# pandas

*pandas* is a Python library for data analysis which provides fast, powerful, flexible and easy to use open source data analysis and manipulation tools. It also provides expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive.

It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data

This notebook is a summary of an introductions to pandas. 

In [1]:
# Import the library with alias and print its version
import pandas as pd
pd.__version__

'1.2.3'

In [2]:
# Associate's Degrees in Science and Engineering Conferred per 1,000 Individuals 18–24 Years Old (Degrees) this a publicly 

#1 Number of S&E Associete degrees receipients of 18 24 years old by state and year
df_assDe = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

df_assDe.head()
df_assDe.columns

Index(['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018'],
      dtype='object')

In [3]:
# 2 Number of individuals (18 24 years old) by state and year
df_indv = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 , 39])
df_indv.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_indv.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,United States,27744738.0,28423532.0,28908773.0,29340084.0,29719320.0,29849590.0,30001425.0,30197665.0,30579025.0,30909846.0,31138496.0,31454470.0,31745935.0,31860323.0,31780671.0,31484453.0,31156469.0,30871364.0,30761088.0
1,Alabama,442244.0,447963.0,450657.0,456549.0,457934.0,458170.0,459114.0,462996.0,470431.0,476884.0,479740.0,481821.0,485460.0,486952.0,481029.0,471969.0,462636.0,455854.0,452658.0
2,Alaska,57670.0,61774.0,65287.0,68253.0,71882.0,73591.0,74369.0,74383.0,75206.0,73249.0,75359.0,76122.0,77749.0,79653.0,78581.0,76872.0,74513.0,71961.0,70377.0
3,Arizona,519627.0,536018.0,549407.0,559859.0,571196.0,580341.0,591986.0,605860.0,621451.0,629438.0,635984.0,649306.0,663209.0,670120.0,677544.0,677130.0,677720.0,675845.0,687396.0
4,Arkansas,263749.0,268131.0,272203.0,275122.0,275779.0,274439.0,273882.0,275325.0,278835.0,281953.0,285507.0,289572.0,290876.0,291256.0,290495.0,286985.0,283384.0,280768.0,280578.0


In [4]:
df_indv.shape

(60, 20)

In [5]:
# 3 Degrees per 1000 individuals of 18 24 years old
df_Dper1000I = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59])
df_Dper1000I.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,1.756204,1.997881,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,0.769579,0.858799,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,0.177477,0.477713,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,8.227975,12.852525,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,0.741258,0.816092,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556


In [6]:
#df_Dper1000I['P'] = [a - b for a in df_Dper1000I['2002'] for b in  df_Dper1000I['2000'] for x in df_Dper1000I['State'] if x == "Alabama"]

df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000']  
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,K
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,1.997881,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,0.858799,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,0.477713,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,12.852525,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.816092,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856


In [7]:
df_Dper1000I['P'] = [0 if x == 'Alabama' else df_Dper1000I['K'] for x in df_Dper1000I['State'] ]
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,0 0.386356 1 0.288689 2 -0.052020 3...
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,0
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,0 0.386356 1 0.288689 2 -0.052020 3...
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,0 0.386356 1 0.288689 2 -0.052020 3...
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,0 0.386356 1 0.288689 2 -0.052020 3...


In [23]:
df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000'] 
df_Dper1000I.dtypes
#print(df_Dper1000I['P'])
df_Dper1000I.shape

(60, 22)

In [26]:
for x in df_Dper1000I['State']:
    if x != "Alabama":
        df_Dper1000I['P'] = 0 
    else:
        df_Dper1000I['P'] = df_Dper1000I['K']
df_Dper1000I.head()
#df_Dper1000I.dtypes

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,0
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,0
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,0
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,0
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,0


In [22]:
df_Dper1000I.iloc[:54]

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,0.386356
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,0.288689
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,-0.05202
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,0.472841
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,0.066856
5,California,2.246889,2.150801,2.482296,2.798916,2.576467,2.421829,2.38835,2.407973,2.527423,...,3.83052,4.379885,5.259684,5.963148,6.660418,7.750218,8.617775,10.213075,0.235408,0.235408
6,Colorado,1.385862,1.664051,1.874592,1.718574,1.527398,1.169972,0.743631,0.617737,0.664584,...,0.892685,0.769913,0.841889,0.81638,0.909904,0.810885,0.792038,0.793428,0.48873,0.48873
7,Connecticut,0.421977,0.438063,0.566915,0.816285,0.779948,0.869173,0.707644,0.529824,0.530805,...,0.64429,0.702786,0.792473,0.784862,0.921829,0.846045,0.943936,0.954251,0.144939,0.144939
8,Delaware,1.092781,0.993251,0.818057,0.863705,1.449065,1.420896,1.852311,1.530548,1.08368,...,1.467216,1.553381,2.028265,2.502709,2.5052,2.911721,3.145289,3.46029,-0.274724,-0.274724
9,District of Columbia,0.944119,2.976519,3.314486,2.549198,3.041026,0.551824,2.68093,2.03749,1.675575,...,0.527211,0.296114,0.265998,0.386737,0.418618,0.212106,0.187291,0.269753,2.370367,2.370367


In [None]:
df_Dper1000I = df_Dper1000I.dropna()
df_Dper1000I.iloc[:54]

In [None]:
df_Dper1000I = df_Dper1000I.iloc[1:54]

In [None]:
df_Dper1000I.shape

In [None]:
df_Dper1000I.index

In [None]:
df_Dper1000I.dtypes

In [None]:
df_Dper1000I.info()

In [None]:
df_Dper1000I.describe(include = 'all')

There are several commands to work on data of *pandas* dataframe such as subsetting, grouping, and merging. 

In [None]:
state = df_Dper1000I['State']

In [None]:
state.head()

In [None]:
state.tail()

In [None]:
type(df_Dper1000I)

In [None]:
type(state)

In [None]:
#df_Dper1000I.set_index('State', inplace = True)
df_Dper1000I.head()

In [None]:
df_Dper1000I['2018'].mean()

In [None]:
df_Dper1000I['2018']>df_Dper1000I['2018'].mean()

In [None]:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

#url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'

url = 'https://ncses.nsf.gov/pubs/nsf21303#data-tables/nsf21303-tab002.xlxs'


In [None]:
soup

In [None]:
df = pd.read_excel('https://ncses.nsf.gov/pubs/nsf21303#data-tables/nsf21303-tab002.xls')

In [None]:
import requests, os
import http.client

http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'

url="https://ncses.nsf.gov/pubs/nsf21303#data-tables"

print("Downloading...")
resp = requests.get(url)
with open('nsf21303-tab002.xlsx', 'wb') as output:
    output.write(resp.content)
print("Done!")

In [None]:
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf21303/assets/data-tables/tables/nsf21303-tab001.xlsx", skiprows = [0, 1, 2], usecols= [0, 1, 3, 5])

In [None]:
df.head()

In [None]:
df.columns