# pandas

*pandas* is a Python library for data analysis which provides fast, powerful, flexible and easy to use open source data analysis and manipulation tools. It also provides expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. In pandas, the data table is called DataFrame. Using pandas, you can explore, clean and process your data.  

It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrame*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using DataFrame
* Working with timestamps and time-series data

This notebook is a summary of an introductions to pandas and some of its methods. 

#### Import Libraries
The first step in using different libraries in Python is importing libraries so that it will be accessible for use. You can import the library by its name as it is or using aliases to make its usage easier. 

In [1]:
import datetime
current_time = datetime.datetime.now()
current_time.isoformat()

'2022-04-13T09:47:09.564092'

In [2]:
import datetime as dt
current_time = dt.datetime.now()
current_time.isoformat()

'2022-04-13T09:47:09.656843'

In [3]:
# Import the library with alias and print its version
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
pd.__version__

'1.3.4'

There are two major pandas data structures: Series and DataFrames.
* Series: a one-dimensional labeled array-like object that capable of holding any data type and it maps an index to values. 
* DataFrame: a 2-dimensional labeled data structure ( a rectangular table of data) with potentially different data types. It has both  row/s and column/s.

You can import different data files such as Excel, CSV, SAs, SPSS, ...etc to pandas DataFrame. 

In [4]:
# Creating Series
s = pd.Series(np.random.randn(5), name="Series1")
s

0   -0.550800
1   -0.170476
2    0.030573
3   -0.018817
4   -1.208114
Name: Series1, dtype: float64

In [5]:
# checking name of the series
s.name

'Series1'

In [6]:
# Creating DataFrame
df1 = pd.DataFrame.from_dict(dict([("A", [1, 2, 3, 4]), ("B", [10, 11, 12, 13])]))
df1.head()

Unnamed: 0,A,B
0,1,10
1,2,11
2,3,12
3,4,13


In [7]:
df1 = df1.drop(index=[1,3])
df1.head()

Unnamed: 0,A,B
0,1,10
2,3,12


In [8]:
df1 = df1.reset_index()

df1.head()

Unnamed: 0,index,A,B
0,0,1,10
1,2,3,12


In [9]:
# Creating another DataFrame
df2 = pd.DataFrame.from_dict(
    dict([("A", [1, 2, 3, 4]), ("B", [10, 11, 12, 13])]),
    orient="index",
    columns=["One", "Two", "Three", "Four"],)
df2.head()

Unnamed: 0,One,Two,Three,Four
A,1,2,3,4
B,10,11,12,13


In [10]:
# Main DataFrame components
columns = df2.columns
index = df2.index
data = df2.values

In [11]:
columns

Index(['One', 'Two', 'Three', 'Four'], dtype='object')

In [12]:
index

Index(['A', 'B'], dtype='object')

In [13]:
data

array([[ 1,  2,  3,  4],
       [10, 11, 12, 13]], dtype=int64)

In [14]:
# Dimension of the DataFrame
df2.shape

(2, 4)

In [15]:
# DataFrame values
df2.values

array([[ 1,  2,  3,  4],
       [10, 11, 12, 13]], dtype=int64)

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, A to B
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   One     2 non-null      int64
 1   Two     2 non-null      int64
 2   Three   2 non-null      int64
 3   Four    2 non-null      int64
dtypes: int64(4)
memory usage: 80.0+ bytes


In [17]:
# Associate's Degrees in Science and Engineering Conferred per 1,000 Individuals 18–24 Years Old (Degrees) this a publicly 
#1 Number of S&E Associete degrees receipients of 18 24 years old by state and year
df_assDe = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

df_assDe.head()
df_assDe.columns

Index(['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018'],
      dtype='object')

In [18]:
# 2 Number of individuals (18 24 years old) by state and year
df_indv = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 , 39])
df_indv.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_indv.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,United States,27744738.0,28423532.0,28908773.0,29340084.0,29719320.0,29849590.0,30001425.0,30197665.0,30579025.0,30909846.0,31138496.0,31454470.0,31745935.0,31860323.0,31780671.0,31484453.0,31156469.0,30871364.0,30761088.0
1,Alabama,442244.0,447963.0,450657.0,456549.0,457934.0,458170.0,459114.0,462996.0,470431.0,476884.0,479740.0,481821.0,485460.0,486952.0,481029.0,471969.0,462636.0,455854.0,452658.0
2,Alaska,57670.0,61774.0,65287.0,68253.0,71882.0,73591.0,74369.0,74383.0,75206.0,73249.0,75359.0,76122.0,77749.0,79653.0,78581.0,76872.0,74513.0,71961.0,70377.0
3,Arizona,519627.0,536018.0,549407.0,559859.0,571196.0,580341.0,591986.0,605860.0,621451.0,629438.0,635984.0,649306.0,663209.0,670120.0,677544.0,677130.0,677720.0,675845.0,687396.0
4,Arkansas,263749.0,268131.0,272203.0,275122.0,275779.0,274439.0,273882.0,275325.0,278835.0,281953.0,285507.0,289572.0,290876.0,291256.0,290495.0,286985.0,283384.0,280768.0,280578.0


In [19]:
df_indv.shape

(60, 20)

In [20]:
# 3 Degrees per 1000 individuals of 18 24 years old
df_Dper1000I = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59])
df_Dper1000I.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,1.756204,1.997881,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,0.769579,0.858799,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,0.177477,0.477713,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,8.227975,12.852525,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,0.741258,0.816092,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556


In [21]:
#df_Dper1000I['P'] = [a - b for a in df_Dper1000I['2002'] for b in  df_Dper1000I['2000'] for x in df_Dper1000I['State'] if x == "Alabama"]

df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000']  
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,K
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,1.997881,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,0.858799,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,0.477713,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,12.852525,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.816092,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856


In [22]:
df_Dper1000I['P'] = [0 if x == 'Alabama' else df_Dper1000I['K'] for x in df_Dper1000I['State'] ]
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,0 0.386356 1 0.288689 2 -0.052020 3...
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,0
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,0 0.386356 1 0.288689 2 -0.052020 3...
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,0 0.386356 1 0.288689 2 -0.052020 3...
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,0 0.386356 1 0.288689 2 -0.052020 3...


In [23]:
df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000'] 
df_Dper1000I.dtypes
#print(df_Dper1000I['P'])
df_Dper1000I.shape

(60, 22)

In [24]:
for x in df_Dper1000I['2000']:
    if x != 1.295665:
        df_Dper1000I['P'] = 'Y'
    else:
        df_Dper1000I['P'] = 'N'
df_Dper1000I.head()
df_Dper1000I.dtypes
for x in df_Dper1000I['State']:
    if x == "United States":
        df_Dper1000I['P'] == 'Y'
        break 
    else:    
        df_Dper1000I['P'] == 'N'
         
df_Dper1000I.head()
# df_Dper1000I['P'] = (df_Dper1000I['py-score'] >= 80) & (df_Dper1000I['js-score'] >= 80)

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,Y
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,Y
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,Y
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,Y
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,Y


In [25]:
#df_Dper1000I['P']  = df_Dper1000I.apply(lambda row: row["2003"] - row['2000'] if row["State"] == 'Alabama' else np.nan, axis=1)
df_Dper1000I.head(52)

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,Y
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,Y
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,Y
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,Y
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,Y
5,California,2.246889,2.150801,2.482296,2.798916,2.576467,2.421829,2.38835,2.407973,2.527423,...,3.83052,4.379885,5.259684,5.963148,6.660418,7.750218,8.617775,10.213075,0.235408,Y
6,Colorado,1.385862,1.664051,1.874592,1.718574,1.527398,1.169972,0.743631,0.617737,0.664584,...,0.892685,0.769913,0.841889,0.81638,0.909904,0.810885,0.792038,0.793428,0.48873,Y
7,Connecticut,0.421977,0.438063,0.566915,0.816285,0.779948,0.869173,0.707644,0.529824,0.530805,...,0.64429,0.702786,0.792473,0.784862,0.921829,0.846045,0.943936,0.954251,0.144939,Y
8,Delaware,1.092781,0.993251,0.818057,0.863705,1.449065,1.420896,1.852311,1.530548,1.08368,...,1.467216,1.553381,2.028265,2.502709,2.5052,2.911721,3.145289,3.46029,-0.274724,Y
9,District of Columbia,0.944119,2.976519,3.314486,2.549198,3.041026,0.551824,2.68093,2.03749,1.675575,...,0.527211,0.296114,0.265998,0.386737,0.418618,0.212106,0.187291,0.269753,2.370367,Y


In [26]:
df_Dper1000I= df_Dper1000I.iloc[:52]
df_Dper1000I.iloc[:55]                                

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,Y
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,Y
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,Y
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,Y
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,Y
5,California,2.246889,2.150801,2.482296,2.798916,2.576467,2.421829,2.38835,2.407973,2.527423,...,3.83052,4.379885,5.259684,5.963148,6.660418,7.750218,8.617775,10.213075,0.235408,Y
6,Colorado,1.385862,1.664051,1.874592,1.718574,1.527398,1.169972,0.743631,0.617737,0.664584,...,0.892685,0.769913,0.841889,0.81638,0.909904,0.810885,0.792038,0.793428,0.48873,Y
7,Connecticut,0.421977,0.438063,0.566915,0.816285,0.779948,0.869173,0.707644,0.529824,0.530805,...,0.64429,0.702786,0.792473,0.784862,0.921829,0.846045,0.943936,0.954251,0.144939,Y
8,Delaware,1.092781,0.993251,0.818057,0.863705,1.449065,1.420896,1.852311,1.530548,1.08368,...,1.467216,1.553381,2.028265,2.502709,2.5052,2.911721,3.145289,3.46029,-0.274724,Y
9,District of Columbia,0.944119,2.976519,3.314486,2.549198,3.041026,0.551824,2.68093,2.03749,1.675575,...,0.527211,0.296114,0.265998,0.386737,0.418618,0.212106,0.187291,0.269753,2.370367,Y


In [27]:
df_Dper1000I = df_Dper1000I.dropna()
df_Dper1000I.iloc[:52]

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
0,United States,1.389056,1.59579,1.775413,2.144847,2.027166,1.834196,1.661588,1.572473,1.607834,...,2.416858,2.670736,2.692942,2.768223,2.877261,2.851157,3.001163,3.20252,0.386356,Y
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,Y
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,Y
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,Y
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,Y
5,California,2.246889,2.150801,2.482296,2.798916,2.576467,2.421829,2.38835,2.407973,2.527423,...,3.83052,4.379885,5.259684,5.963148,6.660418,7.750218,8.617775,10.213075,0.235408,Y
6,Colorado,1.385862,1.664051,1.874592,1.718574,1.527398,1.169972,0.743631,0.617737,0.664584,...,0.892685,0.769913,0.841889,0.81638,0.909904,0.810885,0.792038,0.793428,0.48873,Y
7,Connecticut,0.421977,0.438063,0.566915,0.816285,0.779948,0.869173,0.707644,0.529824,0.530805,...,0.64429,0.702786,0.792473,0.784862,0.921829,0.846045,0.943936,0.954251,0.144939,Y
8,Delaware,1.092781,0.993251,0.818057,0.863705,1.449065,1.420896,1.852311,1.530548,1.08368,...,1.467216,1.553381,2.028265,2.502709,2.5052,2.911721,3.145289,3.46029,-0.274724,Y
9,District of Columbia,0.944119,2.976519,3.314486,2.549198,3.041026,0.551824,2.68093,2.03749,1.675575,...,0.527211,0.296114,0.265998,0.386737,0.418618,0.212106,0.187291,0.269753,2.370367,Y


In [28]:
df_Dper1000I = df_Dper1000I.iloc[1:54]

In [29]:
df_Dper1000I.shape

(51, 22)

In [30]:
df_Dper1000I.index

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
            51],
           dtype='int64')

In [31]:
df_Dper1000I.dtypes

State     object
2000     float64
2001     float64
2002     float64
2003     float64
2004     float64
2005     float64
2006     float64
2007     float64
2008     float64
2009     float64
2010     float64
2011     float64
2012     float64
2013     float64
2014     float64
2015     float64
2016     float64
2017     float64
2018     float64
K        float64
P         object
dtype: object

In [32]:
df_Dper1000I.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 1 to 51
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   State   51 non-null     object 
 1   2000    51 non-null     float64
 2   2001    51 non-null     float64
 3   2002    51 non-null     float64
 4   2003    51 non-null     float64
 5   2004    51 non-null     float64
 6   2005    51 non-null     float64
 7   2006    51 non-null     float64
 8   2007    51 non-null     float64
 9   2008    51 non-null     float64
 10  2009    51 non-null     float64
 11  2010    51 non-null     float64
 12  2011    51 non-null     float64
 13  2012    51 non-null     float64
 14  2013    51 non-null     float64
 15  2014    51 non-null     float64
 16  2015    51 non-null     float64
 17  2016    51 non-null     float64
 18  2017    51 non-null     float64
 19  2018    51 non-null     float64
 20  K       51 non-null     float64
 21  P       51 non-null     object 
dtypes: f

In [33]:
df_Dper1000I.describe(include = 'all')


Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
count,51,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,...,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51
unique,51,,,,,,,,,,...,,,,,,,,,,1
top,Alabama,,,,,,,,,,...,,,,,,,,,,Y
freq,1,,,,,,,,,,...,,,,,,,,,,51
mean,,1.395969,1.641068,1.829406,2.089938,2.014794,1.741067,1.614753,1.533752,1.532981,...,2.227446,2.39875,2.353661,2.381664,2.425209,2.321685,2.389,2.440515,0.433437,
std,,0.87528,0.952092,0.980127,0.893737,0.8676,0.824333,0.832064,0.856646,0.971688,...,2.476779,2.622491,2.073287,1.859088,1.793678,1.691515,1.758781,1.88339,0.445743,
min,,0.05202,0.0,0.0,0.669438,0.514732,0.271772,0.430287,0.215103,0.186155,...,0.527211,0.296114,0.265998,0.386737,0.418618,0.212106,0.187291,0.269753,-0.38066,
25%,,0.69865,1.078716,1.16125,1.395276,1.48555,1.207748,1.10976,0.955597,0.905879,...,1.229187,1.39506,1.397285,1.416752,1.518099,1.335786,1.424074,1.338949,0.146663,
50%,,1.184784,1.508084,1.606224,2.035318,1.907324,1.591448,1.500401,1.428665,1.440205,...,1.705206,1.860134,1.950105,1.978843,1.99792,1.902772,1.928166,1.934671,0.421439,
75%,,1.851651,1.952249,2.412959,2.700316,2.565026,2.261274,1.909093,1.844025,1.756629,...,2.342877,2.491381,2.530685,2.688086,2.765615,2.917187,2.754378,2.988274,0.563053,


There are several commands to work on data of *pandas* dataframe such as subsetting, grouping, and merging. 

In [34]:
state = df_Dper1000I['State']

In [35]:
state.head()

1       Alabama
2        Alaska
3       Arizona
4      Arkansas
5    California
Name: State, dtype: object

In [36]:
state.tail()

47         Virginia
48       Washington
49    West Virginia
50        Wisconsin
51          Wyoming
Name: State, dtype: object

In [37]:
type(df_Dper1000I)

pandas.core.frame.DataFrame

In [38]:
type(state)

pandas.core.series.Series

In [39]:
#df_Dper1000I.set_index('State', inplace = True)
df_Dper1000I.head()

Unnamed: 0,State,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,K,P
1,Alabama,1.295665,1.428689,1.584354,1.938456,1.629056,1.385948,1.108657,0.771065,0.814147,...,1.019051,1.04643,0.975455,0.914706,0.90472,0.737081,0.651524,0.545666,0.288689,Y
2,Alaska,0.05202,0.0,0.0,0.908385,0.514732,0.271772,0.430287,0.215103,0.186155,...,1.247997,0.977505,1.205228,1.005332,0.741492,0.724706,0.833785,0.511531,-0.05202,Y
3,Arizona,1.518397,1.858147,1.991238,2.372026,2.652329,2.36585,2.187552,2.469217,4.257777,...,17.843667,18.761808,13.842297,11.919521,10.029093,7.814437,6.417152,5.004393,0.472841,Y
4,Arkansas,0.686259,0.607912,0.753114,1.046808,0.790488,0.783416,0.828824,0.773631,0.613266,...,0.618154,0.587879,0.638613,0.691922,0.73523,0.635181,0.683839,0.976556,0.066856,Y
5,California,2.246889,2.150801,2.482296,2.798916,2.576467,2.421829,2.38835,2.407973,2.527423,...,3.83052,4.379885,5.259684,5.963148,6.660418,7.750218,8.617775,10.213075,0.235408,Y


In [40]:
df_Dper1000I['2018'].mean()

2.4405146765648635

In [41]:
df_Dper1000I['2018']>df_Dper1000I['2018'].mean()

1     False
2     False
3      True
4     False
5      True
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20     True
21     True
22    False
23    False
24     True
25     True
26    False
27    False
28    False
29    False
30     True
31     True
32     True
33     True
34    False
35    False
36    False
37     True
38    False
39    False
40    False
41    False
42     True
43    False
44    False
45     True
46    False
47     True
48     True
49     True
50    False
51     True
Name: 2018, dtype: bool

In [42]:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

#url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'

url = 'https://ncses.nsf.gov/pubs/nsf21303#data-tables/nsf21303-tab002.xlxs'


In [43]:
#soup

In [44]:

df = pd.read_excel('https://ncses.nsf.gov/pubs/nsf21303/assets/data-tables/tables/nsf21303-tab002.xlsx', index_col=0)


In [45]:
import requests, os
import http.client

http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'

url="https://ncses.nsf.gov/pubs/nsf21303#data-tables"

print("Downloading...")
resp = requests.get(url)
with open('nsf21303-tab002.xlsx', 'wb') as output:
    output.write(resp.content)
print("Done!")

Downloading...
Done!


In [46]:
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf21303/assets/data-tables/tables/nsf21303-tab001.xlsx", skiprows = [0, 1, 2], usecols= [0, 1, 3, 5])

In [47]:
df.head()

Unnamed: 0,Company and financial information,All companies,1–4 employees,5–9 employees
0,Total R&D cost,6655717.0,3525771.0,3129946.0
1,"R&D for salaries, wages, and fringe benefits",3656750.0,1895125.0,1761625.0
2,R&D for expensed machinery and equipment (not ...,256952.0,153000.0,103952.0
3,R&D for materials and supplies,583126.0,319901.0,263224.0
4,R&D for payments to business partners for coll...,361014.0,228498.0,132516.0


In [48]:
df.columns

Index(['Company and financial information', 'All companies', '1–4 employees',
       '5–9 employees'],
      dtype='object')