# pandas

*pandas* is a Python library for data analysis which provides fast, powerful, flexible and easy to use open source data analysis and manipulation tools. It also provides expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. In pandas, the data table is called DataFrame. Using pandas, you can explore, clean and process your data.  

It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrame*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using DataFrame
* Working with timestamps and time-series data

This notebook is a summary of an introductions to pandas and some of its methods. 

#### Import Libraries
The first step in using different libraries in Python is importing libraries so that it will be accessible for use. You can import the library by its name as it is or using aliases to make its usage easier. 

In [1]:
import datetime
current_time = datetime.datetime.now()
current_time.isoformat()

'2022-04-14T21:42:18.386749'

In [2]:
import datetime as dt
current_time = dt.datetime.now()
current_time.isoformat()

'2022-04-14T21:42:18.496456'

In [3]:
# Import the library with alias and print its version
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
pd.__version__

'1.3.4'

There are two major pandas data structures: Series and DataFrames.
* Series: a one-dimensional labeled array-like object that capable of holding any data type and it maps an index to values. 
* DataFrame: a 2-dimensional labeled data structure ( a rectangular table of data) with potentially different data types. It has both  row/s and column/s.

You can import different data files such as Excel, CSV, SAs, SPSS, ...etc to pandas DataFrame. 

In [4]:
# Creating Series
s = pd.Series(np.random.randn(5), name="Series1")
s.head()

0    1.089806
1    1.497397
2   -0.927283
3   -0.203392
4   -0.078266
Name: Series1, dtype: float64

In [5]:
# checking name of the series
s.name

'Series1'

In [6]:
# Creating DataFrame
df1 = pd.DataFrame.from_dict(dict([("A", [1, 2, 3, 4]), ("B", [10, 11, 12, 13])]))
df1.head()

Unnamed: 0,A,B
0,1,10
1,2,11
2,3,12
3,4,13


In [7]:
df1 = df1.drop(index=[1,3])
df1.head()

Unnamed: 0,A,B
0,1,10
2,3,12


In [8]:
df1 = df1.reset_index()

df1.head()

Unnamed: 0,index,A,B
0,0,1,10
1,2,3,12


In [9]:
# Creating another DataFrame
df2 = pd.DataFrame.from_dict(
    dict([("A", [1, 2, 3, 4]), ("B", [10, 11, 12, 13])]),
    orient="index",
    columns=["One", "Two", "Three", "Four"],)
df2.head()

Unnamed: 0,One,Two,Three,Four
A,1,2,3,4
B,10,11,12,13


In [10]:
# Main DataFrame components
columns = df2.columns
index = df2.index
data = df2.values

In [11]:
columns

Index(['One', 'Two', 'Three', 'Four'], dtype='object')

In [12]:
index

Index(['A', 'B'], dtype='object')

In [13]:
data

array([[ 1,  2,  3,  4],
       [10, 11, 12, 13]], dtype=int64)

In [14]:
# Dimension of the DataFrame
df2.shape

(2, 4)

In [15]:
# DataFrame values
df2.values

array([[ 1,  2,  3,  4],
       [10, 11, 12, 13]], dtype=int64)

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, A to B
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   One     2 non-null      int64
 1   Two     2 non-null      int64
 2   Three   2 non-null      int64
 3   Four    2 non-null      int64
dtypes: int64(4)
memory usage: 80.0+ bytes


In [None]:
# Associate's Degrees in Science and Engineering Conferred per 1,000 Individuals 18–24 Years Old (Degrees) this a publicly 
#1 Number of S&E Associete degrees receipients of 18 24 years old by state and year
df_assDe = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

df_assDe.head()
df_assDe.columns

In [None]:
# 2 Number of individuals (18 24 years old) by state and year
df_indv = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 , 39])
df_indv.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_indv.head()

In [None]:
df_indv.shape

In [None]:
# 3 Degrees per 1000 individuals of 18 24 years old
df_Dper1000I = pd.read_excel("se-associates-degrees-per-1000-18-24-year-olds.xlsx", skiprows = [0, 1, 2], 
                         usecols = [0, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59])
df_Dper1000I.columns = ['State', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018']
df_Dper1000I.head()

In [None]:
#df_Dper1000I['P'] = [a - b for a in df_Dper1000I['2002'] for b in  df_Dper1000I['2000'] for x in df_Dper1000I['State'] if x == "Alabama"]

df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000']  
df_Dper1000I.head()

In [None]:
df_Dper1000I['P'] = [0 if x == 'Alabama' else df_Dper1000I['K'] for x in df_Dper1000I['State'] ]
df_Dper1000I.head()

In [None]:
df_Dper1000I['K'] = df_Dper1000I['2002']- df_Dper1000I['2000'] 
df_Dper1000I.dtypes
#print(df_Dper1000I['P'])
df_Dper1000I.shape

In [None]:
for x in df_Dper1000I['2000']:
    if x != 1.295665:
        df_Dper1000I['P'] = 'Y'
    else:
        df_Dper1000I['P'] = 'N'
df_Dper1000I.head()
df_Dper1000I.dtypes
for x in df_Dper1000I['State']:
    if x == "United States":
        df_Dper1000I['P'] == 'Y'
        break 
    else:    
        df_Dper1000I['P'] == 'N'
         
df_Dper1000I.head()
# df_Dper1000I['P'] = (df_Dper1000I['py-score'] >= 80) & (df_Dper1000I['js-score'] >= 80)

In [None]:
#df_Dper1000I['P']  = df_Dper1000I.apply(lambda row: row["2003"] - row['2000'] if row["State"] == 'Alabama' else np.nan, axis=1)
df_Dper1000I.head(52)

In [None]:
df_Dper1000I= df_Dper1000I.iloc[:52]
df_Dper1000I.iloc[:55]                                

In [None]:
df_Dper1000I = df_Dper1000I.dropna()
df_Dper1000I.iloc[:52]

In [None]:
df_Dper1000I = df_Dper1000I.iloc[1:54]

In [None]:
df_Dper1000I.shape

In [None]:
df_Dper1000I.index

In [None]:
df_Dper1000I.dtypes

In [None]:
df_Dper1000I.info()

In [None]:
df_Dper1000I.describe(include = 'all')


There are several commands to work on data of *pandas* dataframe such as subsetting, grouping, and merging. 

In [None]:
state = df_Dper1000I['State']

In [None]:
state.head()

In [None]:
state.tail()

In [None]:
type(df_Dper1000I)

In [None]:
type(state)

In [None]:
#df_Dper1000I.set_index('State', inplace = True)
df_Dper1000I.head()

In [None]:
df_Dper1000I['2018'].mean()

In [None]:
df_Dper1000I['2018']>df_Dper1000I['2018'].mean()

In [None]:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

#url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'

url = 'https://ncses.nsf.gov/pubs/nsf21303#data-tables/nsf21303-tab002.xlxs'


In [None]:
#soup

In [None]:

df = pd.read_excel('https://ncses.nsf.gov/pubs/nsf21303/assets/data-tables/tables/nsf21303-tab002.xlsx', index_col=0)


In [None]:
import requests, os
import http.client

http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'

url="https://ncses.nsf.gov/pubs/nsf21303#data-tables"

print("Downloading...")
resp = requests.get(url)
with open('nsf21303-tab002.xlsx', 'wb') as output:
    output.write(resp.content)
print("Done!")

In [None]:
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf21303/assets/data-tables/tables/nsf21303-tab001.xlsx", skiprows = [0, 1, 2], usecols= [0, 1, 3, 5])

In [None]:
df.head()

In [None]:
df.columns