<center><font size="50"> <b> Pandas </b> </font></center>

From the [pandas github](https://github.com/pandas-dev/pandas) page

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

In [None]:
import pandas as pd

## Two most fundamental data structure in pandas 
- Series
- DataFrame

## Series

One dimensional array like object with associated label(index)

In [None]:
import numpy as np
np.random.seed(1)
random_int = np.random.randint(1, 100, 5)
series = pd.Series(random_int)

In [None]:
series

In [None]:
series.index

If we don't give index then a default starting from 0 is created.
We can give index with labels

In [None]:
series = pd.Series(random_int, index=['a', 'b', 'c', 'd', 'e'])

In [None]:
series.index

In [None]:
series.values

## We can use boolean filtering(indexing) and math operation

In [None]:
series[series > 60]

In [None]:
np.sqrt(series)

or we can use python dict to create a Series(**a common theme in python libraries to take dict**)

In [None]:
sdata = {'Colorado': 5.6, 'Utha': 3.1, 'Nevda': 2.9}
state_ser= pd.Series(sdata)

state_ser

Series object itself and its index have a name attribute

In [None]:
state_ser.index.name= 'Population in Million'
state_ser.name = 'State'

In [None]:
state_ser

We can use labels to index value

In [None]:
state_ser[['Colorado', 'Utha']]

In real dataset there will be values missing for an attribute. Let's add a state with missing value

In [None]:
state_ser['Texas']= np.NAN

In [None]:
state_ser

# Checking for missing value(isna, isnull, notnull)

In [None]:
pd.isna(state_ser)

In [None]:
pd.isnull(state_ser)

In [None]:
# looks like isnull is an alias for isna
pd.isnull

# DataFrame

Used for tabular data(2D) representation.
- It has both row and column index.
- Can can be thought of as collection(dict) of Series sharing same index.
- Hierarchical indexing can be used for higher dimensional data.

In [None]:
#Creating a DataFrame from a dictionary
crime = {
    'years':['2007','2008','2009','2010'],
    'vandalism':[33,69,48,44],
    'drug abuse':[46,60,61,67],
    'liquor laws':[86,81,76,86]
}
crime_df = pd.DataFrame(crime)
crime_df

Note That pandas render table in a nice html format

# Some properties of pandas dataframe

In [None]:
crime_df.columns

In [None]:
crime_df.index

In [None]:
crime_df.dtypes

In [None]:
crime_df.isna()

In [None]:
crime_df.head(2)

In [None]:
## How to view bottom two row ??


In [None]:
# How to get underline 2d numpy array??


In a real dataset we have lot of columns. We can arrange columns and give index values

In [None]:
crime.keys()

In [None]:
pd.DataFrame(crime, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'], index =list('abcd'))

In [None]:
# or we alrady have read the dataframe
pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'])

In [None]:
# or we want year to an index
crime_df = pd.DataFrame(crime_df, columns=['years', 'liquor laws', 'drug abuse', 'vandalism'] )
crime_df.set_index('years')

In [None]:
crime_df

what happened, we just set the index

In [None]:
# Use inplace to modify data frame
crime_df.set_index('years', inplace=True)
crime_df

# Slicing and dicing DataFrame([], loc, iloc)

In [None]:
crime_df[['drug abuse', 'vandalism']]

## slicing or selecting data with a boolean array

In [None]:
crime_df[crime_df['vandalism']>40]

In [None]:
#or use attribute access
crime_df.vandalism

In [None]:
# use drug abuse as peoprty to access this colums
crime_df.

Valid Python variable name is required. Let's change it.

**Search for pandas function and use it to rename drug abuse to *drug_abuse***

In [None]:
#Write code here

## Rows can be retrieved using loc and iloc

## loc
- loc uses label/index
- conditional lookup

In [None]:
#
crime_df

In [None]:
# using label
series_2010 =crime_df.loc[['2010'], ['drug_abuse', 'vandalism']]
series_2010

In [None]:
# Conditional row selection
crime_df.loc[crime_df.drug_abuse>50]

<font color = "red">Indexing returns a view </font>

In [None]:
series_2010.drug_abuse = 1.0

In [None]:
crime_df

## iloc
use it for integer location based indexing 

In [None]:
crime_df

In [None]:
crime_df.iloc[1:3, 1:3]

In [None]:
crime_df.T

# Reindex
create new DataFrame as per new index

In [None]:
df = pd.DataFrame(np.arange(12).reshape((4,3)), index=[0, 3 ,5 ,9], columns=['a', 'b', 'c'])
df

In [None]:
# row reindexing
df.reindex(range(10))

In [None]:
# column reindexing
df.reindex(columns=['c', 'b'])

# drop row or column

In [None]:
data_df = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=['Ohio', 'Colorado', 'Utah', 'New York'],
                      columns=['one', 'two', 'three', 'four'])
data_df

In [None]:
data_df.drop(['Utah'])

# To drop column use axis = 1, axis =0 is default

In [None]:
data_df.drop(['one', 'three'], axis=1)

# Arithmetic operation support and element wise array operation from numpy


In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df1

In [None]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                    columns=list('abcde'))
df2

In [None]:
df1 + df2

In [None]:
np.exp(df1)

# applying lambda function to frame

In [None]:
# apply a function row wise
df1.apply(lambda x: x.max())

In [None]:
# or apply column wise axis =1 or columns
df1.apply(lambda x: x.max(), axis='columns')

# applymap for element wise function

In [None]:
df1.applymap(lambda x: int(x) )

In [None]:
df['a'].map(lambda x: x**2)

# Summarizing and Computing Descriptive Statistics

In [None]:
df = pd.DataFrame(np.arange(8).reshape(4,2),
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

In [None]:
df['name'] = ['Sam', 'Tim', 'John', 'Chris']
df

In [None]:
df.sum()

In [None]:
df.mean()

In [None]:
df.max()

In [None]:
# Can u guess a method to get a summary stats

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                       [np.nan, np.nan], [0.75, -1.3]],
                     index=['a', 'b', 'c', 'd'],
                     columns=['one', 'two'])
df

In [None]:
df.sum()

In [None]:
df.sum(skipna=False)

# Side: quick way to scrap table in webpages

In [None]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Malnutrition', header=0)

In [None]:
tables[2]