# Pandas Overview
I wanted to collect some of the common commands I've used in pandas (along with some new ones) for future reference. The same week I decided to do this I ran across this great post by Ted Petrou about minimally sufficient pandas (https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) and so I try and follow some of the example found in this blog post. 

In [1]:
import pandas as pd
import numpy as np

### Read in data
This is a college score card data set that the blog post I am looking at uses. It can be found at https://collegescorecard.ed.gov/data/ and download scorecard data. When reading in data Ted talks about using only "read_csv" since "read_table" will soon be deprecated.

In [2]:
data = pd.read_csv('Most-Recent-Cohorts-All-Data-Elements.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Select data
Use only loc to select by column names and iloc to select by indices. We could also select columns by writing data[['column1'],['column2']]

In [3]:
data = data.loc[:,('INSTNM','CITY','STABBR','HBCU','MENONLY','WOMENONLY','RELAFFIL','SATVRMID','SATMTMID','DISTANCEONLY',
           'UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP')]



If wanting to select a particular cell Ted talks about using a numpy array which is around 60x faster. So at that point I would just convert it to a numpy array but I think iloc is just fine for most cases.

#### Smaller dataset

In [4]:
import pandas as pd
from datetime import datetime
from dateutil.parser import parse
import numpy as np
raw_data = {'employee_name': ['Andy', 'Andy','Beth', 'Andy', "Dale"],
            'employee_id': [123456,123456,789456,654123,963852],
            'date_joined': ['2015-02-15','2015-02-15', np.nan, '2017-05-16', "2018-01-15"],
            'age': [45,45,36,34,25],
            'yrs_of_experience': [24,24,14,14,4]}
df = pd.DataFrame(raw_data, columns = ['employee_name', 'employee_id', 'date_joined','age', 'yrs_of_experience'])
df

Unnamed: 0,employee_name,employee_id,date_joined,age,yrs_of_experience
0,Andy,123456,2015-02-15,45,24
1,Andy,123456,2015-02-15,45,24
2,Beth,789456,,36,14
3,Andy,654123,2017-05-16,34,14
4,Dale,963852,2018-01-15,25,4


In [16]:
df[(df['employee_name']=='Andy') & (df.employee_id==123456)]

Unnamed: 0,employee_name,employee_id,date_joined,age,yrs_of_experience
0,Andy,123456,2015-02-15,45,24
1,Andy,123456,2015-02-15,45,24


In [10]:
df[['employee_name','date_joined']].drop_duplicates()

Unnamed: 0,employee_name,date_joined
0,Andy,2015-02-15
2,Beth,
3,Andy,2017-05-16
4,Dale,2018-01-15


In [26]:
df_2 = pd.DataFrame([45])
df_2.columns = 'age_x'

TypeError: Index(...) must be called with a collection of some kind, 'age_x' was passed

In [25]:
df.merge(df, on='employee_name',how='inner')

Unnamed: 0,employee_name,employee_id_x,date_joined_x,age_x,yrs_of_experience_x,employee_id_y,date_joined_y,age_y,yrs_of_experience_y
0,Andy,123456,2015-02-15,45,24,123456,2015-02-15,45,24
1,Andy,123456,2015-02-15,45,24,123456,2015-02-15,45,24
2,Andy,123456,2015-02-15,45,24,654123,2017-05-16,34,14
3,Andy,123456,2015-02-15,45,24,123456,2015-02-15,45,24
4,Andy,123456,2015-02-15,45,24,123456,2015-02-15,45,24
5,Andy,123456,2015-02-15,45,24,654123,2017-05-16,34,14
6,Andy,654123,2017-05-16,34,14,123456,2015-02-15,45,24
7,Andy,654123,2017-05-16,34,14,123456,2015-02-15,45,24
8,Andy,654123,2017-05-16,34,14,654123,2017-05-16,34,14
9,Beth,789456,,36,14,789456,,36,14


In [16]:
df.groupby(['employee_id','employee_name']).agg({'date_joined':['count',lambda x: x.nunique(dropna=False),lambda x: x.shape[0]]})

Unnamed: 0_level_0,Unnamed: 1_level_0,date_joined,date_joined,date_joined
Unnamed: 0_level_1,Unnamed: 1_level_1,count,<lambda_0>,<lambda_1>
employee_id,employee_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
123456,Andy,2,1,2
654123,Andy,1,1,1
789456,Beth,0,1,1
963852,Dale,1,1,1


In [42]:
df.fillna({'date_joined':'2009-12-01'},inplace=True)

In [49]:
df['month'] = df['date_joined'].apply(lambda x: parse(x).month)
df['year'] = df['date_joined'].apply(lambda x: parse(x).year)

In [51]:
df.drop_duplicates(inplace=True)

In [55]:
df.groupby(['month','year']).agg({'employee_name':'nunique'})

Unnamed: 0_level_0,Unnamed: 1_level_0,employee_name
month,year,Unnamed: 2_level_1
1,2018,1
2,2015,1
5,2017,1
12,2009,1


In [54]:
df

Unnamed: 0,employee_name,employee_id,date_joined,age,yrs_of_experience,month,year
0,Andy,123456,2015-02-15,45,24,2,2015
2,Beth,789456,2009-12-01,36,14,12,2009
3,Andy,654123,2017-05-16,34,14,5,2017
4,Dale,963852,2018-01-15,25,4,1,2018


In [4]:
data[data.CITY=='ARTESIA']

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
5591,Angeles Institute,ARTESIA,CA,0.0,0.0,0.0,,,,0.0,96.0,0.0417,0.2917,0.3333


In [19]:
def matt_sum(x):
    return(x.shape[0])

In [22]:
data[data.CITY=='Abilene']

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
3393,Abilene Christian University,Abilene,TX,0.0,0.0,0.0,74.0,565.0,558.0,0.0,3655.0,0.6372,0.0876,0.171
3448,Hardin-Simmons University,Abilene,TX,0.0,0.0,0.0,54.0,545.0,540.0,0.0,1670.0,0.6371,0.079,0.1892
3480,McMurry University,Abilene,TX,0.0,0.0,0.0,71.0,505.0,515.0,0.0,1031.0,0.4656,0.1532,0.259
4273,Texas College of Cosmetology-Abilene,Abilene,TX,0.0,0.0,0.0,,,,0.0,73.0,0.5479,0.0822,0.3562
6252,NeeCee's College of Cosmetology,Abilene,TX,0.0,0.0,0.0,,,,0.0,75.0,0.1867,0.3067,0.4133


In [20]:
test = data.groupby(['CITY','MENONLY']).agg({'SATVRMID':matt_sum})

In [21]:
test

Unnamed: 0_level_0,Unnamed: 1_level_0,SATVRMID
CITY,MENONLY,Unnamed: 2_level_1
ARTESIA,0.0,1.0
Aberdeen,0.0,3.0
Abilene,0.0,5.0
Abingdon,0.0,1.0
Abington,0.0,1.0
...,...,...
Yucca Valley,0.0,1.0
Yukon,0.0,1.0
Yuma,0.0,1.0
Zanesville,0.0,3.0


### Basic EDA on data

In [21]:
data.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,,425.0,420.0,0.0,4616.0,0.0256,0.9129,0.0076
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,,560.0,575.0,0.0,12047.0,0.5786,0.2626,0.0309
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,74.0,,,1.0,293.0,0.157,0.2355,0.0068
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,,590.0,610.0,0.0,6346.0,0.7148,0.1131,0.0411
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,,415.0,410.0,0.0,4704.0,0.0138,0.9337,0.0111


In [22]:
data.tail()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,,425.0,420.0,0.0,4616.0,0.0256,0.9129,0.0076
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,,560.0,575.0,0.0,12047.0,0.5786,0.2626,0.0309
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,74.0,,,1.0,293.0,0.157,0.2355,0.0068
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,,590.0,610.0,0.0,6346.0,0.7148,0.1131,0.0411
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,,415.0,410.0,0.0,4704.0,0.0138,0.9337,0.0111


In [23]:
data.describe()

Unnamed: 0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
count,5.0,5.0,5.0,1.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0
mean,0.4,0.0,0.0,74.0,497.5,503.75,0.2,5601.2,0.29796,0.49156,0.0195
std,0.547723,0.0,0.0,,90.415707,103.551517,0.447214,4244.279644,0.326845,0.398195,0.015572
min,0.0,0.0,0.0,74.0,415.0,410.0,0.0,293.0,0.0138,0.1131,0.0068
25%,0.0,0.0,0.0,74.0,422.5,417.5,0.0,4616.0,0.0256,0.2355,0.0076
50%,0.0,0.0,0.0,74.0,492.5,497.5,0.0,4704.0,0.157,0.2626,0.0111
75%,1.0,0.0,0.0,74.0,567.5,583.75,0.0,6346.0,0.5786,0.9129,0.0309
max,1.0,0.0,0.0,74.0,590.0,610.0,1.0,12047.0,0.7148,0.9337,0.0411


In [24]:
data.shape

(5, 14)

### Null values
Ted suggests to use **isna** and **notna** since these match with the functions **dropna** and **fillna**.

In [25]:
data.isna()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
0,False,False,False,False,False,False,True,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,True,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False


This shows that ~isna and notna are the same thing. Not that **equals** is important when comparing two datasets. The command **eq** will return a data frame where each cell is true if it equals the corresponding cell in the other data frame

In [26]:
booled_data = ~data.isna()
booled_data.equals(data.notna())

True

**dropna** will drop all rows with na by default which in our case is all rows

In [27]:
data.dropna()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP


**fillna** will fill all null values with a value

In [28]:
data.fillna('test')

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,test,425,420,0.0,4616.0,0.0256,0.9129,0.0076
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,test,560,575,0.0,12047.0,0.5786,0.2626,0.0309
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,74,test,test,1.0,293.0,0.157,0.2355,0.0068
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,test,590,610,0.0,6346.0,0.7148,0.1131,0.0411
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,test,415,410,0.0,4704.0,0.0138,0.9337,0.0111


### Operations

When performing operations on data frames in some cases it is best to use the functions:
* add
* sub and subtract
* mul and multiply
* div, divide and truediv
* pow
* floordiv
* mod

Many cases I think we can use the built in operators such as '+' and '-' but when looking to subtract pandas series for example we should use the above operations. That is because when using '-' for example the data frame will look to match up the series index with the columns of the data frame. If there is a mismatch we can have an issue. If using the built in functions however we can specify what axis to do the operation

In [29]:
data = data.set_index('INSTNM')

In [30]:
sat = data[['SATVRMID','SATMTMID']]

In [32]:
sat

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,425.0,420.0
University of Alabama at Birmingham,560.0,575.0
Amridge University,,
University of Alabama in Huntsville,590.0,610.0
Alabama State University,415.0,410.0


Build a pandas series from satvrmid and try and subtract from dataframe

In [37]:
satvrmid = sat.iloc[:,0]
satvrmid

INSTNM
Alabama A & M University               425.0
University of Alabama at Birmingham    560.0
Amridge University                       NaN
University of Alabama in Huntsville    590.0
Alabama State University               415.0
Name: SATVRMID, dtype: float64

Using the simple subtraction operator we have an issue

In [38]:
sat - satvrmid

Unnamed: 0_level_0,Alabama A & M University,Alabama State University,Amridge University,SATMTMID,SATVRMID,University of Alabama at Birmingham,University of Alabama in Huntsville
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,,,,,,,
University of Alabama at Birmingham,,,,,,,
Amridge University,,,,,,,
University of Alabama in Huntsville,,,,,,,
Alabama State University,,,,,,,


We instead use the sub function to make sure we are subtracting on the right axis

In [40]:
sat.sub(satvrmid, axis='rows')

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,0.0,-5.0
University of Alabama at Birmingham,0.0,15.0
Amridge University,,
University of Alabama in Huntsville,0.0,20.0
Alabama State University,0.0,-5.0


### Use other built in pandas methods vs. python

In [41]:
sat

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,425.0,420.0
University of Alabama at Birmingham,560.0,575.0
Amridge University,,
University of Alabama in Huntsville,590.0,610.0
Alabama State University,415.0,410.0


In [43]:
sat.sum()

SATVRMID    1990.0
SATMTMID    2015.0
dtype: float64

In [45]:
sat.min()

SATVRMID    415.0
SATMTMID    410.0
dtype: float64

In [46]:
sat.max()

SATVRMID    590.0
SATMTMID    610.0
dtype: float64

In [48]:
sat.abs()

Unnamed: 0_level_0,SATVRMID,SATMTMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,425.0,420.0
University of Alabama at Birmingham,560.0,575.0
Amridge University,,
University of Alabama in Huntsville,590.0,610.0
Alabama State University,415.0,410.0


### Group by
There are different methods of doing this but below is shown by Ted to be the most straighforward

In [51]:
data

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,,425.0,420.0,0.0,4616.0,0.0256,0.9129,0.0076
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,,560.0,575.0,0.0,12047.0,0.5786,0.2626,0.0309
Amridge University,Montgomery,AL,0.0,0.0,0.0,74.0,,,1.0,293.0,0.157,0.2355,0.0068
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,,590.0,610.0,0.0,6346.0,0.7148,0.1131,0.0411
Alabama State University,Montgomery,AL,1.0,0.0,0.0,,415.0,410.0,0.0,4704.0,0.0138,0.9337,0.0111


In [53]:
data.groupby('CITY').agg({'SATVRMID':'sum', 'SATMTMID':'max'})

Unnamed: 0_level_0,SATVRMID,SATMTMID
CITY,Unnamed: 1_level_1,Unnamed: 2_level_1
Birmingham,560.0,575.0
Huntsville,590.0,610.0
Montgomery,415.0,410.0
Normal,425.0,420.0


### Deal with multi-indexing

This shows a dataframe with two levels of indexing for both columns and rows

In [57]:
multi = data.groupby(['CITY','HBCU']).agg({'SATVRMID':['sum','max'], 'SATMTMID':'max'})
multi

Unnamed: 0_level_0,Unnamed: 1_level_0,SATVRMID,SATVRMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max,max
CITY,HBCU,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Birmingham,0.0,560.0,560.0,575.0
Huntsville,0.0,590.0,590.0,610.0
Montgomery,0.0,0.0,,
Montgomery,1.0,415.0,415.0,410.0
Normal,1.0,425.0,425.0,420.0


The best way to handle dataframes with multi-indexing is to immediate convert to single indexing. We do this by first renaming the columns

In [58]:
multi.columns = ['Vsum','Vmax','mmax']
multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Vsum,Vmax,mmax
CITY,HBCU,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Birmingham,0.0,560.0,560.0,575.0
Huntsville,0.0,590.0,590.0,610.0
Montgomery,0.0,0.0,,
Montgomery,1.0,415.0,415.0,410.0
Normal,1.0,425.0,425.0,420.0


Reset the index to get rid of multi-indexing on the row level

In [59]:
multi.reset_index()

Unnamed: 0,CITY,HBCU,Vsum,Vmax,mmax
0,Birmingham,0.0,560.0,560.0,575.0
1,Huntsville,0.0,590.0,590.0,610.0
2,Montgomery,0.0,0.0,,
3,Montgomery,1.0,415.0,415.0,410.0
4,Normal,1.0,425.0,425.0,420.0
