# Week 3: Advanced Python Pandas

# Group by

This function takes some column name(s) and splits the dataframe into chunks based on those names. It returns a dataframe `groupby` opject which can be iterated upon, and returns a tuple where the first item is the group condition and the second item is group reduced by that grouping. 

Load the census data, exclude the state-level summations.

In [1]:
import pandas as pd
import numpy as np
import os

In [3]:
df = pd.read_csv('../data/census.csv')
df = df[df['SUMLEV']==50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


* Get a list of the unique state names
* Reduce the dataframe by that list and calculate the average

In [4]:
%%timeit -n 10
for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Countries in state ' + state + ' have an average population of ' + str(avg))

Countries in state Alabama have an average population of 71339.3432836
Countries in state Alaska have an average population of 24490.7241379
Countries in state Arizona have an average population of 426134.466667
Countries in state Arkansas have an average population of 38878.9066667
Countries in state California have an average population of 642309.586207
Countries in state Colorado have an average population of 78581.1875
Countries in state Connecticut have an average population of 446762.125
Countries in state Delaware have an average population of 299311.333333
Countries in state District of Columbia have an average population of 601723.0
Countries in state Florida have an average population of 280616.567164
Countries in state Georgia have an average population of 60928.6352201
Countries in state Hawaii have an average population of 272060.2
Countries in state Idaho have an average population of 35626.8636364
Countries in state Illinois have an average population of 125790.509804
Co

Try it now with a `groupby` approach and compare the timing. 

In [5]:
%%timeit -n 10
for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

Counties in state Alabama have an average population of 71339.3432836
Counties in state Alaska have an average population of 24490.7241379
Counties in state Arizona have an average population of 426134.466667
Counties in state Arkansas have an average population of 38878.9066667
Counties in state California have an average population of 642309.586207
Counties in state Colorado have an average population of 78581.1875
Counties in state Connecticut have an average population of 446762.125
Counties in state Delaware have an average population of 299311.333333
Counties in state District of Columbia have an average population of 601723.0
Counties in state Florida have an average population of 280616.567164
Counties in state Georgia have an average population of 60928.6352201
Counties in state Hawaii have an average population of 272060.2
Counties in state Idaho have an average population of 35626.8636364
Counties in state Illinois have an average population of 125790.509804
Counties in stat

`groupby` produces a huge difference in the speed.

# Provide a function to `groupby`

In [6]:
df = df.set_index('STNAME')

def fun(item):
    if item[0] < 'M':
        return 0
    if item[0] < 'Q':
        return 1
    return 2

for group, frame in df.groupby(fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

There are 1177 records in group 0 for processing.
There are 1134 records in group 1 for processing.
There are 831 records in group 2 for processing.


`groupby` also has a method called `agg`, short for aggregate. It applies a function to the column(s) of data in the group and returns a result. Pass in a dictionary of column names and the function you want to apply. 

Build a summary dataframe of the average populations per state: Give `agg` a dictionary with the `CENSUS2010POP` key and the `np.average` function. 

In [14]:
# Start the df from scratch
df = pd.read_csv('../data/census.csv')
df = df[df['SUMLEV']==50]

In [16]:
df.groupby('STNAME').agg({'CENSUS2010POP':np.average}).head()

Unnamed: 0_level_0,CENSUS2010POP
STNAME,Unnamed: 1_level_1
Alabama,71339.343284
Alaska,24490.724138
Arizona,426134.466667
Arkansas,38878.906667
California,642309.586207


# Groupby objects

There are two types of `groupby` objects: The DataFrame groupby and the Series groupby. They behave a little differently with aggregate. 

In [19]:
print(type(df.groupby(level=0)['POPESTIMATE2010']))
print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']))

<class 'pandas.core.groupby.SeriesGroupBy'>
<class 'pandas.core.groupby.DataFrameGroupBy'>


In the case of the Series groupby, you can apply several functions and they will all be applied to the one column.

In [29]:
df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg':np.average, 'sum':np.sum}).head()

Unnamed: 0_level_0,sum,avg
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,4779736,71339.343284
Alaska,710231,24490.724138
Arizona,6392017,426134.466667
Arkansas,2915918,38878.906667
California,37253956,642309.586207


If we do the same thing with the DataFrame instead of the Series, it will apply both functions to every column in the Series.

In [31]:
df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg({'avg':np.average, 'sum':np.sum}).head()

Unnamed: 0_level_0,sum,sum,avg,avg
Unnamed: 0_level_1,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Alabama,4785161,4801108,71420.313433,71658.328358
Alaska,714021,722720,24621.413793,24921.37931
Arizona,6408208,6468732,427213.866667,431248.8
Arkansas,2922394,2938538,38965.253333,39180.506667
California,37334079,37700034,643691.017241,650000.586207


Confusion comes into play when we change the labels of the dictionary we pass to `aggregate` to correspond to our data frame. In this case, `pandas` recognizes they are the same and maps the data directly to the columns instead of creating a hierarchically-labeled column. 

In [32]:
df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg({'POPESTIMATE2010':np.average, 'POPESTIMATE2011':np.sum}).head()

Unnamed: 0_level_0,POPESTIMATE2011,POPESTIMATE2010
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,4801108,71420.313433
Alaska,722720,24621.413793
Arizona,6468732,427213.866667
Arkansas,2938538,38965.253333
California,37700034,643691.017241
