# The `pandas` Groupby

## A quick reference guide

This is intended as a quick reference for using the `groupby` method in `pandas`. As of 5 Nov 2017 it is under construction. The examples in this notebook have been shamelessly stolen from Wes McKinney's book, [Python for Data Analysis, Second Edition](http://shop.oreilly.com/product/0636920050896.do). Go there to learn more.

In [1]:
# The maths, graphs, stats and style libs

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats
import matplotlib.style as mplstyle
%matplotlib inline
mplstyle.use('fivethirtyeight')

In [2]:
df = pd.DataFrame({
    'key1': 'a a b b a'.split(),
    'key2': 'one two one two one'.split(),
    'data1': np.random.chisquare(100, 5),
    'data2': np.random.chisquare(100, 5)
})

In [3]:
df

Unnamed: 0,data1,data2,key1,key2
0,84.80494,99.578843,a,one
1,97.931793,106.225475,a,two
2,126.690496,119.118078,b,one
3,83.162669,83.045898,b,two
4,87.952184,109.333221,a,one


In [4]:
g = df['data1'].groupby(df['key1'])

In [5]:
g

<pandas.core.groupby.SeriesGroupBy object at 0x7fc59eb82080>

In [6]:
g.mean()

key1
a     90.229639
b    104.926583
Name: data1, dtype: float64

In [7]:
g.std()

key1
a     6.853369
b    30.778821
Name: data1, dtype: float64

## Multiple layers of grouping?

In [8]:
m = df['data1'].groupby([df['key1'], df['key2']])

In [9]:
m.median()

key1  key2
a     one      86.378562
      two      97.931793
b     one     126.690496
      two      83.162669
Name: data1, dtype: float64

In this summary we have the word 'one' appearing twice. Same with the word 'two'. That is visually inefficient because we have this stack of ones and twos there and we can't quickly compare side by side...

## And check this out...

In [10]:
m.mean().unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,86.378562,97.931793
b,126.690496,83.162669


Natually this would only work nicely with two dimensions. I wonder what happens with three.

In [11]:
df2 = pd.DataFrame({
    'key1': 'a a b b a'.split(),
    'key2': 'one two one two one'.split(),
    'key3': 'fee fi foe foe fum'.split(),
    'data1': np.random.chisquare(100, 5),
    'data2': np.random.chisquare(100, 5),
    'data3': np.random.chisquare(100, 5)
})

In [12]:
df2

Unnamed: 0,data1,data2,data3,key1,key2,key3
0,106.648497,90.339627,106.18572,a,one,fee
1,106.438256,103.350028,136.174661,a,two,fi
2,104.332531,102.208932,85.825886,b,one,foe
3,94.636823,86.119851,130.668514,b,two,foe
4,130.828863,117.320763,104.294777,a,one,fum


In [13]:
t = df2['data1'].groupby([df2['key1'], df2['key2'], df2['key3']])

In [14]:
t.mean()

key1  key2  key3
a     one   fee     106.648497
            fum     130.828863
      two   fi      106.438256
b     one   foe     104.332531
      two   foe      94.636823
Name: data1, dtype: float64

In [15]:
t.mean().unstack()

Unnamed: 0_level_0,key3,fee,fi,foe,fum
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,one,106.648497,,,130.828863
a,two,,106.438256,,
b,one,,,104.332531,
b,two,,,94.636823,


Well I'll be damned it still behaves nicely. But still doesn't work as well as the two dimensional example.

## Group keys

They don't have to be part of the dataframe. They just have to be arrays of the right length.

In [16]:
states = np.array('Ohio California California Ohio Ohio'.split())

In [17]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [18]:
df['data1'].groupby([states, years]).mean()

California  2005     97.931793
            2006    126.690496
Ohio        2005     83.983804
            2006     87.952184
Name: data1, dtype: float64

Wow. I'm amazed. This is too easy.

In [19]:
# But if they are part of the dataframe, there is a shortcut

df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,90.229639,105.045846
b,104.926583,101.081988


In [20]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,86.378562,104.456032
a,two,97.931793,106.225475
b,one,126.690496,119.118078
b,two,83.162669,83.045898


In [21]:
# And a useful aggregator is 

df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

## Iterating over groups

In [22]:
# With a single group key

for name, group in df.groupby('key1'):
    print(name)
    print(group.std())

a
data1    6.853369
data2    4.983033
dtype: float64
b
data1    30.778821
data2    25.506883
dtype: float64


In [23]:
# With multiple group keys, the first element is always a tuple

for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group.mean(), '\n')

('a', 'one')
data1     86.378562
data2    104.456032
dtype: float64 

('a', 'two')
data1     97.931793
data2    106.225475
dtype: float64 

('b', 'one')
data1    126.690496
data2    119.118078
dtype: float64 

('b', 'two')
data1    83.162669
data2    83.045898
dtype: float64 



### Nice recipe here

In [24]:
pieces = dict(list(df.groupby('key1')))

In [25]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,126.690496,119.118078,b,one
3,83.162669,83.045898,b,two


In [26]:
df

Unnamed: 0,data1,data2,key1,key2
0,84.80494,99.578843,a,one
1,97.931793,106.225475,a,two
2,126.690496,119.118078,b,one
3,83.162669,83.045898,b,two
4,87.952184,109.333221,a,one


## Axis 1 grouping

In [27]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [28]:
g = df.groupby(df.dtypes, axis=1)

In [29]:
for dtype, group in g:
    print(dtype)
    print(group, '\n')

float64
        data1       data2
0   84.804940   99.578843
1   97.931793  106.225475
2  126.690496  119.118078
3   83.162669   83.045898
4   87.952184  109.333221 

object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one 



## Selecting a column or subset of columns

In [30]:
# This,

a = df.groupby('key1')['data1']
a

<pandas.core.groupby.SeriesGroupBy object at 0x7fc59eae9dd8>

In [31]:
# is the same as this

b = df['data1'].groupby(df['key1'])
b

<pandas.core.groupby.SeriesGroupBy object at 0x7fc59eae9470>

In [32]:
# check it

print(a.mean(), '\n')
print(b.mean())

key1
a     90.229639
b    104.926583
Name: data1, dtype: float64 

key1
a     90.229639
b    104.926583
Name: data1, dtype: float64


In [33]:
# Getting fancy with it

df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,104.456032
a,two,106.225475
b,one,119.118078
b,two,83.045898


Objects returned are pd.DataFrames unless a single column is used. In that case it is a Series.

In [34]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7fc59eae97f0>

In [35]:
s_grouped.mean()

key1  key2
a     one     104.456032
      two     106.225475
b     one     119.118078
      two      83.045898
Name: data2, dtype: float64

## Grouping with Dicts and Series

You can create a mapping of columns. Maybe a few columns are similare and they should be aggregated together but you need something to aggregate them by. So you can use a dictionary for that. And because this is a way of grouping columns, it makes sense that we use `axis=1`.

In [36]:
people = pd.DataFrame(np.random.randn(5, 5),
                     columns='a b c d e'.split(),
                     index='Joe Steve Wes Jim Travis'.split())
people

Unnamed: 0,a,b,c,d,e
Joe,1.470859,0.526744,1.421099,0.288364,0.350974
Steve,0.089474,0.803263,-1.049622,1.105173,0.238681
Wes,0.298881,-0.528112,-0.662961,-0.897172,0.51788
Jim,-0.363227,0.133728,-1.446405,-0.666698,0.68982
Travis,-0.309465,-0.875456,-0.124724,-0.98356,-1.64134


In [37]:
people.iloc[2:3, [1, 2]] = np.nan

people

Unnamed: 0,a,b,c,d,e
Joe,1.470859,0.526744,1.421099,0.288364,0.350974
Steve,0.089474,0.803263,-1.049622,1.105173,0.238681
Wes,0.298881,,,-0.897172,0.51788
Jim,-0.363227,0.133728,-1.446405,-0.666698,0.68982
Travis,-0.309465,-0.875456,-0.124724,-0.98356,-1.64134


In [38]:
mapping = {
    'a': 'red',
    'b': 'red',
    'c': 'blue',
    'd': 'blue',
    'e': 'red',
    'f': 'orange'
}

In [39]:
by_col = people.groupby(mapping, axis=1)

In [40]:
by_col.sum()

Unnamed: 0,blue,red
Joe,1.709463,2.348577
Steve,0.05555,1.131418
Wes,-0.897172,0.816761
Jim,-2.113103,0.460321
Travis,-1.108283,-2.82626


In [41]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [42]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


## Grouping with functions

Ok what??? This is black magic.

In [43]:
people.index

Index(['Joe', 'Steve', 'Wes', 'Jim', 'Travis'], dtype='object')

In [44]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,1.406514,0.660471,-0.025306,-1.275506,1.558674
5,0.089474,0.803263,-1.049622,1.105173,0.238681
6,-0.309465,-0.875456,-0.124724,-0.98356,-1.64134


In [45]:
key_list = 'one one one two two'.split()
key_list

['one', 'one', 'one', 'two', 'two']

Mix and match:

In [46]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.298881,0.526744,1.421099,-0.897172,0.350974
3,two,-0.363227,0.133728,-1.446405,-0.666698,0.68982
5,one,0.089474,0.803263,-1.049622,1.105173,0.238681
6,two,-0.309465,-0.875456,-0.124724,-0.98356,-1.64134


## Groupping by index levels

In [47]:
cols = pd.MultiIndex.from_arrays(['US US US JP JP'.split(),
                                  [1, 3, 5, 1, 3]],
                                names=['city', 'tenor'])

In [48]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=cols)

hier_df

city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.274526,2.022062,0.698098,0.719362,-0.567175
1,0.214936,0.781465,1.894164,-0.887908,-1.546453
2,0.225761,0.725232,1.912674,-0.844105,0.163861
3,-0.4579,-1.41399,-1.381397,0.341068,0.195141


In [49]:
hier_df.groupby(level='city', axis=1).min()

city,JP,US
0,-0.567175,0.274526
1,-1.546453,0.214936
2,-0.844105,0.225761
3,0.195141,-1.41399


Here we've created an index with two layers. We named one layer `city` and the other layer `tenor`. Those are the names we use to refer to those layers. The `groupby` statement shows how this is done.

## Data aggregation

In [50]:
# Quantile is available for Series objects, thus also available for groupby objects

df

Unnamed: 0,data1,data2,key1,key2
0,84.80494,99.578843,a,one
1,97.931793,106.225475,a,two
2,126.690496,119.118078,b,one
3,83.162669,83.045898,b,two
4,87.952184,109.333221,a,one


In [51]:
g = df.groupby('key1')

g['data1'].quantile(0.9)

key1
a     95.935871
b    122.337713
Name: data1, dtype: float64

### DIY aggregation with the `agg` method

Just write a function that aggregates arrays, then pass it to the grouped object's `agg` method.

In [52]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [53]:
g.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,13.126853,9.754378
b,43.527826,36.07218


### Other methods

In [54]:
g.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,90.229639,6.853369,84.80494,86.378562,87.952184,92.941988,97.931793,3.0,105.045846,4.983033,99.578843,102.902159,106.225475,107.779348,109.333221
b,2.0,104.926583,30.778821,83.162669,94.044626,104.926583,115.808539,126.690496,2.0,101.081988,25.506883,83.045898,92.063943,101.081988,110.100033,119.118078


`describe` is not an aggregation function. But it still works.

## Column-wise and multiple function application

Here we use the `tips.csv` dataset provided by Wes on the GitHub for the book.

In [55]:
tips = pd.read_csv('data/tips.csv')

tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [56]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head(6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808
5,25.29,4.71,No,Sun,Dinner,4,0.18624


In [57]:
g = tips.groupby(['day', 'smoker'])

In [58]:
g_pct = g['tip_pct']

In [59]:
g_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

This is black magic. I swear it's too easy!! I'm not doing any work here!

In [60]:
g_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


But maybe you want different names for the columns?

In [61]:
# You can pass a tuple with ('name', 'func') elements

g_pct.agg([('Average', 'mean'), ('Std. Dev', 'std'), ('Range', peak_to_peak)])

Unnamed: 0_level_0,Unnamed: 1_level_0,Average,Std. Dev,Range
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


In [62]:
funcs = 'count mean max'.split()
funcs

['count', 'mean', 'max']

In [63]:
result = g['tip_pct', 'total_bill'].agg(funcs)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


I swear that's just black magic. Really? All that as a one liner? That line is selecting just two columns from the original dataset. Then it is running three aggregation functions on each of them. And it gives you detail on day of the week and smoker/non-smoker?

Ok maybe that took three lines.

1. Group
1. List of functions
1. Aggregation

But still. Nice.

In [64]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
ftuples

[('Durchschnitt', 'mean'),
 ('Abweichung', <function numpy.core.fromnumeric.var>)]

In [65]:
result = g['tip_pct', 'total_bill'].agg(ftuples)

result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


In [66]:
result['tip_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,Durchschnitt,Abweichung
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.000791
Fri,Yes,0.174783,0.002631
Sat,No,0.158048,0.001581
Sat,Yes,0.147906,0.003767
Sun,No,0.160113,0.001793
Sun,Yes,0.18725,0.023757
Thur,No,0.160298,0.001503
Thur,Yes,0.163863,0.001551


### What happens with a `dict`?

In [67]:
g.agg({
    'tip': np.max,
    'size': 'sum'
})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [68]:
g.agg({
    'tip_pct': 'min max mean std'.split(),
    'size': 'sum'
})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


### Return data with non-hierarchical index

Sometimes the index doesn't need to be fancy.

In [69]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


## Apply: General split-apply-combine

In [70]:
# Top five values by group

def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

In [71]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


### Top `n` rows by group using `apply`

In [72]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


In [73]:
# With args

tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Fri,94,22.75,3.25,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Yes,Thur,Lunch,4,0.115982


## Examples

### Describe by group

In [74]:
result = tips.groupby('smoker')['tip_pct'].describe()
result

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,151.0,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199
Yes,93.0,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345


In [75]:
result.unstack('smoker')

       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64

### Suppressing the group keys

In [76]:
tips.groupby('smoker', group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
88,24.71,5.85,No,Thur,Lunch,2,0.236746
185,20.69,5.0,No,Sun,Dinner,5,0.241663
51,10.29,2.6,No,Sun,Dinner,2,0.252672
149,7.51,2.0,No,Thur,Lunch,2,0.266312
232,11.61,3.39,No,Sat,Dinner,2,0.29199
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


### Quantile and bucket analysis

In [77]:
frame = pd.DataFrame({
    'data1': np.random.randn(1000),
    'data2': np.random.randn(1000)
})
frame.head()

Unnamed: 0,data1,data2
0,-0.10011,-0.731775
1,0.281648,-1.793507
2,0.960225,0.902989
3,-0.771496,-0.00267
4,-0.359816,-1.482207


In [78]:
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

0    (-1.521, 0.0717]
1     (0.0717, 1.664]
2     (0.0717, 1.664]
3    (-1.521, 0.0717]
4    (-1.521, 0.0717]
5    (-1.521, 0.0717]
6    (-1.521, 0.0717]
7     (-3.12, -1.521]
8    (-1.521, 0.0717]
9     (0.0717, 1.664]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.12, -1.521] < (-1.521, 0.0717] < (0.0717, 1.664] < (1.664, 3.257]]

In [79]:
def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

In [80]:
g = frame.data2.groupby(quartiles)

In [81]:
g.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.12, -1.521]",63.0,1.964858,-0.003782,-2.380061
"(-1.521, 0.0717]",493.0,3.453817,-0.028346,-2.841952
"(0.0717, 1.664]",393.0,2.361924,0.002523,-3.00917
"(1.664, 3.257]",51.0,2.168838,-0.002598,-1.918185


Above are equal length buckets. Below are equal size buckets.

In [82]:
quantiles = pd.qcut(frame.data1, 10, labels=False)

In [83]:
g2 = frame.data2.groupby(quantiles)

In [84]:
g2.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100.0,3.203425,-0.033657,-2.380061
1,100.0,2.792355,-0.124342,-2.264983
2,100.0,3.453817,0.079344,-1.981536
3,100.0,2.548706,-0.020325,-2.204968
4,100.0,2.275789,-0.051468,-2.605174
5,100.0,2.026707,-0.046174,-3.00917
6,100.0,1.907233,0.002711,-2.96528
7,100.0,2.226617,-0.170476,-2.775786
8,100.0,2.295722,0.251285,-1.752092
9,100.0,2.361924,-0.020435,-2.58019


### Fill missing values with group specific values

In [85]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

0         NaN
1    1.604683
2         NaN
3   -0.176565
4         NaN
5   -0.726581
dtype: float64

In [86]:
s.fillna(s.mean())

0    0.233846
1    1.604683
2    0.233846
3   -0.176565
4    0.233846
5   -0.726581
dtype: float64

In [87]:
states = 'Ohio NewYork Vermont Florida Oregon Nevada California Idaho'.split()
states[1] = 'New York'
states

['Ohio',
 'New York',
 'Vermont',
 'Florida',
 'Oregon',
 'Nevada',
 'California',
 'Idaho']

In [88]:
group_key = ['East'] * 4 + ['West'] * 4
group_key

['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West']

In [89]:
data = pd.Series(np.random.randn(8), index=states)
data

Ohio          1.173883
New York     -0.599441
Vermont      -1.340648
Florida      -0.561924
Oregon        0.663556
Nevada        0.345004
California   -0.309906
Idaho         1.710531
dtype: float64

In [90]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio          1.173883
New York     -0.599441
Vermont            NaN
Florida      -0.561924
Oregon        0.663556
Nevada             NaN
California   -0.309906
Idaho              NaN
dtype: float64

In [91]:
data.groupby(group_key).mean()

East    0.004172
West    0.176825
dtype: float64

In [92]:
fill_mean = lambda g: g.fillna(g.mean())

In [93]:
data.groupby(group_key).apply(fill_mean)

Ohio          1.173883
New York     -0.599441
Vermont       0.004172
Florida      -0.561924
Oregon        0.663556
Nevada        0.176825
California   -0.309906
Idaho         0.176825
dtype: float64

And maybe we just have the fill value hard coded somewhere...

In [94]:
fill_values = {'East':0.5, 'West':-1}
fill_func = lambda g: g.fillna(fill_values[g.name])

In [95]:
data.groupby(group_key).apply(fill_func)

Ohio          1.173883
New York     -0.599441
Vermont       0.500000
Florida      -0.561924
Oregon        0.663556
Nevada       -1.000000
California   -0.309906
Idaho        -1.000000
dtype: float64

### Random sampling and permutation

A French deck with `pandas`. Aka, picking random cards.

In [96]:
suits = 'H S C D'.split()
card_val = (list(range(1,11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2,11)) + 'J Q K'.split()
cards = []
for suit in suits:
    cards.extend(str(num) + suit for num in base_names)
deck = pd.Series(card_val, index=cards)
deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
QH     10
KH     10
dtype: int64

In [97]:
def draw(deck, n=5):
    return deck.sample(n)

draw(deck)

QS     10
JC     10
10D    10
5D      5
6H      6
dtype: int64

In [98]:
get_suit = lambda card: card[-1]

deck.groupby(get_suit).apply(draw, n=2)

C  AC      1
   3C      3
D  4D      4
   KD     10
H  2H      2
   9H      9
S  JS     10
   10S    10
dtype: int64

In [99]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

2C     2
8C     8
8D     8
AD     1
9H     9
KH    10
KS    10
6S     6
dtype: int64

## Group weighted average and correlation

In [100]:
df = pd.DataFrame({
    'category': 'a a a a b b b b'.split(),
    'data': np.random.randn(8),
    'weights': np.random.rand(8)
})
df

Unnamed: 0,category,data,weights
0,a,2.103936,0.294371
1,a,0.94748,0.419243
2,a,0.688057,0.971892
3,a,0.773716,0.252345
4,b,-0.759041,0.808385
5,b,0.121176,0.914567
6,b,-0.315135,0.637059
7,b,-0.745589,0.553597


In [101]:
g = df.groupby('category')

get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

In [102]:
g.apply(get_wavg)

category
a    0.970416
b   -0.383129
dtype: float64

### Financial dataset example

In [133]:
close_px = pd.read_csv('data/stock_px_2.csv', parse_dates=True, index_col=0)

close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB


In [134]:
close_px[-4:]

Unnamed: 0,AAPL,MSFT,XOM,SPX
2011-10-11,400.29,27.0,76.27,1195.54
2011-10-12,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58


Maybe we do a yearly correlation of daily returns?

In [135]:
rets = close_px.pct_change().dropna()

rets.head()

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003-01-03,0.006757,0.001421,0.000684,-0.000484
2003-01-06,0.0,0.017975,0.024624,0.022474
2003-01-07,-0.002685,0.019052,-0.033712,-0.006545
2003-01-08,-0.020188,-0.028272,-0.004145,-0.014086
2003-01-09,0.008242,0.029094,0.021159,0.019386


In [136]:
get_year = lambda x: x.year
by_year = rets.groupby(get_year)

by_year.size()

2003    251
2004    252
2005    252
2006    251
2007    251
2008    253
2009    252
2010    252
2011    199
dtype: int64

In [137]:
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


In [139]:
# or inter-column correlations

by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

### Group-wise linear regression

In [141]:
import statsmodels.api as sm

  from pandas.core import datetools


In [142]:
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

In [143]:
by_year.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514


I'll need to look into [statsmodels](http://www.statsmodels.org/dev/index.html).

The chapter ends with the `pivot_table` and `crosstab` methods. Those seem to be built on top of the `groupby` method and are there for convenience. They are tasks that happen often enough to warrant their own methods. I won't go into them now. I'll first get a good grasp of the `groupby` method and then look into those two.

## Rename a single column in one line

In [None]:
# This needs to be fleshed out. Right now it is not valid code.

In [27]: df=df.rename(columns = {'two':'new_name'})

In [28]: df
Out[28]: 
  one three  new_name
0    1     a         9
1    2     b         8
2    3     c         7
3    4     d         6
4    5     e         5