# The `pandas` Groupby

**Author:** Marco "Milo" Hemken

## A quick reference guide

This is intended as a quick reference for using the `groupby` method in `pandas`. As of 5 Nov 2017 it is under construction. The examples in this notebook have been shamelessly stolen from Wes McKinney's book, [Python for Data Analysis, Second Edition](http://shop.oreilly.com/product/0636920050896.do). Go there to learn more.

In [2]:
# The maths, graphs, stats and style libs

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats
import matplotlib.style as mplstyle
%matplotlib inline
mplstyle.use('fivethirtyeight')

### First example

Here we create a DataFrame object. Think of it like a spreadsheet but outside of excel. It is just a group of columns and rows with numbers. Once we've created it, we tell Python to display the data.

In [3]:
# Create a set of data from random numbers

df = pd.DataFrame({
    'key1': 'a a b b a'.split() * 2,
    'key2': 'one two one two one'.split() * 2,
    'data1': np.random.chisquare(100, 10),  # Creates random data
    'data2': np.random.chisquare(100, 10)  # Creates random data
})

df  # Show the data frame

Unnamed: 0,data1,data2,key1,key2
0,110.38887,98.631336,a,one
1,99.611873,117.209323,a,two
2,91.901021,90.467677,b,one
3,104.411832,109.43063,b,two
4,101.235708,95.628458,a,one
5,94.796146,113.209483,a,one
6,111.377867,100.08396,a,two
7,109.146204,104.796501,b,one
8,109.811017,86.232764,b,two
9,112.10775,109.112201,a,one


Now we want to see the average of all the `a`'s and `b`'s. So we create a group object. Once we have a group object we can ask it questions like "what's the average?", "what's the median?" etc.

In [4]:
g = df.groupby('key1')

g  # Show the grouped object

<pandas.core.groupby.DataFrameGroupBy object at 0x7f1b3c4a2550>

In [5]:
# What is the average by group?

g.mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,104.919702,105.645794
b,103.817518,97.731893


In [6]:
# What is the median by group?

g.median()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,105.812289,104.59808
b,106.779018,97.632089


In [7]:
# What is the standard deviation by group?

g.std()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,7.314629,8.756988
b,8.300061,11.132055


## Multiple layers of grouping?

In [None]:
whos

In [8]:
m = df['data1'].groupby([df['key1'], df['key2']])

In [9]:
m.median()

key1  key2
a     one     105.812289
      two     105.494870
b     one     100.523613
      two     107.111424
Name: data1, dtype: float64

In this summary we have the word 'one' appearing twice. Same with the word 'two'. That is visually inefficient because we have this stack of ones and twos there and we can't quickly compare side by side...

## And check this out...

In [10]:
m.mean().unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,104.632119,105.49487
b,100.523613,107.111424


Natually this would only work nicely with two dimensions. I wonder what happens with three.

In [None]:
df2 = pd.DataFrame({
    'key1': 'a a b b a'.split(),
    'key2': 'one two one two one'.split(),
    'key3': 'fee fi foe foe fum'.split(),
    'data1': np.random.chisquare(100, 5),
    'data2': np.random.chisquare(100, 5),
    'data3': np.random.chisquare(100, 5)
})

In [None]:
df2

In [None]:
t = df2['data1'].groupby([df2['key1'], df2['key2'], df2['key3']])

In [None]:
t.mean()

In [None]:
t.mean().unstack()

Well I'll be damned it still behaves nicely. But still doesn't work as well as the two dimensional example.

## Group keys

They don't have to be part of the dataframe. They just have to be arrays of the right length.

In [None]:
df = pd.DataFrame({
    'key1': 'a a b b a'.split(),
    'key2': 'one two one two one'.split(),
    'data1': np.random.chisquare(100, 5),  # Creates random data
    'data2': np.random.chisquare(100, 5)  # Creates random data
})

In [None]:
states = np.array('Ohio California California Ohio Ohio'.split())

In [None]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [None]:
df['data1'].groupby([states, years]).mean()

Wow. I'm amazed. This is too easy.

In [None]:
# But if they are part of the dataframe, there is a shortcut

df.groupby('key1').mean()

In [None]:
df.groupby(['key1', 'key2']).mean()

In [None]:
# And a useful aggregator is 

df.groupby(['key1', 'key2']).size()

## Iterating over groups

In [None]:
# With a single group key

for name, group in df.groupby('key1'):
    print(name)
    print(group.std())

In [None]:
# With multiple group keys, the first element is always a tuple

for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group.mean(), '\n')

### Nice recipe here

In [None]:
pieces = dict(list(df.groupby('key1')))

In [None]:
pieces['b']

In [None]:
df

## Axis 1 grouping

In [None]:
df.dtypes

In [None]:
g = df.groupby(df.dtypes, axis=1)

In [None]:
for dtype, group in g:
    print(dtype)
    print(group, '\n')

## Selecting a column or subset of columns

In [None]:
# This,

a = df.groupby('key1')['data1']
a

In [None]:
# is the same as this

b = df['data1'].groupby(df['key1'])
b

In [None]:
# check it

print(a.mean(), '\n')
print(b.mean())

In [None]:
# Getting fancy with it

df.groupby(['key1', 'key2'])[['data2']].mean()

Objects returned are pd.DataFrames unless a single column is used. In that case it is a Series.

In [None]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

s_grouped

In [None]:
s_grouped.mean()

## Grouping with Dicts and Series

You can create a mapping of columns. Maybe a few columns are similare and they should be aggregated together but you need something to aggregate them by. So you can use a dictionary for that. And because this is a way of grouping columns, it makes sense that we use `axis=1`.

In [None]:
people = pd.DataFrame(np.random.randn(5, 5),
                     columns='a b c d e'.split(),
                     index='Joe Steve Wes Jim Travis'.split())
people

In [None]:
people.iloc[2:3, [1, 2]] = np.nan

people

In [None]:
mapping = {
    'a': 'red',
    'b': 'red',
    'c': 'blue',
    'd': 'blue',
    'e': 'red',
    'f': 'orange'
}

In [None]:
by_col = people.groupby(mapping, axis=1)

In [None]:
by_col.sum()

In [None]:
map_series = pd.Series(mapping)
map_series

In [None]:
people.groupby(map_series, axis=1).count()

## Grouping with functions

Ok what??? This is black magic.

In [None]:
people.index

In [None]:
people.groupby(len).sum()

In [None]:
key_list = 'one one one two two'.split()
key_list

Mix and match:

In [None]:
people.groupby([len, key_list]).min()

## Groupping by index levels

In [None]:
cols = pd.MultiIndex.from_arrays(['US US US JP JP'.split(),
                                  [1, 3, 5, 1, 3]],
                                names=['city', 'tenor'])

In [None]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=cols)

hier_df

In [None]:
hier_df.groupby(level='city', axis=1).min()

Here we've created an index with two layers. We named one layer `city` and the other layer `tenor`. Those are the names we use to refer to those layers. The `groupby` statement shows how this is done.

## Data aggregation

In [None]:
# Quantile is available for Series objects, thus also available for groupby objects

df

In [None]:
g = df.groupby('key1')

g['data1'].quantile(0.9)

### DIY aggregation with the `agg` method

Just write a function that aggregates arrays, then pass it to the grouped object's `agg` method.

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [None]:
g.agg(peak_to_peak)

### Other methods

In [None]:
g.describe()

`describe` is not an aggregation function. But it still works.

## Column-wise and multiple function application

Here we use the `tips.csv` dataset provided by Wes on the GitHub for the book.

In [None]:
tips = pd.read_csv('data/tips.csv')

tips.head()

In [None]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head(6)

In [None]:
g = tips.groupby(['day', 'smoker'])

In [None]:
g_pct = g['tip_pct']

In [None]:
g_pct.agg('mean')

This is black magic. I swear it's too easy!! I'm not doing any work here!

In [None]:
g_pct.agg(['mean', 'std', peak_to_peak])

But maybe you want different names for the columns?

In [None]:
# You can pass a tuple with ('name', 'func') elements

g_pct.agg([('Average', 'mean'), ('Std. Dev', 'std'), ('Range', peak_to_peak)])

In [None]:
funcs = 'count mean max'.split()
funcs

In [None]:
result = g['tip_pct', 'total_bill'].agg(funcs)
result

I swear that's just black magic. Really? All that as a one liner? That line is selecting just two columns from the original dataset. Then it is running three aggregation functions on each of them. And it gives you detail on day of the week and smoker/non-smoker?

Ok maybe that took three lines.

1. Group
1. List of functions
1. Aggregation

But still. Nice.

In [None]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
ftuples

In [None]:
result = g['tip_pct', 'total_bill'].agg(ftuples)

result

In [None]:
result['tip_pct']

### What happens with a `dict`?

In [None]:
g.agg({
    'tip': np.max,
    'size': 'sum'
})

In [None]:
g.agg({
    'tip_pct': 'min max mean std'.split(),
    'size': 'sum'
})

### Return data with non-hierarchical index

Sometimes the index doesn't need to be fancy.

In [None]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

## Apply: General split-apply-combine

In [None]:
# Top five values by group

def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

In [None]:
top(tips, n=6)

### Top `n` rows by group using `apply`

In [None]:
tips.groupby('smoker').apply(top)

In [None]:
# With args

tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

## Examples

### Describe by group

In [None]:
result = tips.groupby('smoker')['tip_pct'].describe()
result

In [None]:
result.unstack('smoker')

### Suppressing the group keys

In [None]:
tips.groupby('smoker', group_keys=False).apply(top)

### Quantile and bucket analysis

In [None]:
frame = pd.DataFrame({
    'data1': np.random.randn(1000),
    'data2': np.random.randn(1000)
})
frame.head()

In [None]:
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

In [None]:
def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

In [None]:
g = frame.data2.groupby(quartiles)

In [None]:
g.apply(get_stats).unstack()

Above are equal length buckets. Below are equal size buckets.

In [None]:
quantiles = pd.qcut(frame.data1, 10, labels=False)

In [None]:
g2 = frame.data2.groupby(quantiles)

In [None]:
g2.apply(get_stats).unstack()

### Fill missing values with group specific values

In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

In [None]:
s.fillna(s.mean())

In [None]:
states = 'Ohio NewYork Vermont Florida Oregon Nevada California Idaho'.split()
states[1] = 'New York'
states

In [None]:
group_key = ['East'] * 4 + ['West'] * 4
group_key

In [None]:
data = pd.Series(np.random.randn(8), index=states)
data

In [None]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

In [None]:
data.groupby(group_key).mean()

In [None]:
fill_mean = lambda g: g.fillna(g.mean())

In [None]:
data.groupby(group_key).apply(fill_mean)

And maybe we just have the fill value hard coded somewhere...

In [None]:
fill_values = {'East':0.5, 'West':-1}
fill_func = lambda g: g.fillna(fill_values[g.name])

In [None]:
data.groupby(group_key).apply(fill_func)

### Random sampling and permutation

A French deck with `pandas`. Aka, picking random cards.

In [None]:
suits = 'H S C D'.split()
card_val = (list(range(1,11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2,11)) + 'J Q K'.split()
cards = []
for suit in suits:
    cards.extend(str(num) + suit for num in base_names)
deck = pd.Series(card_val, index=cards)
deck[:13]

In [None]:
def draw(deck, n=5):
    return deck.sample(n)

draw(deck)

In [None]:
get_suit = lambda card: card[-1]

deck.groupby(get_suit).apply(draw, n=2)

In [None]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

## Group weighted average and correlation

In [None]:
df = pd.DataFrame({
    'category': 'a a a a b b b b'.split(),
    'data': np.random.randn(8),
    'weights': np.random.rand(8)
})
df

In [None]:
g = df.groupby('category')

get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

In [None]:
g.apply(get_wavg)

### Financial dataset example

In [None]:
close_px = pd.read_csv('data/stock_px_2.csv', parse_dates=True, index_col=0)

close_px.info()

In [None]:
close_px[-4:]

Maybe we do a yearly correlation of daily returns?

In [None]:
rets = close_px.pct_change().dropna()

rets.head()

In [None]:
get_year = lambda x: x.year
by_year = rets.groupby(get_year)

by_year.size()

In [None]:
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year.apply(spx_corr)

In [None]:
# or inter-column correlations

by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

### Group-wise linear regression

The below code does not run. The issue is being worked on by the appropriate authorities.

In [None]:
import statsmodels.api as sm

In [None]:
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

In [None]:
by_year.apply(regress, 'AAPL', ['SPX'])

I'll need to look into [statsmodels](http://www.statsmodels.org/dev/index.html).

The chapter ends with the `pivot_table` and `crosstab` methods. Those seem to be built on top of the `groupby` method and are there for convenience. They are tasks that happen often enough to warrant their own methods. I won't go into them now. I'll first get a good grasp of the `groupby` method and then look into those two.

## Rename a single column in one line

In [None]:
# This needs to be fleshed out. Right now it is not valid code.

In [27]: df=df.rename(columns = {'two':'new_name'})

In [28]: df
Out[28]: 
  one three  new_name
0    1     a         9
1    2     b         8
2    3     c         7
3    4     d         6
4    5     e         5