In [1]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# Group-wise Operations and Transformations

Aggregation is only one kind of group operation. It is a special case in the more general
class of data transformations; that is, it accepts functions that reduce a one-dimensional
array to a scalar value. In this section, I will introduce you to the transform and apply
methods, which will enable you to do many other kinds of group operations.
Suppose, instead, we wanted to add a column to a DataFrame containing group means
for each index. One way to do this is to aggregate, then merge:

In [2]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})

In [3]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.767725,0.174813,a,one
1,0.240928,-0.568615,a,two
2,-0.461411,-0.616863,b,one
3,-0.529614,0.147096,b,two
4,-0.95021,-1.659278,a,one


In [4]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')

In [10]:
k1_means

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.492336,-0.68436
b,-0.495512,-0.234883


In [6]:
pd.merge(df, k1_means, left_on='key1', right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,-0.767725,0.174813,a,one,-0.492336,-0.68436
1,0.240928,-0.568615,a,two,-0.492336,-0.68436
4,-0.95021,-1.659278,a,one,-0.492336,-0.68436
2,-0.461411,-0.616863,b,one,-0.495512,-0.234883
3,-0.529614,0.147096,b,two,-0.495512,-0.234883


This works, but is somewhat inflexible. You can think of the operation as transforming
the two data columns using the np.mean function. Let’s look back at the people Data-
Frame from earlier in the chapter and use the transform method on GroupBy:

In [7]:
key = ['one', 'two', 'one', 'two', 'one']

In [11]:
people = DataFrame(np.random.randn(5, 5),
    columns=['a', 'b', 'c', 'd', 'e'],
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [12]:
people.groupby(key).mean()

Unnamed: 0,a,b,c,d,e
one,-0.264299,1.097962,0.161261,0.569196,-0.179642
two,-1.345193,0.63669,0.474942,-1.503753,0.949361


In [13]:
people.groupby(key).transform(np.mean)

Unnamed: 0,a,b,c,d,e
Joe,-0.264299,1.097962,0.161261,0.569196,-0.179642
Steve,-1.345193,0.63669,0.474942,-1.503753,0.949361
Wes,-0.264299,1.097962,0.161261,0.569196,-0.179642
Jim,-1.345193,0.63669,0.474942,-1.503753,0.949361
Travis,-0.264299,1.097962,0.161261,0.569196,-0.179642


As you may guess, transform applies a function to each group, then places the results
in the appropriate locations. If each group produces a scalar value, it will be propagated
(broadcasted). Suppose instead you wanted to subtract the mean value from each
group. To do this, create a demeaning function and pass it to transform:

In [14]:
def demean(arr):
    return arr - arr.mean()

In [15]:
demeaned = people.groupby(key).transform(demean)

In [16]:
demeaned

Unnamed: 0,a,b,c,d,e
Joe,0.657619,1.312195,1.1715,0.192572,-1.55544
Steve,0.27722,-0.029362,-0.798328,0.148042,-1.04929
Wes,-1.233028,-1.670551,0.066493,0.534091,0.991478
Jim,-0.27722,0.029362,0.798328,-0.148042,1.04929
Travis,0.57541,0.358356,-1.237992,-0.726664,0.563962


You can check that demeaned now has zero group means:

In [17]:
demeaned.groupby(key).mean()

Unnamed: 0,a,b,c,d,e
one,-3.700743e-17,7.401487e-17,0.0,3.700743e-17,0.0
two,-1.110223e-16,5.5511150000000004e-17,5.5511150000000004e-17,1.110223e-16,0.0


As you’ll see in the next section, group demeaning can be achieved using apply also.

## Apply: General split-apply-combine
    
Like aggregate, transform is a more specialized function having rigid requirements: the
passed function must either produce a scalar value to be broadcasted (like np.mean) or
a transformed array of the same size. The most general purpose GroupBy method is
apply, which is the subject of the rest of this section. As in Figure 9-1, apply splits the
object being manipulated into pieces, invokes the passed function on each piece, then
attempts to concatenate the pieces together.

Returning to the tipping data set above, suppose you wanted to select the top five
tip_pct values by group. First, it’s straightforward to write a function that selects the
rows with the largest values in a particular column:

In [27]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

In [31]:
tips = pd.read_csv('tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [32]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [33]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


Now, if we group by smoker, say, and call apply with this function, we get the following:

In [34]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


What has happened here? The top function is called on each piece of the DataFrame,
then the results are glued together using pandas.concat, labeling the pieces with the
group names. The result therefore has a hierarchical index whose inner level contains
index values from the original DataFrame.

If you pass a function to apply that takes other arguments or keywords, you can pass
these after the function:

In [35]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982


NOTE 

Beyond these basic usage mechanics, getting the most out of apply is
largely a matter of creativity. What occurs inside the function passed is
up to you; it only needs to return a pandas object or a scalar value. The
rest of this chapter will mainly consist of examples showing you how to
solve various problems using groupby.

You may recall above I called describe on a GroupBy object:

In [36]:
result = tips.groupby('smoker')['tip_pct'].describe()

In [37]:
result

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,151.0,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199
Yes,93.0,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345


Inside GroupBy, when you invoke a method like describe, it is actually just a shortcut
for:

In [41]:
grouped = tips.groupby(['sex', 'smoker'])

In [42]:
f = lambda x: x.describe()

In [43]:
grouped.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,No,count,54.0,54.0,54.0,54.0
Female,No,mean,18.105185,2.773519,2.592593,0.156921
Female,No,std,7.286455,1.128425,1.073146,0.036421
Female,No,min,7.25,1.0,1.0,0.056797
Female,No,25%,12.65,2.0,2.0,0.139708
Female,No,50%,16.69,2.68,2.0,0.149691
Female,No,75%,20.8625,3.4375,3.0,0.18163
Female,No,max,35.83,5.2,6.0,0.252672
Female,Yes,count,33.0,33.0,33.0,33.0
Female,Yes,mean,17.977879,2.931515,2.242424,0.18215


Suppressing the group keys

In the examples above, you see that the resulting object has a hierarchical index formed
from the group keys along with the indexes of each piece of the original object. This
can be disabled by passing group_keys=False to groupby:

In [44]:
tips.groupby('smoker', group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


## Quantile and Bucket Analysis

As you may recall from Chapter 7, pandas has some tools, in particular cut and qcut,
for slicing data up into buckets with bins of your choosing or by sample quantiles.
Combining these functions with groupby, it becomes very simple to perform bucket or quantile analysis on a data set. Consider a simple random data set and an equal-length
bucket categorization using cut:

In [45]:
frame = DataFrame({'data1': np.random.randn(1000),
    'data2': np.random.randn(1000)})

In [46]:
factor = pd.cut(frame.data1, 4)

In [47]:
factor[:10]

0     (0.0542, 1.676]
1      (1.676, 3.298]
2     (0.0542, 1.676]
3     (0.0542, 1.676]
4    (-1.568, 0.0542]
5    (-3.196, -1.568]
6    (-1.568, 0.0542]
7     (0.0542, 1.676]
8    (-1.568, 0.0542]
9    (-1.568, 0.0542]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.196, -1.568] < (-1.568, 0.0542] < (0.0542, 1.676] < (1.676, 3.298]]

The Factor object returned by cut can be passed directly to groupby. So we could compute
a set of statistics for the data2 column like so:

In [48]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
    'count': group.count(), 'mean': group.mean()}

In [49]:
grouped = frame.data2.groupby(factor)

In [50]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.196, -1.568]",58.0,2.326772,-0.08599,-2.068261
"(-1.568, 0.0542]",457.0,2.757238,0.037039,-2.66297
"(0.0542, 1.676]",429.0,3.056418,0.030285,-2.603012
"(1.676, 3.298]",56.0,3.110348,0.154452,-2.715448


These were equal-length buckets; to compute equal-size buckets based on sample
quantiles, use qcut. I’ll pass labels=False to just get quantile numbers.

In [51]:
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False)

In [52]:
grouped = frame.data2.groupby(grouping)

In [53]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100.0,2.326772,-0.119873,-2.604194
1,100.0,2.705751,-0.006406,-2.66297
2,100.0,2.757238,0.114154,-2.119333
3,100.0,2.526042,0.002532,-2.031835
4,100.0,2.242366,0.041315,-1.822644
5,100.0,2.064285,0.080511,-2.531153
6,100.0,3.056418,-0.036904,-2.603012
7,100.0,2.336053,0.139888,-2.085136
8,100.0,2.893829,0.077886,-2.415502
9,100.0,3.110348,0.042704,-2.715448


## Example: Filling Missing Values with Group-specific Values
    
When cleaning up missing data, in some cases you will filter out data observations
using dropna, but in others you may want to impute (fill in) the NA values using a fixed
value or some value derived from the data. fillna is the right tool to use; for example
here I fill in NA values with the mean:

In [54]:
s = Series(np.random.randn(6))

In [55]:
s[::2] = np.nan

In [56]:
s

0         NaN
1    0.053667
2         NaN
3   -1.606363
4         NaN
5    1.027205
dtype: float64

In [57]:
s.fillna(s.mean())

0   -0.175164
1    0.053667
2   -0.175164
3   -1.606363
4   -0.175164
5    1.027205
dtype: float64

Suppose you need the fill value to vary by group. As you may guess, you need only
group the data and use apply with a function that calls fillna on each data chunk. Here
is some sample data on some US states divided into eastern and western states:

In [58]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
        'Oregon', 'Nevada', 'California', 'Idaho']

In [59]:
group_key = ['East'] * 4 + ['West'] * 4

In [60]:
data = Series(np.random.randn(8), index=states)

In [61]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan

In [62]:
data

Ohio          0.395448
New York      1.376014
Vermont            NaN
Florida       1.232475
Oregon       -1.015703
Nevada             NaN
California   -0.515764
Idaho              NaN
dtype: float64

In [63]:
data.groupby(group_key).mean()

East    1.001313
West   -0.765733
dtype: float64

We can fill the NA values using the group means like so:

In [64]:
fill_mean = lambda g: g.fillna(g.mean())

In [65]:
data.groupby(group_key).apply(fill_mean)

Ohio          0.395448
New York      1.376014
Vermont       1.001313
Florida       1.232475
Oregon       -1.015703
Nevada       -0.765733
California   -0.515764
Idaho        -0.765733
dtype: float64

In another case, you might have pre-defined fill values in your code that vary by group.
Since the groups have a name attribute set internally, we can use that:

In [66]:
fill_values = {'East': 0.5, 'West': -1}

In [67]:
fill_func = lambda g: g.fillna(fill_values[g.name])

In [68]:
data.groupby(group_key).apply(fill_func)

Ohio          0.395448
New York      1.376014
Vermont       0.500000
Florida       1.232475
Oregon       -1.015703
Nevada       -1.000000
California   -0.515764
Idaho        -1.000000
dtype: float64

## Example: Random Sampling and Permutation
    
Suppose you wanted to draw a random sample (with or without replacement) from a
large dataset for Monte Carlo simulation purposes or some other application. There
are a number of ways to perform the “draws”; some are much more efficient than others.
One way is to select the first K elements of np.random.permutation(N), where N is the
size of your complete dataset and K the desired sample size. As a more fun example,
here’s a way to construct a deck of English-style playing cards:

In [73]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)
    
deck = Series(card_val, index=cards)

#range(-30,0) + range(1,30)
#list(range(-30,0)) + list(range(1,30))


So now we have a Series of length 52 whose index contains card names and values are
the ones used in blackjack and other games (to keep things simple, I just let the ace be
1):

In [74]:
deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

Now, based on what I said above, drawing a hand of 5 cards from the desk could be
written as:

In [75]:
def draw(deck, n=5):
    return deck.take(np.random.permutation(len(deck))[:n])

In [76]:
draw(deck)

KC    10
6S     6
3C     3
KS    10
4C     4
dtype: int64

Suppose you wanted two random cards from each suit. Because the suit is the last
character of each card name, we can group based on this and use apply:

In [77]:
get_suit = lambda card: card[-1] # last letter is suit

In [78]:
deck.groupby(get_suit).apply(draw, n=2)

C  AC     1
   9C     9
D  4D     4
   6D     6
H  9H     9
   4H     4
S  QS    10
   3S     3
dtype: int64

In [79]:
# alternatively
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

QC     10
10C    10
5D      5
7D      7
8H      8
6H      6
AS      1
9S      9
dtype: int64

## Example: Group Weighted Average and Correlation

Under the split-apply-combine paradigm of groupby, operations between columns in a
DataFrame or two Series, such a group weighted average, become a routine affair. As
an example, take this dataset containing group keys, values, and some weights:

In [80]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
    'data': np.random.randn(8),
    'weights': np.random.rand(8)})

In [81]:
df

Unnamed: 0,category,data,weights
0,a,-0.626607,0.141789
1,a,0.118698,0.871032
2,a,-0.741121,0.632667
3,a,-0.099925,0.070714
4,b,0.593732,0.642682
5,b,0.063801,0.956804
6,b,-0.423153,0.531805
7,b,0.397379,0.189412


The group weighted average by category would then be:

In [82]:
grouped = df.groupby('category')

In [83]:
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

In [84]:
grouped.apply(get_wavg)

category
a   -0.268852
b    0.126194
dtype: float64

As a less trivial example, consider a data set from Yahoo! Finance containing end of
day prices for a few stocks and the S&P 500 index (the SPX ticker):

In [85]:
close_px = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)

In [87]:
close_px.head()

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003-01-02,7.4,21.11,29.22,909.03
2003-01-03,7.45,21.14,29.24,908.59
2003-01-06,7.45,21.52,29.96,929.01
2003-01-07,7.43,21.93,28.95,922.93
2003-01-08,7.28,21.31,28.83,909.93


In [88]:
close_px[-4:]

Unnamed: 0,AAPL,MSFT,XOM,SPX
2011-10-11,400.29,27.0,76.27,1195.54
2011-10-12,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58


One task of interest might be to compute a DataFrame consisting of the yearly correlations
of daily returns (computed from percent changes) with SPX. Here is one way to
do it:

In [89]:
rets = close_px.pct_change().dropna()

In [90]:
spx_corr = lambda x: x.corrwith(x['SPX'])

In [91]:
by_year = rets.groupby(lambda x: x.year)

In [92]:
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


There is, of course, nothing to stop you from computing inter-column correlations:

In [93]:
# Annual correlation of Apple with Microsoft
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

## Example: Group-wise Linear Regression

In the same vein as the previous example, you can use groupby to perform more complex
group-wise statistical analysis, as long as the function returns a pandas object or scalar
value. For example, I can define the following regress function (using the statsmo
dels econometrics library) which executes an ordinary least squares (OLS) regression
on each chunk of data:

In [96]:
import statsmodels.api as sm
from pandas.core import datetools

def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

Now, to run a yearly linear regression of AAPL on SPX returns, I execute:

In [97]:
by_year.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514
