<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-aggregation-and-group-operations" data-toc-modified-id="Data-aggregation-and-group-operations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data aggregation and group operations</a></span><ul class="toc-item"><li><span><a href="#GroupBy-mechanics" data-toc-modified-id="GroupBy-mechanics-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>GroupBy mechanics</a></span><ul class="toc-item"><li><span><a href="#Iterating-over-groups" data-toc-modified-id="Iterating-over-groups-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Iterating over groups</a></span></li><li><span><a href="#Selecting-a-column-or-subset-of-columns" data-toc-modified-id="Selecting-a-column-or-subset-of-columns-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Selecting a column or subset of columns</a></span></li><li><span><a href="#Grouping-with-dicts-and-series" data-toc-modified-id="Grouping-with-dicts-and-series-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Grouping with dicts and series</a></span></li><li><span><a href="#Grouping-with-functions" data-toc-modified-id="Grouping-with-functions-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Grouping with functions</a></span></li><li><span><a href="#Grouping-by-index-level" data-toc-modified-id="Grouping-by-index-level-1.1.5"><span class="toc-item-num">1.1.5&nbsp;&nbsp;</span>Grouping by index level</a></span></li></ul></li><li><span><a href="#Data-aggregation" data-toc-modified-id="Data-aggregation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data aggregation</a></span><ul class="toc-item"><li><span><a href="#Column-wise-and-multiple-function-application" data-toc-modified-id="Column-wise-and-multiple-function-application-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Column-wise and multiple function application</a></span></li><li><span><a href="#Returning-aggregated-data-without-row-indexes" data-toc-modified-id="Returning-aggregated-data-without-row-indexes-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Returning aggregated data without row indexes</a></span></li></ul></li><li><span><a href="#Apply:-General-split-apply-combine" data-toc-modified-id="Apply:-General-split-apply-combine-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Apply: General split-apply-combine</a></span><ul class="toc-item"><li><span><a href="#Suppressing-the-group-keys" data-toc-modified-id="Suppressing-the-group-keys-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Suppressing the group keys</a></span></li><li><span><a href="#Quantile-and-bucket-analysis" data-toc-modified-id="Quantile-and-bucket-analysis-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Quantile and bucket analysis</a></span></li><li><span><a href="#Example:-filling-missing-values-with-group-specific-values" data-toc-modified-id="Example:-filling-missing-values-with-group-specific-values-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Example: filling missing values with group-specific values</a></span></li><li><span><a href="#Example:-random-sampling-and-permutation" data-toc-modified-id="Example:-random-sampling-and-permutation-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Example: random sampling and permutation</a></span></li><li><span><a href="#Example:-group-weighted-average-and-correlation" data-toc-modified-id="Example:-group-weighted-average-and-correlation-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Example: group weighted average and correlation</a></span></li><li><span><a href="#Example:-group-wise-linear-regression" data-toc-modified-id="Example:-group-wise-linear-regression-1.3.6"><span class="toc-item-num">1.3.6&nbsp;&nbsp;</span>Example: group-wise linear regression</a></span></li></ul></li><li><span><a href="#Pivot-tables-and-cross-tabulation" data-toc-modified-id="Pivot-tables-and-cross-tabulation-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Pivot tables and cross-tabulation</a></span><ul class="toc-item"><li><span><a href="#Crosstab" data-toc-modified-id="Crosstab-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Crosstab</a></span></li></ul></li></ul></li></ul></div>

In [4]:
import numpy as np
import pandas as pd

# Data aggregation and group operations

- After loading, merging, cleaning a dataset
    - compute group statistics
    
- pandas (like SQL) has flexible operations for joining, filtering, aggregating data

## GroupBy mechanics

- Group operations are also called `split-apply-combine`
    
    1. split data (from DataFrame or Series) into groups:
        - based on certain keys
        - along rows or columns
    2. apply a function to each group producing a new value
        - E.g., sum()
    3. combine the results into a series / df object
    
- Grouping can happen in many ways:
    - list or array with values encoding the groups (same length as
      the axis being grouped)
    - a dict or a Series giving the correspondence between values
      on the axes and group names
    - the name of the column to be used for the split
    - a function invoked on the index or on the rows / columns

In [33]:
np.random.seed(10)

df = pd.DataFrame({
    'key1': ['a', 'a', 'b', 'b', 'a'],
    'key2': ['one', 'two', 'one', 'two', 'one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.331587,-0.720086
1,a,two,0.715279,0.265512
2,b,one,-1.5454,0.108549
3,b,two,-0.008384,0.004291
4,a,one,0.621336,-0.1746


In [36]:
# 1)
# - We want to compute the mean of the values in "data1" grouping by values of "key1"
# - groupby() computes the mapping between keys of the groups and rows of the dataframe
grouped = df['data1'].groupby(df['key1'])

# We are grouping a Series since we have a single column.
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x110a3e048>

In [37]:
grouped2 = df[['data1', 'data2']].groupby(df['key1'])

# We are grouping a DataFrame since we have two columns.
grouped2

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x110a3e1d0>

In [38]:
# For each value of the group we compute the mean of the corresponding rows.
grouped.mean()

key1
a    0.889400
b   -0.776892
Name: data1, dtype: float64

In [39]:
# If we group by 2 keys, we end up with a hierarchical index Series.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means

key1  key2
a     one     0.976461
      two     0.715279
b     one    -1.545400
      two    -0.008384
Name: data1, dtype: float64

In [9]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.031284,0.562044
b,0.359613,-1.723603


In [10]:
df['data1']

0    0.958372
1    0.562044
2    0.359613
3   -1.723603
4   -0.895805
Name: data1, dtype: float64

In [40]:
# 2)
# We can also use arrays to infer the groups, as long as the size is
# the same as the number of rows.
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])

print(len(states), len(years), len(df['data1']))

df['data1'].groupby([states, years]).mean()

5 5 5


California  2005    0.715279
            2006   -1.545400
Ohio        2005    0.661601
            2006    0.621336
Name: data1, dtype: float64

In [12]:
# 3)
# The grouping information can be stored in the same data frame as the data
# Note that df['key2'] is excluded from the mean since it is not numerical.
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.208204,1.468931
b,-0.681995,-0.090453


In [13]:
# Group by 2 keys the entire df.
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.031284,1.433314
a,two,0.562044,1.540165
b,one,0.359613,-0.114184
b,two,-1.723603,-0.066722


In [41]:
# We can count the number of elements with size().
# Note that nan are excluded.
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating over groups

- groupby object supports iteration.

In [42]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.331587,-0.720086
1,a,two,0.715279,0.265512
2,b,one,-1.5454,0.108549
3,b,two,-0.008384,0.004291
4,a,one,0.621336,-0.1746


In [43]:
for name, group in df.groupby('key1'):
    # group is the dataframe.
    print("\n# key=", name)
    print("group=\n", group)


# key= a
group=
   key1 key2     data1     data2
0    a  one  1.331587 -0.720086
1    a  two  0.715279  0.265512
4    a  one  0.621336 -0.174600

# key= b
group=
   key1 key2     data1     data2
2    b  one -1.545400  0.108549
3    b  two -0.008384  0.004291


In [44]:
# In case of grouping by multiple keys, the "key" is a tuple of values.
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print("\n# key=", (k1, k2))
    print("group=\n", group)


# key= ('a', 'one')
group=
   key1 key2     data1     data2
0    a  one  1.331587 -0.720086
4    a  one  0.621336 -0.174600

# key= ('a', 'two')
group=
   key1 key2     data1     data2
1    a  two  0.715279  0.265512

# key= ('b', 'one')
group=
   key1 key2   data1     data2
2    b  one -1.5454  0.108549

# key= ('b', 'two')
group=
   key1 key2     data1     data2
3    b  two -0.008384  0.004291


In [45]:
# One can compute a dict out of the groupby in one line.
pieces = dict(list(df.groupby('key1')))

import pprint

pprint.pprint(pieces)

{'a':   key1 key2     data1     data2
0    a  one  1.331587 -0.720086
1    a  two  0.715279  0.265512
4    a  one  0.621336 -0.174600,
 'b':   key1 key2     data1     data2
2    b  one -1.545400  0.108549
3    b  two -0.008384  0.004291}


### Selecting a column or subset of columns

In [46]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.331587,-0.720086
1,a,two,0.715279,0.265512
2,b,one,-1.5454,0.108549
3,b,two,-0.008384,0.004291
4,a,one,0.621336,-0.1746


In [47]:
# Group by "key1".
grouped = df.groupby('key1')

# .groups.keys() to get the keys.
print("keys=", list(grouped.groups.keys()))

# A groupby object can be split by column after being computed.
print(df.groupby('key1')["data1"])

keys= ['a', 'b']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x1109f8780>


In [48]:
print(df.groupby('key1')["data1"].mean())
print(df.groupby('key1')["data2"].mean())

key1
a    0.889400
b   -0.776892
Name: data1, dtype: float64
key1
a   -0.209725
b    0.056420
Name: data2, dtype: float64


In [51]:
# It is equivalent to
# - "split and then select" and
# - "select and then split"
print(df.groupby('key1')["data1"].mean())

# Once we have selected "data1" there is no "key1" anymore so we use the array
# df["key1"] to label the values and group.
print(df["data1"].groupby(df['key1']).mean())

key1
a    0.889400
b   -0.776892
Name: data1, dtype: float64
key1
a    0.889400
b   -0.776892
Name: data1, dtype: float64


### Grouping with dicts and series

In [53]:
np.random.seed(10)

people = pd.DataFrame(
    np.random.randn(5, 5),
    columns=list("abcde"),
    index="Joe Steve Wes Jim Travis".split())

# Add NAs at row = 2 and columns = [1, 3]
people.iloc[2:3, [1, 3]] = np.nan

people

Unnamed: 0,a,b,c,d,e
Joe,1.331587,0.715279,-1.5454,-0.008384,0.621336
Steve,-0.720086,0.265512,0.108549,0.004291,-0.1746
Wes,0.433026,,-0.965066,,0.22863
Jim,0.445138,-1.136602,0.135137,1.484537,-1.079805
Travis,-1.977728,-1.743372,0.26607,2.384967,1.123691


Finish

In [56]:
# Build a map from columns to group and aggregate.
mapping = {
    'a': 'red',
    'b': 'red',
    'c': 'blue',
    'd': 'blue',
    'e': 'red',
    'f': 'orange'
}

by_column = people.groupby(mapping, axis=1)

by_column.sum()

Unnamed: 0,blue,red
Joe,-1.553784,2.668201
Steve,0.11284,-0.629174
Wes,-0.965066,0.661656
Jim,1.619674,-1.771269
Travis,2.651037,-2.597409


In [25]:
# Transform the dict into a fixed mapping series.
map_series = pd.Series(mapping)

print(map_series)

display(people.groupby(map_series, axis=1).sum())

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object


Unnamed: 0,blue,red
Joe,-0.272033,0.257715
Steve,0.79674,-4.354606
Wes,-0.664528,-1.372944
Jim,1.262702,0.051811
Travis,0.398841,-2.25307


### Grouping with functions

- Instead of a fixed mapping through dict or Series, a function can be used
- When passing a function to groupby(), the function is called on the
  index and the result is the group

In [26]:
# Group by length of name
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,2.994951,-1.845217,-1.763727,2.089868,-2.213151
5,-0.989102,-2.276888,0.797407,-0.000667,-1.088615
6,-0.445895,-1.942931,0.935635,-0.536794,0.135755


### Grouping by index level

- One can use the hierarchical index to aggregate using one of the
  levels


In [27]:
columns = pd.MultiIndex.from_arrays(
    [['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]],
    names=['cty', 'tenor'])

columns

MultiIndex([('US', 1),
            ('US', 3),
            ('US', 5),
            ('JP', 1),
            ('JP', 3)],
           names=['cty', 'tenor'])

In [28]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.173586,-0.728171,-0.573554,-0.601618,0.68589
1,0.90717,0.447767,0.116431,-0.30392,0.538885
2,0.85955,1.837971,0.610375,-0.347287,-0.739334
3,1.883621,-1.271082,1.066175,-0.806638,-0.251036


In [29]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## Data aggregation

- aggregation = transformation from arrays to scalar value
    - E.g., mean, count, min, sum, first, last

In [30]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.958372,0.532245
1,a,two,0.562044,1.540165
2,b,one,0.359613,-0.114184
3,b,two,-1.723603,-0.066722
4,a,one,-0.895805,2.334383


In [31]:
# split by values of key1 and compute quantile.
df.groupby('key1').quantile(0.9)

TypeError: 'quantile' cannot be performed against 'object' dtypes!

In [None]:
# One can use any custom function.
def peak_to_peak(arr):
    return arr.max() - arr.min()

df.groupby('key1').agg(peak_to_peak)

In [None]:
# Also functions like describe() work, although they are not
# aggregations.

df.groupby('key1')["data1"].describe()

In [None]:
df.groupby('key1').describe()

### Column-wise and multiple function application

In [None]:
tips = pd.read_csv('~/src/github/pydata-book/examples/tips.csv')

tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

In [None]:
grouped = tips.groupby(['day', 'smoker'])

# Select a column.
grouped_pct = grouped['tip_pct']
print("keys=", list(grouped_pct.groups.keys()))

# Equivalent.
grouped_pct.agg('mean')
grouped_pct.mean()

In [None]:
# Pass a list of aggregation functions.
funcs = ['mean', 'std', peak_to_peak]
grouped_pct.agg(funcs)

In [None]:
# Assign name to the functions.
funcs = [
    ('foo', 'mean'),
    ('bar', np.std)
]
grouped_pct.agg(funcs)

In [None]:
funcs = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(funcs)

# It has hierarchical columns for both rows and columns.
result

In [None]:
# Aggregation functions can also be specified by dict.
funcs = {'tip': np.max, 'size': 'sum'}
grouped.agg(funcs)

In [None]:
funcs = {'tip_pct': ['min', 'max', 'mean', 'std'], 'size': 'sum'}
grouped.agg(funcs)

### Returning aggregated data without row indexes

In [None]:
tips.groupby(['day', 'smoker'], as_index=True).mean()

In [None]:
# Returning a hierarchical index can be disabled.
tips.groupby(['day', 'smoker'], as_index=False).mean()

In [None]:
# This is equivalent to call reset_index().

tips.groupby(['day', 'smoker'], as_index=True).mean().reset_index()

## Apply: General split-apply-combine

In [None]:
# You want to select the top five tip_pct values by group.

def top(df, n=2, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips, n=6)

In [None]:
# top() is called on each row group and then results
# are concat with pandas.concat, using labels from group name.
tips.groupby('smoker').apply(top)

In [None]:
# You can pass params to the function using **kwargs.
tips.groupby('smoker').apply(top, n=1, column='total_bill')

In [None]:
# This operation is what describe() does.
display(tips.groupby('smoker')["tip_pct"].describe())

df2 = tips.groupby('smoker')["tip_pct"].apply(lambda x: x.describe())
display(df2)
    

### Suppressing the group keys

In [None]:
# Disable the hierarchial indexing.

df = tips.groupby('smoker', group_keys=True).apply(top)
display(df)

df = tips.groupby('smoker', group_keys=False).apply(top)
display(df)

### Quantile and bucket analysis

In [None]:
df = pd.DataFrame({
    "data1": np.random.randn(100),
    "data2": np.random.randn(100)
})

display(df.head())

In [None]:
quartiles = pd.cut(df.data1, 4)

print(type(quartiles))

quartiles[:4]

In [None]:
# We can use the series above to groupby.

# We can filter by data2 since we already know the mapping.
grouped = df["data2"].groupby(quartiles)

for k, v in grouped:
    print(k)
    print(v.head(2))

In [None]:
def get_stats(group):
    #print group
    #assert 0
    return pd.Series({
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    })

df2 = grouped.apply(get_stats)

display(df2)

# Move one level of index to columns.
df2.unstack()

In [None]:
pd.qcut(df.data1, 4).head()

In [None]:
pd.cut(df.data1, 4).head()

### Example: filling missing values with group-specific values

In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan

s

In [None]:
s.fillna(s.mean())

In [None]:
# One can fill nans based on the group.

states = [
    'Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California',
    'Idaho'
]
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)

data

In [None]:
data[['Vermont', "Nevada", "Idaho"]] = np.nan

data

In [None]:
data.groupby(group_key).mean()

In [None]:
fill_mean = lambda g: g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)

### Example: random sampling and permutation

In [None]:
# Hearts
# Spades
# Clubs
# Diamonds
suits = list("HSCD")
# Values.
card_val = list(list(range(1, 10 + 1)) + [10] * 3)
base_names = ['A'] + list(range(2, 10 + 1)) + list("JQK")
assert len(card_val) == len(base_names)

cards = []
for suit in suits:
    cards.extend(str(num) + suit for num in base_names)
    
deck = pd.Series(card_val * 4, index=cards)
assert len(deck) == 52
deck.head()

In [None]:
# Draw two cards without replacement.

def draw(deck, n=5):
    return deck.sample(n)


draw(deck)

In [None]:
# Last letter is suit.
get_suit = lambda card: card[-1]

In [None]:
# Draw 2 cards per suit.
# Group by suit, and then get 2 cards from each group.
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

### Example: group weighted average and correlation

In [None]:
df = pd.DataFrame({
    'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
    'data': np.random.randn(8),
    'weights': np.random.rand(8)
})

df

In [None]:
grouped = df.groupby('category')

# Aggregation function: dot product between data and weights.
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

grouped.apply(get_wavg)

In [None]:
close_px = pd.read_csv(
    '~/src/github/pydata-book/examples/stock_px_2.csv',
    parse_dates=True,
    index_col=0)

close_px.head()

In [None]:
close_px.info()

In [None]:
close_px.describe()

In [None]:
# Compute rets.

rets = close_px.pct_change().dropna()

rets.head()

In [None]:
# For each stock compute the correlation with SPX.
spx_corr = lambda x: x.corrwith(x['SPX'])

# Groupby year.
get_year = lambda x: x.year

# For each year, compute the correlation of each stock to SPX.
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

In [None]:
# For each year, compute the correlation of AAPL and MSFT.
by_year.apply(lambda x: x['AAPL'].corr(x['MSFT']))

### Example: group-wise linear regression

- You can use groupby to perform more complex analysis, as long
  as function returns a pandas object (Series or DataFrame) or
  scalar value

In [None]:
import statsmodels.api as sm

def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.0
    result = sm.OLS(Y, X).fit()
    return result.params

by_year.apply(regress, 'AAPL', ['SPX'])

## Pivot tables and cross-tabulation

- A pivot table aggregates a table of data by one or more keys
  arranging results for rows and columns

In [None]:
tips.head()

In [None]:
# Aggregate through mean by two indices.
tips.pivot_table(index=['day', 'smoker'])

In [None]:
#?tips.pivot_table

In [None]:
# Compute two metrics by
# - 3 vars: 2 on the index and 1 on the columns
tips.pivot_table(['tip_pct', 'size'],
                 index=['time', 'day'],
                 columns='smoker')

In [None]:
# We can also add summation over each var, so that
# there are values for 2 variables.
tips.pivot_table(['tip_pct', 'size'],
                 index=['time', 'day'],
                 columns='smoker',
                 margins=True)

In [None]:
# You can specify the aggregation function by passing aggfunc.
tips.pivot_table('tip_pct',
                 index=['time', 'smoker'],
                 columns='day',
                 aggfunc=len, margins=True)

### Crosstab

- Special case of pivot table that computes group frequencies

In [None]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)q