# Chapter 10. Data aggregation and group operations

## 10.1 Groupby mechanics

The grouping key can be provided in different forms:

1. a list of values the same length oas the axis being grouped
2. a column name
3. a dictionary or Series that corresponds the values on the axis being grouped and the group names
4. a function to be invoked on the axis index or individual index labels

The following are some examples using these methods.

In [53]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

In [54]:
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.764052,-0.977278
1,a,two,0.400157,0.950088
2,b,one,0.978738,-0.151357
3,b,two,2.240893,-0.103219
4,a,one,1.867558,0.410599


In [55]:
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.144044,0.333674
1,a,two,1.454274,1.494079
2,b,one,0.761038,-0.205158
3,b,two,0.121675,0.313068
4,a,one,0.443863,-0.854096


In [56]:
# `groupby()` creates a new `GroupBy` object.
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x1136f1e90>

In [57]:
# Mean of the data in the 'data1' column, grouped by 'key1'.
grouped.mean()

key1
a    0.680727
b    0.441356
Name: data1, dtype: float64

In [58]:
# Group by two columns.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.293953
      two     1.454274
b     one     0.761038
      two     0.121675
Name: data1, dtype: float64

In [59]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.293953,1.454274
b,0.761038,0.121675


If the grouping information is a column of the same DataFrame, then only the grouping column name is required.

In [60]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.680727,0.324553
b,0.441356,0.053955


In [61]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.293953,-0.260211
a,two,1.454274,1.494079
b,one,0.761038,-0.205158
b,two,0.121675,0.313068


A frequently useful method on a grouped DataFrame is `size()`.

In [62]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating over groups

The GroupBy object created by `groupby()` supports iteration over a sequence of 2-tuples containing the group name and the data.

In [63]:
for name, group in df.groupby('key1'):
    print(f'group name: {name}')
    print(group)
    print('')


group name: a
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
1    a  two  1.454274  1.494079
4    a  one  0.443863 -0.854096

group name: b
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158
3    b  two  0.121675  0.313068



In [64]:
for name, group in df.groupby(['key1', 'key2']):
    print(f'group name: {name[0]}-{name[1]}')
    print(group)
    print('')


group name: a-one
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
4    a  one  0.443863 -0.854096

group name: a-two
  key1 key2     data1     data2
1    a  two  1.454274  1.494079

group name: b-one
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158

group name: b-two
  key1 key2     data1     data2
3    b  two  0.121675  0.313068



By default, `groupby()` groups on `axis=0`, though the columns could also be grouped.

In [65]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [66]:
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(f'data type: {dtype}')
    print(group)
    print('')


data type: float64
      data1     data2
0  0.144044  0.333674
1  1.454274  1.494079
2  0.761038 -0.205158
3  0.121675  0.313068
4  0.443863 -0.854096

data type: object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one



### Selecting a column or subset of columns

A GroupBy object can still be indexed by column name.
The next two statements are equivalent.

In [67]:
df.groupby('key1')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x113875750>

In [68]:
df['data1'].groupby(df['key1'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x11387b5d0>

### Grouping with dictionaries and Series



In [69]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=list('abcde'),
                      index=['Joe', 'Steve', 'Wex', 'Jim', 'Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,-2.55299,0.653619,0.864436,-0.742165,2.269755
Steve,-1.454366,0.045759,-0.187184,1.532779,1.469359
Wex,0.154947,0.378163,-0.887786,-1.980796,-0.347912
Jim,0.156349,1.230291,1.20238,-0.387327,-0.302303
Travis,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


In [70]:
people.iloc[2, [1, 2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-2.55299,0.653619,0.864436,-0.742165,2.269755
Steve,-1.454366,0.045759,-0.187184,1.532779,1.469359
Wex,0.154947,,,-1.980796,-0.347912
Jim,0.156349,1.230291,1.20238,-0.387327,-0.302303
Travis,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


If I have a group correspondence for the columns and want to sum together the columns by these groups, I can just pass the dictionary for grouping.

In [71]:
mapping = {
    'a': 'red', 'b': 'red', 'c': 'blue',
    'd': 'blue', 'e': 'red', 'f' : 'orange'
}

In [72]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.122271,0.370383
Steve,1.345595,0.060752
Wex,-1.980796,-0.192965
Jim,0.815053,1.084337
Travis,0.244505,-2.978223


### Grouping with functions

A function can be used to create the mappings.
Each group key will be passed once, and the return value defines the groups.

Here is an example of grouping by the length of the first names.

In [73]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-2.241693,1.883909,2.066816,-3.110288,1.61954
5,-1.454366,0.045759,-0.187184,1.532779,1.469359
6,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


It is possible to use both a function and an array or dictionary for grouping at the same time.

In [74]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-2.55299,0.653619,0.864436,-1.980796,-0.347912
3,two,0.156349,1.230291,1.20238,-0.387327,-0.302303
5,one,-1.454366,0.045759,-0.187184,1.532779,1.469359
6,two,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


### Grouping by index levels

For hierarchically indexed data structures, the levels of the axis can be used for grouping.

In [75]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.438074,-1.252795,0.77749,-1.613898,-0.21274
1,-0.895467,0.386902,-0.510805,-1.180632,-0.028182
2,0.428332,0.066517,0.302472,-0.634322,-0.362741
3,-0.67246,-0.359553,-0.813146,-1.726283,0.177426


In [76]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 10.2 Data Aggregation