# Data Aggregation and Group Operations

Categorizing a data set and applying a function to each group, whether an aggregation
or transformation, is often a critical component of a data analysis workflow. After
loading, merging, and preparing a data set, a familiar task is to compute group statistics
or possibly pivot tables for reporting or visualization purposes. pandas provides a flexible
and high-performance groupby facility, enabling you to slice and dice, and summarize
data sets in a natural way.


One reason for the popularity of relational databases and SQL (which stands for
“structured query language”) is the ease with which data can be joined, filtered, transformed,
and aggregated. However, query languages like SQL are rather limited in the
kinds of group operations that can be performed. As you will see, with the expressiveness
and power of Python and pandas, we can perform much more complex grouped
operations by utilizing any function that accepts a pandas object or NumPy array. In
this chapter, you will learn how to:

• Split a pandas object into pieces using one or more keys (in the form of functions,arrays, or DataFrame column names)

• Computing group summary statistics, like count, mean, or standard deviation, or a user-defined function

• Apply a varying set of functions to each column of a DataFrame

• Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection

• Compute pivot tables and cross-tabulations

• Perform quantile analysis and other data-derived group analyses


Aggregation of time series data, a special use case of groupby, is referred
to as resampling in this book and will receive separate treatment in
Chapter 10.

In [2]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

## GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language,
coined the term split-apply-combine for talking about group operations, and I
think that’s a good description of the process. In the first stage of the process, data
contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into
groups based on one or more keys that you provide. The splitting is performed on a
particular axis of an object. For example, a DataFrame can be grouped on its rows
(axis=0) or its columns (axis=1). Once this is done, a function is applied to each group,
producing a new value. Finally, the results of all those function applications are combined
into a result object. The form of the resulting object will usually depend on what’s
being done to the data. See Figure 9-1 for a mockup of a simple group aggregation.

Each grouping key can take many forms, and the keys do not have to be all of the same
type:

• A list or array of values that is the same length as the axis being grouped

• A value indicating a column name in a DataFrame

A dict or Series giving a correspondence between the values on the axis being grouped and the group names

• A function to be invoked on the axis index or the individual labels in the index


Note that the latter three methods are all just shortcuts for producing an array of values
to be used to split up the object. Don’t worry if this all seems very abstract. Throughout
this chapter, I will give many examples of all of these methods. To get started, here is
a very simple small tabular dataset as a DataFrame:

In [3]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})

In [4]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.586185,-1.86519,a,one
1,-0.086984,-0.081112,a,two
2,1.458549,1.65947,b,one
3,-1.868374,-0.425024,b,two
4,2.101978,-0.8754,a,one


In [5]:
grouped = df['data1'].groupby(df['key1'])

In [11]:
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10b02a5f8>

This grouped variable is now a GroupBy object. It has not actually computed anything
yet except for some intermediate data about the group key df['key1']. The idea is that
this object has all of the information needed to then apply some operation to each of
the groups. For example, to compute group means we can call the GroupBy’s mean
method:

In [12]:
grouped.mean()

key1
a    0.476270
b   -0.204912
Name: data1, dtype: float64

Later, I'll explain more about what’s going on when you call .mean(). The important
thing here is that the data (a Series) has been aggregated according to the group key,
producing a new Series that is now indexed by the unique values in the key1 column.
The result index has the name 'key1' because the DataFrame column df['key1'] did.


If instead we had passed multiple arrays as a list, we get something different:

In [14]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

In [15]:
means

key1  key2
a     one     0.757896
      two    -0.086984
b     one     1.458549
      two    -1.868374
Name: data1, dtype: float64

In this case, we grouped the data using two keys, and the resulting Series now has a
hierarchical index consisting of the unique pairs of keys observed:

In [16]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.757896,-0.086984
b,1.458549,-1.868374


In these examples, the group keys are all Series, though they could be any arrays of the
right length:

In [17]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])

In [18]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005   -0.086984
            2006    1.458549
Ohio        2005   -1.227280
            2006    2.101978
Name: data1, dtype: float64

Frequently the grouping information to be found in the same DataFrame as the data
you want to work on. In that case, you can pass column names (whether those are
strings, numbers, or other Python objects) as the group keys:

In [20]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.47627,-0.940568
b,-0.204912,0.617223


Unnamed: 0,data1,data2,key1,key2
0,-0.586185,-1.86519,a,one
1,-0.086984,-0.081112,a,two
2,1.458549,1.65947,b,one
3,-1.868374,-0.425024,b,two
4,2.101978,-0.8754,a,one


In [24]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.757896,-1.370295
a,two,-0.086984,-0.081112
b,one,1.458549,1.65947
b,two,-1.868374,-0.425024


You may have noticed in the first case df.groupby('key1').mean() that there is no
key2 column in the result. Because df['key2'] is not numeric data, it is said to be a
nuisance column, which is therefore excluded from the result. By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset as you’ll
see soon.

Regardless of the objective in using groupby, a generally useful GroupBy method is
size which return a Series containing group sizes:

In [25]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

As of this writing, any missing values in a group key will be excluded
from the result. It’s possible (and, in fact, quite likely), that by the time
you are reading this there will be an option to include the NA group in
the result.

As of this writing, any missing values in a group key will be excluded
from the result. It’s possible (and, in fact, quite likely), that by the time
you are reading this there will be an option to include the NA group in
the result.

In [28]:
df.head()

Unnamed: 0,data1,data2,key1,key2
0,-0.586185,-1.86519,a,one
1,-0.086984,-0.081112,a,two
2,1.458549,1.65947,b,one
3,-1.868374,-0.425024,b,two
4,2.101978,-0.8754,a,one


## Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing
the group name along with the chunk of data. Consider the following small example
data set:

In [27]:
for name, group in df.groupby('key1'):
    print (name)
    print (group)

a
      data1     data2 key1 key2
0 -0.586185 -1.865190    a  one
1 -0.086984 -0.081112    a  two
4  2.101978 -0.875400    a  one
b
      data1     data2 key1 key2
2  1.458549  1.659470    b  one
3 -1.868374 -0.425024    b  two


In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [29]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print (k1, k2)
    print (group)

a one
      data1    data2 key1 key2
0 -0.586185 -1.86519    a  one
4  2.101978 -0.87540    a  one
a two
      data1     data2 key1 key2
1 -0.086984 -0.081112    a  two
b one
      data1    data2 key1 key2
2  1.458549  1.65947    b  one
b two
      data1     data2 key1 key2
3 -1.868374 -0.425024    b  two


Of course, you can choose to do whatever you want with the pieces of data. A recipe
you may find useful is computing a dict of the data pieces as a one-liner:

In [30]:
pieces = dict(list(df.groupby('key1')))

In [31]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,1.458549,1.65947,b,one
3,-1.868374,-0.425024,b,two


By default groupby groups on axis=0, but you can group on any of the other axes. For
example, we could group the columns of our example df here by dtype like so:

In [33]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [34]:
grouped = df.groupby(df.dtypes, axis=1)

In [35]:
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -0.586185 -1.865190
 1 -0.086984 -0.081112
 2  1.458549  1.659470
 3 -1.868374 -0.425024
 4  2.101978 -0.875400, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of
column names has the effect of selecting those columns for aggregation. This means that:

In [36]:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

<pandas.core.groupby.DataFrameGroupBy object at 0x10b10c6a0>

are syntactic sugar for:

In [37]:
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

<pandas.core.groupby.DataFrameGroupBy object at 0x10b1435c0>

Especially for large data sets, it may be desirable to aggregate only a few columns. For
example, in the above data set, to compute means for just the data2 column and get
the result as a DataFrame, we could write:

In [38]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-1.370295
a,two,-0.081112
b,one,1.65947
b,two,-0.425024


The object returned by this indexing operation is a grouped DataFrame if a list or array
is passed and a grouped Series is just a single column name that is passed as a scalar:

In [39]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

In [44]:
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10b10cb00>

In [46]:
s_grouped.mean()

key1  key2
a     one    -1.370295
      two    -0.081112
b     one     1.659470
      two    -0.425024
Name: data2, dtype: float64

## Grouping with Dicts and Series
Grouping information may exist in a form other than an array. Let’s consider another
example DataFrame:

In [68]:
people = DataFrame(np.random.randn(5, 5),
    columns=['a', 'b', 'c', 'd', 'e'],
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

In [69]:
people.head()

Unnamed: 0,a,b,c,d,e
Joe,1.566123,0.323672,0.272837,0.150506,-0.038586
Steve,-0.165051,1.344181,-0.007014,-0.849705,-0.760626
Wes,-0.812433,-0.635232,1.171538,0.339441,-1.289153
Jim,-1.143791,0.591405,-1.009851,0.887563,-0.636839
Travis,-1.410973,0.177429,-0.596788,0.017589,-1.50473


In [70]:
people.loc[2:3, ['b', 'c']] = np.nan # Add a few NA values

In [71]:
people

Unnamed: 0,a,b,c,d,e
Joe,1.566123,0.323672,0.272837,0.150506,-0.038586
Steve,-0.165051,1.344181,-0.007014,-0.849705,-0.760626
Wes,-0.812433,,,0.339441,-1.289153
Jim,-1.143791,0.591405,-1.009851,0.887563,-0.636839
Travis,-1.410973,0.177429,-0.596788,0.017589,-1.50473


Now, suppose I have a group correspondence for the columns and want to sum together
the columns by group:

In [72]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
        'd': 'blue', 'e': 'red', 'f' : 'orange'}

Now, you could easily construct an array from this dict to pass to groupby, but instead
we can just pass the dict:

In [74]:
by_column = people.groupby(mapping, axis=1)

In [75]:
by_column.sum()

Unnamed: 0,blue,red
Joe,0.423343,1.851209
Steve,-0.856719,0.418504
Wes,0.339441,-2.101587
Jim,-0.122288,-1.189224
Travis,-0.579199,-2.738274


The same functionality holds for Series, which can be viewed as a fixed size mapping.
When I used Series as group keys in the above examples, pandas does, in fact, inspect
each Series to ensure that its index is aligned with the axis it’s grouping:

In [76]:
map_series = Series(mapping)

In [77]:
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [78]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


## Grouping with Functions

Using Python functions in what can be fairly creative ways is a more abstract way of
defining a group mapping compared with a dict or Series. Any function passed as a
group key will be called once per index value, with the return values being used as the
group names. More concretely, consider the example DataFrame from the previous
section, which has people’s first names as index values. Suppose you wanted to group
by the length of the names; you could compute an array of string lengths, but instead
you can just pass the len function:

In [79]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.390101,0.915077,-0.737014,1.377509,-1.964577
5,-0.165051,1.344181,-0.007014,-0.849705,-0.760626
6,-1.410973,0.177429,-0.596788,0.017589,-1.50473


Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted
to arrays internally:

In [80]:
key_list = ['one', 'one', 'one', 'two', 'two']

In [81]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.812433,0.323672,0.272837,0.150506,-1.289153
3,two,-1.143791,0.591405,-1.009851,0.887563,-0.636839
5,one,-0.165051,1.344181,-0.007014,-0.849705,-0.760626
6,two,-1.410973,0.177429,-0.596788,0.017589,-1.50473


## Grouping by Index Levels

A final convenience for hierarchically-indexed data sets is the ability to aggregate using
one of the levels of an axis index. To do this, pass the level number or name using the
level keyword:

In [82]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
    [1, 3, 5, 1, 3]], names=['cty', 'tenor'])

In [83]:
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)

In [84]:
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.895138,-0.282767,0.386815,-1.057029,0.891477
1,-0.221725,0.729446,-2.031792,1.454463,0.406126
2,1.043111,0.96474,-1.099683,-0.326499,1.402363
3,0.137706,0.577253,-1.184296,-1.741245,-0.539967


In [85]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


In [86]:
df


Unnamed: 0,data1,data2,key1,key2
0,-0.586185,-1.86519,a,one
1,-0.086984,-0.081112,a,two
2,1.458549,1.65947,b,one
3,-1.868374,-0.425024,b,two
4,2.101978,-0.8754,a,one
