# Chapter 10. Data Aggregation and Group Operations
<a id='index'></a>
In this chapter, you will learn how to:
* Split a pandas object into pieces using one or more keys (in the form of func‐ tions, arrays, or DataFrame column names)
* Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function
* Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other statistical group analyses

## Table of Content
- [10.1 GroupBy Mechanics](#101)
    - [10.1.1 Iterating Over Groups](#1011)
    - [10.1.2 Selecting a Column or Subset of Columns](#1012)
    - [10.1.3 Grouping with Dicts and Series](#1013)
    - [10.1.4 Grouping with Functions](#1014)
    - [10.1.5 Grouping by Index Levels](#1015)
- [10.2 Data Aggregation](#102)
    - [10.2.1 Column-Wise and Multiple Function Application](#1021)

In [21]:
import pandas as pd
import numpy as np

## 10.1 GroupBy Mechanics
<a id='101'></a>
Each grouping key can take many forms, and the keys do not have to be all of the same type:
* A list or array of values that is the same length as the axis being grouped 
* A value indicating a column name in a DataFrame
* A dict or Series giving a correspondence between the values on the axis being grouped and the group names
* A function to be invoked on the axis index or the individual labels in the index

In [4]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df

Unnamed: 0,data1,data2,key1,key2
0,-0.522937,0.021566,a,one
1,-0.563996,0.466624,a,two
2,-2.089548,-1.227312,b,one
3,0.011069,-0.145358,b,two
4,-0.441027,1.476595,a,one


In [6]:
# To compute the mean of the data1 column using the labels from key1.
# Method 1
grouped = df['data1'].groupby(df['key1'])

# This grouped variable is now a GroupBy object
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10f862160>

In [7]:
grouped.mean()

key1
a   -0.50932
b   -1.03924
Name: data1, dtype: float64

In [9]:
# If instead we had passed multiple arrays as a list, we'd get something different:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means

key1  key2
a     one    -0.481982
      two    -0.563996
b     one    -2.089548
      two     0.011069
Name: data1, dtype: float64

In [10]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.481982,-0.563996
b,-2.089548,0.011069


In [13]:
# In this example, the group keys are all Series, though they could be any arrays of the right length:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])

df['data1'].groupby([states, years]).mean()

California  2005   -0.563996
            2006   -2.089548
Ohio        2005   -0.255934
            2006   -0.441027
Name: data1, dtype: float64

In [18]:
# Frequently the grouping information is found in the same DataFrame as the data you want to work on. 
# In that case, you can pass column names (whether those are strings, numbers, or other Python objects) 
# as the group keys:

df.groupby('key1').mean()

# ou may have noticed in the first case df.groupby('key1').mean() that there is no key2 
# column in the result. Because df['key2'] is not numeric data, it is said to be a nuisance c
# olumn, which is therefore excluded from the result.

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.50932,0.654928
b,-1.03924,-0.686335


In [19]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.481982,0.74908
a,two,-0.563996,0.466624
b,one,-2.089548,-1.227312
b,two,0.011069,-0.145358


In [20]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 10.1.1 Iterating Over Groups
<a id='1011'></a>
The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data

In [26]:
for name, group in df.groupby('key1'):
    print("Name: {0}\nGroup:\n{1}\n".format(name, group))

Name: a
Group:
      data1     data2 key1 key2
0 -0.522937  0.021566    a  one
1 -0.563996  0.466624    a  two
4 -0.441027  1.476595    a  one

Name: b
Group:
      data1     data2 key1 key2
2 -2.089548 -1.227312    b  one
3  0.011069 -0.145358    b  two



In [27]:
# Multiple keys
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print("Name: {0}\nGroup:\n{1}\n".format((k1, k2), group))

Name: ('a', 'one')
Group:
      data1     data2 key1 key2
0 -0.522937  0.021566    a  one
4 -0.441027  1.476595    a  one

Name: ('a', 'two')
Group:
      data1     data2 key1 key2
1 -0.563996  0.466624    a  two

Name: ('b', 'one')
Group:
      data1     data2 key1 key2
2 -2.089548 -1.227312    b  one

Name: ('b', 'two')
Group:
      data1     data2 key1 key2
3  0.011069 -0.145358    b  two



In [28]:
# A recipe you may find useful is computing a dict of the data pieces as a one-liner
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-2.089548,-1.227312,b,one
3,0.011069,-0.145358,b,two


In [29]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [31]:
# By default groupby groups on axis=0, but you can group on any of the other axes.
grouped = df.groupby(df.dtypes, axis=1)

In [32]:
for dtype, group in grouped:
    print("Dtype: {0}\nGroup:\n{1}\n".format(dtype, group))

Dtype: float64
Group:
      data1     data2
0 -0.522937  0.021566
1 -0.563996  0.466624
2 -2.089548 -1.227312
3  0.011069 -0.145358
4 -0.441027  1.476595

Dtype: object
Group:
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one



### 10.1.2 Selecting a Column or Subset of Columns
<a id='1012'></a>
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:
> * df.groupby('key1')['data1']
> * df.groupby('key1')[['data2']]

are syntactic sugar for:
> * df['data1'].groupby(df['key1'])
> * df[['data2']].groupby(df['key1'])

In [37]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.74908
a,two,0.466624
b,one,-1.227312
b,two,-0.145358


In [41]:
# The object returned by this indexing operation is a grouped DataFrame if a list or 
# array is passed or a grouped Series if only a single column name is passed as a scalar:

s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x10f906da0>

In [42]:
s_grouped.mean()

key1  key2
a     one     0.749080
      two     0.466624
b     one    -1.227312
      two    -0.145358
Name: data2, dtype: float64

### 10.1.3 Grouping with Dicts and Series
<a id='1013'></a>

In [43]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

people.iloc[2:3, [1, 2]] = np.NaN # Add a few NA values

people

Unnamed: 0,a,b,c,d,e
Joe,-0.360381,-0.277191,-1.026461,0.283021,0.635666
Steve,-0.211355,-0.660918,-0.514262,-0.579179,-0.638758
Wes,-0.277641,,,-1.082167,1.019958
Jim,-0.043795,0.463472,-0.644391,-0.229198,1.236472
Travis,-1.897389,0.853288,2.183437,-0.57007,0.651539


In [46]:
# a group correspondence for the columns and want to sum together the columns by group:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f': 'orange'}

# you could construct an array from this dict to pass to groupby, but instead we can just pass the dict.
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-0.74344,-0.001905
Steve,-1.093441,-1.511031
Wes,-1.082167,0.742317
Jim,-0.873589,1.656149
Travis,1.613367,-0.392562


In [47]:
# The same functionality holds for Series, which can be viewed as a fixed-size mapping:
map_series = pd.Series(mapping)

people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 10.1.4 Grouping with Functions
<a id='1014'></a>

In [49]:
# Suppose you wanted to group by the length of the names; while you could compute an array of string lengths, 
# it’s simpler to just pass the len function:

people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.681817,0.186281,-1.670853,-1.028344,2.892096
5,-0.211355,-0.660918,-0.514262,-0.579179,-0.638758
6,-1.897389,0.853288,2.183437,-0.57007,0.651539


In [50]:
# Mixing functions with arrays, dicts, or Series is not a problem as everything gets con‐ verted to arrays internally:
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.360381,-0.277191,-1.026461,-1.082167,0.635666
3,two,-0.043795,0.463472,-0.644391,-0.229198,1.236472
5,one,-0.211355,-0.660918,-0.514262,-0.579179,-0.638758
6,two,-1.897389,0.853288,2.183437,-0.57007,0.651539


### 10.1.5 Grouping by Index Levels
<a id='1015'></a>
A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index.

In [53]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])

hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.523435,-0.960708,-0.629164,-0.200194,0.433677
1,1.347963,0.586102,0.706475,1.587471,0.089473
2,1.445327,-1.499973,-0.034685,-0.565481,-0.647728
3,0.850024,-0.853514,-0.705422,0.497624,1.21282


In [59]:
# using the level keyword:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


<hr>

## 10.2 Data Aggregation
<a id='102'></a>

In [60]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.522937,0.021566,a,one
1,-0.563996,0.466624,a,two
2,-2.089548,-1.227312,b,one
3,0.011069,-0.145358,b,two
4,-0.441027,1.476595,a,one


In [69]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a   -0.457409
b   -0.198993
Name: data1, dtype: float64

In [67]:
# To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.122969,1.455029
b,2.100617,1.081955


In [70]:
# You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,-0.50932,0.062605,-0.563996,-0.543467,-0.522937,-0.481982,-0.441027,3.0,0.654928,0.745568,0.021566,0.244095,0.466624,0.971609,1.476595
b,2.0,-1.03924,1.48536,-2.089548,-1.564394,-1.03924,-0.514086,0.011069,2.0,-0.686335,0.765057,-1.227312,-0.956824,-0.686335,-0.415846,-0.145358


### 10.2.1 Column-Wise and Multiple Function Application
<a id='1021'></a>

<hr>

[Back to top](#index)