# Chapter 10. Data Aggregation and Group Operations
<a id='index'></a>
In this chapter, you will learn how to:
* Split a pandas object into pieces using one or more keys (in the form of func‐ tions, arrays, or DataFrame column names)
* Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function
* Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection
* Compute pivot tables and cross-tabulations
* Perform quantile analysis and other statistical group analyses

## Table of Content
- [10.1 GroupBy Mechanics](#101)
    - [10.1.1 Iterating Over Groups](#1011)
    - [10.1.2 Selecting a Column or Subset of Columns](#1012)
    - [10.1.3 Grouping with Dicts and Series](#1013)
    - [10.1.4 Grouping with Functions](#1014)
    - [10.1.5 Grouping by Index Levels](#1015)
- [10.2 Data Aggregation](#102)
    - [10.2.1 Column-Wise and Multiple Function Application](#1021)
    - [10.2.2 Returning Aggregated Data Without Row Indexes](#1022)

In [12]:
import pandas as pd
import numpy as np

## 10.1 GroupBy Mechanics
<a id='101'></a>
Each grouping key can take many forms, and the keys do not have to be all of the same type:
* A list or array of values that is the same length as the axis being grouped 
* A value indicating a column name in a DataFrame
* A dict or Series giving a correspondence between the values on the axis being grouped and the group names
* A function to be invoked on the axis index or the individual labels in the index

In [13]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df

Unnamed: 0,data1,data2,key1,key2
0,-0.262195,0.128915,a,one
1,-1.333046,-0.38511,a,two
2,-1.113341,-0.207467,b,one
3,0.486929,-0.636922,b,two
4,0.607649,0.787095,a,one


In [14]:
# To compute the mean of the data1 column using the labels from key1.
# Method 1
grouped = df['data1'].groupby(df['key1'])

# This grouped variable is now a GroupBy object
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x1062c9320>

In [15]:
grouped.mean()

key1
a   -0.329197
b   -0.313206
Name: data1, dtype: float64

In [16]:
# If instead we had passed multiple arrays as a list, we'd get something different:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means

key1  key2
a     one     0.172727
      two    -1.333046
b     one    -1.113341
      two     0.486929
Name: data1, dtype: float64

In [17]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.172727,-1.333046
b,-1.113341,0.486929


In [18]:
# In this example, the group keys are all Series, though they could be any arrays of the right length:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])

df['data1'].groupby([states, years]).mean()

California  2005   -1.333046
            2006   -1.113341
Ohio        2005    0.112367
            2006    0.607649
Name: data1, dtype: float64

In [19]:
# Frequently the grouping information is found in the same DataFrame as the data you want to work on. 
# In that case, you can pass column names (whether those are strings, numbers, or other Python objects) 
# as the group keys:

df.groupby('key1').mean()

# ou may have noticed in the first case df.groupby('key1').mean() that there is no key2 
# column in the result. Because df['key2'] is not numeric data, it is said to be a nuisance c
# olumn, which is therefore excluded from the result.

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.329197,0.176967
b,-0.313206,-0.422195


In [20]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.172727,0.458005
a,two,-1.333046,-0.38511
b,one,-1.113341,-0.207467
b,two,0.486929,-0.636922


In [21]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 10.1.1 Iterating Over Groups
<a id='1011'></a>
The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data

In [22]:
for name, group in df.groupby('key1'):
    print("Name: {0}\nGroup:\n{1}\n".format(name, group))

Name: a
Group:
      data1     data2 key1 key2
0 -0.262195  0.128915    a  one
1 -1.333046 -0.385110    a  two
4  0.607649  0.787095    a  one

Name: b
Group:
      data1     data2 key1 key2
2 -1.113341 -0.207467    b  one
3  0.486929 -0.636922    b  two



In [23]:
# Multiple keys
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print("Name: {0}\nGroup:\n{1}\n".format((k1, k2), group))

Name: ('a', 'one')
Group:
      data1     data2 key1 key2
0 -0.262195  0.128915    a  one
4  0.607649  0.787095    a  one

Name: ('a', 'two')
Group:
      data1    data2 key1 key2
1 -1.333046 -0.38511    a  two

Name: ('b', 'one')
Group:
      data1     data2 key1 key2
2 -1.113341 -0.207467    b  one

Name: ('b', 'two')
Group:
      data1     data2 key1 key2
3  0.486929 -0.636922    b  two



In [24]:
# A recipe you may find useful is computing a dict of the data pieces as a one-liner
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-1.113341,-0.207467,b,one
3,0.486929,-0.636922,b,two


In [25]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [26]:
# By default groupby groups on axis=0, but you can group on any of the other axes.
grouped = df.groupby(df.dtypes, axis=1)

In [27]:
for dtype, group in grouped:
    print("Dtype: {0}\nGroup:\n{1}\n".format(dtype, group))

Dtype: float64
Group:
      data1     data2
0 -0.262195  0.128915
1 -1.333046 -0.385110
2 -1.113341 -0.207467
3  0.486929 -0.636922
4  0.607649  0.787095

Dtype: object
Group:
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one



### 10.1.2 Selecting a Column or Subset of Columns
<a id='1012'></a>
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:
> * df.groupby('key1')['data1']
> * df.groupby('key1')[['data2']]

are syntactic sugar for:
> * df['data1'].groupby(df['key1'])
> * df[['data2']].groupby(df['key1'])

In [28]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.458005
a,two,-0.38511
b,one,-0.207467
b,two,-0.636922


In [29]:
# The object returned by this indexing operation is a grouped DataFrame if a list or 
# array is passed or a grouped Series if only a single column name is passed as a scalar:

s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x1060cdf28>

In [30]:
s_grouped.mean()

key1  key2
a     one     0.458005
      two    -0.385110
b     one    -0.207467
      two    -0.636922
Name: data2, dtype: float64

### 10.1.3 Grouping with Dicts and Series
<a id='1013'></a>

In [31]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])

people.iloc[2:3, [1, 2]] = np.NaN # Add a few NA values

people

Unnamed: 0,a,b,c,d,e
Joe,0.121431,0.535703,-0.22308,-1.628091,0.069782
Steve,1.515276,-0.146037,-0.304709,-0.055596,0.526006
Wes,0.41272,,,-0.822014,0.768018
Jim,0.147582,-0.555246,-0.382013,2.082841,-1.154042
Travis,0.288919,-0.807659,-0.715064,1.579642,-0.486272


In [32]:
# a group correspondence for the columns and want to sum together the columns by group:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f': 'orange'}

# you could construct an array from this dict to pass to groupby, but instead we can just pass the dict.
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.851171,0.726916
Steve,-0.360305,1.895245
Wes,-0.822014,1.180738
Jim,1.700828,-1.561706
Travis,0.864578,-1.005013


In [33]:
# The same functionality holds for Series, which can be viewed as a fixed-size mapping:
map_series = pd.Series(mapping)

people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 10.1.4 Grouping with Functions
<a id='1014'></a>

In [34]:
# Suppose you wanted to group by the length of the names; while you could compute an array of string lengths, 
# it’s simpler to just pass the len function:

people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.681734,-0.019543,-0.605093,-0.367264,-0.316242
5,1.515276,-0.146037,-0.304709,-0.055596,0.526006
6,0.288919,-0.807659,-0.715064,1.579642,-0.486272


In [35]:
# Mixing functions with arrays, dicts, or Series is not a problem as everything gets con‐ verted to arrays internally:
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,0.121431,0.535703,-0.22308,-1.628091,0.069782
3,two,0.147582,-0.555246,-0.382013,2.082841,-1.154042
5,one,1.515276,-0.146037,-0.304709,-0.055596,0.526006
6,two,0.288919,-0.807659,-0.715064,1.579642,-0.486272


### 10.1.5 Grouping by Index Levels
<a id='1015'></a>
A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index.

In [36]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])

hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.119693,-0.715328,-1.367066,-0.626788,1.830574
1,-0.148365,-0.520048,0.217353,-1.65971,-0.463134
2,1.141544,-1.382318,-1.609864,1.852783,-2.21251
3,-2.770203,-0.377639,0.144058,0.240626,0.221584


In [37]:
# using the level keyword:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


<hr>

## 10.2 Data Aggregation
<a id='102'></a>

In [38]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.262195,0.128915,a,one
1,-1.333046,-0.38511,a,two
2,-1.113341,-0.207467,b,one
3,0.486929,-0.636922,b,two
4,0.607649,0.787095,a,one


In [39]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    0.433680
b    0.326902
Name: data1, dtype: float64

In [40]:
# To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.940695,1.172205
b,1.600269,0.429455


In [41]:
# You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,-0.329197,0.972081,-1.333046,-0.79762,-0.262195,0.172727,0.607649,3.0,0.176967,0.587578,-0.38511,-0.128097,0.128915,0.458005,0.787095
b,2.0,-0.313206,1.131561,-1.113341,-0.713273,-0.313206,0.086861,0.486929,2.0,-0.422195,0.303671,-0.636922,-0.529559,-0.422195,-0.314831,-0.207467


### 10.2.1 Column-Wise and Multiple Function Application
<a id='1021'></a>

In [42]:
tips = pd.read_csv('examples/tips.csv')

# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [43]:
# However, you may want to aggregate using a different function depending on the column, 
# or multiple functions at once. Fortunately, this is possible to do, which I’ll illustrate 
# through a number of examples. First, I’ll group the tips by day and smoker:
grouped = tips.groupby(['day', 'smoker'])

group_pct = grouped['tip_pct']
group_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

In [44]:
# If you pass a list of functions or function names instead, you get back a 
# DataFrame with column names taken from the functions:
group_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


You don’t need to accept the names that GroupBy gives to the columns; notably, lambda functions have the name '<lambda>', which makes them hard to identify (you can see for yourself by looking at a function’s __name__ attribute). Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):

In [46]:
group_pct.agg([('foo', 'mean'), ('bar', np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column. To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

In [47]:
functions = ['count', 'mean', 'max']

result = grouped['tip_pct', 'total_bill'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


In [48]:
result['tip_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,4,0.15165,0.187735
Fri,Yes,15,0.174783,0.26348
Sat,No,45,0.158048,0.29199
Sat,Yes,42,0.147906,0.325733
Sun,No,57,0.160113,0.252672
Sun,Yes,19,0.18725,0.710345
Thur,No,45,0.160298,0.266312
Thur,Yes,17,0.163863,0.241255


In [49]:
# As before, the list of tuples with custom names can be passed
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


Now, suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:

In [50]:
grouped.agg({'tip': np.max, 'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [51]:
grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'], 'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


### 10.2.2 Returning Aggregated Data Without Row Indexes
<a id='1022'></a>

<hr>

[Back to top](#index)