# Chapter 10. Data aggregation and group operations

## 10.1 Groupby mechanics

The grouping key can be provided in different forms:

1. a list of values the same length oas the axis being grouped
2. a column name
3. a dictionary or Series that corresponds the values on the axis being grouped and the group names
4. a function to be invoked on the axis index or individual index labels

The following are some examples using these methods.

In [76]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

In [77]:
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1.764052,-0.977278
1,a,two,0.400157,0.950088
2,b,one,0.978738,-0.151357
3,b,two,2.240893,-0.103219
4,a,one,1.867558,0.410599


In [78]:
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.144044,0.333674
1,a,two,1.454274,1.494079
2,b,one,0.761038,-0.205158
3,b,two,0.121675,0.313068
4,a,one,0.443863,-0.854096


In [79]:
# `groupby()` creates a new `GroupBy` object.
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2e21d0>

In [80]:
# Mean of the data in the 'data1' column, grouped by 'key1'.
grouped.mean()

key1
a    0.680727
b    0.441356
Name: data1, dtype: float64

In [81]:
# Group by two columns.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.293953
      two     1.454274
b     one     0.761038
      two     0.121675
Name: data1, dtype: float64

In [82]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.293953,1.454274
b,0.761038,0.121675


If the grouping information is a column of the same DataFrame, then only the grouping column name is required.

In [83]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.680727,0.324553
b,0.441356,0.053955


In [84]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.293953,-0.260211
a,two,1.454274,1.494079
b,one,0.761038,-0.205158
b,two,0.121675,0.313068


A frequently useful method on a grouped DataFrame is `size()`.

In [85]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterating over groups

The GroupBy object created by `groupby()` supports iteration over a sequence of 2-tuples containing the group name and the data.

In [86]:
for name, group in df.groupby('key1'):
    print(f'group name: {name}')
    print(group)
    print('')


group name: a
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
1    a  two  1.454274  1.494079
4    a  one  0.443863 -0.854096

group name: b
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158
3    b  two  0.121675  0.313068



In [87]:
for name, group in df.groupby(['key1', 'key2']):
    print(f'group name: {name[0]}-{name[1]}')
    print(group)
    print('')


group name: a-one
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
4    a  one  0.443863 -0.854096

group name: a-two
  key1 key2     data1     data2
1    a  two  1.454274  1.494079

group name: b-one
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158

group name: b-two
  key1 key2     data1     data2
3    b  two  0.121675  0.313068



By default, `groupby()` groups on `axis=0`, though the columns could also be grouped.

In [88]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [89]:
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(f'data type: {dtype}')
    print(group)
    print('')


data type: float64
      data1     data2
0  0.144044  0.333674
1  1.454274  1.494079
2  0.761038 -0.205158
3  0.121675  0.313068
4  0.443863 -0.854096

data type: object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one



### Selecting a column or subset of columns

A GroupBy object can still be indexed by column name.
The next two statements are equivalent.

In [90]:
df.groupby('key1')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2db190>

In [91]:
df['data1'].groupby(df['key1'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2f02d0>

### Grouping with dictionaries and Series



In [92]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=list('abcde'),
                      index=['Joe', 'Steve', 'Wex', 'Jim', 'Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,-2.55299,0.653619,0.864436,-0.742165,2.269755
Steve,-1.454366,0.045759,-0.187184,1.532779,1.469359
Wex,0.154947,0.378163,-0.887786,-1.980796,-0.347912
Jim,0.156349,1.230291,1.20238,-0.387327,-0.302303
Travis,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


In [93]:
people.iloc[2, [1, 2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-2.55299,0.653619,0.864436,-0.742165,2.269755
Steve,-1.454366,0.045759,-0.187184,1.532779,1.469359
Wex,0.154947,,,-1.980796,-0.347912
Jim,0.156349,1.230291,1.20238,-0.387327,-0.302303
Travis,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


If I have a group correspondence for the columns and want to sum together the columns by these groups, I can just pass the dictionary for grouping.

In [94]:
mapping = {
    'a': 'red', 'b': 'red', 'c': 'blue',
    'd': 'blue', 'e': 'red', 'f' : 'orange'
}

In [95]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.122271,0.370383
Steve,1.345595,0.060752
Wex,-1.980796,-0.192965
Jim,0.815053,1.084337
Travis,0.244505,-2.978223


### Grouping with functions

A function can be used to create the mappings.
Each group key will be passed once, and the return value defines the groups.

Here is an example of grouping by the length of the first names.

In [96]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-2.241693,1.883909,2.066816,-3.110288,1.61954
5,-1.454366,0.045759,-0.187184,1.532779,1.469359
6,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


It is possible to use both a function and an array or dictionary for grouping at the same time.

In [97]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-2.55299,0.653619,0.864436,-1.980796,-0.347912
3,two,0.156349,1.230291,1.20238,-0.387327,-0.302303
5,one,-1.454366,0.045759,-0.187184,1.532779,1.469359
6,two,-1.048553,-1.420018,-1.70627,1.950775,-0.509652


### Grouping by index levels

For hierarchically indexed data structures, the levels of the axis can be used for grouping.

In [98]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.438074,-1.252795,0.77749,-1.613898,-0.21274
1,-0.895467,0.386902,-0.510805,-1.180632,-0.028182
2,0.428332,0.066517,0.302472,-0.634322,-0.362741
3,-0.67246,-0.359553,-0.813146,-1.726283,0.177426


In [99]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 10.2 Data Aggregation

An aggregation method is a data transformation that turns an array into a scalar value.
When used on a GroupBy object, the transformation is applied to each group, separately.

In [100]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.144044,0.333674
1,a,two,1.454274,1.494079
2,b,one,0.761038,-0.205158
3,b,two,0.121675,0.313068
4,a,one,0.443863,-0.854096


In [101]:
grouped = df.groupby('key1')
grouped['data1'].quantile()

key1
a    0.443863
b    0.441356
Name: data1, dtype: float64

It is common to define custom aggregation methods.
The can be applied to grouped data by passing them to the `agg()` method of a GroupBy object.

In [102]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.31023,2.348175
b,0.639363,0.518226


Other methods will also perform as expected, even though they are not sticktly aggregations.

In [103]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.680727,0.686479,0.144044,0.293953,0.443863,0.949068,1.454274,3.0,0.324553,1.174114,-0.854096,-0.260211,0.333674,0.913877,1.494079
b,2.0,0.441356,0.452098,0.121675,0.281516,0.441356,0.601197,0.761038,2.0,0.053955,0.366441,-0.205158,-0.075602,0.053955,0.183511,0.313068


### Column-wise and multiple function application

It is common to need to use a different aggregation method for each column.
This is demonstrated using an example.

In [104]:
# Read in data on tipping at restaurants.
tips = pd.read_csv('assets/examples/tips.csv')

# Calculate the percent of the bill the tip covered.
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [105]:
grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']

# The average tipping percentage per day, split into smokers and non-smokers.
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

The `agg` method can be passed a list of functions or function names to apply to the grouped data.

In [106]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


In [107]:
# Passing 2-tuples provides the name of the new column and the aggregation method.
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


The above example was using a grouped Series.
With a DataFrame, a list of functions can be applied to all columns, or specific functions can be applied to specified columns.

In [108]:
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


In [109]:
# A mapping of column names to functions.
fxn_mapping = {
    'tip_pct': ['min', 'max', 'mean', 'std'],
    'size': 'sum'
}
grouped.agg(fxn_mapping)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


### Return aggregated data without row indexes

So far, the returned aggregations have the group key combinations for an index, often hierarchical.
This can be disabled by passing `as_index=False`.

In [110]:
# with index
tips.groupby(['day', 'smoker']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,18.42,2.8125,2.25,0.15165
Fri,Yes,16.813333,2.714,2.066667,0.174783
Sat,No,19.661778,3.102889,2.555556,0.158048
Sat,Yes,21.276667,2.875476,2.47619,0.147906
Sun,No,20.506667,3.167895,2.929825,0.160113
Sun,Yes,24.12,3.516842,2.578947,0.18725
Thur,No,17.113111,2.673778,2.488889,0.160298
Thur,Yes,19.190588,3.03,2.352941,0.163863


In [111]:
# without index
tips.groupby(['day', 'smoker'], as_index=False).mean()

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


In [112]:
# also without index
tips.groupby(['day', 'smoker']).mean().reset_index()

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


## 10.3 Apply: general split-apply-combine

Apply splits the object being manipulated into pieces, invokes a provided function on each piece, and then concatenates the pieces together.
Here is an example using the tipping data.

In [113]:
def top(df, n=5, column='tip_pct'):
    '''
    Return rows with the top `n` values in `column`.
    '''
    return df.sort_values(by=column)[-n:]

top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


In [114]:
tips.groupby('smoker').apply(top, n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


### Suppressing the group keys

To prevent the resulting DataFrame from having the grouping keys for row indexes, use `group_keys=False` in `groupby()`.

In [115]:
tips.groupby('smoker', group_keys=False).apply(top, n=3)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
51,10.29,2.6,No,Sun,Dinner,2,0.252672
149,7.51,2.0,No,Thur,Lunch,2,0.266312
232,11.61,3.39,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


### Quantile and bucket analysis

pandas includes `cut()` and `qcut()` for dividing data into quantiles.
This often is useful in conjunction with `groupby()`.
The categorical object returned by `cut()` can be passed directly to `groupby()` to use to separate the data.

In [116]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles.head()

0    (-1.492, 0.0624]
1    (-3.052, -1.492]
2     (0.0624, 1.617]
3    (-1.492, 0.0624]
4    (-1.492, 0.0624]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.052, -1.492] < (-1.492, 0.0624] < (0.0624, 1.617] < (1.617, 3.171]]

In [117]:
def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

frame.data2.groupby(quartiles).apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.052, -1.492]",-1.730276,2.15312,68.0,0.097696
"(-1.492, 0.0624]",-2.777359,2.680571,482.0,0.05623
"(0.0624, 1.617]",-3.116857,2.929096,398.0,-0.046616
"(1.617, 3.171]",-2.158069,2.042072,52.0,-0.156563


Use `qcut()` to get equally-sized buckets.

In [118]:
grouping = pd.qcut(frame.data1, 10, labels=False)
frame.data2.groupby(grouping).apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,-2.534554,2.642936,100.0,-0.013313
1,-1.641703,2.232016,100.0,0.059095
2,-2.777359,2.620574,100.0,0.193168
3,-1.830029,2.488442,100.0,-0.103425
4,-2.12389,2.526368,100.0,0.169418
5,-2.121176,2.680571,100.0,0.044411
6,-3.116857,2.929096,100.0,0.083797
7,-2.994613,2.540232,100.0,-0.305252
8,-1.980566,2.190898,100.0,-0.058111
9,-2.802203,2.198296,100.0,0.000731


## 10.4 Pivot tables and cross-tabulation

Pivot tables are made with pandas by using `groupby()` and some reshaping operations.
There is a `pivot_table()` method for DataFrames and a top-level `pd.pivot_table()` function.

In [119]:
tips.pivot_table(index=['day', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_pct,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,2.25,2.8125,0.15165,18.42
Fri,Yes,2.066667,2.714,0.174783,16.813333
Sat,No,2.555556,3.102889,0.158048,19.661778
Sat,Yes,2.47619,2.875476,0.147906,21.276667
Sun,No,2.929825,3.167895,0.160113,20.506667
Sun,Yes,2.578947,3.516842,0.18725,24.12
Thur,No,2.488889,2.673778,0.160298,17.113111
Thur,Yes,2.352941,3.03,0.163863,19.190588


The following example only aggragates `'tip_pct'` and `'size'` and additionally groups by `'time'`.

In [120]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,No,Yes
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Dinner,Fri,2.0,2.222222,0.139622,0.165347
Dinner,Sat,2.555556,2.47619,0.158048,0.147906
Dinner,Sun,2.929825,2.578947,0.160113,0.18725
Dinner,Thur,2.0,,0.159744,
Lunch,Fri,3.0,1.833333,0.187735,0.188937
Lunch,Thur,2.5,2.352941,0.160311,0.163863


Setting `margins=True` computes partial totals, adding an `'All'` row and column.

In [121]:
tips.pivot_table(['tip_pct', 'size'],
                 index=['time', 'day'], 
                 columns='smoker', 
                 margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,All,No,Yes,All
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Dinner,Fri,2.0,2.222222,2.166667,0.139622,0.165347,0.158916
Dinner,Sat,2.555556,2.47619,2.517241,0.158048,0.147906,0.153152
Dinner,Sun,2.929825,2.578947,2.842105,0.160113,0.18725,0.166897
Dinner,Thur,2.0,,2.0,0.159744,,0.159744
Lunch,Fri,3.0,1.833333,2.0,0.187735,0.188937,0.188765
Lunch,Thur,2.5,2.352941,2.459016,0.160311,0.163863,0.161301
All,,2.668874,2.408602,2.569672,0.159328,0.163196,0.160803


The default aggregation behaviour is to take the mean.
This can be changed by passing a different function to `aggfunc`.

In [122]:
tips.pivot_table('tip_pct', 
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur,All
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,No,3.0,45.0,57.0,1.0,106.0
Dinner,Yes,9.0,42.0,19.0,,70.0
Lunch,No,1.0,,,44.0,45.0
Lunch,Yes,6.0,,,17.0,23.0
All,,19.0,87.0,76.0,62.0,244.0


In [123]:
tips.pivot_table('tip_pct',
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True, 
                 fill_value=0)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur,All
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,No,3,45,57,1,106.0
Dinner,Yes,9,42,19,0,70.0
Lunch,No,1,0,0,44,45.0
Lunch,Yes,6,0,0,17,23.0
All,,19,87,76,62,244.0


### Cross-tabulations (crosstab)

Cross-tabulation is a special case of a pivot table the computes group frequencies.

In [124]:
data = pd.DataFrame({
    'Sample': list(range(1, 11)),
    'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'],
    'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']
})
data

Unnamed: 0,Sample,Nationality,Handedness
0,1,USA,Right-handed
1,2,Japan,Left-handed
2,3,USA,Right-handed
3,4,Japan,Right-handed
4,5,Japan,Left-handed
5,6,Japan,Right-handed
6,7,USA,Right-handed
7,8,USA,Left-handed
8,9,Japan,Right-handed
9,10,USA,Right-handed


In [125]:
pd.crosstab(data.Nationality, data.Handedness, margins=True)

Handedness,Left-handed,Right-handed,All
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Japan,2,3,5
USA,1,4,5
All,3,7,10


In [126]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

Unnamed: 0_level_0,smoker,No,Yes,All
time,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dinner,Fri,3,9,12
Dinner,Sat,45,42,87
Dinner,Sun,57,19,76
Dinner,Thur,1,0,1
Lunch,Fri,1,6,7
Lunch,Thur,44,17,61
All,,151,93,244
