In [1]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# Pivot Tables and Cross-Tabulation

A pivot table is a data summarization tool frequently found in spreadsheet programs
and other data analysis software. It aggregates a table of data by one or more keys,
arranging the data in a rectangle with some of the group keys along the rows and some
along the columns. Pivot tables in Python with pandas are made possible using the
groupby facility described in this chapter combined with reshape operations utilizing
hierarchical indexing. DataFrame has a pivot_table method, and additionally there is
a top-level pandas.pivot_table function. In addition to providing a convenience interface
to groupby, pivot_table also can add partial totals, also known as margins.
Returning to the tipping data set, suppose I wanted to compute a table of group means
(the default pivot_table aggregation type) arranged by sex and smoker on the rows:


In [2]:
tips = pd.read_csv('tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [3]:
tips.pivot_table(index=['sex', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_pct,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,2.592593,2.773519,0.156921,18.105185
Female,Yes,2.242424,2.931515,0.18215,17.977879
Male,No,2.71134,3.113402,0.160669,19.791237
Male,Yes,2.5,3.051167,0.152771,22.2845


This could have been easily produced using groupby. Now, suppose we want to aggregate
only tip_pct and size, and additionally group by day. I’ll put smoker in the table
columns and day in the rows:

In [4]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
    columns='smoker')

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,No,Yes
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Female,Fri,2.5,2.0,0.165296,0.209129
Female,Sat,2.307692,2.2,0.147993,0.163817
Female,Sun,3.071429,2.5,0.16571,0.237075
Female,Thur,2.48,2.428571,0.155971,0.163073
Male,Fri,2.0,2.125,0.138005,0.14473
Male,Sat,2.65625,2.62963,0.162132,0.139067
Male,Sun,2.883721,2.6,0.158291,0.173964
Male,Thur,2.5,2.3,0.165706,0.164417


In [5]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
 columns='smoker', margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,All,No,Yes,All
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,Fri,2.5,2.0,2.111111,0.165296,0.209129,0.199388
Female,Sat,2.307692,2.2,2.25,0.147993,0.163817,0.15647
Female,Sun,3.071429,2.5,2.944444,0.16571,0.237075,0.181569
Female,Thur,2.48,2.428571,2.46875,0.155971,0.163073,0.157525
Male,Fri,2.0,2.125,2.1,0.138005,0.14473,0.143385
Male,Sat,2.65625,2.62963,2.644068,0.162132,0.139067,0.151577
Male,Sun,2.883721,2.6,2.810345,0.158291,0.173964,0.162344
Male,Thur,2.5,2.3,2.433333,0.165706,0.164417,0.165276
All,,2.668874,2.408602,2.569672,0.159328,0.163196,0.160803


To use a different aggregation function, pass it to aggfunc. For example, 'count' or
len will give you a cross-tabulation (count or frequency) of group sizes:

In [6]:
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
    aggfunc=len, margins=True)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur,All
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,No,2.0,13.0,14.0,25.0,54.0
Female,Yes,7.0,15.0,4.0,7.0,33.0
Male,No,2.0,32.0,43.0,20.0,97.0
Male,Yes,8.0,27.0,15.0,10.0,60.0
All,,19.0,87.0,76.0,62.0,244.0


If some combinations are empty (or otherwise NA), you may wish to pass a fill_value:

In [7]:
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
    columns='day', aggfunc='sum', fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,day,Fri,Sat,Sun,Thur
time,sex,smoker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,Female,No,2,30,43,2
Dinner,Female,Yes,8,33,10,0
Dinner,Male,No,4,85,124,0
Dinner,Male,Yes,12,71,39,0
Lunch,Female,No,3,0,0,60
Lunch,Female,Yes,6,0,0,17
Lunch,Male,No,0,0,0,50
Lunch,Male,Yes,5,0,0,23


See Table 9-2 for a summary of pivot_table methods.

Table 9-2. pivot_table options

Function name Description

values Column name or names to aggregate. By default aggregates all numeric columns

rows Column names or other group keys to group on the rows of the resulting pivot table

cols Column names or other group keys to group on the columns of the resulting pivot table

aggfunc Aggregation function or list of functions; 'mean' by default. Can be any function valid in a groupby context

fill_value Replace missing values in result table

margins Add row/column subtotals and grand total, False by default

## Cross-Tabulations: Crosstab
    
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes
group frequencies. Here is a canonical example taken from the Wikipedia page on crosstabulation:

In [8]:
data

NameError: name 'data' is not defined