# Crosstab and pivot tables
## Reshaping and summarising data

The cross tabulation (crosstab), and the more general pivot table, allow us to reshape and summarise data in tabular form. 

Later we'll look at reshaping data as a seperate activity, and we'll also look at the basic building blocks that enable the creation of crosstabs and pivot tables - that is, grouping and aggregation over groups.   But for now we'll take the quick and easy route of preparing cross tabulations.

Note that SQL cannot easily produce crosstabs and pivot tables: basically SQL is not good at reshaping data.  SQL's underpinning model of the table gets in the way of morphing the table forms under user control (the core data structure was intended for storage, not analysis).  It's one of the reasons NoSQL languages are seen as being more flexible: reshaping a dataset can be a powerful analysis tool.

We'll be looking at the *pandas* `crosstab()` and `pivot_table()` methods, so ... we need a sample DataFrame to work with.

In [1]:
import pandas as pd
import numpy as np

## First, ingest some data
### The Isle of Wight council spending data

The Isle of Wight spending data is a dataset we've seen before - it gives us a nice clean DataFrame to work with.   (It's also the one used in the sample tables in the crosstab section of Part 4 of the module material.)

In [2]:
# Read in the CSV formatted spending data file.
df = pd.read_csv('data/spendingdata/IW_PUBLISHED FORMAT - JAN 2014.csv',
                 thousands=',', encoding='latin-1')
df[:3]

Unnamed: 0,Capital or Revenue,Directorate,Transaction Number,Date,Service Area,Expenses Type,Amount,Supplier Name
0,Revenue,Community Wellbeing & Social Care,5105650243,29.01.2014,Drug Misuse - Adults,Charges from Independent Providers,120.0,REDACTED PERSONAL DATA
1,Revenue,Community Wellbeing & Social Care,5105646636,15.01.2014,Drug Misuse - Adults,Charges from Independent Providers,120.0,REDACTED PERSONAL DATA
2,Revenue,Community Wellbeing & Social Care,5105648361,22.01.2014,Leaseholds by LA,Accommodation Costs - Leaseholder Payments,695.89,REDACTED PERSONAL DATA


## The cross tabulation: counting occurrences

The `crosstab()` method provides a convenient way of counting the occurrences of one column value or index value with respect to another.

Let's get a count of the number of `Capital` and the number of `Revenue` transactions for each of the Isle of Wight council `Directorates`:

In [9]:
pd.crosstab(df['Directorate'], df['Capital or Revenue'])

Capital or Revenue,Capital,Revenue
Directorate,Unnamed: 1_level_1,Unnamed: 2_level_1
Childrens Services,37,4091
Community Wellbeing & Social Care,30,5039
Corporate,36,28
Economy & Environment,27,2075
Resources,44,1092


The `crosstab` has reshaped our DataFrame - there is now a row for each unique value in the `Directorate` column of the original table, and a column for each unique value in the `Capital or Revenue` column of the original table.  At the intersection of each row and column there is the count of the number of times that row value and that column value occur in the original table's rows.

What would happen if we switched the order of the `df` column references in the above `crosstab` code?  Try it and see if you were right.  There are really two questions here: what happens to the rows and columns, and what happens to the counts?

In [10]:
pd.crosstab(df['Capital or Revenue'], df['Directorate'])

Directorate,Childrens Services,Community Wellbeing & Social Care,Corporate,Economy & Environment,Resources
Capital or Revenue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Capital,37,30,36,27,44
Revenue,4091,5039,28,2075,1092


The first of the `crosstab` parameters gives us the column list, the second parameter gives us the row list - so we get a different table shape. But because the same rows and columns are used, the counts at the intersections of the rows and columns remain the same - `Childrens Services` had `37` `Capital` transactions - whatever the order of the shape of the resulting table, or order of the rows and columns.

We can also capture the total count by row and by column by setting the `margins` parameter to be `True` (by default, `margins=False`). The new `All` column and row contains the row and column totals respectively.

In [11]:
pd.crosstab(df['Directorate'], df['Capital or Revenue'], margins=True)

Capital or Revenue,Capital,Revenue,All
Directorate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Childrens Services,37,4091,4128
Community Wellbeing & Social Care,30,5039,5069
Corporate,36,28,64
Economy & Environment,27,2075,2102
Resources,44,1092,1136
All,174,12325,12499


## The pivot table: generalising to more than counting

A rather more general method of summarising data is the `pivot_table()`. The `pivot_table()` provides functionality akin to spreadsheet pivot tables, in that it can be used to aggregate data in a DataFrame, over a hierarchy of columns, in a user-controlled way.

Let's start with the simplest aggregation over a single column and work our way up.

What is the `sum` of the `Amounts` for each `Directorate`?   I've used the numpy `sum` function, but any meaningful aggregate function could be applied to the columns.  

In [13]:
df.pivot_table(index=['Directorate'], aggfunc=np.sum)

Unnamed: 0_level_0,Amount
Directorate,Unnamed: 1_level_1
Childrens Services,4400360.57
Community Wellbeing & Social Care,4418142.0
Corporate,-123917.04
Economy & Environment,3806277.69
Resources,776363.23


### ASIDE: Why is only the `Amount` column output?
The `df` table has more than just the `Amount` column; so where are the other columns?

In the above cell, the `Amount` column is the only column where the `sum` function makes sense. 

What happens if we apply an aggregation function, like `np.max`,  that makes sense over all the columns? 

Try it and see.

In [14]:
# Put your code here and try it.
df.pivot_table(index=['Directorate'], aggfunc=np.max)
#this is nastly, because max also applies to strings and returns the max string, i.e. the ones closest to z

Unnamed: 0_level_0,Amount,Capital or Revenue,Date,Expenses Type,Service Area,Supplier Name,Transaction Number
Directorate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Childrens Services,268198.52,Revenue,31.12.2013,Water and Sewerage,Youth- West Wight,YOUTH SERVICE,Payroll
Community Wellbeing & Social Care,276434.0,Revenue,31.01.2014,Water and Sewerage,Wroxall Primary Devolved Capital,YOURCARE LIMITED,Payroll
Corporate,8320.25,Revenue,31.01.2014,Water and Sewerage,Rent Allowances Granted,WOOTTON PRIMARY SCHOOL,5400001770
Economy & Environment,1410780.85,Revenue,31.12.2013,Water and Sewerage,Youth Council,ZUMBA FITNESS,Payroll
Resources,127326.0,Revenue,31.12.2013,Water and Sewerage,Transformation Costs,YMCA WINCHESTER HOUSE DAY NURSERY,Payroll


## Pivot table over two, or more, index columns?
We said that the pivot table applies the aggregate function over a hierarchy of columns - so let's look at that.

Imagine that we need to generate a report that shows the total amount associated with capital or revenue spend for each directorate, with those two amounts reported separately for each directorate. 

The `pivot_table()` allows us to supply a list of columns for the index value, and that list is used to create the hierarchical breakdown of the aggregations  (later in the module we'll see these are known as `groups`.)


In [15]:
df.pivot_table(index=['Directorate', 'Capital or Revenue'], aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Amount
Directorate,Capital or Revenue,Unnamed: 2_level_1
Childrens Services,Capital,205436.48
Childrens Services,Revenue,4194924.09
Community Wellbeing & Social Care,Capital,35711.53
Community Wellbeing & Social Care,Revenue,4382430.47
Corporate,Capital,32972.17
Corporate,Revenue,-156889.21
Economy & Environment,Capital,281325.89
Economy & Environment,Revenue,3524951.8
Resources,Capital,107492.38
Resources,Revenue,668870.85


In [None]:
# We could extend this to more columns in the hierarchy, 
#    - although the Isle of Wight data doesn't have a nice short 'third' column to use, 
#      so this looks messy. 
df.pivot_table(index=['Directorate', 'Capital or Revenue', 'Expenses Type'], aggfunc=np.sum)


# Here's another hierarchy with a different aggfunc:
#df.pivot_table(index=['Directorate', 'Capital or Revenue', 'Supplier Name'], aggfunc='count')

__Note:__ in the above we're still depending on the type of the actual `aggfunc` to decide that the `Amount` column is the only one we want. If we put `np.max` in place of `np.sum` in the above, what do you think will happen (remember you can get the `max` of a set of text strings, or numbers, or dates) ... it gets very messy!  

There must be some way to control this ....

Let's first remind ourselves of the two-level hierarchy we saw earlier.

In [16]:
df.pivot_table(index=['Directorate', 'Capital or Revenue'], aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Amount
Directorate,Capital or Revenue,Unnamed: 2_level_1
Childrens Services,Capital,205436.48
Childrens Services,Revenue,4194924.09
Community Wellbeing & Social Care,Capital,35711.53
Community Wellbeing & Social Care,Revenue,4382430.47
Corporate,Capital,32972.17
Corporate,Revenue,-156889.21
Economy & Environment,Capital,281325.89
Economy & Environment,Revenue,3524951.8
Resources,Capital,107492.38
Resources,Revenue,668870.85


The `pivot_table()` method also lets us produce the grouped result DataFrames with other shapes - in particular we can choose where to split the hierarchy into rows and columns. 

So suppose we want to group the sum totals as before at the `Directorate` level, but this time generate a results DataFrame that splits out the `Capital or Revenue` amounts as distinct columns.

In [17]:
df.pivot_table(index=['Directorate'], columns=['Capital or Revenue'], aggfunc=np.sum)

Unnamed: 0_level_0,Amount,Amount
Capital or Revenue,Capital,Revenue
Directorate,Unnamed: 1_level_2,Unnamed: 2_level_2
Childrens Services,205436.48,4194924.09
Community Wellbeing & Social Care,35711.53,4382430.47
Corporate,32972.17,-156889.21
Economy & Environment,281325.89,3524951.8
Resources,107492.38,668870.85


We could create a hierarchy of rows and columns by using a list for the `index` parameter and a list for the `columns` parameter.
The Isle of Wight data really looks messy if you try to do this! There are too many distinct values in the columns that aren't the `Directorate` or `Capital or Revenue` columns. This gives a very wide and very long table.

When there is a hierarchy over the `index` and over the `columns`, you may also want to generate column or row totals. The `pivot_table()` method can help here too. Set the `margins` parameter to be `True` (the default is `False`).

In [18]:
df.pivot_table(index=['Directorate'], columns=['Capital or Revenue'], aggfunc=np.sum, margins='True')

Unnamed: 0_level_0,Amount,Amount,Amount
Capital or Revenue,Capital,Revenue,All
Directorate,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Childrens Services,205436.48,4194924.09,4400360.57
Community Wellbeing & Social Care,35711.53,4382430.47,4418142.0
Corporate,32972.17,-156889.21,-123917.04
Economy & Environment,281325.89,3524951.8,3806277.69
Resources,107492.38,668870.85,776363.23
All,662938.45,12614288.0,13277226.45


## Different aggregate functions applied to the columns?

We can also apply different aggregation functions to the relevant columns - `aggfunc` accepts a list, each function in the list creates a summary column for each compatible column in the original table.

In [19]:
# sum and median both apply to numeric values, so we get two output columns here:
df.pivot_table(index=['Directorate', 'Capital or Revenue'], aggfunc=[np.sum, np.median])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,median
Unnamed: 0_level_1,Unnamed: 1_level_1,Amount,Amount
Directorate,Capital or Revenue,Unnamed: 2_level_2,Unnamed: 3_level_2
Childrens Services,Capital,205436.48,4032.0
Childrens Services,Revenue,4194924.09,115.08
Community Wellbeing & Social Care,Capital,35711.53,504.73
Community Wellbeing & Social Care,Revenue,4382430.47,203.69
Corporate,Capital,32972.17,148.62
Corporate,Revenue,-156889.21,420.0
Economy & Environment,Capital,281325.89,1870.0
Economy & Environment,Revenue,3524951.8,47.0
Resources,Capital,107492.38,1200.5
Resources,Revenue,668870.85,69.06


In [None]:
# and if we have multiple columns, we get a hierarchy:
df.pivot_table(index=['Directorate'], columns=['Capital or Revenue'], aggfunc=[np.sum, np.median])

## Crowd-sourcing time
There's probably a way to apply specific functions to specific named columns (say `sum` to `Amount`, and `max` to `Expenses`; but I can't seem to figure that one out.  

If you do identify how to do this, put the recipe up on the module forum to share.

## Exercise 

Use a pivot table to find the median spend in the capital and revenue categories (as the rows) for each directorate (by column).

In [None]:
# YOUR ANSWER HERE.


In [None]:
# Sample solution. Do attempt the exercise yourself before unfolding this sample solution.
df.pivot_table(index=['Capital or Revenue'], columns=['Directorate'], aggfunc=np.median)

## Summary 

Cross tabulations and pivot tables are quick ways of getting aggregations over the rows and columns of tabular datasets.   Later in the module we'll look at ways in which DataFrame and SQL tables can be directly manipulated to group common table values across columns.

For documentation on crosstabs and pivot tables, the _pandas_ documentation is a good place to start:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html


## Something to watch out for
There seems to be a lack of consistency in what is permitted for the `aggfunc` parameter.

numpy aggregation functions are named using the `np.name` notation (`np.median`, `np.sum`, `np.max`, etc.), other aggregation functions require their quoted string name `'count'`, `'median'`, etc.

I assume this represents a different naming convention, but I'm having trouble working out why.
If you have any insights, please pop them on the forum.


## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to the `04.2 Descriptive statistics in pandas` Notebook.