# DAML 04 - Pandas Indexes as Dimensions

Michal Grochmal <michal.grochmal@city.ac.uk>

Indexes in `pandas` do much more than enumerating the rows of a series or data frame.
An index can hold a list of values as the index of a certain element, in other words
we can have a combination of values as the index, a multi-index.  Column names are also
an index and can be multi-valued as well.

Let's pick the numeric columns from our British Isles data frame and *stack* them
together into a multiple index.

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 12

In [2]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas']
area = np.array([14130, 77933, 20779, 130279, 572])
population2017 = np.array([1876695, 5404700, np.nan, 55268100, np.nan])
population2011 = np.array([1810863, 5313600, 3063456, 53012456, 83314])
data = pd.DataFrame({'capital': capital,
                     'area': area,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country).sort_index()
data

Unnamed: 0,area,capital,population 2011,population 2017
England,130279,London,53012456,55268100.0
Isle of Man,572,Douglas,83314,
Northern Ireland,14130,Belfast,1810863,1876695.0
Scotland,77933,Edinburgh,5313600,5404700.0
Wales,20779,Cardiff,3063456,


In [3]:
pop = data[['population 2011', 'population 2017']]
pop.columns = [2011, 2017]
pop

Unnamed: 0,2011,2017
England,53012456,55268100.0
Isle of Man,83314,
Northern Ireland,1810863,1876695.0
Scotland,5313600,5404700.0
Wales,3063456,


We also renamed the columns to 2011 and 2017 to make examples shorter.

Now we stack, and we get a `Series` with a double valued index.

In [4]:
pop_year = pop.stack()
pop_year

England           2011    53012456.0
                  2017    55268100.0
Isle of Man       2011       83314.0
Northern Ireland  2011     1810863.0
                  2017     1876695.0
Scotland          2011     5313600.0
                  2017     5404700.0
Wales             2011     3063456.0
dtype: float64

Selecting only one part of the index provides a single indexed data frame,
which may contain more than one value.

In [5]:
pop_year['England']

2011    53012456.0
2017    55268100.0
dtype: float64

To get a single value we select a combined index value.
Note that the tuple syntax is not necessary for `Series`
but may be needed for data frames.

In [6]:
pop_year[('England', 2011)]

53012456.0

The slicing operators allow us to select parts of the index.
For example, all places that do have data for 2017.

In [7]:
pop_year[:, 2017]

England             55268100.0
Northern Ireland     1876695.0
Scotland             5404700.0
dtype: float64

By *unstacking* we get back the data frame.

In [8]:
pop_year.unstack()

Unnamed: 0,2011,2017
England,53012456.0,55268100.0
Isle of Man,83314.0,
Northern Ireland,1810863.0,1876695.0
Scotland,5313600.0,5404700.0
Wales,3063456.0,


## Indexes on Data Frames

If rows and columns are indexed in the same way we can exchange the row
(index) labels with column labels and reposition data appropriately.
Moreover, if several values are used to index a row or column we can exchange
only some of the values between rows and columns, or vice-versa.
That sounds horribly complicated but it is actually a common task within databases,
notably data warehouses.
The operation of changing labels between rows and columns whilst reordering the data
accordingly is called **pivoting** or **crosstabbing** in database jargon.
Database software extensions often provide pivot or crosstab operations.

Before we attempt pivoting let's try to move the index into the data frame itself.

In [9]:
pop.index.name = 'country'
pop_full = pop.reset_index()
pop_full

Unnamed: 0,country,2011,2017
0,England,53012456,55268100.0
1,Isle of Man,83314,
2,Northern Ireland,1810863,1876695.0
3,Scotland,5313600,5404700.0
4,Wales,3063456,


We moved the index into a column, good.
But since the columns are an index too we can move them into the data frame too.

In [10]:
pop_melt = pop_full.melt(id_vars=['country'], var_name='year')
pop_melt

Unnamed: 0,country,year,value
0,England,2011,53012456.0
1,Isle of Man,2011,83314.0
2,Northern Ireland,2011,1810863.0
3,Scotland,2011,5313600.0
4,Wales,2011,3063456.0
5,England,2017,55268100.0
6,Isle of Man,2017,
7,Northern Ireland,2017,1876695.0
8,Scotland,2017,5404700.0
9,Wales,2017,


Melting a data frame produces spread data,
i.e. we see the same data as before but instead of looking through a row and column
we look at a combination of columns in a row to understand what the "value" means.

The *pivot* operation is the opposite of melting.
We build meaningful columns from the data in the rows.

In [11]:
pop_full = pop_melt.pivot(index='country', columns='year', values='value')
pop_full

year,2011,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1
England,53012456.0,55268100.0
Isle of Man,83314.0,
Northern Ireland,1810863.0,1876695.0
Scotland,5313600.0,5404700.0
Wales,3063456.0,


Pivoting can be powerful,
not only it can build new columns but it can aggregate the resulting values.
In `pandas` the `pivot_table` method accept aggregations.
We can output a mean and keep the number of values from which the mean was taken from.

In [12]:
pop_agg = pop_melt[['country', 'value']].pivot_table(
    index='country', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x)), np.max, np.min])
pop_agg

Unnamed: 0_level_0,mean,<lambda>,amax,amin
Unnamed: 0_level_1,value,value,value,value
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
England,54140278.0,2.0,55268100.0,53012456.0
Isle of Man,83314.0,1.0,83314.0,83314.0
Northern Ireland,1843779.0,2.0,1876695.0,1810863.0
Scotland,5359150.0,2.0,5404700.0,5313600.0
Wales,3063456.0,1.0,3063456.0,3063456.0


There is a side effect here.
Since we may aggregate on more than a single column at once we get a multi-index
on the columns.  Since we do not need it for this case we name the columns ourselves.

In [13]:
pop_agg.columns = ['mean', 'not null', 'max', 'min']
pop_agg

Unnamed: 0_level_0,mean,not null,max,min
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
England,54140278.0,2.0,55268100.0,53012456.0
Isle of Man,83314.0,1.0,83314.0,83314.0
Northern Ireland,1843779.0,2.0,1876695.0,1810863.0
Scotland,5359150.0,2.0,5404700.0,5313600.0
Wales,3063456.0,1.0,3063456.0,3063456.0


## Dimensions

`pandas` provide several ways to pivot columns and rows,
for example, stacking and unstacking can be performed on data frames.
Yet, the most important point of the pivot operation is the fact that we can represent several
dimensions in a smaller number of dimensions by labeling data with combinations of values.

Just like we can represent a function of the type $f(x, y) = z$ by either storing a
2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by building a long
1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.