# DAML 03 - Pandas Indexes as Dimensions

Michal Grochmal <michal.grochmal@city.ac.uk>

Indexes in `pandas` do much more than enumarating that rows of a series or data frame.
An index can hold a list of values as the index of a certain element, in other words
we can have a combination of values as the index, a multi-index.  Column names are also
an index and can be multi-valued as well.

If rows and columns are indexed in the same way we can exchange the row (index) labels
with column labels and reposition data appropriatelly.  Morovoer, if several values are
used to index a row or column we can exchnage only some of the values between rows and
columns, or vice-versa.  That sounds horribly complicated but it is actually a common tasks
within databases, notably data warehouses.  The operation of changing lables between rows
and columns whilst reordering the data accordingly is called **pivoting** or **crosstabbing**
in database jargon.  Database software extensions often provide pivot/crosstab operations.

`pandas` provide several ways to pivot columns and rows, we will see some examples next.
The most important point of the pivot operation is the fact that we can represent several
dimensions in a smaller number of dimensions by labeling data with combinations of values.
Just like we can represent a function of the type $f(x, y) = z$ by either storing a
2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by builing a long
1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
mpl.rcParams['figure.figsize'] = (12.5, 6.0)
import pandas as pd
pd.options.display.max_rows = 12

In [2]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas']
area = [14130, 77933, 20779, 130279, 572]
population2017 = [1876695, 5404700, np.nan, 55268100, np.nan]
population2011 = [1810863, 5313600, 3063456, 53012456, 83314]
uk_area = pd.Series(area, index=country)
uk_area.sort_index(inplace=True)
array = np.array([area, capital, population2011, population2017]).T
data = pd.DataFrame({'capital': capital,
                     'area': area,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country)
data.sort_index(inplace=True)
data

Unnamed: 0,area,capital,population 2011,population 2017
England,130279,London,53012456,55268100.0
Isle of Man,572,Douglas,83314,
Northern Ireland,14130,Belfast,1810863,1876695.0
Scotland,77933,Edinburgh,5313600,5404700.0
Wales,20779,Cardiff,3063456,


In [3]:
pop = data[['population 2011', 'population 2017']]
pop.columns = [2011, 2017]
pop

Unnamed: 0,2011,2017
England,53012456,55268100.0
Isle of Man,83314,
Northern Ireland,1810863,1876695.0
Scotland,5313600,5404700.0
Wales,3063456,


In [4]:
pop_year = pop.stack()
pop_year

England           2011    53012456.0
                  2017    55268100.0
Isle of Man       2011       83314.0
Northern Ireland  2011     1810863.0
                  2017     1876695.0
Scotland          2011     5313600.0
                  2017     5404700.0
Wales             2011     3063456.0
dtype: float64

In [5]:
pop_year['England']

2011    53012456.0
2017    55268100.0
dtype: float64

In [6]:
pop_year[('England', 2011)]

53012456.0

In [7]:
pop_year[:, 2017]

England             55268100.0
Northern Ireland     1876695.0
Scotland             5404700.0
dtype: float64

In [8]:
pop_year.unstack()

Unnamed: 0,2011,2017
England,53012456.0,55268100.0
Isle of Man,83314.0,
Northern Ireland,1810863.0,1876695.0
Scotland,5313600.0,5404700.0
Wales,3063456.0,


In [9]:
pop.index.name = 'country'
pop_full = pop.reset_index()
pop_full

Unnamed: 0,country,2011,2017
0,England,53012456,55268100.0
1,Isle of Man,83314,
2,Northern Ireland,1810863,1876695.0
3,Scotland,5313600,5404700.0
4,Wales,3063456,


In [10]:
pop_melt = pop_full.melt(id_vars=['country'], var_name='year')
pop_melt

Unnamed: 0,country,year,value
0,England,2011,53012456.0
1,Isle of Man,2011,83314.0
2,Northern Ireland,2011,1810863.0
3,Scotland,2011,5313600.0
4,Wales,2011,3063456.0
5,England,2017,55268100.0
6,Isle of Man,2017,
7,Northern Ireland,2017,1876695.0
8,Scotland,2017,5404700.0
9,Wales,2017,


In [11]:
pop_full = pop_melt.pivot(index='country', columns='year', values='value')
pop_full

year,2011,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1
England,53012456.0,55268100.0
Isle of Man,83314.0,
Northern Ireland,1810863.0,1876695.0
Scotland,5313600.0,5404700.0
Wales,3063456.0,


In [12]:
pop_agg = pop_melt.pivot_table(index='country', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x))])
pop_agg

Unnamed: 0_level_0,mean,<lambda>
Unnamed: 0_level_1,value,value
country,Unnamed: 1_level_2,Unnamed: 2_level_2
England,54140278.0,2.0
Isle of Man,83314.0,1.0
Northern Ireland,1843779.0,2.0
Scotland,5359150.0,2.0
Wales,3063456.0,1.0


In [13]:
pop_agg.columns = ['mean', 'not null']
pop_agg

Unnamed: 0_level_0,mean,not null
country,Unnamed: 1_level_1,Unnamed: 2_level_1
England,54140278.0,2.0
Isle of Man,83314.0,1.0
Northern Ireland,1843779.0,2.0
Scotland,5359150.0,2.0
Wales,3063456.0,1.0
