# DAML 03 - Pandas Indexes as Dimensions

Michal Grochmal <michal.grochmal@city.ac.uk>

Indexes in `pandas` do much more than enumarating that rows of a series or data frame.
An index can hold a list of values as the index of a certain element, in other words
we can have a combination of values as the index, a multi-index.  Column names are also
an index and can be multi-valued as well.

If rows and columns are indexed in the same way we can exchange the row (index) labels
with column labels and reposition data appropriatelly.  Morovoer, if several values are
used to index a row or column we can exchnage only some of the values between rows and
columns, or vice-versa.  That sounds horribly complicated but it is actually a common tasks
within databases, notably data warehouses.  The operation of changing lables between rows
and columns whilst reordering the data accordingly is called **pivoting** or **crosstabbing**
in database jargon.  Database software extensions often provide pivot/crosstab operations.

`pandas` provide several ways to pivot columns and rows, we will see some examples next.
The most important point of the pivot operation is the fact that we can represent several
dimensions in a smaller number of dimensions by labeling data with combinations of values.
Just like we can represent a function of the type $f(x, y) = z$ by either storing a
2-dimensional grid of $x$ and $y$ points mapping to values of $z$; or by builing a long
1-dimensional list of points of the form $(x, y)$ and mapping it to the $z$ values.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
mpl.rcParams['figure.figsize'] = (12.5, 6.0)
import pandas as pd
pd.options.display.max_rows = 12

In [None]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas']
area = [14130, 77933, 20779, 130279, 572]
population2017 = [1876695, 5404700, np.nan, 55268100, np.nan]
population2011 = [1810863, 5313600, 3063456, 53012456, 83314]
uk_area = pd.Series(area, index=country)
uk_area.sort_index(inplace=True)
array = np.array([area, capital, population2011, population2017]).T
data = pd.DataFrame({'capital': capital,
                     'area': area,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country)
data.sort_index(inplace=True)
data

In [None]:
pop = data[['population 2011', 'population 2017']]
pop.columns = [2011, 2017]
pop

In [None]:
pop_year = pop.stack()
pop_year

In [None]:
pop_year['England']

In [None]:
pop_year[('England', 2011)]

In [None]:
pop_year[:, 2017]

In [None]:
pop_year.unstack()

In [None]:
pop.index.name = 'country'
pop_full = pop.reset_index()
pop_full

In [None]:
pop_melt = pop_full.melt(id_vars=['country'], var_name='year')
pop_melt

In [None]:
pop_full = pop_melt.pivot(index='country', columns='year', values='value')
pop_full

In [None]:
pop_agg = pop_melt.pivot_table(index='country', aggfunc=[np.mean, lambda x: np.sum(~np.isnan(x))])
pop_agg

In [None]:
pop_agg.columns = ['mean', 'not null']
pop_agg