# Hierarchical Indexing

Often
it is useful to go beyond  one-dimensional and two-dimensional
data stored in Pandas Series and DataFrame objects, and store higher-dimensional data. i.e., data indexed
by more than one or two keys. While Pandas does provide Panel and Panel4D objects
that natively handle three-dimensional and four-dimensional data , a far more common pattern in practice is to make use of hierarchical
indexing (also known as multi-indexing) to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented
within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [1]:
import pandas as pd 
import numpy as np 

### A Multiply Indexed Series
Let’s start by considering how we might represent two-dimensional data within a
one-dimensional Series. For concreteness, we will consider a series of data where
each point has a character and numerical key.

#### The bad way
Suppose you would like to track data about products from two different years. Using the
Pandas tools , you might be tempted to simply use Python
tuples as keys:

In [2]:
index = [('Bananas', 2016), ('Bananas', 2017), ('Tomatoes', 2016), ('Tomatoes', 2017), ('Onions', 2016), ('Onions', 2017)]
volume = [33871648, 37253956, 18976457, 19378102,20851820, 25145561]
vol = pd.Series(volume, index=index)
vol

(Bananas, 2016)     33871648
(Bananas, 2017)     37253956
(Tomatoes, 2016)    18976457
(Tomatoes, 2017)    19378102
(Onions, 2016)      20851820
(Onions, 2017)      25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based
on this multiple index:

In [3]:
vol[('Bananas', 2016) : ('Tomatoes', 2017)]

(Bananas, 2016)     33871648
(Bananas, 2017)     37253956
(Tomatoes, 2016)    18976457
(Tomatoes, 2017)    19378102
dtype: int64

But the convenience ends there. For example, if you need to select all values from
2016, you’ll need to do some messy (and potentially slow) munging to make it
happen:

In [4]:
# This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax we’ve grown to love in Pandas.
vol[[i for i in vol.index if i[1] == 2016]]

(Bananas, 2016)     33871648
(Tomatoes, 2016)    18976457
(Onions, 2016)      20851820
dtype: int64

### The better way: Pandas MultiIndex
Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a
rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations
we wish to have. We can create a multi-index from the tuples as follows:

In [5]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([( 'Bananas', 2016),
            ( 'Bananas', 2017),
            ('Tomatoes', 2016),
            ('Tomatoes', 2017),
            (  'Onions', 2016),
            (  'Onions', 2017)],
           )

If we reindex our series with this MultiIndex, we see the hierarchical representation
of the data:

In [6]:
vol = vol.reindex(index)
vol

Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
Onions    2016    20851820
          2017    25145561
dtype: int64

Here the first two columns of the Series representation show the multiple index values,
while the third column shows the data. Notice that some entries are missing in
the first column: in this multi-index representation, any blank entry indicates the
same value as the line above it.

Now to access all data for which the second index is 2016, we can simply use the Pandas
slicing notation:

In [7]:
vol[:, 2016]

Bananas     33871648
Tomatoes    18976457
Onions      20851820
dtype: int64

The result is a singly indexed array with just the keys we’re interested in. This syntax
is much more convenient (and the operation is much more efficient!) than the homespun
tuple-based multi-indexing solution that we started with. We’ll now further discuss
this sort of indexing operation on hierarchically indexed data.

### MultiIndex as extra dimension
You might notice something else here: we could easily have stored the same data
using a simple DataFrame with index and column labels. In fact, Pandas is built with
this equivalence in mind. The unstack() method will quickly convert a multiplyindexed
Series into a conventionally indexed DataFrame:

In [8]:
# unstack method
vol_df = vol.unstack()
vol_df

Unnamed: 0,2016,2017
Bananas,33871648,37253956
Onions,20851820,25145561
Tomatoes,18976457,19378102


Naturally, the <b>stack()</b> method provides the opposite operation:

In [9]:
# stack method
vol_df.stack()

Bananas   2016    33871648
          2017    37253956
Onions    2016    20851820
          2017    25145561
Tomatoes  2016    18976457
          2017    19378102
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical indexing
at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent
data of three or more dimensions in a Series or DataFrame. Each extra level in a
multi-index represents an extra dimension of data; taking advantage of this property
gives us much more flexibility in the types of data we can represent. Concretely, we
might want to add another column of pricing data for each product at each year
; with a MultiIndex this is as easy as adding another column
to the DataFrame:

In [10]:
vol_df = pd.DataFrame({'total': vol,
'unit_price': [33, 34,
68, 74,
88, 101]})
vol_df

Unnamed: 0,Unnamed: 1,total,unit_price
Bananas,2016,33871648,33
Bananas,2017,37253956,34
Tomatoes,2016,18976457,68
Tomatoes,2017,19378102,74
Onions,2016,20851820,88
Onions,2017,25145561,101


In addition, all the ufuncs and other functionality  work with hierarchical indices as well. Here we compute the
actual revenue by year, given the above data:

In [11]:
# calculating yearly revenues
rev = vol_df['total'] * vol_df['unit_price']
rev.unstack()

Unnamed: 0,2016,2017
Bananas,1117764384,1266634504
Onions,1834960160,2539701661
Tomatoes,1290399076,1433979548


This allows us to easily and quickly manipulate and explore even high-dimensional
data.

## Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series or DataFrame
is to simply pass a list of two or more index arrays to the constructor. For example:

In [12]:
# MultiIndex Creation - passing a list of two or more index arrays to the constructor
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.971049,0.826068
a,2,0.464429,0.596521
b,1,0.372111,0.922033
b,2,0.700548,0.66331


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically
recognize this and use a MultiIndex by default:

In [13]:
# passing a dictionary with appropriate tuples as keys
data = {('Bananas', 2016): 33871648,
('Bananas', 2017): 37253956,
('Tomatoes',  2016): 20851820,
('Tomatoes', 2017): 25145561,
('Onions',  2016): 18976457,
('Onions', 2017): 19378102}
pd.Series(data)

Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    20851820
          2017    25145561
Onions    2016    18976457
          2017    19378102
dtype: int64

Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we’ll see a couple
of these methods here.

### Explicit MultiIndex constructors
For more flexibility in how the index is constructed, you can instead use the class
method constructors available in the pd.MultiIndex. For example, as we did before,
you can construct the MultiIndex from a simple list of arrays, giving the index values
within each level:

In [14]:
# MultiIndex from a simple list of arrays
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [15]:
# You can construct it from a list of tuples, giving the multiple index values of each point:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)])

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3)],
           )

In [16]:
# You can even construct it from a Cartesian product of single indices:
pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2, 3]])

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('c', 3)],
           )

Similarly, you can construct the MultiIndex directly using its internal encoding by
passing levels (a list of lists containing available index values for each level) and
labels (a list of lists that reference these labels):

In [17]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can pass any of these objects as the index argument when creating a Series or
DataFrame, or to the reindex method of an existing Series or DataFrame.

## MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish
this by passing the names argument to any of the above MultiIndex constructors, or
by setting the names attribute of the index after the fact:

In [18]:
vol.index.names = ['product', 'year']
vol

product   year
Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
Onions    2016    20851820
          2017    25145561
dtype: int64

## MultiIndex for columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows
can have multiple levels of indices, the columns can have multiple levels as well. Consider
the following, which is a mock-up of some (somewhat realistic) medical data:

In [19]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,34.0,37.7,40.0,36.5,43.0,37.1
2013,2,43.0,36.4,43.0,37.4,42.0,36.2
2014,1,35.0,38.1,38.0,37.2,46.0,37.0
2014,2,41.0,38.4,46.0,38.6,42.0,35.5


Here we see where the multi-indexing for both rows and columns can come in very
handy. This is fundamentally four-dimensional data, where the dimensions are the
subject, the measurement type, the year, and the visit number. With this in place we
can, for example, index the top-level column by the person’s name and get a full Data
Frame containing just that person’s information:

In [20]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,40.0,36.5
2013,2,43.0,37.4
2014,1,38.0,37.2
2014,2,46.0,38.6


For complicated records containing multiple labeled measurements across multiple
times for many subjects (people, countries, cities, etc.), use of hierarchical rows and
columns can be extremely convenient!

## Indexing and Slicing a MultiIndex
Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you
think about the indices as added dimensions. We’ll first look at indexing multiply
indexed Series, and then multiply indexed DataFrames.

### Multiply indexed Series
Consider the multiply indexed Series of state populations we saw earlier:

In [21]:
vol

product   year
Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
Onions    2016    20851820
          2017    25145561
dtype: int64

We can access single elements by indexing with multiple terms:

In [22]:
# accessing single elements by indexing with multiple terms
vol['Bananas', 2016]

33871648

The MultiIndex also supports partial indexing, or indexing just one of the levels in
the index. The result is another Series, with the lower-level indices maintained:

In [23]:
# partial indexing
vol['Bananas']

year
2016    33871648
2017    37253956
dtype: int64

With sorted indices, we can perform partial indexing on lower levels by passing an
empty slice in the first index:

In [24]:
vol[:, 2016]

product
Bananas     33871648
Tomatoes    18976457
Onions      20851820
dtype: int64

In [25]:
# Other types of indexing and selection  work as well; for example, selection based on Boolean masks:
vol[vol > 20000000]

product  year
Bananas  2016    33871648
         2017    37253956
Onions   2016    20851820
         2017    25145561
dtype: int64

In [26]:
# Selection based on fancy indexing also works:
vol[['Bananas', 'Tomatoes']]

product   year
Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
dtype: int64

## Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical
DataFrame from before:

In [27]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,34.0,37.7,40.0,36.5,43.0,37.1
2013,2,43.0,36.4,43.0,37.4,42.0,36.2
2014,1,35.0,38.1,38.0,37.2,46.0,37.0
2014,2,41.0,38.4,46.0,38.6,42.0,35.5


Remember that columns are primary in a DataFrame, and the syntax used for multiply
indexed Series applies to the columns. For example, we can recover Guido’s heart
rate data with a simple operation:

In [28]:
health_data['Guido', 'HR']

year  visit
2013  1        40.0
      2        43.0
2014  1        38.0
      2        46.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers For example:

In [29]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,34.0,37.7
2013,2,43.0,36.4


These indexers provide an array-like view of the underlying two-dimensional data,
but each individual index in loc or iloc can be passed a tuple of multiple indices. For
example:

In [30]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        34.0
      2        43.0
2014  1        35.0
      2        41.0
Name: (Bob, HR), dtype: float64

In [31]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,34.0,40.0,43.0
2014,1,35.0,38.0,46.0


## Rearranging Multi-Indices
One of the keys to working with multiply indexed data is knowing how to effectively
transform the data. There are a number of operations that will preserve all the information
in the dataset, but rearrange it for the purposes of various computations. We
saw a brief example of this in the stack() and unstack() methods, but there are
many more ways to finely control the rearrangement of data between hierarchical
indices and columns, and we’ll explore them here.

### Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted. Let’s take a look at
this here.
We’ll start by creating some simple multiply indexed data where the indices are not
lexographically sorted:

In [32]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.077155
      2      0.336472
c     1      0.219946
      2      0.782738
b     1      0.108082
      2      0.706341
dtype: float64

If we try to take a partial slice of this index, it will result in an error:

In [33]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Although it is not entirely clear from the error message, this is the result of the Multi
Index not being sorted. For various reasons, partial slices and other similar operations
require the levels in the MultiIndex to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting;
examples are the sort_index() and sortlevel() methods of the DataFrame. We’ll
use the simplest, sort_index(), here:

In [34]:
data = data.sort_index()
data

char  int
a     1      0.077155
      2      0.336472
b     1      0.108082
      2      0.706341
c     1      0.219946
      2      0.782738
dtype: float64

In [35]:
# With the index sorted in this way, partial slicing will work as expected:
data['a':'b']

char  int
a     1      0.077155
      2      0.336472
b     1      0.108082
      2      0.706341
dtype: float64

### Stacking and unstacking indices
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index
to a simple two-dimensional representation, optionally specifying the level to use:

In [36]:
vol

product   year
Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
Onions    2016    20851820
          2017    25145561
dtype: int64

In [37]:
vol.unstack(level=0)

product,Bananas,Onions,Tomatoes
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016,33871648,20851820,18976457
2017,37253956,25145561,19378102


In [38]:
vol.unstack(level=1)

year,2016,2017
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Bananas,33871648,37253956
Onions,20851820,25145561
Tomatoes,18976457,19378102


The opposite of unstack() is stack(), which here can be used to recover the original
series:

In [39]:
# stack - can be used to recover the oriiginal series
vol.unstack().stack()

product   year
Bananas   2016    33871648
          2017    37253956
Onions    2016    20851820
          2017    25145561
Tomatoes  2016    18976457
          2017    19378102
dtype: int64

### Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into columns;
this can be accomplished with the reset_index method. Calling this on the pproduct
dictionary will result in a DataFrame with a product and year column holding the
information that was formerly in the index. For clarity, we can optionally specify the
name of the data for the column representation:

In [40]:
vol

product   year
Bananas   2016    33871648
          2017    37253956
Tomatoes  2016    18976457
          2017    19378102
Onions    2016    20851820
          2017    25145561
dtype: int64

In [41]:
vol_flat = vol.reset_index(name='volume')
vol_flat

Unnamed: 0,product,year,volume
0,Bananas,2016,33871648
1,Bananas,2017,37253956
2,Tomatoes,2016,18976457
3,Tomatoes,2017,19378102
4,Onions,2016,20851820
5,Onions,2017,25145561


Often when you are working with data in the real world, the raw input data looks like
this and it’s useful to build a MultiIndex from the column values. This can be done
with the set_index method of the DataFrame, which returns a multiply indexed Data
Frame:

In [42]:
vol_flat.set_index(['product', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,volume
product,year,Unnamed: 2_level_1
Bananas,2016,33871648
Bananas,2017,37253956
Tomatoes,2016,18976457
Tomatoes,2017,19378102
Onions,2016,20851820
Onions,2017,25145561


## Data Aggregations on Multi-Indices
We’ve previously seen that Pandas has built-in data aggregation methods, such as
mean(), sum(), and max(). For hierarchically indexed data, these can be passed a
level parameter that controls which subset of the data the aggregate is computed on.

For example, let’s return to our health data:

In [43]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,34.0,37.7,40.0,36.5,43.0,37.1
2013,2,43.0,36.4,43.0,37.4,42.0,36.2
2014,1,35.0,38.1,38.0,37.2,46.0,37.0
2014,2,41.0,38.4,46.0,38.6,42.0,35.5


Perhaps we’d like to average out the measurements in the two visits each year. We can
do this by naming the index level we’d like to explore, in this case the year:

In [44]:
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,38.5,37.05,41.5,36.95,42.5,36.65
2014,38.0,38.25,42.0,37.9,44.0,36.25


By further making use of the axis keyword, we can take the mean among levels on
the columns as well:

In [45]:
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,40.833333,36.883333
2014,41.333333,37.466667


Thus in two lines, we’ve been able to find the average heart rate and temperature
measured among all subjects in all visits each year. This syntax is actually a shortcut
to the GroupBy functionality.