# Hierarchical Indexing

Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys. While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data (see Aside: Panel Data), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In this section, we'll explore the direct creation of MultiIndex objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

We begin with the standard imports:

In [1]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data within a one-dimensional Series. For concreteness, we will consider a series of data where each point has a character and numerical key.

## The Bad Way

Suppose you would like to track data about states from two different years. Using the Pandas tools we've covered you might be tempted to simply use Python Tuples as keys:

In [29]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straighforwardly index or slice the series based on this multiple index:

In [30]:
pop[('California', 2010): ('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there. For example, if you need to selecft values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [31]:
pop[[i for i in pop.index if i[1] == 2010]] # very slow on larger data sets

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

This produces the desired result, but it is not as clean (or as efficient) as the slicing syntax we've grown to love in Pandas:

## The Better Way: Pandas Multiindex

Fortunately, Pandas provides a better way. Our tuple-based indexing is essiantially a rudimentary multi-index, and the Pandas *Multiindex* type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:

In [32]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

Notice that the *MultiIndex* contains multiple *levels* of indexing-in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.

If we re-index our series with this MultiIndex, we see the hierarchichal representation of the data:

In [33]:
pop = pop.reindex(index)

pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first two columns of the *Series* representation show the multiple index values, while the 3rd column shows the data. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicated the same value as the line above it.


Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation.

In [34]:
print (pop[:, 2010])

print (pop[:, 2000])

California    37253956
New York      19378102
Texas         25145561
dtype: int64
California    33871648
New York      18976457
Texas         20851820
dtype: int64


The result is singly indexed array with just the keys we're interested in. This syntax is much more conventient (and the operation is much mroe efficient) Than the home-spun tuple-based multi-indexing soluiton that we staqrted with. We'll now further discuss this sort of indexing operation on hierarchically indexed data.

## MultiIndex as an extra dimension

You might notice something else here : we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

In [35]:
pop_df = pop.unstack()

pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [36]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Seeing this, you might wonder why would we bother with hierarchichal indexing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it represent data of three or more dimensions in a series or DataFrame. Each extra level in a multi-index represenets an extra dimension of data; taking advantage of this property gives us much more flexibility in the tyopes of data we can represent. Concretely, we might want to add another column for demographic data for each state at each year )say, population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame:



In [37]:
pop_df = pd.DataFrame({'total': pop,
                        'under18': [9267089, 9284094,
                                    4687374, 4318033,
                                    5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In addition, all the ufuncs and the other functionality work with hierarchicial indices as well. Here we compute the fraction of People under 18 by year, given the above data:

In [43]:
f_u18 = pop_df['under18'] / pop_df['total']


f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


This allows us to easily and quickly manipulate and explore even high-dimensional data.

## Methods of Multiindex Creation

The most straightforward way to construct a multiply indexed *series* or *DataFrame* is to simply pass a list of two or more index arrays to the constructor. For example:

In [44]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])

df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.027692,0.329781
a,2,0.474337,0.154183
b,1,0.584676,0.111237
b,2,0.802998,0.377853


The work of creating MultiIndex is done in the background

Similiarly, if you pass a dictionary with appropriate typles as keys, Pandas will automatically recognize this and use a MultiIndex by default

In [46]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we'll see a couiple of these methods here

## Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex. For example, as we did before, you can construct the MultiIndex from a simple list of arrays giving the index values within each level.

In [47]:
pd.MultiIndex.from_arrays([['a', 'a', 'b','b'], [1, 2, 1, 2]])


MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can construct it form a list of tuples giving the multiple index values of each point:

In [48]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can even construct it from a Cartesian product of single indices:

In [50]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Similarly, you can construct the MultiIndex directly using its internal encoding by passling levels (a list of lists containing available index values for each level) and labels (a list of lists that reference labels)

In [51]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])