# Hierarchical Indexing (MultiIndex)

Sometimes it's useful to have more than one index level for your data. This is called hierarchical indexing, or multi-indexing. It lets you work with higher-dimensional data in a two-dimensional format (like a DataFrame).

Let's import our standard tools first.

In [1]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's see what a multiply indexed `Series` looks like. We'll represent population data for a few states across two years.

In [2]:
# Create a list of arrays for the multi-index
index = [('California', 2010), ('California', 2020),
         ('Texas', 2010), ('Texas', 2020),
         ('New York', 2010), ('New York', 2020)]
populations = [37253956, 39538223,
               25145561, 29145505,
               19378102, 20201249]
pop = pd.Series(populations, index=pd.MultiIndex.from_tuples(index))
pop

California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
New York    2010    19378102
            2020    20201249
dtype: int64

This `Series` now has a `MultiIndex`. You can see the two levels of the index. We can access elements from it like this:

In [3]:
# Access all data for California
pop['California']

2010    37253956
2020    39538223
dtype: int64

In [4]:
# Access data for California in 2020
pop['California', 2020]

39538223

## Methods of MultiIndex Creation

There are several ways to create a `MultiIndex`. The `pd.MultiIndex.from_tuples()` method we just used is a good one, but here are a few others.

In [5]:
# From a list of arrays
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.047217,0.961854
a,2,0.124222,0.368494
b,1,0.791733,0.944024
b,2,0.989073,0.750993


You can also give the index levels names.

In [6]:
# Naming the index levels
df.index.names = ['key1', 'key2']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.047217,0.961854
a,2,0.124222,0.368494
b,1,0.791733,0.944024
b,2,0.989073,0.750993


Another way is to use `pd.MultiIndex.from_product`, which creates a `MultiIndex` from the cartesian product of the arrays you provide.

In [7]:
# Using from_product
index = pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
                                   names=['key1', 'key2'])
df_prod = pd.DataFrame(np.random.rand(4, 2), index=index, columns=['data1', 'data2'])
df_prod

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.424367,0.38297
a,2,0.418999,0.811223
b,1,0.362696,0.38527
b,2,0.272894,0.274023


## Indexing and Slicing a MultiIndex

Indexing and slicing a `MultiIndex` can be straightforward or a bit tricky, depending on what you want to do.

In [8]:
# Accessing the data for Texas
pop['Texas']

2010    25145561
2020    29145505
dtype: int64

You can also use slicing. This slice gets all data for states from California to New York.

In [9]:
pop = pop.sort_index() # note: index must be sorted before slicing
pop.loc['California':'New York']

California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
dtype: int64

For more advanced slicing, it's often better to use an `IndexSlice` object. This makes the slicing syntax more intuitive.

In [10]:
idx = pd.IndexSlice
pop[idx[:, 2020]]

California    39538223
New York      20201249
Texas         29145505
dtype: int64

This slice gets all data for the year 2020 across all states.

## Rearranging Multi-Indices

Sometimes you need to reshape your data. The two most important methods for this are `stack()` and `unstack()`.

`unstack()` converts a level of the index into columns.

In [None]:
print(pop)
pop_df = pop.unstack()
print(pop_df)

# note: .unstack() takes all unique values in the specific index level (identified either via integer index level or index label) and turns them into columns

# meanwhile, .reset_index() takes an index level and converts it into a single new column with the same name as index level (e.g., there will be duplicate values going down row sections)

California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64
                2010      2020
California  37253956  39538223
New York    19378102  20201249
Texas       25145561  29145505


`stack()` is the opposite. It converts columns back into an index level.

In [12]:
pop_df.stack()

California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

You can also use `set_index` to create a `MultiIndex` from columns in a `DataFrame`.

In [13]:
df_flat = pd.DataFrame({'state': ['California', 'California', 'Texas', 'Texas'],
                        'year': [2010, 2020, 2010, 2020],
                        'pop': [37253956, 39538223, 25145561, 29145505]})
df_flat

Unnamed: 0,state,year,pop
0,California,2010,37253956
1,California,2020,39538223
2,Texas,2010,25145561
3,Texas,2020,29145505


In [14]:
df_multi = df_flat.set_index(['state', 'year'])
df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,pop
state,year,Unnamed: 2_level_1
California,2010,37253956
California,2020,39538223
Texas,2010,25145561
Texas,2020,29145505


## Modern Data Aggregations and Operations with Multi-Indices

The `level` parameter in aggregation functions is deprecated. Instead, we use `groupby()` for more explicit and powerful aggregations.

In [15]:
# Create a new DataFrame for this example
index = pd.MultiIndex.from_product([[2020, 2021], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2020,1,29.0,39.0,42.0,35.9,35.0,38.8
2020,2,30.0,36.3,14.0,37.6,45.0,37.1
2021,1,32.0,37.0,37.0,36.5,45.0,39.1
2021,2,69.0,36.3,53.0,36.1,44.0,37.2


### Grouping and Aggregating

In [16]:
# Group by subject and get the mean
health_data.groupby(level='subject', axis=1).mean()

type        HR       Temp
subject                
Bob    40.000000  37.150000
Guido  36.500000  36.525000
Sue    42.250000  38.050000

### Slicing and Selecting Data

In [17]:
# Select data for a specific year and visit
health_data.loc[(2020, 2)]

subject  type
Bob       HR      30.0
          Temp    36.3
Guido     HR      14.0
          Temp    37.6
Sue       HR      45.0
          Temp    37.1
Name: (2020, 2), dtype: float64

In [18]:
# Select data for a specific year, visit, and subject
health_data.loc[(2020, 1), ('Guido')]

subject  type
Guido    HR      42.0
         Temp    35.9
Name: (2020, 1), dtype: float64

### Resetting and Setting the Index

In [19]:
# Reset the index to columns
health_data_flat = health_data.stack(level=['subject', 'type']).reset_index(name='value')
health_data_flat

In [20]:
# Set the index back
health_data_flat.set_index(['year', 'visit', 'subject', 'type']).unstack(level=['subject', 'type'])

### Swapping Index Levels

In [21]:
# Swap the order of the index levels
health_data.swaplevel('year', 'visit')