### Hierarchical Indexing

AKA Multi-indexing.

Used to incorporate multiple index levels within a single index.

Allows higher-dimensional data to be represented with Pandas Series and DataFrames.

In [1]:
import pandas as pd
import numpy as np

#### Introducing the MultiIndex

Allows us to define indices as tuples and still achieve efficient computation.

In [4]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [5]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [6]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [8]:
pop[('California', 2000)]

33871648

In [11]:
pop['California', 2000]

33871648

In [18]:
pop[:, 2000]

California    33871648
New York      18976457
Texas         20851820
dtype: int64

#### MultiIndex as an Extra Dimension

The unstack() method will convert a multiple indexed series into a conventionally indexed DataFrame.

Each extra level in a multi-index represents an extra dimension of data.

In [19]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [20]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [22]:
pop_df = pd.DataFrame({
    'total': pop,
    'under18': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014],
})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [23]:
pop_df.stack()

California  2000  total      33871648
                  under18     9267089
            2010  total      37253956
                  under18     9284094
New York    2000  total      18976457
                  under18     4687374
            2010  total      19378102
                  under18     4318033
Texas       2000  total      20851820
                  under18     5906301
            2010  total      25145561
                  under18     6879014
dtype: int64

In [24]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [25]:
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

#### MultiIndex Creation Methods

In [26]:
# simplest way to create MultiIndex is to pass twoor more index arrays to constructor
df = pd.DataFrame(np.random.rand(4,2),
                  index=[list('aabb'), [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.462132,0.174091
a,2,0.970736,0.331117
b,1,0.715981,0.079179
b,2,0.923744,0.179402


In [29]:
# can also pass dictionary with tuples as keys
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

In [30]:
pd.MultiIndex.from_arrays([
    list('aabb'),
    [1, 2, 1, 2]
])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [31]:
pd.MultiIndex.from_tuples([
    ('a', 1),
    ('a', 2),
    ('b', 1),
    ('b', 2)
])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [32]:
pd.MultiIndex.from_product([
    ['a', 'b'],
    [1, 2]
])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [34]:
pd.MultiIndex(
    levels=[['a', 'b'], [1, 2]],
    codes=[[0, 0, 1, 1], [0, 1, 0, 1]]
)

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [35]:
# we can name the levels of a MultiIndex
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

#### MultiIndex Columns

In a DataFrame, the rows and columns are symmetric. Therefore, we can have MultiIndex columns in addition to rows.

In [37]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

columns

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

In [38]:
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data

array([[ 0. , -1.3,  0.2, -0.3, -0.8, -0.5],
       [ 0.3,  0.3, -0.4, -0.4, -0.2, -1.2],
       [-0.9, -1.4, -0.6,  0.7, -0.2, -1.8],
       [ 1.2, -0.9,  0.5, -2.1,  1.1,  0.4]])

In [40]:
data[:, ::2] *= 10
data += 37
data

array([[ 37. ,  35.7,  57. ,  36.7, -43. ,  36.5],
       [ 67. ,  37.3,  -3. ,  36.6,  17. ,  35.8],
       [-53. ,  35.6, -23. ,  37.7,  17. ,  35.2],
       [157. ,  36.1,  87. ,  34.9, 147. ,  37.4]])

In [41]:
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,37.0,35.7,57.0,36.7,-43.0,36.5
2013,2,67.0,37.3,-3.0,36.6,17.0,35.8
2014,1,-53.0,35.6,-23.0,37.7,17.0,35.2
2014,2,157.0,36.1,87.0,34.9,147.0,37.4


In [48]:
health_data.loc[2013]

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
visit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,37.0,35.7,57.0,36.7,-43.0,36.5
2,67.0,37.3,-3.0,36.6,17.0,35.8


In [49]:
health_data.loc[(2013, 1)]

subject  type
Bob      HR      37.0
         Temp    35.7
Guido    HR      57.0
         Temp    36.7
Sue      HR     -43.0
         Temp    36.5
Name: (2013, 1), dtype: float64

#### Indexing / Slicing a MultiIndex

In [50]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [54]:
pop['California', 2000]

33871648

In [55]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [56]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [57]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [58]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [59]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [60]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,37.0,35.7,57.0,36.7,-43.0,36.5
2013,2,67.0,37.3,-3.0,36.6,17.0,35.8
2014,1,-53.0,35.6,-23.0,37.7,17.0,35.2
2014,2,157.0,36.1,87.0,34.9,147.0,37.4


In [63]:
health_data['Sue', 'HR']

year  visit
2013  1        -43.0
      2         17.0
2014  1         17.0
      2        147.0
Name: (Sue, HR), dtype: float64

In [66]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,37.0,35.7
2013,2,67.0,37.3


In [67]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1         37.0
      2         67.0
2014  1        -53.0
      2        157.0
Name: (Bob, HR), dtype: float64

In [68]:
# slices within index tuples is not convenient and is error prone
# Pandas makes the IndexSlice to handle this
idx = pd.IndexSlice
idx

<pandas.core.indexing._IndexSlice at 0x11373a978>

In [69]:
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,37.0,57.0,-43.0
2014,1,-53.0,-23.0,17.0


#### Rearranging MultiIndices

Many ways to transform MultiIndexed objects. stack() and unstack() were just the beginning.

In [70]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.115129
      2      0.370067
c     1      0.161804
      2      0.950034
b     1      0.375766
      2      0.468197
dtype: float64

In [71]:
# you can't apple an index slice if the index or levels have not been sorted
data['a':'b']

UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

In [72]:
data = data.sort_index()
data

char  int
a     1      0.115129
      2      0.370067
b     1      0.375766
      2      0.468197
c     1      0.161804
      2      0.950034
dtype: float64

In [73]:
data['a':'b']

char  int
a     1      0.115129
      2      0.370067
b     1      0.375766
      2      0.468197
dtype: float64

In [75]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [74]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [76]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [77]:
pop.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [78]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [80]:
# reset index will convert the MutliIndex into index labels and columns
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [81]:
# similarly we can build a MultiIndex from the column values
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


#### Data Aggregation on Multi-Indices

In [82]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,37.0,35.7,57.0,36.7,-43.0,36.5
2013,2,67.0,37.3,-3.0,36.6,17.0,35.8
2014,1,-53.0,35.6,-23.0,37.7,17.0,35.2
2014,2,157.0,36.1,87.0,34.9,147.0,37.4


In [89]:
# average measurements in the two visits each year
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,52.0,36.5,27.0,36.65,-13.0,36.15
2014,52.0,35.85,32.0,36.3,82.0,36.3


In [90]:
# axis can be changed as usual
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,22.0,36.433333
2014,55.333333,36.15
