# Hierarchical Indexing

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

## Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]

In [3]:
arrays

[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
 ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

In [4]:
tuples = list(zip(*arrays))

In [5]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [6]:
languages = ['Java', 'Python', 'JavaScript']
versions = [14, 3, 6]

result = zip(languages, versions)
list(result)

[('Java', 14), ('Python', 3), ('JavaScript', 6)]

In [7]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [8]:
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [9]:
s = pd.Series(np.random.randn(8), index=index)

In [10]:
s

first  second
bar    one      -0.448490
       two      -1.159014
baz    one      -0.119939
       two       0.652187
foo    one      -0.172993
       two       0.611660
qux    one       0.996318
       two      -0.134842
dtype: float64

paring every elements in two iterables

In [11]:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [12]:
pd.MultiIndex.from_product(iterables, names=["first", "second"])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [13]:
df = pd.DataFrame(
    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
    columns=["first", "second"],
)
df

Unnamed: 0,first,second
0,bar,one
1,bar,two
2,foo,one
3,foo,two


In [14]:
pd.MultiIndex.from_frame(df)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [15]:
arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one    0.622156
     two   -1.443556
baz  one   -0.466087
     two   -1.793726
foo  one    0.659844
     two    0.706194
qux  one   -0.730235
     two    0.403805
dtype: float64

## Reconstructing the level labels

In [16]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [17]:
index.get_level_values("second")

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

## Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [18]:
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]:
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-2.574435,0.271631,-0.049531,0.752474,-0.043887,0.004016,-0.060635,0.97072
B,1.264841,-0.212004,0.806223,-1.879281,-0.986661,-0.893546,-2.145456,-0.09217
C,0.366449,-0.936474,0.559537,-0.937182,-0.579063,-0.776911,0.013164,0.48755


In [20]:
df["bar"]

second,one,two
A,-2.574435,0.271631
B,1.264841,-0.212004
C,0.366449,-0.936474


In [21]:
df["bar","one"]

A   -2.574435
B    1.264841
C    0.366449
Name: (bar, one), dtype: float64

## Define levels

In [22]:
df.columns.levels 

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [23]:
df[["foo","qux"]].columns.levels  # same as original

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

You can use instead

In [24]:
df[["foo", "qux"]].columns.to_numpy()

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [25]:
df[["foo", "qux"]].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [26]:
df[["foo", "qux"]].columns.get_level_values(1)

Index(['one', 'two', 'one', 'two'], dtype='object', name='second')

## Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [27]:
s 

bar  one    0.622156
     two   -1.443556
baz  one   -0.466087
     two   -1.793726
foo  one    0.659844
     two    0.706194
qux  one   -0.730235
     two    0.403805
dtype: float64

In [28]:
s[:-2]

bar  one    0.622156
     two   -1.443556
baz  one   -0.466087
     two   -1.793726
foo  one    0.659844
     two    0.706194
dtype: float64

In [29]:
s + s[:-2]

bar  one    1.244312
     two   -2.887112
baz  one   -0.932175
     two   -3.587452
foo  one    1.319688
     two    1.412387
qux  one         NaN
     two         NaN
dtype: float64

In [30]:
s[::2]

bar  one    0.622156
baz  one   -0.466087
foo  one    0.659844
qux  one   -0.730235
dtype: float64

In [31]:
s + s[::2]

bar  one    1.244312
     two         NaN
baz  one   -0.932175
     two         NaN
foo  one    1.319688
     two         NaN
qux  one   -1.460470
     two         NaN
dtype: float64

In [32]:
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [33]:
s[index[:3]]

bar  one    0.622156
     two   -1.443556
baz  one   -0.466087
dtype: float64

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [34]:
s.reindex(index[:3])

first  second
bar    one       0.622156
       two      -1.443556
baz    one      -0.466087
dtype: float64

## Advanced indexing with hierarchical index

In [35]:
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-2.574435,0.271631,-0.049531,0.752474,-0.043887,0.004016,-0.060635,0.97072
B,1.264841,-0.212004,0.806223,-1.879281,-0.986661,-0.893546,-2.145456,-0.09217
C,0.366449,-0.936474,0.559537,-0.937182,-0.579063,-0.776911,0.013164,0.48755


In [36]:
df.T

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-2.574435,1.264841,0.366449
bar,two,0.271631,-0.212004,-0.936474
baz,one,-0.049531,0.806223,0.559537
baz,two,0.752474,-1.879281,-0.937182
foo,one,-0.043887,-0.986661,-0.579063
foo,two,0.004016,-0.893546,-0.776911
qux,one,-0.060635,-2.145456,0.013164
qux,two,0.97072,-0.09217,0.48755


In [37]:
df = df.T

In [38]:
df.loc[("bar", "two")]

A    0.271631
B   -0.212004
C   -0.936474
Name: (bar, two), dtype: float64

In [39]:
df.loc[("bar", "two"), "A"]

0.2716314659812904

In [40]:
df.loc["bar"]

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,-2.574435,1.264841,0.366449
two,0.271631,-0.212004,-0.936474


### partial slicing 

In [41]:
df.loc["baz":"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,-0.049531,0.806223,0.559537
baz,two,0.752474,-1.879281,-0.937182
foo,one,-0.043887,-0.986661,-0.579063
foo,two,0.004016,-0.893546,-0.776911


slicing with tuples 

In [42]:
df.loc[("baz", "two"):("qux", "one")]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,0.752474,-1.879281,-0.937182
foo,one,-0.043887,-0.986661,-0.579063
foo,two,0.004016,-0.893546,-0.776911
qux,one,-0.060635,-2.145456,0.013164


###  Slicing (extra)

In [43]:
s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [44]:
s

a   -0.166921
b   -0.844386
c   -1.864096
d   -0.915535
e   -1.030069
f    1.342979
dtype: float64

In [45]:
s[:3]

a   -0.166921
b   -0.844386
c   -1.864096
dtype: float64

In [46]:
s[:"d"]

a   -0.166921
b   -0.844386
c   -1.864096
d   -0.915535
dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

In [47]:
!pwd

'pwd'은(는) 내부 또는 외부 명령, 실행할 수 있는 프로그램, 또는
배치 파일이 아닙니다.
