# Hierarchical Indexing

- one-dimensional data - ``Series``
- two-dimensional data - ``DataFrame`` 
- higher-dimensional data - data indexed by more than two keys


While Pandas does provide ``Panel`` and ``Panel4D`` objects that natively handle three-dimensional and four-dimensional data (see [Aside: Panel Data](#Aside:-Panel-Data)), a far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index.

In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.

In [None]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Consider how two-dimensional data within a one-dimensional ``Series``.

For concreteness, we will consider a series of data where each point has a character and numerical key.

### The bad way

Suppose you would like to track data about states from two different years.
Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [None]:
index = [('CF', 2000), 
         ('CF', 2010),
         ('NY', 2000), 
         ('NY', 2010),
         ('TX', 2000), 
         ('TX', 2010)]

populations = [48, 46, 47,  20, 20, 61]

ppln = pd.Series(populations, 
                index=index
                )
ppln

(CF, 2000)    48
(CF, 2010)    46
(NY, 2000)    47
(NY, 2010)    20
(TX, 2000)    20
(TX, 2010)    61
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [None]:
ppln[0]

48

In [None]:
ppln.index

Index([('CF', 2000), ('CF', 2010), ('NY', 2000), ('NY', 2010), ('TX', 2000),
       ('TX', 2010)],
      dtype='object')

In [None]:
ppln[('CF', 2010)]

46

In [None]:
ppln[('CF', 2010):('TX', 2000)]

(CF, 2010)    46
(NY, 2000)    47
(NY, 2010)    20
(TX, 2000)    20
dtype: int64

But the convenience ends there. 

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [None]:
ppln[[i for i in ppln.index if i[1] == 2010]]

(CF, 2010)    46
(NY, 2010)    20
(TX, 2010)    61
dtype: int64

This produces the desired result
  - but is not as clean (or as efficient for large datasets) as the slicing syntax we've grown to love in Pandas.

### The Better Way - Pandas `MultiIndex`
 
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas ``MultiIndex`` type gives us the type of operations we wish to have.

We can create a multi-index from the tuples as follows:

In [None]:
nIndex = pd.MultiIndex.from_tuples(index)
nIndex

MultiIndex([('CF', 2000),
            ('CF', 2010),
            ('NY', 2000),
            ('NY', 2010),
            ('TX', 2000),
            ('TX', 2010)],
           )

Notice that the ``MultiIndex`` contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.

If we re-index our series with this ``MultiIndex``, we see the hierarchical representation of the data:

In [None]:
ppln = ppln.reindex(nIndex)
ppln

CF  2000    48
    2010    46
NY  2000    47
    2010    20
TX  2000    20
    2010    61
dtype: int64

In [None]:
type(ppln)

pandas.core.series.Series

- first two columns of the ``Series`` representation show the multiple index values
-  the third column shows the data.

Notice that some entries are missing in the first column
- in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [None]:
ppln[:, 2010] #skip first column of index

CF    46
NY    20
TX    61
dtype: int64

In [None]:
ppln['NY']

2000    47
2010    20
dtype: int64

The result is a singly indexed array with just the keys we're interested in.

This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with.

We'll now further discuss this sort of indexing operation on hieararchically indexed data.

### MultiIndex as extra dimension

We could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
  - ``unstack()`` - convert a multiply indexed ``Series`` into a ``DataFrame``

In [None]:
print(type(ppln))
ppln

<class 'pandas.core.series.Series'>


CF  2000    48
    2010    46
NY  2000    47
    2010    20
TX  2000    20
    2010    61
dtype: int64

In [None]:
ppln_df = ppln.unstack()
ppln_df

Unnamed: 0,2000,2010
CF,48,46
NY,47,20
TX,20,61


In [None]:
type(ppln_df)

pandas.core.frame.DataFrame

- ``stack()`` -  provides the opposite operation

In [None]:
ppln_df_stack = ppln_df.stack()
ppln_df_stack

CF  2000    48
    2010    46
NY  2000    47
    2010    20
TX  2000    20
    2010    61
dtype: int64

In [None]:
type(ppln_df_stack)

pandas.core.series.Series

Using hierarchical indexing, just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional ``Series``, we can also use it to represent data of three or more dimensions in a ``Series`` or ``DataFrame``.

Each extra level in a multi-index represents an extra dimension of data. 

Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a ``MultiIndex`` this is as easy as adding another column to the ``DataFrame``:

In [None]:
pop_df = pd.DataFrame({'total': ppln,
                       'under18': [19, 14, 34, 13, 10, 14]
                       })
pop_df

Unnamed: 0,Unnamed: 1,total,under18
CF,2000,48,19
CF,2010,46,14
NY,2000,47,34
NY,2010,20,13
TX,2000,20,10
TX,2010,61,14


Compute the fraction of people under 18 by year:

In [None]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

CF  2000    0.395833
    2010    0.304348
NY  2000    0.723404
    2010    0.650000
TX  2000    0.500000
    2010    0.229508
dtype: float64

In [None]:
f_u18.unstack()

Unnamed: 0,2000,2010
CF,0.395833,0.304348
NY,0.723404,0.65
TX,0.5,0.229508


This allows us to easily and quickly manipulate and explore even high-dimensional data.

## Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor.  

In [None]:
rands = np.random.rand(4, 2) # 4 rows, 2 columns, array of arrays
rands 

array([[0.12784152, 0.38895604],
       [0.89743458, 0.58116077],
       [0.48839931, 0.36144782],
       [0.78935031, 0.32165196]])

In [None]:
indices = [['a', 'a', 'b', 'b'],  # I column of indices
           [1, 2, 1, 2]           #II column of indices
           ]                  
indices #list of lists

[['a', 'a', 'b', 'b'], [1, 2, 1, 2]]

In [None]:
df = pd.DataFrame(rands,
                  index= indices, 
                  columns=['data1', 'data2']
                  )
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.127842,0.388956
a,2,0.897435,0.581161
b,1,0.488399,0.361448
b,2,0.78935,0.321652


The work of creating the ``MultiIndex`` is done in the background.

Similarly, if you pass a `dictionary` with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:

In [None]:
data = {('CF', 2000): 48,
        ('CF', 2010): 56,
        ('TX', 2000): 20,
        ('TX', 2010): 61,
        ('NY', 2000): 57,
        ('NY', 2010): 20
        }

In [None]:
sdta = pd.Series(data)
sdta

CF  2000    48
    2010    56
TX  2000    20
    2010    61
NY  2000    57
    2010    20
dtype: int64

In [None]:
dfdta = sdta.to_frame('population')
dfdta

Unnamed: 0,Unnamed: 1,population
CF,2000,48
CF,2010,56
TX,2000,20
TX,2010,61
NY,2000,57
NY,2010,20


Nevertheless, it is sometimes useful to explicitly create a ``MultiIndex``; we'll see a couple of these methods here.

### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.


For example, as we did before, you can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:

In [None]:
pd.MultiIndex.from_arrays(indices)

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point:

In [None]:
indices = [('a', 1), ('a', 2), ('b', 1), ('b', 2)] 
indices  #list of tuples

[('a', 1), ('a', 2), ('b', 1), ('b', 2)]

In [None]:
pd.MultiIndex.from_tuples(indices)

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can even construct it from a `Cartesian product` of single indices:

In [None]:
indices = [['a', 'b'], 
           [1, 2]
           ]
indices #list of lists

[['a', 'b'], [1, 2]]

In [None]:
pd.MultiIndex.from_product(indices)

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Any of these objects can be passed as the ``index`` argument when creating a ``Series`` or ``Dataframe``, or be passed to the ``reindex`` method of an existing ``Series`` or ``DataFrame``.

### MultiIndex level names

Sometimes it is convenient to name the levels of the ``MultiIndex``.

This can be accomplished by passing the ``names`` argument to any of the above ``MultiIndex`` constructors, or by setting the ``names`` attribute of the index after the fact:

In [None]:
ppln

CF  2000    48
    2010    46
NY  2000    47
    2010    20
TX  2000    20
    2010    61
dtype: int64

In [None]:
ppln.index.names = ['state', 'year']
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

With more involved datasets, this can be a useful way to keep track of the meaning of various index values.

### MultiIndex for columns

In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.

Consider the following, which is a mock-up of some (somewhat realistic) medical data:

Hierarchical indices and columns : 

In [None]:
rIndices = [[2013, 2014], [1, 2]]
cIndices = [['Bob', 'Guido', 'Sue'], ['HR', 'Temp']]

In [None]:
index = pd.MultiIndex.from_product(rIndices,
                                   names=['year', 'visit']
                                   )
columns = pd.MultiIndex.from_product(cIndices,
                                     names=['subject', 'type'])

In [None]:
index

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           names=['year', 'visit'])

In [None]:
columns

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

mock data:

In [None]:
rands = np.random.randn(4, 6) #4 rows, 6 columns. list of lists
rands

array([[-0.99670703, -2.06796045, -0.66564366,  1.63818315,  0.0146547 ,
         1.77880527],
       [-0.45510831,  0.95414602, -0.52518174, -0.1316519 , -0.14628677,
        -0.56221147],
       [ 0.92203474,  1.22349725, -0.76105974, -0.15939213,  0.58560532,
         0.12612514],
       [-0.82292006, -0.70723157, -2.29007619,  0.24198059,  0.12793811,
         0.09239994]])

In [None]:
data = np.round(rands, 1) #round to one decimal
data

array([[-1. , -2.1, -0.7,  1.6,  0. ,  1.8],
       [-0.5,  1. , -0.5, -0.1, -0.1, -0.6],
       [ 0.9,  1.2, -0.8, -0.2,  0.6,  0.1],
       [-0.8, -0.7, -2.3,  0.2,  0.1,  0.1]])

In [None]:
data[:, ::2] #every other column

array([[-1. , -0.7,  0. ],
       [-0.5, -0.5, -0.1],
       [ 0.9, -0.8,  0.6],
       [-0.8, -2.3,  0.1]])

In [None]:
data[:, ::2] *= 10 #every other column multiplied by 10
data

array([[-10. ,  -2.1,  -7. ,   1.6,   0. ,   1.8],
       [ -5. ,   1. ,  -5. ,  -0.1,  -1. ,  -0.6],
       [  9. ,   1.2,  -8. ,  -0.2,   6. ,   0.1],
       [ -8. ,  -0.7, -23. ,   0.2,   1. ,   0.1]])

In [None]:
data += 37
data

array([[27. , 34.9, 30. , 38.6, 37. , 38.8],
       [32. , 38. , 32. , 36.9, 36. , 36.4],
       [46. , 38.2, 29. , 36.8, 43. , 37.1],
       [29. , 36.3, 14. , 37.2, 38. , 37.1]])

In [None]:
# create the DataFrame
health_data = pd.DataFrame(data, 
                           index=index, 
                           columns=columns
                           )
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,34.9,30.0,38.6,37.0,38.8
2013,2,32.0,38.0,32.0,36.9,36.0,36.4
2014,1,46.0,38.2,29.0,36.8,43.0,37.1
2014,2,29.0,36.3,14.0,37.2,38.0,37.1


This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.

Here we see where the multi-indexing for both rows and columns can come in *very* handy.

With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:

In [None]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,30.0,38.6
2013,2,32.0,36.9
2014,1,29.0,36.8
2014,2,14.0,37.2


For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!

## Indexing and Slicing a MultiIndex

Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.

We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrame``s.

### Multiply indexed Series

Consider the multiply indexed ``Series`` of state populations we saw earlier:

In [None]:
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

We can access single elements by indexing with multiple terms:

In [None]:
ppln['CF', 2000]

48

The ``MultiIndex`` also supports *partial indexing*, or indexing just one of the levels in the index.

- The result is another ``Series``, with the lower-level indices maintained

In [None]:
ppln['CF']

year
2000    48
2010    46
dtype: int64

In [None]:
type(ppln['CF'])

pandas.core.series.Series

Partial slicing is available as well, as long as the ``MultiIndex`` is sorted.

In [None]:
type(ppln)

pandas.core.series.Series

In [None]:
ppln.loc['CF':'NY']

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [None]:
ppln[:, 2000] #applicable in series, not in df

state
CF    48
NY    47
TX    20
dtype: int64

Other types of indexing and selection work as well

Selection based on Boolean masks:

In [None]:
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

In [None]:
ppln[ppln > 22]

state  year
CF     2000    48
       2010    46
NY     2000    47
TX     2010    61
dtype: int64

Selection based on fancy indexing also works:

In [None]:
ppln[['CF', 'TX']]

state  year
CF     2000    48
       2010    46
TX     2000    20
       2010    61
dtype: int64

### Multiply indexed DataFrames

A multiply indexed ``DataFrame`` behaves in a similar manner.

In [None]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,34.9,30.0,38.6,37.0,38.8
2013,2,32.0,38.0,32.0,36.9,36.0,36.4
2014,1,46.0,38.2,29.0,36.8,43.0,37.1
2014,2,29.0,36.3,14.0,37.2,38.0,37.1


-  columns are primary in a ``DataFrame``
    -  the syntax used for multiply indexed ``Series`` applies to the columns.

For example, we can recover Guido's heart rate data with a simple operation:

In [None]:
health_data['Guido', 'HR']

year  visit
2013  1        30.0
      2        32.0
2014  1        29.0
      2        14.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the ``loc``, ``iloc``, and ``ix`` indexers. 

In [None]:
health_data.iloc[:3, # rows from the beggining to the III, excluded
                 :3  # columns from the beggining to the III, excluded
                 ]

Unnamed: 0_level_0,subject,Bob,Bob,Guido
Unnamed: 0_level_1,type,HR,Temp,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,27.0,34.9,30.0
2013,2,32.0,38.0,32.0
2014,1,46.0,38.2,29.0


These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in ``loc`` or ``iloc`` can be passed a tuple of multiple indices. 

In [None]:
health_data.index

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           names=['year', 'visit'])

In [None]:
health_data.columns

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

For example:

In [None]:
health_data.loc[:,            #all the rows 
                ('Bob', 'HR') #only this column
                ]

year  visit
2013  1        27.0
      2        32.0
2014  1        46.0
      2        29.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not especially convenient
- trying to create a slice within a tuple will lead to a syntax error:

```
health_data.loc[(:, 1), (:, 'HR')]
```

- we could get around this by building the desired slice explicitly using Python's built-in ``slice()`` function
- a better way in this context is to use an ``IndexSlice`` object, which Pandas provides for precisely this situation

For example:

In [None]:
idx = pd.IndexSlice
idx

<pandas.core.indexing._IndexSlice at 0x7f3b0b4cc460>

In [None]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,34.9,30.0,38.6,37.0,38.8
2013,2,32.0,38.0,32.0,36.9,36.0,36.4
2014,1,46.0,38.2,29.0,36.8,43.0,37.1
2014,2,29.0,36.3,14.0,37.2,38.0,37.1


In [None]:
health_data.loc[idx[:, 2],   #rows    only of 2,    irrespective of what in prior to '2'  row_index
                idx[:, 'HR'] #columns only of 'HR', irrespective of what in prior to 'HR' column_index
                ]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,2,32.0,32.0,36.0
2014,2,29.0,14.0,38.0


alternative options with `slice()`: 

In [None]:
health_data.loc[(slice(None), 2),   #rows    only of 2,    irrespective of what in prior to '2'  row_index
                (slice(None), 'HR') #columns only of 'HR', irrespective of what in prior to 'HR' column_index
                ]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,2,32.0,32.0,36.0
2014,2,29.0,14.0,38.0


There are so many other ways to interact with data in multiply indexed ``Series`` and ``DataFrame``s. 

## Rearranging Multi-Indices

There are many more ways to finely control the rearrangement of data between `hierarchical indices` and `columns`, and we'll explore them here.

### Sorted and unsorted indices

-  Many of the ``MultiIndex`` slicing operations will fail if the index is not sorted.
  - the indices are *not lexographically sorted*:

In [None]:
indices = [['a', 'c', 'b'], [1, 2]]
indices

[['a', 'c', 'b'], [1, 2]]

In [None]:
index = pd.MultiIndex.from_product(indices)
index

MultiIndex([('a', 1),
            ('a', 2),
            ('c', 1),
            ('c', 2),
            ('b', 1),
            ('b', 2)],
           )

In [None]:
rands = np.random.rand(6)
rands

array([0.84314464, 0.178241  , 0.83067047, 0.71600096, 0.79369478,
       0.44871774])

In [None]:
data = pd.Series(rands, index=index)
data

a  1    0.843145
   2    0.178241
c  1    0.830670
   2    0.716001
b  1    0.793695
   2    0.448718
dtype: float64

In [None]:
data.index.names = ['char', 'int']
data

char  int
a     1      0.843145
      2      0.178241
c     1      0.830670
      2      0.716001
b     1      0.793695
      2      0.448718
dtype: float64

In [None]:
type(data)

pandas.core.series.Series

If we try to take a partial slice of this index, it will result in an error
  - a > b or b > a ? 

In [None]:
try:
    data['b':'a']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted.

-  partial slices and other similar operations require the levels in the ``MultiIndex`` to be in sorted (i.e., lexographical) order.

Pandas provides a number of convenience routines to perform this type of sorting
- examples are the ``sort_index()`` and ``sortlevel()`` methods of the ``DataFrame``

In [None]:
data = data.sort_index()
data

char  int
a     1      0.843145
      2      0.178241
b     1      0.793695
      2      0.448718
c     1      0.830670
      2      0.716001
dtype: float64

With the index sorted in this way, partial slicing will work as expected:

In [None]:
data['a':'b']

char  int
a     1      0.843145
      2      0.178241
b     1      0.793695
      2      0.448718
dtype: float64

### Stacking and unstacking indices

It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

In [None]:
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

In [None]:
ppln.unstack(level=0)

state,CF,NY,TX
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,48,47,20
2010,46,20,61


In [None]:
ppln.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
CF,48,46
NY,47,20
TX,20,61


The opposite of ``unstack()`` is ``stack()``, which here can be used to recover the original series:

In [None]:
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

In [None]:
ppln.unstack().stack() #stack() voids unstack(); results no change at all. 

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

### Index setting and resetting

- Another way to rearrange hierarchical data is to turn the index labels into columns
  - this can be accomplished with the ``reset_index`` method.

Calling this on the population dictionary will result in a ``DataFrame`` with a *level_0* and *level_1* column holding the information that was formerly in the index.

For clarity, we can optionally specify the name of the data for the column representation:

In [None]:
ppln

state  year
CF     2000    48
       2010    46
NY     2000    47
       2010    20
TX     2000    20
       2010    61
dtype: int64

In [None]:
pop_flat = ppln.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,CF,2000,48
1,CF,2010,46
2,NY,2000,47
3,NY,2010,20
4,TX,2000,20
5,TX,2010,61


In [None]:
pop_flat.columns = ['state', 'year', 'population']
pop_flat

Unnamed: 0,state,year,population
0,CF,2000,48
1,CF,2010,46
2,NY,2000,47
3,NY,2010,20
4,TX,2000,20
5,TX,2010,61


Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
- This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``

In [None]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
CF,2000,48
CF,2010,46
NY,2000,47
NY,2010,20
TX,2000,20
TX,2010,61


This type of reindexing to be one of the more useful patterns when encountering real-world datasets.

## Data Aggregations on Multi-Indices

- Pandas has built-in data aggregation methods, such as ``mean()``, ``sum()``, and ``max()``

- For hierarchically indexed data, these can be passed a ``level`` parameter that controls which subset of the data the aggregate is computed on.

For example, let's return to our health data:

In [None]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,34.9,30.0,38.6,37.0,38.8
2013,2,32.0,38.0,32.0,36.9,36.0,36.4
2014,1,46.0,38.2,29.0,36.8,43.0,37.1
2014,2,29.0,36.3,14.0,37.2,38.0,37.1


To average-out the measurements in the two visits each year, name the index level we'd like to explore: 

In this case the year:

In [None]:
data_mean = health_data.groupby(level='year').mean()
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,29.5,36.45,31.0,37.75,36.5,37.6
2014,37.5,37.25,21.5,37.0,40.5,37.1


By further making use of the ``axis`` keyword, we can take the mean among levels on the columns as well:

In [None]:
data_mean.groupby(axis=1, level='type').mean()

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,32.333333,37.266667
2014,33.166667,37.116667


- Thus, we find the average heart rate and temperature measured among all subjects in all visits each year.

## Aside: Panel Data

Pandas has a few other fundamental data structures that we have not yet discussed, namely the ``pd.Panel`` and ``pd.Panel4D`` objects.

These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the (one-dimensional) ``Series`` and (two-dimensional) ``DataFrame`` structures.

Once you are familiar with indexing and manipulation of data in a ``Series`` and ``DataFrame``, ``Panel`` and ``Panel4D`` are relatively straightforward to use.

In particular, the ``ix``, ``loc``, and ``iloc`` indexers extend readily to these higher-dimensional structures.

Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation.

As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets.
For the occasional specialized application, however, these structures can be useful.

<!--NAVIGATION-->
< [Previous](link) | [toc](toc) | [Next](next) >
