# Hierarchical indexing
---------------

For higher-dimensional data common pattern is to make use of **hierarchical indexing** (**multi-indexing**) to incorporate multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the 1D ``Series`` and 2D ``DataFrame``.

In [None]:
import pandas as pd
import numpy as np

## 1. Multiply Indexed Series
________________________________
* Representation of 2D data within 1D ``Series`` where each point has a character and numerical key.

#### 1.1. Using Python tuples as keys (the bad way)
-----------------

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

* With this indexing scheme, the straightforwardly indexing and slicing the series based on this index are avalable:

In [None]:
pop[('California', 2010):('Texas', 2000)]

* messy (and potentially slow) request to select all values from 2010:

In [None]:
pop[[i for i in pop.index if i[1] == 2010]]

#### 1.2. MultiIndex (the better way)
-----------------------------
* Tuple-based indexing is essentially a rudimentary multi-index
* ``MultiIndex`` type gives  the better type of operations 
* ``MultiIndex`` contains multiple *levels* of indexing (the state names and the years) as well as multiple *labels* for each data point which encode these levels


In [None]:
index = pd.MultiIndex.from_tuples(index)
index

In [None]:
pop = pop.reindex(index)
pop

* Hierarchical representation of the data after re-indexing of the series with ``MultiIndex`` :
    * first two columns of the ``Series`` represent the multiple index values, while the third column shows the data
    * some entries are missing in the first column: any blank entry indicates the same value as the line above it

* Accessing all data for which the second index is 2010 with slicing notation:

In [None]:
p2010 = pop[:, 2010]
p2010

In [None]:
type(p2010)

This syntax is much more convenient (and the operation is much more efficient!) than the tuple-based multi-indexing solution 

#### 1.3. MultiIndex as extra dimension
-------------------------------------

* ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [None]:
pop_df = pop.unstack()
pop_df

In [None]:
pop_df.columns

In [None]:
pop_df.loc['California', 2010]

In [None]:
pop_df.iloc[1, 1]

In [None]:
pop_df['California', 2010] #KeyError

In [None]:
pop_df.values

In [None]:
pop.values

* ``stack()`` method provides the opposite operation:

In [None]:
pop_df.stack()

#### 1.4. The reason of hierarchical indexingis: 
---------------
* just as we were able to use multi-indexing to represent 2D data within a 1D ``Series``, we can also use it to represent data of three or more dimensions in a ``Series`` or ``DataFrame``
* each extra level in a multi-index represents an extra dimension of data
* simplifying of data processing -- adding another column to ``DataFrame`` with a ``MultiIndex``( column of demographic data for each state at each year, population under 18):

In [None]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

* All the ufuncs and other functionality work with hierarchical indices as well : computing the fraction of people under 18 by year:

In [None]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
#f_u18.unstack()[2000]

## 2. Methods of ``MultiIndex`` creation
___________________________

#### 2.1. In ``Series`` or ``DataFrame`` constructors 
__________________________________________________
* The most straightforward (*implicit*) way:
    * to pass a list of two or more index arrays
    * to pass a dictionary with appropriate tuples as keys
* The work of creating the ``MultiIndex`` is done in the background.

In [None]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

In [None]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

#### 2.2. Explicit ``MultiIndex`` constructors
___________________________________________
Any of MultiIndex objects can be passed as the ``index`` argument when creating a ``Series`` or ``Dataframe``, or be passed to the ``reindex`` method of an existing ``Series`` or ``DataFrame``

* from a list of arrays -- giving the index values within each level:

In [None]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

* from a list of tuples -- giving the multiple index values of each point:

In [None]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

* from a Cartesian product of single indices:

In [None]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

* directly using its internal encoding by passing
    * ``levels`` -- a list of lists containing available index values for each level
    * ``codes`` (or ``labels``, is deprecated) -- a list of lists that reference these labels

In [None]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

#### 2.3. ``MultiIndex`` level names
_________________________
Naming the levels of the ``MultiIndex`` can be accomplished:
* by passing the ``names`` argument to any of the above ``MultiIndex`` constructors or 
* by setting the ``names`` attribute of the index after the fact:

In [None]:
pop.index.names = ['state', 'year']
pop

In [None]:
pop.loc['California', :]

## 3. Indexing and Slicing a MultiIndex
______________________

Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.

#### 3.1. Access single elements in multiply indexed ``Series`` by multiple terms 
________________________________

In [None]:
pop

In [None]:
pop['California', 2000]

#### 3.2. Partial indexing, or indexing just one of the levels in the index.
-------------------
* the result is another ``Series``, with the lower-level indices maintained:

In [None]:
pop['California']

In [None]:
pop['California'][2010]

* Partial slicing is available as well, as long as the ``MultiIndex`` is sorted :

In [None]:
pop.loc['California':'New York']

* With **sorted** indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [None]:
pop[:, 2000]

* Other types of indexing and selection work as well; for example, selection based on Boolean masks:

In [None]:
pop[pop > 22000000]

* Selection based on fancy indexing also works:

In [None]:
pop.loc[['California', 'Texas']]

In [None]:
pop.loc[['California', 'Texas']][1]

## 4. Multiply indexed ``DataFrames``
____________________________

A multiply indexed ``DataFrame`` behaves in a similar manner as ``Series``.

#### 4.1. MultiIndex for columns
---------------------

* In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well
* Because columns are primary in a ``DataFrame`` the syntax used for multiply indexed ``Series`` applies to the columns

In [None]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

In [None]:
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
data

In [None]:
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

#### 4.2. Multi-indexing
----------------
* For both rows and columns can come in *very* handy
* Four-dimensional data, where the dimensions are:
   * the subject
   * the measurement type
   * the year
   * visit number

In [None]:
health_data[('Guido','HR')]

In [None]:
gt=health_data['Guido','Temp']
gt

In [None]:
health_data['Guido','Temp'][2013]

In [None]:
gt.unstack()

In [None]:
gt.unstack()[1]

In [None]:
gt.unstack()[1][2014]

In [None]:
gt[2013,1]

In [None]:
health_data.loc[:, [('Guido', 'HR'), ('Sue', 'HR')]]

In [None]:
health_data.loc[[(2013,2),(2014,2)], [('Guido', 'HR'), ('Sue', 'HR')]]

* As with the single-index case,  the ``loc`` and ``iloc`` indexers can be used; these indexers provide an array-like view of the underlying 2D data

In [None]:
health_data.iloc[:2, :4]

* Each individual index in ``loc`` or ``iloc`` can be passed as a **tuple** of multiple indices:

In [None]:
health_data.loc[:, ('Bob', 'HR')]

* Working with slices within tuples (individual index ) is not  convenient:

In [None]:
# health_data.loc[(:, 1), (:, 'HR')] # SyntaxError: invalid syntax

* Using an ``IndexSlice`` object for working with slices within tuples:

In [None]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

## 5. Rearranging multi-indices
___________________________
* There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations
* ``stack()`` and ``unstack()`` methods --  a brief example of this operations, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns

In [None]:
# creating multiply indexed data where indices are not lexographically sorted
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

#### 5.1. Sorted and unsorted indices
-------------

* Many of the ``MultiIndex`` slicing operations will fail if the index is not sorted
* i.e., a partial slice of unsorted index  results an error:

In [None]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

* Ppartial slices and other similar operations require the levels in the ``MultiIndex`` to be in sorted (i.e., lexographical) order
* Pandas provides a number of convenience routines to perform sorting:
    * ``DataFrame.sort_index()`` 
    * ``DataFrame.sortlevel()`` 

In [None]:
data = data.sort_index()
data

* With the index sorted partial slicing will work as expected:

In [None]:
data['a':'b']

#### 5.2. Stacking and unstacking indices
------------------

* converting a dataset from a stacked multi-index to a simple 2D representation, optionally specifying the level to use:

In [None]:
pop.unstack(level='state')

In [None]:
pop.unstack(level=0)

In [None]:
pop.unstack(level=1)

* The opposite of ``unstack()`` is ``stack()``, which here can be used to recover the original series:

In [None]:
pop.unstack().stack()

#### 5.3. Index setting and resetting
--------------------------

* ``reset_index()``  --  to turn the index labels into columns -- results in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index:

In [None]:
pop_flat = pop.reset_index(name='population')
pop_flat

In [None]:
pop_flat['state']

In [None]:
pop_flat.loc[1]

In [None]:
pop_flat.iloc[1]

* Using  ``set_index()``, which returns a multiply indexed ``DataFrame`` from the column values :

In [None]:
psy= pop_flat.set_index(['state', 'year'])
psy

In [None]:
psy.loc['California', 2010]

## 6. Data aggregations on multi-indices
_____________________________
 ``level`` and ``axis`` parameters can be used to control which subset of the data is computed on with aggregation methods, such as ``mean()``, ``sum()``, and ``max()``

#### 6.1.  ``level`` parameter 
-------------------
* Average-out the measurements in the two visits each year:

In [None]:
health_data

In [None]:
data_mean = health_data.mean(level='year')
data_mean

#### 6.2.  ``axis`` parameter 
--------------

* making use of the ``axis`` keyword, we can take the mean among levels on the columns as well:

In [None]:
data_mean.mean(axis=1, level='type')

##  Panel Data
_________________

Pandas has other fundamental data structures, namely the ``pd.Panel`` and ``pd.Panel4D`` objects.
These can be thought of, respectively, as 3D and 4D generalizations of the 1D ``Series`` and 2D ``DataFrame``.
