# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L2 Hierarchical Indexing

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* Hierarchical indexing

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* Python for Data Analysis by Wes McKinney

## Hierarchical Index (Pandas MultiIndex)
So far, we have been working with series and data frames that have a uni-dimensional index. For example, the string index \["Seattle", "Spokane", "Bellevue", "Leavenworth\] that uniquely identified populations in a series or an integer index from \[0, n) that uniquely identifies each row in a data frame. The latter is the index we used in our lesson on data cleaning where we worked with the [pd_hoa_activities.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities.csv) dataset. Let's take a look at the first few rows in this dataset *after* we performed our cleaning process:

|pid|task|duration|age|class|
|-|-|-|-|-|
|0|Water Plants|146|72|HOA|
|0|Fill Medication Dispenser|210|72|HOA|
|0|Wash Countertop|241|72|HOA|
|0|Sweep and Dust|328|72|HOA|
|0|Cook|229|72|HOA|
|0|Wash Hands|38|72|HOA|
|0|Perform TUG|10|72|HOA|
|0|Perform TUG w/Questions|10|72|HOA|
|0|Day Out Task|680|72|HOA|
|1|Water Plants|63|54|HOA|
|...|...|...|...|...|

A simple row labeling index does not adequately represent this data because it has a natural *hierarchical* index. That is, pid uniquely identifies age and class, while the tuple (pid, task) uniquely identifies the participant's duration for a certain specific task.
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/pd_hierarchical_index1.png" width="450">

We could more appropriately represent the relationship amongst these 5 variables by storing the data in a different format. Consider the following two different storage approaches:
1. One data structure: data frame with hierarchical indexing (outer: pid, inner: task) and columns (duration, age, class)
    * Note: contains redundant copies of age and class values for each task.
1. Two data structures:
    * Data frame with index (pid) and columns (age, class)
    * Series with hierarchical index (outer: pid, inner: task) and values (duration)
        * Note: if we have other features describing the participant's performance on a task available (e.g. an efficiency score, number of sensor events, etc.), this would be a data frame with one column for each feature.
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/pd_hierarchical_index2.png" width="500">
    
There are different trade-offs to using each approach. For example, redundant information is stored in option 1, but it is easier to keep track of one object instead of two objects (e.g if we decide to drop a pid because of missing data in one object, we need to decide/remember to drop the same pid in the other object). For this lesson, we are going to use option 1. For practice, try implementing option 2, it is a good exercise in Pandas!
    
### Creating a MultiIndex
Let's take a look at a small example of creating and using a hierarchical index. In Pandas, a hierarchical index is represented as a [`MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/advanced.html#hierarchical-indexing-multiindex) object.

In [21]:
import pandas as pd

# adapted from https://pandas.pydata.org/pandas-docs/stable/advanced.html#hierarchical-indexing-multiindex
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
# *arrays unpacks arrays into two arguments to zip
# zip creates tuples from parallel arrays
tuples = list(zip(*arrays))
print(tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
print(index)
s = pd.Series(np.random.randn(8), index=index, name="random data")
print(s, end="\n\n")
print("Indexing once into the outer index 'first':", s["bar"], end="\n\n")
print("Indexing twice into the outer index 'first' and inner index 'second':", s["bar"]["one"])

[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
first  second
bar    one       0.074745
       two       0.665284
baz    one      -0.094223
       two       0.849528
foo    one       0.426799
       two       0.302697
qux    one      -0.335500
       two      -0.008077
Name: random data, dtype: float64

Indexing once into the outer index 'first': second
one    0.074745
two    0.665284
Name: random data, dtype: float64

Indexing twice into the outer index 'first' and inner index 'second': 0.0747453681967


### Creating a MultiIndex with `read_csv()`
We can setup a hierarchical index when we read data in from a csv file. We have seen examples of using the `index_col` keyword with `read_csv()` to specify the index column when we load data. If we set `index_col` to an ordered list of column positions, Pandas will infer the `MultiIndex` for us! Let's try it out with our working example "foo bar" example:

In [42]:
out_fname = r"files\hierarchical_foobar.csv"
s.to_csv(out_fname)
# column 0 is first, column 1 is second
s2 = pd.read_csv(out_fname, header=None, index_col=[0, 1])
# note: s2 is now a data frame

# set back up the labels
s2.index.set_names(["first", "second"], inplace=True)
s2.rename(columns={2: "random data"}, inplace=True)
print(s2.shape)
print(s2)

# convert to series
s2 = s2["random data"]
print(s2)

(8, 1)
              random data
first second             
bar   one        0.074745
      two        0.665284
baz   one       -0.094223
      two        0.849528
foo   one        0.426799
      two        0.302697
qux   one       -0.335500
      two       -0.008077
first  second
bar    one       0.074745
       two       0.665284
baz    one      -0.094223
       two       0.849528
foo    one       0.426799
       two       0.302697
qux    one      -0.335500
       two      -0.008077
Name: random data, dtype: float64


## Hierarchical Indexing Example
We are going to continue working with the [pd_hoa_activities.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities.csv) dataset. This dataset contains information from a smart home study where participants performed 9 activities in a smart home environment. In a previous lesson data cleaning, we read in the data, cleaned it, and saved a new csv file with the data in cleaned form: [pd_hoa_activities_cleaned.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities_cleaned.csv). We will start with this cleaned version of the dataset. 

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd
import numpy as np

fname = r"files\pd_hoa_activities_cleaned.csv"
df = pd.read_csv(fname, header=0)
print(df.shape)
print(df.head(n=10))

(665, 5)
   pid                       task  duration  age class
0    0               Water Plants       146   72   HOA
1    0  Fill Medication Dispenser       210   72   HOA
2    0            Wash Countertop       241   72   HOA
3    0             Sweep and Dust       328   72   HOA
4    0                       Cook       229   72   HOA
5    0                 Wash Hands        38   72   HOA
6    0                Perform TUG        10   72   HOA
7    0    Perform TUG w/Questions        10   72   HOA
8    0               Day Out Task       680   72   HOA
9    1               Water Plants        63   54   HOA


Now, we are going to apply a hierarchical index with the outer index "pid" and the inner index "task". 

In [6]:
def apply_multi_index(df):
    '''
    multi-index outer index: participant id
    inner index: task
    '''
    arrays = [df["pid"], df["task"]]
    df.drop(["pid", "task"], axis=1, inplace=True)
    tuples = list(zip(*arrays))
    index = pd.MultiIndex.from_tuples(tuples, names=["pid", "task"])
    df.set_index(index, inplace=True)
    
apply_multi_index(df)
print(df.head(n=12), "\n")
print(df.describe())

                               duration  age class
pid task                                          
0   Water Plants                    146   72   HOA
    Fill Medication Dispenser       210   72   HOA
    Wash Countertop                 241   72   HOA
    Sweep and Dust                  328   72   HOA
    Cook                            229   72   HOA
    Wash Hands                       38   72   HOA
    Perform TUG                      10   72   HOA
    Perform TUG w/Questions          10   72   HOA
    Day Out Task                    680   72   HOA
1   Water Plants                     63   54   HOA
    Fill Medication Dispenser       202   54   HOA
    Wash Countertop                 259   54   HOA 

          duration         age
count   665.000000  665.000000
mean    356.541353   68.735338
std     722.675794    9.812659
min       0.000000   54.000000
25%      34.000000   61.000000
50%     190.000000   68.000000
75%     326.000000   76.000000
max    6151.000000   93.000000
