# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L5 Aggregation

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* Performing data aggregation
* Computing summary statistics

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* Python for Data Analysis by Wes McKinney

## Data Aggregation Overview
Gathering and summarizing information, perhaps in preparation for statistical analysis or visualization, is called *data aggregation*. For example, suppose you want to investigate the similarities/differences amongst patients in a clinical setting. Suppose specific attributes you are interested in include medical condition, age, and gender. You might *group* the data into two groups: male and female. By grouping the data based on a variable, such as gender, you are aggregating the data. The grouping allows you to then create a bar chart representing the frequency of each medical condition present in each group, or perform hypothesis testing to see if there is a significant age difference between the two groups. 

### Split-Apply-Combine
Data aggregation typically follows a "split, apply, combine" process:
* Split the data into groups based on some criteria
    * Perform *group by* operations
    * Select or slice data to form a subset
    * Example: Group a data frame by rows (axis 0) or by columns (axis 1)
* Apply a function to each group independently, producing a new value
    * Compute summary statistics (aggregation)
        * Example: Count the size of each group
        * Example: Compute mean, standard deviation, custom stats, etc.
    * Transform the data in the group (transformation)
        * Example: Standardizing data (z-score) within each group
        * Example: Filling missing data with a value derived from each group
    * Discard some groups (filtration)
        * Example: Discarding data that belongs to groups with only a few members
        * Example: Filtering out data based on the group sum or mean
* Combine the results of the function applications into a data structure
    * Example: A series with index corresponding to data frame column names and values representing the column means
    
<img src="http://blog.yhat.com/static/img/split-apply-combine.jpg" width="500">
(image from [http://blog.yhat.com/static/img/split-apply-combine.jpg](http://blog.yhat.com/static/img/split-apply-combine.jpg))
    
### Pandas GroupBy
In the split step, we want to divide a dataset into a mapping of group names to group data. With the Pandas [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) function, we can divide a data frame into a [`GroupBy`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) object that stores the mapping. For example:

In [2]:
import numpy as np
import pandas as pd

# adapted from http://pandas.pydata.org/pandas-docs/stable/groupby.html
df = pd.DataFrame({"Gender" : ["F", "F", "M", "F", "M", "M", "M", "F"],
                   "AgeGroup" : ["OA", "A", "OA", "YA", "YA", "OA", "A", "YA"], # OA: older adult, A: adult, YA: young adult
                   "Feature1" : np.random.randn(8),
                   "Feature2" : np.random.randn(8)})
print(df)
# GroupBy object (mapping of group name -> group data frame)
gender_groups = df.groupby("Gender")
# groups attribute is a dictionary storing the mapping
print("Groups:", gender_groups.groups)
print("Female data frame")
F_df = gender_groups.get_group("F")
print(F_df)
print("Male data frame")
M_df = gender_groups.get_group("M")
print(M_df)
# confirm M_df is a data frame
print(type(M_df))
# divided the data frame into 2 groups
print(len(df) == len(F_df) + len(M_df))

  AgeGroup  Feature1  Feature2 Gender
0       OA -0.257802 -0.106882      F
1        A -0.749021 -0.917163      F
2       OA -1.268724  1.006705      M
3       YA  0.618698 -0.257106      F
4       YA  1.245493 -0.374474      M
5       OA -1.388780 -0.822717      M
6        A  0.296953 -0.894486      M
7       YA  0.607894 -1.509741      F
Groups: {'M': [2, 4, 5, 6], 'F': [0, 1, 3, 7]}
Female data frame
  AgeGroup  Feature1  Feature2 Gender
0       OA -0.257802 -0.106882      F
1        A -0.749021 -0.917163      F
3       YA  0.618698 -0.257106      F
7       YA  0.607894 -1.509741      F
Male data frame
  AgeGroup  Feature1  Feature2 Gender
2       OA -1.268724  1.006705      M
4       YA  1.245493 -0.374474      M
5       OA -1.388780 -0.822717      M
6        A  0.296953 -0.894486      M
<class 'pandas.core.frame.DataFrame'>
True


Now we have learned enough background information to dive into learning about aggregating data by working through an example!

## Data Aggregation Example
We are going to continue working with the [pd_hoa_activities.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities.csv) dataset. This dataset contains information from a smart home study where participants performed 9 activities in a smart home environment. In a previous lesson data cleaning, we read in the data, cleaned it, and saved a new csv file with the data in cleaned form: [pd_hoa_activities_cleaned.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities_cleaned.csv). We will start with this cleaned version of the dataset. 

In [3]:
import pandas as pd
import numpy as np

fname = r"files\pd_hoa_activities_cleaned.csv"
df = pd.read_csv(fname, header=0, index_col=[0, 1])
print(df.shape)
print(df.head(n=10))

(665, 3)
                               duration  age class
pid task                                          
0   Water Plants                    146   72   HOA
    Fill Medication Dispenser       210   72   HOA
    Wash Countertop                 241   72   HOA
    Sweep and Dust                  328   72   HOA
    Cook                            229   72   HOA
    Wash Hands                       38   72   HOA
    Perform TUG                      10   72   HOA
    Perform TUG w/Questions          10   72   HOA
    Day Out Task                    680   72   HOA
1   Water Plants                     63   54   HOA


### Split
Now let's group the data into two population groups, HOA and PD. 

In [4]:
classes = df.groupby("class")
for class_name, cls_df in classes:
    print(class_name)
    print(cls_df.head())

HOA
                               duration  age class
pid task                                          
0   Water Plants                    146   72   HOA
    Fill Medication Dispenser       210   72   HOA
    Wash Countertop                 241   72   HOA
    Sweep and Dust                  328   72   HOA
    Cook                            229   72   HOA
PD
                               duration  age class
pid task                                          
2   Water Plants                     47   62    PD
    Fill Medication Dispenser       205   62    PD
    Wash Countertop                 232   62    PD
    Sweep and Dust                  543   62    PD
    Cook                            511   62    PD


### Apply and Combine
Then, we can compute summary statistics for each group, such as mean and standard deviation for age. we will store the results in a new results data frame with index "HOA" and "PD:

In [5]:
age_results_df = pd.DataFrame(index=classes.groups, columns=["age mean", "age std"])
for class_name, cls_df in classes:
    age_results_df.ix[class_name]["age mean"] = cls_df["age"].mean()
    age_results_df.ix[class_name]["age std"] = cls_df["age"].std()
print(age_results_df)

    age mean  age std
PD   68.8539  9.88264
HOA  68.6771  9.78872
