# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)

# PA Daily Change Scores

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Implement a dissimilarity change score algorithm
* Implement baseline window comparisons
* Implement sliding window comparisons

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Write object-oriented Python code
* Use the `pandas` library
* Load/save data in Pandas objects
* Clean, aggregate, summarize, and visualize data

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* [Sprint et al., 2016](http://www.sciencedirect.com/science/article/pii/S1532046416300740)
* [Wang et al., 2012](http://ieeexplore.ieee.org/document/6189784/?section=abstract)

## Overview and Requirements
For this programming assignment, we are going to extend the PACD functionality we covered in the Behavior Change Detection lessons. Specifically, we are going to write code to do the following:
1. Load a Fitbit dataset that contains 50 days of Fitbit data
1. Re-sample the dataset according to a parameter
1. Adjust the feature extraction code from the lessons to handle re-sampled data
1. Implement AHA-DS (AHA Dissimilarity Scorer), a simple dissimilarity-based change score algorithm
1. Compute AHA-DS change scores for window sizes of 1 day (daily)
    1. Baseline comparison implementation
    1. Sliding window comparison implementation
1. Write AHA-DS daily change score results to a csv file

## Program Details
### Dataset
Download [fitbit_example2_data.zip](https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/files/fitbit_example2_data.zip). This dataset is in the same format as the previous Fitbit data we have been working with: one csv file per day with minute by minute rows and Fitbit metric columns. This dataset contains 50 days of Fitbit data and is a superset of our previous Fitbit dataset we have been working with (that had 21 days of data).

### Data Storage
Read in each csv file as a data frame with a `DateTimeIndex` representing each minute in the day. Extract the steps from each day to form a new data frame with the same minute by minute index, but with step columns labeled for each day, one column for each of the 50 days. For example:

|time|10/1/2015|10/2/2015|...|11/19/2015|
|-|-|-|-|-|
|0:00:00|0|0|...|0|
|0:01:00|0|0|...|0|
|...|...|...|...|...|
|23:59:00|0|0|...|0|

### Re-sampling
Add support to re-sample the `DateTimeIndex` by a number of minutes specified by the command line argument, $t_{mins}$. You can expect $t_{mins}$ to be between [1, 120] minutes inclusive.

### Feature Extraction
Extract the same features for each day as we extracted in the lessons:
1. Total steps
1. Max steps
1. Average steps
1. Standard deviation of steps
1. Physical activity intensity percentages. Percent of the day:
    1. Sedentary ($<$ 5 steps/min)
    1. Low (5 $\leq$ steps/min $<$ 40)
    1. Moderate (40 $\leq$ steps/min $<$ 100)
    1. High ($\geq$ 100 steps/min)

However, be sure to account for re-sampled data when computing the above features!

### AHA-DS
We are going to implement a simple, daily change score algorithm that we are going to call *AHA-DS*! AHA-DS computes a change score based on dissimilarity between daily feature vectors. To describe the AHA-DS in detail, consider two days of Fitbit data, $D_{i}$ and $D_{j}$, each containing 1 day of step data. Feature vectors $F_{i}$ and $F_{j}$ represent the feature vectors for days $D_{i}$ and $D_{j}$, respectively.

To compare two days $D_{i}$ and $D_{j}$ for changes, a weighted normalized Euclidean distance (WNED) measure is used as a change score to quantify the differences between the corresponding feature vectors $F_{i}$ and $F_{j}$. The smaller the Euclidean distance between these two vectors, the more similar the two days of data are. 

Let the number of features in feature vectors $F_{i}$ and $F_{j}$ be $d$. First, each feature is normalizes as follows (from [Wang et al., 2012](http://ieeexplore.ieee.org/document/6189784/?section=abstract)):

$$F_{i}^{*}(k) = \frac{F_{i}(k)}{max\lbrack F_{i}(k), F_{j}(k) \rbrack}$$

$$F_{j}^{*}(k) = \frac{F_{j}(k)}{max\lbrack F_{i}(k), F_{j}(k) \rbrack}$$

for $k = 1, ..., d$

Then, the WNED between $F_{i}^{*}$ and $F_{j}^{*}$, $d_{ij}$, is defined as follows:

$$d_{ij} = \sqrt{\sum_{k=1}^{d} \frac{1}{d} \lbrack F_{i}^{*}(k) - F_{j}^{*}(k) \rbrack^{2}}$$.

### Comparison Modes
AHA-DS can compute a sequence of WNED change scores in two different comparison modes:
1. Baseline day comparison: the first day in the dataset is a reference day and is used in each comparison, including a comparison to itself. All subsequent days are compared to the baseline day. For example:
    1. First comparison: days (1, 1)
    1. Second comparison: days (1, 2)
    1. ...
    1. Last comparison: days (1, $N$)
1. Sliding day comparison: pairs of days are compared starting with the first two days. Subsequent pairs are the result of advancing by one day. For example:
    1. First comparison: days (1, 2)
    1. Second comparison: days (2, 3)
    1. ...
    1. Last comparison: days ($N - 1$, $N$)

### Program Output
1. Save the features data frame to the output file specified by a command line argument.
1. Save the resulting AHA-DS change scores in a Pandas `Series` in the following format (actual results omitted):

||score|
|-|-|
|2015-10-01~2015-10-02|0.XX|
|2015-10-02~2015-10-03|0.XX|
|...|...|
|2015-11-18~2015-11-19|0.XX|

The series should be named "score" and is written to the output file specified by a command line argument. Use the following [`to_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_csv.html) keywords to output your data frame:
1. `header=True` 
1. `float_format="%.2f"`


### Command Line Arguments
Your program should accept the following parameters as command line arguments (in this order):
1. Input filename: the filename (relative path) of the input data
1. Output features filename: the filename (relative path) to output the features
1. Output scores filename: the filename (relative path) to output the change scores
1. $t_{mins}$: number of minutes per time interval (used for re-sampling)
1. Mode: either "b" for baseline day comparisons or "s" for sliding day comparisons

Example: `files\fitbit_example2_data files\aha-ds_60_features.csv files\aha-ds_60_b.csv 60 b`

### Example Results (Visualized)
#### Baseline
Example AHA-DS baseline change scores for the dataset with $t_{mins}$ = 60 and Mode = "b" looks like the following:
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/figures/ahads_60_b.png" width="1000" />

Note: Notice how there is a cyclical pattern with peaks around weekends and valleys around weekdays. There are some days where it appears the Fitbit was not worn, perhaps it was charging or forgotten. For these days, the AHA-DS change score is quite high. A more advanced AHA-DS system could detect missing data by looking at null heart rate values and fill missing data by taking averages of neighboring days that share the same day of the week (e.g. the Monday before and the Monday after to fill a missing Monday). 

#### Sliding
Example AHA-DS sliding change scores for the dataset with $t_{mins}$ = 60 and Mode = "s" looks like the following:
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/progassignments/figures/ahads_60_s.png" width="1000" />