In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datawrangler import *

## Data
The data is a abridged version of the PAXRAW D dataset from the
[NHANES (National Health and Nutritional Survey)](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2005) study conducted by the NIH.


The dataset contains objective phyiscal activity measurements as recorded by an activity monitor device. I took the original dataset and reduced its size (from ~7000 subject to 1000) and converted it to a CSV file for the demo.

You can download the demo dataset here:
https://drive.google.com/open?id=1sd3ePUTp4ZYqOgw8DhZapF3kXIbtz5Fg

In [2]:
# Read the data into a DataFrame
data = pd.read_csv("PAXRAW_1000.csv")

In [3]:
data.head()

Unnamed: 0,SEQN,PAXSTAT,PAXCAL,PAXDAY,PAXN,PAXHOUR,PAXMINUT,PAXINTEN,PAXSTEP
0,31128.0,1.0,1.0,1.0,1.0,0.0,0.0,166.0,4.0
1,31128.0,1.0,1.0,1.0,2.0,0.0,1.0,27.0,0.0
2,31128.0,1.0,1.0,1.0,3.0,0.0,2.0,0.0,0.0
3,31128.0,1.0,1.0,1.0,4.0,0.0,3.0,276.0,4.0
4,31128.0,1.0,1.0,1.0,5.0,0.0,4.0,0.0,0.0


## Data cont'd.
In order to resample the data, we need to have a timestamp value for each row that contains the correct unit of time (minute, hour, etc). Usually, these datasets come with an exact date timestamp, which I would parse using pd.to_datetime().

However, this dataset doesn't contain an exact timestamp. Instead it provides the measurement number "PAXN",
which records the order that the measurements are taken (3rd measurement == 3, 4th == 4, etc) for each subject.

In this case, I know that the measurments are taken at exact 1-minute intervals, so we can treat the "PAXN" column as a TimeDelta between the starting measurement $t=0$ and the current one. This lets us resample without an exact timestamp.

In [4]:
# Set t0 = 0 by subtracting the minimum value from the entire column
data.PAXN -= data.PAXN.min()
# Add a timedelta column to the data, so that we can resample.
data["delta_t"] = data.groupby(by="SEQN").apply(lambda grp: pd.to_timedelta(grp.PAXN,"m")).reset_index(drop=True)

# Normalization
For my project, I wanted to build a pipeline that would allow me to easily apply some of my most commonly used normalization techniques sequentially. 

In this demo I apply the following normalizations:

1. **LongToWideFormat** - This converts the "long" format data (each row is a measurement) to a "wide" format (each row contains all measurements for a single subject).


2. **TimeSeriesResampler** - By default, the PAXRAW dataset reports cumulative activity over each minute of the study week. This is usually too noisy to use on its own, so I resample it to a lower frequency (5mins, 15mins, etc). to clean up some of the noise. In this case, I have Pandas resample the data to a 5-minute interval by summing up the 1-minute measurements.


3. **NaNReplacer** - Depending on the dataset and timestamp type used, some subjects could end up with NaN measurments (if the device did not record any measurements for one portion of the study, etc). If the NaNs are infrequent, I usually just replace them with 0. Right now, this method just sets all NaN values to 0.


4. **ConstValueDropper** - One issue that occurs in the PAXRAW dataset is the inclusion of measurements which have a constant value for the entire study period, which can lead to interesting results if you try to cluster the data. I'm not sure if this is caused by a device malfunction or something else, but these values are obviously incorrect. This module will drop any rows (subjects) which have a constant value for all measurements. 


5. **StableSeasonalFilter** - This module allows me to correct the data for the "seasonal" diurnal patterns that occur by having a user wear the device over a week-long study. When clustering the data, the "strongest" signal is often a sinusoidal pattern with 7 peaks (i.e. higher activity during waking hours, low/no activity at night). We are interested in more subtle patterns though, so this module allows us to correct the data by removing this seasonality.


6. **ZTransformNormalize** - This module normalizes the measurements for each subject (row) to have $\mu=0$ and $\sigma=1$. This preserves the overall shape of the series, but normalizes it to account for different activity levels in each patient.

In [5]:
# Initialize the Normalizer with a list of the operations I want to perform.
normalizer = Normalizer([LongToWideFormat(index_col="SEQN",data_col="PAXINTEN",timestamp_col="delta_t"),
                         TimeseriesResampler("5T",axis=1),
                         NaNReplacer(const_val=0),
                         ConstValueDropper(axis=1),
                         StableSeasonalFilter(num_seasons=7),
                         ZTransformNormalize(axis=-1)])
# The output DataFrame will be the normalized data.
data_df = normalizer.apply(data)
print("done!")

          Module         |    Input Shape     |    Output Shape    
-------------------------------------------------------------------
[1mLongToWideFormat         |   (10047756, 10)   |   (1000, 10080)    [0m
[1mTimeseriesResampler      |   (1000, 10080)    |    (1000, 2016)    [0m
NaNReplacer              |    (1000, 2016)    |    (1000, 2016)    
[1mConstValueDropper        |    (1000, 2016)    |    (994, 2016)     [0m
StableSeasonalFilter     |    (994, 2016)     |    (994, 2016)     
ZTransformNormalize      |    (994, 2016)     |    (994, 2016)     
-------------------------------------------------------------------
done!


# Output
The normalizer also prints out a small summary table for each step. The *Input Shape* column contains the dimensions of the data before the normalization step is applied, and the *Output Shape* column contains the dimensions after the normalization step is applied.

I wanted to have this because in some datasets I end up "losing" quite a few subjects during the normalization phase, due to lack of data or something else. This table makes it easier for me to see what is happening to the data at each step.

# Analysis

### 1. LongToWideFormat
![Normalization Output at Step 0](imgs/Step_0.png)
This is the "raw" data that comes directly after Pandas does the pivot to convert from long to wide format.

### 2. TimeSeriesResampler
![Normalization Output at Step 1](imgs/Step_1.png)
This mostly looks the same, but this new series is 1/5th the length of the old one, since it was resampled from a 1-minute interval to a 5-minute interval.

### 3. NaNReplacer
![Normalization Output at Step 2](imgs/Step_2.png)
Nothing to see here, since there were no NaNs for this subject.

### 4. ConstValueDropper
![Normalization Output at Step 3](imgs/Step_3.png)
Again nothing to see. This time series doesn't have constant values, so it was passed through this module unchanged.

### 5. StableSeasonalFilter
![Normalization Output at Step 4](imgs/Step_4.png)
Here you can see that the series has been corrected (sort-of) for the seasonality of the data. The data variance is still higher during the day, but the seasonality filter has pulled the daytime mean activity lower.

### 6. ZTransformNormalize
![Normalization Output at Step 5](imgs/Step_5.png)
There isn't much difference visible here, but the Y-axis has changed since the data has been transformed to Z-scores. This doesn't have any effect for a single series, but normalizes the series across subjects, so that all subjects have the same mean activity.
