# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)

# L6 Virtual Classifier: Part 1

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* The virtual classifier algorithm
* Implementing steps of PACD with the virtual classifier algorithm

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Sprint et al., 2016](http://www.sciencedirect.com/science/article/pii/S1532046416300740)
* [Hido et al., 2008](https://link.springer.com/chapter/10.1007%2F978-3-540-68125-0_15)

## Virtual Classifier Overview
Change analysis, as proposed by [Hido et al., 2008](https://link.springer.com/chapter/10.1007%2F978-3-540-68125-0_15), utilizes a virtual binary classifier to detect and investigate change. Before discussing the details of the VC approach, it is necessary to cover background information related to binary classification and k fold cross validation.
    
### Background: Binary Classification
Snippets from [Wikipedia](https://en.wikipedia.org/wiki/https://en.wikipedia.org/wiki/Statistical_classification):
> In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).

> In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function.

> An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

> In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to cluster analysis, i.e. a type of unsupervised learning, rather than the supervised learning described in this article.

> Binary or binomial classification is the task of classifying the elements of a given set into two groups on the basis of a classification rule. Some of the methods commonly used for binary classification are:
* Decision trees
* Random forests
* Bayesian networks
* Support vector machines
* Neural networks
* Logistic regression

### Background: Binary Classifier Performance
While there are *several* classification evaluation metrics, we are going to evaluate our VC with classification accuracy: the proportion of correct classifications made out of all classifications made: 

$$Accuracy = \frac{\# correct}{\# correct + \# incorrect}$$
   
Note: There are cases when accuracy is not useful. For example, consider the case when classes are imbalanced (e.g. 99% accuracy when 99% of the dataset is the positive class).


### Background: K-fold Cross Validation
Training a classifier on separate training and testing sets is called the *holdout method*. In the holdout method, the dataset is divided into two sets, the training and the testing set. The training set is used to build the model and the testing set is used to evaluate the model (e.g. the model's accuracy). One of the shortcomings of this approach is the evaluation of the model depends heavily on which examples are selected for training versus testing. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png" width="650">
(image from [https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/2000px-Supervised_machine_learning_in_a_nutshell.svg.png))

K-fold cross validation is a model evaluation approach that addresses this shortcoming of the holdout method. For K-fold cross validation, the examples are divided into k subsets $S = s_{1},...,s_{i},...,s_{k}$ and the holdout method is repeated k times. Each iteration $i$, subset $s_{i}$ is held out of the training set. Subsets $S - {s_{i}}$ are used for training and $s_{i}$ is used for testing. The average performance of all $k$ train/test trials is computed and evaluated. 

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg" width="600">
(image from [https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg))

Note: K = 1 is called *leave-one-out* cross validation. 

### VC Steps
We apply the VC approach as part of PACD for large window sizes. The VC algorithm is composed of the following steps:
1. A feature extraction step reduces two windows $W_{i}$ and $W_{j}$ into sets of $n$ daily feature vectors, $F_{i}$ and $F_{j}$, where $n$ is the number of days in window.
1. Each daily feature vector of $F_{i}$ is labeled with a positive class and each daily feature vector of $F_{j}$ is labeled with a negative class. 
1. VC trains a decision tree to learn the decision boundary between the virtual positive and negative classes. The resulting average prediction accuracy based on $k$-fold cross validation is represented as $p_{VC}$. 
1. If a significant change exists between $W_{i}$ and $W_{j}$, the average classification accuracy $p_{VC}$ of the learner should be significantly higher than the accuracy expected from random noise, $p_{rand} = 0.5$, the binomial maximum likelihood of two equal length windows.
    * For this test, the inverse survival function of a binomial distribution is used to determine a critical value, $p_{critical}$, at which $n$ Bernoulli trials are expected to exceed $p_{rand}$ at $\alpha$ significance. If $p_{VC} \geq p_{critical}$, a significant change exists between the two windows, $W_{i}$ and $W_{j}$.

## VC Example
To implement the VC algorithm, we will work with the steps data frame we constructed in a previous lesson, [fitbit_example_data_steps_df.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/fitbit_example_data_steps_df.csv). 
<img src="https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/ADM_example.png" width="700">

Since there are 21 days of data in this dataset, we will segment our data into three 1 week windows:
1. $W_{1}$ (week 1): [2015-10-01, 2015-10-07]
1. $W_{8}$ (week 2): [2015-10-08, 2015-10-14]
1. $W_{15}$ (week 3): [2015-10-15, 2015-10-21]

With our three windows, we can either perform a set of baseline or a sliding window comparisons. For baseline comparisons, we can perform the following comparisons:
1. $W_{1}$ to $W_{8}$: compare the first week to the second week.
1. $W_{1}$ to $W_{15}$: compare the first week to the third week.

For sliding comparisons, we can perform the following comparisons:
1. $W_{1}$ to $W_{8}$: compare the first week to the second week.
1. $W_{8}$ to $W_{15}$: compare the second week to the third week.

For simplicity, in this example we are going to perform a single comparison of $W_{1}$ to $W_{8}$. Let's begin by loading the data into a data frame.

In [1]:
import pandas as pd
import numpy as np

fname = r"files\fitbit_example_data_steps_df.csv"
df = pd.read_csv(fname, header=0, parse_dates=[0])
df.set_index("time", inplace=True)
print(type(df.index))
print(df.shape, "Number of days:", len(df.columns))
print(df.head(n=5))

<class 'pandas.tseries.index.DatetimeIndex'>
(1440, 21) Number of days: 21
                     2015-10-01  2015-10-02  2015-10-03  2015-10-04  \
time                                                                  
2017-06-08 00:00:00         0.0         0.0         0.0         0.0   
2017-06-08 00:01:00         0.0         0.0         0.0        14.0   
2017-06-08 00:02:00         0.0         0.0         0.0        60.0   
2017-06-08 00:03:00         0.0         0.0         4.0        10.0   
2017-06-08 00:04:00         0.0         0.0         7.0         8.0   

                     2015-10-05  2015-10-06  2015-10-07  2015-10-08  \
time                                                                  
2017-06-08 00:00:00         0.0        49.0         0.0         5.0   
2017-06-08 00:01:00         0.0         0.0         0.0        26.0   
2017-06-08 00:02:00         0.0         0.0         0.0        17.0   
2017-06-08 00:03:00         0.0         5.0         0.0        53.0   
2

### Extract Features
Now that we have loaded the data into a data frame, we need to compute features that are likely to be indicative of changes from day to day and window to window. For our VC example, our window size, $n$, is going to be 7 days (one week). Consequently, let's extract the following 8 daily features that are straightforward to compute:
1. Total steps
1. Max steps
1. Average steps
1. Standard deviation of steps
1. Physical activity intensity percentages. Percent of the day:
    1. Sedentary ($<$ 5 steps/min)
    1. Low (5 $\leq$ steps/min $<$ 40)
    1. Moderate (40 $\leq$ steps/min $<$ 100)
    1. High ($\geq$ 100 steps/min)
    
With window size $n = 7$ days and 8 features per day, we are going to construct feature matrices that are $7 \times 8$ data frames. 

Note: To compute the four physical activity percentages, we will need to bin the data according to the specified cut-off values. We can do this with support from the Pandas [`cut()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html) function.

In [3]:
def compute_features_df(df):
    '''
    df is a dataframe with a DateTimeIndex and columns corresponding to dates
    '''
    features_df = pd.DataFrame(index=df.columns)
    features_df["total"] = df.sum()
    features_df["max"] = df.max()
    features_df["mean"] = df.mean()
    features_df["std"] = df.std()

    # PHYSICAL ACTIVITY INTENSITY FEATURES
    intensities = ["sedentary", "low", "moderate", "high"]
    # add blank columns for each intensity. values will be filled in one at a time
    for intensity in intensities:
        features_df[intensity] = np.NaN
    
    # the largest step count in the data set
    max_val = features_df["max"].max()
    # need to adjust if resample data using sum instead of mean
    # exclusive of left, inclusive of right
    bins = [-1, 4, 39, 99, max_val]

    for date in df.columns:
        intensity_ser = pd.cut(df[date], bins, labels=intensities)
        counts = pd.value_counts(intensity_ser)
        for intensity in intensities:
            percentage = counts.loc[intensity] / len(df) * 100
            features_df.ix[date][intensity] = percentage # use loc because counts is a categorial index
    return features_df

features_df = compute_features_df(df)
print(features_df)

              total    max       mean        std  sedentary        low  \
2015-10-01   5097.0  118.0   3.539583  12.590844  86.458333  10.972222   
2015-10-02   8898.0  113.0   6.179167  17.505718  81.041667  12.708333   
2015-10-03  13208.0  121.0   9.172222  26.262312  83.680556   7.500000   
2015-10-04  31017.0  124.0  21.539583  40.019937  70.208333   9.861111   
2015-10-05   3176.0  117.0   2.205556   9.228742  90.138889   8.680556   
2015-10-06   6021.0  118.0   4.181250  14.944877  87.777778   8.194444   
2015-10-07   4822.0  110.0   3.348611  12.330547  88.333333   8.750000   
2015-10-08   5803.0  119.0   4.029861  12.875836  84.027778  12.777778   
2015-10-09   4272.0  113.0   2.966667  11.257810  88.402778   9.305556   
2015-10-10     19.0   19.0   0.013194   0.500694  99.930556   0.069444   
2015-10-11  10134.0  110.0   7.037500  19.796673  81.180556  12.083333   
2015-10-12   4061.0  118.0   2.820139  10.902576  88.611111   9.375000   
2015-10-13   3522.0  114.0   2.445833 

## To be Continued...
We will finish the VC implementation in the next lesson. Until then, let's save our features data frame to file so we don't have to reconstruct it.

In [None]:
out_fname = fname[:-12] + "features_df.csv" # write out for later
features_df.to_csv(out_fname)