# Classifying temporal data with non temporal models

In this jupyter notebook we will classify human acitivities from accelerometer data using a Random Forest classifier. We will be using the human activity recognition WISDM dataset (http://www.cis.fordham.edu/wisdm/dataset.php) which contains 6 different activities: walking, jogging, upstairs, downstairs, sitting, standing. 

In [None]:
import numpy as np
import pandas as pd
import warnings
from scipy.io import arff
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

In [None]:
#Define global variables
filename = "wisdm_modified.arff"

### Loading the data with features extracted
The .arff file contains 43 extracted features from the raw accelerometer data. We will load the .arff file and convert it into a pandas data frame. Again, we will do data exploration but this time on the extracted features.

<img src="img/exploration.png" width="100">

In [None]:
#Load arff file.
dataset = arff.loadarff(filename)
dataset = pd.DataFrame(dataset[0])

# remove double quotes in column names
dataset.columns = dataset.columns.str.replace('\"','')

#print the frist data frame rows
dataset.head(10)

In [None]:
#remove rows with nan. We may want to impute those values instead.
dataset = dataset.dropna(how='any')

In [None]:
#compute summary statistics
dataset.describe()

In [None]:
dataset[['XSTANDDEV','class']].groupby('class').describe()

In [None]:
# The same but with sql syntax
from pandasql import sqldf

#initialize
pysqldf = lambda q: sqldf(q, globals())

print(pysqldf("""SELECT class, COUNT(class), AVG(XSTANDDEV), MIN(XSTANDDEV), MAX(XSTANDDEV) FROM dataset GROUP BY class;"""))


<img src="img/preprocessing.png" width="100">

In [None]:
#Drop unused columns
dataset = dataset.drop(['UNIQUE_ID','user','XAVG'], axis=1) #XAVG is all zeros

#Convert bytes to string
dataset['class'] = dataset['class'].str.decode("utf-8")

#shuffle the rows
seed = 321 #set seed for reproducibility
np.random.seed(seed)

dataset = shuffle(dataset) # In some cases, it is a good practice to shuffle the data

dataset.head()

In [None]:
#Select features and class
features = dataset.drop(['class'], axis=1)
labels = dataset[['class']]

#convert to numpy array
features = features.values
labels = labels.values

In [None]:
# Encode the labels converting them from strings to integers
le = LabelEncoder()
labels_int = le.fit_transform(labels.ravel())

In [None]:
# display the first labels
labels_int[0:10]

<img src="img/tt.png" width="100">

## k-fold cross validation

Commonly used to estimate the generalization performance of a predictive model. Divide the dataset into k subsets. Perform k iterations. In each iteration take one of the subsets and use it as the test set. Use the remaining subsets as the train set. Each subset is used as test set once and only once. Stratified cross validation preserves the percentage of samples for each class (this is what we will use here).

![title](img/kfold.png)

In [None]:
#Define our cross validation strategy.
skf = StratifiedKFold(n_splits=10, random_state=seed)

In [None]:
#https://stackoverflow.com/questions/48687375/deprecation-error-in-sklearn-about-empty-array-without-any-empty-array-in-my-cod?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
warnings.filterwarnings('ignore') #supress warnings from scikit learn

#variables to accumulate predictions in int format
acum_true_classes_int = np.empty((0,))
acum_predicted_classes_int = np.empty((0,))

#variables to accumulate predictions in string format
acum_true_classes = np.empty((0,))
acum_predicted_classes = np.empty((0,))


i = 0
for train_idxs, test_idxs in skf.split(features, labels_int):
    i = i + 1
    print("=================Fold ", i, "/", 10)
    clf = RandomForestClassifier(n_estimators = 100, random_state=seed)
    
    # Normalize features between 0 and 1
    # This is done within each fold and normalization parameters learned just from the training data
    normalizer = preprocessing.MinMaxScaler().fit(features[train_idxs,])
    train_normalized = normalizer.transform(features[train_idxs,])
    test_normalized = normalizer.transform(features[test_idxs])
    
    #train classifier with the training data
    clf.fit(train_normalized, labels_int[train_idxs])
    predictions = clf.predict(test_normalized)
    
    acum_true_classes_int = np.hstack((acum_true_classes_int, labels_int[test_idxs]))
    acum_predicted_classes_int = np.hstack((acum_predicted_classes_int, predictions))
    
    # convert classes back to strings
    true_classes = le.inverse_transform(labels_int[test_idxs])
    predicted_classes = le.inverse_transform(predictions)
    acum_true_classes = np.hstack((acum_true_classes, true_classes))
    acum_predicted_classes = np.hstack((acum_predicted_classes, predicted_classes))
    
    
warnings.resetwarnings()

<img src="img/results.png" width="100">

In [None]:
# Create confusion matrix
pd.crosstab(acum_true_classes, acum_predicted_classes, rownames=['True labels'], colnames=['Predicted labels'])

### Performance metrics

P: the number of total positive cases  
N: the number of total negative cases  
TP: number of correctly classified positive samples  
TN: number of correctly classified negative samples  
FP: number of negative samples incorrectly classified as positive  
FN: number of positive samples incorrectly classified as negative  



### Accuracy
Percentage of correctly classified instances.
$$ACC=\frac{TP + TN}{P + N}$$

In [None]:
accuracy_score(acum_true_classes_int, acum_predicted_classes_int)

### Recall (sensitivity)
The proportion of positives that are correctly identified as such.

$$RECALL=\frac{TP}{P}$$

In [None]:
#average='macro' will report the average across all classes.
#average=None will report the performance metric for each class.
recall_score(acum_true_classes_int, acum_predicted_classes_int, average='macro')

### Precision (positive predictive value)
The ability of the classifier not to label as positive a sample
that is negative. Equivalently, it is the fraction of relevant instances among the selected ones.

$$PRECISION=\frac{TP}{TP + FP}$$

In [None]:
precision_score(acum_true_classes_int, acum_predicted_classes_int, average='macro')


![title](img/PrecisionRecall.png)
By Walber - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36926283

# Conclusions

- In this part of the tutorial we perfomred activity recognition from sensor data.  
- We learned how to use python to explore the data and pre-process it.
- We used scikit-learn to train a Random Forest classifier and achieved an acceptable performance.  

