# Machine Learning on MCU - LAB1 ex2
Predicting Human Activity from Smartphone Accelerometer and Gyroscope Data

We use a dataset of 3-axial accelerometer signals from an academic experiment on the UC Irvine Machine Learning Repository.

The dataset can be downloaded here: https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions#
where you can also find the description of the dataset.

In [1]:
# If required, download the dataset
import requests
import os.path
import zipfile
if (not os.path.isdir('./HAPT Data Set')):
    open('./HAPT Data Set.zip', 'wb').write(requests.get(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip", 
        allow_redirects=True).content)
    zipfile.ZipFile('./HAPT Data Set.zip', 'r').extractall('./HAPT Data Set')

In [None]:
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import random
random.seed(7)

# display pandas results to 3 decimal points, not in scientific notation
# pd.set_option('display.float_format', lambda x: '%.3f' % x)

Load the accelerometer and gyroscope data.
We read the feature names from features.txt and the activity labels from activity_labels.txt

In [None]:
with open('./HAPT Data Set/features.txt') as f:
    features = f.read().split()

print('There are {} features.'.format(len(features)))
    
with open('./HAPT Data Set/activity_labels.txt') as f:
    activity_labels = f.readlines()

activity_df = [x.split() for x in activity_labels]
print('There are {} activities.'.format(len(activity_df)))
pd.DataFrame(activity_df, columns = ['Activity_id', 'Activity_label'])

The data are pre-split into training and test sets. Let's load the features x and the labels y, and have a look at a few features.

In [None]:
X_train = pd.read_table('./HAPT Data Set/Train/X_train.txt',
             header = None, sep = " ", names = list(dict.fromkeys(features)))
X_train.iloc[:10, :10].head()

In [None]:
y_train = pd.read_table('./HAPT Data Set/Train/y_train.txt',
             header = None, sep = " ", names = ['Activity_id'])
y_train.head()

In [None]:
X_test = pd.read_table('./HAPT Data Set/Test/X_test.txt',
             header = None, sep = " ", names = list(dict.fromkeys(features)))
y_test = pd.read_table('./HAPT Data Set/Test/y_test.txt',
             header = None, sep = " ", names = ['Activity_id'])

# Human activity classification
Now that we’ve loaded the train and test data into memory, we can start building a model to predict the activity from the features. We use the python library scikit-learn.

The test set must not be used until the very last step of testing (as the name suggests). So we split the given training set into training and validation sets. The training set is used for training and the validation set is used for validating the models (identifying the best model, tuning the hyperparameters).

In this exercise we'll use 5-fold cross-validation (CV), which means that the data are randomly partitioned into 5 equal-sized sub-datasets, of which 1 sub-dataset is retained as the validation data for testing the model, and the remaining 4 are used for training. The whole process is repeated 5 times, with each of the 5 sub-datasets used exactly once as the validation data. The results can then be averaged to produce a single estimation.

Let's start with a Linear Support Vector Classification.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import validation_curve

# Declare the hyper-parameter
C_params = np.logspace(-6, 3, 10)

# Declare the classfier
clf_svc = LinearSVC(random_state = 7)

# Compute training and test scores for varying parameter values
train_scores, val_scores = validation_curve(
    clf_svc, X_train.values, y_train.values.flatten(),
    param_name = "C", param_range = C_params,
    cv = 5, scoring = "accuracy", n_jobs = -1)

The traning scores and the validation scores are saved. Now we can plot the learning curves wrt the parameter. Let's first calculate the means and the standard deviations of the validation, and then plot the training and validation accuracy vs. parameter values.

In [None]:
def plot_accuracy(train_scores, val_scores, C_params):
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    val_scores_mean = np.mean(val_scores, axis=1)
    val_scores_std = np.std(val_scores, axis=1)

    # To better visualize the plot, we set the y-axis limits
    y_min = 0.5
    y_max = 1.0

    f = plt.figure(figsize = (12, 8))
    ax = plt.axes()
    plt.title("SVM Training and Validation Accuracy")
    plt.xlabel("C Value")
    plt.ylabel("Accuracy")
    plt.ylim(y_min, y_max)
    plt.yticks(np.arange(y_min, y_max + .01, .05))
    plt.semilogx(C_params, train_scores_mean, label = "Training Accuracy", color = "red")
    plt.fill_between(C_params, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha = 0.2, color = "red")
    plt.semilogx(C_params, val_scores_mean, label = "Validation Accuracy",
                 color = "green")
    plt.fill_between(C_params, val_scores_mean - val_scores_std,
                     val_scores_mean + val_scores_std, alpha = 0.2, color = "green")
    plt.legend(loc = "best")

    plt.show()
    
plot_accuracy(train_scores, val_scores, C_params)

From the graph, it looks like the best value of C is at 0.1. The validation accuracy begins slowly decreasing after that 0.1, indicating that we are starting to overfit.

In [None]:
print("The mean validation score is {:.4f}".format(np.mean(val_scores[np.where(C_params == 0.1)[0][0]])))

Until now we've included all the features, will a subset of features perform better than including all of them? Let's find it out.

One way would be to select particular features manually, let's look at the features list: (you can also open features.txt in a text editor)

In [None]:
features

In [None]:
features[0:40]

The first 40 features are derived from time domain accelerometer 3-axial signals (for more details read features_info.txt). Let's do the same using only these 40 features.

In [None]:
X_train_new = X_train.iloc[:,0:40]
X_test_new = X_test.iloc[:,0:40]

# Compute training and test scores for varying parameter values
train_scores, val_scores = validation_curve(
    clf_svc, X_train_new.values, y_train.values.flatten(),
    param_name = "C", param_range = C_params,
    cv = 5, scoring = "accuracy", n_jobs = -1)

plot_accuracy(train_scores, val_scores, C_params)

The classifier performs worse (as expected, since the feature set is not well selected).

The manual selection of subset and combinations of subsets of features could be extremely cumbersome. Let's use a more advanced function in scikit-learn.

In [None]:
from sklearn.feature_selection import SelectKBest

feature_selector = SelectKBest(k=20).fit(X_train.values, y_train.values.flatten())
X_train_new = feature_selector.transform(X_train.values)
X_test_new = feature_selector.transform(X_test.values)
# We use .values because X_train is a panda dataset
# The output of .transform is an array, therefore we don't need to use .values anymore in the validation_curve

In [None]:
# Compute training and test scores for varying parameter values
train_scores, val_scores = validation_curve(
    clf_svc, X_train_new, y_train.values.flatten(),
    param_name = "C", param_range = C_params,
    cv = 5, scoring = "accuracy", n_jobs = -1)

plot_accuracy(train_scores, val_scores, C_params)

We can see that 20 best features (best according to ANOVA -- it's the default score function in SelectKBest, since we didn't specify any score_func) are not enough and the performance is worse than including all the features.

(Optional) Try with higher k to see if there is a subset of features which can perform better than including all the features.

The parameter k in SelectKBest is another parameter which should be tuned. In the documentation of the function you can find also a list of score function which can be used. As you can notice, the options are broad. Due to time limitations, we don't investigate further into feature selection during the lab session. In the following steps of this exercise we will use the feature matrices X_train and X_test including all the features.

The performance of linear SVM is already quite good, but we can investigate further and do a grid serach on different kernels and C values for the SVM model. With a larger search space, we might get a different set of optimal parameters.

Use the general Support Vector Classifier (SVC) in scikit-learn and try non-linear kernels. Use GridSearchCV in model_selection to search for the best parameters. (Read the documentation pages of SVC and GridSearchCV on scikit-learn, there you also find example codes.)

In [None]:
# import the functions SVC and GridSearchCV


# Declare the parameters
# (Pay attention to the number of paramters you declare, because more points in the GridSearch will cost you longer training time -- could take more than half an hour...)


# Declare the classifier (estimator) to be used in GridSearchCV


# Declare the classifier using GridSearchCV


In [None]:
# Fit the classifier and find the best parameters using GridSearchCV




In [None]:
# Which one is the best model and the best parameter? What's the best CV score you got?
# (Hint: read the list of Attributes in the documentation page of GridSearchCV on scikit-learn)


# Predict on test data using the best model



Now you have a general understanding of training, validating, and testing using the python library scikit-learn. The following points are optional, in case you want to explore more about the scikit-learn library.

In [None]:
# Now try other classifiers, for example Decision Trees, Nearest Neighbors, or Ensemble methods. Can you get better results?

# After identifying the best model with the optimal set of parameters, test your model using the test set. 

# What's your test accuracy? Are you satisfied? How complex is your model to be deployed on a microcontroller?