# Computational Exercise 11: Human Activity Recognition

**Please note that (optionally) this assignment may be completed in groups of 2 students.**

---
In this exercise, we'll be working with the [Human Activity Recognition Using Smartphones Data Set](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones), which is exactly what it sounds like. As described on [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) site:
- Experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years
- Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist
- 3-axial linear acceleration and 3-axial angular velocity were captured at a constant rate of 50Hz with the phone's embedded accelerometer and gyroscope
- The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% the test data

For each record, we have:
- Triaxial (i.e. x-, y-, and z-direction) acceleration from the accelerometer (total acceleration) and the estimated body acceleration
- Triaxial Angular velocity from the gyroscope
- A label indicating the corresponding activity
- There is also a 561-feature vector with extracted time and frequency domain variables, but we will not be using it in this exercise

Goals are as follows:

- Describe and visualize the dataset
- Extract summary statistics for each record
- Predict activities using the extracted summary statistics
- Predict activities by applying a simple RNN to the raw data (i.e. without extracting summary statistics)

We'll begin by importing the usual libraries including `tensorflow`, which we'll use to create and train the RNN.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import os

## Download the dataset

First, we'll need to download the dataset. The block below will download it into a folder named 'HAR' in your current working directory. By default, this is the directory where this notebook is located.

In [2]:
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

zipurl = "https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI HAR Dataset.zip".replace(" ", "%20")

with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall('HAR')

## Load the dataset

We can now load each of the 9 inertial signals described above along with the corresponding activity labels. The details of this block reflect the organization of this particular dataset and are less important than describing and visualizing the dataset, which you'll do in exercise 11.1 below. We'll also go ahead and standardize the data in this block, which will be important when training our RNN.

In [3]:
### LIST THE 9 AVAILABLE INERTIAL SIGNALS ###
### THESE CORRESPOND TO SPECIFIC FILES WE'LL LOAD FROM THE DATA DIRECTORY ###

inertial_signals = [
    'body_acc_x',
    'body_acc_y',
    'body_acc_z',
    'total_acc_x',
    'total_acc_y',
    'total_acc_z',
    'body_gyro_x',
    'body_gyro_y',
    'body_gyro_z'
]

### LOAD THE DATA ###

x_train = np.concatenate([
    pd.read_table(
        os.path.join(
            'HAR/UCI HAR Dataset/train/Inertial Signals',
            sig + '_train.txt'),
        header=None, sep='\s+').values[:, :, np.newaxis]
    for sig in inertial_signals], axis=2)

x_test = np.concatenate([
    pd.read_table(
        os.path.join(
            'HAR/UCI HAR Dataset/test/Inertial Signals',
            sig + '_test.txt'),
        header=None, sep='\s+').values[:, :, np.newaxis]
    for sig in inertial_signals], axis=2)

y_train = pd.read_table(
    'HAR/UCI HAR Dataset/train/y_train.txt',
    header=None, squeeze=True
).values - 1

y_test = pd.read_table(
    'HAR/UCI HAR Dataset/test/y_test.txt',
    header=None, squeeze=True
).values - 1

### STANDARDIZE THE DATA ###

x_mean = np.mean(x_train, axis=(0, 1))
x_std = np.std(x_train, axis=(0, 1))

x_train = (x_train - x_mean) / x_std
x_test = (x_test - x_mean) / x_std

## Exercise 11.1: Exploring the data

In as many blocks as needed, explore the dataset, including:
- Determine the shape of `x_train`, `x_test`, `y_train`, and `y_test`
- Count the number of each type of label in `y_train` and `y_test`
- Plot (using `plt.plot`) at least one inertial signal over time for at least one example. To do this, note that in `x_train`, the different examples are stacked along axis 0, time varies along axis 1, and the inertial signal varies along axis 2 (e.g. rotation and acceleration in each direction)

In [4]:
### DETERMINE THE SHAPE OF THE DATA ###



In [5]:
### COUNT AND/OR PLOT THE NUMBER OF EACH LABEL (i.e. activity) ###



In [6]:
### PLOT AT LEAST ONE INERTIAL SIGNAL OVER TIME FOR AT LEAST ONE EXAMPLE ###



## Exercise 11.2: Extract features

As we've discussed in class, the simplest way to build a predictive model from repeated measures data is to calculate a limited set of summary statistics (e.g. maximum, minimum, mean) for each measure. This gives you a fixed length vector of length $M\times L$, where $M$ is the number of repeated measures, and $L$ is the number of summary statistics. The block below shows how to construct this vector using just two summary statistics, the max and mean. 

In this block, you should:
- Select at least two additional summary statistics to calculate
- Following the example below, calculate them for each of your 9 inertial signals and stack all of the summary statistics together
- Inspect the shape of the resulting features, noting that the first dimension (axis 0) should be unchanged compared to `x_train` and `x_test` from exercise 11.1 above

In [7]:
x_train_features = np.concatenate(
    [
        x_train.max(axis=1),
        x_train.min(axis=1)
        # ADD AT LEAST TWO MORE FEATURES HERE. MAKE SURE YOU AGGREGATE OVER AXIS 1 ###
    ],
    axis=1
)

x_test_features = np.concatenate(
    [
        x_test.max(axis=1),
        x_test.min(axis=1)
        # ADD THE SAME FEATURES AS YOU DID FOR THE TRAINING SET
    ],
    axis=1
)

## Exercise 11.3: Logistic regression

Now, use this feature vector to build a model that predicts the activity labels. You may use any model you like. It may be instructive to see how choosing different summary statistics in exercise 11.2 affects performance.

In [8]:
### TRAIN A CLASSIFIER OF YOUR CHOICE (e.g. LogisticRegression) ON THE TRAINING SET (i.e. x_train_features) ###


### EVALUATE ITS PERFORMANCE ON THE TEST SET (i.e. x_test_features) ###



## Exercise 11.4: LSTM

Finally, let's see if we can improve performance using a recurrent neural network -- specifically, an LSTM. This code is almost identical to code you've seen before, and includes:
- Loading specific layers we'll need from tensorflow, including a special LSTM layer
- Defining our model -- in this case, a single LSTM block followed by a linear prediction layer -- and creating an instance of it
- Defining our loss, which is the usual cross-entropy loss, as well as our optimizer
- Converting our dataset to tensorflow tensors

We'll then train for 10 epochs while evaluating accuracy on the training set and test set in each iteration.

The code in these blocks is complete, and does not need to be modified before running the blocks. However, it may be interesting to see how changes to the model -- including the number of hidden units in the LSTM, which is currently set to 36 -- affects performance.

In [9]:
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras import Model

class MyLSTM(Model):
    def __init__(self):
        super(MyLSTM, self).__init__()
        self.lstm = LSTM(36)
        self.fc = Dense(6)

    def call(self, x):
        x = self.lstm(x)
        return self.fc(x)
    
model = MyLSTM()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # multi-class cross-entropy loss
optimizer = tf.keras.optimizers.Adam() # modified stochastic gradient descent optimizer

# create tensorflow datasets
train_ds = tf.data.Dataset.from_tensor_slices(
    (x_train.astype('float32'), y_train)).batch(32)

test_ds = tf.data.Dataset.from_tensor_slices(
    (x_test.astype('float32'), y_test)).batch(32)

In [None]:
EPOCHS = 10

for epoch in range(EPOCHS):

    train_accuracy = []
    test_accuracy = []
  
    for i, (x, y) in enumerate(train_ds):
        
        print('Running training batch %i of %i' % (i, len(train_ds)), end='\r')
    
        with tf.GradientTape() as tape:
            predicted_logits = model(x)
            loss = loss_object(y, predicted_logits)

        y_pred = np.argmax(predicted_logits, axis=1)
        batch_accuracy = np.mean(y_pred == y)
        train_accuracy.append(batch_accuracy)
    
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    for i, (x, y) in enumerate(test_ds):
        
        print('Running test batch %i of %i        ' % (i, len(test_ds)), end='\r')
    
        predicted_logits = model(x)
        y_pred = np.argmax(predicted_logits, axis=1)
    
        batch_accuracy = np.mean(y_pred == y)
        test_accuracy.append(batch_accuracy)

    train_accuracy = 100 * np.mean(train_accuracy)
    test_accuracy = 100 * np.mean(test_accuracy)
        
    print('Epoch %i: train accuracy = %.1f%%, test accuracy = %.1f%%' % (
        epoch, train_accuracy, test_accuracy))

### Once you've completed these exercises, please turn in the assignment as follows:

If you're using Anaconda on your local machine:
- download your notebook as html (see File > Download as > HTML (.html))
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS

If you're using Google Colab:
- download your notebook as .ipynb (see File > Download > Download .ipynb)
- if you have nbconvert installed, convert it to .html; if not, leave is as .ipynb
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS