# Exercise 5: Activity Recognition from Smartphone Sensors

## What you will learn:
* Sequence Prediction (many-to-one) with Recurrent Neural Networks. Here: LSTM
* Activity recognition based on smarthone's acceleration-sensors 

## Preparation

1. Prepare yourself by studying notebook [Bike Rental Predition with Recurrent Neural Networks](../Lecture/06KerasLSTMbikeRentalPrediction.ipynb) and answer the following questions:

    2. Explain how the Baseline-Model predicts future values?
    3. What is bi-directional LSTM?


In [4]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from pandas.plotting import register_matplotlib_converters

%matplotlib inline
%config InlineBackend.figure_format='retina'

register_matplotlib_converters()
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 22, 10

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

In [23]:
import warnings
warnings.filterwarnings("ignore")

## Tasks
### Download and prepare data
1. Download smartphone activity data file from [Wireless Sensor Data Mining webpage](http://www.cis.fordham.edu/wisdm/dataset.php)
2. Decompress the downloaded archive
3. Move file *WISDM_ar_v1.1_raw.txt* into your current directory
4. Import data of this file into a pandas dataframe by the following codecells

The raw file is missing column names. 

In [11]:
column_names = ['user_id', 'activity', 'timestamp', 'x_axis', 'y_axis', 'z_axis']
df = pd.read_csv('WISDM_raw.txt', header=None, names=column_names)

Also, one of the columns is having an extra ”;” after each value. 

In [13]:
df.z_axis.replace(regex=True, inplace=True, to_replace=r';', value=r'')
df['z_axis'] = df.z_axis.astype(np.float64)
df.shape

(1098204, 6)

In [14]:
df.dropna(axis=0, how='any', inplace=True)

In [16]:
df.head()

Unnamed: 0,user_id,activity,timestamp,x_axis,y_axis,z_axis
0,33,Jogging,49105962326000,-0.694638,12.680544,0.503953
1,33,Jogging,49106062271000,5.012288,11.264028,0.953424
2,33,Jogging,49106112167000,4.903325,10.882658,-0.081722
3,33,Jogging,49106222305000,-0.612916,18.496431,3.023717
4,33,Jogging,49106332290000,-1.18497,12.108489,7.205164


In [7]:
df.shape

(1098203, 6)

### Data understanding by visualisation
1. Visualise the distribution of all samples (rows) over the 6 activities, e.g. by applying a *seaborn countplot*.
2. Visualise the distribution of all samples (rows) over the different user-ids, e.g. by applying a *seaborn countplot*.
3. Write a function, which takes 2 arguments: `activity` and the dataframe, which contains all data. Within the function a figure of 3 subplots (3 rows, 1 column) shall be generated. Within the three subplots the accelerometer-values `x_axis`, `y_axis` and `z_axis` of the first 200 samples, which belong to the selected `activity`, shall be plotted. Invoke this function for the activity `Sitting` and for the activity `Jogging`. Compare the sequences of these 2 activities. Do the two activities have different accelerometer-values characteristics? 

### Split data into training- and test-data
All data of users with *user_id <=30* shall be applied for training. Data of all other *user_ids* shall be applied for testing. Split the dataframe correspondingly.

### Scaling of Accelerometer Values
Apply the [Scikit-Learn Robust Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) for scaling the columns `x_axis`, `y_axis` and `z_axis`. Train the scaler on training data and apply the trained scaler on training- and test-data. 

### Prepare Training- and Test-data
In the following codecell training- and testdata is prepared such that it can be passed to a LSTM. Study this code and explain what is done here.

In [26]:
from scipy import stats

def create_dataset(X, y, time_steps=1, step=1):
    Xs, ys = [], []
    for i in range(0, len(X) - time_steps, step):
        v = X.iloc[i:(i + time_steps)].values
        labels = y.iloc[i: i + time_steps]
        Xs.append(v)        
        ys.append(stats.mode(labels)[0][0])
    return np.array(Xs), np.array(ys).reshape(-1, 1)

TIME_STEPS = 200
STEP = 40

X_train, y_train = create_dataset(
    df_train[['x_axis', 'y_axis', 'z_axis']], 
    df_train.activity, 
    TIME_STEPS, 
    STEP
)

X_test, y_test = create_dataset(
    df_test[['x_axis', 'y_axis', 'z_axis']], 
    df_test.activity, 
    TIME_STEPS, 
    STEP
)

In [27]:
print(X_train.shape, y_train.shape)

(22454, 200, 3) (22454, 1)


In [28]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)

enc = enc.fit(y_train)

y_train = enc.transform(y_train)
y_test = enc.transform(y_test)

In [29]:
print(X_train.shape, y_train.shape)

(22454, 200, 3) (22454, 6)


### Define LSTM, Train and Evaluate it
1. Define a Neural Network, consisting of 
* a Bidirectional LSTM, 
* one hidden Dense-Layer
* one output Layer (Dense-Layer)
for this classification problem.
2. For training apply 
* the `categorical_crossentropy`-loss function
* the `adam`-optimizer
* the `accuracy`-metric
3. Train the model and optimize its parameters, such that the accuracy on the test-data is at least 80%. Plot the increase of training- and test-accuracy over the training epochs.
4. For your best network configuration, calculate the models prediction for all testdata and plot the corresponding confusion matrix. Discuss the result, in particular the confusion matrix. 