In [13]:
import pandas as pd
import numpy as np
import os
import random
from IPython.display import Image
from sklearn.preprocessing import MinMaxScaler

## 2. Preprocessing
This notebook will teach you how to preprocess a sensor based Human Activity Recognition dataset.

Let's readn-in the data

In [14]:
data_folder = 'data'
dataset = '/Users/ahoelzemann/Documents/git/dl-for-har/data/rwhar_3sbjs_data.csv'
data = pd.read_csv(os.path.join(data_folder, dataset),
                   names=['subject_id', 'acc_x', 'acc_y', 'acc_z', 'activity_label'])

#### 1.1 Cleaning

##### 1.1.1 Sensor Orientation

Whenever we are working with a multimodal dataset, which means a dataset that consists of data from different sensors,
we need to make sure that the sensor orientation of the data matches each other.



Depending on the circumstances we want to clean the data before we train our classifier.

It is very important to double check if the dataset contains **NaN - Values**. If the dataset contains these values
make sure, that missing values are interpolated, since we want to keep the original sampling rate.


In [15]:
Image(url="../images/pamap_skoda_orientation.png")

In [16]:


interpolated_data = pd.concat([subject_id, acc_x, acc_y, acc_z, activity_label], axis=1)
interpolated_data.columns = ['subject_id', 'acc_x', 'acc_y', 'acc_z', 'activity_label']


Be careful with cleaning the data from noise or outlier, since it only is recommandable if the noise/outlier is not from any importance for the use case of our model.
#### 1.2 Resampling

Resampling is necessary if we work with sensor data from different sensors, that were not recorded with the sampling rate.
The optimize the classifier we need to align the data sampling rates with each other.
Resampling can either be done by up- or downsample the data.


#### 1.3 Normalizing
Normalizing is in an important part in the preprocessing chain, but can also the reason for many mistakes.
Therefore it is important to choose the correct strategy for normalizing your dataset.

##### 1.3.1 How to normalize?



3 possible solutions to normalize correctly.
Big pitfalls, since beginners tend to normalize the whole vector at once.

Normalizing sensor-wise

In [26]:
scaler_sensorwise = MinMaxScaler(feature_range=[-1,1])

scaled_sensorwise = scaler_sensorwise.fit_transform(interpolated_data[["acc_x", "acc_y", "acc_z"]].values.reshape(-1,1))
scaled_sensorwise

array([[ 0.0193068 ],
       [ 0.51856662],
       [ 0.04323806],
       ...,
       [-0.05468475],
       [ 0.51707086],
       [ 0.01787214]])

before / after

Normalizing axis-wise

before / after


In [None]:
scaler_axiswise = MinMaxScaler(feature_range=[-1,1])
scaled_x = scaler_axiswise.fit_transform(interpolated_data["acc_x"].values.reshape(-1,1))
scaled_y = scaler_axiswise.fit_transform(interpolated_data["acc_y"].values.reshape(-1,1))
scaled_z = scaler_axiswise.fit_transform(interpolated_data["acc_z"].values.reshape(-1,1))


Batch-Normalization
Where to put in the architecture?
According to citation[] batch normalization layers should be placed after convolutional layers.


#### 1.4 Windowing
##### 1.4.1 Jumping/Sliding Window

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Shuffling
