

# 1. Introduction & Data Analysis

## 1.1. Introduction

Welcome to the Tutorial on Deep Learning for Human Activity Recognition! This is the first notebook in which we will guide you on getting a basic understanding of all necessary steps one needs to go through when applying Deep Learning on Human Activity Recognition.

We assume that you already listened to the introduction, are familiar with the ultimate goal of the tutorial as well as necessary terms in order to understand this tutorial. If not, you can catch up with the introduction slides on the [tutorial website](https://mariusbock.github.io/dl-for-har/).

During this tutorial we will cover the Deep Learning Activity Recognition Chain (**DL-ARC**) pipeline, which is an overhaul of the Activity Recognition Chain as proposed by [Bulling et al. in 2014](http://dx.doi.org/10.1145/2499621). To familiarise you with the DL-ARC, we will cover within the next hours the following:
- Data Collection & Analysis
- Data Preprocessing
- Evaluation
- Training
- Validation & Testing

We will work with a mix of presentation slides and Juypter Notebooks. Before jumping into the contents of the pipeline, we will briefly cover the basics of how to use Jupyter Notebooks and, if used, [Google Colab](https://colab.research.google.com/).

### 1.1.1. For Colab Users

If you are accessing this tutorial via Google Colab, first make sure to use Google Colab in English. This will help us to better assist you with issues that might arise during the tutorial. There are two ways to change the default language if it isn't English already:
1. On Google Colab, go to 'Help' -> 'View in English'. 
2. Change the default language of your browser to English.



In general, we strongly advise you to use Google Colab as it provides you with a working Python distribution as well as free GPU resources. To make Colab use GPUs, you need to change the current notebooks runtime type via:

- Runtime -> Change runtime type -> Dropdown -> GPU -> Save

For the live tutorial, we require all participants to use Colab. If you decide to rerun the tutorial at later points and rather want to have it run locally on your machine, feel free to clone our GitHub repository (mariusbock/dl-for-har). 

This notebook will teach you how to analyse a Human Activity Recognition dataset. It will illustrate methods which one can apply to get a feel for the data and use case at hand.

### 1.1.2. Jupyter Notebooks Basics

[Jupyter notebooks](https://jupyter.org/) are made of two main components. Markdown text cells and code cells. The latter can be seen as small Python scripts, which can be individually run. The output of the code is printed right after the cell. This allows for more granualar and expressive code with explanations and intermediate outputs along the way. In the following you will find a sample code cell. You can run the cell by either clicking the 'run' symbol in the top left of the cell or by clicking on it and hitting `Shift + Enter` on your keyboard. You can also rerun cells as many times as you want, but be aware that some cells require other cells to be run beforehand in order to work properly (e.g. if one cell references variables defined in another cell)

**Note**: If you get "Warning: This notebook was not authored by Google.", just hit "Run anyway".

In [None]:
# This is a normal print statement
print('Hello World!')

In [None]:
# Just like any script you can declare variables and import packages
import numpy as np

test_array = np.array([1, 2, 3, 4, 5])
print(test_array)

# You can also just write the variable again and it will be printed as output of the cell
test_array

## 1.1. The Dataset

Throughout the whole tutorial we will use the [RealWorld HAR dataset](https://doi.org/10.1109/PERCOM.2016.7456521). The dataset has recorded data from 8 activites (climbing stairs up and down, jumping, lying, running/ jogging, sitting, standing, walking), performed by 15 people. The orginal dataset covers acceleration, GPS, gyroscope, light, magnetic field, and sound level data. Sensors were placed on multiple body positions, i.e. chest, forearm, head, shin, thigh, upper arm, and waist. Each subject performed each activity roughly 10 minutes except for jumping (~1.7 minutes).

To keep things simple and fast, we will use only data obtained from the first three subjects and only use acceleration data captured from the wrist. Within the first task of this notebook, we will show you how to load the dataset using [pandas](https://pandas.pydata.org/) and print the first rows of the dataset.

To get started with this notebook, you need to first run the code cell below. Please set `use_colab` to be `True` if you are accessing this notebook via Colab. If not, please set it to `False`. This code cell will make sure that imports from our GitHub repository will work.

In [None]:
import os
import sys

use_colab = True

module_path = os.path.abspath(os.path.join('..'))

if use_colab:
    # clone package repository
    !git clone https://github.com/mariusbock/dl-for-har.git

    # navigate to dl-for-har directory
    %cd dl-for-har/
else:
    os.chdir(module_path)
    
# this statement is needed so that we can use the methods of the DL-ARC pipeline
if module_path not in sys.path:
    sys.path.append(module_path)

### Task: Loading the dataset

1. Load the dataset containing the data of the first three subjects within the RealWorld HAR dataset, **'rwhar_3sbjs_data.csv'** within the **'data'** folder of the repository, using [pandas](https://pandas.pydata.org/)' `read_csv()` method
2. While reading in the data, use the **names** attribute to pass along the header of the CSV file; The columns we will be using are 'subject_id', 'acc_x', 'acc_y', 'acc_z' and 'activity_label'
3. Print the first five rows of the loaded dataset using the built-in `head()` method of the pandas dataframe

In [None]:
import pandas as pd

# declare where the dataset lies
data_dir = 'data/rwhar_3sbjs_data.csv'

# use pd.read_csv() to load the dataset; use the names attribute to pass along the column names
data = pd.read_csv(data_dir, names=['subject_id', 'acc_x', 'acc_y', 'acc_z', 'activity_label'])

# print the first 7 rows of the loaded data using .head()
data.head(7)

As you can see, the dataset consists of 5 columns.
- **subject_id**: identifier which person the data belongs to
- **acc_x**: acceleration data obtained from the wrist (x-axis)
- **acc_y**: acceleration data obtained from the wrist (y-axis)
- **acc_z**: acceleration data obtained from the wrist (z-axis)
- **activity_label**: name of the activity that was performed

## 1.2. Visualizing the Dataset

Once the dataset has been loaded, we go over steps you can take to get a better feeling for a dataset and what its data tends to look like.


We first have a look at the labeling of the dataset. Each record represents a recorded value of the sensor worn by the participants on their wrist. The corresponding label is the activity they were performing at a given point in time. The next code cell will introduce to you built-in functions of [pandas](https://pandas.pydata.org/) you can use to quickly see how the labels are distributed across all records.

### Task: Analyse the labeling
1. Analyse the label distribution of the dataset: What unique labels exist in the dataset? How many instances of each label are there?
2. Visualize your results obtained above, using a bar plot diagram in [matplotlib](https://matplotlib.org/)

In [None]:
import matplotlib.pyplot as plt

# obtain the unique labels within the 'activity_label' column
unique_labels = data['activity_label'].unique()
print('\nUnique labels in the dataset:')
print(unique_labels)

# obtain the label distribution of the 'actitiy_label' column:
label_distribution = data['activity_label'].value_counts()
print('\nLabel Distribution: ')
print(label_distribution)

# declare the x- and y-axis of the plot
# x_axis = the different labels within the dataset
# y_axis = their occurences across the dataset
x_axis = label_distribution.index.tolist()
y_axis = label_distribution.tolist()
# this will declare the plot
plt.figure(figsize=(12, 5))
plt.bar(x_axis, y_axis, width=0.5)
plt.xlabel('Activity label')
plt.ylabel('Count')
plt.title('Label Distribution')
plt.show()

Now let us focus on visualizing the acceleration data in nthe next task. As you saw in the introduction to this tutorial, even though raw accelerometer readings are less intelligible than, for instance, images, we can still visualize the accelerometer timeseries along the x axis as a simple graph. Within the next coding task, you will use a function `plot_activity`, which will plot data belonging to sample activities as a simple timeseries plot.

### Task: Plot activity data


1. Filter the original dataset to only contain records with the wanted label 
2. Define the y-axis of the plot as the sensor values, e.g. acceleration data on the x-, y- and z-axis
3. Define the x-axis of the plot as the time in seconds
4. Test the function for different activity labels and sensors: What interesting parts in the data can you spot?

In [None]:
import plotly.graph_objects as go

# define the activity name, sensor names and sampling rate
activity_label = 'walking'
sensor_names = ['acc_x', 'acc_y', 'acc_z']
sampling_rate = 50

def plot_activity(data, label, sensor_names, sampling_rate):
    # filter the data to retain only activity_label data
    filtered_data = data[data.activity_label==label]
    
    # define the y- and x-axis as defined above
    # count for x how many records there are, divide by sampling rate
    y_axis = filtered_data[sensor_names]
    x_axis = np.array(range(len(filtered_data))) / sampling_rate

    # plot data
    fig = go.Figure()
    for s in sensor_names: 
      fig.add_trace(go.Scatter(x=x_axis, y=y_axis[s], name=s))
    fig.update_layout(margin_l=0,margin_r=0)  # no margins left or right
    fig.update_layout(title="activity: "+label)
    fig.update_layout(xaxis_title="time (seconds)")
    fig.update_layout(yaxis_title="acceleration (10g)")
    fig.update_layout(legend_x=0, legend_bgcolor='rgba(0,0,0,0.2)')
    fig.show()

# call the function you just defined using the correct inputs
plot_activity(data, activity_label, sensor_names, sampling_rate)

## 1.3. More detailled analysis

We finally introduce a more complex analysis of the dataset. As the next notebook will show you, we need to segment our data using a sliding window approach. The size of the windows is often a crucial parameter, which you need to decide upon before training your network. Sometimes, smaller windows are required due to activities changing fast -- smaller windows will also give you essentially more data to train, as you are chunking the data into smaller pieces. On the other hand, small windows can also lead to your network not recognizing characteristical traits within the data and thus hurt the expressiveness of your data.

We therefore implemented [helpful functions](https://github.com/mariusbock/dl-for-har/blob/main/data_processing/data_analysis.py) within our GitHub, which allows you to get a summary of the activities in your dataset. For each subject as well as overall the function `analyze_window_lengths` gives you:
1. A list of all the activities that are included and how long they lasted
2. The average, maximum and minimum time each activity lasted

This can help you better understand your use case at hand, and decide on how large or small your sliding windows should be.

### Task: Analyze the activities

1. Run the code cell below
2. Analyze the results: What do you notice? Do the activities change frequently?
3. Do the results differ a lot across subjects?

In [None]:
from data_processing.data_analysis import analyze_window_lengths

analyze_window_lengths(labels=data['activity_label'], subject_idx=data['subject_id'], sampling_rate=50)