# Activity Classification

## Table of Contents

[1. Introduction](#introduction)
<a href='#introduction'></a>
<br />
[2. Methods](#methods)
<a href='#methods'></a>
<br />
[3. Results](#results)
<a href='#results'></a>
<br />
[4. Conclusion](#conclusion)
<a href='#conclusion'></a>

<a id='introduction'></a>
## 1. Introduction

__Goal:__
<br>
The goal of this project was to predict the type of physical activity (e.g., walking, climbing stairs) from tri-axial smartphone accelerometer data. 

__Data set:__
<br>
The input data used for training in this project consists of two files. The first file, train_time_series.csv, contains the raw accelerometer data and has the following format:

    timestamp, UTC time, accuracy, x, y, z

The second file, train_labels.csv, contains the activity labels. Different activities have been numbered with integers and follow the following encoding: 1 = standing, 2 = walking, 3 = stairs down, 4 = stairs up. Because the accelerometers are sampled at high frequency, the labels in train_labels.csv are only provided for every 10th observation in train_time_series.csv.

In addition, the test data consists of two files as well. The first file, test_time_series.csv, contains the raw accelerometer data and equals the first file of the training data. The second file, test_labels.csv, provides every 10th data point of test_time_series.csv for which a label has to be predicted.

### Importing Libraries

In [None]:
from time import process_time
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score

### Loading Data from CSV

In [None]:
# Take starting time to calculate code run time
start = process_time()

In [None]:
# Load train and test data
df_trainTime_raw = pd.read_csv('train_time_series.csv')
df_trainLabels_raw = pd.read_csv('train_labels.csv')
df_testTime_raw = pd.read_csv('test_time_series.csv')
df_testLabels_raw = pd.read_csv('test_labels.csv')

<a id='methods'></a>
## 2. Methods

### Data Preparation

In [None]:
# Drop columns 'Unnamed: 0', 'UTC time', and 'accuracy' for training data
df_trainTime = df_trainTime_raw.drop(columns=['Unnamed: 0', 'UTC time', 'accuracy'])

# Drop columns 'Unnamed: 0' and 'UTC time' for labels data 
df_trainLabels = df_trainLabels_raw.drop(columns=['Unnamed: 0', 'UTC time'])

# Merge data frames 'df_trainTime' and 'df_trainLabels' (i.e. add column 'label' to 'df_trainTime')
df_trainTimeLabels = pd.merge(df_trainTime, df_trainLabels, how="outer", on="timestamp")

# Propagate labels by forward (row 3 et seqq.) and backward (row 1 to 3) fill to full training data 
df_trainTimeLabels['label'] = df_trainTimeLabels['label'].fillna(method="ffill").fillna(method="backfill")

# Create training data (i.e. data points with propagated labels are used as training data)
X_train = df_trainTimeLabels[~df_trainTimeLabels['timestamp'].isin(df_trainLabels['timestamp'])].drop(columns=['timestamp', 'label'])
y_train = df_trainTimeLabels[~df_trainTimeLabels['timestamp'].isin(df_trainLabels['timestamp'])].drop(columns=['timestamp', 'x', 'y', 'z'])

# Create test data (i.e. data points with originally given label are used as test data)
X_test = df_trainTimeLabels[df_trainTimeLabels['timestamp'].isin(df_trainLabels['timestamp'])].drop(columns=['timestamp', 'label'])
y_test = df_trainTimeLabels[df_trainTimeLabels['timestamp'].isin(df_trainLabels['timestamp'])].drop(columns=['timestamp', 'x', 'y', 'z'])

### Building and Evaluating Classifier

In [None]:
# Initialize classifier
forest_classifier = RandomForestClassifier(max_depth=10, n_estimators = 100)

# Train classifier on self-created training data
forest_classifier.fit(X_train, y_train['label'].to_numpy())

# Make predictions
preds = forest_classifier.predict(X_test)

# Evaluate accuracy on self-created test data
accuracy = accuracy_score(y_test['label'].to_numpy(), preds)

### Predicting Labels on Test Set

In [None]:
# Create test set
df_testTime = df_testTime_raw[df_testTime_raw['timestamp'].isin(df_testLabels_raw['timestamp'])]

# Drop columns 'Unnamed: 0', 'timestamp', 'UTC time', and 'accuracy' for test data
df_testTime = df_testTime.drop(columns=['Unnamed: 0', 'timestamp', 'UTC time', 'accuracy'])

# Make predictions on test set
preds = forest_classifier.predict(df_testTime)

# Set column 'Unnamed: 0' as index without an index name
df_testLabels = df_testLabels_raw.set_index('Unnamed: 0')
df_testLabels.index.name = None

# Add class predictions to column 'label'
df_testLabels['label'] = preds

# Create CSV file 'test_labels.csv'
df_testLabels.to_csv('test_labels.csv')

In [None]:
# Take end time to calculate code run time
end = process_time()

# Calculate code run time
end - start

<a id='results'></a>
## 3. Results

In [None]:
# Print accuracy score on self-created test data
print(accuracy)

# Print the importance of each covariate in the random forest classification
sorted(list(zip(X_test, forest_classifier.feature_importances_)), key=lambda tup: tup[1])

<a id='conclusion'></a>
## 4. Conclusion

The classification accuracy score on the self-created test data is ~64% while the actual classification accuracy score on the not disclosed test data is between 40% and 50%.