# Final Project
HarvardX PH526x - Using Python for Research

Florian Jäckel

<h3> Introduction </h3>

<b>The goal of the final project</b> is to predict types of physical activity (e.g., walking, climbing stairs) from tri-axial smartphone accelerometer data.

<b>The input data </b> consists of two files. The first file, <code>train_time_series.csv</code>, contains the raw accelerometer data with following format: <code>timestamp, UTC time, accuracy, x, y, z</code>. <code>x</code>, <code>y</code>, and <code>z</code> correspond to measurements of linear acceleration along each of the three orthogonal axes. Another file, <code>train_labels.csv</code>, contains the activity labels. With these labels, one can train the model. Different activities have been numbered with integers: 1 = standing, 2 = walking, 3 = stairs down, 4 = stairs up. The labels in <code>train_labels.csv</code> are only provided for every 10th observation in <code>train_time_series.csv</code>. <code>train_time_series.csv</code> contains about 3750 observations; accordingly, <code>train_labels.csv</code> contains around 375 labels. There is also a file called <code>test_time_series.csv></code> based on which one is asked to provide the activity labels predicted by the model. 

The <b> key steps taken </b> were, first, to map the training labels with the observations. In order not to lose the data in between the provided labels, a rolling window method was used. Next, since the true labels for this test data is not given, my approach was to split the training data once more. This way, different models and paramaters could be tested. Once a reasonably accurate model was found, it was used on the actual test data and the results written to the file <code>test_labels.csv</code>.
    
As an <b> additional assignment </b>, one was asked to provide the running time of one's code, starting at the moment the test data is loaded and ending when the predictions are computed. Here, the simple approach was to wright the starting and ending time to a variable, respectively, and calculate the difference.

<b>Nota bene:</b> I have a background in the humanities. Working with quantifiable/quantified data is not my area of expertise. I cannot explain some of the problems underlying the assignment beyond what was immediately taught during the course. My solution is based on trial and error as well as skimming through some scientific papers on the subject.

<h3> Methods </h3>

In the following, I explain the steps taken and provide the respective code.

<h4>1 Import Library and Data</h4>

First things first: the pandas library is imported and the provided data stored in pandas Dataframe objects by using the <code>read_csv</code> method.

In [1]:
# import libraries

import pandas as pd

# create pandas Dataframes based on provided csv files

train_time_series = pd.read_csv("train_time_series.csv")

train_labels = pd.read_csv("train_labels.csv")

test_time_series = pd.read_csv("test_time_series.csv")

test_labels = pd.read_csv("test_labels.csv")

<h4>2 Prepare Data </h4>

Next, to take care of the problem that labels are only provided for every 10th observation, a column is added with the mean of the surrounding 10 observations for each of the three directions x, y and z, respectively. This is done with the <code>rolling</code> method of pandas. At the beginning and end, the <code>nan</code> values are interpolated using the <code>interpolate</code> method of pandas.

In [2]:
train_time_series["roll_mean_x"] = train_time_series["x"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

train_time_series["roll_mean_y"] = train_time_series["y"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

train_time_series["roll_mean_z"] = train_time_series["z"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

Then, the Dataframes for observations and labels are merged, leaving only rows with matching timestamp.

In [3]:
train_merge = pd.merge(train_time_series, train_labels, on=("timestamp", "UTC time"))

Finally, the test data is prepared the same way.

In [4]:
test_time_series["roll_mean_x"] = test_time_series["x"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

test_time_series["roll_mean_y"] = test_time_series["y"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

test_time_series["roll_mean_z"] = test_time_series["z"].rolling(window=10, center=True).mean().interpolate(limit_direction="both")

test_merge = pd.merge(test_time_series, test_labels, on=("timestamp", "UTC time"))

<h4>3 Set Classification Features and Classification Target</h4>

To make the code more readable and easily adaptable, the classification features and the classification target are stored in variables.

In [5]:
classification_features = ["roll_mean_x", "roll_mean_y", "roll_mean_z"]

classification_target = "label"

<h4>4 Predictions Based on Different Models taught in the Course</h4>

<h5>4.1 Starting with Logistic Regression</h5>

One of the most basic models for a classification problem such as the given one is logistic regression. To use it, the respective class from the <code>scikit-learn</code> library must be imported and a respective object must be initialized.

In [6]:
# import model

from sklearn.linear_model import LogisticRegression

# initalize object and asign to variable

logreg_model = LogisticRegression()

Next, we train or fit the model using the fit method of the class <code>LogisticRegression</code>. The features used in training the model, in other words: "X", will be the three columns with the means of x, y, and z. The classification target used in training the model, in other words: "y", will be the labels provided with <code>train_labels.csv</code>. Both X and y can be retrieved from the Dataframe <code>train_merge</code> using the variables defined above in section 3.

In [7]:
logreg_model.fit(train_merge[classification_features], train_merge[classification_target])

LogisticRegression()

Finally, based on the trained or fitted model we can make predictions of the test data. These are stored in a variable and printed.

In [8]:
predictions_logreg = logreg_model.predict(test_merge[classification_features])

print(predictions_logreg)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


Based on this output, it seems likely that logistic regression does not handle our problem well. Hence, in the next section, we will try another model.

<h5>4.2 An alternative: Random Forest Classification</h5>

Another common model for classification problems is random forest classification. Codewise, the procedure is analogous to what has been done using logistic regression.

In [9]:
# import model

from sklearn.ensemble import RandomForestClassifier

# initalize object and asign to variable

rdmfrst_model = RandomForestClassifier()

# train model based on training data

rdmfrst_model.fit(train_merge[classification_features], train_merge[classification_target])

# predict labels of test data

predictions_rdmfrst = rdmfrst_model.predict(test_merge[classification_features])

print(predictions_rdmfrst)

[2 4 4 4 2 2 3 3 2 3 2 4 2 2 2 2 4 3 2 2 3 4 2 2 3 4 2 2 4 4 2 2 2 2 2 1 2
 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 2 2 3 2 2 3 2 3 3 2 2 2 2 2 4
 2 4 2 2 3 3 2 2 2 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


This output looks more realistic. However, this scores only little more than 50% accuracy when submitted.

<h4> 5 Outwiting the assignment? </h4>

Without the actual test labels (or, maybe, better knowledge on these kinds of problems), chosing the right model is very challenging. What could be a solution? I decided to "forget" about the test labels and split the training data once more to measure the accuracy of different models.

This was accomplished with the <code>train_test_split</code> method. Two thirds of the data was "extracted" for training, one third for testing. The data was stratified to obtain a suitable proportion of values.


In [10]:
# import the method train_test_split from the sklearn library

from sklearn.model_selection import train_test_split

# split data

X_train, X_test, y_train, y_test = train_test_split(train_merge[classification_features], train_merge[classification_target], test_size=0.33, stratify=train_merge[classification_target])

Based on these training and testing data sets, different models can be tested and their accuracy evaluated. Skimming through some research papers on similar problems (i.e. classifying physical activities based on accelaration data), the following methods were mentioned in addition to Logistic Regression and Random Forest Classification, among others: k Nearest Neighbours, Support Vector Machines (SVMs), and Gaussian Naive Bayes.

All five models were implemented the way already described, this time using the generated variables <code>X_train</code>, <code>X_test</code>, <code>y_train</code>, and <code>y_test</code>. <code>y_test</code> can then be compared with the generated variable <code>y_predict</code> to test the accuracy of the respective model. This is done using the <code>metrics</code> method.

In [11]:
# importing the methods from the sklearn library

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# importing method metrics from the sklearn library to measure accuracy

from sklearn import metrics

# using the pattern described above, 
# based on the newly generated variables
# X_train, X_test, y_train, and y_test

logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
y_predict = logreg_model.predict(X_test)

# comparing the outcome y_predict with y_test to measure and print accuracy

print(metrics.accuracy_score(y_test, y_predict))

# analogous approach for remaining models

rdmfrst_model = RandomForestClassifier()
rdmfrst_model.fit(X_train, y_train)
y_predict = rdmfrst_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

knn_model = KNeighborsClassifier(n_neighbors=15, weights="distance") # different values for k / n_neighbors were tested
knn_model.fit(X_train, y_train)
y_predict = knn_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

SVC_model = SVC()
SVC_model.fit(X_train, y_train)
y_predict = SVC_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

GaussianNB_model = GaussianNB()
GaussianNB_model.fit(X_train, y_train)
y_predict = GaussianNB_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

0.5645161290322581
0.6370967741935484
0.6209677419354839
0.5645161290322581
0.6129032258064516


<h4>6 Solution: Window Size </h4>

Looking at the results, it seems as if none of the three newly applied models scores significantly higher than Logistic Regression or Random Forest Classifier. Skimming through the papers again, I found out that one approach is to use a different window size:

>The data vector containing 3-axis acceleration and 3-axis rotation rate recorded at a time instant is called a sample. To reduce noise and capture cyclic patterns of motion, features were not computed on each single sample, but on a sliding window of samples. Many studies have indicated the superiority of using a 1-second window size; others have used larger window sizes such as 2 seconds and 10 seconds to capture more cyclic patterns. We experimentally compared window sizes of 1 second, 2 seconds, 5 seconds, and 10 seconds, and found that the 2-second window size (60 samples in our case) produced the best classification performance. --- Wu, Wanmin et al. “Classification accuracies of physical activities using smartphone motion sensors.” Journal of medical Internet research vol. 14,5 e130. 5 Oct. 2012, doi:10.2196/jmir.2208)

Accordingly, I adapted my code from above, using a variable to conveniently test different window sizes. (To make the interactive usage within the jupyter notebook possible, the necessary lines of code must be repeated here.)

In [12]:
window_size = 200

train_time_series["roll_mean_x"] = train_time_series["x"].rolling(window=window_size, center=True).mean().interpolate(limit_direction="both")
train_time_series["roll_mean_y"] = train_time_series["y"].rolling(window=window_size, center=True).mean().interpolate(limit_direction="both")
train_time_series["roll_mean_z"] = train_time_series["z"].rolling(window=window_size, center=True).mean().interpolate(limit_direction="both")

train_merge = pd.merge(train_time_series, train_labels, on=("timestamp", "UTC time"))

X_train, X_test, y_train, y_test = train_test_split(train_merge[classification_features], train_merge[classification_target], test_size=0.33, stratify=train_merge[classification_target])

logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
y_predict = logreg_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

rdmfrst_model = RandomForestClassifier()
rdmfrst_model.fit(X_train, y_train)
y_predict = rdmfrst_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

knn_model = KNeighborsClassifier(n_neighbors=15, weights="distance")
knn_model.fit(X_train, y_train)
y_predict = knn_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

SVC_model = SVC()
SVC_model.fit(X_train, y_train)
y_predict = SVC_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

GaussianNB_model = GaussianNB()
GaussianNB_model.fit(X_train, y_train)
y_predict = GaussianNB_model.predict(X_test)
print(metrics.accuracy_score(y_test, y_predict))

0.5645161290322581
0.9354838709677419
0.9354838709677419
0.5645161290322581
0.8709677419354839


<h3> Results </h3>

Different window sizes were tested. The value was increased first. A value of about 200 seemed to score much better results. Beyond the value 200, the accuracy seemed to decrease again. In general, the predictions based on Random Forest Classification, k Nearest Neighbors, and Gaussian Naive Bayes are much higher with an increased window size. The method Random Forest Classification, seemingly scoring the highest results on average, was chosen to calculate the final results.

<h4> Finalize the results for submission </h4>

Finally, it is possible to return to the original test data provided with the assignment to predict test labels based on Random Forest Classification, however using a bigger window size of 200. (Again, to make the interactive usage within the jupyter notebook possible, the necessary lines of code must be repeated here:)

In [13]:
# train model based on training data

rdmfrst_model.fit(train_merge[classification_features], train_merge[classification_target])

# predict labels of test data

predictions_rdmfrst = rdmfrst_model.predict(test_merge[classification_features])

print(predictions_rdmfrst)

[4 3 4 4 2 2 4 4 2 3 4 3 2 1 4 2 4 2 3 2 4 4 4 2 4 4 2 3 4 4 2 4 2 1 1 2 2
 2 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 4 2 3 4 3 4 3 2 4 2 1 4 4 3 4 3 3 2 2 2 4
 4 2 1 4 3 3 2 1 3 4 3 4 2 4 2 2 2 4 2 3 2 2 2 4 2 2 2 1 2 2 1 2 2 2 4 2 2
 1 2 1 2 2 2 2 1 2 2 3 1 2 1]


These predictions can now be assigned to the Dataframe <code>test_labels</code> and the dataframe can be written to a CSV file using the <code>to_csv</code> method.

In [21]:
test_labels["label"] = predictions_rdmfrst
test_labels.to_csv("test_labels_labels-added.csv")

<h4>Measuring Code Runtime</h4>

Finally, to measure the code runtime, the training and prediction based on logistic regression are omitted. The library <code>time</code> is imported and the entire code wrapped in the following lines of code (here, for brevity, the rest of the code is not repeated again):

In [34]:
# import library

import time

# wrap code with the following:

start_time = time.time()

# all lines of code above, except import of libraries and codes of logistic regression.

time_used = time.time() - start_time

<h3>Conclusion</h3>

The assignment was to predict types of physical activity from smartphone accelerometer data. Since the test labels for this classification problem were not provided with the test data and since the results based on Logistic Regression and Random Forest Classification scored a rather low accuracy upon first submission, the problem of missing test labels was circumvented by splitting the provided training data further and testing further models. Although it was now possible to measure accuracy, scores were still low. Based on skimming through research papers, an additional approach was adopted, namely increasing the window size of calculating the means of observations. With a significantly bigger window size, the accuracy of Random Forest Classification (among others) scored much higher. This approach was then used on the originally provided test data. The main python techniques were pandas Dataframe objects to store and work on the data as well as methods from the sklearn library to implement machine learning.