# Python for Research: Final Project
### Project goal
Predict type of physical activity from tri-axial smartphone accelerometer data.

### Input data
- train_time_series.csv: Raw accelerometer data
 - timestamp, UTC time, accuracy, x, y, z
- train_labels.csv: Activity labels to use to train model
 - 1 = standing, 2 = walking, 3 = stairs down, 4 = stairs up
 - Only provided for every 10th observation in train_time_series.csv
- test_time_series.csv: Predict activity type for this dataset. Accuracy determined as percentage of correct classifications.

### Output
- test_labels.csv: Upload activity predictions from test_time_series.csv
- Runtime of code

### Method
- Inner join train_time_series and trail_labels.csv 
- Train random forest model using 80% of training data
 - Linear regression is not effective for classification, while logistic regression is best for classification with binary outcomes.
- Predict activity for remaining 20% of training data
- Conduct cross-validation to improve model performance
- Predict activity for test data

In [1]:
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
start = time.time()

# Import data
train_time_df = pd.read_csv('train_time_series.csv')
train_activity_df = pd.read_csv('train_labels.csv')
test_time_df = pd.read_csv('test_time_series.csv')
test_labels_df = pd.read_csv('test_labels.csv')

In [3]:
# Remove unnecessary columns
train_time_df = train_time_df.drop(['Unnamed: 0', 'UTC time', 'accuracy'], axis = 1)
train_activity_df = train_activity_df.drop(['Unnamed: 0', 'UTC time'], axis = 1)
test_time_df = test_time_df.drop(['Unnamed: 0', 'UTC time', 'accuracy'], axis = 1)
test_labels_df = test_labels_df.drop(['Unnamed: 0', 'UTC time'], axis = 1)

In [4]:
# Inner join train_time_series and trail_labels.csv 
train_df = pd.merge(left = train_time_df, right = train_activity_df, left_on = 'timestamp', right_on = 'timestamp')
train_df.head()

# Create training set with 80% of data
subset = 0.9
nRows = round(train_df.shape[0] * subset)
train_subset_df = train_df.sample(n = nRows)
train_subset_df.head()

Unnamed: 0,timestamp,x,y,z,label
77,1565110008264,0.909073,-0.851181,-0.034027,2
39,1565109970177,-0.021774,-0.363892,0.07933,2
178,1565110109588,0.062851,-0.497757,0.058685,3
281,1565110212825,0.255386,-0.786774,0.076691,3
303,1565110234876,-0.429321,-0.694839,-0.462814,2


In [5]:
classification_outcome = train_subset_df['label']
classification_outcome.head()
classification_outcome.shape

(338,)

In [6]:
# Create function to assess accuracy of prediction
def accuracy(estimator, X, y):
    predictions = estimator.fit(X, y).predict(X)
    return accuracy_score(y, predictions)

# Define inputs (covariates) and outputs (classification_outcome)
covariates = train_subset_df[['x', 'y', 'z']]
classification_outcome = train_subset_df['label']

# Train random forest model using training subset
forest_classifier = RandomForestClassifier(max_depth = 4, random_state = 0)
forest_classifier.fit(covariates, classification_outcome)
activity_pred = forest_classifier.predict(covariates)

In [7]:
# In-sample accuracy
accuracy(forest_classifier, covariates, classification_outcome)

# In-sample accuracy is 65%

0.650887573964497

In [8]:
# Out-of-sample accuracy

# Left outer join to pull 20% of training set
train_subset2_df = train_df.merge(train_subset_df, on = ['timestamp'], how = 'left', indicator = True)
train_subset2_df = train_subset2_df.loc[train_subset2_df._merge == 'left_only']
train_subset2_df = train_subset2_df.drop(['x_y', 'y_y', 'z_y', 'label_y', '_merge'], axis = 1).rename(columns = {"x_x": "x", "y_x": "y", "z_x": "z", "label_x": "label"})
train_subset2_df.head()

covariates2 = train_subset2_df[['x', 'y', 'z']]
classification_outcome2 = train_subset2_df['label']

activity_pred2 = forest_classifier.predict(covariates2)
accuracy(forest_classifier, covariates2, classification_outcome2)

0.8648648648648649

In [9]:
# Predict on test data
covariates3 = test_time_df[['x', 'y', 'z']]
activity_pred3 = forest_classifier.predict(covariates3)
test_time_df['activity_pred'] = activity_pred3
test_time_df.head()

test_time_df.to_csv("test_time_series_predict.csv")

In [10]:
# Write to CSV file
test_labels_df_merged = pd.merge(test_labels_df, test_time_df, on = ['timestamp'])
test_labels_df_merged.head()

test_labels_df_merged.to_csv("test_labels_predict.csv")

In [11]:
test_labels_df_merged['activity_pred'].tolist()

# With this approach, accuracy on test data is 46.4%. This isn't perfect, but this is a 40% improvement over baseline performance of 33%.

[2,
 2,
 2,
 3,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2]

In [12]:
end = time.time()
end - start

0.6279420852661133

### Conclusion
We can predict activity type with 46% accuracy (40% improvement over baseline of 33%) based on accelerometer data.