## Introduction:
The data set contains tri-axial smartphone accelerometer data.There are 4 csv files: train_time_series, train_labels, test_time_series, test_labels. 
 -  The files have time stamps and the accelerations in x, y, z directions.
 -  The _labels files only have every tenth data point from the time_series files.
 -  Our task is to use the train set to predict the type of physical activity in the test sets. 

## Method:
We acknowledge that this is a statistical classification problem. According to previous homework, classification is most precise when using Random Forest Classifier in Scikit-learn. Our approach has 2 main parts: data cleaning, training and predicting.

The specific steps are as follows:
 - Import relevant packages: NumPy, Pandas, Scikit-learn, Datetime
 - Record start time
 - Read in all files to Pandas data frames
 - Match data frames dimensions
 - Fit the training data set, using accelerations as covariates
 - Predict the test data set and add results into .csv file

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import datetime 

start_time = datetime.datetime.today()

# read in training sets
timedf = pd.read_csv('train_time_series.csv')
labeldf = pd.read_csv('train_labels.csv')

# read in test sets
testtimedf = pd.read_csv('test_time_series.csv')
testlabeldf  = pd.read_csv('test_labels.csv')

# data cleaning

# match dfs dimensions
labellist = list(labeldf.timestamp)
timedf = timedf[timedf.timestamp.isin(labellist)]

testlabellist = list(testlabeldf.timestamp)
testtimedf = testtimedf[testtimedf.timestamp.isin(testlabellist)]

# train data set
forest_classifier = RandomForestClassifier()
classification_outcome = labeldf['label']
all_covariates = ['x', 'y', 'z']
forest_classifier.fit(timedf[all_covariates], classification_outcome)


# predict 
testlabeldf['label'] = forest_classifier.predict(testtimedf[all_covariates])
testlabeldf.to_csv('test_labels.csv')

end_time = datetime.datetime.today()
print("Runtime: ", (end_time-start_time)/datetime.timedelta(seconds=1))

Runtime:  0.203311


## Results
Our predictions have 44.8% accuracy according to the course's auto grader. Below, we split the train set into another train and test sets, with a ratio of 60/40 to get the score for our model. Our model has around 53% accuracy based on this code. The lack of precision can be due to too few covariates and data points. 

In [2]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(timedf, labeldf.label, train_size = 0.6, random_state = 1)
forest_classifier.fit(x_train[all_covariates], y_train)
forest_classifier.score(x_test[all_covariates],y_test)

0.5666666666666667

## Conclusions
In this final project, we used Scikit-learn Random Forest Classifier to predict the type of physical activity based on accelerometer data. Based on the time difference and acceleration value in Cartesian coordinates, we were able to fit the training data to yield predictions. Our accuracy is low due to limited covariates. Our runtime is about 0.17 seconds.