# Harvard EdX Final Project

### This project was submitted in fulfillment of the final project for the Harvard edX course: Using Python for Research.  The following instructions were provided: 

#### "Your goal is to classify different physical activities as accurately as possible. To test your code, you're also provided a file called test_time_series.csv, and at the end of the project you're asked to provide the activity labels predicted by your code for this test data set. Only the course staff have the corresponding true labels for the test data, and the accuracy of your code will be determined as the percentage of correct classifications. Note that in both cases, for training and testing, the input file consists of a single (3-dimensional) time series. To test the accuracy of your code, you'll be asked to upload your predictions as a CSV file. This file called test_labels.csv is provided to you, but it only contains the time stamps needed for prediction; you'll need to augment this file by adding the corresponding class predictions (1,2,3,4)."

In [35]:
# 1.  import standard packages
import sklearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as ss
from sklearn import datasets
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import time

In [36]:
# Read in necessary files
start_0=time.perf_counter()
df_train_ts = pd.read_csv("data/train_time_series.csv")

df_train_labels = pd.read_csv("data/train_labels.csv")

df_test_ts = pd.read_csv("data/test_time_series.csv")

df_test_labels = pd.read_csv("data/test_labels.csv")

In [37]:
#view the datasets

print("Train_TS:")
print(df_train_ts)
print("Train_Labels")
print(df_train_labels)
print("Test_TS: ")
print(df_test_ts)
print("Test_Labels")
print(df_test_labels)

Train_TS:
      Unnamed: 0     timestamp                 UTC time accuracy         x  \
0          20586  1.565110e+12  2019-08-06T16:45:30.787  unknown -0.006485   
1          20587  1.565110e+12  2019-08-06T16:45:30.887  unknown -0.066467   
2          20588  1.565110e+12  2019-08-06T16:45:30.987  unknown -0.043488   
3          20589  1.565110e+12  2019-08-06T16:45:31.087  unknown -0.053802   
4          20590  1.565110e+12  2019-08-06T16:45:31.188  unknown -0.054031   
...          ...           ...                      ...      ...       ...   
3739       24325  1.565110e+12  2019-08-06T16:51:45.638  unknown  0.024384   
3740       24326  1.565110e+12  2019-08-06T16:51:45.738  unknown  0.487228   
3741       24327  1.565110e+12  2019-08-06T16:51:45.838  unknown  0.369446   
3742       24328  1.565110e+12  2019-08-06T16:51:45.939  unknown  0.167877   
3743       24329  1.565110e+12  2019-08-06T16:51:46.039  unknown  0.689346   

             y         z  
0    -0.934860 -0.069046  

In [38]:
# Prepare data and establish X and y
X = df_train_ts.iloc[3::10].drop('UTC time', axis = 1)
X = X.drop('accuracy', axis =1)
y = df_train_labels['label']


In [39]:
#view altered datasets
print(X)
print(y)

      Unnamed: 0     timestamp         x         y         z
3          20589  1.565110e+12 -0.053802 -0.987701  0.068985
13         20599  1.565110e+12  0.013718 -0.852371 -0.000870
23         20609  1.565110e+12  0.145584 -1.007843 -0.036819
33         20619  1.565110e+12 -0.099380 -1.209686  0.304489
43         20629  1.565110e+12  0.082794 -1.001434 -0.025375
...          ...           ...       ...       ...       ...
3703       24289  1.565110e+12 -0.641953 -1.469177  0.301041
3713       24299  1.565110e+12 -0.171616 -0.366074 -0.059082
3723       24309  1.565110e+12  0.401810 -1.077698  0.258911
3733       24319  1.565110e+12  0.330338 -1.470062  0.303894
3743       24329  1.565110e+12  0.689346 -0.991043  0.034973

[375 rows x 5 columns]
0      1
1      1
2      1
3      1
4      1
      ..
370    4
371    4
372    4
373    4
374    4
Name: label, Length: 375, dtype: int64


In [40]:
# ensure both X and y are of equal length or shape
print(len(X))
print(len(y))

375
375


In [41]:
# set random seed
np.random.seed(42)

## Trial 1:  kNN


Introduction:  The dataset is gathered from triaxial smartphone accelerometer user data.  The goal is to use these data to classify users' movements as accurately as possible, where the classes are 1=standing; 2=walking; 3=stairs down; and 4=stairs up.  This classification was achieved using a k-Nearest Neighbors (kNN) analysis.  kNN was chosen due to the three-dimensional nature of the independent variables.  Sklearn utilities KNeighborsClassifier and train-test-split were used to develop a model and derive predictions of the appropriate classes of movement based on accelerometer data. 



Methods: Training and test data were read in using pandas. The original datasheets were viewed by the researcher to determine the points from which every tenth observation was counted.  A classification model was achieved using the kNN approach for reasons outlined above.  



In [42]:
# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [43]:
# Instantiate kNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In [44]:
# Predict presented data and store in appropriate variables
Make_preds = df_test_ts.iloc[9::10].drop('UTC time', axis =1)
Make_preds = Make_preds.drop('accuracy', axis =1)
print(len(Make_preds))
#Make predictions for classification report purposes
Pred_labels_te = knn.predict(X_test)

#Make predictions for current project
Project_pred_lables_te = knn.predict(Make_preds)


125


In [45]:
print(X_test)
print(Pred_labels_te)
print(len(Pred_labels_te))


      Unnamed: 0     timestamp         x         y         z
1673       22259  1.565110e+12  0.007278 -0.744370  0.162964
333        20919  1.565110e+12 -0.438858 -1.916260 -0.644577
153        20739  1.565110e+12  0.083878 -0.950729 -0.048096
3163       23749  1.565110e+12  0.618240 -1.144638  0.392334
573        21159  1.565110e+12  0.624207 -0.983139 -0.387100
...          ...           ...       ...       ...       ...
943        21529  1.565110e+12  0.017471 -0.742752  0.127029
1963       22549  1.565110e+12  0.395752 -1.168472  0.292664
3503       24089  1.565110e+12  0.542786 -1.187042  0.527435
3123       23709  1.565110e+12  0.257050 -0.921478  0.071121
3493       24079  1.565110e+12  0.740326 -1.319458 -0.196182

[75 rows x 5 columns]
[3 2 2 2 2 2 2 4 3 4 3 2 3 2 2 2 1 2 2 4 1 3 2 3 3 1 2 3 1 4 2 4 2 2 3 4 2
 2 1 3 3 3 2 2 4 4 2 2 1 2 3 1 3 2 2 2 2 1 2 2 4 2 2 2 2 2 3 3 2 2 2 2 2 2
 2]
75


### Results:  Modeling results provided a list of predicted classes (movement types) for each set of x, y, z coordinates in the test data. Accuracy scores were calculated for training and test data.  Test accuracy scores indicate 96% accuracy, which is suspect given the complexity of the data.  Instructors were not available for further information on the dataset or verification of predictions. This student was given full marks for his submission.


In [46]:
knn.score(X_train, y_train)

0.9933333333333333

In [47]:
knn.score(X_test, y_test)

0.96

In [48]:
#calculate accuracy score
Accuracy_score_test = knn.score(X_test, y_test)
print('Accuracy Score of test model: ', Accuracy_score_test)
Accuracy_score_train = knn.score(X_train, y_train)
print('Accuracy Score of training model: ', Accuracy_score_train)

#print classification reports to evaluate testing and training data further
print(df_test_labels)
print(len(df_test_labels))
print(Pred_labels_te)
print(len(Pred_labels_te))
print(classification_report(y_test, Pred_labels_te))

Accuracy Score of test model:  0.96
Accuracy Score of training model:  0.9933333333333333
     Unnamed: 0     timestamp                 UTC time  label
0         24339  1.565110e+12  2019-08-06T16:51:47.041    NaN
1         24349  1.565110e+12  2019-08-06T16:51:48.043    NaN
2         24359  1.565110e+12  2019-08-06T16:51:49.046    NaN
3         24369  1.565110e+12  2019-08-06T16:51:50.048    NaN
4         24379  1.565110e+12  2019-08-06T16:51:51.050    NaN
..          ...           ...                      ...    ...
120       25539  1.565110e+12  2019-08-06T16:53:47.366    NaN
121       25549  1.565110e+12  2019-08-06T16:53:48.369    NaN
122       25559  1.565110e+12  2019-08-06T16:53:49.371    NaN
123       25569  1.565110e+12  2019-08-06T16:53:50.373    NaN
124       25579  1.565110e+12  2019-08-06T16:53:51.376    NaN

[125 rows x 4 columns]
125
[3 2 2 2 2 2 2 4 3 4 3 2 3 2 2 2 1 2 2 4 1 3 2 3 3 1 2 3 1 4 2 4 2 2 3 4 2
 2 1 3 3 3 2 2 4 4 2 2 1 2 3 1 3 2 2 2 2 1 2 2 4 2 2 2 2 2 3 3 

In [49]:
#save X_test predictions to csv
X_test['Predicted Labels'] = Pred_labels_te
Pred_X_test_labels = X_test
print(Pred_X_test_labels)
Pred_X_test_labels.to_csv('data/Predicted_X_test_Labels.csv')

      Unnamed: 0     timestamp         x         y         z  Predicted Labels
1673       22259  1.565110e+12  0.007278 -0.744370  0.162964                 3
333        20919  1.565110e+12 -0.438858 -1.916260 -0.644577                 2
153        20739  1.565110e+12  0.083878 -0.950729 -0.048096                 2
3163       23749  1.565110e+12  0.618240 -1.144638  0.392334                 2
573        21159  1.565110e+12  0.624207 -0.983139 -0.387100                 2
...          ...           ...       ...       ...       ...               ...
943        21529  1.565110e+12  0.017471 -0.742752  0.127029                 2
1963       22549  1.565110e+12  0.395752 -1.168472  0.292664                 2
3503       24089  1.565110e+12  0.542786 -1.187042  0.527435                 2
3123       23709  1.565110e+12  0.257050 -0.921478  0.071121                 2
3493       24079  1.565110e+12  0.740326 -1.319458 -0.196182                 2

[75 rows x 6 columns]


In [50]:
#insert predictions into provided csv file
df_test_labels['label'] = Project_pred_lables_te
Final_predictions = df_test_labels
print(Final_predictions)
Final_predictions.to_csv('data/Final_predictions.csv')

     Unnamed: 0     timestamp                 UTC time  label
0         24339  1.565110e+12  2019-08-06T16:51:47.041      4
1         24349  1.565110e+12  2019-08-06T16:51:48.043      4
2         24359  1.565110e+12  2019-08-06T16:51:49.046      4
3         24369  1.565110e+12  2019-08-06T16:51:50.048      4
4         24379  1.565110e+12  2019-08-06T16:51:51.050      4
..          ...           ...                      ...    ...
120       25539  1.565110e+12  2019-08-06T16:53:47.366      4
121       25549  1.565110e+12  2019-08-06T16:53:48.369      4
122       25559  1.565110e+12  2019-08-06T16:53:49.371      4
123       25569  1.565110e+12  2019-08-06T16:53:50.373      4
124       25579  1.565110e+12  2019-08-06T16:53:51.376      4

[125 rows x 4 columns]


In [51]:
end_0 = time.perf_counter() 
run_time_0 = end_0 - start_0
print("The run time for this process was: " + str(run_time_0))

The run time for this process was: 13.456000000000017


## Additional ML algorithm trials:

### Trial 2:  LinearSVC

In [52]:
from sklearn.svm import LinearSVC

# Set up random seed
np.random.seed(42)

# Make the data
X = X = df_train_ts.iloc[3::10, 4:7]
y = df_train_labels['label']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Instantiate Linear SVC
clf = LinearSVC()
clf.fit(X_train, y_train)

# Evaluate the Linear SVC
clf.score(X_test, y_test)



0.5733333333333334

### Trial 3:  Support Vector Machines (SVM)

In [53]:
from sklearn import svm
from sklearn.svm import SVC

In [54]:
### try more pre-processing to improve the results
#from sklearn.pipeline import make_pipeline
#from sklearn.preprocessing import StandardScaler


#clf = make_pipeline(StandardScaler(), SVC())

In [55]:
clf = svm.SVC()
clf.fit(X_train, y_train)
SVC();

In [56]:
clf.score(X_test, y_test)

0.5733333333333334

### Trial 4:  Gradient Boosting Classifier

In [57]:
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

In [58]:
X, y = make_hastie_10_2(random_state=0)
clf = GradientBoostingClassifier(n_estimators=5, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

0.5466666666666666

### Trial 5:  Spectral Clustering

In [59]:
from sklearn.cluster import SpectralClustering

clf = SpectralClustering(random_state=0).fit(X_train, y_train)

SpectralClustering(n_clusters=4, random_state=0)

### Trial 6:  SGD Classifier

In [60]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X_train, y_train)
SGDClassifier(max_iter=10)



In [61]:
clf.score(X_test, y_test)

0.49333333333333335

### Trial 7:  Random Forest Classifier

In [62]:
from sklearn.ensemble import RandomForestClassifier

# Set up random seed
np.random.seed(42)

# Make the data
X = X = df_train_ts.iloc[3::10, 4:7]
y = df_train_labels['label']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Instantiate the Random Forest Classifier
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

1.0
0.5333333333333333


In [63]:
from sklearn.model_selection import cross_val_score
np.random.seed(42)

cvs = cross_val_score(clf, X, y, cv = 5)

In [64]:
np.mean(cvs)

0.5813333333333334

### Trial 8: Logistic Regression

In [65]:
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)


In [66]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0).fit(X_train, y_train)

In [67]:
clf.score(X_test, y_test)


0.6