# Test ML pipeline on ACC/GYR feature matrix

# Supervised Machine Learning Pipeline - Multi-class Classification

# Overview

Use activity recognition test dataset to:
1. Perform multi-class classification of activity recognition tasks (6 classes) using 131 features in the time and frequency domain.
2. Compare linear classifiers using a machine learning pipeline.

Linear classifiers
- k nearest neighbors
- logistic regression
- SVM - linearSVC
- SVM - SVC which uses nonlinear SVM by default

# Import packages

In [1]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
import sklearn.datasets
import pandas as pd
import numpy as np
import seaborn as sns

# Loading data

In [9]:
# load test set - acc/gyr feature matrix
testfile = r'//FS2.smpp.local\RTO\Inpatient Sensors -Stroke\Data analysis\Analysis_ActivityRecognition\accgyr_feature_to_pipeline.csv'
df = pd.read_csv(testfile)

In [10]:
df.head(2)

Unnamed: 0,subject,date,test,task,trial,location,acc-rawdata,gyr-rawdata,acc-meanX,acc-meanY,...,gyr-meanpower_bin11_z,gyr-meanpower_bin12_z,gyr-meanpower_bin13_z,gyr-meanpower_bin14_z,gyr-meanpower_bin15_z,gyr-meanpower_bin16_z,gyr-meanpower_bin17_z,gyr-meanpower_bin18_z,gyr-meanpower_bin19_z,gyr-meanpower_bin20_z
0,HC02,temp date,activity recognition,LYING,0,sacrum,Accel X (g) Accel Y (g) Acce...,Gyro X (°/s) Gyro Y (°/s) Gy...,-0.058348,0.183076,...,7.5e-05,7.7e-05,6.7e-05,5.7e-05,5.5e-05,7.9e-05,5.4e-05,6.2e-05,4.5e-05,4.5e-05
1,HC02,temp date,activity recognition,LYING,0,distal_lateral_shank_right,Accel X (g) Accel Y (g) Acce...,Gyro X (°/s) Gyro Y (°/s) Gy...,0.105776,0.906628,...,0.000336,0.000195,0.000197,0.000209,0.000138,0.000108,9.1e-05,5.7e-05,4.7e-05,6.5e-05


In [11]:
# once I have X and y, modify below
df.insert(8, 'target_category', df['task'].astype('category').cat.codes)
X = df.drop(df.columns[0:9], axis=1)
y = df['target_category']

# stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Exploratory data analysis (EDA)

In [12]:
# check dimensions
print('AR data dimensions: ', X.shape)
print('AR target dimensions: ', y.shape)

AR data dimensions:  (110, 262)
AR target dimensions:  (110,)


In [13]:
# check dimensions
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(88, 262) (88,) (22, 262) (22,)


# 1. k nearest neighbors (knn)

X = df.drop(df.columns[0:9], axis=1)
y = df['target_category']

# stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [14]:
from sklearn.neighbors import KNeighborsClassifier

# Create and fit the model with default hyperparameters
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [15]:
# n=5
knn.score(X_test, y_test)

0.5909090909090909

In [16]:
# test k=6 with parameter n_neighbors
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.6363636363636364

In [25]:
# hyperparameter tuning for n_neighbors
# setting n to 1-30, overfits with n=1 ##############
param_grid = {'n_neighbors': np.arange(5, 30)}

# Option 1
knn = KNeighborsClassifier()
# Option 2 - scaling doesn't work here
# steps = [('scaler', StandardScaler()),
#          ('knn', KNeighborsClassifier())]
# pipeline = Pipeline(steps)

# args: model, grid, number of folds for cross validation
knn_cv = GridSearchCV(knn, param_grid, cv=5)
knn_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned KNN Parameters: {}".format(knn_cv.best_params_)) 
print("Best score is {}".format(knn_cv.best_score_))

print("knn training accuracy:", knn_cv.score(X_train, y_train))
print("knn test accuracy    :", knn_cv.score(X_test, y_test))

Tuned KNN Parameters: {'n_neighbors': 5}
Best score is 0.5454545454545454
knn training accuracy: 0.7272727272727273
knn test accuracy    : 0.5909090909090909


## knn Overfit
- add scaling?

# 2. Logistic Regression - multi-class
Key hyperparameters:
- C (inverse regularization strength)
- penalty (type of regularization - L1 and L2)
- multi_class (type of multi-class)

## 2.1 One-vs-Rest

In [26]:
# LogisticRegression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
# lr.predict(X_test)
# lr.score(X_test, y_test)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [27]:
lr.score(X_test, y_test)

0.7272727272727273

In [28]:
# Fit one-vs-rest logistic regression classifier
# lr_ovr = LogisticRegression()
lr_ovr = OneVsRestClassifier(LogisticRegression()) 
lr_ovr.fit(X_train, y_train)

print("OVR training accuracy:", lr_ovr.score(X_train, y_train))
print("OVR test accuracy    :", lr_ovr.score(X_test, y_test))

OVR training accuracy: 1.0
OVR test accuracy    : 0.7272727272727273


In [29]:
y_pred_lr_ovr = lr_ovr.predict(X_test)
print(classification_report(y_test, y_pred_lr_ovr))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_lr_ovr))

             precision    recall  f1-score   support

          0       0.33      0.50      0.40         2
          1       0.50      0.25      0.33         4
          2       1.00      0.50      0.67         2
          3       1.00      0.50      0.67         2
          4       0.89      1.00      0.94         8
          5       0.67      1.00      0.80         4

avg / total       0.75      0.73      0.71        22

Confusion matrix:
 [[1 1 0 0 0 0]
 [2 1 0 0 1 0]
 [0 0 1 0 0 1]
 [0 0 0 1 0 1]
 [0 0 0 0 8 0]
 [0 0 0 0 0 4]]


Task code:
- Lying 0
- Sitting 1
- Stairs dn 2
- Stairs up 3
- Standing 4
- Walking 5

## 2.2 Softmax/Multinomial/Cross-Entropy Loss

In [30]:
lr_mn = LogisticRegression(multi_class="multinomial",solver="lbfgs")
lr_mn.fit(X_train, y_train)

print("Softmax training accuracy:", lr_mn.score(X_train, y_train))
print("Softmax test accuracy    :", lr_mn.score(X_test, y_test))

Softmax training accuracy: 1.0
Softmax test accuracy    : 0.6818181818181818


In [31]:
y_pred_lr_mn = lr_mn.predict(X_test)
print(classification_report(y_test, y_pred_lr_mn))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_lr_mn))

             precision    recall  f1-score   support

          0       0.50      1.00      0.67         2
          1       0.50      0.25      0.33         4
          2       0.33      0.50      0.40         2
          3       0.00      0.00      0.00         2
          4       0.88      0.88      0.88         8
          5       1.00      1.00      1.00         4

avg / total       0.67      0.68      0.66        22

Confusion matrix:
 [[2 0 0 0 0 0]
 [2 1 0 0 1 0]
 [0 0 1 1 0 0]
 [0 0 2 0 0 0]
 [0 1 0 0 7 0]
 [0 0 0 0 0 4]]


## Optional: L1 regularization

In [32]:
# Specify L1 regularization
lr = LogisticRegression(penalty='l1')

# Instantiate the GridSearchCV object and run the search
searcher = GridSearchCV(lr, {'C':[0.001, 0.01, 0.1, 1, 10, 100]})
searcher.fit(X_train, y_train)

# Report the best parameters
print("Best CV params", searcher.best_params_)

# Find the number of nonzero coefficients (selected features)
best_lr = searcher.best_estimator_
coefs = best_lr.coef_
print("Total number of features:", coefs.size)
print("Number of selected features:", np.count_nonzero(coefs))

# with l1 reg - C=1
# without l1 reg - C=10

Best CV params {'C': 100}
Total number of features: 1572
Number of selected features: 373


In [33]:
print("L1 reg training accuracy:", searcher.score(X_train, y_train))
print("L1 reg test accuracy    :", searcher.score(X_test, y_test))

L1 reg training accuracy: 1.0
L1 reg test accuracy    : 0.7272727272727273


# 3. LinearSVC for SVM

In [34]:
# LinearSVC
import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.svm import LinearSVC

# svm.ft(wine.data, wine.target)
# svm.score(wine.data, wine.target)

In [35]:
linearsvm = LinearSVC()
linearsvm.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [36]:
print("LinearSVC training accuracy:", linearsvm.score(X_train, y_train))
print("LinearSVC test accuracy    :", linearsvm.score(X_test, y_test))

LinearSVC training accuracy: 1.0
LinearSVC test accuracy    : 0.8181818181818182


In [37]:
LinearSVC().get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'loss', 'max_iter', 'multi_class', 'penalty', 'random_state', 'tol', 'verbose'])

In [38]:
# CV and scaling in a pipeline using Normalization
steps = [('scaler', StandardScaler()),
         ('svc', LinearSVC())]
pipeline = Pipeline(steps)

# Specify hyperparameter space using a dictionary
parameters = {'svc__C':[0.1, 1, 10]}

X_train_svc, X_test_svc, y_train_svc, y_test_svc = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train_svc, y_train_svc)
y_pred_svc = cv.predict(X_test_svc)

# Compute and print metrics
# print best parameters
print("Tuned Model Parameters: {}".format(cv.best_params_))
print("LinearSVC training accuracy:", linearsvm.score(X_train, y_train))
print("Test Accuracy: {}".format(cv.score(X_test_svc, y_test_svc)))
print(classification_report(y_test_svc, y_pred_svc))
print("Confusion matrix:\n", confusion_matrix(y_test_svc, y_pred_svc))

Tuned Model Parameters: {'svc__C': 0.1}
LinearSVC training accuracy: 1.0
Test Accuracy: 0.8181818181818182
             precision    recall  f1-score   support

          0       1.00      0.50      0.67         2
          1       0.50      0.25      0.33         4
          2       1.00      1.00      1.00         2
          3       0.67      1.00      0.80         2
          4       0.80      1.00      0.89         8
          5       1.00      1.00      1.00         4

avg / total       0.81      0.82      0.79        22

Confusion matrix:
 [[1 1 0 0 0 0]
 [0 1 0 1 2 0]
 [0 0 2 0 0 0]
 [0 0 0 2 0 0]
 [0 0 0 0 8 0]
 [0 0 0 0 0 4]]


Sitting confused as Lying 1
Sitting confused with Standing 6
Stairs dn confused with Walking 1
Standing confused with Sitting 2

Task code:
Lying 0
Sitting 1
Stairs dn 2
Stairs up 3
Standing 4
Walking 5

# 4. SVC - default nonlinear SVM

In [39]:
# SVC
import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.svm import SVC
svm = SVC()
# svm.fit(wine.data, wine.target)
# svm.score(wine.data, wine.target)

In [40]:
svm.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [41]:
print("SVC training accuracy:", svm.score(X_train, y_train))
print("SVC test accuracy    :", svm.score(X_test, y_test))

# overfit model

SVC training accuracy: 0.9886363636363636
SVC test accuracy    : 0.36363636363636365


## 4.1 SVC: Tune hyperparameters to improve test accuracy

In [42]:
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train,y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)

# Report the test accuracy using these best parameters
print("Train accuracy of best grid search hypers:", 
      searcher.score(X_train,y_train))
# Report the test accuracy using these best parameters
print("Test accuracy of best grid search hypers:", 
      searcher.score(X_test, y_test))

Best CV params {'C': 10, 'gamma': 1e-05}
Best CV accuracy 0.4431818181818182
Train accuracy of best grid search hypers: 0.9772727272727273
Test accuracy of best grid search hypers: 0.36363636363636365


## 4.2 SVC: normalize data, tune hyperparameters and check final result

In [44]:
# CV and scaling in a pipeline
steps = [('scaler', StandardScaler()),
         ('svm', SVC())]
pipeline = Pipeline(steps)

# Specify hyperparameter space using a dictionary
parameters = {'svm__C':[0.1, 1, 10],
              'svm__gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}

X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train_svm, y_train_svm)
y_pred_svm = cv.predict(X_test_svm)

# Compute and print metrics
# print best parameters
print("Tuned Model Parameters: {}".format(cv.best_params_))
print("Train Accuracy: {}".format(cv.score(X_train_svm, y_train_svm)))
print("Test Accuracy: {}".format(cv.score(X_test_svm, y_test_svm)))
print(classification_report(y_test_svm, y_pred_svm))
print("Confusion matrix:\n", confusion_matrix(y_test_svm, y_pred_svm))

Tuned Model Parameters: {'svm__C': 10, 'svm__gamma': 0.001}
Train Accuracy: 1.0
Test Accuracy: 0.8636363636363636
             precision    recall  f1-score   support

          0       1.00      1.00      1.00         2
          1       1.00      0.50      0.67         4
          2       1.00      1.00      1.00         2
          3       0.50      0.50      0.50         2
          4       0.89      1.00      0.94         8
          5       0.80      1.00      0.89         4

avg / total       0.88      0.86      0.85        22

Confusion matrix:
 [[2 0 0 0 0 0]
 [0 2 0 1 1 0]
 [0 0 2 0 0 0]
 [0 0 0 1 0 1]
 [0 0 0 0 8 0]
 [0 0 0 0 0 4]]


Lying confused as Sitting 1
Lying confused with Standing 0
Sitting confused as Lying 1
Sitting confused with Standing 4
Stairs dn confused with Standing 0
Stairs dn confused with Walking 4
Stairs up confused with Stairs dn 2
Stairs up confused with Walking 3
Standing confused with Lying 0
Standing confused with Sitting 4
Standing confused with Stairs dn 0
Standing confused with Walking 0
Walking confused as Stairs dn 0
Walking confused as Stairs up 0

Task code:
Lying 0
Sitting 1
Stairs dn 2
Stairs up 3
Standing 4
Walking 5

# Everything overfits