Going to run the KNN model in order to get a good baseline for performace
1. Control vs. All Park
1. PD vs. MSA/PSP
1. MSA vs. PD/PSP
1. PDP vs. PD/MSA

## Imports and Function Definitions

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold, train_test_split, LeaveOneOut, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler 
from imblearn.pipeline import Pipeline

def grid_search_optimization(model, tuned_parameters, X, y, Xh, yh, cv=5, scoring='accuracy', verbose=False):
    print("# Tuning hyper-parameters for %s" %scoring)
    print()

    clf = GridSearchCV(model, tuned_parameters, cv=cv, n_jobs = -1, scoring=scoring, verbose=1)
    clf.fit(X, y)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    if verbose:
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()

    print("Detailed classification report (holdout):")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = yh, clf.predict(Xh)
    print(classification_report(y_true, y_pred))
    print()
    
    return clf

def group_classes(data, grouping):
        classes_to_keep = grouping.keys()
        data_to_keep = data.loc[data['GroupID'].isin(classes_to_keep)]
        classes_to_change = {k:grouping[k] for k in classes_to_keep if k!= grouping[k]}
        return data_to_keep.replace(classes_to_change)

In [2]:
# parameters and ranges to plot
param_grid = {
    "classifier__n_neighbors": range(1,25,1),
    "PCA__n_components": range(1,113,10)
    #"fss__k" : range(1,115,3)
}

clf = Pipeline([
#     ('Norm', Normalizer()),
#     ('Undersampler', RandomUnderSampler()),
    #("Oversample", RandomOverSampler()),
#     ('fss',SelectKBest()),
    ('Scaler', StandardScaler()),
    ('PCA', PCA()),
    ('classifier', KNeighborsClassifier(weights='uniform', n_jobs=-1))
])

## 1. Control vs. PD/MSA/PSP

In [3]:
# Get the data
data1 = pd.read_excel('training_data.xlsx')
data1 = group_classes(data1, {0:0, 1:1, 2:1, 3:1})

y1 = data1['GroupID']
X1 = data1.drop(['GroupID'], axis=1)

X_train1, X_test1, Y_train1, Y_test1 = train_test_split(X1, y1, test_size=0.35, random_state=42)

best1 = grid_search_optimization(clf, param_grid, X_train1, Y_train1, X_test1, Y_test1, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 288 candidates, totalling 5760 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   31.5s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   44.3s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   59.8s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 5760 out of 5760 | elapsed:  3.0min finished


Best parameters set found on development set:

{'PCA__n_components': 41, 'classifier__n_neighbors': 10}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.63      0.45      0.53        95
          1       0.80      0.89      0.84       227

avg / total       0.75      0.76      0.75       322




## 2. PD vs MSA/PSP

In [4]:
# Get the data
data2 = pd.read_excel('training_data.xlsx')
data2 = group_classes(data2, {1:0, 2:1, 3:1})

y2 = data2['GroupID']
X2 = data2.drop(['GroupID'], axis=1)

X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X2, y2, test_size=0.35, random_state=42)

best2 = grid_search_optimization(clf, param_grid, X_train2, Y_train2, X_test2, Y_test2, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 288 candidates, totalling 5760 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   20.7s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   31.3s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   44.5s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 5760 out of 5760 | elapsed:  2.9min finished


Best parameters set found on development set:

{'PCA__n_components': 31, 'classifier__n_neighbors': 1}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.90      0.93      0.91       164
          1       0.80      0.74      0.77        66

avg / total       0.87      0.87      0.87       230




## 3. MSA vs PD/PSP

In [5]:
# Get the data
data3 = pd.read_excel('training_data.xlsx')
data3 = group_classes(data3, {1:0, 3:0, 2:1})

y3 = data3['GroupID']
X3 = data3.drop(['GroupID'], axis=1)

X_train3, X_test3, Y_train3, Y_test3 = train_test_split(X3, y3, test_size=0.35, random_state=42)

best3 = grid_search_optimization(clf, param_grid, X_train3, Y_train3, X_test3, Y_test3, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 288 candidates, totalling 5760 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   30.7s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   43.4s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   59.0s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 5760 out of 5760 | elapsed:  2.9min finished


Best parameters set found on development set:

{'PCA__n_components': 51, 'classifier__n_neighbors': 1}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.93      0.94      0.94       208
          1       0.37      0.32      0.34        22

avg / total       0.88      0.88      0.88       230




## 4. PSP vs PD/MSA

In [6]:
# Get the data
data4 = pd.read_excel('training_data.xlsx')
data4 = group_classes(data4, {1:0, 2:0, 3:1})

y4 = data4['GroupID']
X4 = data4.drop(['GroupID'], axis=1)

X_train4, X_test4, Y_train4, Y_test4 = train_test_split(X4, y4, test_size=0.35, random_state=42)

best4 = grid_search_optimization(clf, param_grid, X_train4, Y_train4, X_test4, Y_test4, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 288 candidates, totalling 5760 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   21.0s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   32.5s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   46.7s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 5760 out of 5760 | elapsed:  2.9min finished


Best parameters set found on development set:

{'PCA__n_components': 61, 'classifier__n_neighbors': 11}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.92      0.99      0.95       186
          1       0.93      0.64      0.76        44

avg / total       0.92      0.92      0.92       230


