Going to run the GNB model in order to get a good baseline for performace
1. Control vs. All Park
1. PD vs. MSA/PSP
1. MSA vs. PD/PSP
1. PDP vs. PD/MSA

## Imports and Function Definitions

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold, cross_val_score, KFold, train_test_split, LeaveOneOut, GridSearchCV
from sklearn.svm import SVC

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline

def grid_search_optimization(model, tuned_parameters, X, y, Xh, yh, cv=5, scoring='accuracy', verbose=False):
    print("# Tuning hyper-parameters for %s" %scoring)
    print()

    clf = GridSearchCV(model, tuned_parameters, cv=cv, n_jobs = -1, scoring=scoring, verbose=1)
    clf.fit(X, y)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    if verbose:
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()

    print("Detailed classification report (holdout):")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = yh, clf.predict(Xh)
    print(classification_report(y_true, y_pred))
    print()
    
    return clf

def group_classes(data, grouping):
        classes_to_keep = grouping.keys()
        data_to_keep = data.loc[data['GroupID'].isin(classes_to_keep)]
        classes_to_change = {k:grouping[k] for k in classes_to_keep if k!= grouping[k]}
        return data_to_keep.replace(classes_to_change)

In [22]:
# parameters and ranges to plot
param_grid = {
    "classifier__C": np.logspace(-5, 10, 20),
    "classifier__gamma": np.logspace(-15,3,20)
#     "PCA__n_components": range(1,113,10)
    #"fss__k" : range(1,115,3)
}

clf = Pipeline([
#     ('Norm', Normalizer()),
    #("Oversample", RandomOverSampler()),
#     ('PCA', PCA()),
#     ('fss',SelectKBest()),
    ('Scaler', StandardScaler()),
    ('classifier', SVC(kernel='rbf', class_weight='balanced'))
])

## 1. Control vs. PD/MSA/PSP

In [23]:
# Get the data
data1 = pd.read_excel('training_data.xlsx')
data1 = group_classes(data1, {0:0, 1:1, 2:1, 3:1})

y1 = data1['GroupID']
X1 = data1.drop(['GroupID'], axis=1)

X_train1, X_test1, Y_train1, Y_test1 = train_test_split(X1, y1, test_size=0.35, random_state=42)

best1 = grid_search_optimization(clf, param_grid, X_train1, Y_train1, X_test1, Y_test1, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 400 candidates, totalling 8000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   23.0s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   32.3s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   45.1s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   59.9s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 7802 tasks      | elapsed:  3.9min


Best parameters set found on development set:

{'classifier__C': 183298.07108324376, 'classifier__gamma': 3.3598182862837741e-07}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.85      0.96      0.90        95
          1       0.98      0.93      0.95       227

avg / total       0.94      0.94      0.94       322




[Parallel(n_jobs=-1)]: Done 8000 out of 8000 | elapsed:  4.0min finished


## 2. PD vs MSA/PSP

In [24]:
# Get the data
data2 = pd.read_excel('training_data.xlsx')
data2 = group_classes(data2, {1:0, 2:1, 3:1})

y2 = data2['GroupID']
X2 = data2.drop(['GroupID'], axis=1)

X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X2, y2, test_size=0.35, random_state=42)

best2 = grid_search_optimization(clf, param_grid, X_train2, Y_train2, X_test2, Y_test2, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 400 candidates, totalling 8000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   23.0s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   29.8s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   38.0s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   47.7s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 3681 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 5381 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 7281 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 7985 out of 8000 | elapsed:  2.6min remaining:    0.2s


Best parameters set found on development set:

{'classifier__C': 20.691380811147901, 'classifier__gamma': 0.0020691380811147901}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.97      0.88      0.92       164
          1       0.75      0.92      0.83        66

avg / total       0.91      0.89      0.89       230




[Parallel(n_jobs=-1)]: Done 8000 out of 8000 | elapsed:  2.6min finished


## 3. MSA vs PD/PSP

In [25]:
# Get the data
data3 = pd.read_excel('training_data.xlsx')
data3 = group_classes(data3, {1:0, 3:0, 2:1})

y3 = data3['GroupID']
X3 = data3.drop(['GroupID'], axis=1)

X_train3, X_test3, Y_train3, Y_test3 = train_test_split(X3, y3, test_size=0.35, random_state=42)

best3 = grid_search_optimization(clf, param_grid, X_train3, Y_train3, X_test3, Y_test3, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 400 candidates, totalling 8000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   25.6s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   45.8s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:   57.6s
[Parallel(n_jobs=-1)]: Done 3681 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 5381 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 7281 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 7985 out of 8000 | elapsed:  2.4min remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 8000 out of 8000 | elapsed:  2.4min finished


Best parameters set found on development set:

{'classifier__C': 20.691380811147901, 'classifier__gamma': 0.018329807108324301}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.92      1.00      0.96       208
          1       0.83      0.23      0.36        22

avg / total       0.92      0.92      0.90       230




## 4. PSP vs PD/MSA

In [26]:
# Get the data
data4 = pd.read_excel('training_data.xlsx')
data4 = group_classes(data4, {1:0, 2:0, 3:1})

y4 = data4['GroupID']
X4 = data4.drop(['GroupID'], axis=1)

X_train4, X_test4, Y_train4, Y_test4 = train_test_split(X4, y4, test_size=0.35, random_state=42)

best4 = grid_search_optimization(clf, param_grid, X_train4, Y_train4, X_test4, Y_test4, cv=20, scoring='f1_micro')

# Tuning hyper-parameters for f1_micro

Fitting 20 folds for each of 400 candidates, totalling 8000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   37.7s
[Parallel(n_jobs=-1)]: Done 2580 tasks      | elapsed:   50.6s
[Parallel(n_jobs=-1)]: Done 4080 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 5780 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 7680 tasks      | elapsed:  2.0min


Best parameters set found on development set:

{'classifier__C': 127.42749857031347, 'classifier__gamma': 2.6366508987303556e-05}

Detailed classification report (holdout):

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.98      0.91      0.94       186
          1       0.70      0.91      0.79        44

avg / total       0.92      0.91      0.91       230




[Parallel(n_jobs=-1)]: Done 8000 out of 8000 | elapsed:  2.1min finished
