# Idiopathic Parkinsons vs MSA
In this notebook I will create a model that is able to differentiate Idiopathic Parkinsons from MSA.  The goal is to create the simplest model that has a success metric of 85% or better for holdout.

## QUESTION:
What performance metric should we use to determine sucess of a run?
* **Accuracy** - is going to be high if the class with the most samples is favored by the model.  We could mitigate this by using upsampling on the minority class, but we still have the problem that poorly predicted predicted minority class sample contributions are amplified (a single bad classification of a minority class could be represented MANY times).
* **Precision (Positive Predictive Value) ** - Assuming that we are setting MSA as the positive class, this would be the number of MSA samples correctly predicted as MSA divided by the total number of samples predicted as MSA.  The downside of using this value is that isn't an obvious way to declare the "positive" class if we are comparing say MSA and PSP.
* **Recall (True Postive Rate, Sensitivity)** - Assuming that MSA = Postive, this would be the number of samples correctly predicted as MSA divided by the total number of MSA samples
* **Specificity (True Negative Rate)**- percentage of Parkinsons samples correctly predicted as parkinsons
* **Negative Predictive Value** - percentage of samples predicted as parkinsons that were actually parkinsons

### Plan of attack
1. Get the baseline for the evaluation metrics (Precision, Recall, Specificity) for the following models:
    * Logistic Regression
    * K-nearest Neighbors
    * Random Forest
    * LDA
    * Gradient Boost
    * AdaBoost
    * Linear SVC
    * RBF SVC
    * XGBoost
1. (Maybe) Run the same baseline with variations and note which models improved with which techniques:
    * Standard Scalar (Definitely)
    * K-Best
    * PCA
1. Repeat Best scenario for each model above but on each of the 3 sites (individually) which have MSA subjects
1. Repeat Best scenario for each model above but on all 3 combined of the 3 sites which have MSA subjects

## Get baseline performance metrics 

In [2]:
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt

from pml.experiment.experiment import Grouping, Experiment
from pml.experiment.model import Model
from pml.data.data import DataFile, BrainData
from pml.utility.default import get_baseline_models

from sklearn.preprocessing import StandardScaler, Normalizer
 
# Load libraries
from matplotlib import pyplot
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import LeaveOneOut
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

DATA_FILE= BrainData(file_url='data/Training_with_Site.xlsx.xlsx', name='Original Data with Site Info')

dataset = DATA_FILE.df

ValueError: labels ['Subject'] not contained in axis

In [18]:
## Make the specificity scorer

from sklearn.metrics import roc_curve, make_scorer, confusion_matrix
def specificity(y_true, y_pred):
    conf = confusion_matrix(y_true, y_pred)
    speci = conf[0,0]/(conf[0,0]+conf[0,1])
    return speci

In [39]:
# config values 
seed = 42
holdout_percent = 0.2
num_folds = 20
scoring = {'precision':'precision','recall':'recall', 'specificity':make_scorer(specificity)}

### 1. Baseline Measurements

In [1]:
from ml_utils import group_classes
from pml.utility.viz import make_barplots
from sklearn.model_selection import cross_validate
from pml.utility.default import get_baseline_models

data = group_classes(dataset, {1:0,2:1})
Y = data.GroupID.values
X = data.drop(['GroupID'], axis=1).values

models = get_baseline_models()
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed)

cv_results = cross_validate(model, X, Y, cv=kfold, scoring=scoring, return_train_score=True)

df = pd.DataFrame()
model = {'model':'LR'}
for key in cv_results.keys():
    if 'time' not in key:
        model["%s_mean"%key] = cv_results[key].mean()
        model["%s_std"%key] = cv_results[key].std()

df.append(model, ignore_index=True)  

SyntaxError: invalid syntax (<ipython-input-1-5f0758b4bb79>, line 4)

[8, 5, 0, 4, -7, 4]