# Parkinson's Machine Learning Classification

## Table of Contents
1. [Initial Data Analysis](#Initial-Data-Analysis)
1. [Initial Model Benchmarks](#Initial-Model-Benchmarks)
    1. [Benchmark Results](#Results)
1. [Preprocessing](#Preprocessing)
    1. [Standardization](#Standardization)

## Initial Data Analysis 
[See Data Exploration Notebook](Data_Exploration.ipynb)

The data set obviously has a class imbalance (Control 32%, Parkinsons 53%, MSA 7%, PSP 7%).  We will need to resample the data or use an algorithm like Decision Trees that are not influenced as heavily by class imbalance.  We should also take care to see how well the under-represented classes are being predicted when evaluating accuracy of our models.

Feature appear relatively normally-distributed between groups.  Nothing particularly interesting jumping out by looking at covariance and pearson correlations

## Initial Model Benchmarks
[See Modeling Benchmark Notebook](ModelingBenchmarks.ipynb)

8 initial models were tested for __multiclass classification__ (I would ideally like to avoid composing results of multiple binary classifiers since the cost of that composition is great):
* Logistic Regression (*log*)
* SVC with Linear Kernel (*svc_lin*)
* SVC with RBF Kernel (*svc_rbf*)
* k-Nearest Neighbors (*knn*)
* Decision Trees (*rand_for*)
* Artificial Neural Networks (*ann*)
* Naive Bayes (*gnb*)
* AdaBoost (*ada*)

These models were run with 5-fold (stratified) cross validation on 80% of the training data (20% kept for holdout validation).  To address the problem of class imbalance, the data was resampled (upsampling of the minority classes) to ensure that the minority classes were well represented in the models.  Alternative imbalanced class corrections can be explored in the future.   No additional preprocessing of the data was performed (scaling, feature selection, etc..).

For most models, default hyperparameters were used (with minor tweaking) based on experience.

The "Outside Validation" data set (who sites were not included in the training data at all), was tested to measure generalizability to unseen data.

The models were evaluated by several means with particular attention paid to the performance on the minority classes:
* Cross Validation Score (Mean Accuracy of the folds)
* Holdout Data accuracy, precision (PPV), and recall (sensitivity)
* "Outside Validation" Data accuracy, precision (PPV), and recall (sensitivity)

### Results

!['Initial Model Accuracies'](images/benchmark_accuracies.png)

k nearest neighbors, non-linear support vector, random forests, and logorithmic regression yielded cross-validation and holdout accuracies > 80%.

Each of those classifiers yield "Outside Validation" accuracies significantly less than 80%.  However, the other classifiers behave consistently between cross-validation, holdout, and validation data sets.

Of course, significant improvements are expected when we begin to introduce various preprocessing methods.

Further useful insight into the how well these models are performing on the individual class level can be seen by viewing the confusion matrices in 
[the Modeling Benchmark Notebook](Model Benchmarks/ModelingBenchmarks.ipynb)


## Preprocessing

### Standardization
[See Model Search- Standardization Notebook](ModelSearch-Standardization.ipynb)

We will use the standard scalar to force each feature column to have a mean of zero and a variance of 1.  This is a requirement of many algorithms.

Most of the features are already [0,1] bound with low variance.  The major exceptions are Age and UPDRS.

!['Accuracies with Standardization'](images/standardization_acc.png)

With the standardization step, most models' accuracies do no change much.  However the linear svc and neural network models see dramatic improvements.

!['Cross Validations Accuracies with Standardization'](images/standardization_validation.png)

In this image, the cross-validation accuracy is the mean of the accuracies of all 5 models trained during 5-fold cross validation. The holdout and validation accuracies are the accuracies obtained by training the models on the **entire** training set and passing the standardized holdout and outside validation data to the trained model.

Looking at the accuracies of these models with the holdout and outside validation data, we see some regression.  This suggests overfitting (it also looks like our *svc_rbf* model is also always predicting a single class).  We will address overfitting issues when do our hyperparameter tuning