# Parkinson's Machine Learning Classification
*Goal: To create a machine learning algorithm which will distinguish Parkinsonâ€™s Disease, Multiple System Atrophy, and Progressive Supranuclear Palsy using an automated pipeline.*

## Table of Contents
1. [Initial Data Analysis](#Initial-Data-Analysis)
1. [Initial Model Benchmarks](#Initial-Model-Benchmarks)
    1. [Benchmark Results](#Results)
1. [Preprocessing](#Preprocessing)
    1. [Standardization](#Standardization)
    1. [Feature Subset Selection](#Feature-Subset-Selection)
    1. [Dimensionality Reduction](#Dimensionality-Reduction)
        1. [Principal Component Analysis](#Principal-Component-Analysis)
        1. [Linear Discriminant Analysis](#Linear-Discriminant-Analysis)
1. [Model Optimization **(new!)**](#Model-Optimization)
    1. [Random Forest](#Random-Forest)
1. [Notes](#Notes)
    1. [6/28/2018- Resampling before cross-validation](#6/28/2018)

## Initial Data Analysis 
[See Data Exploration Notebook](Data_Exploration.ipynb)

The data set obviously has a class imbalance (Control 32%, Parkinsons 53%, MSA 7%, PSP 7%).  We will need to resample the data or use an algorithm like Decision Trees that are not influenced as heavily by class imbalance.  We should also take care to see how well the under-represented classes are being predicted when evaluating accuracy of our models.

Feature appear relatively normally-distributed between groups.  Nothing particularly interesting jumping out by looking at covariance and pearson correlations

## Initial Model Benchmarks
[See Modeling Benchmark Notebook](ModelingBenchmarks.ipynb)

8 initial models were tested for __multiclass classification__ (I would ideally like to avoid composing results of multiple binary classifiers since the cost of that composition is great):
* Logistic Regression (*log*)
* SVC with Linear Kernel (*svc_lin*)
* SVC with RBF Kernel (*svc_rbf*)
* k-Nearest Neighbors (*knn*)
* Decision Trees (*rand_for*)
* Artificial Neural Networks (*ann*)
* Naive Bayes (*gnb*)
* AdaBoost (*ada*)

These models were run with 5-fold (stratified) cross validation on 80% of the training data (20% kept for holdout validation).  To address the problem of class imbalance, the data was resampled (upsampling of the minority classes) to ensure that the minority classes were well represented in the models.  Alternative imbalanced class corrections can be explored in the future.   No additional preprocessing of the data was performed (scaling, feature selection, etc..).

For most models, default hyperparameters were used (with minor tweaking) based on experience.

The "Outside Validation" data set (who sites were not included in the training data at all), was tested to measure generalizability to unseen data.

The models were evaluated by several means with particular attention paid to the performance on the minority classes:
* Cross Validation Score (Mean Accuracy of the folds)
* Holdout Data accuracy, precision (PPV), and recall (sensitivity)
* "Outside Validation" Data accuracy, precision (PPV), and recall (sensitivity)

### Results

!['Initial Model Accuracies'](images/benchmark_accuracies.png)

k nearest neighbors, non-linear support vector, random forests, and logorithmic regression yielded cross-validation and holdout accuracies > 80%.

Each of those classifiers yield "Outside Validation" accuracies significantly less than 80%.  However, the other classifiers behave consistently between cross-validation, holdout, and validation data sets.

Of course, significant improvements are expected when we begin to introduce various preprocessing methods.

Further useful insight into the how well these models are performing on the individual class level can be seen by viewing the confusion matrices in 
[the Modeling Benchmark Notebook](Model Benchmarks/ModelingBenchmarks.ipynb)


## Preprocessing

### Standardization
[See Model Search- Standardization Notebook](ModelSearch-Standardization.ipynb)

We will use the standard scalar to force each feature column to have a mean of zero and a variance of 1.  This is a requirement of many algorithms.

Most of the features are already [0,1] bound with low variance.  The major exceptions are Age and UPDRS.

!['Accuracies with Standardization'](images/standardization_acc.png)

With the standardization step, most models' accuracies do no change much.  However the linear svc and neural network models see dramatic improvements.

!['Cross Validations Accuracies with Standardization'](images/standardization_validation.png)

In this image, the cross-validation accuracy is the mean of the accuracies of all 5 models trained during 5-fold cross validation. The holdout and validation accuracies are the accuracies obtained by training the models on the **entire** training set and passing the standardized holdout and outside validation data to the trained model.

Looking at the accuracies of these models with the holdout and outside validation data, we see some regression.  This suggests overfitting (it also looks like our *svc_rbf* model is also always predicting a single class).  We will address overfitting issues when do our hyperparameter tuning

### Feature Subset Selection
[See ModelSearch- k Best Features Notebook](ModelSearch-kBestFeatures.ipynb)

We will do further data preprocessing, this time training models on the **k-best** features.  "Best" here is determined by which features have the largest dependence on the class target variable.  We will use ANOVA F-score as our measure of dependence (Mutual Information was also compared, but showed no significant improvement was much slower).  This preprocessing will be done **in addition to the the standardization step from the previous section**.

!['K-Best Feature Selection'](images/kBest_fscore.png)

In the above figure, we see several models where the maximum (cross-validation) accuracy is at a k < 37, suggesting that we can get a performance improvement by feature subset selection.

For each model, we will find the value of **k** which maximizes the mean cross-validation accuracy.

|    _model_    | knn | svc_lin | svc_rbf | rand_for | ada | gnb | log | ann |
|---------------|-----|---------|---------|----------|-----|-----|-----|-----|
| ** _best_ k** |  4  | 33      |    3    |  20      |  26 |  9  |  20 |  32 |


Just as in the standardization preprocessing section, we will compare the **k-best** cross-validation accuracy with that of the initial model benchmarks and we will see a comparison of the cross-validation, holdout, and outside validation accuracies using this method.

!['Feature Selection Comparison w/ Standardization'](images/feature_subset_acc_fscore.png)
!['Feature Selection Validation'](images/fss_validation_fscore.png)

We see mild performance improvement in most of the models.  We should therefor include feature selection in our model pipeline and tune **k** during the optimization stage


### Dimensionality Reduction
[See ModelSearch- Dimensionality Reduction notebook](ModelSearch-DimensionalityReduction.ipynb)
We are going to reduce the number of dimensions of the data using Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA).  Both methods reduce the data to fewer dimensions by projecting in the directions of greatest variance. This can reduce noise and improve performance. LDA also takes the class into account in order to maximize differences between classes in the projected data.

We will include the standardization step from the previous stage.

#### Principal Component Analysis
PCA did not yield particularly useful results.  As can be seen in the graphs below, we get significantly accuracy for most models by using all available components and therefore its not beneficial to project out data via PCA unless we need to improve computational speed.

!['PCA Accuracy vs Number of Components'](images/pca_ns.png)

#### Linear Discriminant Analysis
LDA always projects the data to a number of dimensions up to the number of different classes - 1.  It generally best to use all components.
In the graph below we compare accuracies of our pipeline with only standardization and with standardization and the LDA projection.  We also observe how the transformation affects accuracy accross our training, holdout, and outside validation data.

!['PCA Accuracy vs Number of Components'](images/lda_acc.png)
!['PCA Accuracy vs Number of Components'](images/lda_validation.png)

From the first graph, we see significant improves over the standardization-only pipeline for the knn and svc_rbf models.  However, we see about equal of worse performance on all other models, inclusing a particularly large drop in accuracy for rand_for.  

We may explore this preprocessing step in the future during during optimization.  We may also find it prudent to only apply this transformation to the models where it shows improvement.

## Model Optimization
Here we will perform hyperparameter tuning on each of the models and then investigate how we could possibly further improve the accuracy.
### Random Forest
[See Optimization- Random Forest Notebook](Optimization-RandomForest.ipynb)

## Notes
### 6/28/2018
#### Resampling before cross-validation
Realized a mistake I had been making.  I was resampling all of the data before cross-validation.  This results in most of the data being present in both training and test data sets during cross-validation and thus overfitted models.

I refactored the code to include the resampling (oversampling of the minority classes) as one of the steps in a pipeline that occurs during each cross-validation run.

This has resulted in accuracies that are lower for cross-validation results, but are much more consistent between the cross-validation, holdout, and validation data (see updated accuracy charts in [Benchmark Results](#Results) and [Standardization](#Standardization) )