# Applying the ANN Model to the Data.
The goal of this notebook is to apply the MLP Classifier model to the data, and to optimize its performance. I will
also put the entirety of the code in here, without applying PCA through the `data_loader`.


### What was learned from `ANN.ipynb`?
We saw that a largely un-tuned MLP Classifier can predict with $\approx 64\%$ accuracy. The goal is to greatly improve
that metric.

From `ANN.ipynb`, the top 10 models that we saw were as follows:
```
    accuracy activation  n_hidden_layers  n_nodes_per_layer solver
12  0.643750       relu              2.0              279.0   adam
20  0.643750       relu              4.0              139.0   adam
9   0.639583   logistic              2.0              279.0  lbfgs
16  0.639583       relu              3.0              186.0   adam
7   0.631250       relu              1.0              100.0  lbfgs
10  0.629167   logistic              2.0              279.0   adam
19  0.627083       relu              4.0              139.0  lbfgs
15  0.622917       relu              3.0              186.0  lbfgs
8   0.620833       relu              1.0              100.0   adam
5   0.616667   logistic              1.0              100.0  lbfgs
```

### Thoughts on how we may improve the model?
* We should try evaluating models with 1-4 hidden layers.
    * However, 3+ layer models have a problem of overtraining. We can account for this by performing a
    train/test/validate split (instead of the normal train/test split). We will then train until we achieve a minimum on
    the validation set.
* It seems like stochastic solvers (`adam`) worked well with the `relu` activation function. Correspondingly, I'm going
    to use SGD with Nesterov Momentum.
    * I'm familiar with its inner workings from another class (Math 450), so it should be a pleasant optimization.
* Statistical Normalization (Z-score) makes the most sense still, as PCA requires zero-mean data; it's also worth
noting that NNs depend heavily on the scale of the data.
* We _definitely_ need to optimize the learning rate. The slides recommend using $\eta = 0.1$ as an initial value,
however it will need to be annealed over time.
    * It's typical to anneal the learning rate by $\frac{1}{N_\mathrm{iter}}$ when using SGD. I'm sure ADAM is similar.
    * The initial learning rate used by `sklearn` is actually `0.001`, making this a huge difference.
        * We can anneal this according to Wolfe's conditions via `sklearn`'s interface.
* We need to perform some degree of optimization on the regularization parameter as well.
    * The default regularization constant is $\alpha = 0.0001$, which is likely too little to truly impact our
        overfitting problem.

### Define a data pipeline
This will perform
1. Apply Z-score standardization to the data
2. Apply PCA with 8 components to the data
3. Feed the data into the MLP Classifier.
    * Note this uses SGD with early-stopping (which implicitly uses a validation set to combat overtraining).

In [1]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
from models.data_loader import DataLoader
from sklearn.model_selection import GridSearchCV

# We learned from `exploring_data.ipynb` that PCA with 8 principal components is optimal.
n_components = 8

# Random state to allow for this to be deterministic.
random_state =  np.random.RandomState(42069)

# Build the model pipeline
model_pipeline = Pipeline(steps=[
    ('standardization', StandardScaler()),
    ('pca', PCA(n_components=n_components, random_state=random_state)),
    ('classifier', MLPClassifier(
        random_state=random_state, activation='relu', solver='sgd', learning_rate='adaptive', early_stopping=True
    ))
])

### Load the data and create a test/train split.

In [2]:
dl = DataLoader('../data/winequality-red.csv', random_state=random_state)
X_train, X_test, y_train, y_test = dl.train_test_split()

N_train, d = X_train.shape

N_h = N_train // 10  # Following the recommendation from the slides.

### Define the search parameter grid space

Parameters to change:
* `hidden_layer_sizes`
* `alpha` - regularization penalty
* `learning_rate_init` - the initial learning rate
* `momentum`

In [None]:
parameter_grid = {
    'classifier__hidden_layer_sizes': [
        (N_h,), (N_h//2,N_h//2,), (N_h//3,N_h//3,N_h//3,), (N_h//4,N_h//4,N_h//4,N_h//4,)
    ],
    'classifier__alpha': np.logspace(-4, -2, 5),
    'classifier__learning_rate_init': [0.1, 0.01, 0.001],
    'classifier__momentum': [0, 0.9]
}

# 3-fold validation.
grid_search = GridSearchCV(model_pipeline, param_grid=parameter_grid, scoring='roc_auc', n_jobs=-1, cv=3)

grid_search.fit(X_train, y_train)