# ANN

Neural Networks are known as "universal approximators," meaning that they can learn _any_ function, no matter how
complex, given the correct number of hidden layers + nodes/layer. This is primarily just to see how effective it could
be, without much tuning.

#### Goals:
1. Determine how effective an ANN would be at this type of classification task.
2. Get a general idea of what ballpark parameters should be in.

I imagine that this will result in considerable overfitting and may not generalize too well. I am expecting, however,
$> 60\%$ accuracy from this model. I do not anticipate that this will perform much better than a RFC model.

#### Loading the Data

In [1]:
import numpy as np
from models.data_loader import DataLoader
from sklearn.neural_network import MLPClassifier
import itertools
import pandas as pd

rs = np.random.RandomState(42069)

# Load the data.
dl = DataLoader('../data/winequality-red.csv', random_state=rs)
dl.apply_pca_to_dataset()

# Apply a Train/Test split.
X_train, X_test, y_train, y_test = dl.train_test_split()

# Obtain the dimension of the data.
N_train, d = X_train.shape

N_h = N_train // 2

#### Building the MLP Classifier
There are a couple of parameters to consider.
* `hidden_layer_sizes`: tuple containing how many nodes to have for each hidden layer
* `activation`: which activation function to use. I'm a fan of `logistic` (sigmoid), but `ReLU` is good too.
* `solver`: the `lbfgs` solver is a quasi-Newton method that works well on small datasets. However, we can explore `adam`
as well, which is a type of SGD.

Notes: There are many other parameters to tune, including learning rate, regularization penalty, etc. However, these will
be saved for if/when we select the ANN as a final model.

Like before, the set of parameter setups we're exploring is $\mathrm{hidden\_layer\_sizes} \times \mathrm{activation} \times \mathrm{solver}$.


In [2]:
hidden_layer_sizes = [
    (100,), # This is the default
    # There's a recommendation for the number of hidden nodes to be 1/10th of the total number of examples.
    # See `slides_module6.pdf` page 32. I am exploring that option here, with a 2-layer, 3-layer, and 4-layer NN.
    (N_h//2, N_h//2,),
    (N_h//3, N_h//3, N_h//3,),
    (N_h//4, N_h//4, N_h//4, N_h//4,),
]
activation = ['logistic', 'relu']
solvers = ['lbfgs', 'adam']

In [3]:
performance = pd.DataFrame(index=['n_hidden_layers', 'n_nodes_per_layer', 'activation', 'solver', 'accuracy'])

for hidden_layer_setup, activation_fn, solver in itertools.product(hidden_layer_sizes, activation, solvers):
    # Build and fit the model.
    mlp = MLPClassifier(hidden_layer_sizes=hidden_layer_setup, activation=activation_fn, solver=solver, random_state=rs,
                        max_iter=2000)
    mlp.fit(X_train, y_train)

    # Score the model on the test set.
    accuracy = mlp.score(X_test, y_test)

    # Log its performance
    performance = performance.append({
        'n_hidden_layers': len(hidden_layer_setup),
        'n_nodes_per_layer': hidden_layer_setup[0],
        'activation': activation_fn,
        'solver': solver,
        'accuracy': accuracy
    }, ignore_index=True)



#### Determine the Top-10 Best Model Configurations

In [4]:
performance = performance.sort_values('accuracy', ascending=False)
print(performance.head(10))

    accuracy activation  n_hidden_layers  n_nodes_per_layer solver
12  0.643750       relu              2.0              279.0   adam
20  0.643750       relu              4.0              139.0   adam
9   0.639583   logistic              2.0              279.0  lbfgs
16  0.639583       relu              3.0              186.0   adam
7   0.631250       relu              1.0              100.0  lbfgs
10  0.629167   logistic              2.0              279.0   adam
19  0.627083       relu              4.0              139.0  lbfgs
15  0.622917       relu              3.0              186.0  lbfgs
8   0.620833       relu              1.0              100.0   adam
5   0.616667   logistic              1.0              100.0  lbfgs


#### What was learned?
* It performs better than the SVM + KNN models (without much tuning), but it's outclassed by the Random Forest model.
* The model appears to not always converge when using a stochastic optimizer (even within 2000 iterations), meaning that
    either the data is mal-formatted internally or this will become a very computationally expensive operation.
* While I do not have definitive proof, I'm assuming that there is considerable overfitting occurring (as is the nature
of MLPs). If we do proceed with this model, we _will_ need to focus on annealing the learning rate properly + tuning the
    regularization parameter.
* The prototypical `ReLU` activation function appears to be the best.