In [1]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Cloning into 'ds-assets'...
remote: Enumerating objects: 204, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 204 (delta 18), reused 34 (delta 12), pack-reused 164[K
Receiving objects: 100% (204/204), 9.57 MiB | 27.60 MiB/s, done.
Resolving deltas: 100% (78/78), done.


# Constructing a basic ANN/MLP

Let's build some MLPs.  A fundamental problem with MLP design is the sheer number of design possibilities of these models.  The MLP classisfier as part of the sklearn package has 23 (!) tunable paramters.  The good news is that all of these parameters except for the architectural parameters and the maximum number of training iterations have good default values. For the architectural parameters a good starting point is an MLP  with a single hidden layer where the number of nodes in the hidden layer is computed as follows,

> $ \#\mbox{hidden nodes} = 2 \times \#\mbox{vars}$

That is the number of hidden nodes is twice the number of independent variables in the training data.  For the maximum number of training iterations we simply choose a very large value, e.g. 10,000. Let's try this using the breast cancer dataset,

In [2]:
# set up
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from confint import classification_confint

In [5]:
# get data
df = pd.read_csv(home+"wdbc.csv").drop(columns=['ID'])

X  = df.drop(columns=['Diagnosis'])
y = df['Diagnosis']  # ANN wants this to be a series

Looking at the shape of the training data we see that there are 30 independent variables. Applying our rule from above means that we should construct an MLP with a single hidden layer that contains 60 nodes.

In [12]:
# neural network
model = MLPClassifier(hidden_layer_sizes=(60,), max_iter=10000, random_state=1)

# train and test the model
(X_train, X_test, y_train, y_test) = \
    train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=1)

model.fit(X_train, y_train)
predict_y = model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb, ub = classification_confint(acc, X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f}, {:3.2f})".format(acc, lb, ub))

Accuracy: 0.93 (0.91, 0.95)


The accuracy of this classifier is encouraging given that we constructed it using just our rule of thumb.  Also, you might be surprised in that we are using the whole data set both as training as well as testing data.  In this instance that is ok because we are not performing a model search, we simply want to see how our rule of thumb performs.  If we were performing a model search then we would have to resort to train-test splits or cross-validation as we do in the grid search below.

# MLP Grid Search

We have to perform a grid search to find the optimal network.

Beware that a grid search over all possible parameters of an MLP is almost impossible:  Too many different combinations possible and training MLPs is sloooowwww.  To mitigate this we concentrate on a couple of key parameters to search over (see the comments in the code).

In [8]:
# set up
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from confint import classification_confint

# get data
df = pd.read_csv(home+"wdbc.csv").drop(columns=['ID'])
X  = df.drop(columns=['Diagnosis'])
y = df['Diagnosis'] # ANN wants this to be a series

# neural network object
model = MLPClassifier(max_iter=10000, random_state=1)

# grid search
# We set up a grid search over the architecture and activation functions.
# In the architecture search we limit ourselves to node values that are multiples
# of the number of independent variables in the training data.  Also, we
# limit ourselves to a maximum of two hidden layers.
param_grid = {
    # search over different architectures
    'hidden_layer_sizes':
      [
      (30,), (60,), (120,),            # single layer MLP
      (30,30), (30, 60), (30, 120),    # 2 layers, first 30, second varying
      (60, 30), (60,60), (60, 120),    # 2 layers, first 60, second varying
      (120, 30), (120, 60), (120, 120) # 2 layers, first 120, second varying
      ],
    # search different activation functions
    'activation' : ['logistic', 'tanh', 'relu']
}

# use 3-fold cross-validation otherwse grid search takes too long
grid = GridSearchCV(model, param_grid, cv=3)
grid.fit(X, y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
acc = accuracy_score(y, predict_y)
lb,ub = classification_confint(acc,X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Grid Search: best parameters: {'activation': 'logistic', 'hidden_layer_sizes': (30, 30)}
Accuracy: 0.96 (0.94,0.97)


Interestingly enough, this network constructed using the grid search is a network with two hidden layers each with 30 nodes in it.

# Model Comparison

The accuracy of the network we constructed using our rule of thumb was,
```
93% (91%, 95%)
```
and the accuracy of our network constructed using a grid search was,
```
96% (94%, 97%)
```
Our first observation is that our rule of thumb got us pretty close to the performance of our optimized network.
The second observation is that **the difference in accuracy between these two models is not statistically significant** because their confidence intervals overlap.  

# Project


An exercise in artificial neural networks.  For more details please see BrightSpace Assignment #4