In [1]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)  

Already up to date.


In [2]:
# notebook level imports
import pandas as pd
import dsutils                        # classification_confint
from sklearn import neural_network    # MLPClassifier
from sklearn import metrics           # accuracy_score
from sklearn import model_selection   # GridSearchCV

# Constructing a basic MLP

**NOTE**: For the most part we will stick with the sklearn terminology and call ANNs **Multi Layer Perceptrons**.

Let's build some MLPs.  A fundamental problem with MLP design is the sheer number of design possibilities for these models.  The MLP classisfier as part of the sklearn package has 23 (!) tunable paramters.  The good news is that all of these parameters except for the architectural parameters and the maximum number of training iterations have good default values. **For the architectural parameters a good starting point is an MLP  with a single hidden layer where the number of nodes in the hidden layer is computed as follows**,

 $$
\#{\rm hidden\,nodes} = 2 \times \#{\rm vars}
$$

That is the number of hidden nodes is twice the number of independent variables in the training data.  For the maximum number of training iterations we simply choose a very large value, e.g. 10,000. Let's try this using the breast cancer dataset,

In [3]:
# get data
df = pd.read_csv(home+"wdbc.csv").drop(columns=['ID'])

X  = df.drop(columns=['Diagnosis'])
y = df['Diagnosis']

In [4]:
X.shape

(569, 30)

Looking at the shape of the training data we see that there are 30 independent variables. Applying our rule from above means that we should construct an MLP with a single hidden layer that contains 60 nodes.

In [5]:
# neural network architecture
nnodes = 2 * X.shape[1]
print("We are using {} nodes".format(nnodes))

We are using 60 nodes


In [6]:

# build model object
model = neural_network.MLPClassifier(hidden_layer_sizes=(nnodes,), 
                                     max_iter=10000, 
                                     random_state=1)

# train the model
model.fit(X, y) 

In [7]:
# test the model with resubstitution error (use training data for testing)
predict_y = model.predict(X)
acc = metrics.accuracy_score(y, predict_y)
lb, ub = dsutils.classification_confint(acc, X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f}, {:3.2f})".format(acc, lb, ub))

Accuracy: 0.95 (0.93, 0.96)


The accuracy of this classifier is encouraging given that we constructed it using just our rule of thumb.  

**NOTE**: It is OK to use the training data as testing data (resubstitution) because we are not searching for the best model we just want to assess.  Since the model parameters are fixed there is no danger of overfitting.

# MLP Grid Search

We have to perform a grid search to find the optimal network.

Beware that a grid search over all possible parameters of an MLP is almost impossible:  Too many different combinations possible and training MLPs is sloooowwww.  To mitigate this we concentrate on a couple of key parameters to search over and even then we only concentrate on a few key combinations.  We use the same data set from the previous section.

In [8]:
# neural network object
model = neural_network.MLPClassifier(max_iter=10000, random_state=1)

# grid search
# We set up a grid search over the architecture and activation functions.
# In the architecture search we limit ourselves to node values that are multiples
# of the number of independent variables in the training data.  Also, we
# limit ourselves to a maximum of two hidden layers.

nnodes = 2*X.shape[1] # rule of thumb

param_grid = {
    # search over different architectures
    'hidden_layer_sizes':
      [
      # single layer MLP: vary size by nnodes with multipliers of 2
      (nnodes//2,), (nnodes,), (nnodes*2,),
      # 2 layers: first fixed at nnodes/2, second varying
      (nnodes//2,nnodes//2), (nnodes//2, nnodes), (nnodes//2, nnodes*2),
      # 2 layers: first fixed at nnodes, second varying
      (nnodes, nnodes//2), (nnodes,nnodes), (nnodes, nnodes*2),
      # 2 layers: first nnodes*2, second varying
      (nnodes*2, nnodes//2), (nnodes*2, nnodes), (nnodes*2, nnodes*2)
      ],
    # search different activation functions
    'activation' : ['logistic',  'tanh', 'relu']  
}

# use 3-fold cross-validation otherwse grid search takes too long
grid = model_selection.GridSearchCV(model, param_grid, cv=3)
grid.fit(X, y)
print("Grid Search: best parameters: {}".format(grid.best_params_))


Grid Search: best parameters: {'activation': 'logistic', 'hidden_layer_sizes': (30, 30)}


In [9]:
# evaluate the best model -- resubstitution error
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
acc = metrics.accuracy_score(y, predict_y)
lb,ub = dsutils.classification_confint(acc,X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.96 (0.94,0.97)


Interestingly enough, this network constructed using the grid search is a network with two hidden layers each with 30 nodes in it.

## Activation Functions Reviewed

Logistic:<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1920px-Logistic-curve.svg.png"
width="300" height="200">

Hyperbolic-Tangent:<br>
<img src="https://media.datacamp.com/cms/google/ad_4nxccvibwobiyw2uym_egwqg8pf6zmnygb_mqy3khokh_3biaex50ultijygp7wg_13qdnddbbd2yevutty96pcimrj3hdihnpv-vsjopo4wyvfpzp92e8kj_i6q4ync0x-hvucvwnevb9i9nnrnyxbayh8ge.png"
width="300" height="200">

ReLu (Rectified Linear Unit):<br>
<img src="https://media.datacamp.com/cms/google/ad_4nxdm3mjfeqgnkibwih8jqb9p93eqd1zmoasqf17atrjzvc7vyafjt2d5lvglvfy9tbuy86dsd7uijmrak-nqpqmfniawbkirncuwlyspzojwo6ta6xrda1mcq-dkuizktyc6jk4peni3evumlow5lkpywah1.png"
width="300" height="200">
<br>
[More Details](https://en.wikipedia.org/wiki/Activation_function#Table_of_activation_functions)

## Back Propagation Reviewed

<center>
<img src="https://machinelearningknowledge.ai/wp-content/uploads/2019/10/Backpropagation.gif"
width="400" height="300">
</center>

[source](https://machinelearningknowledge.ai/animated-explanation-of-feed-forward-neural-network-architecture/)

# Model Comparison

The accuracy of the network we constructed using our rule of thumb was,
```
0.95 (0.93, 0.96)
```
and the accuracy of our network constructed using a grid search was,
```
96% (94%, 97%)
```
Our first observation is that our rule of thumb got us pretty close to the performance of our optimized network.
The second observation is that **the difference in accuracy between these two models is not statistically significant** because their confidence intervals overlap.  

# Project


Please see BrightSpace.