In [16]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)  

Already up to date.


In [17]:
# notebook level imports
import pandas as pd
import dsutils                        # classification_confint
from sklearn import neural_network    # MLPClassifier
from sklearn import metrics           # accuracy_score
from sklearn import model_selection   # GridSearchCV

# Constructing a basic MLP

**NOTE**: For the most part we will stick with the sklearn terminology and call ANNs **Multi Layer Perceptrons**.

* A fundamental problem with MLP design is the **sheer number of design possibilities**.  
* The sklearn MLP classisfier 23 (!) tunable parameters.  
* The good news is that most of these parameters have good default values that we don't have to touch.
* There are only a few parameters that have an extraordinary effect on the quality of MLP models:
   1. The number of **layers** and **nodes** in the network
   1. The **transfer/activation** function
* The final thing we have to worry about is: **How often do we apply the training data to the network
   until it is fully trained** -- the max_iter parameter.
   * Unfortunately, here the default value of 200 is completely insufficient
   * I usually set this to some very large value like **10000** or in some instances **100000**.
   * The good news is, the network will give you a warning if max_iter was set too low



## Where to Start: The 'Rule of Thumb' Network

A good place to start a network design is with a network consisting of a **single hidden layer** with a **number of nodes** computed as,

$$
N = 2 \times V
$$

where $N$ is the number of nodes in the hidden layer and $V$ is the number of independent variables in 
the training data, e.g. V = X.shape[1].



Let's try this using the Wisconsin dataset,

In [18]:
# get data
df = pd.read_csv(home+"wdbc.csv").drop(columns=['ID'])

X  = df.drop(columns=['Diagnosis'])
y = df['Diagnosis']
V = X.shape[1]
N = 2*V

In [19]:
# hidden layer size
N 

60

In [None]:

# a multi-layer perceptron with one hidden layer
model = neural_network\
   .MLPClassifier(
      hidden_layer_sizes=(N,),  # one hidden layer with N neurons
      activation='logistic',    # logistic activation function
      max_iter=10000, 
      random_state=1
      )\
   .fit(X, y)  



**Note**: The logistic activation function is a great "general purpose" activation function.

In [23]:
# evaluate the model
dsutils.acc_score(model, X, y, as_string=True)


'Accuracy: 0.95 (0.93, 0.96)'

**Observation**: The accuracy of this classifier is encouraging given that we constructed it **using just our rule of thumb**.  

## MLP Grid Search

Let's construct an optimal model and compare its performance to our 'rule of thumb' network.

In [7]:
# neural network object
model = neural_network.MLPClassifier(max_iter=10000, random_state=1)

# grid search:
#   * limit to 1 and 2 hidden layers
#   * vary number of neurons in each layer (multiples of 2 of N)
#   * vary activation functions
param_grid = {
    # search over different architectures
    'hidden_layer_sizes':
      [
      # single layer MLP: vary size by N with multipliers of 2
      (N//2,), (N,), (N*2,),
      
      # 2 layers: first fixed at N/2, second varying
      (N//2,N//2), (N//2, N), (N//2, N*2),
      
      # 2 layers: first fixed at N, second varying
      (N, N//2), (N,N), (N, N*2),
      
      # 2 layers: first N*2, second varying
      (N*2, N//2), (N*2, N), (N*2, N*2)
      ],
    
    # search different activation functions
    'activation' : ['logistic',  'tanh', 'relu']  
}

# perform grid search
grid = model_selection.GridSearchCV(model, param_grid).fit(X, y)
best_params = grid.best_params_
best_model = grid.best_estimator_

print(f"Grid Search: best parameters: {best_params}")

Grid Search: best parameters: {'activation': 'logistic', 'hidden_layer_sizes': (120, 30)}


In [8]:
# evaluate the best model
dsutils.acc_score(best_model, X, y, as_string=True)

'Accuracy: 0.96 (0.95, 0.98)'

**Observations**: The difference in performance between the optimal MLP and our 'rule of thumb' MLP is **not** statistically significant.

> We see this a lot with complex models where 'rule of thumb' models come very close to the optimal performance.  

>Therefore, practioners often forgo the search for optimal models and use 'rule of thumb' models which they then tweak.

## Activation Functions Reviewed

Logistic:<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1920px-Logistic-curve.svg.png"
width="300" height="200">

Hyperbolic-Tangent:<br>
<img src="https://media.datacamp.com/cms/google/ad_4nxccvibwobiyw2uym_egwqg8pf6zmnygb_mqy3khokh_3biaex50ultijygp7wg_13qdnddbbd2yevutty96pcimrj3hdihnpv-vsjopo4wyvfpzp92e8kj_i6q4ync0x-hvucvwnevb9i9nnrnyxbayh8ge.png"
width="300" height="200">

ReLu (Rectified Linear Unit):<br>
<img src="https://media.datacamp.com/cms/google/ad_4nxdm3mjfeqgnkibwih8jqb9p93eqd1zmoasqf17atrjzvc7vyafjt2d5lvglvfy9tbuy86dsd7uijmrak-nqpqmfniawbkirncuwlyspzojwo6ta6xrda1mcq-dkuizktyc6jk4peni3evumlow5lkpywah1.png"
width="300" height="200">
<br>
[More Details](https://en.wikipedia.org/wiki/Activation_function#Table_of_activation_functions)

# Project #4


Please see BrightSpace.