In [5]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it 
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/" 
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# ANN/MLP Code Examples

Let's build some MLPs.  A fundamental problem with MLP design is the sheer number of design possibilities of these models.  The MLP classisfier as part of the sklearn package has 23 (!) tunable paramters.  The good news is that all of these parameters except for the architectural parameters have good default values. For the architectural parameters a good starting point is an MLP  with a single hidden layer where the number of nodes in the hidden layer is computed as follows,

$ \#\mbox{nodes} = 2 \times \#\mbox{vars}$

That is the number of hidden nodes is twice the number of independent variables in the training data.  Let's try this using the breast cancer dataset,

In [6]:
# set up
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from confint import classification_confint

In [7]:
# get data
df = pd.read_csv(home+"wdbc.csv")
df = df.drop(['ID'],axis=1)


X  = df.drop(['Diagnosis'],axis=1)
y = df['Diagnosis']

print("Shape: {}".format(X.shape))

# sestup training data
datasets = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=3)
train_X, test_X, train_y, test_y = datasets

Shape: (569, 30)


A look at the shape of the training data we see that there are 30 independent variables. Applying our rule from above means that we should construct an MLP with a single hidden layer that contains 60 nodes.

In [8]:
# neural network
model = MLPClassifier(hidden_layer_sizes=(60,), random_state=1)

# train and test the model
model.fit(train_X, train_y)
predict_y = model.predict(test_X)
acc = accuracy_score(test_y, predict_y)
lb, ub = classification_confint(acc, test_X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f}, {:3.2f})".format(acc, lb, ub))

Accuracy: 0.93 (0.88, 0.98)


## MLP Grid Search

We can also perform a grid search to find the optimal network. However, beware that a grid search over all possible parameters of an MLP is almost impossible:  Too many different combinations possible and training MLPs is sloooowwww.  To mitigate this we concentrate on a couple of key parameters to search over.

In [9]:
# set up
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from confint import classification_confint

# get data
df = pd.read_csv(home+"wdbc.csv")
df = df.drop(['ID'],axis=1)
X  = df.drop(['Diagnosis'],axis=1)
actual_y = df['Diagnosis']

# neural network
model = MLPClassifier(max_iter=10000, random_state=1)

# grid search
param_grid = {'hidden_layer_sizes': [ (30,), (60,), (120,),
                                      (30,30), (30, 60), (30, 120),
                                      (60, 30), (60,60), (60, 120),
                                      (120, 30), (120, 60), (120, 120)
                                    ],
              'activation' : ['logistic', 'tanh', 'relu']
             }
grid = GridSearchCV(model, param_grid, cv=3) # 3-fold cross-validation
grid.fit(X, actual_y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
acc = accuracy_score(actual_y, predict_y)
lb,ub = classification_confint(acc,X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

# build the confusion matrix
labels = ['M', 'B']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

Grid Search: best parameters: {'activation': 'logistic', 'hidden_layer_sizes': (30, 30)}
Accuracy: 0.96 (0.94,0.97)
Confusion Matrix:
     M    B
M  200   12
B   13  344


Notice that even though our first instinct is that the optimized MLP is much better than the straight forward MLP using our rule of thumb the difference in accuracy between these two models is statistically not significant because their confidence intervals overlap!

# Team Exercise

In this exercise we use a data set to predict cervical cancer risk based
on social and behavior characteristics.

Please see the file `template-ca-cervix.ipynb` on the CSC310 shared drive.

Do the following:

* Build a 1-hidden-layer MLP according to our rule of thumb and using the 'relu' activation function (train and test on full data set).
* Build a best 2-layer MLP using grid-search to search over layer sizes and activation functions.  For the activation functions use 'logistic' and 'relu'. For more details see the [MLP documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).
* Build a best decision tree using grid-search for this data set.

Evaluation:
* Which one of the above models has the best accuracy?
* Are the differences in accuracy between the three models statistically significant?

For more details please see BrightSpace Assignment #4