# Aims of this tutorial
The aim of this tutorial is to illustrate how Perceptrons can be combined into Neural Networks to solve problems that are not linearly separable, such as XOR.  
We will look at the key differences between the two algorithms and also consider how network architecture and training parameters affects the outcome.

## Learning Objectives:
1. Understand the key differences between the Neural Network and Perceptron algorithms:
- Non-linear activation functions.
- Using Backpropagation to update (learn) the weights.
- configuring MLP with more than one output node when there are more than two different output labels (multi-class learning)
2. Understand how different nodes learn different aspects of the problem.

3. Consider the need for different network architectures and learning parameters for different problems.

### Overview:
<img src="ANN-2-Node.png" style="float:right">

As we have seen, Perceptrons are only capable of solving linearly separable problems.   
To overcome this limitation, we can connect Perceptrons together into a network.  
Each one becomes a Node in the network, and they are connected together into Layers. 

In standard Artificial Neural Network (ANN) architecture there is one input, one output and one or more hidden layers.  
- Though the term *input layer* is a bit misleading, it doesn't actually do any computation, it is just the inputs to the network.
- So, outputs of hidden layers become the inputs to subsequent hidden layers, or the final output layer. 
- Hidden nodes tend to learn different aspects of the problem space, building more complex decision boundaries and are therefore able to solve more complex problems.

Note: 
- The number of nodes in the input layer must equal the number of inputs/features in the data. 
- One output node can discriminate between two classes (classification problems),  
  or predict a value for one continuous variable (regression problems).  
  If your data  has more than two classes (or variables to predict),  
  the number of output nodes must equal the number of classes/regression variables. 
- The number of hidden layers and nodes in the layers is arbitrary, and selecting this architecture is part of building an ANN.

### Neural Network Training Algorithm  
Similar to Perceptrons, ANN are trained in two 'phases'. 
- The forward pass, where data is input into the network to produce an output. 
- The backward pass, where the error in output is used to update the weights using Backpropagation and Gradient Descent.
  - note that to calculate what the sum of  inputs was going *in* to a node we apply the *sigmoid derivative* to the signal coming *out* of that node 

<img src="ann-pseudocode.png" style="float:center">

## Part 1: Solving XOR
As an introduction to the ANN algorithm, and to give you an intuition for how different nodes and layers in the network learn different aspects of the problem space, we are going to look at how a small network can solve the XOR problem.

Running the code will train an ANN to solve the XOR problem and produces a visualisation to show how different nodes have learned different aspects of the problem to create a more complex decision boundary (in this case different logical functions.

- You do not need to understand *how* the graphs/visualisations are produced.

- You should try and understand *what* the graphs/visualisations output means.

### Activity 1: Train MLP with one hidden layer and see (through experimentation) how many nodes are needed to reliably solve x-or
- Run the next two cells below once to import the libraries and define the function that plots the decision surface.
- if the first cell reports an error trying to import VisualiseNN, make sure you have downloaded the file VisualiseNN.py and it is in the same directory as this notebook

In [None]:
# basics for manipulating and outputting arrays etc
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import matplotlib.pyplot as plt
import numpy as np
from random import random
%matplotlib inline

## MLP specific stuff
from sklearn.neural_network import MLPClassifier
import VisualiseNN as VisNN


# useful sklearn functions for preprocessing data and sahowing results
from  sklearn.model_selection import train_test_split 
from sklearn.metrics import plot_confusion_matrix

#the iris data
from sklearn.datasets import load_iris


In [None]:
def plotDecisionSurface(model,X,y):
    min1, max1 = X[:, 0].min() - 1, X[:, 0].max() + 1 #1st feature
    min2, max2 = X[:, 1].min() - 1, X[:, 1].max() + 1 #2nd feature
    x1_scale = np.arange(min1, max1, 0.1)
    x2_scale = np.arange(min2, max2, 0.1)
    x_grid, y_grid = np.meshgrid(x1_scale, x2_scale)
    # flatten each grid to a vector
    x_g, y_g = x_grid.flatten(), y_grid.flatten()
    x_g, y_g = x_g.reshape((len(x_g), 1)), y_g.reshape((len(y_g), 1))
    # stack to produce hi-res grid in form like dataset
    grid = np.hstack((x_g, y_g))

    # make predictions for the grid
    y_pred_2 = model.predict(grid)
    
    #predict the probability
    p_pred = model.predict_proba(grid)
    # keep just the probabilities for class 0
    p_pred = p_pred[:, 0]
    # reshaping the results
    p_pred.shape
    pp_grid = p_pred.reshape(x_grid.shape)

    # plot the grid of x, y and z values as a surface
    levels=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
    surface = plt.contourf(x_grid, y_grid, pp_grid, levels,cmap='Pastel1')
    plt.colorbar(surface)
    # create scatter plot for samples from each class
    for class_value in range(2):
        # get row indexes for samples with this class
        row_ix = np.where(y == class_value)
        # create scatter of these samples
        plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Pastel1')
    # show the plot
    
    


### Activity 1.1 Investigating repeatability
Now run the cell below - it will try and learn the XOR problem and show you a plot of how the error rate changes over *time* measured in epochs.  
- Nne epoch means that all the training data is shown to the system once and the weights are updated
- We know. that *in theory* it should be able to learn XOR with 2 hidden nodes - **but is there a difference between theory and what happens in practice?**
- Each time you run the cell it starts the whole process from new, so the error curve will be different and you might get different final accuracy scores.
- As there are only four cases, we do not have any test data for this problem - we are just looking at how reliably different sized networks can learn a simple problem

You should:
1. Run the cell 10 times with 2 nodes in the hidden layer ( the parameter in the MLP constructor set to *hidden_layer_sizes=(2,)*. and note how many times it ended up with no errors (training set accuracy = 100%).  
**remember that jupyter will move on to the next cell , so you need to select  the cell to re-run it**

2. Now repeat, changing the constructor to change the  with the size of the hidden  layer to 4,6,8,10 nodes - and again note how many times out of 10 it  successfully learned the problem.
3. Edit the second cell below, changing the values in xor_success to record your actual results and run the cell to produce a **sensitivity analysis** - a plot showing how much youyr results depend on a network parameter- thre number of hidden nodes.


In [None]:
# the four input cases form our training data
train_X = np.array( [[0,0],[0,1],[1,0],[1,1]])
# and her eare the labels our network should learn for the XOR problem
xor_y = np.array([0,1,1,0])

train_y= xor_y

# one hidden layer with one hidden layer of 2 neurons with logistic (sigmoid) activation and Stochastic Gradient Descent (backprop)

xorMLP =  MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, alpha=1e-4,
                    solver='sgd', verbose=0, 
                    learning_rate_init=.1)


xorMLP.fit(train_X, train_y)
    
lossplot=plt.plot(xorMLP.loss_curve_) 
training_accuracy = 100* xorMLP.score(train_X, train_y)
print("Training set accuracy: " + str(training_accuracy) + "%")

In [None]:
# Edit the  array xor_success to  repalce the 'dummy' values(1,6,4,8,10) with youirt results, i.e.  the number of times your algorithm reached 100% accuracy on the training set

num_hidden_nodes = [2,4,6,8,10]
xor_success = [1,6,4,8,10]
plt.plot(num_hidden_nodes, xor_success)

### Activity 1.2: Visualising what the network is doing
After a successful run of the cell above (i.e. one ending with training set accuracy 100%) run the cell below.
- The top plot shows the output of the final node for different inputs.  
  In this case we only have the four inputs marked by circles
- The bottom plot showes a visualisation of the network structure and weights 
  - blue ones are *negative*, so will be suppressing the output of the cell they lead to if there is a signal down that connection
  - red ones are *positive - so will be stimulating the node they lead to if there is a signal present

In [None]:
theMLP=xorMLP # change this line to reuse the code below for a different problem
num_output_nodes = 1 # and this one for multi-class problems

plotDecisionSurface(theMLP,train_X,train_y)


#network_structure = np.hstack(([train_X.shape[1]], np.asarray(myMLP.hidden_layer_sizes), [train_y.shape[0]]))
network_structure = np.hstack((2, np.asarray(theMLP.hidden_layer_sizes), 1))
# Draw the Neural Network with weights
network=VisNN.DrawNN(network_structure, theMLP.coefs_)
network.draw()

## Part 2: Using MLP for multiclass problems:  iris data
<img src="cascading.png" style="float:right">

So far we have used multilayer perceptrons for learning binary (two-class) problems.  
Last week you should have discussed how you could solve a multi-class problem,  
by 'cascading' binary classifiers. 
This is shown in the image for a three class problem.  
Here the diamonds represent classifiers, each doing a "this class or not" decision.


In this part we will introduce a different idea, which is to use a  parallel classifier using softmax and one-hot encoding.

Not only is this simpler to manage, it  has the benefit that the classifiers can all share the feature creation done in previous layers






In [None]:
# load the data


irisX,irisy = load_iris(return_X_y = True)
feature_names = ['sepal width','sepal_length','petal_width','petal_length']
irisLabels = np.array(('setosa','versicolor','virginica'))
# show what the labels look like
print(irisy)


### Transforming our label data to a format for training a MLP with three output nodes
As you can see when you run the cell above, the labels is a 1-D array with labels of 0, 1, or 2.  
This is fine for models like nearest neighbours, rule sets or decision trees.
However, (crudely speaking) the output from a neuron tends to be *off* (0) or *on*(1).  
So if we want our network to make a choice of three predictions, then we need a node for each class.

So there are two changes we make:
1. We configure the network to have three output nodes  and use 'softmax' ('winner-takes-all') activation.  
    i.e. Each node outputs a value, and we take as our final output the class whose node has the highest output signal
2. We convert our labels tell the network what *each of the nodes* should ideally output for each training example.  
   In other words, if the label is 0 the then output should be [0,0,1],  
   if the label is 1 it should be [0,1,0], and if it is 2 the output shoulfd be [1,0,0].

Sklearn comes with a module sklearn.preprocessing.onehotencoder() to do this,   
but the cell below does it explicitly to illustrate what is going on. 

I've made it generic so that you can easily reuse it for different datasets

In [None]:
numcases = len(irisy)
print('there are ' +str(numcases) +' training examples')
thelabels = np.unique(irisy)
numlabels = len(thelabels)
print( 'there are ' + str(numlabels) + ' labels: ' + str(thelabels))

# make a 2d array with numcases rows. and numlabels columns
irisy_onehot = np.zeros((numcases,numlabels))


# Now loop through the rows of the new array setting the appropriate column value to 1
for row in range(numcases):
    label = irisy[row]
    irisy_onehot[row][label]= 1

print('This is what  rows 45-55 look like')
print(irisy_onehot[44:55,:])

### Splitting our data into a training and a test set
As you can see from the output of the cells above, the iris data has groups all the classes i.e. rows 0-49 are 'iris-setosa', 50-99 are 'iris versicolor'. and rows 100-149 are 'iris-virginica'.

So if we want to train our network  and then estimate how well it will do on new data, we need to split this into a training and test set.  
Again, we could do this manually:
- first shuffling the rows so that we got a mixture of classes, 
- then taking the first part of the data for training and the second for testing.

If the data are not so well organised, or the numbers of examples of different classes are not roughly equal, then that code gets trickier.  
So the cell below shows how to do this using a method from sklearn.   
The parameters are, in order:
- the feature values (irisx)
- the onehot-encoded set of labels (irisy_onehot)
- what proportion of our data we holdback from training, so we can use it for test. We'll use 1/3rd ( test_size=0.33)
- the array holding the labels that we want to be evenl;y represented in both our training and test sets. (stratify=irisy_onehot)

This function returns the four different arrays - train and test, x and y.  
Noe that this function also works if your data is not one-hot encoded - it figures that out for itself.

In [None]:

iris_train_X, iris_test_X, iris_train_y, iris_test_y = train_test_split(irisX,irisy_onehot, test_size=0.33, stratify=irisy_onehot )


### Activity 2.1 Training a MLP to learn the iris classification problem
1. Start by using the  settings for the MLPClassifier that we had before and just change the size of the hidden layer to five or ten
- you will probably see that the training stops making improvements before the problem has been fully learned.
- this is an example of the backpropagation getting 'stuck' in a **local optimum** (we'll talk about these more next week)
- it happens becuase the basid 'stochastic gradient descent' algorithm *'sgd'* is a local search method with only crude methods for getting out of 'traps' 
- try changing the solver to 'adam' and see if this gives better performance

**Remember** to run a few times with each setting - this is a randomised algorithm and the random set of initial weights makes a huge difference.  

**Question**: what do you understand by *better*

2. Now try adding a second hidden layer - for example by changing that parameter in the constructor to *hidden_layer_sizes=(3,3)*.  
- Experiment to see if it is better to have one hidden layer of 10 nodes or 2 layers of 5 nodes.

In [None]:
# create an MLP object-  you will want to change the number of hidden nodes
irisMLP =  MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, alpha=1e-4,
                    solver='adam', verbose=0, 
                    learning_rate_init=.1)






irisMLP.fit(iris_train_X, iris_train_y)
print('number of output nodes = ' +str(irisMLP.n_outputs_))
    
lossplot=plt.plot(irisMLP.loss_curve_)    

# report how well it does on the training set
training_accuracy = 100* irisMLP.score(iris_train_X, iris_train_y)
print("Training set accuracy: " + str(training_accuracy) + "%")


# now how good is our network at predicting data it has never seen before
test_accuracy = 100* irisMLP.score(iris_test_X, iris_test_y)
print("Estimated (Test set) accuracy: " + str(test_accuracy) + "%")

# this bit of code prints a simple confusion matrix showing how the predicted labels correspond to the 'real' ones
predictions=irisMLP.predict(iris_test_X)
confusion = np.zeros((3,3))
for row in range (predictions.shape[0]):
    actual = np.argmax(iris_test_y[row])
    predicted = np.argmax(predictions[row])
    confusion [actual] [predicted] += 1

print( '\nPredicted->   Setosa  Versicolor  Virginica')
print( 'Actual \|/ ')
for i in range(3):
    print( '{:<10}       {:2.0f}       {:2.0f}       {:2.0f}'.format(irisLabels[i], confusion[i][0], confusion[i][1],confusion[i][2]))


### Activity 2.2 Discussion
Try to come up with answers to these questions. (these are the sorts of things you might be asked in an exam)

1. Why is the test accuracy sometimes much lower than the training accuracy?

2. Why is it sometimes less reliable train a network with multiple hidden layers when learning the iris data?  
Hint: how many connections are you trying to learn?  how much data have you got?

### Activity2.3 (stretch): Does it help if you normalise the data like we did in week 5?
In Activity 2.3 of the unsupervised learning tutorial (workbook5) we used a Minmax scaler so that each feature was transformed to the range (0,1).  
Reusing snippets of code from that workbook,  try adding a few lines to the cell at the start of this section (Part 2), so that scaling gets applied to irisX before you make the call to train_test_split().
- Does this improve learning?

## Part 3: Learning to recognise hand-written digits:  MNIST





### Activity 3.1: Loading and visualising the data
The next cell  downloads and savs the data locally.
- if I can sort out shared storage on ther Jupyterhub server, I will create an option to use that to save time
 

In [None]:

# example code to run on the server where i will put a version of the data locally

In [None]:
#example code from https://scikit-learn.org/stable/auto_examples/neural_networks/plot_mnist_filters.html#sphx-glr-auto-examples-neural-networks-plot-mnist-filters-py

# the data to download is about 33Mb 
# so I've put this code in its own cell so you can just do it once.
from sklearn.datasets import fetch_openml

print(__doc__)

# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True,cache=True,data_home="data")

X = X / 255.

# rescale the data, use the traditional train/test split
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

print('data loaded and saved locally')

This cell  shows us some example images

In [None]:
# display ten random images from each class
print('The test data has {} images, each described as a {} features (pixel values)'.format(X_test.shape[0], X_test.shape[1]))


plt.figure(figsize=(10, 10))
for label in range(10):
    imagesForLabel= np.empty((0,784))
    
    for possible in range (200):
        if (int(y_test[possible])==int(label)):
            imagesForLabel = np.vstack((imagesForLabel, X_test[possible]))
    for col in range(5):
        exampleplot = plt.subplot(10, 5, (label*5 +col+1) )
        exampleplot.imshow(imagesForLabel[col].reshape(28, 28), 
                   cmap=plt.cm.gray)
plt.show()

### Activity 3.2 : Visualising what features the hidden layers learn to respond to 
We will now configure a multilayer perceptron  and training it with all 60,000 images from the standard MNIST training set.

The idea for you to learn here, is that each hidden node is effectively acting as a feature detector: 
 - Consider just one hidden layer node: 
   - and a simple pattern where the weights from pixels in the top left and bottom right quadrant are all +1, 
   - and the weights from pixels in the top-right and bottom-left quadrants are all -1.

- Now consider an input image that has some constant value for every pixel (feature) - i.e. is all the same colour. 
  - when these inputs to the node  are multiplied by their weights and summed, they will cancel each other.
  - so the sum will be zero and the output will be sigmoid(0) = 0.5.

- Next consider an the image  of a simple 'chequer' pattern with  white (255) in the top-left and bottom-right quadrants,  
  and black (0)  in the other two.
  - In this case  the pattern of  pixel intensities (features) in the image  maches match the pattern in the weights.
   - So then the weighted sum will be at its maximum, and the node will output +1.

So we can consider each hidden node as a 'feature detector' that responds to how well the input image matches a particular pattern.

The next set of cells:
- Set up and train the network with 16 nodes (so we cna visualsie it). 
- Then output the pattern  weights from each of the nodes as an image.

<div class="alert alert-block alert-info"> In year 2, the Machine Learning module will explain how this concept of feature detectors has been extended  in Deep Convolutional Networks. <br>
In these features (called 'filters') can be a smaller size than the image and a process of Convolution (rather than straighforward multiplying) lets them detect small local features anywhere in the image.<br>  Convolutional Neural Networks have completely revolutionised the field of image processing and AI for visual tasks.</div>


In [None]:
# Set up and train network
import warnings
from sklearn.exceptions import ConvergenceWarning
mlp = MLPClassifier(hidden_layer_sizes=(16), max_iter=25, alpha=1e-4,
                    solver='sgd', verbose=1, random_state=10,
                    learning_rate_init=.1)

# this example won't converge because of CI's time constraints, so we catch the
# warning and are ignore it here
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning,
                            module="sklearn")
    mlp.fit(X_train, y_train)

print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

In [None]:
# get the weights from the input nodes to the first hidden layer
coef = mlp.coefs_.copy()[0].T

# scale the weights so they all lie in the same range for di

plt.figure(figsize=(10, 10))
scale = np.abs(coef).max()
for i in range(16):
    l1_plot = plt.subplot(4, 4, i + 1)
    l1_plot.imshow(coef[i].reshape(28, 28), 
                   cmap=plt.cm.RdBu)#, vmin=-scale, vmax=scale)
    l1_plot.set_xticks(())
    l1_plot.set_yticks(())
    #l1_plot.set_xlabel('Hidden Node %i' % i)
title= 'Learned weights from pixels to each hidden node. have been trained to respond to ...\n'
title = title + 'Blue indicates negative weights: signals from these pixels suppress the node.\n'
title=title+ 'Red indicates positive weights; pixels from these pixels stimulate the  hidden node.'

_=plt.suptitle(title)

### Activity 3.3: Discussion
Iris is a simple problems with only 4 features and three classes.
MNIST is a much more complicated problem with 784 features and ten classes - some of which (e.g. 4s and sevens) can be drawn in completely different ways.

So how come the accuracy is roughly the same on thes two problems?

Can you predict wehat the effect on training and test accuracy might be?

### Activity 3.4: examining the effect of having less data
- run the cell below  times, and make sure you can explain the pattenrs of changing training and test scor youy see.

### (Stretch) Activity 3.5
- run the cell n times, saving the training and test accuracy from each runc
- capture the data and display it as two different lines in the same plot, with error bars for each.
  HINT: google is good to find code snippets to make plots with

In [None]:
for trSetSize in (100,600,1000,6000,10000,50000):
    split= trSetSize/60000
    _,X_train_small,_,y_train_small = train_test_split(X_train,y_train, test_size=split,stratify=y_train)
    smallMnistMLP = MLPClassifier(hidden_layer_sizes=(16), max_iter=25, alpha=1e-4,
                    solver='sgd', verbose=0, random_state=10,
                    learning_rate_init=.1)

#put a loop of n runs here
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning,
                            module="sklearn")
        smallMnistMLP.fit(X_train_small, y_train_small)
    print('With a training set of {} examples'.format(trSetSize))
    print("    Training set score: %f" % smallMnistMLP.score(X_train_small, y_train_small))
    print("    Test set score: %f" % smallMnistMLP.score(X_test, y_test))

<div class="alert alert-block alert-danger"> Please save your work (click the save icon) then shutdown the notebook when you have finished with this tutorial (menu->file->close and shutdown notebook</div>

<div class="alert alert-block alert-danger"> Remember to download and save your work if you are not running this notebook locally.</div>