# Aims of this tutorial
The aim of this tutorial is to illustrate how Perceptrons can be combined into Neural Networks to solve problems that are not linearly separable, such as XOR.  
We will look at the key differences between the two algorithms and also consider how network architecture and training parameters affects the outcome.

## Learning Objectives:
1. Understand the key differences between the Neural Network and Perceptron algorithms:
- Non-linear activation functions.
- Using Backpropagation to update (learn) the weights.
- configuring MLP with more than one output node when there are more than two different output labels (multi-class learning)
2. Understand how different nodes learn different aspects of the problem.

3. Consider the need for different network architectures and learning parameters for different problems.

### Overview:
<img src="ANN-2-Node.png" style="float:right">

As we have seen, Perceptrons are only capable of solving linearly separable problems.   
To overcome this limitation, we can connect Perceptrons together into a network.  
Each one becomes a Node in the network, and they are connected together into Layers. 

In standard Artificial Neural Network (ANN) architecture there is one input, one output and one or more hidden layers.  
- Though input layer is a bit misleading, it doesn't actually do any computation, it is just the inputs to the network.
- So, outputs of hidden layers become the inputs to subsequent hidden layers, or the final output layer. 
- Hidden nodes tend to learn different aspects of the problem space, building more complex decision boundaries and are therefore able to solve more complex problems.

Note: 
- The number of nodes in the input layer must equal the number of inputs/features in the data. 
- The number of output nodes must equal the number of labels/classes in the data. 
- The number of hidden layers and nodes in the layers is arbitrary, and selecting this architecture is part of building an ANN.

### Neural Network Training Algorithm  
Similar to Perceptrons, ANN are trained in two 'phases'. 
- The forward pass, where data is input into the network to produce an output. 
- The backward pass, where the error in output is used to update the weights using Backpropagation and Gradient Descent.
  - note that to calculate what the sum of  inputs was going *in* to a node we apply the *sigmoid derivative* to the signal coming *out* of that node 

<img src="ann-pseudocode.png" style="float:center">

## Part 1: Solving XOR
As an introduction to the ANN algorithm, and to give you an intuition for how different nodes and layers in the network learn different aspects of the problem space, we are going to look at how a small network can solve the XOR problem.

Running the code will train an ANN to solve the XOR problem and produces a visualisation to show how different nodes have learned different aspects of the problem to create a more complex decision boundary (in this case different logical functions.

You do not need to understand how the graphs/visualisations are produced.

You should try and understand what the graphs/visualisations output means.

### Activity 1: Train MLP with one hidden layer and see (through experimentation) how many nodes are needed to reliably solve x-or
- Run the next two cells below once to import the libraries and define the finction that pltos the decision surface
- if the first cell reports an error trying to import VisualiseNN, make sure you have downloaded the file VisualiseNN.py and it is in the same directory as this notebook

In [None]:
# basics for manipulating and outputting arrays etc
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import matplotlib.pyplot as plt
import numpy as np
from random import random
%matplotlib inline

## MLP specific stuff
from sklearn.neural_network import MLPClassifier
import VisualiseNN as VisNN


# useful sklearn functions for preprocessing data and sahowing results
from  sklearn.model_selection import train_test_split 
from sklearn.metrics import plot_confusion_matrix

#the iris data
from sklearn.datasets import load_iris


In [None]:
def plotDecisionSurface(model,X,y):
    min1, max1 = X[:, 0].min() - 1, X[:, 0].max() + 1 #1st feature
    min2, max2 = X[:, 1].min() - 1, X[:, 1].max() + 1 #2nd feature
    x1_scale = np.arange(min1, max1, 0.1)
    x2_scale = np.arange(min2, max2, 0.1)
    x_grid, y_grid = np.meshgrid(x1_scale, x2_scale)
    # flatten each grid to a vector
    x_g, y_g = x_grid.flatten(), y_grid.flatten()
    x_g, y_g = x_g.reshape((len(x_g), 1)), y_g.reshape((len(y_g), 1))
    # stack to produce hi-res grid in form like dataset
    grid = np.hstack((x_g, y_g))

    # make predictions for the grid
    y_pred_2 = model.predict(grid)
    
    #predict the probability
    p_pred = model.predict_proba(grid)
    # keep just the probabilities for class 0
    p_pred = p_pred[:, 0]
    # reshaping the results
    p_pred.shape
    pp_grid = p_pred.reshape(x_grid.shape)

    # plot the grid of x, y and z values as a surface
    levels=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
    surface = plt.contourf(x_grid, y_grid, pp_grid, levels,cmap='Pastel1')
    plt.colorbar(surface)
    # create scatter plot for samples from each class
    for class_value in range(2):
        # get row indexes for samples with this class
        row_ix = np.where(y == class_value)
        # create scatter of these samples
        plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Pastel1')
    # show the plot
    
    


### Activity 1.1 Investigating repeatability
Now run the cell below - it will try and learn the XOR problem and show you a plot of how the error rate changes over *time* measured in epochs.  
- one epoch means that all the training data is shown to the system once and the weights are updated
- we know. that *in theory* it should be able to learn XOR with 2 hidden nodes - **but is there a difference between theory and what happens in practice?**
- Each time you run the cell it starts the whole process from new, so the error curve will be different and you might get different final accuracy scores.
- as there are only four cases, we do not have any test data for this problem - we are just looking at how reliably different sized networks can learn a simple problem

You should:
1. Run the cell 10 times with 2 nodes in the hidden layer ( the parameter in the MLP constructor set to *hidden_layer_sizes=(2,)*. and note how many times it ended up with no errors (training set accuracy = 100%).  
**remember to click on the cell the press run or shifdt-return to run it**

2. Now repeat, changing the constructor to change the  with the size of the hidden  layer to 4,6,8,10 nodes - and again note how many times out of 10 it  successfully learned the problem.


In [None]:
# the four input cases form our training data
train_X = np.array( [[0,0],[0,1],[1,0],[1,1]])
# and her eare the labels our network should learn for the XOR problem
xor_y = np.array([0,1,1,0])

train_y= xor_y

# one hidden layer with one hidden layer of 2 neurons with logistic (sigmoid) activation and Stochastic Gradient Descent (backprop)

xorMLP =  MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, alpha=1e-4,
                    solver='sgd', verbose=0, 
                    learning_rate_init=.1)


xorMLP.fit(train_X, train_y)
    
lossplot=plt.plot(xorMLP.loss_curve_) 
training_accuracy = 100* xorMLP.score(train_X, train_y)
print("Training set accuracy: " + str(training_accuracy) + "%")

### Activity 1.2: Visualising what the network is doing
After a successful run ofthe cell above (i.e. one ending with training set accuracy 100%) run the cell below.
- The top plot shows the output of the final node for different inputs.  
  In this case we only have the four inputs marked by circles
- The bottom plot showes a visualsiation of the betwork structure and weights 
  - blue ones are *negative*, so will be suppressing the output of the cell they lead to if there is a signal down that connection
  - red ones are *positive - so will betrying to turn on ther node they lead to if there is a signal present

In [None]:
theMLP=xorMLP # change this line to reuse the code below for a different problem
num_output_nodes = 1 # and this one for multi-class problems

plotDecisionSurface(theMLP,train_X,train_y)


#network_structure = np.hstack(([train_X.shape[1]], np.asarray(myMLP.hidden_layer_sizes), [train_y.shape[0]]))
network_structure = np.hstack((2, np.asarray(theMLP.hidden_layer_sizes), 1))
# Draw the Neural Network with weights
network=VisNN.DrawNN(network_structure, theMLP.coefs_)
network.draw()

## Part 2: Using MLP for multiclass problems:  iris data

- introduce idea of parallel classifier using softmax and one-hot encoding
  - benefit that the classifiers can all share the feature creation done in prrevious layers
- same visualisations as first case 





In [None]:
# load the data


irisX,irisy = load_iris(return_X_y = True)
feature_names = ['sepal width','sepal_length','petal_width','petal_length']
irisLabels = ['iris-setosa','iris-versicolor','iris-virginica']
# show what the labels look like
print(irisy)


### Transforming our label data to a format for training a MLP with three output nodes
As you can see when you run the cell above, the labels is a 1-D array with labels of 0, 1, or 2.  
However, if we want our network to make a choice of three predictions, then we need a node for each class.

So there are two changes we make:
1. We tell ther network to have three output nodes  sandf use 'softmax' activiation.  
    i.e. Each node outputs a value, and we take asour final output the class whose node has ther highest output signal
2. We convert our labels tell the network what *each of the nodes* should ideally output for each training example.  
   In other words, if the label is 0 the then output should be [0,0,1], if the label is 1 it should be [0,1,0], and if it is 2 the output shoulfd be [1,0,0].

sklearn comes with a module sklearn.preprocessing.onehotencoder() to do this,   but the cell below does it explicitly to illustrate what is going on. 

I've made it generic so that you can easily reuse it for different datasets

In [None]:
numcases = len(irisy)
print('there are ' +str(numcases) +' training examples')
thelabels = np.unique(irisy)
numlabels = len(thelabels)
print( 'there are ' + str(numlabels) + ' labels: ' + str(thelabels))
# make a 2d array with numcases rows. and numlabels columns
irisy_onehot = np.zeros((numcases,numlabels))


# Now loop through the rows of the new array setting the appropriate column value to 1
for row in range(numcases):
    label = irisy[row]
    irisy_onehot[row][label]= 1

#print(irisy_onehot)

### Splitting our data into a training and a test set
As you can see from the output of the cells above, the iris data has groups all the classes i.e. rows 0-49 are 'iris-setosa', 50-99 are 'iris versicolor'. and rows 100-149 are 'iris-virginica'.

So if we want to train our network  and then estiamte how well it will do on new data, we need to split this into a training and test set.
Again, we could od this manually - first shuffling the rows so that we got a mixture of classes, then taking the first part of the data for training and the second for testing.

If the data are not so well organised, or the numbers of examples of different classes are not roughly equal, then that code gets trickier.
So the cell below shows how to do this using a method from sklearn.  The parameters are, in order:
- the feature values (irisx)
- the onehot-encoded set of labels (irisy_onehot)
- what proportion of our data we holdback from training, so we can use it for test. We'll use 1/3rd: test_size=0.33
- the array holding the labels that we want to be evenl;y represented in both our training and test sets; stratify=itrisy_onehot 

In [None]:

iris_train_X, iris_test_X, iris_train_y, iris_test_y = train_test_split(irisX,irisy_onehot, test_size=0.33, stratify=irisy_onehot )


### Activity 2.1 Training a MLP to learn the iris classification problem
1. Start by using the  settings for the MLPClassifier that we had before and just change the size of the hidden layer tofive or ten
- you will probably see that the training stops making improvements before the problem has been fully learned.
- this is an example of the backpropagation getting 'stuck' in a **local optimum** (we'll talk about these more next week)
- it happens becuase the basid 'stochastic gradient descent' algorithm *'sgd'* is fairly crude local search method with only crude methods for getting out of 'traps' 
- try changing the solver to 'adam' and see if this gives better performance

**Remember** to run a few times with each setting - this is a stochasdtic algorithm and the random set of initial weights makes a huge difference.  

**Question**: what do you understand by *better*

2. Now try adding a second hidden layer - for example by changing that parameter in the constructor to *hidden_layer_sizes=(3,3)*.  
- Experiment to see if it is better to have one hidden layer of 10 nodes or 2 layers of 5 nodes.

In [None]:
# create an MLP object-  you will want to change the number of hidden nodes
irisMLP =  MLPClassifier(hidden_layer_sizes=(5,5), max_iter=1000, alpha=1e-4,
                    solver='adam', verbose=0, 
                    learning_rate_init=.1)






irisMLP.fit(iris_train_X, iris_train_y)
print('number of output nodes = ' +str(irisMLP.n_outputs_))
    
lossplot=plt.plot(irisMLP.loss_curve_)    

# report how well it does on the trainig set
training_accuracy = 100* irisMLP.score(iris_train_X, iris_train_y)
print("Training set accuracy: " + str(training_accuracy) + "%")

# now how good is our network at predicting data it has never seen before
test_accuracy = 100* irisMLP.score(iris_test_X, iris_test_y)
print("Estimated (Test set) accuracy: " + str(test_accuracy) + "%")


# print a confusion matrix showing where the errors occur
#plot_confusion_matrix(irisMLP,iris_test_X,iris_test_y,labels=irisLabels)

### Activityy 2.2 Discussion
Try to come up with answers to these questions. (these are the sorts of things you might be asked in an exam)

1. Why is the test accuracy sometimes much lower than the trainig accuracy?

2. Why is it sometimes less reliable train a network with multiple hidden layers when learning the iris data?  
Hint: how many connections are you trying to learn?  how much data have you got?

## Activity3: Solving MNIST

## TO DO - tidy up, show ten classes in in parallel,  translate code bleow so it doesnt use keras
### activitiy is finding out how many layers, and how wide, are needed for this more complex task


The aim of this activity is to give you some experience selecting the training parameters and network architecture for applying neural networks to a classification task.


 

Neural Network (MLP) Visualization for Digit Recognition
loosely based on example visualisation code from a Kaggle example and "towards data science"
This notebook contrasts how simple Multi-layer Perceptron (MLP) Neural Networks and Convolutional Neural Networks recognize hand-written digits from the MNIST data set.
A two-layer MLP does a decent job at recognizing hand-written digits.
A very simple convolutional model does even better (though still not close to the 'state of the art')
There is considerable difference in the interpretability of the features

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.animation
plt.rcParams["animation.html"] = "jshtml"


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder





In [None]:
#exanmple code from https://scikit-learn.org/stable/auto_examples/neural_networks/plot_mnist_filters.html#sphx-glr-auto-examples-neural-networks-plot-mnist-filters-py
import warnings

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.exceptions import ConvergenceWarning
from sklearn.neural_network import MLPClassifier

print(__doc__)

# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = X / 255.

# rescale the data, use the traditional train/test split
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
                    solver='sgd', verbose=10, random_state=1,
                    learning_rate_init=.1)

# this example won't converge because of CI's time constraints, so we catch the
# warning and are ignore it here
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning,
                            module="sklearn")
    mlp.fit(X_train, y_train)

print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

fig, axes = plt.subplots(4, 4)
# use global min / max to ensure all weights are shown on the same scale
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()


In [None]:
# load mnist dataset
data = pd.read_csv('mnist_train.csv')

#next few lines added by jim to estiamte vanialla cnn performance
x_train = data.iloc[:,1:785]
y_train = data.iloc[:,0]

testdata = pd.read_csv('mnist_test.csv')
x_test = testdata.iloc[:,1:785]
y_test = testdata.iloc[:,0]


#data = data.head(30000)
# split data into train and test sample
#x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,1:785], data.iloc[:,0],  
                                                    #test_size = 0.1, random_state = 42)

dataSamples = data.shape[0]
trnum = dataSamples#trnum = int(dataSamples*0.9)
tenum = testdata.shape[0]#tenum= dataSamples-trnum
x_train = x_train.values.reshape(trnum, 784)
x_test = x_test.values.reshape(tenum, 784)

# compute the number of labels
num_labels = len(np.unique(y_train))

# convert to one-hot vector
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# normalize
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

X_train_img = x_train.reshape(x_train.shape[0],28,28,1)
X_test_img = x_test.reshape(x_test.shape[0],28,28,1)



f, axes = plt.subplots(2, 10, sharey=True,figsize=(20,5))
for i,ax in enumerate(axes.flat):
    ax.axis('off')
    ax.imshow(X_test_img[i,:,:,0],cmap="gray")

Specifying and Training a simple Neural Network

<img align="right" src="simple_MLP_for_Mnist.png" alt="Architecture of simple MLP, only 28 inputs shown" width="400"/>

We use a simple two-layer MLP with sigmoid activation Architecture of simple MLP, only 28 inputs shown
In the first hidden layer, each neuron takes every pixel value as input parameter.
Every neuron in the second hidden layer then takes all the outputs of the first layer (after activation using sigmoid) as input parameters. * After applying a softmax activation, these results form the final output layer.
The optimizer then derives linear weights in such a way as to minimize the loss function (in this case the categorical crossentropy) and thus maximizing the accuracy of the classification.

In [None]:
# network parameters
input_size = x_train.shape[1]
batch_size = 64
activation = 'sigmoid'
# this model is a 3-layer MLP with sigmoid activation each layer
model = Sequential()
model.add(Dense(25, input_dim=input_size, activation='sigmoid'))
model.add(Dense(25,activation='sigmoid'))
model.add(Dense(num_labels,activation='softmax'))
model.summary()

# loss function for one-hot vector using adam optimizer
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
# train the network
model.fit(x_train, y_train, epochs=10, batch_size=batch_size)

# validate the model on test dataset to determine generalization
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))

Slide Type
This model implementation reaches an accuracy of ~96% if we give it the whole training set, which is remarkable, but better results are possible using, e.g., a Convolutional Neural Network. We use the simpler MLP to make the interpretation of the visualization results easier.

In [None]:
Extract Output and Group Information
We extract the outputs generated by each individual neuron and for each frame in the MNIST training sample and store them on a per-layer basis.

In [None]:
get_layer_output = K.function([model.layers[0].input, model.layers[0].input, model.layers[0].input],
                              [model.layers[0].output, model.layers[1].output, model.layers[2].output])

layer1_output, layer2_output, layer3_output = get_layer_output([x_train])

Finally, we extract and store the indices of frames showing the same digit.

In [None]:
train_ids = [np.arange(len(y_train))[y_train[:,i] == 1] for i in range(10)]

Visualization of Individual Frames
In this visualization, we focus on individual training data (i.e., individual frames with hand-written digits).
The following panel shows from left to right
the original 28x28 pixel frame depicting a hand-written figure,
the output values of all neurons of the first hidden layer,
the output values of all neurons of the second hidden layer, and
the one-hot encoded output layer indicating the model classification result.
Note that in those plots showing network layers, each pixel stands for the output of a single neuron. This output is based on the input parameters passed on from the previous layer, the trained weights for each neuron, and the activation function used in this layer. Dark blue pixels stand for low output values, while yellow pixels stand for high output values. The pixels have been arranged in two dimensions to save space; just think of these layers in linear arrangements to stay in the typical picture of layers in a network.

In [None]:
%%capture
%matplotlib inline

# digit to be plotted
digit = 6

# indices of frames to be plotted for this digit
n = range(50)

# initialize plots
f, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(15,4))

# prepare plots
ax1.set_title('Input Layer', fontsize=16)
ax1.axes.get_xaxis().set_visible(False)
ax1.axes.get_yaxis().set_visible(False)

ax2.set_title('Hidden Layer 1', fontsize=16)
ax2.axes.get_xaxis().set_visible(False)
ax2.axes.get_yaxis().set_visible(False)

ax3.set_title('Hidden Layer 2', fontsize=16)
ax3.axes.get_xaxis().set_visible(False)
ax3.axes.get_yaxis().set_visible(False)
    
ax4.set_title('Output Layer', fontsize=16)
ax4.axes.get_xaxis().set_visible(False)
ax4.axes.get_yaxis().set_visible(False)   

# add numbers to the output layer plot to indicate label
for i in range(3):
    for j in range(4):
        text = ax4.text(j, i, [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, '', '']][i][j],
                        ha="center", va="center", color="w", fontsize=16)    
        
def animate(id):
    # plot elements that are changed in the animation
    digit_plot = ax1.imshow(x_train[train_ids[digit][id]].reshape((28,28)), animated=True)
    layer1_plot = ax2.imshow(layer1_output[train_ids[digit][id]].reshape((5,5)), animated=True)
    layer2_plot = ax3.imshow(layer2_output[train_ids[digit][id]].reshape((5,5)), animated=True)
    output_plot = ax4.imshow(np.append(layer3_output[train_ids[digit][id]], 
                                       [np.nan, np.nan]).reshape((3,4)), animated=True)
    return digit_plot, layer1_plot, layer2_plot, output_plot,

# define animation
ani = matplotlib.animation.FuncAnimation(f, animate, frames=n, interval=100)

In [None]:
ani

In [None]:
Scrolling through the animation, it becomes clear that in most cases the same subset of neurons fires, while other neurons remain quiescent. This is much more obvious in the second hidden layer than in the first hidden layer and can be interpreted as the first layer pre-processesing the pixel data, while the second layer deals with pattern recognition. Note that in most cases the recognition of the digit shown is unambiguous; ambiguity only occurs in somewhat pathologic cases.
You can change the digit shown by changing the digit value in the code block above.

Conclusions
Despite variations in the shapes of hand-written digits, the same groups of neurons is involved in the identification of the same digits.
Similarities in the shapes of digits translate into similarities in the groups of neurons that are involved in their identification in the first hidden layer, but not so much in the second hidden layer.