# Building and training a Multi Layered Perceptron (MLP) using Tensorflow

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/olaiya/MLTutorialNotebooks/blob/master/mlp.ipynb)

# Contents

- [1. Constructing a MLP with Tensorflow](#1.)
    - [1.1 Import required libraries](#1.1)
    - [1.2 Generate a dataset](#1.2)
    - [1.3 Activation functions](#1.3)
    - [1.4 Examples of activation functions in Tensorflow](#1.4)
    - [1.5 Sequential API](#1.5)
    - [1.6 Training on data](#1.6)
    - [1.7 Functional API](#1.7)
    - [1.8 Saving a model](#1.8)
    - [1.9 Loading a saved model](#1.9)
- [2. Image Classification with a MLP](#2.)
- [3. Learning rate scheduling](#3.)
    - [3.1 Power scheduling](#3.1)
    - [3.2 Exponential scheduling](#3.2)
    - [3.3 Piecewise constant scheduling](#3.3)
- [4. MLPs for regression](#4.)
- [5. Dropout](#5.)
- [6. Exercise](#6.)

## 1. Contructing a MLP with Tensorflow <a name="1."></a>

In this workbook we will use the python library Tensorflow to implement an MLP. We will implement MLPs for classification as a way of dipping into Tensorflow. We will also cover considerations for training such as batch sizes and learning rates as well as ways to avoid overfitting. We will also looking at the training loss output as well as saving and loading models 

To run a code cell, click on the cell the press "Shift + Enter"

### 1.1 Import required libraries <a name="1.1"></a>

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from matplotlib import cm

#Want to use version of Tensorflow > 2.0
print('Using Tensorflow version %s' % tf.__version__)


### 1.2 Generate a dataset <a name="1.2"></a>

### Create the data

Let's generate a dataset, consisting of two data types which we call signal and background. Each data type is normally generated around a point in the x-y plane. Distinguishing signal from backgorund in this case is a very simple problem. You could use PDFs and likelihoods to identify signal and backgound. However for the purpose of this tutorial we will build a very simply MLP classifier to identify events as signal or background

In [None]:
#Create datasets
num_events = 10000

#Signal x and y mean values
signal_mean = [1.0, 1.0]
#Signal x and y values are uncorrelated
signal_cov = [[1.0, 0.0],
              [0.0, 1.0]]

#Generate a training and validation sample
signal_train = np.random.multivariate_normal(
        signal_mean, signal_cov, num_events)
signal_val = np.random.multivariate_normal(
        signal_mean, signal_cov, num_events)

#Background x and y mean values
background_mean = [-1.0, -1.0]
#Background x and y values are uncorrelated
background_cov = [[1.0, 0.0],
                  [0.0, 1.0]]

#Generate a training and validation sample
background_train = np.random.multivariate_normal(
        background_mean, background_cov, num_events)
background_val = np.random.multivariate_normal(
        background_mean, background_cov, num_events)

#Add the signal and background samples
data_train = np.vstack([signal_train, background_train])
labels_train = np.vstack([np.ones((num_events, 1)), np.zeros((num_events, 1))])

#Add the signal and background samples
data_val = np.vstack([signal_val, background_val])
labels_val = np.vstack([np.ones((num_events, 1)), np.zeros((num_events, 1))])

In [None]:
#Plot the datasets generated
range_ = ((-3, 3), (-3, 3))
plt.figure(0, figsize=(8,4))
plt.subplot(1,2,1); plt.title("Signal")
plt.xlabel("x"), plt.ylabel("y")
plt.hist2d(signal_train[:,0], signal_train[:,1],
        range=range_, bins=20, cmap=cm.coolwarm)
plt.subplot(1,2,2); plt.title("Background")
plt.hist2d(background_train[:,0], background_train[:,1],
        range=range_, bins=20, cmap=cm.coolwarm)
plt.xlabel("x"), plt.ylabel("y");

Because this is a simple problem we can build a simple MLP to identify signal and background events 

An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that can use a nonlinear activation function

<img src="images/mlp.png" alt="mlp" width="800"/>

### 1.3 Activation Functions <a name="1.3"></a>

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)

In [None]:
z = np.linspace(-5, 5, 200)

plt.figure(figsize=(11,4))

plt.subplot(121)
plt.plot(z, np.sign(z), "r-", linewidth=1, label="Step")
plt.plot(z, sigmoid(z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", linewidth=2, label="Tanh")
plt.plot(z, relu(z), "m-.", linewidth=2, label="ReLU")
plt.grid(True)
plt.legend(loc="center right", fontsize=14)
plt.title("Activation functions", fontsize=14)
plt.axis([-5, 5, -1.2, 1.2])

plt.subplot(122)
plt.plot(z, derivative(np.sign, z), "r-", linewidth=1, label="Step")
plt.plot(0, 0, "ro", markersize=5)
plt.plot(0, 0, "rx", markersize=10)
plt.plot(z, derivative(sigmoid, z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", linewidth=2, label="Tanh")
plt.plot(z, derivative(relu, z), "m-.", linewidth=2, label="ReLU")
plt.grid(True)
#plt.legend(loc="center right", fontsize=14)
plt.title("Derivatives", fontsize=14)
plt.axis([-5, 5, -0.2, 1.2])

### 1.4 Examples of activation functions in Tensorflow <a name="1.4"></a>

I use the tensorflow shorthand for activation functions such as activation="relu". You can instead use activation=tf.nn.relu


tf.nn.relu
tf.nn.leaky_relu
tf.nn.elu


### 1.5 Sequential API  <a name="1.5"></a>

In [None]:
# Let's use Tensorflows sequential API. We pass the layers to the API sequentially
#Construct Neural Net

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(100, activation="relu", input_dim=2))
model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

#Use model with leaky relu activation function
#model = tf.keras.models.Sequential()
#model.add(tf.keras.layers.Dense(100, activation=tf.nn.leaky_relu, input_dim=2))
#model.add(tf.keras.layers.Dense(1, activation="sigmoid"))



It is useful to look at the architecture of your neural net to check it makes sense. Do you have the total number of parameters you would expect to have. 

Expected number of parameters: each input is connected to a node in the first hidden layer by a weight. So that is 2 inputs connected to 100 nodes = 200 weights. Each node has a bias, so 100 biases. Therefore that is 200 weights + 100 biases = 300 parameters at the first layer

The 100 nodes are connected to 1 ouput node, so that is 100 weights. The output node has a bias to the number of parameters here is 100. Therefore the total number of parameters required to construct this neural net is 401

Use the model's method .summary() to see a breakdown of your neural net

In [None]:
model.summary()

### 1.6 Training on data  <a name="1.6"></a>

Set loss function and optimiser. As this is a classification neural net we want to use binary cross entropy as the loss function. We will use Adam as the optimiser. The Adam optimiser is based on gradient decent but has a more sophisticated adaption of learning rates for parameters. Then you run the 'compile' method on the model

In [None]:
# Set loss function and optimiser
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])


Train on data. Define the dataset you want to train on and the validation set you want to validate the training against. Set the number of epochs (the number of iterations over the dataset). Set the batch size. This is a number/size of events smaller than your dataset that you run over and update the weights of your neural net. You run over multiples of batch sizes until you have run over all your events in your dataset. That is one epoch.  The batch size can actually be quite important. The main benefit of large batch sizes is that the training algorithm will see more instances per calculation. If you have the hardware large batch sizes are usually recommended. However large batch sizes can lead to instabilities. Particularly at the start of trainin.  A small batch size can be good if you are low on computer memory or want to avoid getting stuck in a local minima. Too small a batch size can make converging on optimal weighs slow!

In [None]:
num_epochs = 100
#Not worried about memory or local minima
batchSize = len(data_train)

#Train on data
history = model.fit(data_train, labels_train,
          validation_data=(data_val, labels_val),
          batch_size=batchSize,
          epochs=num_epochs)


92% accurracy. Not bad! Be careful with accuracy measurments. It is easy to measure high accuracy if your validation or test sample mainly has one category of data. Not the case here!

In [None]:
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')

val_loss, val_acc = model.evaluate(data_val,  labels_val, verbose=2)

Plot the loss of the neural net as a function of epoch

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', color='red', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

### 1.7 Functional API  <a name="1.7"></a>

The Keras functional API is a way to create models that are more flexible than the tf.keras.Sequential API. The functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.

We can construct the above MLP model in Tensorlow using the functional API

In [None]:
inputs = tf.keras.Input(shape=(2,))
x = tf.keras.layers.Dense(100, activation="relu")(inputs)
outputs = tf.keras.layers.Dense(1)(x)
model_fapi = tf.keras.Model(inputs=inputs, outputs=outputs)

model_fapi.summary()

# Set loss function and optimiser
model_fapi.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

num_epochs = 10
#Only run with 10 epochs. Just a demonstration that functional API works
#Not worried about memory or local minima
batchSize = len(data_train)

#Train on data
history = model_fapi.fit(data_train, labels_train,
          validation_data=(data_val, labels_val),
          batch_size=batchSize,
          epochs=num_epochs)


### 1.8 Saving Models  <a name="1.8"></a>

What about saving the model. Also at what point do save the model. This is an ideal example where the loss is smooth and decreases monotonically. You'll find this is rarely the case. What I tend to do is save the model each time the validation loss reaches a minimum. You overtrain your neural net if you use the training loss as a metric for saving your model. Let's use a callback to save the model. A callback lets you specify a list of objects that will be called during training. We can use a checkpoint callback to save the model every time the validation loss reaches a new minimum. To start with we will save the weights of the model only, by specifying save_weights_only=True. 

In [None]:
import os

#Specify a directory to save model

model_filename = 'simple_mlp_weights.h5'
saved_model_directory = 'models'

checkpoint_filename = os.path.join(saved_model_directory, model_filename)

CHECK_FOLDER = os.path.isdir(saved_model_directory)

# If folder doesn't exist, then create it.
if not CHECK_FOLDER:
    os.makedirs(saved_model_directory)
    print("created folder : ", saved_model_directory)
                
#Create checkpoint
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_filename,
        monitor='val_loss',
        save_weights_only=True,
        save_best_only=True,
        verbose = 1)

        
history = model.fit(data_train, labels_train,
        validation_data=(data_val, labels_val),
        batch_size=len(data_train),
        epochs=num_epochs,
        callbacks=[model_checkpoint_callback])

Notice your model continues learning from its previous parameters! Makes sense, you haven't reset the model parameters from the last training iteration

So we have saved the weights of the model, so you will need the model before you can load the weights (next sub section). What if you want to save the model itself, so it can be recalled without any prior information. You can do this using the save method. Don't use thw .h5 file type extension when saving the full model. For example:

In [None]:
full_model_filename = 'simple_mlp_fullModel'
saved_model_directory = 'models'

model_filename = os.path.join(saved_model_directory, full_model_filename)
model.save(full_model_filename)

### 1.9 Load a saved model <a name="1.9"></a>

Now we have saved the parameters we can load them at any time

In [None]:
#Load model weights

model.load_weights(checkpoint_filename)
model.summary()

In the above example model needed to be defined and the weights loaded. If you have the full model saved you can load it directly

In [None]:
#Load full model
loaded_model = keras.models.load_model(full_model_filename)
loaded_model.summary()

# 2. Image classification with a MLP  <a name="2."></a>

We can look at the MNIST dataset, famous in machine learning examples. Specifically we can look at images of handwritten numbers 0-9 and use a MLP to identify them. This is a dataset of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images. More info can be found at the <a id='http://yann.lecun.com/exdb/mnist'>MNIST homepage</a>.

In [None]:
#Load the dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()


# Randomly look at one of the images
random_ints =np.random.randint(len(x_test), size=1)
image_index = random_ints[0]

plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')

print('Image label is: %i' % y_test[image_index])


Let's build an mlp to identify the images. The intenstiy of the pixels is represented as a value from 0 to 255. We should normalise the intensity! This is important! I your inputs have different scales, they will have disproportionate contributions to the loss. Scale your inputs to avoid this problem!

Also, the image is a 2D array so we should flatten it to a 1D array so the values can be input into a MLP. Lets build a simple MLP with a hidden layer containing 128 nodes and an output layer of 10 nodes. Again we can work out how many parameters it would take to build this mlp. 

The total number of inputs required for the mlp is (an input for each pixel) 28 x 28  = 784

Total number of weights to connect each input to each node in the first layer is 784 x 128 = 100352

Total number of biases in the first layer is 128.

Total number of parameters for the first layer is = 100352 + 128 = 100480

Total number of parameters for the output layer is = number of weights + biases = (128 * 10) + 10 = 1290

Total number of parameters =  101770

In [None]:
# Intensity of pixels ranges from 0-255. We need to normalise them
x_train, x_test = x_train / 255.0, x_test / 255.0


#Rather than instantiate the node and use the .add method to keep adding input,layers, outputs, etc we can just state
#components separated by a comma during instantiation

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])

model.summary()

One would expect some kind of sigmoid activation function to be applied to the output. However we let the loss function SparseCategoricalCrossentropy take the output and use it to calculate the loss to best attribute probabilities to each output

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])


Train the model!

In [None]:
# Intensity of pixels ranges from 0-255. We need to normalise them
x_train, x_test = x_train / 255.0, x_test / 255.0

#Split into training and validation samples
x_train, x_val = x_train[0:55000], x_train[55000:]
y_train, y_val = y_train[0:55000], y_train[55000:]

In [None]:
num_epochs = 10
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs)

Greater than 90%, pretty good! With a more complex mlp and more epochs you can do much better!

In [None]:
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')

val_loss, val_acc = model.evaluate(x_val,  y_val, verbose=2)

## 3. Learning Rate Scheduling <a name="3."></a>

Finiding a good rate is very import. Too large a learning rate and your training may never converge. Too low and training can take forever to converge. One has to find an optimal learning rate. Or even better, you don't need to have a constant learning rate. You can start with a large learning rate and then as soon as the training stops making fast progress you can reduce the learning rate. You can use what is called a learning rate scheduler to define the learning rate as a function of epoch. The learning rate scheduler can be passed to the fit function as a callback 

### 3.1 Power Scheduling <a name="3.1"></a>

lr = lr0 /(1 + (epoch / s))

In [None]:
# Learning rate lr0 drops by 1/2, 1/3, 1/4 .. after s steps
def power_decay(lr0, s):
    def power_decay_function(epoch):
        return lr0/(1 + (epoch / s))
    return power_decay_function

power_decay_fn = power_decay(0.01,20)

lr_scheduler = keras.callbacks.LearningRateScheduler(power_decay_fn)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
             
num_epochs = 10
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs, callbacks=[lr_scheduler])

### 3.2 Exponential Scheduling <a name="3.2"></a>

<mrow>lr = lr0 * 0.1**(epoch / s)</mrow>

In [None]:
#Learning rate drops exponentially
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
             
num_epochs = 10
hisotry = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs, callbacks=[lr_scheduler])


### 3.3 Piecewise Constant Scheduling <a name="3.3"></a>

With piecewise constant sheduling you can set a fixed learning rate for different epoch ranges.Most of the time I find piecewise sheduling simple and effective enough to acheive what I want 

In [None]:
##Two implementations for piecewise constant sheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001
    
    
def piecewise_constant(boundaries, values):
    boundaries = np.array([0] + boundaries)
    values = np.array(values)
    def piecewise_constant_fn(epoch):
        return values[np.argmax(boundaries > epoch) - 1]
    return piecewise_constant_fn

piecewise_constant_fn = piecewise_constant([5, 15], [0.01, 0.005, 0.001])



In [None]:
lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
             
num_epochs = 10
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs, callbacks=[lr_scheduler])

## 4. MLPs for Regression <a name="4."></a>

We can use MLPs to predict values. Simply remove the sigmoid activation function and change the loss function. You can pick Mean Square Error (MSE), Mean Average Error (MAE), whatever you like.  https://www.tensorflow.org/api_docs/python/tf/keras/losses

In [None]:
def generate_data(num_events):
    
    #Generate x values
    x = np.linspace(-10, 30, num_events)
    #Generate noise#
    noise = np.random.rand(1, num_events)
    #Generate y values (x squared distribution + noise)
    y = (x  * x)
    #Add noise
    noise =  noise * y
    y = y + noise
    #Scale data
    ymax = np.amax(y[0])
    y = y / np.amax(y[0])
    #Add noise
    #noise = noise * amplitude
    #y = y + noise
    #Shuffle the data
    shuffled_indices = np.arange(len(x))
    np.random.shuffle(shuffled_indices)
    
    return x[shuffled_indices] , y[0][shuffled_indices]

data_x, data_y  = generate_data(500)

i1 = 400
i2 = 450

x_train , x_val, x_test = data_x[0:i1], data_x[i1:i2], data_x[i2:]
y_train, y_val, y_test = data_y[0:i1], data_y[i1:i2], data_y[i2:]


plt.plot(x_train, y_train, 'o', label='Data')
plt.legend()

Create a simple MLP to model the distribution. This is regression so no use of the sigmoid activation function on the output. Also we will use a different loss function, say mean squared error

In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.InputLayer(input_shape=(1,)),
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(1, )
])

model.summary()




In [None]:
model.compile(optimizer='adam',
              loss='mse')
                 
num_epochs = 200
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs)

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', color='red', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

In [None]:
y_test_prediction = model.predict(x_test)
#Change the shape of data (make it 1D)
y_test_prediction = np.squeeze(y_test_prediction)

In [None]:
# Compare estimation with truth

plt.plot(x_test, y_test, 'o', label='Test data')
plt.plot(x_test, y_test_prediction, 'o', color='red', label='Model prediction')
plt.legend()

## 5. Dropout <a name="5."></a>

If your neural net tends to suffer from overtraining, using the dropout technique is a good way to combat overtraining. At every training step, every neuron in the layer has a probability p of being 'dropped out'. This means if the neuron is dropped out it will be entirely ignored during training. The hyperparameter p is called the dopout rate and is typically set between 10% and 50%. You can also apply dropout to the input layer.

We can construct a more complex neural net for our dataset and add dropout. The purpose is only for demonstrating the implentation of dropout.


In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.InputLayer(input_shape=(1,)),
  tf.keras.layers.Dense(20, activation='relu'),
  tf.keras.layers.Dropout(rate=0.3),
  tf.keras.layers.Dense(20, activation='relu'),
  tf.keras.layers.Dropout(rate=0.3),
  tf.keras.layers.Dense(1, )
])

model.summary()


In [None]:
model.compile(optimizer='adam',
              loss='mse')
                 
num_epochs = 200
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=num_epochs)


In [None]:
y_test_prediction = model.predict(x_test)
#Change the shape of data (make it 1D)
y_test_prediction = np.squeeze(y_test_prediction)

In [None]:
# Compare estimation with truth

plt.plot(x_test, y_test, 'o', label='Test data')
plt.plot(x_test, y_test_prediction, 'o', color='red', label='Model prediction')
plt.legend()

## 6. Exercise <a name="6."></a>

Rerun the image classification MLP with some modifications to see if you can improve the accuracy. Play with the number of nodes, layers, change the activation functions, the loss function, add learning rate scheduling, maybe dropout.  Don't make your neural net too large. We do have limited resources.