# COMP4702/7703 Prac 7: MNIST Deep Learning with TensorFlow
This notebook will allow you to create deep networks and MLPs for the MNIST dataset using tensorflow. As the course assumes no python knowledge I have written some code to do the implementation for you.
If you are not using a lab computer to do this practical, you will need to install TensorFlow on your machine before continuing. See [here](https://www.tensorflow.org/install/) for more information on how to do this.

# Disclaimer - this code has been tested on Ubuntu16.04 and Windows10 only.
## Lets get cracking!

In [None]:
from prac7ConvMLPModel import *
from SupportCode.Helpers import *
import numpy as np
# Set seed so randomly generated values for hyper-parameters are predictable
np.random.seed(42)

### Set up data

In [None]:
mnist = tf.keras.datasets.mnist.load_data()
[x_train, y_train], [x_test, y_test] = mnist

# Flatten input arrays from 28x28 to 784 for x_train and x_test
x_train = x_train.reshape(len(x_train), 784)
x_test = x_test.reshape(len(x_test), 784)

# Concatenate x_train and y_train in order to randomly shuffle whole dataset (VERY IMPORTANT - used for K-Fold CV)
y_train = y_train.reshape(len(y_train), 1)
train = np.concatenate((x_train, y_train), axis=1)
np.random.shuffle(train)
# Resplit x_train and y_train
x_train = train[:, :-1]
y_train = train[:, -1]

# One hot encoding for y_train to be able to train neural net
shape = (y_train.size, y_train.max() + 1)
one_hot = np.zeros(shape)
rows = np.arange(y_train.size)
one_hot[rows, y_train] = 1
y_train = one_hot

# One hot encoding for y_train to be able to train neural net
shape = (y_test.size, y_test.max() + 1)
one_hot = np.zeros(shape)
rows = np.arange(y_test.size)
one_hot[rows, y_test] = 1
y_test = one_hot

To begin, we will create an MLP with default settings. Running this the first time will be a bit slow as it will download the MNIST dataset. The number of training steps has been set to a small number in order to verify the code is working and for you to see what the output is.

In [None]:
# Verify modified code works fine!
# data = [x_train, y_train, x_test, y_test]
# prac7ConvMLPModel(data)

You'll notice that two new TensorBoard tabs have opened in your browser. These display a variety of information about the network parameters. One tab indicates the test set summaries and accuracies and the other tab is the train set summaries. In the bottom left corner of the page you will see either "MODELtrainX" or "MODELtestX". X indicates the run number - this simply corresponds to the order that you run them in. MODEL is just the model name - either "MLP" or "convNet".

To open a previous TensorBoard simply call:

In [None]:
# openTensorBoardAtIndex("MLP", "GradientDescent", 0)

As you go through the prac make sure to keep track of your settings for a given run - if you lose track you'll have to delete the folder and start again. -Each file is in the order of 10s-100s of MBs so you might need to delete some.

A particularly interesting part of the tensorboard is the images tab. This shows the weight values as a greyscale image for each layer. You can move the sliding bar to see how the weights change over each iteration.

# Configure your MLP
By default prac7ConvMLPModel() will generate an MLP with one hidden layer that consists of 500 neurons. It will use stochastic gradient decent to optimise. You'll also notice that it doesn't perform very well.

To change this you will need to change the optimiser values and the hidden layer values. Your prac 6 MLP might be a good start... unless it was bad ><.

## Lets take a look at the optimisers
In this prac you can use the following  optimisers: GradientDescent, Adam, RMSProp, Momentum, and Adagrad.

Adam seems to be the most popular at the moment, followed by RMSProp.

The following code should help you in configuring your optimisers. If you don't know what a parameter for a particular optimiser does remember that google is your friend. If you want to take a look at the tensorflow documentation check out the [TensorFlow documention](https://www.tensorflow.org/api_guides/python/train) 

**Don't** forget to *change* these values (srsly they will throw errors ;))!

The MLP is pretty easy to set up as you only need to chose the layer layout. You could also change the activation function but by default it's the rectilinear unit or RELU. Check out the [TensorFlow documention](https://www.tensorflow.org/api_guides/python/)  for which functions you can use and then change the act parameter in the function call.

## Q1

### (a)
Compare the performance of the different optimisers using the MLP topology you chose in Prac 6 Q3. 

### Instructions
* Use a **table** to present your results including the hyper-parameters selected. 
* Describe in **at most** 250 words your methodology for selecting hyper-parameters.
* Discuss in **at most** 150 words what attributes of each optimiser might make it perform better or worse. 

### Set up k-Fold Cross-Validation function thats splits training set into training and validation sets (leaving test set untouched for final models only)

In [None]:
# Number of times to split training set into 1/5 = 20% validation sets and 4/5 = 80% training sets for hyper-parameter
# optimisation
k = 5

def k_fold_cv(k, i, x_train, y_train):
    start_idx = int((i - 1)/k*len(x_train))
    end_idx = int(i/k*len(x_train))
    valid_x = x_train[start_idx:end_idx, :]
    valid_y = y_train[start_idx:end_idx, :]
    
    train_x = np.delete(x_train, [range(start_idx, end_idx)], axis=0)
    train_y = np.delete(y_train, [range(start_idx, end_idx)], axis=0)
    return train_x, train_y, valid_x, valid_y  

In [None]:
activationFunction = tf.nn.relu
MLPTopology={}
# Use 2 hidden layers with 500 neurons in each layer
MLPTopology['hiddenDims'] = [500, 500]

### Using Random Search k-Fold CV for Hyper-Parameter Optimisation

[Reference](https://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) - paper by Dr. Bengio

In [None]:
# Optimisation dictionary for Gradient Descent
optDicGD = {}
optDicGD["optMethod"] = "GradientDescent"

learning_rates = np.random.uniform(low=0.0001, high=0.001, size=5)

print(f"Learning rates = {learning_rates}")

for i in range(1, k + 1):
    data = k_fold_cv(k, i, x_train, y_train)
    optDicGD["learning_rate"] = learning_rates[i-1]
    
    
    # Evaluate Gradient Descent optimiser with varying learning rates
    prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicGD, act=activationFunction, max_steps=100)

In [None]:
# Evaluate best Gradient Descent model on test set (unseen data)
data = [x_train, y_train, x_test, y_test]
optDicGD["learning_rate"] = learning_rates[2]
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicGD, act=activationFunction, max_steps=100)

In [None]:
# openTensorBoardAtIndex("MLP", "GradientDescent", 5)

In [None]:
# Optimisation dictionary for Momentum
optDicM = {}
optDicM["optMethod"] = "Momentum"

# Have already chosen 5 random learning rates, just choose 5 random momentums
momentums = np.random.uniform(low=0.001, high=0.1, size=5)

print(f"Learning rates = {learning_rates}")
print(f"Momentums = {momentums}")

for i in range(1, k + 1):
    data = k_fold_cv(k, i, x_train, y_train)
    optDicM["learning_rate"] = learning_rates[i-1]
    optDicM["momentum"] = momentums[i-1]
    
    # Evaluate Momentum optimiser with varying learning rates and momentums
    prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicM, act=activationFunction, max_steps=100)

In [None]:
# Evaluate best Momentum model on test set (unseen data)
data = [x_train, y_train, x_test, y_test]
optDicM["learning_rate"] = learning_rates[3]
optDicM["momentum"] = momentums[3]
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicM, act=activationFunction, max_steps=100)

In [None]:
# openTensorBoardAtIndex("MLP", "Momentum", 5)

In [None]:
# Optimisation dictionary for Adagrad
optDicAGrad = {}
optDicAGrad["optMethod"] = "Adagrad"

initial_accum_values = np.random.uniform(low=0.01, high=0.2, size=5)

print(f"Learning rates = {learning_rates}")
print(f"Initial accumulator values = {initial_accum_values}")

for i in range(1, k + 1):
    data = k_fold_cv(k, i, x_train, y_train)
    optDicAGrad["learning_rate"] = learning_rates[i-1]
    optDicAGrad["initial_accumulator_value"] = initial_accum_values[i-1]
    
    # Evaluate Adagrad optimiser with varying learning rates and momentums
    prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAGrad, act=activationFunction, max_steps=100)

In [None]:
# Evaluate best Adagrad model on test set (unseen data)
data = [x_train, y_train, x_test, y_test]
optDicAGrad["learning_rate"] = learning_rates[3]
optDicAGrad["initial_accumulator_value"] = initial_accum_values[3]
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAGrad, act=activationFunction, max_steps=100)

In [None]:
# openTensorBoardAtIndex("MLP", "Adagrad", 5)

In [None]:
# Optimisation dictionary for RMSProp
optDicRMS = {}
optDicRMS["optMethod"] = "RMSProp"
optDicRMS["centered"] = False # This normalises the weights if True

decays = np.random.uniform(low=0.0, high=0.01, size=5)

print(f"Learning rates = {learning_rates}")
print(f"Momentums = {momentums}")
print(f"Decays = {decays}")

for i in range(1, k + 1):
    data = k_fold_cv(k, i, x_train, y_train)
    optDicRMS["learning_rate"] = learning_rates[i-1]
    optDicRMS["momentum"] = momentums[i-1]
    optDicRMS["decay"] = decays[i-1]
    
    # Evaluate RMSProp optimiser with varying learning rates and momentums
    prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicRMS, act=activationFunction, max_steps=100)

In [None]:
# Evaluate best RMSProp model on test set (unseen data)
data = [x_train, y_train, x_test, y_test]
optDicRMS["learning_rate"] = learning_rates[3]
optDicRMS["momentum"] = momentums[3]
optDicRMS["decay"] = decays[3]
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicRMS, act=activationFunction, max_steps=100)

In [None]:
# openTensorBoardAtIndex("MLP", "RMSProp", 5)

In [None]:
# Optimisation dictionary for Adam
optDicAdam = {}
optDicAdam["optMethod"] = "Adam"
optDicAdam["learning_rate"] = 0.001
optDicAdam["beta1"] = 0.9
optDicAdam["beta2"] = 0.999

beta1s = np.random.uniform(low=0.85, high=0.95, size=5)
# Upper limit of 1.0 excluded
beta2s = np.random.uniform(low=0.9, high=1.0, size=5)

print(f"Learning rates = {learning_rates}")
print(f"Beta_1 = {beta1s}")
print(f"Beta_2 = {beta2s}")

for i in range(1, k + 1):
    data = k_fold_cv(k, i, x_train, y_train)
    optDicAdam["learning_rate"] = learning_rates[i-1]
    optDicAdam["beta1"] = beta1s[i-1]
    optDicAdam["beta2"] = beta2s[i-1]
    
    # Evaluate Adam optimiser with varying learning rates and momentums
    prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAdam, act=activationFunction, max_steps=100)

In [None]:
# Evaluate best Adam model on test set (unseen data)
data = [x_train, y_train, x_test, y_test]
optDicAdam["learning_rate"] = learning_rates[3]
optDicAdam["beta1"] = beta1s[3]
optDicAdam["beta2"] = beta2s[3]
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAdam, act=activationFunction, max_steps=100)

In [None]:
# openTensorBoardAtIndex("MLP", "Adam", 5)

## Q1


### (b)
Compare the performance of the network from (a) using ReLU, tanh, and sigmoid functions as the activation function.

### Instructions
* Use a **table** to present your results. 
* Discuss in **at most** 150 words the differences in the functions, and why that may have lead to different results. 

### Hints

Don't forget to play with the parameters, as the ones above will probably throw errors!

Increase 'max_steps' to increase the number of iterations. Also, use TensorBoard - it might provide some insight into what's happening during training.

If you have any trouble using any of the parameters, consult the [TensorFlow documentation](https://www.tensorflow.org/api_docs/python/). The search bar up the top is really good! (I guess it's not surprising considering the API is made by Google).

In [None]:
data = [x_train, y_train, x_test, y_test]

# TESTING ALL ACTIVATION FUNCTIONS ON LAST CONFIGURATION (INDEX 4) FOR ADAM OPTIMISER FROM Q1.a)
# Testing ReLU
activationFunction = tf.nn.relu
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAdam, act=activationFunction, max_steps=100,
                  path="ReLU")

In [None]:
# Testing Tanh
activationFunction = tf.nn.tanh
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAdam, act=activationFunction, max_steps=100,
                  path="Tanh")

In [None]:
# Testing Sigmoid
activationFunction = tf.nn.sigmoid
prac7ConvMLPModel(data, model='MLP', MLPTop=MLPTopology, optimiser=optDicAdam, act=activationFunction, max_steps=100,
                  path="Sigmoid")