# Neural nets

Welcome to the third jupyter notebook! In this session, we'll cover some basics about neural nets. If it looks like you're gonna be through the content of this notebook in ten minutes or so, because you're already familiar with all of its concepts, then feel free to challenge yourself a little more with the overarching machine-learning challenge.

## Setup

To allow the next code blocks to run smoothly, this section sets a couple of settings.

Some imports that we will be using:

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

And some more imports specific to this notebook:

In [None]:
import tensorflow as tf
from tensorflow import keras

Some figure plotting settings: increase the axis labels of our figures a bit.

In [None]:
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Preparing the input data

We've already touched it previously, but one of the most popular (and probably most boring) datasets using in machine-learning teaching is the MNIST dataset. It's a collection of pictures of handwritten digits. We'll use it to train some neural nets in this session! First, we'll need to fetch the dataset from the internet. Then, we'll have to transform it in such a way that it can be used with our model.

Let's start with downloading the dataset, which already comes in two sets for training and testing:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()

To understand what we're dealing with, let's check the shape of the objects:

In [None]:
X_train_full.shape

Ok, great! Our training dataset consists of 60,000 instances, each of which has 28 by 28 features. Since we're talking about images, these 28 by 28 features are just the pixels of the image. And the test dataset?

In [None]:
X_test.shape

That's 10,000 instances, good. When dealing with images, the individual pixels can carry different types of information: in the worst case, three different colour channels (red, blue, green), each of which with a certain "depth" (that is, the number of bits used to "describe" the colour). 8 or 16 bits are typical numbers for this. The first case, for example, would mean that we can have 256 different colour intensities ($2^{8}$). Luckily, our images are only greyscale, so each pixel only carries 8 bits to describe its brightness. Let's scale this to be between 0 and 1:

In [None]:
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0

In addition, we should reserve a small fraction of the training data for validation. Remember the three different types of datasets we usually consider when building/fitting/validating/testing a model:
* The training data, which is directly used in the training steps of the model.
* The validation data, which is used to evaluate the model performance on-the-fly during training. Validation data does _not_ go into the fit procedure itself, but it does have an impact on the training procedure. For example, when using techniques like early stopping, the model performance on the validation data is the deciding factor when to stop training.
* The test data, which the model _only_ gets to see once it is fully built and trained. This is to check how the model performs on unseen data.

We've already separated our dataset into 60,000 training and 10,000 testing instances, but let's reserve another fraction of the training data for on-the-fly validation purposes.

In [None]:
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

Before we get started with building a model and the training, let's have a quick look at the data itself. The following block of code picks one random instance (no. 36,000), rehapes it into an image and prints it to your screen:

In [None]:
some_digit = X_train[36000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = mpl.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show()

Look's very much like a five, doesn't it? Let's look at a bunch of instances.

In [None]:
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

In [None]:
plt.figure(figsize=(9,9))
example_images = np.r_[X_train[::612]]
plot_digits(example_images, images_per_row=10)
plt.show()

Cool! We can already see that most of them are easy to classify with the human eye, but there are a few instances that are quite tricky. You'll also notice that almost all digits were written by Americans, they would look different if put down by a German native speaker.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [None]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

In [None]:
model.summary()

In [None]:
model.evaluate(X_test, y_test)

# Batch Normalisation

The following model implements two hidden layers, each of which use the ELU activation function. The activation function is already implemented as a separate step in the model to make your life easier. Can you implement batch normalisation for each of the hidden layers and the output? The class to use is [tf.layers.batch_normalization](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization).

Again, you might see deprecation warnings for this class (also on the documentation page).

In [None]:
reset_graph()

n_inputs = 28 * 28  # pixel size of the input
n_hidden1 = 300     # Arbitrary size of a first hidden layer
n_hidden2 = 100     # Arbitrary size of a second hidden layer
n_outputs = 10      # Number of output nodes, one for each digit

# Placeholder for the input data.
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

# This is basically a placeholder with a switch whether we're looking
# at training data or not. This will be needed to tell the batch
# normalisation how the normalisation is evaluated (batch vs. total).
training = tf.placeholder_with_default(False, shape=(), name='training')

# Build the first hidden layer. Can you add batch normalisation here?
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 =    # implement here
bn1_act = tf.nn.elu(bn1)

# Build the second one. Again, can you add batch normalisation?
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 =    # implement here
bn2_act = tf.nn.elu(bn2)

# And the same for the output layer?
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits =   # implement here

# As before, define the cross entropy and loss.
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

# As before, gradient-descent optimiser to minimise the loss.
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

# As before, evaluate on the accuracy by comparing to y.
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Now we're ready to run the model. Again, this might take a moment. Can you guess what the `UPDATE_OPS` function is doing for batch normalisation?

In [None]:
n_epochs = 20
batch_size = 200

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

That's it! But actually, isn't this worse than what we had before with the leaky ReLU and _no_ batch normalisation? So that means that a "better" activation function plus batch normalisation give a worse result than the cheap leaky ReLU? Do you have any idea why?

# Alternative (and faster) optimisers

There are various optimisers available in tensorflow, all of which tend to be a lot faster than the 'standard' gradient-descent optimiser. Below you find the tensorflow implemetations of:
* Momentum optimisation ([tf.train.MomentumOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer))
* Nesterov momentum optimisation
* Adaptive gradient (AdaGrad) optimisation ([tf.train.AdagradOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer))
* RMSProp optimisation ([tf.train.RMSPropOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer))
* Adaptive moment estimation (Adam) optimisation ([tf.train.AdamOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer))

All of these can easily be used in the above neural net(s) trained on the MNIST dataset. Just replace the current optimizer of the 'train' scope of the model. Can you make out differences between the optimizers? Do they considerably speed up the convergence and/or the training cycle?

In [None]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)

optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9, use_nesterov=True)

optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate, momentum=0.9, decay=0.9, epsilon=1e-10)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

# Regularisation via dropout

The most common regularisation technique for neural nets is dropout. Can you implement dropout in the following example? The class to use in tensorflow is [tf.layers.dropout](https://www.tensorflow.org/api_docs/python/tf/layers/dropout).

Again, you might see deprecation warnings because of tensorflow v2 coming soon ...

In [None]:
reset_graph()

# Placeholder for the input data.
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

# This is basically a placeholder with a switch whether we're looking
# at training data or not. This will be needed to tell the batch
# normalisation how the normalisation is evaluated (batch vs. total).
training = tf.placeholder_with_default(False, shape=(), name='training')

# Set the dropout rate.
dropout_rate = 0.5

# First, let's implement dropout for the input X. Can you add it?
X_drop =    # implement here

# Build the actual NN with two hidden layers.
with tf.name_scope("dnn"):
    # Define the first hidden layer.
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu, name="hidden1")
    # Now the first layer was created. Can you add dropout for it?
    hidden1_drop =    # implement here
    # Define the second hidden layer.
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu, name="hidden2")
    # And for the second one, too?
    hidden2_drop =    # implement here
    # Build the output layer.
    logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")

# As before, define the cross entropy and loss.
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

# Let's use momentum optimisation this time.
with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss)    

# As before, evaluate on the accuracy by comparing to y.
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Now let's start the training!

In [None]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")