# MNIST Dataset

This is the 'Hello World' of ML training.

The MNIST dataset consists of 70,000 handwritten digits. Since we have 10 digits, we have 10 classes from 0 to 9.

Our challenge is to create an algorithm that takes an image and correctly determines what digit is shown in that image. 

## Solution Outline

Each image in the MNIST dataset is 28 pixels by 28 pixels. 

It is on a greyscale so the pixel values are between 0 and 255. 

We can think about the problem as a 28x28 matrix with input values between 0 and 255. 

0 corresponds to purely black and a 255 to purely white.

As each photo is 28x28 pixels, the total pixel number is 784.

The approach for deep neural networks is to 'flatten' each image into a vector 784 x 1.

So for each image you would have 784 inputs to the neural network.

Then we will linearly combine the input layer and add a non-linearity to create the first hidden layer. This model will have two hidden layers. Two are enough to produce a model for this with very good accuracy.

We then need to produce the output layer. There are 10 digits, therefore 10 classes and so there will be 10 output units after the second hidden layer. 

The output will then be compared to the targets. We will use one-hot encoding for both the outputs and the targets.

I.e. the digit 0 will be represented by [1,0,0,0,0,0,0,0,0,0] and 5 will be represented by [0,0,0,0,0,1,0,0,0,0].

Since we want to see the probability of a digit being rightfull labelle we will use a softmax activation function on the output layer.

### Action plan:
- prepare our data and preprocess it. create training, validation and test datasets
- outline the model and choose the activation functions
- set the appropriate advanced optimisers and the loss function
- make it learn
- test the accuracy of the model

## Import the relevant packages






In [1]:
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds # Tensorflow datasets has a lot of data ready for modelling

## Data

Loading the dataset with flag `as_supervised = True` loads the data in a 2-tuple structure.

Loading the dataset with flag `with_info = True` provides a tuple containing info about version, features, # samples of the dataset

In [2]:
mnist_dataset,mnist_info = tfds.load(name='mnist',with_info=True,as_supervised=True)

In [5]:
mnist_train, mnist_test = mnist_dataset['train'],mnist_dataset['test']

# this dataset does not already have a validation dataset so we will take 10% of the training set

num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
num_validation_samples = tf.cast(num_validation_samples,tf.int64)
# cast converts a variable into a given data type

num_test_samples = mnist_info.splits['test'].num_examples
num_test_samples = tf.cast(num_test_samples,tf.int64)

def scale(image,label):
    image = tf.cast(image,tf.float32)
    image /= 255.
    return image,label

# there is a tensor flow function called dataset.map that applies a custom
# transformation to a given dataset. it takes as input a function which
# determines the transformation. 
# This map function only takes functions that work with a variable and a label.

scaled_train_and_validation_data = mnist_train.map(scale)

test_data = mnist_test.map(scale)

Shuffling is something often done at the preprocessing stage. It keeps the same information, but puts it in a different order. Like shuffling cards. 

It's possible that the targets are stored in a descending order, resulting in the first x batches having only one value for target and other batches having only other values. Shuffling protects against this. Otherwise the stochastic gradient descent algorithm would not work properly. 

You need to set a buffer size which limits the amount of data being shuffled in one go. Otherwise the data might be too large and take up the entire memory of the machine and cause issues.

In [6]:
BUFFER_SIZE = 10000

shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

validation_data = shuffled_train_and_validation_data.take(num_validation_samples)
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

BATCH_SIZE = 100

train_data = train_data.batch(BATCH_SIZE)
validation_data = validation_data.batch(num_validation_samples) # model expects this in batch format so use the samples num
test_data = test_data.batch(num_test_samples)

# Validation data must have same shape and format as the train and test data
# The MNIST data is iterable and in 2-tuple format (as_supervised=True)

validation_inputs, validation_targets = next(iter(validation_data))


## Model

### Outline the model

784 inputs
10 outputs, one for each digit
we will work with 2 hidden layers of 50 elements each

The optimal width and depth for this problem is unknown, but it is safe to assume the initial values of 4 layers for depth and 50 for width are suboptimal. NB: I don't actually know what the width and depth values are, just assumed those two values for now.

In [12]:
input_size = 784
output_size = 10
hidden_layer_size = 100
# underlying assumption is that all hidden layers are same size
# it is possible to vary hidden layer size

# the model below shows how each layer is built. I imagine you could be clever about adding multiple hidden layers.
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28,1)),
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'), # activation function chosen because it is known to be good at this problem
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
    tf.keras.layers.Dense(output_size,activation='softmax') # output layer for this problem must turn outputs into probabilities, which softmax does.    
])

### Choose the optimiser and the loss function

In [13]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

### Training 

What happens inside an epoch?
- At the beginning of each epoch, the training loss will be set to 0
- The algorithm will iterate over a preset number of batches, all from train_data
- The weights and biases will be updated as many times as there are batches
- We will get a value for the loss function, indicating how the training is going
- We will also see a training accuracy
- At the end of the epoch, the algorithm will forward propogate the whole validation set

In [14]:
NUM_EPOCHS = 5

model.fit(train_data,epochs=NUM_EPOCHS,validation_data=(validation_inputs,validation_targets),verbose=2)

Epoch 1/5
540/540 - 3s - loss: 0.3301 - accuracy: 0.9048 - val_loss: 0.1548 - val_accuracy: 0.9553
Epoch 2/5
540/540 - 3s - loss: 0.1399 - accuracy: 0.9584 - val_loss: 0.1106 - val_accuracy: 0.9665
Epoch 3/5
540/540 - 3s - loss: 0.0987 - accuracy: 0.9702 - val_loss: 0.0879 - val_accuracy: 0.9720
Epoch 4/5
540/540 - 3s - loss: 0.0763 - accuracy: 0.9766 - val_loss: 0.0838 - val_accuracy: 0.9738
Epoch 5/5
540/540 - 3s - loss: 0.0614 - accuracy: 0.9807 - val_loss: 0.0585 - val_accuracy: 0.9825


<tensorflow.python.keras.callbacks.History at 0x14942a400>

The loss decreases with every epoch which is great. They dont change enormously between epochs as within each epoch there are 540 different weights and biases in the hidden layers. 

The accuracy shows in what % of the cases our outputs were equal to the targets. Logically they increase over time as the model gets better.

We usually keep an eye on the validation loss (or set early stopping mechanisms) to determine whether the model is overfitting.

The validation_accuracy is the TRUE ACCURACY OF THE MODEL.

97% is already a great result, but if the hyperparameters are changed can it get better?

Hidden layer size can be changed from 50 to 100 and the model re-run...

With 100, the validation accuracy is now at 98.25% which is much better.

TensorFlow think that models should be able to get to 99% and above in many circumstances.

## Test the model

The validation accuracy is still only on validation data. We might have overfit still. 

We need to test it on the test dataset.

In [15]:
test_loss, test_accuracy = model.evaluate(test_data)



In [19]:
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss,test_accuracy*100))

Test loss: 0.08. Test accuracy: 97.35%


The model has a 97.35% accuracy.

After you test the model, you are no longer allowed to change it. The test data will no longer be a dataset that the model has never seen. 

The main point of the test dataset is to simulate model deployment. 

Getting a test accuracy very close to the validation accuracy is a sign that we have not got overfitting.



