# Deep Neural Network for MNIST Classification

We'll apply the knowledge from the lectures in this section to write a deep neural network. The problem we've chosen is referred to as the "Hello World" of deep learning because for most students it is the first deep learning algorithm they see. 

The dataset is called MNIST and refers to handwritten digital recognition. You can find more about it on Yann LeCunn's website (Director of AI Research at Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as convolutiional neural networks (CNNs). 

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image)

The goal is to writen an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

Our goal is to build a neural network with 2 hidden layers. 

## Import the relevant packages

In [1]:
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

## Data

In [2]:
mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised= True)
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']
## mnist has no validation dataset, need to do that manually, set 10% of training dataset as validation dataset
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
## can count train samples or use mnist info- latter is readily availiable
## divides samples by 10
num_validation_samples = tf.cast(num_validation_samples, tf.int64)
## tf.cast converts samples to int to prefer float issues 
num_test_samples = mnist_info.splits['test'].num_examples
num_test_samples = tf.cast(num_test_samples,tf.int64)
## ideally like to scale data to have results be more numerically stable (e.g. inputs between 0 and 1)

def scale(image,label):
    image = tf.cast(image,tf.float32)
    image /=255. ## . signifies we want output to be float
    return image, label

## function created to use on dataset.map, which uses function as input to a given dataset, 
## uses input as function to determine transformation
## can use other methods instead as long as function uses image and label inputs and returns them

scaled_train_and_validation_data = mnist_train.map(scale)
test_data = mnist_test.map(scale)

BUFFER_SIZE = 10000
## need to shuffle data when preprocessing, same info but different order. Avoids 0's in one dataset, 1's in another etc
## or higher prices in one, then low in another
## helps with batching, to work as intended 
## one batch could have 0's, another 1's, causing loss to differ greatly
shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)
## take is used to extract sample size based on parameter, aka 10,000 for us 
## takes 10,000 observations, can't be too big because computer memory won't handle it 
## if buffer size set to 1, no shuffling happens 
## if buffer size >= num_samples, shuffling will happen at once (uniformly
## if buffer size is between 1 and total sample size, we will be optimizing the computational power of computer 

train_data = shuffled_train_and_validation_data.skip(num_validation_samples)
## takes all samples but first 10% of samples, based on parameter 

## need to set batch size for batching
## if batch size set to 1 = stoachastic gradient descent
## if batch size # of samples, (single batch)GD as seen previously 
## need number reasonably small but large enough to preserve underlying dependencies = mini batch GD 
BATCH_SIZE = 100

## batch method combines consecutive elements of dataset into batches
train_data = train_data.batch(BATCH_SIZE)
## indicates to model how many samples it should take in each batch 
## will override variable as we want a variable which is batched 
## don't need to create batching for validation dataset, 
## since batching reduces noise (meaningless or irrelevant data in real world data)
## and validation dataset is forward propogated and only has 100 examples, compared to other datasets 
## batching is useful for updating weights once per batch, which is 100 samples, so reduce noise for training updates 
## we forward propogate once when validating
## batching gives average loss, but we want actual loss when testing or validating so take all data at once.  
## not expensive to calculate exact values, due to low computational power

validation_data = validation_data.batch(num_validation_samples)
## one batch 
test_data = test_data.batch(num_test_samples)
## model expects validation dataset to be batched, so need to be overriden, same for test data 
## validation data must have same shape and object properties as train and test data- mnist data is iterable and 2-tuple format
## as supervised set to true

validation_inputs,validation_targets = next(iter(validation_data))
## makes object iterable like a list, one element at a time, next loads next batch, 
## since only one batch, loads inputs and targets

## Model


In [3]:
## outline the model now data is loaded and preprocessed

### Outline the model

In [4]:
## width and depth of model are hyperparemeters, can fiddle with them for improved result, 
## as 50 nodes for each hidden layer is suboptimal 
## 784 inputs for input layer, and 10 outputs

In [5]:
input_size = 784
output_size = 10
hidden_layer_size = 600
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
## assumption is all hidden layers are same size, but customised layers could be made to suit problem better
model = tf.keras.Sequential([tf.keras.layers.Flatten(input_shape = (28,28,1)),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(output_size, activation = 'softmax') ## when creating a classifier, 
                            ## activation function must change outputs into probabilities 
                            ## activation function is relu as specfcic to problem,
                            ## each neural network irl has different optimal combination of activation functions 
                            ## builds layers of model, takes output size of next layer as argument 
                            ## takes inputs and calculates dot product, and calculates inputs and weights, 
                            ## can apply activation function too
                             
])  
## our data from tfds is such that each input is 28 by 28 x 1 or tensor of rank 3
## can use flatten object to turn tensor (images) into vector, takes argument shape of argument we want to flatten

## Choose optimizer and loss function 

In [6]:
model.compile(optimizer = 'adam',loss = 'sparse_categorical_crossentropy',metrics = ['accuracy']) 
## adam optimizer and need loss function relevant to classifiers 
## there are 3 categorical cross entropy (best for categorical) loss functions for tensor flow. 
## Binary is for binary data, categorical presumes you've one hot encoded data, 
## while sparse categorical applies one-hot encoding
## output shape need to match target shape, so need one-hot encoding  
# can add metrics you want to calculate, so metrics for us is accuracy 

## Training 

In [7]:
NUM_EPOCHS = 5
model.fit(train_data,epochs = NUM_EPOCHS, validation_data = (validation_inputs,validation_targets),callbacks = [early_stopping],
          verbose = 2)
          ## parametised it so can ammend the value if needed 
          ## like to create variables for specific things like epochs and output size, since we have hyperparameters, 
          ## and can spot it when debugging etc
          ## NUM_EPOCHS = 5 , model.fit(train_data,epochs = NUM_EPOCHS. Enough to train model, but want to validate too
          # only most important information obtained since verbose = 2 (determines lines of output)

Epoch 1/5
540/540 - 5s - loss: 0.2094 - accuracy: 0.9364 - val_loss: 0.1040 - val_accuracy: 0.9682 - 5s/epoch - 10ms/step
Epoch 2/5
540/540 - 4s - loss: 0.0798 - accuracy: 0.9750 - val_loss: 0.0755 - val_accuracy: 0.9760 - 4s/epoch - 8ms/step
Epoch 3/5
540/540 - 4s - loss: 0.0545 - accuracy: 0.9824 - val_loss: 0.0500 - val_accuracy: 0.9837 - 4s/epoch - 8ms/step
Epoch 4/5
540/540 - 4s - loss: 0.0376 - accuracy: 0.9879 - val_loss: 0.0390 - val_accuracy: 0.9882 - 4s/epoch - 7ms/step
Epoch 5/5
540/540 - 4s - loss: 0.0331 - accuracy: 0.9894 - val_loss: 0.0325 - val_accuracy: 0.9895 - 4s/epoch - 8ms/step


<keras.callbacks.History at 0x1ed09574e20>

In [8]:
## 1. at beginning of each epoch training loss will be set to 0
## 2. the alrogithm will iterate over a preset number of batches, all from train_data
## 3. the weights and biases will be updated as many times as there are batches
## 4. we will get a value for the loss function, indicating how training is going
## 5. we will see the training accuracy 
## 6. At the end of the epoch, the algorithm will forward propogate through the validation set in single batch, 
## through optimized model, and calculate validation accuracy 
## when max epochs reached, training will be over 
## stochastic gradient descent very slow- my observations- less accurate 

In [9]:
## loss for this case is training loss, loss didn't change much 
## due to 540 weights and biases needed to be updated after first epoch
## accuracy shows what percent of ouputs equal to targets 
## validation loss and accuracy are checks, keep an eye on validation loss to check if overfitting 
## validation accuracy is true accuracy of the model, since training accuracy is avwrage loss across batches, 
## while validation accuracy is for whole validation dataset
## 97% for us first time 
##  can change hyperparemeters like number of hidden layer nodes (size)
## hidden node size increased accuracy- 98.35 now

## Test the model

In [10]:
## final accuracy is from forward propogating the test data, reason is overfitting

In [11]:
## fine tuning hyperparemters can cause overfitting over validation training set

In [12]:
test_loss, test_accuracy = model.evaluate(test_data)
## forward propogating the test data, returns loss and metrics value (accuracy for us) for the model in test mode
## seperated them into two variables to be clear 



In [13]:
print('Test loss: {0:.2}. Test accuracy: {1:.2f}%'.format(test_loss,test_accuracy * 100))

Test loss: 0.067. Test accuracy: 97.97%


In [14]:
## can't test it more than once, as first time model has not seen data, but other times it will have and could overfit
## main point of test dataset is to simulate model deployment 
## low accuracy means model has overfit, but accuracy close to validation accuracy shows we have not overfit 
## based on what we would expect if put it in real life 