# Halter Technical Test

This is the Google Colab notebook containing the code to complete the Halter Technical Test. Please refer to the README.md provided in the email for instructions and answers to the questions.

You may wish to use 'Runetime > Run all' if you do not wish to sequentially run the code cells. It is also recommended that you make use of the Google Colab GPUs as it will significantly improve the code compilation time.

The README.md text is also appended at the end of this notebook.

# Imports and Configs

In [None]:
# Imports
import os
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from PIL import Image

# Initialise
IMG_SIZE = 28

# Load dataset
mnist = tf.keras.datasets.mnist
(images_train, labels_train), (images_test, labels_test) = mnist.load_data()

# Normalise and reshape dataset preprocessing and input to Convolutional Neural Network
images_train = tf.keras.utils.normalize(images_train, axis=1)
images_test = tf.keras.utils.normalize(images_test, axis=1)
images_train = images_train.reshape(len(images_train), IMG_SIZE, IMG_SIZE, 1)
images_test = images_test.reshape(len(images_test), IMG_SIZE, IMG_SIZE, 1)

# Preprocessing
Perform data augmentation on training dataset to minimise bias and potential for overfitting when fitting the sequential model.

In [None]:
# Data Augmentation
datagen = ImageDataGenerator(
        rotation_range=25,
        width_shift_range=0.1,
        height_shift_range=0.1,
        shear_range=0.2,
        zoom_range=0.2)

# Create Iterator objects to pass into model.fit_generator()
train_generator = datagen.flow(images_train, labels_train)

# Model Optimisation

This is a remnant of some code that I used to optimise my Machine Learning Model.
I don't recommend that you run this code, as I have already shown and chosen my final model below.

In [None]:
# Set this to false by default as you will probably not want to train and test over 100
# Convolutional Neural Networks
run_hyper_param_opt = False #@param {type:"boolean"}

## Hyperparameter Optimisation
I have performed a coarse Hyperparameter Optimisation below.

In [None]:
# Iteratively create a new model
opt = ['agd', 'rmsprop', 'adam']
loss = ['categorical_crossentropy', 'poisson']
conv_layers = [1, 2, 3]
dense_layers = [1, 2, 3 , 4, 5]
layer_sizes = [32, 64, 128, 256]
dropout = [0, 0.2, 0.4, 0.8]
epochs = [10, 12, 15, 18, 20]

# Hyperparameter Optimisation
if run_hyper_param_opt:
  for dense_layer in dense_layers:
      for layer_size in layer_sizes:
          for drop in dropout:
              for epoch in epochs:
                # Initialise naming and callback for Tensorboard analysis later
                name = "Test_Epoch_Single_No_Dropout_{}_nodes_{}_dense_{}_time".format(drop, layer_size, dense_layer, int(time.time()))
                tensorboard = TensorBoard(log_dir='logs/{}'.format(name))
                print(name)

                model = Sequential()
                # Convolutional/Input Layer 
                model.add(Conv2D(25, kernel_size=(3,3), strides=(1,1), padding='valid', activation='relu', input_shape=(IMG_SIZE,IMG_SIZE,1)))
                model.add(MaxPool2D(pool_size=(1,1)))
                # flatten output of conv (input) layer 
                model.add(Flatten())

                # Variable number of Dense Layers
                for i in range(dense_layer):
                    model.add(Dense(layer_size, activation='relu'))  
                    model.add(Dropout(drop))

                # Final dense and output layer
                model.add(Dense(layer_size, activation='relu'))
                model.add(Dense(3, activation='softmax'))

                # compiling the sequential model
                model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
                model.fit(images_train, labels_train, epochs=epoch, batch_size=20, validation_data=(images_test,labels_test), callbacks = [tensorboard])

## Analysing results with TensorBoard
Ensure that the `'log'` folder has been uploaded into the files workspace on the left in order to view the TensorBoard with the results.

In [None]:
%load_ext tensorboard
!rm -rf ./logs/
%tensorboard --logdir logs

# The Selected Model
The below model is the selected model with the optimal parameters from the Hyperparameter optimisation. Optimisation was done in conjunction with Tensorboard analysis.

In [None]:
model = tf.keras.models.Sequential()
model.add(Conv2D(25,
                 kernel_size=(3,3),
                 strides=(1,1),
                 padding='valid',
                 activation='relu',input_shape=(IMG_SIZE,IMG_SIZE,1))
                 )
model.add(MaxPool2D(pool_size=(1,1)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
history = model.fit(train_generator,
                    steps_per_epoch=(len(images_train))//32,
                    epochs=10
                    )

model.summary()

## Finding the most ambiguous images
Create a list of the images with the lowest prediction confidence percentages. These will be the images that the model finds the most 'ambiguous'. The prediction, position, and the confidence in the prediction for the top 10 images will be stored.

In [None]:
# Make model predictions for each category for each image in the testing dataset
predictions = model.predict(images_test)

# Generate a list of the 10 images with the lowest confidence predictions
num = 10
lowest_confidence = np.ones(num)
lowest_conf_pos = np.zeros(num)
lowest_conf_predict = np.zeros(num)
counter = 0

# Begin iterating through the testing dataset
for prediction in predictions:
    if np.max(prediction) < np.min(lowest_confidence):
        max_pos = lowest_confidence.argmax()
        lowest_conf_pos[max_pos] = int(counter)
        lowest_confidence[max_pos] = np.max(prediction)
        lowest_conf_predict[max_pos] = np.argmax(prediction)
    counter += 1
lowest_conf_pos = [int(i) for i in lowest_conf_pos]
lowest_conf_predict = [int(i) for i in [int(i) for i in lowest_conf_predict]]

print('lowest_confidence:')
print(lowest_confidence)
print('lowest_conf_pos')
print(lowest_conf_pos)
print('lowest_conf_predict')
print(lowest_conf_predict)

## Output Images
Run the code below to save and store the output images, and have a preview of the images.

Note that due to the stochastic nature of the convolutional neural network, the 

In [None]:
# Create subdirectory to store output files
if not os.path.exists('outputs'):
    os.makedirs('outputs')

# Save each image into the subdirectory
fig = plt.figure()
for i in range(num):
    # Subplot 10 most ambiguous images
    ax = fig.add_subplot(2, 5, i+1)
    ax.set_title('Pred: %.0f' % lowest_conf_predict[i])
    ax.imshow(images_test[lowest_conf_pos[i]].reshape(IMG_SIZE, IMG_SIZE))
    name = "outputs/Low_confidence_%s_(%.5s%%).png" % (int(lowest_conf_predict[i]), lowest_confidence[i]*100)
    img = Image.fromarray(np.uint8(images_test[int(lowest_conf_pos[i])].reshape(IMG_SIZE, IMG_SIZE) * 255) , 'L')
    img.save(name)  

## Thank you for looking through my technical test submission!

# Halter Tech Challenge - Jae Min Seo (README.md)

## Instructions
The Google Colab linked (https://colab.research.google.com/drive/1SNWyp18lqTr6qTgBSI6U427-aKwSnkwc?usp=sharing) will have instructions on how to run the model above each code cell. As the code is compiled and run remotely on Colab, all dependencies will already be in place. How convenient!

## Question One
### Describe the model you have created. Explain the design decisions you have made in constructing the model.

As the MNIST dataset is well known for being relatively simple, I did not try to create a model that was too complex. I did a fairly coarse hyperparameter optimisation on the CNN to optimise the validation accuracy and minimise the loss function for the final epoch test of each configuration. I analysed the steady-state accuracies and losses on Tensorboard, and incremented the parameters in the direction that minimised losses (eg. If a 5% dropout rate had a lower loss than 0%, I then test 10%). Parameters such as the optimizer, activation, dropout rate, number of dense layers, and number of nodes per layer were perturbed in the optimisation. 

Ultimately, I landed at a basic convolutional neural network with three dense layers, and a 20% dropout rate between them. This helps to prevent any significant overfitting and minimise bias. Whilst the MNIST dataset is relatively noise-free, it was still important to have a robust dataset when finding the images in the test dataset that deviates the most from the training dataset. To further prevent overfitting and improve the overall robustness of the model, I did data augmentation on the training data, though not a significant amount as there is already a large amount of quality data in the training dataset.

After the model was created, each image in the test dataset was iterated through, and their images with the lowest confidence in their predictions from the model were collated, and saved in a subdirectory.

## Question Two
### Analyse your results: do the 10 images you have found align with your expectations? Why do you think this is? State how you have quantified “most ambiguous”.

Due to the stochastic nature of CNN compilation and fitting, I did not get consistent ambiguous outputs from the MNIST test dataset. However, there were clearly some images that repeated in each test more than others.

I quantified ‘ambiguity’ as how low the machine learning model’s prediction is when classifying an image. I found that this was a fairly good measure of how hard to classify an image, as even after a brief visual look through the outputted files, I could understand how a machine could misinterpret the result. However, my model did attempt to classify the ambiguous results, and I found that the model did appear to be correct around half of the time.

Common trends between images that had a low prediction scores were images where the stroke voxel appeared to touch the edge of the image, or where the stroke was lifted and continued on another part of the image. Other cases included where angled/circular strokes were too small relative to the straight strokes and created a somewhat distorted image. A reason for the misclassifications  could be class noise: After a rough visual inspection of images in the MNIST dataset, I definitely saw images that could have been a contradictory instance of one of two numbers. However, there were definitely some images that were clearly misclassified, and this is more due to my own model than the data.


## Question Three
### Lets say your only goal is to train the best handwritten digit classifier on earth, using this dataset. What do you think the impact is of these ‘hard to classify’ images on the performance? How could you use your ‘hard to classify’ image detector to improve this?

Noisy data is known to cause needless convolutions and complexity to a machine learning model, whilst also increasing the time of learning. This ultimately leads to the degradation of performance of machine learning algorithms. A result of this could be decrease validation/classification accuracy, as well as poor predictions.

This ‘hard to classify’ image detector could be used to systematically remove data that was noisy or cause adverse effects to the machine learning model. This means that the training dataset -when fitting a machine learning model- can be compiled and fitted faster (eg. remove excess connections between dense layers - minimise needs for dropouts), increase model accuracy, and overall improve the complexity and time to complete the model.
A limitation is that my model it itself has already been trained with some ambiguous images from the raw MNIST dataset. This can cause potential biasing issues. This bias could be seen in certain occasions in my model, as some of the classifications had relatively high confidence predictions for some images that visually appeared quite ambiguous. An initial supervised model would be recommended in such an instance, or a model trained from a relatively more noise-free dataset.