## AI Skill Academy
## Deep Learning
### Classifying images from Fashion MNIST with Keras

Pierre Feillet

## Part 1 - Team Member 2 (Convolutional NN) - Directions

Classify images using a convolutional neural network (CNN) architecture. 

Before the implementation of each task, we add a hypothesis Markdown cell in the Jupyter Notebook documenting the expected result for that task.

Fixed parameters
For the purposes of consistency, the following parameters are fixed for Task 1-3
   - Sigmoid transfer function for hidden layers
   - Softmax activation function for the output layer
   - Categorical cross entropy as loss function
   - Stochastic Gradient Descent (SGD) as the optimization algorithm
 
For all Part 2 tasks add a dense layer on the top to connect to the 10 possible output values. 
For Task  1 & 2, set training epochs to 5 and a fixed batch size of 128.

In this notebook we train a simple Convolutional Neural Network (CNN) with Keras on the Fashion MNIST dataset, enabling to classify fashion images and categories. Similar to the MNIST digit dataset, the Fashion MNIST dataset includes:

    60,000 training examples
    10,000 testing examples
    10 classes
    28×28 grayscale/single channel images

The ten fashion class labels include:

    T-shirt/top
    Trouser/pants
    Pullover shirt
    Dress
    Coat
    Sandal
    Shirt
    Sneaker
    Bag
    Ankle boot

In [2]:
import numpy as np
import time

from tensorflow import keras
from tensorflow.keras import layers

from keras.datasets import fashion_mnist

import tensorflow as tf
print("TensorFlow version: " + tf.__version__)

TensorFlow version: 1.14.0


Using TensorFlow backend.


Setting a deterministic ramdom seed

In [3]:
import random as python_random
import os

os.environ["PYTHONHASHSEED"]="0"
np.random.seed(123)
python_random.seed(123)
tf.random.set_random_seed(123)

In [4]:
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# Scale images to the [0, 1] range for normalization
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


## Task 1

Implement a simple CNN to classify fashion images with the following fixed parameters to start with. 
1.     Sigmoid transfer function for hidden layer
2.     Softmax activation function for the output layer
3.     Categorical cross entropy as loss function
4.     Stochastic Gradient Descent (SGD) as the optimization algorithm
5.     Number of layer = 2 (one hidden convolutionnal layer and one output layer)
6.     Number of nodes per layer = 8
7.     Batch size = 128, epochs = 5



### Hypothesis
The simplicity of the model would result in high bias and moderate accuracy in the test set. 
Simpler model should also result in a quick training time. We will measure the accuracy.
We expect a low accuracy with a super basic CNN.

In [5]:
model = keras.Sequential(
    [
        #layers.Flatten(input_shape=(28, 28, 1)),
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(5, 5), activation="sigmoid"),
        layers.Flatten(),
        layers.Dense(10, activation="softmax"),
    ]
)

model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 24, 24, 32)        832       
_________________________________________________________________
flatten (Flatten)            (None, 18432)             0         
_________________________________________________________________
dense (Dense)                (None, 10)                184330    
Total params: 185,162
Trainable params: 185,162
Non-trainable params: 0
_________________________________________________________________


In [9]:
batch_size = 128
epochs = 5

In [7]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy", "AUC"])

tic = time.perf_counter()
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 54000 samples, validate on 6000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Executing in 199.6002 seconds


In [8]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

Test loss: 0.6108185867786408
Test accuracy: 0.7752
AUC: 0.97880584


Training is about 5 mn long on MacBook Pro and accuracy is 0.75. Knowing that ML models can easily reach 95 accuracy on MNIST the first version of the model is suboptimal.

### Task 2

Increase the complexity of the CNN by adding multiple convolution and dense layers. 
Add one more convolutional layer with 32 neurons (feature maps) and a 5x5 feature detector. 
Add a dense layer with 128 nodes.

### Hypothesis
We expect to boost the accuracy of the model, probably at a training time cost.

In [46]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(5, 5), activation="sigmoid"),
        layers.Conv2D(32, kernel_size=(5, 5), activation="sigmoid"),
        layers.Flatten(),
        layers.Dense(128, activation="sigmoid"),
        layers.Flatten(),
        layers.Dense(10, activation="softmax"),
    ]
)

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 20, 20, 32)        25632     
_________________________________________________________________
flatten_2 (Flatten)          (None, 12800)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               1638528   
_________________________________________________________________
flatten_3 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1290      
Total params: 1,666,282
Trainable params: 1,666,282
Non-trainable params: 0
____________________________________________

In [47]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy", "AUC"])

tic = time.perf_counter()
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 54000 samples, validate on 6000 samples
Epoch 1/5


Epoch 2/5




Epoch 3/5


Epoch 4/5




Epoch 5/5


Executing in 465.3710 seconds


In [48]:
score = model.evaluate(x_test, y_test, verbose=0)

print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

Test loss: 0.8828392681121826
Test accuracy: 0.6897
AUC: 0.96053743


### Results

Training duration increases significantly with the neuronal network complexity to reach 1000 s. Accuracy it 69%. So the addition of a second conv layer degrades the accuracy while increasing the learning duration.

### Task 3

Document answers to the following questions:
How can you improve the models built in Tasks 1 & 2?
Using only convolutional layers, will hyper-parameter optimization (no of layer, no of the nodes, learning rate, etc) help in increasing the accuracy? If yes, implement the changes and report your results. 

### Hypothesis
HPO is expected to explore a configuration space with multiple dimensions. It automates the exploration in the hypermodel space and should bring a better solution based on the given topology hypothesis.
On the other side we have not yet considered normalization, pooling or drop out.
The time to perform the combinations of hyper models is unknown at this point.

In [28]:
!pip install -q -U keras-tuner
import kerastuner as kt

## Define the model

Define the hyperparameter search space in addition to the model architecture. The model you set up for hypertuning is called a *hypermodel*.

You can define a hypermodel by using a model builder function or by subclassing the `HyperModel` class of the Keras Tuner API

You can also use two pre-defined `HyperModel` classes - [HyperXception](https://keras-team.github.io/keras-tuner/documentation/hypermodels/#hyperxception-class) and [HyperResNet](https://keras-team.github.io/keras-tuner/documentation/hypermodels/#hyperresnet-class) for computer vision applications.

We use here a model builder function to define the image classification model. The model builder function returns a compiled model and uses hyperparameters you define inline to hypertune the model.

In [29]:
import IPython

In [30]:
def model_builder(hp):   
  model = keras.Sequential()
  model.add(keras.Input(shape=input_shape))
    
  # Convolutionnal layers
  hp_cv_layers = hp.Int('cv_layers', min_value = 1, max_value = 2, step = 1)
  hp_cv_units = hp.Int('cv_units', min_value = 32, max_value = 64, step = 32)

  for i in range(hp_cv_layers):  # adding variation of layers.
     model.add(layers.Conv2D(hp_cv_units, kernel_size=(5, 5), activation="sigmoid"))
    
  model.add(layers.Flatten())

  hp_units = hp.Int('units', min_value = 32, max_value = 64, step = 32)
  model.add(layers.Dense(units = hp_units, activation="sigmoid"))
    
  model.add(layers.Flatten())
  model.add(layers.Dense(10, activation="softmax"))

  # Tune the learning rate for the optimizer 
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values = [1e-2]) #values = [1e-2, 1e-3, 1e-4]) 
  
  model.compile(optimizer=keras.optimizers.SGD(learning_rate = hp_learning_rate),
                loss="categorical_crossentropy", 
                metrics = ['accuracy'])
  
  return model

## Instantiate the tuner and perform hypertuning

Instantiate the HPO tuner to perform the hypertuning. The Keras Tuner has four tuners available - `RandomSearch`, `Hyperband`, `BayesianOptimization`, and `Sklearn`. In this tutorial, we use the [Hyperband](https://arxiv.org/pdf/1603.06560.pdf) tuner. 

To instantiate the Hyperband tuner, you must specify the hypermodel, the `objective` to optimize and the maximum number of epochs to train (`max_epochs`).

In [31]:
tuner = kt.Hyperband(model_builder,
                     objective = 'val_acc', 
                     max_epochs = 5,
                     factor = 3,
                     directory = '.',
                     project_name = 'intro_to_kt')      

INFO:tensorflow:Reloading Oracle from existing project ./intro_to_kt/oracle.json
INFO:tensorflow:Reloading Tuner from ./intro_to_kt/tuner0.json


The Hyperband tuning algorithm uses adaptive resource allocation and early-stopping to quickly converge on a high-performing model. This is done using a sports championship style bracket. The algorithm trains a large number of models for a few epochs and carries forward only the top-performing half of models to the next round. Hyperband determines the number of models to train in a bracket by computing 1 + log<sub>`factor`</sub>(`max_epochs`) and rounding it up to the nearest integer.

Before running the hyperparameter search, define a callback to clear the training outputs at the end of every training step.

In [32]:
class ClearTrainingOutput(tf.keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait = True)

Run the hyperparameter search. The arguments for the search method are the same as those used for `tf.keras.model.fit` in addition to the callback above.

The `./intro_to_kt` directory contains detailed logs and checkpoints for every trial (model configuration) run during the hyperparameter search. When re-running the hyperparameter search, the Keras Tuner uses the existing state from these logs to resume the search. To disable this behavior, pass an additional `overwrite = True` argument while instantiating the tuner.

In [37]:
tuner.search(x_train, y_train, epochs = 5, validation_data = (x_test, y_test), callbacks = [ClearTrainingOutput()])

# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]

Train on 60000 samples, validate on 10000 samples
Epoch 1/5






KeyboardInterrupt: 

In [39]:
print("The hyperparameter search is complete. The optimal hyper parameters found are:")
print ("nb of cv layers: {0}.".format(best_hps.get('cv_layers')))
print ("number of units for the convolutionnal layers: {0}.".format(best_hps.get('cv_units')))
print ("number of units: {0}.".format(best_hps.get('units')))
print ("learning rate: {0}.".format(best_hps.get('learning_rate')))

The hyperparameter search is complete. The optimal hyper parameters found are:
nb of cv layers: 1.
number of units for the convolutionnal layers: 128.
number of units: 32.
learning rate: 0.01.


To finish this task, we retrain the model with the optimal hyperparameters from the search.

In [90]:
# Build the model with the optimal hyperparameters and train it on the data
model = tuner.hypermodel.build(best_hps)
tic = time.perf_counter()
model.fit(x_train, y_train, epochs = 5, validation_data = (x_test, y_test))
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
















Epoch 2/5
















Epoch 3/5
















Epoch 4/5
















Epoch 5/5


















<tensorflow.python.keras.callbacks.History at 0x7fac97735fd0>

#### Results
Accuracy for the best model found is 80%


### Task 4

In this task you are free to change any parameter/architecture to improve the quality metrics. This is the time to get creative!
After training and evaluating the network, document the quality metrics for each change and include the findings in your Jupyter Notebook.

### Hypothesis
Normalization with centering should improve the accuracy of the model


In [10]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        keras.layers.BatchNormalization(),
        layers.Conv2D(32, kernel_size=(5, 5), activation="sigmoid"),
        layers.Flatten(),
        layers.Dense(10, activation="softmax"),
    ]
)

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_1 (Batch (None, 28, 28, 1)         4         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
flatten_2 (Flatten)          (None, 18432)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                184330    
Total params: 185,166
Trainable params: 185,164
Non-trainable params: 2
_________________________________________________________________


In [11]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy", "AUC"])

tic = time.perf_counter()
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 54000 samples, validate on 6000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Executing in 143.1490 seconds


In [12]:
score = model.evaluate(x_test, y_test, verbose=0)

print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

Test loss: 0.5044754775047302
Test accuracy: 0.8164
AUC: 0.98459816


### Results
We gain 3% of accuracy compared. Good to get but not a breakthrought.


### Hypothesis
Changing the activation function to RELU may improve the model.


In [53]:
model = keras.Sequential(
    [
        #layers.Flatten(input_shape=(28, 28, 1)),
        keras.Input(shape=input_shape),
        keras.layers.BatchNormalization(),
        layers.Conv2D(32, kernel_size=(5, 5), activation="relu"),
        #layers.Dense(8, activation="sigmoid"),
        #layers.MaxPooling2D(pool_size=(2, 2)),
        #layers.Conv2D(8, kernel_size=(3, 3), activation="sigmoid"),
        #layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        #layers.Dropout(0.5),
        layers.Dense(10, activation="softmax"),
    ]
)

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_1 (Batch (None, 28, 28, 1)         4         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
flatten_5 (Flatten)          (None, 18432)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                184330    
Total params: 185,166
Trainable params: 185,164
Non-trainable params: 2
_________________________________________________________________


In [54]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy", "AUC"])

tic = time.perf_counter()
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 54000 samples, validate on 6000 samples
Epoch 1/5


Epoch 2/5




Epoch 3/5


Epoch 4/5




Epoch 5/5


Executing in 176.5187 seconds


In [55]:
score = model.evaluate(x_test, y_test, verbose=0)

print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

Test loss: 0.36861664798259736
Test accuracy: 0.8719
AUC: 0.9906212


### Results
The batch normalization layer increases the accuracy and decrease learning time.


### Hypothesis
Introduce a drop out layer to regularize the nodes.


In [13]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        keras.layers.BatchNormalization(),
        layers.Conv2D(32, kernel_size=(5, 5), activation="sigmoid"),
        layers.Dropout(0.5),
        layers.Flatten(),
        layers.Dense(10, activation="softmax"),
    ]
)

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_2 (Batch (None, 28, 28, 1)         4         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
dropout_1 (Dropout)          (None, 24, 24, 32)        0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 18432)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                184330    
Total params: 185,166
Trainable params: 185,164
Non-trainable params: 2
_________________________________________________________________


In [14]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy", "AUC"])

tic = time.perf_counter()
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
toc = time.perf_counter()
print(f"Executing in {toc - tic:0.4f} seconds")

Train on 54000 samples, validate on 6000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Executing in 356.2885 seconds


In [15]:
score = model.evaluate(x_test, y_test, verbose=0)

print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

Test loss: 0.48319054408073425
Test accuracy: 0.8283
AUC: 0.98529756


In [None]:
score = model.evaluate(x_test, y_test, verbose=0)

print("Test loss:", score[0])
print("Test accuracy:", score[1])
print("AUC:", score[2])

### Results
Waiting for the result.


### Task 5

Discuss the results for Tasks 1-4 with your project partner. Note down the differences in the results between Part 1 and 2 so far and document your insights. Does one architecture work better than the other? Why? Would the optimization from Tasks 3 and 4 help improve both Part 1 and Part 2 results.


Note: Attend the Project Check-in and Debrief session before continuing with Task 6