In [1]:
import logging
import numpy as np
import time

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, datasets, losses, models, optimizers, metrics

In [2]:
# This is the configuration for the logging output

logging_format = '%(message)s'
logging.basicConfig(format=logging_format, level=logging.INFO)

# Custom Models and Training with TensorFlow

This notebook contains the solution for the exercises 12 and 13 of the chapter 12: *Custom Models and Training with TensorFlow* of the book *Hands On Machine Learning with Scikit-Learn, Keras & TensorFlow* of Aurélien Géron.

## Exercise 12

**Implement a custom layer that performs Layer Normalization:**
- **The ```build()``` method should define two trainable weights $\alpha$ and $\beta$, both of shape ```input_shape[-1:]``` and data type ```tf.float32```. $\alpha$ should be initialized with ones, and $\beta$ with zeros.**

- **The ```call()``` method should compute the mean $\mu$ and standard deviation $\sigma$ of each instance’s features. For this, you can use ```tf.nn.moments(inputs, axes=-1, keepdims=True)```, which returns the mean $\mu$ and the variance $\sigma^2$ of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return $\alpha \otimes (\textbf{X}-\mu)/(\sigma + \varepsilon) + \beta$, where $\otimes$ represents itemwise multiplication (*) and $\varepsilon$ is a smoothing term (small constant to avoid division by zero, e.g., 0.001).**

- **Ensure that your custom layer produces the same (or very nearly the same) output as the ```keras.layers.LayerNormalization``` layer.**

In order to create a custom layer, it is require to create a class that subclasses the ```keras.layers.Layer``` class. 

In [3]:
class CustomLayerNormalization(layers.Layer):
    
    
    def __init__(self, epsilon=0.001, **kwargs):
        super().__init__(**kwargs)
        self.epsilon = epsilon
    
    
    def build(self, batch_input_shape):
        self.alpha = self.add_weight(
            name='alpha', 
            shape=batch_input_shape[-1],
            initializer='ones'
        )
        self.beta = self.add_weight(
            name='beta',
            shape=batch_input_shape[-1],
            initializer='zeros'
        )
    
    
    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        return self.alpha * (X - mean)/(tf.sqrt(variance) + self.epsilon) + self.beta

    
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape
    
    
    def get_config(self):
        base_config = self.get_config()
        return {**base_config, 'epsilon':self.epsilon}

In order to verify that our custom layer produces the same output as the ```keras.layers.LayerNormalization``` layer, let's create an array of random numbers to evaluate the MSE between the outputs of our layer and the output immplemented by keras.

In [4]:
data = np.random.rand(40000, 32, 32).astype(np.float32)

keras_layer = layers.LayerNormalization()
custom_layer = CustomLayerNormalization()

difference = losses.mean_squared_error(keras_layer(data), custom_layer(data))
logging.info(f'The mean difference between the LayerNormalization layer and the custom layer is {np.mean(difference)}')

2021-12-29 21:40:34.588788: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-29 21:40:34.588908: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
The mean difference between the LayerNormalization layer and the custom layer is 8.283178431156557e-06


Metal device set to: Apple M1 Pro

systemMemory: 32.00 GB
maxCacheSize: 10.67 GB



The difference is really low, and therefore we can conclude the outputs are almost the same.

## Exercise 13

**Train a model using a custom training loop to tackle the Fashion MNIST dataset**

Let's first download the Fashion MNIST dataset. This is a collection of 60,000 images of fashion items, with size 28 x 28. The classes are the following:

|Label|Description|
|:---:|:---:|
|0|T-shirt/top|
|1|Trouser|
|2|Pullover|
|3|Dress|
|4|Coat|
|5|Sandal|
|6|Shirt|
|7|Sneaker|
|8|Bag|
|9|Ankle Boot|

If you want to have more details on the dataset, please visit the [documentation of Keras](https://keras.io/api/datasets/fashion_mnist/#fashion-mnist-dataset-an-alternative-to-mnist).

In [5]:
(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()
x_train, x_valid = x_train[10000:], x_train[:10000]
y_train, y_valid = y_train[10000:], y_train[:10000]

- **Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch.**

Creating a custom training loop adds extra flexibility to our model, but also adds complexity to the model. Furthermore, makes our code harder to maintain.

Even though it seems like a good alternative, only consider this option when you really need this extra flexibility. One good example as suggested per Aurélien Géron in his book, is when implementing some models from papers that require extra flexibility. Another good example when to use it, as well suggested per Aurélien Géron, is when trying to create models that use different optimizers in different layers.

Let's first create two functions. The first one, ```random_sampling```, will sample $n$ instances from the training sets. Keep in mind that $n$ is defined by the parameter batch_size. The second function ```status_bar``` will show the relevant information per epoch. In this case, Géron stablishes the relevant information.

In [6]:
# Create a function to sample randomly from the training set
def random_sampling(x, y, batch_size=32):
    idx = np.random.randint(len(x), size=batch_size)
    return x[idx], y[idx]


# Create a function that simulates the status bar printed when training a neural network
def status_bar(epoch, total_epochs, loss, val_loss, time_epoch, metrics=None, val_accuracy=None):
    metrics = ' - '.join(['{}: {:.4f}'.format(m.name, m.result()) for m in [loss] + (metrics or [])])
    
    logging.info(f'Epoch {epoch}/{total_epochs} ({time_epoch} s) - ' + metrics + ' - val_loss: {:.4f} - val_accuracy: {:.4f}'.format(val_loss, val_accuracy))

Let's create a simple model: the input layer will be a ```layers.Flatten()```, then 2 additional hidden layers with ```activation='elu'``` and ```kernel_initializer='he_normal'```, and the output layer will have ```activation='softmax``` with a ```kernel_initializer='glorot_normal'```. We have added a ```BatchNormalization()``` layer before the activation function (as suggested per Sergey Ioffe and Christian Szegedy in their [paper](https://arxiv.org/abs/1502.03167)).

Then, we will create the variables for the training loop. Among them, we will create an optimizer: the Nesterov Accelerated Gradient (NAG). 

The *NAG* optimizer is a variant of the original *Momentum optimization* algorithm. The idea of this optimizer is to add momentum to the gradient. The same way a ball rolling down a slope (like in a street) has some momentum, e.g., the ball accelerates when rolls down, the gradient here will gain momentum when approaches to the minimum.

The *NAG* extends the *Momentum Optimization* algorithm, by measuring the gradient of the cost function not at local point, but at different position. Being $\theta$ the vector of the weights, NAG measures the gradient not in local $\theta$, but in $\theta + \beta m$, where $\beta$ is a friction factor, and $m$ is the momentum vector, in direction of the momentum. 

The algorithm for *NAG* is the following:

> 1. $m \gets \beta m - \eta \nabla_\theta J(\theta + \beta m)$
> 2. $\theta \gets \theta + m$

*NAG* keeps track of the previous gradients in the momentum vector $m$. Substract the Gradient at the point $\theta + \beta m$, and then updates the weights $\theta$.

For further information of the *NAG* algorithm, please visit [this post](https://machinelearningmastery.com/gradient-descent-with-nesterov-momentum-from-scratch/).

In [7]:
# Create a simple model, using BatchNormalization layers

mnist_model = models.Sequential([
    layers.Flatten(input_shape=[28, 28]),
    layers.LayerNormalization(),        # To normalize the input of the model
    layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    layers.Dense(10, activation='softmax', kernel_initializer='glorot_normal')
])

# Set the variables for the training loop
n_epochs = 25
batch_size = 1500
n_steps = len(x_train) // batch_size
optimizer = optimizers.SGD(momentum=0.9, nesterov=True)     # Nesterov Accelerated Gradient
loss_fn = losses.SparseCategoricalCrossentropy(name='Val_Loss')
mean_loss = metrics.Mean(name='Loss')
metrics_model = [metrics.SparseCategoricalAccuracy(name='Accuracy')]

Now let's create the actual training loop. We will create to loops: one for the epochs and other for the steps.

The steps of the training loop are the following:

- First we randomly sample $n$ instances of the training set.
- The [```tf.GradientTape()```](https://www.tensorflow.org/api_docs/python/tf/GradientTape) records the operations that are executed within this block. This allows to use the results of this operations to compute the automatic differentiation.
- The ```tape.gradient()``` method calculates the gradientes of the loss of the function, and the ```optimizer.apply_gradients()``` method applies the gradients to all the trainable variables (weights) of the model or layer (see Exercise 13).
- The mean of the loss is calculated, and we iterate on each metric (in this case just ```keras.metrics.SparseCategoricalAccuracy()```) to compute them.
- In order to calculate the loss and the accuracy on the validation set, it is necessary to run the model as function on the validation set. Note that previously we set the parameter ```training=True```, now it is necessary to be ```False``` (which is the default value). Since the function ```keras.losses.SparseCategoricalCrossentropy()``` calculates the loss per instance, it is necessary to calculate its mean, the same with the ```keras.metrics.sparse_categorical_accuracy```.
- We calculate the time of processing by recording the start time, and the end time. This will allow us to measure the time of traning per epoch.
- Finally we call the ```status_bar()``` function.

In [8]:
for epoch in range(1, n_epochs + 1):
    start_time = time.time()    
    for step in range(1, n_steps + 1):
        x_batch, y_batch = random_sampling(x_train, y_train)
        with tf.GradientTape() as tape:
            y_pred = mnist_model(x_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            model_loss = tf.add_n([main_loss] + mnist_model.losses)
        gradients = tape.gradient(model_loss, mnist_model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, mnist_model.trainable_variables))
        mean_loss(model_loss)
        for metric in metrics_model:
            metric(y_batch, y_pred)

    y_valid_pred = mnist_model(x_valid)
    val_loss = np.mean(loss_fn(y_valid, y_valid_pred))
    val_accuracy = np.mean(metrics.sparse_categorical_accuracy(tf.constant(y_valid), y_valid_pred))
        
    end_time = round(time.time() - start_time, 2)
    status_bar(epoch, n_epochs, mean_loss, val_loss, end_time, metrics_model, val_accuracy)

Epoch 1/25 (0.92 s) - Loss: 1.0735 - Accuracy: 0.6676 - val_loss: 0.6729 - val_accuracy: 0.7702
Epoch 2/25 (0.76 s) - Loss: 0.8709 - Accuracy: 0.7188 - val_loss: 0.5840 - val_accuracy: 0.7946
Epoch 3/25 (0.72 s) - Loss: 0.7798 - Accuracy: 0.7497 - val_loss: 0.5456 - val_accuracy: 0.8034
Epoch 4/25 (0.77 s) - Loss: 0.7191 - Accuracy: 0.7654 - val_loss: 0.4834 - val_accuracy: 0.8222
Epoch 5/25 (0.76 s) - Loss: 0.6829 - Accuracy: 0.7759 - val_loss: 0.4992 - val_accuracy: 0.8155
Epoch 6/25 (0.75 s) - Loss: 0.6440 - Accuracy: 0.7852 - val_loss: 0.5192 - val_accuracy: 0.8145
Epoch 7/25 (0.75 s) - Loss: 0.6206 - Accuracy: 0.7896 - val_loss: 0.4556 - val_accuracy: 0.8413
Epoch 8/25 (0.76 s) - Loss: 0.6013 - Accuracy: 0.7958 - val_loss: 0.4901 - val_accuracy: 0.8268
Epoch 9/25 (0.75 s) - Loss: 0.5871 - Accuracy: 0.7999 - val_loss: 0.4852 - val_accuracy: 0.8292
Epoch 10/25 (0.76 s) - Loss: 0.5781 - Accuracy: 0.8009 - val_loss: 0.4627 - val_accuracy: 0.8311
Epoch 11/25 (0.77 s) - Loss: 0.5639 - A

- **Try using a different optimizer with a different learning rate for the upper layers and the lower layers.**

In order to achieve this, it will be necessary to create a new model to use different optimizers and learning rates for lower and upper layers.

In [9]:
lower_layers = models.Sequential([
    layers.Flatten(input_shape=[28, 28]),
    layers.LayerNormalization(),
    layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
])

upper_layers = models.Sequential([
    layers.Dense(300, activation='elu', kernel_initializer='he_normal'),
    layers.Dense(10, activation='softmax', kernel_initializer='glorot_normal')
])

mnist_model_2 = models.Sequential([
    lower_layers,
    upper_layers
])

# Set the variables for the training loop
n_epochs = 25
batch_size = 1500
n_steps = len(x_train) // batch_size
lower_optimizer = optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)     # Nesterov Accelerated Gradient
upper_optimizer = optimizers.Nadam(learning_rate=0.002, beta_1=0.9, beta_2=0.999)      # Nadam optimizer
loss_fn = losses.SparseCategoricalCrossentropy(name='Val_Loss')
mean_loss = metrics.Mean(name='Loss')
metrics_model = [metrics.SparseCategoricalAccuracy(name='Accuracy')]

There are some changes to take into consideration when creating the training loop:

- It is necessary to add the parameter ```persistent=True``` to the ```tf.GradientTape()``` to allow to compute persistent gradients.
- We will need to iterate on the lower layers and the upper layers.

The rest of the training loop is similar to the previous.

In [11]:
for epoch in range(1, n_epochs + 1):
    start_time = time.time()    
    for step in range(1, n_steps + 1):
        x_batch, y_batch = random_sampling(x_train, y_train)
        with tf.GradientTape(persistent=True) as tape:
            y_pred = mnist_model_2(x_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            model_loss = tf.add_n([main_loss] + mnist_model_2.losses)
        for layers, optimizer in ((lower_layers, lower_optimizer), (upper_layers, upper_optimizer)):
            gradients = tape.gradient(model_loss, layers.trainable_variables)
            optimizer.apply_gradients(zip(gradients, layers.trainable_variables))
        mean_loss(model_loss)
        for metric in metrics_model:
            metric(y_batch, y_pred)

    y_valid_pred = mnist_model_2(x_valid)
    val_loss = np.mean(loss_fn(y_valid, y_valid_pred))
    val_accuracy = np.mean(metrics.sparse_categorical_accuracy(tf.constant(y_valid), y_valid_pred))
    
    end_time = round(time.time() - start_time, 2)
    status_bar(epoch, n_epochs, mean_loss, val_loss, end_time, metrics_model, val_accuracy)

Epoch 1/25 (1.0 s) - Loss: 1.0863 - Accuracy: 0.6477 - val_loss: 0.6976 - val_accuracy: 0.7665
Epoch 2/25 (0.95 s) - Loss: 0.8908 - Accuracy: 0.7012 - val_loss: 0.6237 - val_accuracy: 0.7737
Epoch 3/25 (0.94 s) - Loss: 0.8014 - Accuracy: 0.7257 - val_loss: 0.5894 - val_accuracy: 0.7902
Epoch 4/25 (0.96 s) - Loss: 0.7604 - Accuracy: 0.7358 - val_loss: 0.5668 - val_accuracy: 0.8007
Epoch 5/25 (1.01 s) - Loss: 0.7269 - Accuracy: 0.7481 - val_loss: 0.5315 - val_accuracy: 0.8170
Epoch 6/25 (0.97 s) - Loss: 0.6945 - Accuracy: 0.7580 - val_loss: 0.6007 - val_accuracy: 0.7994
Epoch 7/25 (0.98 s) - Loss: 0.6692 - Accuracy: 0.7661 - val_loss: 0.5734 - val_accuracy: 0.8066
Epoch 8/25 (0.93 s) - Loss: 0.6507 - Accuracy: 0.7720 - val_loss: 0.4858 - val_accuracy: 0.8312
Epoch 9/25 (0.99 s) - Loss: 0.6276 - Accuracy: 0.7797 - val_loss: 0.4634 - val_accuracy: 0.8358
Epoch 10/25 (1.04 s) - Loss: 0.6091 - Accuracy: 0.7857 - val_loss: 0.4477 - val_accuracy: 0.8430
Epoch 11/25 (1.08 s) - Loss: 0.6016 - Ac