### Exercise 12

Implement a custom layer that performs layer normalization (we will use this type of layer in Chapter 15):

1. The `build()` method should define two trainable weights **α** and **β**, both of shape `input_shape[-1:]` and data type `tf.float32`. **α** should be initialized with 1s, and **β** with 0s.

2. The `call()` method should compute the mean _μ_ and standard deviation *σ* of each instance’s features. For this, you can use `tf.nn.moments(inputs, axes=-1, keepdims=True)`, which returns the mean *μ* and the variance *σ2* of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return **α** ⊗ (**X** – μ)/(σ + ε) + **β**, where ⊗ represents itemwise multiplication (`*`) and *ε* is a smoothing term (a small constant to avoid division by zero, e.g., 0.001).

3. Ensure that your custom layer produces the same (or very nearly the same) output as the `tf.keras.layers.LayerNormalization` layer.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [2]:
# # Set TensorFlow to use CPU only
tf.config.set_visible_devices([], 'GPU')

In [3]:
# Load a dataset and split it to train, validation and test set

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# input_shape = X_train.shape[1:]

tf.keras.utils.set_random_seed(42)

In [4]:
X_train.shape[-1:]

(8,)

In [5]:
foo_mean, foo_var = tf.nn.moments(X_train_scaled, axes=-1, keepdims=True)
print(foo_mean.shape, foo_var.shape)

(11610, 1) (11610, 1)


In [6]:
tf.sqrt(foo_var).shape

TensorShape([11610, 1])

In [7]:
(tf.ones_like(foo_mean) * foo_mean).shape

TensorShape([11610, 1])

In [8]:
class MyLayerNormalization(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.eps = 1e-3

    def build(self, batch_input_shape):
        self.a = self.add_weight(
            name="a", shape=batch_input_shape[-1:], initializer="ones"
        )
        self.b = self.add_weight(
            name="b", shape=batch_input_shape[-1:], initializer="zeros"
        )
    
    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        # It is preferable to compute sqrt(variance + eps) instead of
        # sqrt(variance) + eps because the derivative of sqrt(z) is
        # undefined at z = 0 and training will bomp in this case
        std = tf.sqrt(variance + self.eps)
        return self.a * ((X - mean) / std) + self.b


In [9]:
my_normalization_layer = MyLayerNormalization()
X_normalized = my_normalization_layer(X_train)
X_normalized.shape

TensorShape([11610, 8])

In [10]:
keras_normalization_layer = tf.keras.layers.LayerNormalization()
X_normalized_keras = keras_normalization_layer(X_train)
X_normalized_keras.shape

TensorShape([11610, 8])

In [11]:
error = tf.keras.metrics.MeanAbsoluteError()
error(X_normalized_keras, X_normalized)

<tf.Tensor: shape=(), dtype=float32, numpy=3.3941788e-08>

Indeed there is not much difference between the two layers!

### Exercise 13
Train a model using a custom training loop to tackle the Fashion MNIST dataset (see Chapter 10):

1. Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch.

2. Try using a different optimizer with a different learning rate for the upper layers and the lower layers.

In [12]:
# Load the dataset
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist

In [13]:
# Scale the input features as early as possible!
X_train_full, X_test = X_train_full / 255., X_test / 255.

In [14]:
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

In [15]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

In [16]:
# Build the model
tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(42)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
for _ in range(5):
    model.add(tf.keras.layers.Dense(100, activation="swish", kernel_initializer="he_normal"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))

In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 100)               78500     
                                                                 
 dense_1 (Dense)             (None, 100)               10100     
                                                                 
 dense_2 (Dense)             (None, 100)               10100     
                                                                 
 dense_3 (Dense)             (None, 100)               10100     
                                                                 
 dense_4 (Dense)             (None, 100)               10100     
                                                                 
 dense_5 (Dense)             (None, 10)                1

In [18]:
# A simple function that samples a batch of instances from the training set
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

In [19]:
# Utility function
def print_status_bar(step, total, loss, metrics=None):
    metrics = " - ".join([f"{m.name}: {m.result():.4f}"
                          for m in [loss] + (metrics or [])])
    end = "" if step < total else "\n"
    print(f"\r{step}/{total} - " + metrics, end=end)

In [20]:
# Define some hyperparameters
n_epochs = 5
batch_size = 64
n_steps = len(X_train) // batch_size
optimizer_lower = tf.keras.optimizers.legacy.SGD(learning_rate=0.001)
optimizer_upper = tf.keras.optimizers.legacy.Adam(learning_rate=0.0001)
loss_fn = tf.keras.losses.sparse_categorical_crossentropy
mean_loss = tf.keras.metrics.Mean(name="mean_loss")
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]
val_metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]

In [21]:
len(model.trainable_variables)

12

In [22]:
len(model.trainable_variables[:-4])

8

In [23]:
# Build the custom training loop
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
        
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer_lower.apply_gradients(zip(gradients[:-4], model.trainable_variables[:-4]))
        optimizer_upper.apply_gradients(zip(gradients[-4:], model.trainable_variables[-4:]))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)

        print_status_bar(step, n_steps, mean_loss, metrics)

    # Validation loss and accuracy
    y_pred = model(X_valid)
    val_loss = tf.reduce_mean(loss_fn(y_valid, y_pred))
    for metric in val_metrics:
        metric(y_valid, y_pred)
    print(f"Validation loss: {val_loss}, validation accuracy: {val_metrics[0].result():.4f}")

    for metric in [mean_loss] + metrics + val_metrics:
        metric.reset_states()

Epoch 1/5
859/859 - mean_loss: 1.6184 - sparse_categorical_accuracy: 0.4681
Validation loss: 0.9442396759986877, validation accuracy: 0.6852
Epoch 2/5
859/859 - mean_loss: 0.8209 - sparse_categorical_accuracy: 0.7198
Validation loss: 0.7189589738845825, validation accuracy: 0.7576
Epoch 3/5
859/859 - mean_loss: 0.6954 - sparse_categorical_accuracy: 0.7587
Validation loss: 0.6419775485992432, validation accuracy: 0.7688
Epoch 4/5
859/859 - mean_loss: 0.6288 - sparse_categorical_accuracy: 0.7751
Validation loss: 0.5945408940315247, validation accuracy: 0.7812
Epoch 5/5
859/859 - mean_loss: 0.5926 - sparse_categorical_accuracy: 0.7831
Validation loss: 0.5678561329841614, validation accuracy: 0.7920
