# HOML Chapter 12 Exercise 12

## Exercise: Implement a custom layer that performs Layer Normalization (we will use this type of layer in Chapter 15):


*a. The build() method should define two trainable weights α and β, both of
shape input_shape[-1:] and data type tf.float32. α should be initialized
with 1s, and β with 0s.*

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

In [None]:
# Random seeds from both Numpy and Tensorflow
from numpy.random import seed
seed(999)
tf.random.set_seed(999)   

We'll set up two trainable weights - alpha and beta. In addition, we need a batch input shape because the number of units in the build method need to equal the number of inputs.

In [None]:
# Build method
def build(self, batch_input_shape):
    self.alpha = self.add_weight(
        name="alpha", shape=batch_input_shape[-1:],
        initializer="ones")
    self.beta = self.add_weight(
        name="beta", shape=batch_input_shape[-1:],
        initializer="zeros")
    super().build(batch_input_shape)

*b. The call() method should compute the mean μ and standard deviation σ of
each instance’s features. For this, you can use tf.nn.moments(inputs,
axes=-1, keepdims=True), which returns the mean μ and the variance σ
2 of
all instances (compute the square root of the variance to get the standard
deviation). Then the function should compute and return α⊗(X - μ)/(σ + ε) +
β, where ⊗ represents itemwise multiplication (*) and ε is a smoothing term
(small constant to avoid division by zero, e.g., 0.001).*


We'll have to define the epsilon hyperparameter in the constructor. 

In the call method, we're going to include the epsilon value under the square root with the variance to ensure that we're never dividing by zero just in case the variance becomes zero. 

In [None]:
# Custom Layer Normalization 
class LayerNormalization(keras.layers.Layer):
    def __init__(self, epsilon=0.001, **kwargs):
        super().__init__(**kwargs)
        self.epsilon = epsilon

    def build(self, batch_input_shape):
        self.alpha = self.add_weight(
            name="alpha", shape=batch_input_shape[-1:],
            initializer="ones")
        self.beta = self.add_weight(
            name="beta", shape=batch_input_shape[-1:],
            initializer="zeros")
        super().build(batch_input_shape) # must be at the end

    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        return self.alpha * (X - mean) / (tf.sqrt(variance + self.epsilon)) + self.beta

    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "epsilon": self.epsilon}

*c. Ensure that your custom layer produces the same (or very nearly the same)
output as the keras.layers.LayerNormalization layer.*

The author tested this custom layer on the California housing dataset. We'll do the same. Let's import it and split it into training, validation, and testing sets.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
housing = fetch_california_housing()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


In [None]:
X_train_all, X_test, y_train_all, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=999)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_all, y_train_all, random_state=999)

Now, we need to convert the data to 32-bit float values for use in Tensorflow. To determine if both layer norms function similarly, we need to determine the mean of the difference in their mean absolute error.

In [None]:
# Convert training values to float 32-bit
X_train_32 = X_train.astype(np.float32)

In [None]:
# Define both the custom layer as well as Keras' LayerNormalization
custom_ln = LayerNormalization()
keras_ln = keras.layers.LayerNormalization()

In [None]:
# Find the mean of the mean abolute error between both layer norms
tf.reduce_mean(keras.losses.mean_absolute_error(
    keras_ln(X_train_32), custom_ln(X_train_32)))

<tf.Tensor: shape=(), dtype=float32, numpy=3.7963805e-08>

The difference between both layer norms is extremely small, so it appears that both layer norms work similarly.

Just to be sure, the author decided to also test the difference between both layers by using randomly data. We'll do so as well.

In [None]:
# Randomly generated data
random_alpha = np.random.rand(X_train_32.shape[-1])
random_beta = np.random.rand(X_train_32.shape[-1])

In [None]:
# Set weights
custom_ln.set_weights([random_alpha, random_beta])
keras_ln.set_weights([random_alpha, random_beta])

In [None]:
# Find the mean of the mean abolute error between both layer norms
tf.reduce_mean(keras.losses.mean_absolute_error(
    keras_ln(X_train_32), custom_ln(X_train_32)))

<tf.Tensor: shape=(), dtype=float32, numpy=2.4424876e-08>

Again, the difference is negligibly small. The custom layer norm works as hoped.