# Advanced Regularization Methods

This notebook explores more sophisticated regularization techniques in TensorFlow:

1. Weight initialization strategies
2. Batch normalization
3. Custom dropout and regularization implementations
4. Callbacks and TensorBoard integration

We'll demonstrate how these techniques affect model performance and provide code for implementing them in your own projects.

## Data Preparation for Advanced Regularization

In this notebook, we'll explore advanced regularization techniques. We start by loading the Fashion MNIST dataset and preparing it for our experiments.

Key aspects of the data preparation:
1. **Data Normalization**: Pixel values are scaled to the range [0, 1] by dividing by 255.
2. **Validation Split**: We separate a portion of the training data to create a validation set for monitoring model performance.
3. **Dataset Reduction**: To speed up execution, we use a smaller subset of the training data (10,000 examples).

This setup allows us to quickly test different regularization techniques while still providing meaningful comparisons.

In [2]:
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout, BatchNormalization
from tensorflow.keras.initializers import RandomNormal, GlorotNormal, GlorotUniform, HeNormal, HeUniform
import matplotlib.pyplot as plt
import numpy as np
import os
import datetime

# Load and preprocess the Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values

# Split the training data to create a validation set
val_split = 5000
x_val, y_val = x_train[:val_split], y_train[:val_split]
x_train, y_train = x_train[val_split:], y_train[val_split:]

# Reduce dataset size for faster execution
train_size = 10000
x_train, y_train = x_train[:train_size], y_train[:train_size]

print(f"Training set shape: {x_train.shape}")
print(f"Validation set shape: {x_val.shape}")
print(f"Test set shape: {x_test.shape}")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
[1m29515/29515[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
[1m26421880/26421880[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
[1m5148/5148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
[1m4422102/4422102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Training set shape: (10000, 28, 28)
Validation set shape: (5000, 28, 28)
Test set shape: (10000, 28, 28)


## Weight Initialization Strategies

Weight initialization is crucial for neural network performance. Poor initialization can lead to vanishing or exploding gradients, slowing down or preventing learning.

This code compares five common initialization strategies:

1. **RandomNormal**: Initializes weights from a normal distribution with small variance. Not optimal for deep networks.

2. **GlorotNormal/GlorotUniform (Xavier)**: Designed to maintain the scale of gradients throughout the network. Works well with symmetric activation functions like tanh.

3. **HeNormal/HeUniform**: Variants of Xavier initialization that account for the non-linearity of ReLU activations. Generally preferred for ReLU networks.

Each model is trained for 5 epochs with identical architecture except for the weight initialization strategy. The training histories are stored for comparison to determine which strategy works best for this dataset.

When to use each initializer:
- For ReLU activation: He initializers
- For tanh/sigmoid activation: Glorot initializers
- For very deep networks: Specialized initializers may be needed

In [3]:
# Create models with different weight initialization strategies
def create_model_with_initializer(initializer_name, initializer):
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu', kernel_initializer=initializer),
        Dense(64, activation='relu', kernel_initializer=initializer),
        Dense(10, activation='softmax', kernel_initializer=initializer)
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Define different initializers to compare
initializers = {
    'RandomNormal': RandomNormal(mean=0.0, stddev=0.01),
    'GlorotNormal': GlorotNormal(),  # Suitable for tanh activation
    'GlorotUniform': GlorotUniform(),  # Xavier uniform, good for tanh
    'HeNormal': HeNormal(),  # Better for ReLU activation
    'HeUniform': HeUniform()  # Better for ReLU activation
}

# Train models with different initializers
histories = {}
EPOCHS = 5
BATCH_SIZE = 128

for name, initializer in initializers.items():
    print(f"Training model with {name} initializer...")
    model = create_model_with_initializer(name, initializer)
    history = model.fit(
        x_train, y_train,
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        validation_data=(x_val, y_val),
        verbose=1
    )
    histories[name] = history

Training model with RandomNormal initializer...


  super().__init__(**kwargs)


Epoch 1/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 23ms/step - accuracy: 0.2964 - loss: 1.8895 - val_accuracy: 0.6128 - val_loss: 0.9352
Epoch 2/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.6355 - loss: 0.8879 - val_accuracy: 0.7168 - val_loss: 0.7782
Epoch 3/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7149 - loss: 0.7452 - val_accuracy: 0.7276 - val_loss: 0.7082
Epoch 4/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7462 - loss: 0.6576 - val_accuracy: 0.7764 - val_loss: 0.6228
Epoch 5/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7723 - loss: 0.6126 - val_accuracy: 0.7984 - val_loss: 0.5885
Training model with GlorotNormal initializer...
Epoch 1/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step - accuracy: 0.5963 - loss: 1.1732 - val_accuracy: 0.7984 - val_loss: 0.5

## Batch Normalization

Batch normalization stabilizes training by normalizing layer inputs for each mini-batch. This implementation places BatchNorm layers between Dense layers and activations, allowing the network to learn the optimal activation distribution.

Key benefits:
- Speeds up training by reducing internal covariate shift
- Acts as a regularizer, often reducing the need for dropout
- Allows higher learning rates
- Makes the model less sensitive to weight initialization

In [4]:
# Create a model with Batch Normalization
def create_batch_norm_model():
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128),  # No activation here since BatchNorm will be applied before activation
        BatchNormalization(),
        tf.keras.layers.Activation('relu'),
        Dense(64),
        BatchNormalization(),
        tf.keras.layers.Activation('relu'),
        Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Train the batch normalization model
batch_norm_model = create_batch_norm_model()
batch_norm_history = batch_norm_model.fit(
    x_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(x_val, y_val),
    verbose=1
)

Epoch 1/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 26ms/step - accuracy: 0.6447 - loss: 1.1119 - val_accuracy: 0.8072 - val_loss: 0.8414
Epoch 2/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.8376 - loss: 0.4830 - val_accuracy: 0.8418 - val_loss: 0.5531
Epoch 3/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.8650 - loss: 0.4012 - val_accuracy: 0.8398 - val_loss: 0.4721
Epoch 4/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8742 - loss: 0.3582 - val_accuracy: 0.8324 - val_loss: 0.4740
Epoch 5/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.8973 - loss: 0.3007 - val_accuracy: 0.8522 - val_loss: 0.4259


## Custom Dropout Implementation

This code implements a custom dropout layer by subclassing `tf.keras.layers.Layer`. The layer applies random dropout to input units during training but passes through inputs unchanged during inference.

Our implementation mirrors standard dropout functionality but allows for future customization of dropout behavior for specialized applications.

In [5]:
# Custom Dropout Layer
class CustomDropout(tf.keras.layers.Layer):
    def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
        super(CustomDropout, self).__init__(**kwargs)
        self.rate = rate
        self.noise_shape = noise_shape
        self.seed = seed

    def call(self, inputs, training=None):
        if training:
            return tf.nn.dropout(
                inputs,
                rate=self.rate,
                noise_shape=self.noise_shape,
                seed=self.seed
            )
        return inputs

    def get_config(self):
        config = super(CustomDropout, self).get_config()
        config.update({
            'rate': self.rate,
            'noise_shape': self.noise_shape,
            'seed': self.seed,
        })
        return config

# Create a model with custom dropout
def create_custom_dropout_model(dropout_rate=0.3):
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu'),
        CustomDropout(dropout_rate),
        Dense(64, activation='relu'),
        CustomDropout(dropout_rate),
        Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Train the custom dropout model
custom_dropout_model = create_custom_dropout_model()
custom_dropout_history = custom_dropout_model.fit(
    x_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(x_val, y_val),
    verbose=1
)

Epoch 1/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 24ms/step - accuracy: 0.4254 - loss: 1.6172 - val_accuracy: 0.7580 - val_loss: 0.6932
Epoch 2/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6947 - loss: 0.8184 - val_accuracy: 0.8114 - val_loss: 0.5544
Epoch 3/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7564 - loss: 0.6725 - val_accuracy: 0.8356 - val_loss: 0.4905
Epoch 4/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7887 - loss: 0.5794 - val_accuracy: 0.8340 - val_loss: 0.4699
Epoch 5/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8002 - loss: 0.5509 - val_accuracy: 0.8416 - val_loss: 0.4514


## Custom Regularizer and TensorBoard Setup

This code implements a custom L1L2 regularizer that combines both L1 and L2 penalties in a single regularizer. We also set up TensorBoard for visualization of model training metrics and architecture.

In [6]:
# Custom L1L2 Regularizer
class CustomL1L2Regularizer(tf.keras.regularizers.Regularizer):
    def __init__(self, l1=0.0, l2=0.0):
        self.l1 = l1
        self.l2 = l2

    def __call__(self, weights):
        l1_loss = tf.reduce_sum(tf.abs(weights)) * self.l1
        l2_loss = tf.reduce_sum(tf.square(weights)) * self.l2
        return l1_loss + l2_loss

    def get_config(self):
        return {'l1': float(self.l1), 'l2': float(self.l2)}

# Set up TensorBoard
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    write_graph=True,
    write_images=True,
    update_freq='epoch'
)

# Create a model with custom regularization
def create_custom_reg_model(l1=0.001, l2=0.001):
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu', kernel_regularizer=CustomL1L2Regularizer(l1=l1, l2=l2)),
        Dense(64, activation='relu', kernel_regularizer=CustomL1L2Regularizer(l1=l1, l2=l2)),
        Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

## Training with Callbacks and Evaluation

This code demonstrates the use of various callbacks in Keras:

1. **LearningRateScheduler**: Dynamically adjusts learning rate during training
2. **ModelCheckpoint**: Saves the best model based on validation accuracy
3. **TensorBoard**: Records metrics for visualization

We then compare all the regularization techniques by plotting validation accuracy and printing final performance metrics.

In [8]:
# Create LearningRateScheduler callback - fixing the function to return float
def lr_scheduler(epoch, lr):
    if epoch < 3:
        return float(lr)  # Explicitly convert to float
    else:
        return float(lr * 0.9)  # Use direct multiplication instead of exp

lr_callback = tf.keras.callbacks.LearningRateScheduler(lr_scheduler)

# Create ModelCheckpoint callback
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    'best_model.h5',
    monitor='val_accuracy',
    save_best_only=True,
    mode='max',
    verbose=1
)

# Train with custom regularization and callbacks
custom_reg_model = create_custom_reg_model()
custom_reg_history = custom_reg_model.fit(
    x_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(x_val, y_val),
    callbacks=[tensorboard_callback, lr_callback, checkpoint_callback],
    verbose=1
)

Epoch 1/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.5292 - loss: 5.6047
Epoch 1: val_accuracy improved from -inf to 0.76980, saving model to best_model.h5




[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 25ms/step - accuracy: 0.5309 - loss: 5.5899 - val_accuracy: 0.7698 - val_loss: 2.7696 - learning_rate: 0.0010
Epoch 2/5
[1m70/79[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 3ms/step - accuracy: 0.7759 - loss: 2.4429
Epoch 2: val_accuracy improved from 0.76980 to 0.79440, saving model to best_model.h5




[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.7765 - loss: 2.4099 - val_accuracy: 0.7944 - val_loss: 1.7532 - learning_rate: 0.0010
Epoch 3/5
[1m71/79[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 3ms/step - accuracy: 0.7893 - loss: 1.6601
Epoch 3: val_accuracy improved from 0.79440 to 0.79560, saving model to best_model.h5




[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7897 - loss: 1.6496 - val_accuracy: 0.7956 - val_loss: 1.4059 - learning_rate: 0.0010
Epoch 4/5
[1m69/79[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 3ms/step - accuracy: 0.8083 - loss: 1.3707
Epoch 4: val_accuracy improved from 0.79560 to 0.80940, saving model to best_model.h5




[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.8073 - loss: 1.3661 - val_accuracy: 0.8094 - val_loss: 1.2549 - learning_rate: 9.0000e-04
Epoch 5/5
[1m71/79[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 3ms/step - accuracy: 0.8089 - loss: 1.2354
Epoch 5: val_accuracy did not improve from 0.80940
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.8085 - loss: 1.2334 - val_accuracy: 0.7958 - val_loss: 1.1914 - learning_rate: 8.1000e-04
