## Regularisation and optimisation techniques

### Introduction
When training neural networks, it’s not enough to simply define an architecture and pick an activation function. We also need to control overfitting and improve convergence. This is where *regularisation* and *optimisation* techniques come into play. We will demonstrate why regularisation is important. demonstrate *dropout* and *L2 weight decay* on a simple model. We will also compare different *optimisers* (SGD, RMSProp, Adam) to see how they affect training and illustrate how these techniques can improve validation loss.

*Regularisation* is about preventing a model from fitting the training data too closely, a scenario called *overfitting*. If a model memorises the training data, it might perform poorly on unseen data. By introducing controlled constraints or noise, we encourage the network to learn general patterns rather than memorising specifics.

Common regularisation methods include:
- *Dropout*: Randomly "drops" neurons during training to prevent them from relying on specific paths.
- *L2 Weight Decay* (or Ridge regularisation): Penalises large weights to encourage smaller, more stable parameter values.
- *L1* (or Lasso): Encourages sparsity by driving some weights to zero.

### Why different optimisers?
While *SGD* (Stochastic Gradient Descent) is the original go-to method, more advanced optimisers like *RMSProp* and *Adam* adapt the learning rate for each parameter, often leading to *faster convergence* and better results.

- *SGD*: Updates weights by the average gradient from a mini-batch.
- *RMSProp*: Keeps a moving average of squared gradients, adapting each parameter’s learning rate.
- *Adam*: Combines RMSProp and Momentum ideas, often converging quickly in practice.

### Install Python libraries

In [None]:
!pip install tensorflow torch numpy matplotlib

### The CIFAR-10 dataset

The *CIFAR-10* (Canadian Institute for Advanced Research) dataset is one of the most popular datasets used for developing and benchmarking computer vision models. It is commonly used in deep learning projects, especially for beginners who want to build image classification models with neural networks. The dataset was created by Alex Krizhevsky, Geoffrey Hinton, and Vinod Nair in 2009 and is available freely for academic and educational use.

CIFAR-10 consists of 60,000 color images with a resolution of 32x32 pixels. Each image belongs to one of 10 classes, and the classes are chosen to be mutually exclusive and visually distinct. This relatively small image size makes the dataset lightweight and fast to train on, which is great for learning and experimenting without needing powerful hardware.

The dataset is split into a training set of 50,000 images and a test set of 10,000 images. Each class has exactly 6,000 images, providing a balanced distribution that helps in training fair models. All images are in RGB (3 channels), and the pixel values range from 0 to 255 (often normalised to the 0–1 range during preprocessing).

CIFAR-10 is especially useful for demonstrating the effects of different training techniques, such as regularistion, dropout, data augmentation, and convolutional architectures. Because it's easy to overfit on this dataset with a complex model, it's a perfect choice to illustrate how regularisation methods help improve generalisation.

Let's look at the different categories of data:

| Label | Category    | Description                                                 |
|-------|-------------|-------------------------------------------------------------|
| 0     | Airplane    | Images of aircraft, often in flight, with sky backgrounds. |
| 1     | Automobile  | Various cars and trucks, not including pickup trucks.       |
| 2     | Bird        | Side or top views of birds, sometimes in natural settings.  |
| 3     | Cat         | Domestic cats in different poses and environments.          |
| 4     | Deer        | Wild deer in forest or grass backgrounds.                   |
| 5     | Dog         | Various dog breeds, similar to cats in image composition.   |
| 6     | Frog        | Frogs typically in outdoor settings like grass or ponds.    |
| 7     | Horse       | Horses in side view, often in outdoor or farm settings.     |
| 8     | Ship        | Boats and ships, including container ships and sailboats.   |
| 9     | Truck       | Larger vehicles, including pickup and cargo trucks.         |

This diversity of image types provides a good challenge for image classifiers. The small image size forces models to learn efficient, compact feature representations, and the simplicity of the dataset allows us to focus on model design and training techniques without getting overwhelmed by data processing or scale.


### Loading the data



In [None]:
import os
import urllib.request
import tarfile

# URL and target path
cifar_url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
target_path = "cifar-10-python.tar.gz"
extract_path = "cifar-10-batches-py"

# Download if not already present
if not os.path.exists(target_path):
    print("Downloading CIFAR-10 dataset...")
    urllib.request.urlretrieve(cifar_url, target_path)

# Extract if not already extracted
if not os.path.exists(extract_path):
    print("Extracting CIFAR-10 dataset...")
    with tarfile.open(target_path, "r:gz") as tar:
        tar.extractall()


Load data from extracted pickle files:

In [None]:
import pickle
import numpy as np

def load_batch(filepath):
    with open(filepath, 'rb') as f:
        batch = pickle.load(f, encoding='bytes')
        data = batch[b'data']
        labels = batch[b'labels']
        # Reshape and normalise
        data = data.reshape(-1, 3, 32, 32).astype("float32") / 255.0
        return data, np.array(labels)

limit_train = 400
limit_test = 100

# Load a small sample of the training set (e.g., first 400 samples)
data, labels = load_batch(f"{extract_path}/data_batch_1")

X_train = data[:limit_train].transpose(0, 2, 3, 1)  
Y_train = labels[:limit_train]

# Load a small sample of the test set (e.g., first 100 samples)
X_test_full, Y_test_full = load_batch(f"{extract_path}/test_batch")
X_test = X_test_full[:limit_test].transpose(0, 2, 3, 1)
Y_test = Y_test_full[:limit_test]


Now that we have our data loaded, we visualise a subset of the training images in a grid, each annotated with its corresponding class label, to gain a better understanding of the data we will be working with:

In [None]:
import matplotlib.pyplot as plt

# Define label names for CIFAR-10
cifar10_labels = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                  'dog', 'frog', 'horse', 'ship', 'truck']

# Function to plot a grid of images
def plot_cifar_images(images, labels, class_names, n=5):
    plt.figure(figsize=(10, 10))
    for i in range(n * n):
        plt.subplot(n, n, i + 1)
        plt.imshow(images[i])
        plt.title(class_names[labels[i]])
        plt.axis("off")
    plt.tight_layout()
    plt.show()

# Plot a 5x5 grid of training images
plot_cifar_images(X_train, Y_train, cifar10_labels, n=5)


### Adding Regularisation to a CNN
In this exercise, we extend a basic convolutional neural network (CNN) for the CIFAR-10 dataset by incorporating regularisation techniques to improve generalisation and reduce overfitting. The model consists of two convolutional blocks followed by fully connected layers, and we apply two forms of regularisation throughout:

1. *Dropout*: Introduced after each pooling layer and before the final dense layers, dropout randomly deactivates a proportion of neurons during training, forcing the network to develop more redundant and general features.

2. *L2 Regularisation* (weight decay): Applied to the convolutional and dense layers, this technique penalises large weights by adding a term to the loss function, encouraging the model to learn simpler, more stable patterns.

The model is compiled using the Adam optimiser and trained using sparse categorical cross-entropy, which is suitable for multi-class classification with integer labels. If we compare the performance of this regularised model against a baseline without dropout and L2 penalties, we can observe how regularisation helps control overfitting and improves validation performance on unseen data:


In [None]:
# Set a common number of epochs for our experiment
num_epochs = 20

### Train model without Regularisation

In this step, we will train a version of the model *without* any regularisation techniques, in order to compare its performance with the regularised version. Specifically, we will remove both *dropout* and *L2 regularisation* from the architecture.

Regularisation methods like dropout and L2 penalty are designed to reduce overfitting by making the model less sensitive to noise in the training data. When we remove them, the model is more likely to memorise the training set, especially if the dataset is small or complex. As a result, we might observe a lower training loss but a higher validation loss, an indication that the model is not generalising well to new data.

If we train a non-regularised version of the same model and compare its training and validation losses against the regularised version, we can better understand the impact of regularisation on generalisation performance:


In [None]:
from tensorflow.keras.models import Sequential  # Sequential model groups layers linearly
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense  # Core layers: convolution, pooling, flatten, dense
from tensorflow.keras.optimizers import Adam, SGD, RMSprop  # Optimizer algorithms


def build_model_no_regularisation(input_shape=(32, 32, 3),
                                  num_classes=10,
                                  optimizer='adam'):
    """
    Builds a simple convolutional neural network (CNN) for CIFAR-10 classification,
    without any dropout or L2 regularisation.

    Parameters:
    - input_shape (tuple): Dimensions of the input images, e.g. (32,32,3).
    - num_classes (int): Number of target classes (10 for CIFAR-10).
    - optimizer (str or keras Optimizer): Which optimiser to use ('adam', 'sgd', 'rmsprop').

    Returns:
    - A compiled Keras Sequential model, ready for training.
    """
    # Initialise a sequential (stacked) model
    model = Sequential([
        # Conv block 1: 32 filters, 3x3 kernel, ReLU activation, same padding to keep size
        Conv2D(32, (3, 3), activation='relu', padding='same',
               input_shape=input_shape),
        # Downsample feature maps by 2x2
        MaxPooling2D((2, 2)),

        # Conv block 2: 64 filters to learn more complex features
        Conv2D(64, (3, 3), activation='relu', padding='same'),
        # Further downsampling
        MaxPooling2D((2, 2)),

        # Conv block 3: 128 filters for even deeper feature extraction
        Conv2D(128, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # Flatten 3D feature maps into 1D vector for classification layers
        Flatten(),
        # Fully connected layer with 128 neurons and ReLU activation
        Dense(128, activation='relu'),
        # Output layer: num_classes neurons with softmax to output probabilities
        Dense(num_classes, activation='softmax')
    ])

    # Choose optimiser based on user argument
    if optimizer == 'sgd':
        # Stochastic gradient descent with momentum for smoothing
        opt = SGD(learning_rate=0.01, momentum=0.9)
    elif optimizer == 'rmsprop':
        # RMSprop adapts learning rate per parameter
        opt = RMSprop(learning_rate=0.001)
    else:
        # Default: Adam combines momentum and adaptive rates
        opt = Adam(learning_rate=0.001)

    # Compile model: specify loss and metrics
    model.compile(
        optimizer=opt,
        loss='sparse_categorical_crossentropy',  # for integer labels
        metrics=['accuracy']  # track classification accuracy
    )
    return model


In [None]:
# Build the non-regularised model with the Adam optimiser
model_no_reg = build_model_no_regularisation(optimizer='adam')

# Train the model on the training data
# Evaluate generalisation performance using the test set as validation data
# Train for the same number of epochs as the regularised model
history_no_reg = model_no_reg.fit(
    X_train, Y_train,
    validation_data=(X_test, Y_test),
    epochs=num_epochs,
    verbose=1
)


### Evaluate
Over the twenty epochs, your convolutional model learned very quickly on the examples it saw during training. Its training accuracy went from about 16 per cent in the first epoch to over 91 per cent by the last, and its training loss fell steeply from around 2.30 down to 0.31. 

In contrast, its performance on the unseen validation set was much more erratic where validation accuracy peaked at 42 per cent in epoch 18 but finished at just 29 per cent, while validation loss generally drifted upwards, ending at about 2.51. In plain terms, the network got extremely good at the images it practised on but struggled to apply what it’d learned consistently to new images, which is a tell-tale sign of overfitting.

### Train model with Regularisation
In this step, we will train our convolutional neural network with both dropout and L2 regularisation applied. Specifically, we will set the `dropout_rate` to 0.2, apply an L2 `weight_decay` (regularisation strength) of `1e-4`, and use the `Adam` optimiser.

These values are commonly used as sensible defaults when starting to regularise a model:

* *Dropout rate of 0.2* means that during training, 20% of the neurons in each dropout layer will be randomly deactivated in each forward pass. This is a moderate amount that typically helps reduce overfitting without significantly impairing the model’s ability to learn. It's a good starting point because it introduces regularisation without being too aggressive.

* *Weight decay of 1e-4* applies a small penalty to large weights, encouraging the model to keep its parameter values small and stable. This helps avoid overly complex models that can memorise the training data. A value of `1e-4` is often used in practice as it tends to strike a balance between under- and over-regularisation.

* *Adam optimiser* is chosen for its adaptive learning rate and efficiency in training deep networks. It combines the benefits of both RMSProp and momentum-based gradient descent, and it generally performs well out-of-the-box across a wide range of tasks.

Starting with these parameters will hopefully improve the model’s ability to generalise to unseen data, reduce the risk of overfitting, and maintain stable and efficient training behaviour. These settings can later be tuned based on validation performance:


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam

def build_model_regularisation(dropout_rate=0.3, weight_decay=1e-4, optimizer='adam'):
    """
    Builds a CNN model with dropout and L2 regularisation for image classification on CIFAR-10.

    Parameters:
    - dropout_rate (float): fraction of neurons to drop during training.
    - weight_decay (float): strength of L2 regularisation.
    - optimizer (str or keras.optimizers.Optimizer): optimiser to use for training.

    Returns:
    - A compiled Keras Sequential model.
    """

    model = Sequential([

        # First convolutional layer with 32 filters of size 3x3
        # Uses ReLU activation and 'same' padding to preserve spatial dimensions
        # Applies L2 regularisation to the kernel weights
        Conv2D(32, (3, 3), activation='relu', padding='same',
               kernel_regularizer=l2(weight_decay), input_shape=(32, 32, 3)),

        # Max pooling reduces the feature map size by half (32x32 -> 16x16)
        MaxPooling2D((2, 2)),

        # Dropout randomly disables a fraction of neurons to prevent overfitting
        Dropout(dropout_rate),

        # Second convolutional layer with 64 filters of size 3x3
        # Again uses ReLU activation and L2 regularisation
        Conv2D(64, (3, 3), activation='relu', padding='same',
               kernel_regularizer=l2(weight_decay)),

        # Another max pooling operation (16x16 -> 8x8)
        MaxPooling2D((2, 2)),

        # Additional dropout for regularisation
        Dropout(dropout_rate),

        # Flatten the 3D feature maps into a 1D vector for the dense layers
        Flatten(),

        # Fully connected layer with 128 neurons and L2 regularisation
        Dense(128, activation='relu', kernel_regularizer=l2(weight_decay)),

        # Dropout before the final classification layer
        Dropout(dropout_rate),

        # Output layer with 10 neurons (one per CIFAR-10 class)
        # Softmax activation converts outputs to class probabilities
        Dense(10, activation='softmax')
    ])

    # Compile the model with specified optimiser and sparse categorical cross-entropy loss
    # Accuracy is used as the evaluation metric
    model.compile(optimizer=optimizer,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    return model

In [None]:
# Number of training epochs
num_epochs = 20

# Build the CNN model with regularisation
# Dropout rate set to 0.2 (20% of neurons dropped during training)
# L2 weight decay set to 1e-4 to penalise large weights
# Using the Adam optimiser for adaptive learning rate and stability
model_with_reg = build_model_regularisation(
    dropout_rate=0.2,
    weight_decay=1e-4,
    optimizer='adam'
)

# Train the model on the training data for 20 epochs
# Use a validation set to monitor generalisation (X_test, Y_test)
# 'verbose=1' prints progress bar and training metrics
history_with_reg = model_with_reg.fit(
    X_train, Y_train,
    validation_data=(X_test, Y_test),
    epochs=num_epochs,
    verbose=1
)


### Evaluate
This version generalised better to the test images. Although its training accuracy only climbed to about 76 per cent (versus 92 per cent in the pure conv model), its final validation accuracy settled at 37 per cent rather than 29 per cent, and its validation loss ended lower (≈2.25 vs ≈2.51). In other words, it didn’t memorise the training set quite as perfectly, but it did a noticeably better job when faced with unseen data, so it’s less over-fitted and overall more reliable on new images.


### Comparing Validation Loss

To evaluate the effectiveness of regularisation, we will compare the *validation loss* of the regularised and non-regularised models side by side. Validation loss provides a useful indication of how well the model generalises to unseen data, lower values typically suggest better performance on data outside the training set.

Regularisation techniques such as dropout and L2 penalty are designed to prevent overfitting by reducing the model’s reliance on specific weights and features. In contrast, a non-regularised model may achieve very low training loss but struggle to perform well on the validation set due to memorising noise or irrelevant patterns.

Let's plot the validation loss over epochs for both models so we can observe:

* Whether the regularised model maintains a lower and more stable validation loss,
* Whether the non-regularised model begins to overfit (i.e. validation loss increases while training loss continues to decrease),
* And how quickly each model converges during training.

This visual comparison helps us understand the trade-offs between model complexity and generalisation, and demonstrates the practical benefits of applying regularisation in deep learning workflows:

In [None]:
from matplotlib import pyplot as plt  # Import matplotlib for plotting

# Plot the validation loss for the model with regularisation
plt.plot(history_with_reg.history['val_loss'], label='With Regularisation')

# Plot the validation loss for the model without regularisation
plt.plot(history_no_reg.history['val_loss'], label='No Regularisation')

# Set the title of the plot
plt.title('Validation Loss comparison')

# Label the x-axis (number of training epochs)
plt.xlabel('Epoch')

# Label the y-axis (loss value on validation set)
plt.ylabel('Validation Loss')

# Display the legend to differentiate between the two curves
plt.legend()

# Display the plot
plt.show()


If the model with dropout and L2 regularisation shows a lower validation loss or a smoother, more stable curve across epochs compared to the non-regularised model, it provides strong evidence that the regularisation is effectively reducing overfitting and improving generalisation. Although, in practice, there may be more work needed to improve performance.

A *lower validation loss* means the model is performing better on data it hasn’t seen during training, which is the ultimate goal in most machine learning tasks. Regularisation helps by discouraging the model from learning overly complex patterns or memorising the training data, which often leads to poor performance on new inputs.

A *smoother curve* (i.e. fewer sharp spikes or fluctuations in the validation loss) indicates that the model's learning is more stable and less sensitive to noise or individual training samples. This is another benefit of regularisation, especially dropout, which introduces randomness during training and forces the model to rely on distributed representations rather than specific neurons or weights.

In contrast, if the non-regularised model shows rapid improvements in training performance but its validation loss either stagnates or increases, this typically signals overfitting: the model is learning the training data too well at the expense of general performance. Therefore, observing these trends in the validation loss curve is a useful diagnostic tool for deciding whether regularisation is necessary or whether additional techniques (like data augmentation) might be beneficial.


### Experimenting with different optimisers

Optimisers play a critical role in how a neural network learns during training. They determine how the model's weights are updated in response to the calculated gradients from the loss function. To explore the effect of different optimisation strategies, you can modify the `optimizer` argument in the model-building functions to use `'sgd'`, `'rmsprop'`, or `'adam'`. Each of these optimisers implements a different approach to adjusting learning rates and updating parameters, and as a result, they produce different training behaviours and generalisation characteristics.

#### SGD (Stochastic Gradient Descent)

* This is the most basic optimiser, using the raw gradients to update weights after each batch.
* It typically requires careful tuning of the learning rate and often benefits from techniques like momentum or learning rate schedules.
* Although it can be slow to converge, especially on complex datasets like CIFAR-10, it is sometimes preferred when generalisation is more important than speed, as it may avoid overfitting better in some cases.

#### RMSProp (Root Mean Square Propagation)

* RMSProp improves on SGD by adapting the learning rate for each weight individually based on a moving average of recent squared gradients.
* This helps the optimiser handle non-stationary objectives and avoid issues with vanishing or exploding gradients.
* It is particularly effective for training models on noisy or sparse data and is well-suited to recurrent neural networks.

#### Adam (Adaptive Moment Estimation)

* Adam is one of the most popular and effective optimisers in deep learning.
* It combines the benefits of both RMSProp (adaptive learning rates) and momentum (accumulated gradient history), making it robust and efficient across a wide range of tasks.
* Adam generally converges quickly and performs well out of the box, making it a strong default choice for most problems.

If you experiment with these different optimisers, you can observe how the choice affects convergence speed, final accuracy, and stability of training and validation losses. This is an important part of model tuning and helps build an understanding of how different learning dynamics can influence model performance.


### What have we learnt?

Throughout this experiment, we’ve seen how both *regularisation* and *optimiser choice* significantly influence the performance and reliability of a neural network.

*Regularisation* techniques such as *Dropout* and *L2 weight decay* are crucial for reducing overfitting. This is a common issue where a model learns to perform very well on the training data but fails to generalise to unseen examples. Dropout achieves this by randomly deactivating neurons during training, which forces the model to learn more robust, distributed representations. L2 regularisation, on the other hand, penalises large weights, encouraging the network to find simpler solutions that generalise better. When used effectively, regularisation leads to smoother and lower validation loss curves, indicating better generalisation and more reliable performance on new data.

We’ve also learnt that the choice of *optimiser* affects how efficiently a model converges during training. Traditional *SGD* takes small, fixed steps in the direction of the gradient, which can be slow and sensitive to learning rate settings. More advanced optimisers like *RMSProp* and *Adam* adapt the learning rate for each parameter and incorporate information from past gradients. These adaptations allow the model to navigate complex loss landscapes more effectively and converge more quickly and reliably. Adam, in particular, is widely used due to its robustness and minimal need for manual tuning.

These experiments highlight the importance of applying the right tools to control overfitting and ensure smooth convergence. If you combine regularisation with well-chosen optimisers, you will be able to build models that perform well on training data, as well as, maintain strong and consistent performance on unseen data, which is essentially our main goal.
