# Week 3 Deep Learning

**Objectives**

This week, we will:

- Build and train deep learning models.
- Diagnose and fix common training issues.
- Implement transfer learning techniques.

**Notes**

- If a line starts with the fountain pen symbol (üñåÔ∏è), it asks you to implement a code part or answer a question.
- Lines starting with the light bulb symbol (üí°) provide important information or tips and tricks.
- Lines starting with the checkmark symbol (‚úÖ) reveal the solutions to specific exercises.

## Tools

Building deep learning models is a complex task.
It involves designing architectures with sometimes millions or billions of parameters and optimising them using gradient descent‚Äîthe process by which training loss is reduced and model parameters are adjusted.
These operations must also be computationally efficient to minimise training and inference time.
Fortunately, several libraries provide high-level tools to handle these complexities, making it easier to create and train models effectively.

The most popular deep learning frameworks are `PyTorch`, `TensorFlow`, and `JAX`.
While these are open-source, they are primarily developed by Meta (`PyTorch`) and Google (`TensorFlow` and `JAX`).
Although there are specialised reasons to choose one over another, either is suitable for most problems; the choice usually depends on convenience and personal preference.
`PyTorch` is often preferred in academic circles because it is intuitive and flexible, making it ideal for quick experimentation.

![pytorch](https://pytorch.org/wp-content/uploads/2025/01/pytorch_seo.png)

In [None]:
# Import torch library
from astropy.visualization import hist
import torch

# We can use it to check if we can use the GPU
torch.cuda.is_available()

In this notebook, however, we will use `Keras`.
It is designed as a high-level wrapper, making it more user-friendly than those underlying frameworks.
While `Keras` runs on top of these libraries, it hides their complexity from the user.
If you eventually need to develop highly custom features or examine internal mechanics, you would likely work directly with a framework like `PyTorch` or `TensorFlow`, but for well-established workflows, `Keras` makes the process much easier.

In [None]:
import os

# We need to specify that we want to use torch
# as the "backend" for keras
os.environ["KERAS_BACKEND"] = "torch"

import keras

keras.__version__

We will also use a few additional libraries to support our workflow.
Since we will be working with images, we use `Pillow` (PIL) for image manipulation and torchvision to bridge the gap between `Pillow` and `PyTorch`.
Finally, we will use `timm`, a package that provides access to a large set of pre-trained image models.

## Case study

### Description

For this notebook, we will use an example dataset of moth images as a case study.

![moths](https://storage.googleapis.com/kaggle-datasets-images/2439824/4128992/4527186ab6ea46da0adb9bbce34e8d81/dataset-cover.jpg?t=2022-08-27-19-01-29)

This dataset was shared via [kaggle](https://www.kaggle.com/).
Kaggle is a great resource for machine learning, serving as a platform that hosts datasets, models, and more.
It also hosts competitions where participants attempt to train the best model for a specific task.

This particular [dataset](https://www.kaggle.com/datasets/gpiosenka/moths-image-datasetclassification?select=MOTHS.csv) was assembled by Kaggle user Gerry through internet searches for various moth species.

### Download

Let's download it

In [None]:
# Kaggle offers a library to interface with their platforms
import kagglehub

# We can download the dataset. The returned object is the path to where it was downloaded
download_dir = kagglehub.dataset_download("gpiosenka/moths-image-datasetclassification")

The dataset includes a CSV file containing metadata for each image.
This includes the location of the file within the download folder (`filepaths`), the species name (`labels`), and a pre-defined split for training, testing, and validation (`data set`).

In [None]:
import pandas as pd

# Define the path to the CSV file
path_to_table = os.path.join(download_dir, "MOTHS.csv")

# Load the file into a DataFrame
df = pd.read_csv(path_to_table)

# Check its first few rows
df.head()

üí°**Note:** If you are unfamiliar with `os.path.join`, here is a brief explanation.
If you have a file named `my_file.txt` in a folder called `my_folder`, the path to that file on Windows is `my_folder\my_file.txt`, whereas on macOS and Linux it is `my_folder/my_file.txt`.
This difference (`/` vs `\`) means code shared across different operating systems might break.
The `os.path.join` function handles these differences automatically, making your code platform-independent.
Using it is considered best practice for writing shareable code.

### Dataset summary

We can now count how many images are available for each species.

In [None]:
# Count the number of images per species and sort by frequency
counts = df.labels.value_counts().sort_values(ascending=False)

counts

Plotting these counts makes the distribution easier to visualise.

In [None]:
import matplotlib.pyplot as plt

_, ax = plt.subplots()

# Plot the number of counts
ax.plot(range(len(counts)), counts)

# Make the y axis range start from 0
ax.set_ylim(0, counts.max() + 10)

# Add labels and title
ax.set_ylabel("Number of images")
ax.set_xlabel("Species rank")
ax.set_title("Counts of images per species")

#### üñåÔ∏è*Full dataset summary*

When publishing a model, it is good practice to provide a detailed breakdown of your dataset.
This includes the number of examples per species across the training, validation, and test sets.
Create a table with one row for each species and three columns showing the image counts for that species in each of the three splits.

### Visualisation

We will now use Pillow (`PIL`) to load and view some of the moth images.

In [None]:
from PIL import Image

# select a random example
example = df.sample(n=1).iloc[0]

# And display it
im = Image.open(os.path.join(download_dir, example.filepaths))

plt.imshow(im)
plt.title(example.labels)

Each time you run this cell, a different image is selected at random from the dataset.
Try running the cell several times to see the variety of species.

#### üñåÔ∏è*Qualitative dataset description*

Suppose you wanted to develop a moth identifier for biologists to use on mobile devices in the field.
After looking at several images, list two reasons why training with this dataset might be unsuitable for this task, and two reasons why it might be suitable.

## Data preprocessing

An essential step in deep learning is preparing your data for the model.
This involves converting raw data into a numerical representation (e.g. turning an image into an array of pixel values) and applying transformations like cropping or scaling values to a 0‚Äì1 range.

These steps are known as **preprocessing**.
Getting this stage right is very important, as the quality of preprocessing can significantly impact a model's performance.

### Tensors

The first step is converting images into a numerical format.

In previous notebooks, we used `numpy` to store and process arrays of data.
`torch` uses its own version of a numerical array called a **tensor**.
In most situations, a tensor behaves similarly to a `numpy` array, making it easy to apply what you already know.
However, they are not identical, and we will highlight the key differences as we progress.

In [None]:
# Use the transforms provided by torchvision
from torchvision.transforms import v2

# Create a transformation object that converts PIL images to an image tensor
to_tensor = v2.ToImage()

# Apply the transform to our example image
im_tensor = to_tensor(im)

im_tensor

We can also convert tensors back to images:

In [None]:
# Create a transform that converts image tensors back to PIL images
to_pil = v2.ToPILImage()

# Apply the transform to the image tensor
reconstructed_im = to_pil(im_tensor)

# Plot
plt.imshow(reconstructed_im)

### Transformations

`torchvision` provides a variety of transforms for manipulating images.
For example, we can use these tools to resize images and reduce their dimensions.

In [None]:
# Create a transformation that will resize any image into an 32x32 image
resize = v2.Resize([32, 32])

# Apply the transform to the image *tensor*
resized = resize(im_tensor)

# Plot the original and resized images side-by-side for comparison
_, (ax1, ax2) = plt.subplots(ncols=2)
ax1.imshow(im)
ax2.imshow(to_pil(resized))

Multiple transformations can be combined into a single pipeline.
For example, we can resize an image and convert it to grayscale in one step.

In [None]:
import torch

# Define a pipeline of multiple transformations
compose_transform = v2.Compose(
    [
        v2.ToImage(),  # convert to image tensor
        v2.Resize([32, 32]),  # resize 32x32
        v2.Grayscale(),  # make grayscale
    ]
)

# Apply the combined transformations to the original PIL image
compose_tensor = compose_transform(im)

# Plot the original and processed results side-by-side
_, (ax1, ax2) = plt.subplots(ncols=2)
ax1.imshow(im)
ax2.imshow(to_pil(compose_tensor), cmap="gray")

#### üñåÔ∏è*Why rescaling?*

The original images in this dataset are 224x224 pixels.
Reducing the resolution makes each image smaller by reducing the total number of pixels, but it also results in a loss of detail.
When developing a model, what are some considerations that might guide your decision on which image size to use?

### Datasets

For training and evaluation, we use a **dataset**‚Äîa collection of examples, which in our case are images of moths.
We need a way to load these images and iterate through them during the training process.

In this dataset, images are organised into `train`, `valid`, and `test` folders.
Within each of these, there are subfolders named after each species containing the corresponding images.
Like so:

```
download_dir
‚îú‚îÄ‚îÄ test
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ ARCIGERA FLOWER MOTH
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ ...
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ WHITE SPOTTED SABLE MOTH
‚îú‚îÄ‚îÄ train
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ ARCIGERA FLOWER MOTH
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ ...
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ WHITE SPOTTED SABLE MOTH
‚îî‚îÄ‚îÄ valid
    ‚îú‚îÄ‚îÄ ARCIGERA FLOWER MOTH
    ‚îú‚îÄ‚îÄ ...
    ‚îî‚îÄ‚îÄ WHITE SPOTTED SABLE MOTH

```

This is a standard structure for classification tasks.
We can use the `ImageFolder` utility to load this data easily.

In [None]:
from torchvision.datasets import ImageFolder

# Define the path to the training directory
train_dataset_path = os.path.join(download_dir, "train")

# Create a transform for the dataset
transform = v2.Compose(
    [
        v2.ToImage(),  # convert to image tensor
        v2.Resize([32, 32]),  # resize 32x32
        v2.ToDtype(torch.float32, scale=True),  # convert to floating numbers
    ]
)

# Create a dataset from the folder structure
# Subfolder names are automatically used as the labels for the images within them
# Setting `transform=transform` ensures every image is pre-processed automatically when loaded.
train_dataset = ImageFolder(train_dataset_path, transform=transform)

# The length of the dataset represents the total number of examples
print(f"Num of examples: {len(train_dataset)}")

You can access individual items from the dataset as you would with a Python list.

In [None]:
# Access the first element in the dataset
# Notice it contains both the image tensor and the label (species)
im, label = train_dataset[0]

print(f"Image label: {label}")

plt.imshow(to_pil(im))

You may have noticed that the image labels appear as numbers rather than species names.
This is because models require numerical inputs to perform calculations.
However, the dataset keeps track of the original class names and the mapping between these integers and the species.

In [None]:
# Print the list of all species names detected in the folder structure
print(train_dataset.classes)

# We'll store the total number of classes for later use
num_classes = len(train_dataset.classes)

# Retrieve the original species name from an integer label
class_name = train_dataset.classes[label]
print(f"Label to species: {label} -> {class_name}")

# Find the integer label associated with a specific species name
species = "REGAL MOTH"
label = train_dataset.class_to_idx[species]
print(f"Species to label: {species} -> {label}")

### Data loading

The final step is the actual process of loading the images.
When iterating over a dataset for training or evaluation, we typically process data in **batches**.
Loading files from a disk can be slow and is often the primary bottleneck in the training process.
To address this, `torch` provides the `DataLoader` utility, which makes data retrieval more efficient by handling batching and loading in parallel.

In [None]:
from torch.utils.data import DataLoader

# Create a DataLoader to handle batching and shuffling
# 'batch_size' determines how many images are processed at once
# 'shuffle=True' ensures batches are assembled randomly during training
# 'num_workers=2' enables multi-processing, loading two images in parallel for efficiency
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, num_workers=2)

# A loader is an 'iterable', allowing you to loop through the entire dataset
for im_batch, label_batch in train_loader:
    # This is where you would typically pass the batches to your model

    # Check the shape of the image batch
    print(f"Image batch shape: {im_batch.shape}")

    # We break the loop here
    break

To help verify that our data is loading correctly, we can visualise an entire batch of images at once.

In [None]:
import numpy as np


def plot_image_batch(batch, ncols=8, figsize=None):
    # Calculate the number of rows required based on the batch size and columns
    batch_size = len(batch)
    nrows = int(np.ceil(batch_size / ncols))  # np.ceil rounds up to nearest integer

    # Create a figure with a grid of sub-axes
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)

    # Iterate over images in batch and axes in figure
    for im, ax in zip(batch, axes.flatten()):
        ax.imshow(to_pil(im))
        ax.axis("off")

    return fig


# Visualise the batch we just loaded
plot_image_batch(im_batch)

#### üñåÔ∏è*Practice batching*

Create a new dataset using the test directory.
This time, keep the images at their **original resolution** but convert them to **grayscale**.
Once the dataset and loader are ready, draw a **random** batch of **16 images** and plot them using the function above.

## Model architecture

With our data loading pipeline in place, we are ready to build our first deep learning model.
We will use `keras`, which provides high-level building blocks to create complex architectures and manage the training process.

As discussed in the lectures, the fundamental building block is the fully connected layer, known in Keras as a **Dense** layer.
In this layer, every neuron is connected to every neuron in the preceding layer.
We also apply an **activation** function, such as **ReLU** (Rectified Linear Unit), to introduce non-linearity, allowing the model to learn complex patterns.

Here is an example of a neural network model:

In [None]:
import keras

keras.config.set_image_data_format("channels_first")

resolution = 32

model = keras.Sequential(
    [
        keras.layers.Input(shape=[3, resolution, resolution]),
        keras.layers.Flatten(),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dense(num_classes, activation="softmax"),
    ]
)

When defining the model architecture, there are several key components to understand:

First, we specify the input shape.
Our data consists of RGB images defined by their height and width.
In this example, we use 32√ó32 pixels to match our downscaled data, though the model can be configured for any resolution.

Then we have the **Flatten** layer.
This converts the 3D structure of the image into a 1D list of features.
In this basic model, we do not account for the spatial relationship between pixels; instead, every pixel value across all three colour channels is treated as an individual input neuron.

Then we have two intermediate layers each with 64 neurons.
Both of them use **ReLU** as the activation function.

The final layer is a dense layer where the number of neurons matches the total number of moth species in our dataset.
This layer uses the softmax activation function, which is key for classification.
Softmax takes the raw outputs from the network and scales them so that:

- Every output value is between 0 and 1.
- The sum of all output values equals exactly 1.

This allow us to interpret the output of each neuron as a probability.
For example, a value of 0.75 on a specific neuron indicates a 75% confidence that the image belongs to that particular species.

We can use the `summary` method to view a detailed breakdown of the model.
This includes the output shape of each layer and the number of trainable parameters.

In [None]:
model.summary()

We can also generate a visual diagram to see how the data flows through the different layers of the model.

In [None]:
keras.utils.plot_model(model)

#### üñåÔ∏è*Dependance of model size on resolution*

The model above is relatively simple, with only two hidden layers and a modest number of neurons, yet it already contains around 200,000 parameters.
Try changing the input resolution using the values (8, 16, 32, 64, 128, 256) and record the number of trainable parameters for each case.
Plot this relationship to see how the model size changes as the image resolution increases.
What patterns do you observe in the plot?
What does this suggest about the practical trade-offs involved when choosing an image resolution for your model?

## Model training

We are now ready to train our model.
As you observed in the previous exercise, using high-resolution images leads to a significantly larger model with more parameters.
To allow for faster experimentation, we will start with a very low resolution of 8√ó8 pixels.
At this size, it is difficult to identify specific moth features, but basic shapes and colours remain visible.

First, we define the transformation and the model architecture for this 8√ó8 resolution.

In [None]:
# Define preprocessing for 8x8 resolution
transform_0 = v2.Compose(
    [
        v2.ToImage(),
        v2.Resize([8, 8]),
        v2.ToDtype(torch.float32, scale=True), # scale=True ensures values are 0-1
    ]
)

# Prepare the training dataset
train_dataset_0 = ImageFolder(train_dataset_path, transform=transform_0)

# Build a small model for fast experimentation
model_0 = keras.Sequential(
    [
        keras.layers.Input(shape=[3, 8, 8]),
        keras.layers.Flatten(),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dense(num_classes, activation="softmax"),
    ]
)

model_0.summary()

We use the `compile` method in Keras to configure the training process.
This requires two final components: the **training loss** and the **optimisation algorithm**.

For classification, the standard choice is **cross-entropy loss**.
As discussed, our model outputs a probability score for each species.
The cross-entropy loss measures the difference between these predicted probabilities and the ideal case, where the correct species has a score of 1 and all others are 0.

For the optimisation, we will use the **Adam** optimiser.
This is an special version of the **Stochastic Gradient Descent (SGD)** covered in the lectures.
It calculates the loss for each batch, determines the direction of the steepest descent, and updates the model parameters accordingly.
The size of these updates is controlled by the **learning rate**, which we have set to 1e-3 (0.001).

In [None]:
# Set the learning rate for the optimiser
learning_rate = 1e-3

# Configure the model for training
# We also track 'accuracy' to monitor how many images the model classifies correctly
model_0.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
    metrics=[
        keras.metrics.SparseCategoricalAccuracy(name="acc"),
    ],
)

We have also included `metrics` in the configuration.
These are additional values, such as accuracy, calculated during training to help us monitor performance.
Unlike the loss function, metrics do not influence the gradient descent process itself.

Finally, we initiate the training process using the fit method.
We first create a data loader to feed our 8√ó8 images into the model in batches.
We will train the model for 20 **epochs**, which means the model will iterate through the entire dataset 20 times.

In [None]:
# Create a DataLoader for our low-resolution dataset
train_loader_0 = DataLoader(train_dataset_0, batch_size=8, shuffle=True, num_workers=2)

# Start the training process
# The 'history' object will store the loss and accuracy values for each epoch
history = model_0.fit(train_loader_0, epochs=20)

We can now plot how the training loss and accuracy evolved during training:

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2)


ax1.plot(history.history["loss"])
ax1.set_title("Training loss")
ax1.set_xlabel("Epochs")

ax2.plot(history.history["acc"])
ax2.set_title("Accuracy")
ax2.set_xlabel("Epochs")

#### üñåÔ∏è*Hyperparameter choice*

In this training run, we achieved a relatively low accuracy score on the training set.

We used a `learning_rate=1e-3`, `batch_size=8`, and `epochs=20`.
These are configurations of the model or the learning process that are not learned by the model itself, unlike its weights and biases.
Because they are set before training begins, they are called **hyperparameters**.

Choosing the right values can impact how well a model learns.
Researchers often perform **hyperparameter tuning**, searching for the best combination through trial and error or structured methods like a Grid Search.

Pick one of the hyperparameters mentioned above and change its value.
Can you achieve a better accuracy score?
Try at most three different configurations; hyperparameter tuning can be a time-consuming process and does not always guarantee a good result.

## Evaluation

So far, we have only measured the training loss and accuracy.
However, these metrics can be misleading when it comes to **generalisation**.
Since the model has been explicitly trained to provide the correct answers for the training set, evaluating it with those same examples is a biased measure of its true ability.

The moth dataset includes a **validation set** and a **test set** to help us assess performance more objectively.

Recall from the lectures that the **validation set** is used during hyperparameter tuning.
It is a subset of data that the model never sees during training, providing a better estimate of how it performs on new data.
However, because we make decisions on how to improve the model based on this specific subset, using it for our final evaluation would still lead to overoptimistic results.

For the final, unbiased assessment of our model, we use the **test set**.
This data is kept completely separate until the very end of our development process.

In [None]:
# We create datasets pointing to the folders containing the validation and test images.
# Note that we are using the same transform.
val_dataset_0 = ImageFolder(os.path.join(download_dir, "valid"), transform=transform_0)

test_dataset_0 = ImageFolder(os.path.join(download_dir, "test"), transform=transform_0)


We can use the `evaluate` method to calculate the loss and performance metrics across our separate datasets.

In [None]:
# Create specific loaders for the validation and test sets
val_loader_0 = DataLoader(val_dataset_0, batch_size=8, shuffle=False)
test_loader_0 = DataLoader(test_dataset_0, batch_size=8, shuffle=False)

# Evaluate the model on both sets
val_loss, val_accuracy = model_0.evaluate(val_loader_0)
test_loss, test_accuracy = model_0.evaluate(test_loader_0)

print(f"Loss: validation={val_loss}  test={test_loss}")
print(f"Accuracy: validation={val_accuracy}  test={test_accuracy}")

You can also monitor the validation performance during the training process by passing the `validation_data` argument to the fit method.

In [None]:
# Train for another 20 epochs while monitoring validation performance
history_1 = model_0.fit(train_loader_0, epochs=20, validation_data=val_loader_0)

Visualising these curves helps us understand how the model is learning over time.

In [None]:
# Create a function to reuse throughout
def plot_train_history(history):
    fig, (ax1, ax2) = plt.subplots(ncols=2)

    ax1.plot(history_1.history["loss"], label="train")
    ax1.plot(history_1.history["val_loss"], label="val")
    ax1.set_title("Training loss")
    ax1.set_xlabel("Epochs")
    ax1.legend()

    ax2.plot(history_1.history["acc"], label="train")
    ax2.plot(history_1.history["val_acc"], label="val")
    ax2.set_title("Accuracy")
    ax2.set_xlabel("Epochs")
    ax2.legend()

    return fig

plot_train_history(history_1)

Notice that we used the same model_0 as before.
Unless you re-ran the cell where the model was initially defined, this second round of training resumed exactly where the previous one finished.

Observe the gap between the two lines: the training loss continues to decrease as the model "memorises" the training images, but the validation loss may stay stagnant or even begin to rise.
This is a clear sign of **overfitting**, where the model is no longer learning general features of moths, but rather specific details unique to the training set.

#### üñåÔ∏è*Start from scratch*

Create a new model called `model_01` and train it from scratch for 40 epochs.
Monitor both the validation loss and accuracy throughout the process.
Once finished, plot the training curves to get a complete view of how the model evolved.

*Optional*: Research about early stopping and how to implement it in keras.
Try it out!

#### üñåÔ∏è*Full evaluation*

Accuracy provides a good overall summary, but it can hide specific weaknesses in a model.
For example, a model might be very good at identifying common moth species but consistently confuse two similar-looking ones.
To investigate this, we use a confusion matrix.

While Keras has many built-in features, its native metrics for detailed error analysis are limited.
We can bridge this gap by using scikit-learn.
First, we need to extract the raw prediction scores (the confidence for each class) and the true labels from our test set.

In [None]:
def extract_scores(model, dataset):
    loader = DataLoader(dataset, batch_size=8, shuffle=False, num_workers=2)

    y_score = []
    y_true = []

    for im_batch, label_batch in loader:
        outputs = model(im_batch)

        y_score.extend(outputs.detach().numpy())
        y_true.extend(label_batch.detach().numpy())

    return np.array(y_score), np.array(y_true)


Use the extract_scores function with your preferred model and the test dataset to retrieve the raw confidence scores and true labels.
Then use the argmax function to get the most confident score for each image `y_pred = y_score.argmax(axis=1)` Finally, use `scikit-learn` to generate the confusion matrix.

## Better model design

One strategy to improve performance is to use architectures specifically suited to your data type.
For images, we use **Convolutional Neural Networks** (CNNs).

While we will cover the technical details later in the module, the main advantage of a CNN is its ability to recognize patterns (like edges or textures) regardless of where they appear in an image.
This makes them far more efficient for visual tasks than the Dense models we have used so far.

Below is a simple convolutional architecture.

In [None]:
resolution = 64

model_cnn = keras.Sequential(
    [
        keras.layers.Input(shape=[3, resolution, resolution]),
        keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        keras.layers.MaxPooling2D(pool_size=(2, 2)),
        keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        keras.layers.GlobalMaxPooling2D(),
        keras.layers.Dense(num_classes, activation="softmax"),
    ]
)

If you look at the summary, you will notice that this model has fewer trainable parameters than a Dense model, even when handling larger images (`resolution = 64`).

#### üñåÔ∏è*CNN sizes*

Try out different resolution sizes (for example: 32, 64, 128).
Does the number of parameters change in the same way as it did for the previous Dense model?

#### üñåÔ∏è*Train a CNN model*

Train the model using the same settings as our earlier experiments: 8√ó8 images, cross-entropy loss, and the Adam optimiser.
Ensure you monitor the validation loss throughout the process.

Answer the following questions:

- Do you observe a change in accuracy or loss?
- What happens if you increase the resolution to 32√ó32 or even 64√ó64?
- How does the training time change as you increase the resolution?

## Data augmentation

So far, our models have struggled with overfitting.
You may have noticed the training accuracy climbing while the validation accuracy remains low‚Äîa clear sign that the model is simply "memorising" the specific images in our training set rather than learning general features of moths.

To improve generalisation, we need more data.
However, in ecology, collecting thousands of additional samples is often not feasible.
This is where **data augmentation** becomes a great strategy.

Augmentation artificially expands our dataset by creating slightly modified versions of our existing images.
By applying random transformations (e.g. cropping, flipping, or adjusting colours) we force the model to focus on the essential features of the moth (its shape and wing patterns) rather than irrelevant details like the exact position in the frame or the specific lighting conditions of the photo.

We can use the `torchvision` library to define a range of random transformations.
These will be applied "on the fly" during training, meaning the model sees a slightly different version of the image in every epoch.

In [None]:
# Create a sequence of random augmentations
augmentations = v2.Compose([
    # Randomly crop a portion of the image and resize it back to 128x128
    v2.RandomResizedCrop([128, 128]),

    # Randomly flip the image horizontally
    v2.RandomHorizontalFlip(p=0.5),

    # Randomly convert the image to grayscale 10% of the time
    v2.RandomGrayscale(p=0.1),

    # Randomly adjust brightness, contrast, and saturation
    v2.ColorJitter(),
])

# Visualise the effect of these augmentations on a single image
fig, axes = plt.subplots(nrows=4, ncols=4)
for ax in axes.flatten():
    # Applying the same augmentation pipeline generates a unique result each time
    im_aug = augmentations(im_tensor)
    ax.imshow(to_pil(im_aug))
    ax.axis("off")

Now we can integrate these augmentations into our training pipeline.
It is important to note that we only apply augmentations to the training set.
The validation and test sets should remain unmodified (except for basic resizing) so they provide a reliable, "real-world" measure of performance.

In [None]:
transform_128 = v2.Compose(
    [
        v2.ToImage(),
        v2.Resize([128, 128]),
        v2.ToDtype(torch.float32, scale=True), # scale=True ensures values are 0-1
    ]
)

# We can also compose two complex transformations
transform_aug = v2.Compose([
    transform_128,
    augmentations,
])

# Use the new augmentation transform
train_dataset_aug = ImageFolder(train_dataset_path, transform=transform_aug)

# Create val dataset. Note we are not using augmentations here
val_dataset_aug = ImageFolder(os.path.join(download_dir, "valid"), transform=transform_128)

model_cnn_128 = keras.Sequential(
    [
        keras.layers.Input(shape=[3, 128, 128]),
        keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        keras.layers.MaxPooling2D(pool_size=(2, 2)),
        keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        keras.layers.GlobalMaxPooling2D(),
        keras.layers.Dense(num_classes, activation="softmax"),
    ]
)

model_cnn_128.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
    metrics=[
        keras.metrics.SparseCategoricalAccuracy(name="acc"),
    ],
)

Finally, we run the training.

In [None]:
train_loader_aug = DataLoader(train_dataset_aug, batch_size=8, num_workers=2, shuffle=True)
val_loader_aug = DataLoader(val_dataset_aug, batch_size=8, num_workers=2, shuffle=False)
history_aug = model_cnn_128.fit(train_loader_aug, epochs=20, validation_data=val_loader_aug)

With augmentations enabled, the training loss decreases more slowly, as the task has become harder.
However, the gap between training and validation performance should ideally narrow, indicating better generalisation.

In [None]:
plot_train_history(history_aug)

#### üñåÔ∏è*Improve the model performance*

Play around with the architecture, hyperparamers, transform etc. Try to get the best performance you can.
As another way to avoid overfitting research dropout.
Make sure you leave at least 20 min for the next section though:


## Transfer learning

The final strategy we will explore is **transfer learning**.
As discussed in the lectures, this involves reusing a model that has already been trained on a massive, general dataset.
The idea is that the model has already learned to recognise fundamental visual features‚Äîsuch as edges, textures, and shapes‚Äîwhich are "transferable" to our specific task of identifying moth species.

### Model loading

We will use the `timm` library (PyTorch Image Models), which provides a wide range of pre-trained architectures.
We have selected `efficientnet_b0`, a model known for being highly efficient and fast while maintaining strong performance.
This model was originally trained on ImageNet, a famous dataset containing millions of images across a thousand different categories.

In [None]:
import timm

timm.list_models(pretrained=True)

efficientnet = timm.create_model("efficientnet_b0", pretrained=True).eval()

Every pre-trained model has specific requirements for its input images (such as specific normalisation values).
It is essential to use the same preprocessing pipeline that the model was originally trained with.

In [None]:
transform = timm.data.create_transform(
    **timm.data.resolve_data_config(efficientnet.pretrained_cfg)
)

### Feature embeddings

Instead of training a deep network from scratch, we will use EfficientNet as a "feature extractor".
We pass our moth images through the model and stop just before the final classification layer.
The outputs at this stage are called **feature embeddings**.

Embeddings are high-dimensional numerical representations of an image.
Because the model was trained on millions of images, these embeddings are quite rich and descriptive.

We can then use these embeddings as inputs for a much simpler model, such as a Logistic Regression classifier.
This is exactly the same approach we used in the machine learning session.

First, we define a function to extract these features:

In [None]:
from tqdm import tqdm

def extract_efficientnet_features(dataset, batch_size=8):
    loader = DataLoader(dataset, batch_size=8)
    features = []
    targets = []

    for im_batch, label_batch in tqdm(loader):
        # Extract features from the model
        feats = efficientnet.forward_features(im_batch)

        # Global average pooling to convert spatial features into a 1D vector
        feats = feats.mean(axis=(2, 3))

        # Convert tensors to numpy arrays
        features.extend(feats.detach().numpy())
        targets.extend(label_batch.detach().numpy())

    features = np.array(features)
    targets = np.array(targets)
    return features, targets

Now, we apply this to our moth datasets:

In [None]:
# Extract features for the training set
train_dataset = ImageFolder(os.path.join(download_dir, "train"), transform=transform)
X_train, y_train = extract_efficientnet_features(train_dataset)

# Extract features for the test set
test_dataset = ImageFolder(os.path.join(download_dir, "test"), transform=transform)
X_test, y_test = extract_efficientnet_features(test_dataset)

With our features ready, we can train a Logistic Regression model in seconds:

In [None]:
from sklearn.linear_model import LogisticRegression

# Fit the classifier on the extracted embeddings
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

And evaluate it

In [None]:
from sklearn.metrics import accuracy_score

# Evaluate on the test set
y_pred = lr_model.predict(X_test)
score = accuracy_score(y_pred, y_test)

print(f"Transfer Learning Accuracy: {score:.4f}")

#### üñåÔ∏è*Few-shot transfer learning*

Transfer learning is particularly interesting when you have very little data.
In ecology, obtaining thousands of labelled images is often impossible.
This is the idea of **few-shot learning**, training a model with only a handful of examples.

How well can we perform with only 1, 5, or 10 examples per species?
Since we have already extracted the embeddings for the entire dataset, we can quickly test this by selecting small subsets.

Use the function below to run an experiment.
For each scenario (1, 5, and 10 examples per species), train the model 5 times (using different random seeds) and plot the average accuracy on the test set.

In [None]:
def select_subset(X, y, n=5, seed=None):
    series = pd.DataFrame({"label": y})
    selection = series.groupby("label").sample(n=n, random_state=seed)
    X_subset = X[selection.index]
    y_subset = y[selection.index]
    return X_subset, y_subset