# Lab 3: TensorBoard, Vanishing Gradients & Hyperparameter Tuning

## Learning Objectives

By the end of this lab, you will be able to:
1. Use TensorBoard to monitor and visualise training metrics
2. Explain mathematically why vanishing gradients occur in deep networks
3. Diagnose vanishing gradients using gradient norm analysis
4. Apply solutions: ReLU, batch normalisation, careful initialisation
5. Use Optuna for automated hyperparameter optimisation

**Prerequisites:** Lab 2 (FFNNs, backpropagation basics)


In [None]:
# ==== Environment Setup ====
# Detects Colab vs local and provides cross-platform utilities

import os
import sys

# Detect environment
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("✓ Running on Google Colab")
else:
    print("✓ Running locally")

def download_file(url: str, filename: str) -> str:
    """Download file if it doesn't exist. Works on both Colab and local."""
    if os.path.exists(filename):
        print(f"✓ {filename} already exists")
        return filename
    
    print(f"Downloading {filename}...")
    if IN_COLAB:
        import subprocess
        subprocess.run(['wget', '-q', url, '-O', filename], check=True)
    else:
        import urllib.request
        urllib.request.urlretrieve(url, filename)
    print(f"✓ Downloaded {filename}")
    return filename

In [None]:
# ==== Device Setup ====
import torch

def get_device():
    """Get best available device: CUDA > MPS > CPU."""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"✓ Using CUDA GPU: {torch.cuda.get_device_name(0)}")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device('mps')
        print("✓ Using Apple MPS (Metal)")
    else:
        device = torch.device('cpu')
        print("✓ Using CPU")
    return device

DEVICE = get_device()

# Loading the data

Again, we will be using PyTorch library/framework for this lab, following on from last week

In [23]:
# import basic libs
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split
from IPython.display import display, clear_output
import time
import datetime

# import torch (whole lib & specific modules)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

torch.set_default_dtype(torch.double)

seed = 42

In [24]:
# import dataset
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST( # MNIST for image classification
    root="data", # specifies directory
    train=True,
    download=True,
    transform=ToTensor(), # converts images from PIL format (or numpy array) to PyTorch Tensor (fundamental data structure for PyTorch)
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False, # NB
    download=True,
    transform=ToTensor(),
)

In [25]:
# there is unncessary amount of data!
print(f"len of training dataset: {len(training_data)}, len of testing dataset: {len(test_data)}")

len of training dataset: 60000, len of testing dataset: 10000


In [26]:
# extract a subset of the datasets
from torch.utils.data import random_split

training_data, _ = random_split(training_data, # arg: original dataset to split
                                [1/10, 9/10], # split props
                                generator=torch.manual_seed(seed)) # reproducibility

test_data, _ = random_split(test_data, [1/10, 9/10], generator=torch.manual_seed(seed))

print(f"len of training dataset: {len(training_data)}, len of testing dataset: {len(test_data)}")

len of training dataset: 6000, len of testing dataset: 1000


# Constructing the model using a generic class

As last week, we construct a lazy sequential model with a customized number of `hidden_layers`, of dimensions provided by list object `hidden_size`.

Whereas last week predefined a classification task -> only required `input_dims` as an arg; here we define a more generic architecture that can be either regr / classification -> requires `out_feat_size` that can take any hidden dims size.

In [None]:
class LazySequential(nn.Sequential):
    """Flexible sequential model with configurable hidden layers and activations."""
    
    def __init__(self,
                 in_feat_size: int,
                 out_feat_size: int,
                 hidden_sizes: list,
                 activation_fn: str = 'ReLU'):
        
        layers = [nn.Flatten()]  # Flatten input image to vector
        
        # Add hidden layers with activations
        for idx, size in enumerate(hidden_sizes):
            in_features = in_feat_size if idx == 0 else hidden_sizes[idx-1]
            layers.append(nn.Linear(in_features, size))
            layers.append(getattr(nn, activation_fn)())
        
        # Output layer (no activation - handled by loss function)
        layers.append(nn.Linear(hidden_sizes[-1], out_feat_size))
        
        super().__init__(*layers)


So now we have a generic class for definiing architecture upto the output activation function! We haven't applied a final activation function, because in PyTorch this is handled by the selected loss function (here, `CrossEntropyLoss`)

Next, we need to define the training function; here our loss function will contain the final activations fn - this modularity allows us to define a generic architecture to be put to multiple end use cases.

We will also feed our per-batch training loss (already calculated in the training loop) to a TensorBoard for visualisation.

In [None]:
from tqdm import tqdm
from torch.utils.tensorboard import SummaryWriter
import datetime

def torch_train(model,
                data,
                optimizer='Adam',
                loss_fn='CrossEntropyLoss',  # Renamed to avoid shadowing
                batch_size=64,
                epochs=1,
                shuffle=False,
                logdir=None):
    """Train a PyTorch model with TensorBoard logging."""
    
    criterion = getattr(nn, loss_fn)()
    opt = getattr(optim, optimizer)(model.parameters())
    dataloader = DataLoader(data, batch_size=batch_size, shuffle=shuffle)
    
    model.train()
    
    # Create unique log directory
    if logdir is None:
        current_time = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
        logdir = f'runs/{current_time}/train'
    
    logger = SummaryWriter(logdir)
    
    for epoch in range(epochs):
        for batch_idx, (X, Y) in enumerate(tqdm(dataloader, desc=f'Epoch {epoch+1}')):
            opt.zero_grad()
            pred = model(X)
            loss = criterion(pred.squeeze(-1), Y.long())
            loss.backward()
            opt.step()
            
            global_step = batch_idx + len(dataloader) * epoch
            logger.add_scalar('Training loss per batch', loss.item(), global_step)
    
    logger.close()
    return model


### Instantiate our models

In [None]:
# define 2 architectures
hidden_layers1 = [2**10, 2**5, 2**3]
hidden_layers2 = [2**10, 2**7, 2**5]

# model1
model1 = LazySequential(in_feat_size=28*28,
                        out_feat_size=10,
                        hidden_sizes=hidden_layers1)
model1 = torch_train(model1,
                     training_data,
                     logdir='runs/models1_2/model1')

# model 2
model2 = LazySequential(in_feat_size=28*28, out_feat_size=10, hidden_sizes=hidden_layers2)
model2 = torch_train(model2, training_data, logdir='runs/models1_2/model2')

# TensorBoard(s)

[Tensorboard](https://www.tensorflow.org/tensorboard) is a visualisation tool provided by TensorFlow that allows you to visualise our ML training process -> helps us understand how our model is training and to diagnose issues. Can incl metrics like:
- loss
- accuracy,
- model graphs,
- other data (all customisable).

TensorBoard is just the visualisation dashboard; we need to generate and send the metrics to it. Tensorboard integrates w/ libs PyTorch (here: SummaryWriter) to calculate and log the specific metrics we want to see during your training process (i.e. we need to calculate the logged metrics in our training / eval loops).

In [None]:
# colab magic command -> loads IPython extension
# tells tensorboard to look for log files in in `runs/models1_2` dir

%load_ext tensorboard
%tensorboard --logdir runs/models1_2

This TensorBoard shows us the Training Loss that we calculated as a by-product of our training process; but it might be more informative to know how the average epoch loss changed over time.

Let's develop our TensorBoard to visualise this. We'll need to calculate this new metric and pass it to the Torch's SummaryWriter object within the training function.

In [None]:
# @title Modifying training to log avg epoch loss and total elapsed time per epoch.

import warnings
import time
warnings.filterwarnings('ignore')


def torch_train_mod(model, data, optimizer='Adam', loss='CrossEntropyLoss', batch_size=2**6, epochs=1, shuffle=False, logdir=None):

    criterion = getattr(nn, loss)()
    optimizer = getattr(optim, optimizer)(model.parameters())
    dataloader = DataLoader(data, batch_size=batch_size, shuffle=shuffle)

    model.train()

    current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    logdir = '//'.join(['runs', current_time, 'train']) if logdir is None else logdir
    logger = SummaryWriter(logdir)


    start = time.time() # <- NEW: returns current time as float

    for epoch in range(epochs):
        epoch_loss = 0 # <- NEW: init epoch loss
        i = 0
        for (X, Y) in tqdm(dataloader):
            optimizer.zero_grad()
            pred = model(X)
            loss = criterion(pred.squeeze(-1), torch.tensor(Y))
            loss.backward()
            optimizer.step()
            epoch_loss+=loss.item() # <- NEW: sum of epoch loss
            i+=1
            logger.add_scalar('Training loss per batch', loss, i + len(dataloader) * epoch )

        # new:
        logger.add_scalar('Avg epoch loss',  # <- NEW comp avg
                        epoch_loss / len(dataloader),
                        epoch) # before was batch
        logger.add_scalar('Total training time', time.time() - start, epoch)

    logger.flush() # forces buffered data to disk
    logger.close()

    return model

# Online learning, minibatch and batch

With our training function set up to visualise how loss changes over epoch, let's visualise three common batch learning strategies.

- **Online learning**: Processes one sample at a time, updating model weights after each individual example.
- **Minibatch learning**: Processes minibatches (obviously).
- **Batch learning** (or Full-batch learning): Processes the entire dataset at once, computing gradients over all samples before updating weights.


<details>
<summary><b>🤔 Question:</b> Why does online learning show more noise but potentially faster initial progress?</summary>

**Answer:** 
- **More noise**: Each gradient is computed from a single sample, so it's a high-variance estimate of the true gradient
- **Faster initial progress**: Updates happen after every sample (many updates per epoch), allowing the model to escape poor initialisations quickly
- **Trade-off**: The noise can help escape local minima but makes convergence to a precise minimum harder

Minibatch provides a balance: lower variance than online, more updates per epoch than full-batch.
</details>


In [None]:
import warnings
warnings.filterwarnings('ignore')

# define our learning types
names = ['online', 'minibatch', 'fullbatch']
batch_sizes = [1, 2**6, len(training_data)]

# make one timestamped parent directory for this batch of runs
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
parent_logdir = f"runs/batches/{timestamp}"

# first instatiate the model, then feed that model to the training function
for name, batch_size in zip(names, batch_sizes): # zip combnes our two lists into an iterator of tuples
    batch_model = LazySequential(
        in_feat_size=28*28, # model is generic across batch types
        out_feat_size=10,
        hidden_sizes=hidden_layers1 # Using hidden_layers1 for consistency
    )
    logdir = f"{parent_logdir}/{name}"   # <- now grouped under timestamp
    batch_model = torch_train_mod(
        batch_model,
        training_data,
        batch_size=batch_size, # this changes within the loop
        epochs=2,
        logdir=logdir
    )

print("Logs saved to:", parent_logdir)

In [None]:
%tensorboard --logdir {parent_logdir}

# The Vanishing Gradient Problem

## Mathematical Foundation

### Backpropagation and the Chain Rule

For a network with $L$ layers, the gradient of the loss with respect to weights in layer $l$ is:

$$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \prod_{k=l+1}^{L} \frac{\partial a^{(k)}}{\partial a^{(k-1)}} \cdot \frac{\partial a^{(l)}}{\partial W^{(l)}}$$

Each term $\frac{\partial a^{(k)}}{\partial a^{(k-1)}} = \sigma'(z^{(k)}) \cdot W^{(k)}$ involves the **activation derivative**.

### Sigmoid Derivative Derivation

For sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$

**Step 1:** Apply quotient rule:
$$\sigma'(z) = \frac{e^{-z}}{(1 + e^{-z})^2}$$

**Step 2:** Rewrite in terms of $\sigma(z)$:
$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

**Step 3:** Find maximum:
- $\sigma(z) \in (0, 1)$, so $\sigma'(z) = \sigma(1-\sigma)$ is maximised when $\sigma = 0.5$
- Maximum value: $0.5 \times 0.5 = 0.25$

$$\boxed{\sigma'(z) \leq 0.25 \text{ for all } z}$$

<details>
<summary><b>🤔 Question:</b> What is the maximum value of tanh'(z)? Where does it occur?</summary>

**Answer:** For $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$, the derivative is $\tanh'(z) = 1 - \tanh^2(z)$.

Maximum occurs at $z = 0$ where $\tanh(0) = 0$, giving $\tanh'(0) = 1$.

This is better than sigmoid (max 0.25), but still causes vanishing gradients because $\tanh'(z) < 1$ for all $z \neq 0$.
</details>

### Gradient Attenuation Through Depth

For a network with $L$ sigmoid layers:

$$\left|\frac{\partial \mathcal{L}}{\partial W^{(1)}}\right| \leq 0.25^{L-1} \cdot \left|\frac{\partial \mathcal{L}}{\partial W^{(L)}}\right|$$

| Depth (L) | Gradient Attenuation |
|-----------|----------------------|
| 5 | $0.25^4 = 0.004$ |
| 10 | $0.25^9 \approx 4 \times 10^{-6}$ |
| 20 | $0.25^{19} \approx 3 \times 10^{-12}$ |
| 50 | $0.25^{49} \approx 10^{-30}$ |

<details>
<summary><b>🤔 Question:</b> Calculate the expected gradient magnitude at layer 1 of a 50-layer sigmoid network if the gradient at layer 50 is 1.0</summary>

**Answer:** $0.25^{49} \approx 3.2 \times 10^{-30}$

This is essentially zero in floating-point arithmetic (double precision has ~15-16 significant digits). The first layer receives no meaningful gradient signal!
</details>


# BONUS:

## Why it’s worse for sigmoid/tanh

* Sigmoid squashes input to $[0,1]$. Most inputs fall in saturated regions ($\sigma'(z) \approx 0$).
* Tanh squashes to $[-1,1]$. Better, but still saturates.
* ReLU partially fixes it ($f'(z)=1$ when active, 0 when inactive).
  * That prevents vanishing *for active neurons*, but “dead ReLUs” still give zero gradient.

---

## Exploding gradient (the sibling problem)

If derivatives or weight magnitudes > 1, the product **explodes exponentially**.

* Early gradients become enormous → unstable updates.
* Training loss oscillates or diverges.

Vanishing and exploding are two sides of the same coin: *repeated multiplication through depth*.


---

## Back to our lab implementation:

The below exercise aims to demonstrate the vanishing gradient problem. By (i)initialising the weights to zero and (ii) using a Sigmoid activation function in a deep network, we observe how gradients diminish during training, hindering effective learning.

In [None]:
# Proper vanishing gradient demonstration
# Use SMALL RANDOM weights (not zero) with sigmoid in deep network
# Zero init causes symmetry breaking problem, which is DIFFERENT from vanishing gradients

def init_weights_small(module):
    """Initialise with small random weights to demonstrate vanishing gradients."""
    if isinstance(module, nn.Linear):
        nn.init.normal_(module.weight, mean=0, std=0.1)
        nn.init.zeros_(module.bias)

# Deep network with sigmoid (prone to vanishing gradients)
vanishing_layers = [32] * 20  # 20 layers of 32 neurons each
vanishing_model = LazySequential(
    in_feat_size=28*28,
    out_feat_size=10,
    hidden_sizes=vanishing_layers,
    activation_fn='Sigmoid'  # Sigmoid derivatives are small (max 0.25)
)

# Apply small random initialisation
vanishing_model.apply(init_weights_small)

print(f"Model depth: {len(vanishing_layers)} hidden layers")
print(f"Sigmoid max derivative: 0.25")
print(f"Expected gradient attenuation: 0.25^{len(vanishing_layers)} = {0.25**len(vanishing_layers):.2e}")

# Train and observe vanishing gradients
vanishing_model = torch_train(
    vanishing_model, 
    training_data, 
    epochs=2, 
    logdir='runs/vanishing_gradients'
)


In [None]:
%tensorboard --logdir runs/zero_grad

We can see no trend in the loss time series above; it's purely stochastic / normally distributed. There's no meaningful learning going on when we encounter the vanishing gradient problem.

Now we are going to modify our training function to output a heatmap of the weights to the TensorBoard (same model as before, only diff training function).

In [None]:
def weight_heatmaps(model, cmap='Reds', **fig_kwargs):
    mat, titles = [], []
    vmin, vmax = +np.inf, -np.inf

    for name, param in model.named_parameters():
      if len(param.squeeze().shape) == 2:
        param = param.detach().numpy()
        param = np.abs(param)
        mat.append(param)
        titles.append(name)

        if param.max() > vmax:
          vmax = param.max()
        if param.min() < vmin:
          vmin = param.min()

    fig, axes = plt.subplots(2,3)
    top, bottom = [1,2,3], [-4,-3,-2]
    for i, weights in enumerate((top, bottom)):
      for j, w in enumerate(weights):
        axes[i,j].imshow(mat[w], vmin=vmin, vmax=vmax, cmap=cmap)


    fig.suptitle('First 3 (top row) vs. last 3 (bottom row) hidden weights')
    return fig

This is going to show the weight matrix for selected layers.
- Each cell in the heatmap corresponds to a single weight matrix, connecting  input features to output features in that layer.
  - x axis: input features (units from prev layer)
  - y axis: output features (units in current layer)
- The colour of each cell indicates the magnitude of that weight.

Next, we add it to out training function.

In [None]:
# train function to log heatmaps of the parameters at every epoch

def torch_train_heatmap(model, data,
                optimizer='Adam', loss='CrossEntropyLoss',
                batch_size=2**6, epochs=1, shuffle=False, logdir=None,
                cmap='Reds', **fig_kwargs):

  criterion = getattr(nn, loss)()
  optimizer = getattr(optim, optimizer)(model.parameters())
  dataloader = DataLoader(data, batch_size=batch_size, shuffle=shuffle)

  model.train()

  current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  logdir = '//'.join(['runs', current_time, 'train']) if logdir is None else logdir
  logger = SummaryWriter(logdir)
  start = time.time()

  for epoch in range(epochs):
    avg_loss = 0
    i = 0

    for (X, Y) in tqdm(dataloader):

        # NEW:
        fig = weight_heatmaps(model, cmap=cmap, **fig_kwargs)
        logger.add_figure('Weights', fig, global_step=i + len(dataloader) * epoch)

        optimizer.zero_grad()

        pred = model(X)
        loss = criterion(pred.squeeze(-1), torch.tensor(Y))

        loss.backward()
        optimizer.step()

        avg_loss+=loss.item()
        i+=1
        logger.add_scalar('Training loss', loss, i + len(dataloader) * epoch )

    logger.add_scalar('Avg epoch loss', avg_loss / len(dataloader), epoch)
    logger.add_scalar('Total training time', time.time() - start, epoch)


  return model

In [None]:
zero_grad_model = torch_train_heatmap(zero_grad_model, training_data, epochs=1, logdir='runs/heatmap')

In [None]:
%tensorboard --logdir runs/heatmap

Here we see weights/activations shrink layer by layer → gradient norms get smaller and smaller → earlier layers barely update → their weights stay close to initialisation (negligible coefficients).

## Exercise: Gradient Norm Analysis

Let's quantitatively track gradient magnitudes through the network layers.


In [None]:
def track_gradient_norms(model, data, epochs=1):
    """Track gradient norms per layer during training."""
    criterion = nn.CrossEntropyLoss()
    opt = optim.Adam(model.parameters())
    dataloader = DataLoader(data, batch_size=64)
    
    layer_names = [name for name, _ in model.named_parameters() if 'weight' in name]
    gradient_history = {name: [] for name in layer_names}
    
    model.train()
    for X, Y in dataloader:
        opt.zero_grad()
        loss = criterion(model(X).squeeze(-1), Y.long())
        loss.backward()
        
        # Record gradient norms
        for name, param in model.named_parameters():
            if 'weight' in name and param.grad is not None:
                gradient_history[name].append(param.grad.norm().item())
        
        opt.step()
        break  # Just one batch for demonstration
    
    return gradient_history

# Compare sigmoid vs ReLU gradient flow
sigmoid_model = LazySequential(28*28, 10, [64]*10, 'Sigmoid')
relu_model = LazySequential(28*28, 10, [64]*10, 'ReLU')

sigmoid_grads = track_gradient_norms(sigmoid_model, training_data)
relu_grads = track_gradient_norms(relu_model, training_data)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Get layer indices (skip first flatten, take every other for weights)
sigmoid_norms = [sigmoid_grads[k][0] for k in sorted(sigmoid_grads.keys())]
relu_norms = [relu_grads[k][0] for k in sorted(relu_grads.keys())]

axes[0].bar(range(len(sigmoid_norms)), sigmoid_norms, alpha=0.7)
axes[0].set_xlabel('Layer')
axes[0].set_ylabel('Gradient Norm')
axes[0].set_title('Sigmoid: Gradient Norms by Layer')
axes[0].set_yscale('log')

axes[1].bar(range(len(relu_norms)), relu_norms, alpha=0.7, colour='orange')
axes[1].set_xlabel('Layer')
axes[1].set_ylabel('Gradient Norm')
axes[1].set_title('ReLU: Gradient Norms by Layer')

plt.tight_layout()
plt.show()

print(f"Sigmoid gradient range: {min(sigmoid_norms):.2e} to {max(sigmoid_norms):.2e}")
print(f"ReLU gradient range: {min(relu_norms):.2e} to {max(relu_norms):.2e}")


# Hyperparameter Tuning
Hyperparameter tuning is the process of systematically searching for the optimal set of hyperparameters (e.g. number of layers, size of the layers, learning rate, batch size, dropout ...) of a ML model. There are four common methods of hyperparameter optimisation:  

*   Manual
*   Grid search
*   Random search
*   Bayesian search

## Hyperparameter Tuning Strategies

| Method | Description | Pros | Cons |
|--------|-------------|------|------|
| **Manual** | Trial and error | Intuitive | Time-consuming, biased |
| **Grid Search** | Exhaustive search over parameter grid | Thorough | Exponential cost in dimensions |
| **Random Search** | Random sampling from parameter space | Often better than grid | No learning from past trials |
| **Bayesian (Optuna)** | Model-based optimisation | Sample-efficient | More complex |

<details>
<summary><b>🤔 Question:</b> Why might random search outperform grid search in high dimensions?</summary>

**Answer:** In high-dimensional spaces, most hyperparameters have little effect on performance (only a few matter). Grid search wastes budget testing all combinations, while random search explores more values of the important parameters. This is known as the "effective dimensionality" argument (Bergstra & Bengio, 2012).
</details>


In [None]:
%pip install optuna torcheval --q

### Optuna
Optuna is an open-source hyperparameter optimisation framework -> automates the process of finding the best set of hyperparameters. Optuna systematically searches through a defined space of possible values to find the combination that yields the best performance (e.g., highest accuracy, lowest loss) on our model.

- **Study**: In Optuna, a "study" represents an optimisation session. Think of a study as a container that manages the entire hyperparameter tuning process.
    - We specify the optimisation direction.
        - 'maximize' for metrics like accuracy,
        -  'minimize' for metrics like loss.

- **Trial**: a single run of our model with a specific set of hyperparameters suggested by Optuna.
    - The objective function defines what happens in each trial (w/ different parameters).
    - Optuna calls this function repeatedly, each time providing a trial object.

- **Objective Function**: This is the function that Optuna optimises.
    - takes a trial object as input
    - returns the metric we want to optimise (e.g., test accuracy).
    - Inside the objective function:
        - we use trial object to suggest values for the hyperparameters you want to tune. Optuna uses these suggestions to explore the hyperparameter space.
        - we build and train our model using the suggested hyperparameters.
        - we evaluate our model's performance using the chosen metric.
        - we return the evaluated metric.

In [None]:
import optuna
from torcheval.metrics.functional import multiclass_accuracy
from tqdm import tqdm

# Define an objective function to be maximized.
def objective(trial,
              training_data,
              test_data,
              **train_params):
    """
    Objective function for Optuna hyperparameter tuning.

    Args:
        trial (optuna.Trial): An Optuna trial object.
        training_data (torch.utils.data.Dataset): The training dataset.
        test_data (torch.utils.data.Dataset): The test dataset.
        **train_params: Additional parameters to pass to the torch_train function.

    Returns:
        float: The test accuracy of the model with the suggested hyperparameters.
    """

    # Suggest values of the hyperparameters using a trial object.
    n_layers = trial.suggest_int('n_layers', 1, 3)
        # ^^^ telling optuna to sample this hyperparam from defined bounds
        # this is the magic: optuna will choose optimum sampling strategy to imporve performance based on past results
    layers = []

    for i in range(n_layers):
        size = trial.suggest_int(f'n_units_l{i}', 4, 128)
        # ^^^ telling optuna to sample numb of hidden units from our bounds
        layers.append(size)

    # build models
    model = LazySequential(in_feat_size=28*28, out_feat_size=10, hidden_sizes=layers).to(torch.device('cpu'))
    model = torch_train(model, training_data, **train_params)

    # prep data
    test_dataloader = DataLoader(test_data, batch_size= len(test_data))

    # compute acc
    X_test, Y_test = next(iter(test_dataloader))
    Y_pred = model(X_test).argmax(dim=-1)
    test_acc = multiclass_accuracy(Y_pred, Y_test)

    return test_acc

...with our objective function defined, let's create a study object

In [None]:
# Create study with more trials for meaningful results
study = optuna.create_study(direction='maximize')

# Run optimisation (increase trials for better results)
study.optimize(
    lambda trial: objective(trial, training_data, test_data),
    n_trials=15,  # Increased from 5 for more meaningful search
    show_progress_bar=True
)


In [None]:
# Visualise optimisation history
import optuna.visualization as vis

# Plot optimisation history
fig1 = vis.plot_optimization_history(study)
fig1.show()

# Plot parameter importances
try:
    fig2 = vis.plot_param_importances(study)
    fig2.show()
except:
    print("Need more trials for parameter importance plot")

# Plot parallel coordinate
fig3 = vis.plot_parallel_coordinate(study)
fig3.show()


In [None]:
# Print the best trial's parameters and value
print("Best trial:")
print("  Acc: {}".format(study.best_trial.value))
print("  Params: ")
for key, value in study.best_trial.params.items():
    print("    {}: {}".format(key, value))

# can also access all trials
print("\nAll trials:")
for trial in study.trials:
    print("  Trial {}:".format(trial.number))
    print("    Acc: {}".format(trial.value))
    print("    Params: {}".format(trial.params))

---

# Summary & Key Takeaways

## TensorBoard
- Essential tool for monitoring training progress and diagnosing issues
- Log scalars (loss, accuracy), images, histograms, and custom figures
- Use `SummaryWriter` to create logs, view with `%tensorboard --logdir <path>`

## Vanishing Gradients
- **Cause**: Activation derivatives < 1 compound through layers
- **Effect**: Early layers receive tiny gradients → don't learn
- **Diagnosis**: Track gradient norms per layer
- **Solutions**: ReLU, batch normalisation, skip connections, careful initialisation

## Batch Learning
- **Online (batch=1)**: High variance, many updates, escapes local minima
- **Full-batch**: Low variance, few updates, precise convergence
- **Minibatch (32-256)**: Best of both worlds for most applications

## Hyperparameter Tuning
- Grid search: Exhaustive but expensive
- Random search: Often surprisingly effective
- Bayesian (Optuna): Sample-efficient, learns from past trials

<details>
<summary><b>🤔 Question:</b> When would you prefer manual tuning over automated methods?</summary>

**Answer:** Manual tuning is preferred when:
1. You're exploring a new problem and building intuition
2. Training is very expensive (each trial takes hours/days)
3. You have strong domain knowledge about reasonable parameter ranges
4. You need to understand parameter interactions, not just find "good" values

Automated methods excel for final optimisation once you have a reasonable baseline.
</details>


## Further Reading

- **Vanishing Gradients**: Glorot & Bengio (2010) - *Understanding the difficulty of training deep feedforward neural networks*
- **Initialisation**: He et al. (2015) - *Delving Deep into Rectifiers*
- **Batch Normalisation**: Ioffe & Szegedy (2015) - *Batch Normalization: Accelerating Deep Network Training*
- **Random Search**: Bergstra & Bengio (2012) - *Random Search for Hyper-Parameter Optimisation*
- **Optuna**: Akiba et al. (2019) - *Optuna: A Next-generation Hyperparameter Optimisation Framework*
