=========================== What Is Hyperparameter Optimization =======================

### Hyperparameter Optimization

This section introduces **hyperparameter optimization** as the process of selecting good values for hyperparameters in machine learning models. Hyperparameters are settings that govern model training or structure but are *not* learned from the training data itself. Unlike model parameters (weights), hyperparameters must be tuned externally.

#### What Are Hyperparameters?

In machine learning, we distinguish between:

- **Model parameters:** learned from data via optimization (e.g., weights in neural networks)
- **Hyperparameters:** set before training and control aspects such as:
  - model complexity (e.g., number of layers, number of hidden units)
  - optimization behavior (e.g., learning rate, batch size)
  - regularization (e.g., weight decay, dropout rate)

These are often denoted as $\lambda$.

#### Hyperparameter Optimization as a Nested Problem

Hyperparameter optimization can be formalized as a nested optimization problem, where we aim to find the hyperparameter setting $\lambda^*$ that minimizes some evaluation metric on held-out (validation) data:

$$
\lambda^*
= \arg\min_{\lambda\in\Lambda}
\mathcal{L}_{\text{val}}\Big(
\big\{\hat{\mathbf{y}}_j(\lambda)\big\}_{j=1}^m,\,\big\{\mathbf{y}_j\big\}_{j=1}^m
\Big)
$$

- $\Lambda$: hyperparameter search space
- $\hat{\mathbf{y}}_j(\lambda)$: model predictions with hyperparameter $\lambda$
- $\mathbf{y}_j$: true validation labels
- $\mathcal{L}_{\text{val}}$: validation loss
- The model is first trained on training data for each $\lambda$, then evaluated on validation data.

This nested structure makes hyperparameter tuning computationally expensive because each evaluation of the validation loss requires training a model.

#### Role of Training and Validation

In practice:
- **Training set:** used to update model parameters given a specific $\lambda$
- **Validation set:** used to evaluate the performance of the resulting model
- **Test set:** used only after hyperparameter tuning to estimate generalization performance

Model selection (choosing $\lambda$) is based on performance on validation data, not training data.

#### Simple Optimization Methods

##### 1. Grid Search

- Define a discrete grid of possible hyperparameter values.
- Evaluate every combination by training a model and computing validation loss.
- Choose the combination with the best validation performance.

Pros:
- Simple to implement

Cons:
- Computationally inefficient
- Scalability issues with many hyperparameters

##### 2. Random Search

- Sample hyperparameter configurations randomly from a distribution or range.
- Evaluate validation performance for each sample.

Advantages over grid search:
- More efficient coverage of high-dimensional spaces
- Often finds good settings with fewer evaluations

#### Why Random Search Works Better

Empirical observation:
- Some hyperparameters matter more than others
- Grid search wastes evaluations on unimportant dimensions
- Random search allocates trials more uniformly

Example: tuning learning rate vs. momentum â€” sampling randomly can find effective learning rates faster than a full grid over both.

#### General Workflow

A typical hyperparameter optimization loop:

1. **Choose a search strategy:** grid, random, Bayesian, etc.
2. **Sample hyperparameters** from the search space
3. **Train the model** with these hyperparameters
4. **Evaluate on validation data**
5. **Record results**
6. **Select the best configuration**

This loop may be parallelized to utilize multiple computational resources.

#### Challenges

- **High cost:** each evaluation trains a full model
- **Search space design:** ranges and distributions matter
- **Interactions between hyperparameters:** complex dependencies
- **Stochastic training dynamics:** results may vary across runs

#### Summary

- **Hyperparameter optimization** is choosing settings that minimize validation error.
- It is a **nested optimization**, often expensive and time-consuming.
- **Grid search** and **random search** are basic strategies.
- Random search scales better in higher dimensions.
- The search strategy and definition of the search space are critical to success.

In [None]:
import numpy as np
import torch
from scipy import stats
from torch import nn
from d2l import torch as d2l

In [None]:
class HPOTrainer(d2l.Trainer):  #@save
    def validation_error(self):
        self.model.eval()
        accuracy = 0
        val_batch_idx = 0
        for batch in self.val_dataloader:
            with torch.no_grad():
                x, y = self.prepare_batch(batch)
                y_hat = self.model(x)
                accuracy += self.model.accuracy(y_hat, y)
            val_batch_idx += 1
        return 1 -  accuracy / val_batch_idx

In [None]:
def hpo_objective_softmax_classification(config, max_epochs=8):
    learning_rate = config["learning_rate"]
    trainer = d2l.HPOTrainer(max_epochs=max_epochs)
    data = d2l.FashionMNIST(batch_size=16)
    model = d2l.SoftmaxRegression(num_outputs=10, lr=learning_rate)
    trainer.fit(model=model, data=data)
    return trainer.validation_error().detach().numpy()

In [None]:
config_space = {"learning_rate": stats.loguniform(1e-4, 1)}

In [None]:
errors, values = [], []
num_iterations = 5

for i in range(num_iterations):
    learning_rate = config_space["learning_rate"].rvs()
    print(f"Trial {i}: learning_rate = {learning_rate}")
    y = hpo_objective_softmax_classification({"learning_rate": learning_rate})
    print(f"    validation_error = {y}")
    values.append(learning_rate)
    errors.append(y)

In [None]:
best_idx = np.argmin(errors)
print(f"optimal learning rate = {values[best_idx]}")