<a href="https://www.kaggle.com/code/mrafraim/dl-day-18-hyperparameters-in-dl?scriptVersionId=288422925" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 18: Hyperparameters in DL

Welcome to Day 18!

Today you'll learn:
- Understand key hyperparameters: **learning rate, epochs, batch size**
- Learn how hyperparameters affect training and generalization
- Introduce grid search intuition for hyperparameter tuning


If you found this notebook helpful, your **<b style="color:red;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# What are Hyperparameters?

- Hyperparameters are parameters set before training, not learned from data.  
- They control how the model learns, how fast it converges, and how well it generalizes.

Key hyperparameters we will explore:
1. **Learning Rate (LR)**: Step size for gradient updates  
2. **Epochs**: Number of complete passes through the training data  
3. **Batch Size**: Number of samples processed before each gradient update


# Learning Rate (LR)

The learning rate controls how big a step the optimizer takes when updating model parameters during training.  
It is the single most sensitive hyperparameter in deep learning, get this wrong and nothing else matters.


## What Learning Rate Actually Does

At each training step, parameters are updated as:

$$
\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)
$$

Where:
- $\theta$ = model parameters  
- $\eta$ = learning rate  
- $\nabla L(\theta)$ = gradient of the loss  

LR scales the gradient. It does not change direction, only *how far* you move.


## Too Large Learning Rate

**Symptoms**
- Loss oscillates wildly
- Loss increases instead of decreasing
- NaN or Inf values appear
- Validation loss explodes early

**Why It Happens**
- Updates overshoot the minimum
- Parameters jump back and forth across the loss valley
- Optimizer never settles

**Real-World Impact**
- Training becomes unstable
- Model fails silently (especially with Adam)
- Wasted compute with no learning


## Too Small Learning Rate

**Symptoms**
- Loss decreases very slowly
- Training appears “stuck”
- Validation loss plateaus early

**Why It Happens**
- Updates are too tiny to escape flat regions
- Requires excessive epochs
- Can get trapped in sharp local minima or saddle points

**Real-World Impact**
- Overfitting risk (long training)
- Inefficient GPU usage
- False belief that model architecture is bad

## Sweet Spot: The Goldilocks Zone

A good learning rate:
- Decreases training loss smoothly
- Validation loss follows with a small gap
- No large oscillations
- Reaches convergence in reasonable epochs

## Rule of Thumb (Practitioner Defaults)

| Model Type | Typical LR |
|---------|------------|
| Linear / Logistic Regression | `0.1 – 0.01` |
| Neural Networks (SGD) | `0.01 – 0.001` |
| Neural Networks (Adam) | `0.001 – 0.0001` |
| Fine-tuning Pretrained Models | `1e-5 – 1e-4` |

> Start large enough to learn, small enough to survive.

# Epoch

An epoch is one complete pass of the entire training dataset through the model. That’s it.

But that definition alone is shallow. The meaning of an epoch comes from what actually happens during it.

## What Actually Happens Inside One Epoch

Assume:

* Dataset size = 1,000 samples
* Batch size = 100

Inside 1 epoch:

* The data is split into 10 batches
* For each batch:

  1. Forward pass
  2. Loss calculation
  3. Backpropagation
  4. Weight update

So:

1 epoch = 10 gradient updates

General formula:

$$
\text{Updates per epoch} = \frac{\text{Number of samples}}{\text{Batch size}}
$$

This is the real operational meaning.

## Why Epoch ≠ Learning

A common beginner mistake:

> “More epochs means better learning”

<b style="color:red;">Wrong!</b>

Epochs only define how many times the model sees the data, not how well it understands it.

Learning quality depends on:

* Learning rate
* Batch size
* Model capacity
* Data quality

Epochs just control exposure count.

## Why We Need Multiple Epochs

On the first epoch:

* Weights are random
* Gradients are large and chaotic
* The model captures only coarse patterns

Each additional epoch:

* Refines parameters
* Reduces loss
* Fits finer structure

But after a point:

* The model starts fitting noise
* Validation performance degrades

That’s where overfitting begins.


## Epoch vs Iteration

| Term          | Meaning                       |
| ------------- | ----------------------------- |
| Iteration | One batch → one weight update |
| Epoch     | All batches processed once    |

Relationship:

$$
\text{Iterations} = \text{Epochs} \times \frac{N}{\text{Batch Size}}
$$

Professionals think in iterations, not epochs.


## Why Epoch Count is a Weak Hyperparameter

In real systems:

* Dataset sizes vary
* Batch sizes change
* Distributed training alters update frequency

So:

* “Train for 50 epochs” is meaningless without context
* “Train for 100k updates” is precise

Epochs are a human convenience, not a fundamental unit.

## Industry Reality

What actually happens in production-grade training:

* Epoch count is rarely fixed
* Training stops based on:

  * Validation loss plateau
  * Early stopping
  * Budget constraints

* Epochs are used only for:

  * Logging
  * Checkpointing
  * Monitoring progress


## Mental Model

> Epoch = one full opportunity for the model to correct itself using the entire dataset.

But:

* Too few opportunities → underfitting
* Too many → memorization

An epoch is one full pass over the training data, resulting in multiple weight updates, its value lies not in the number itself, but in how it interacts with batch size, learning rate, and stopping criteria.


# Batch Size

Batch size is the number of training samples used to compute one gradient update.

One batch → one forward pass + one backward pass + one weight update.

That’s the atomic unit of learning.

## What Batch Size Really Controls

Batch size does not control speed alone. It controls how noisy your learning signal is.

Mathematically, each gradient is an estimate of the true gradient over the full dataset:

$$
\nabla \mathcal{L}*{batch} \approx \nabla \mathcal{L}*{data}
$$

Batch size determines the quality of this approximation.

## Small vs Large Batch: The Real Difference

### Small Batch (e.g. 8–64)

**Behavior**

* Gradient changes direction a lot  
  → because each update sees only a small, incomplete view of the data  
* Loss curve looks jumpy  
  → sometimes up, sometimes down, even if learning is happening  
* Many updates per epoch  
  → weights are adjusted very frequently  

**Why it often generalizes better**

* The randomness forces the model to learn patterns that work across many batches, not just one  
* The model can’t perfectly memorize training data  
  → this naturally reduces overfitting  
* The model settles into wide, stable solutions  
  → small input changes don’t break performance  

**Cost**

* Training takes longer in real time  
  → more updates = more computation  
* GPU is not fully utilized  
  → hardware waits for small batches  
* If learning rate is high, updates can overshoot  
  → training becomes unstable or diverges  

### Large Batch (e.g. 512–8192)

**Behavior**

* Gradient direction is consistent  
  → each update sees a more complete picture of the data  
* Loss curve is smooth  
  → steady decrease with fewer sudden jumps  
* Fewer updates per epoch  
  → weights change less frequently  

**Why it can generalize worse**

* Lack of randomness lets the model lock onto very specific solutions
* These solutions fit training data well but break on new data  
* The model becomes sensitive  
  → small input changes can hurt predictions  

**Benefit**

* Much faster per epoch  
  → fewer updates, more parallel computation  
* Excellent GPU utilization  
  → hardware works at full capacity  
* Necessary for huge datasets  
  → small batches would take impractically long  

## Batch Size vs Epoch vs Iteration

Assume:

* Dataset = 10,000 samples
* Batch size = 100

Then:

* Batches per epoch = 100
* Updates per epoch = 100
* 1 update = 1 batch

| Term       | Meaning                    |
| ---------- | -------------------------- |
| Batch      | Chunk of data              |
| Batch size | Samples per batch          |
| Iteration  | One batch processed        |
| Epoch      | All batches processed once |

## The Hidden Equation Most People Ignore

Total learning depends on number of updates, not epochs:

$$
\text{Total Updates} = \text{Iterations} = \frac{N}{\text{Batch Size}} \times \text{Epochs}
$$

Change batch size → you change how many times weights are updated.

That’s why batch size tuning without epoch or LR adjustment breaks training.

## Learning Rate Coupling

Batch size and learning rate are coupled.

**Linear scaling heuristic:**

* Double batch size → double learning rate

Why?

* Larger batch = more confident gradient
* Needs a larger step to stay efficient

Ignore this → either slow convergence or divergence.

## Why “Bigger Batch = Better” Is False

### Large Batch

* Optimizes faster

  → each update uses lots of data, so the direction is confident  
  → loss drops smoothly and quickly  

* Learns narrower solutions

  → the model settles into very specific parameter settings  
  → works extremely well on training data  
  → small changes in input or data distribution can hurt performance  


### Small Batch

* Optimizes slower

  → each update is based on limited data  
  → progress looks messy and takes more time  

* Learns more robust solutions

  → randomness forces the model to perform well across many different mini-samples  
  → solutions are tolerant to noise and unseen data  
  → better real-world performance  

### Generalization Lives in Noise

* Noise prevents the model from becoming overconfident  
* Noise pushes the model away from fragile solutions  
* Noise acts like built-in regularization  

> A model that learns smoothly is not always a model that learns **well**.

## Industry Defaults

| Task                     | Typical Batch        |
| ------------------------ | -------------------- |
| Tabular ML               | 32–128               |
| CNN (vision)             | 64–256               |
| Transformers fine-tuning | 8–64                 |
| Pretraining LLMs         | 2k–32k (distributed) |

Context decides, not dogma.

## Mental Model

> Batch size = how many examples you trust before changing your mind.

Small batch → “I update my belief often, even if noisy”

Large batch → “I wait for more evidence before updating”

Batch size is the number of samples used per weight update, controlling gradient noise, update frequency, generalization behavior, and training efficiency, not just speed.

# Sample Experiment 

You can play by changing the learning rate, batch size and epoch to observe the model performance.

Recommended:

- Keep the experiment small (like 200 samples) for speed
- Change one hyperparameter at a time to isolate effect
- Track training loss vs. validation loss for each setting
- Optional: Make plots for each change to visually see the impact

In [1]:
# Import libraries

import torch                                           # Import core PyTorch library
import torch.nn as nn                                  # Import neural network modules (layers, loss functions)
from torch.utils.data import DataLoader, TensorDataset # Utilities for batching and dataset handling


In [2]:
# Simple Dataset

torch.manual_seed(42)                           # same sequence of random numbers every time
X = torch.randn(200, 2)                         # 200 samples and 2 features
y = X[:,0]*2 + X[:,1]*-3 + torch.randn(200)*0.5 # linear function + noise
y = y.unsqueeze(1)                              # Reshape target to (200, 1) to match model shape           

In [3]:
# Dataset & Dataloader

df =  TensorDataset(X,y)                                 # Combines inputs and targets into PyTorch dataset
loader = DataLoader(df, batch_size = 16, shuffle = True) # DataLoader handles batching and shuffling automatically

In [4]:
# Model

model = nn.Linear(2,1) # Simple LR model with 2 input features and 1 output
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

In [5]:
# Training Loop

epochs = 50

for epoch in range(epochs):

    model.train()                                # Set model to training model
    running_loss = 0

    for xb,yb in loader:

        optimizer.zero_grad()                     # Clear old gradients
        preds = model(xb)                         # Forward pass: predictions
        loss = criterion(preds, yb)               # Loss calculation
        loss.backward()                           # Backward pass: compute gradients
        optimizer.step()                          # Weights updated

        running_loss += loss.item() * xb.size(0)  # Batch avg. loss ---> total loss

    running_loss /= len(loader.dataset)           # Avg. loss per sample

    if epoch %10 == 0:                            # Print progress
        print(f"Epoch {epoch}: Loss = {running_loss:.4f}")        

Epoch 0: Loss = 6.6718
Epoch 10: Loss = 0.3366
Epoch 20: Loss = 0.2362
Epoch 30: Loss = 0.2341
Epoch 40: Loss = 0.2338


The loss decreases rapidly in the first few epochs, indicating that the model is quickly capturing the main patterns in the data. After epoch 20, the loss plateaus, suggesting convergence. Minimal improvement beyond this point indicates that further training may not yield significant gains and early stopping could be considered.

# Grid Search for Hyperparameters

- **Goal:** Find the combination of hyperparameters that minimizes validation loss
- **Approach:**
    1. Define a set of candidate values for each hyperparameter  
        - e.g., LR = [0.001, 0.01, 0.1], Batch Size = [16,32,64], Epochs = [20,50]  
    2. Train a model for every combination  
    3. Evaluate on validation set  
    4. Pick the combination with lowest validation error

- **Tip:**  
    - Grid search is exhaustive → slow for many hyperparameters  
    - Alternatives: Random search, Bayesian optimization, or learning rate schedulers


#  Key Takeaways FROM Day 18

- Hyperparameters control how the model learns, not what it learns  
- Learning rate affects convergence speed and stability  
- Epochs control total training time → watch for overfitting  
- Batch size affects gradient stability and generalization  
- Grid search helps systematically find optimal hyperparameters  
- Validation set is essential for hyperparameter tuning

---

<p style="text-align:center; font-size:18px;">
© 2025 Mostafizur Rahman
</p>
