<a href="https://www.kaggle.com/code/mrafraim/dl-day-32-hyperparameters-for-cnn-rnn?scriptVersionId=294808330" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 32: Hyperparameters for CNN/RNN

Welcome to Day 32!

Today you'll learn:

1. What hyperparameters actually control
2. Learning rate, the most dangerous knob
3. Optimizers and their behavior
4. CNN-specific tuning heuristics
5. RNN-specific tuning heuristics
6. Practical tuning workflow (real world)

If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---


# What Are Hyperparameters?

Hyperparameters are:
- Set before training
- Not learned from data
- Control how learning happens

Examples:
- Learning rate
- Optimizer type
- Batch size
- Number of layers
- Dropout rate


# The Learning Rate

Learning rate $\alpha$ controls update size:

$$
\theta_{t+1} = \theta_t - \alpha \nabla_\theta L
$$

Where:
- $\theta$ = model parameters
- $\nabla_\theta L$ = gradient
- $\alpha$ = learning rate


## Learning Rate Intuition

| Learning Rate | Behavior |
|-------------|---------|
| Too small | Very slow convergence |
| Too large | Loss oscillates / diverges |
| Just right | Fast, stable descent |

Visual intuition:
- Small steps → safe but slow
- Large steps → unstable jumps


# PART A: CNN Hyperparameters


## 1. CNN Learning Rate Heuristics


### What Is Learning Rate (LR)?

Learning rate ($\eta$) controls how much the weights change during each update.

$$
w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w}
$$

- Small $\eta$ → very slow learning  
- Large $\eta$ → unstable training / divergence  

Choosing the right LR is critical for CNN training.


### Why CNNs Tolerate Higher Learning Rates

CNNs have built-in properties that stabilize training:

#### 1️⃣ Weight Sharing
- Same filter applied across spatial locations
- Gradients are averaged over many pixels
- Reduces gradient noise

#### 2️⃣ Local Receptive Fields
- Each neuron sees only a small region
- Gradients are less chaotic than fully connected layers

#### 3️⃣ Structured Depth
- Repeated Conv → Norm → Activation blocks
- Optimizers adapt faster

Result: **CNN gradients are naturally stable**


### Role of Batch Normalization in CNNs

Batch Normalization normalizes activations per mini-batch:

$$
\hat{x} = \frac{x - \mu_{batch}}{\sqrt{\sigma^2_{batch} + \epsilon}}
$$

Then applies learnable scaling:

$$
y = \gamma \hat{x} + \beta
$$

### Why BatchNorm Enables Higher LR

BatchNorm:
- Keeps activations near mean $0$
- Prevents exploding/vanishing gradients
- Makes gradient scale predictable

Effect:
> CNNs with BatchNorm can safely use **larger learning rates**


### Typical Learning Rates for CNNs (Practice)

#### SGD (with momentum)

| Setup | Learning Rate |
|------|---------------|
| No BatchNorm | $0.01$ |
| With BatchNorm | $0.05$ – $0.1$ |
| Large batch (≥128) | $0.1$ |

Why:
- SGD depends heavily on gradient magnitude
- BatchNorm stabilizes gradients


#### Adam / AdamW

| Setup | Learning Rate |
|------|---------------|
| Standard CNN | $1 \times 10^{-3}$ |
| Deep CNN | $3 \times 10^{-4}$ |
| With BatchNorm | up to $2 \times 10^{-3}$ |

Why lower than SGD:
- Adam adapts learning rate per parameter
- Too high LR causes overshooting


### Batch Size vs Learning Rate (CNN Rule)

CNNs often follow linear LR scaling:

$$
\text{LR}_{new} = \text{LR}_{base} \times \frac{\text{Batch}_{new}}{\text{Batch}_{base}}
$$

#### Example
- Batch size = 32 → LR = $0.01$
- Batch size = 128 → LR = $0.04$

⚠️ Works best when BatchNorm is used


### Signs of Incorrect Learning Rate

#### LR Too High
- Loss oscillates or explodes
- Training accuracy stuck at random
- NaN values appear

#### LR Too Low
- Very slow loss decrease
- Underfitting despite long training
- Wasted compute time


### Practical CNN LR Tuning Strategy

1. Start with Adam at $1e^{-3}$
2. Add Batch Normalization
3. Increase LR gradually:
   - $1e^{-3}$ → $2e^{-3}$ → $3e^{-3}$
4. Observe loss curve:
   - Smooth decrease → good
   - Sudden spikes → too high


### Key Takeaways

- CNNs have stable gradients
- BatchNorm increases stability further
- Higher stability → higher learning rates possible
- SGD benefits most from BatchNorm
- Large-batch CNNs almost require BatchNorm



## 2. CNN Optimizers: When to Use Which

Optimizers control how CNN weights are updated during training.  
They determine the speed, stability, and generalization of learning.

### 1️⃣ SGD (Stochastic Gradient Descent)

SGD updates weights using the gradient of the loss on one batch at a time:

$$
w_{t+1} = w_t - \eta \cdot \nabla L(w_t)
$$

- $w_t$ → current weight  
- $\eta$ → learning rate  
- $\nabla L(w_t)$ → gradient of loss w.r.t. $w_t$  

**Intuition:**  
- Move weights in the direction of decreasing loss.
- “Stochastic” because we use mini-batches, not the whole dataset.

**Pros:**
- Simple and widely used  
- Works well for large datasets 
- Often better generalization than adaptive optimizers

**Cons:**
- Converges slowly, especially for deep CNNs  
- Can get stuck in local minima or saddle points

**PyTorch Example:**

```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
````

**Use Case:**

* Classic CNNs (e.g., ResNet, VGG)
* Large-scale image datasets


### 2️⃣ SGD + Momentum

**Why Momentum?**

* Standard SGD can oscillate along steep slopes
* Momentum “remembers” previous updates to accelerate in consistent directions:

Mathematical Formulation:

$$
v_{t+1} = \mu v_t - \eta \nabla L(w_t)
$$

$$
w_{t+1} = w_t + v_{t+1}
$$

* $v_t$ → velocity (accumulated gradient)
* $\mu$ → momentum coefficient (usually 0.9)

**Intuition:**

* Like a ball rolling down a hill: it keeps moving in the same direction, smoothing out oscillations.

**Pros:**

* Faster convergence than plain SGD
* Smoother updates
* Better for long-term training stability

**Cons:**

* Still sensitive to learning rate
* Can overshoot if LR too high

**PyTorch Example:**

```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
```

**Use Case:**

* Vision models (ResNet, EfficientNet)
* Large datasets with many epochs

### 3️⃣ Adam (Adaptive Moment Estimation)

Adam combines Momentum + RMSProp, adapting learning rate per parameter:

Mathematical Steps:

1. Compute moving averages of gradients:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(w_t)
$$

2. Compute moving average of squared gradients:

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(w_t))^2
$$

3. Bias-corrected estimates:

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

4. Update weights:

$$
w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

* $\beta_1 \approx 0.9$, $\beta_2 \approx 0.999$
* $\epsilon \approx 1e^{-8}$ for numerical stability

**Intuition:**

* Momentum-like term ($m_t$) accelerates learning
* RMSProp-like term ($v_t$) scales learning rate per weight
* Automatically adapts to gradient magnitude

**Pros:**

* Fast convergence
* Works well **out of the box**
* Handles sparse gradients

**Cons:**

* Can **overfit** if LR not tuned
* Sometimes **worse final accuracy** than SGD+Momentum

**PyTorch Example:**

```python
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
```

**Use Case:**

* Quick prototyping
* Small to medium CNNs
* When you need fast convergence


### Summary Table

| Optimizer      | Key Idea                   | Pros                     | Cons                         | Best Use Case                  |
| -------------- | -------------------------- | ------------------------ | ---------------------------- | ------------------------------ |
| SGD            | Gradient descent per batch | Simple, generalizes well | Slow convergence             | Large datasets, vision CNNs    |
| SGD + Momentum | Accumulate past gradients  | Faster, smooth updates   | Sensitive to LR              | Deep CNNs, long training       |
| Adam           | Adaptive LR per weight     | Fast, good default       | Can overfit, final acc lower | Small/medium CNNs, prototyping |


**Learning Tip:**

* For large, classic CNNs, start with SGD + Momentum.
* For quick experiments or small datasets, start with Adam.
* Always tune learning rate for best results.


## 3️. CNN Batch Size:

Batch size determines how many samples are processed before the model updates weights.  

$$
\text{Weight update happens after every batch of size } B
$$


### Effects of Batch Size

1. **Large Batch Size**
- Examples: 128, 256  
- **Pros:**  
  - Gradients are averaged over many samples → smoother updates  
  - Can use higher learning rate
  - Training is more stable
- **Cons:**  
  - Requires more GPU memory  
  - Can sometimes generalize worse

2. **Small Batch Size**
- Examples: 16, 32  
- **Pros:**  
  - More gradient noise → can help generalization 
  - Works with limited memory
- **Cons:**  
  - Updates are noisy → loss may fluctuate  
  - May require smaller learning rate


### Typical CNN Batch Sizes

| Batch Size | Notes |
|-----------|-------|
| 32 – 64  | Common default for modest GPUs, good balance of speed and generalization |
| 128 – 256 | Large GPUs, smoother training, faster convergence |
| >256     | Very large batch, requires careful LR tuning (Linear scaling rule) |


### Practical Tip

- Start with batch size that fits your GPU memory
- Adjust learning rate based on batch size:

$$
\text{LR}_{new} = \text{LR}_{base} \times \frac{\text{Batch}_{new}}{\text{Batch}_{base}}
$$

- Combine with BatchNorm for stability at larger batches  


### Key Takeaways

- Batch size = tradeoff between stability and generalization  
- Small batches → noisy gradients → better generalization  
- Large batches → smooth gradients → faster convergence  
- Always consider GPU memory limits


# PART B: RNN Hyperparameters

RNNs (Recurrent Neural Networks) behave differently from CNNs due to sequential dependencies.
This affects learning rate choices, optimizer selection, and batch size decisions.


## Why RNNs Are Sensitive

RNNs:
- Reuse parameters across time
- Accumulate gradients
- Prone to instability

Result:
- Learning rate must be smaller
- Optimizer choice matters more

## 1️. RNN Learning Rate Heuristics

RNNs are more sensitive to learning rate than CNNs because:

- Gradients can explode or vanish due to repeated multiplications over time steps  
- Sequential dependencies make weight updates less stable  

**Typical Learning Rates:**

| Optimizer | LR (vanilla RNN) | Notes |
|-----------|-----------------|-------|
| SGD       | 0.01 – 0.05     | Small LR recommended for stability |
| SGD+Momentum | 0.01 – 0.05  | Helps smooth updates over time |
| Adam      | 1e-3 – 5e-4     | Good default for small/medium RNNs |

**Tips:**

- Use smaller LR than CNNs to prevent exploding gradients  
- Combine with *gradient clipping to handle large updates:

$$
\text{if } ||g|| > \text{threshold}, \quad g \leftarrow g \frac{\text{threshold}}{||g||}
$$

- Learning rate schedules (StepLR, Cosine, OneCycle) improve stability.

***Gradient clipping** limits the magnitude of gradients during backpropagation to prevent exploding gradients and stabilize training.

## 2️. RNN Optimizers

### SGD / SGD + Momentum

- Similar formulas as CNNs:

$$
w_{t+1} = w_t - \eta \nabla L(w_t)
$$

- Momentum helps smooth updates across time steps 
- Works for long sequences if LR is small

**Pros:**
- Often better generalization for sequential tasks  
- Simple and predictable

**Cons:**
- Can be very slow for long sequences  
- Sensitive to learning rate

**PyTorch Example:**

```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
````


### Adam

* Automatically adapts per-parameter learning rates
* Handles sparse or noisy gradients in sequential data

**Pros:**

* Fast convergence
* Handles varying gradient scales

**Cons:**

* Can overfit if LR not tuned
* Final accuracy sometimes worse than SGD+Momentum

**PyTorch Example:**

```python
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
```

## 3️. RNN Batch Size

Batch size in RNNs = number of sequences processed simultaneously.

### Effects of Batch Size

1. **Large Batch**

* Examples: 64 – 256 sequences
* Pros:

  * Smoother gradient estimates
  * Faster training per epoch
* Cons:

  * Higher GPU memory usage
  * Can hurt generalization in sequential tasks

2. **Small Batch**

* Examples: 16 – 32 sequences
* Pros:

  * Noisy gradients → better generalization
  * Works with limited memory
* Cons:

  * Updates noisy → loss fluctuates
  * Training may be slower

## Practical Tips

* Start with small batch sizes if sequences are long
* Use gradient clipping to stabilize training
* Combine with LayerNorm or BatchNorm (input only) to stabilize hidden states

**Linear scaling rule for batch size:**

$$
\text{LR}*{new} = \text{LR}*{base} \times \frac{\text{Batch}*{new}}{\text{Batch}*{base}}
$$

* Works if RNN is properly normalized

## Key Takeaways

* **RNNs are fragile**: careful LR, optimizer, and batch size choices matter
* Use smaller learning rates than CNNs
* Prefer SGD+Momentum or Adam depending on task
* Small batches → better generalization, safer for long sequences
* Use normalization and gradient clipping to improve stability

# Key Takeaways from Day 32

- Learning rate is the primary control knob
- CNNs tolerate aggressive settings
- RNNs require caution and clipping
- Optimizer choice affects convergence, not intelligence
- Tuning is iterative, not magical

---

<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
