<a href="https://www.kaggle.com/code/mrafraim/dl-day-31-regularization-in-cnn-rnn?scriptVersionId=293819189" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 31: Regularization in CNN/RNN

Welcome to day 31!

Today you'll learn: 

1. Why regularization is necessary
2. Dropout: intuition and behavior
3. Batch Normalization: why it works
4. Regularization in CNNs
5. Regularization in RNNs (important differences)
6. Practical do’s and don’ts


If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

#  Why Regularization Exists

Deep networks have:
- Millions of parameters
- High expressive power
- Ability to memorize noise

This leads to:
- Low training loss
- High validation loss

→ **Overfitting**

Regularization:
> Intentionally restricts model freedom to improve generalization


# PART A: Regularization in CNNs

# CNN Overfitting Pattern

Typical CNN behavior without regularization:

- Training loss ↓↓↓
- Validation loss ↓ then ↑
- Filters become too specialized
- Model memorizes training images

CNNs overfit because:
- Many filters
- Fully connected layers at the end


# Dropout in CNN

Dropout is a **regularization technique** designed to reduce overfitting by randomly disabling neurons during training.

In CNNs, dropout must be used surgically. Blindly applying it hurts spatial learning.


## 1️. What Problem Dropout Solves

Deep CNNs have high representational capacity.

This creates a core risk:

- Model memorizes training data
- Learns fragile feature dependencies
- Performs poorly on unseen data

This phenomenon is called **overfitting**.

Dropout directly attacks:

❌ Feature co-adaptation  
❌ Over-reliance on specific neurons  
❌ Brittle internal representations


## 2️. What Dropout Actually Does (Mathematics)

Dropout randomly turns off neurons during training.

For a neuron output $h$ during training:

$$
\tilde{h} = h \cdot r, \quad r \sim \text{Bernoulli}(p)
$$

Where:
- $p$ = probability of keeping the neuron active
- $r = 1$ → neuron active
- $r = 0$ → neuron dropped

Important:
- Dropout is applied only during training
- During inference, all neurons are active
- Outputs are scaled automatically

### Step 1: What is `h`?

`h` is the output of a neuron before dropout.

Example:
- Input → weights → activation function
- Resulting number = `h`

So if a neuron fires with value:

$h = 6.0$

That is its contribution to the next layer.

### Step 2: What is `r`?

`r` is a random switch.

It is drawn from a Bernoulli distribution:

$$r ∼ Bernoulli(p)$$

Meaning:
- r = 1 with probability p (keep neuron)
- r = 0 with probability 1 − p (drop neuron)

This decision is made:
- Independently
- For every neuron
- At every training step


---

### What Is a Bernoulli Distribution?

A Bernoulli distribution models an experiment with:

- Exactly two possible outcomes
- One is labeled “success”
- One is labeled “failure”

That’s it. No middle ground.


**Mathematical Definition**

A random variable $X$ follows a Bernoulli distribution if:

$$
X \sim \text{Bernoulli}(p)
$$

Where:
- $p$ = probability of success
- $X = 1$ with probability $p$
- $X = 0$ with probability $1 - p$

So:

| Outcome | Value | Probability |
|------|------|-------------|
| Success | 1 | $p$ |
| Failure | 0 | $1 - p$ |


**Example A: Coin Flip (Biased)**

- Head = 1
- Tail = 0
- Probability of head = 0.7

Then:

$$
X \sim \text{Bernoulli}(0.7)
$$

Each flip gives:
- 1 (70% of the time)
- 0 (30% of the time)


**Example B: Light Switch**

- ON = 1
- OFF = 0

No probability involved here unless randomness is added.
Bernoulli adds randomness with control.

### Bernoulli in Neural Networks

In dropout:

- Each neuron is treated like a light switch
- ON = neuron kept
- OFF = neuron dropped

The switch is flipped randomly using Bernoulli.

For each neuron:

$$
r \sim \text{Bernoulli}(p)
$$

Meaning:
- r = 1 → neuron survives
- r = 0 → neuron removed

This happens:
- For every neuron
- For every training batch
- Independently


**Manual Dropout Example Using Bernoulli**

Layer output:

$h = [5, 3, 7, 1]$

Keep probability:

$p = 0.5$

Bernoulli draws:

$r = [1, 0, 1, 0]$

Apply dropout:

$$
\tilde{h} = h \odot r
$$

Result:

$$[5, 0, 7, 0]$$


---

### Step 3: The Core Equation

$$
\tilde{h} = h \cdot r
$$

This is not fancy math.

It literally means:

- If r = 1 → output stays the same
- If r = 0 → output becomes zero


### Step 4: Manual Single-Neuron Example

Assume:
- Neuron output: $h = 6$
- Keep probability: $p = 0.5$

Possible outcomes:

| r | Calculation | Output |
|---|------------|--------|
| 1 | 6 × 1 | 6 |
| 0 | 6 × 0 | 0 |

So during training:
- Sometimes the neuron exists
- Sometimes it vanishes


### Step 5: Manual Multi-Neuron Example (Critical)

Assume a layer output:

$h = [4, 2, 8, 6]$

Let $p = 0.5$

Random Bernoulli mask:

$r = [1, 0, 1, 0]$

Apply dropout:

$$
\tilde{h} = h \odot r
$$

Result:

$$[4, 0, 8, 0]$$

Half the neurons are removed for this step only.


### Why Scaling Is Required

If we randomly drop neurons, the expected output magnitude decreases.

Without correction:

- Training sees smaller activations
- Inference sees larger activations
- Model breaks

### Expected Value Explanation

Original expected output:

$$E[h] = h$$

After dropout (no scaling):

$$
E[\tilde{h}] = p \cdot h + (1 - p) \cdot 0
$$

$$
E[\tilde{h}] = p \cdot h
$$

So magnitude shrinks by factor p.

**Concrete Numeric Example**

Let:
- h = 10
- p = 0.5

Expected value:

$$
E[\tilde{h}] = 0.5 × 10 = 5
$$
Meaning:

- During training, neuron contributes half as much on average


**Why This Is a Problem**

During training:
- Network learns using average signal ≈ 5

During inference (no dropout):
- All neurons active
- Output = 10

Distribution mismatch:
- Training sees small activations
- Inference sees larger activations
- Leads to unstable predictions

### Inverted Dropout: The Fix (Used in PyTorch)

Instead of scaling at inference, we scale during training.

$$
\tilde{h} = \frac{h \cdot r}{p}
$$


**Expected Value With Inverted Dropout**

Possible outcomes now:


| r | Probability | Output |
|--|--|--|
| 1 | $p$ | $h / p$ |
| 0 | $1 − p$ | 0 |

**Expected Value Calculation**

$$
E[\tilde{h}] = p \cdot \frac{h}{p} + (1 - p) \cdot 0
$$

$$
E[\tilde{h}] = h
$$

Matches original neuron output

**Numeric Example**

Let:
- $h = 10$
- $p = 0.5$

Case 1: 

$r = 1$  
$Output = 10 / 0.5 = 20$

Case 2: 

$r = 0$  
$Output = 0$  

Expected value:

$0.5 × 20 + 0.5 × 0 = 10$ 


### What “Outputs Are Scaled Automatically” Really Means

When you write:

`nn.Dropout(p=0.5)`

PyTorch:
- Applies Bernoulli mask
- Divides by p during training
- Does NOTHING during inference

You never see the scaling, but it’s there.


## 3️.  How Dropout Actually Helps

Dropout exists to prevent co-adaptation between neurons.

Co-adaptation happens when:
- Neuron A becomes useful only because neuron B exists
- Neuron B depends on neuron A to work correctly

This creates fragile feature learning.

If either neuron fails, the prediction collapses.


### What Dropout Does During Training

During every training step:

- Random neurons are temporarily removed
- The network structure changes every batch
- Forward and backward passes use a different sub-network

Example:

- Batch 1: Neurons A, C active
- Batch 2: Neurons B, D active
- Batch 3: Neurons A, B active

No neuron is guaranteed to be present.

### Why This Is Equivalent to Training Many Sub-Networks

Because neurons are randomly removed:
- The model never trains as a single fixed architecture
- It trains thousands of smaller networks
- All networks share the same weights

This behaves like an ensemble:
- But without training separate models
- And without extra memory cost

### Why Neurons Become More Robust

Since any neuron can disappear:
- No neuron can rely on a specific partner
- Each neuron must learn independently useful features

Instead of learning:

Feature = Neuron A AND Neuron B

The network learns:

Feature = Neuron A OR Neuron B OR Neuron C

This creates redundancy.

### What This Achieves

Dropout forces the model to learn:
- Multiple ways to represent the same pattern
- Backup features instead of brittle shortcuts

Results:
- Better generalization
- Reduced overfitting
- More stable performance on unseen data

## 4️. Why Dropout Is Tricky in CNNs

Dropout behaves very differently in CNNs compared to fully connected networks. This is not accidental, it comes from how CNNs represent information.

### How CNNs Represent Information

CNNs rely on three structural ideas:

- **Local spatial correlations**  
  Nearby pixels (or tokens) are strongly related.

- **Shared convolutional filters**  
  The same filter detects the same pattern everywhere.

- **Structured feature maps**  
  Activations form grids (height × width × channels), not flat vectors.

This structure is the strength of CNNs.

### What Dropout Does That Causes Trouble

Dropout randomly removes individual activations.

In early convolution layers, this means:
- Random pixels in feature maps are erased
- Local continuity is broken
- Partial edges or textures disappear

**Example (edge detection):**

Original feature map:

████████<br>
████████<br>
████████

After dropout:

███ ███<br>
█ █████<br>
████ ██

Edges become fragmented.


### Why This Hurts Early CNN Layers

Early convolution layers learn:
- Edges
- Corners
- Textures
- Simple shapes

These features require spatial consistency.

Dropping random neurons early:
- Destroys local patterns
- Makes filters harder to learn
- Slows convergence
- Reduces representation quality

In short:
> Dropout fights against what early CNN layers are trying to learn.

### Why Dropout Works Better in Later Layers

Later CNN layers (especially fully connected layers):
- Represent abstract concepts
- No longer depend on precise spatial layout
- Behave like standard dense networks

Examples:
- “Catness”
- “Face-like structure”
- “Positive sentiment”

Here:
- Co-adaptation becomes a real risk
- Dropout helps prevent over-reliance on specific neurons


### Practical Rule Used in Real Systems

- Avoid dropout in early convolution layers
- Dropout is usually NOT needed in convolutional layers and if you use it, use very little, very carefully.
- Use dropout in:
  - Fully connected layers
  - Classification heads
  - Dense decision layers


### Industry Reality Check

Modern CNN architectures often:
- Use Batch Normalization instead of dropout
- Use data augmentation for regularization
- Apply dropout only near the output

That’s why you rarely see heavy dropout in ResNet, EfficientNet, etc.



> CNNs depend on spatial structure. Dropout destroys spatial structure.

So:

> Dropout is a poor regularizer for early CNN layers but a good regularizer for dense decision layers.


## 5️. Where to Use Dropout in CNN

- After Fully Connected (FC) layers
- After Global Average Pooling
- Late-stage convolution blocks (light dropout)

Avoid
- First conv layer
- Aggressive dropout in early feature extraction

Typical Values

| Layer Type | Dropout Rate |
|----------|-------------|
| FC Layers | 0.3 – 0.5 |
| Conv Blocks | 0.1 – 0.3 |


## 7. Example: Dropout in a CNN

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Convolutional layer:
        # Input: 1 channel (grayscale), Output: 16 channels, Kernel: 3x3
        # No padding, so output spatial size = (28-3+1)=26 # assuming input 28x28
        self.conv = nn.Conv2d(1, 16, 3)
        
        # Dropout layer for regularization
        # Randomly zeroes 30% of neurons during training
        self.dropout = nn.Dropout(p=0.3)
        
        # Fully connected layer:
        # Input features = 16 channels * 26 * 26 pixels (flattened)
        # Output features = 10 classes
        self.fc = nn.Linear(16*26*26, 10)  

    def forward(self, x):
        # Apply convolution
        x = self.conv(x)
        
        # Apply ReLU activation function
        # F.relu is functional (stateless) version
        x = F.relu(x)
        
        # Flatten 4D tensor (B, C, H, W) -> 2D tensor (B, features)
        # Necessary for feeding into fully connected layer
        x = x.view(x.size(0), -1)
        
        # Apply dropout (only active during training)
        x = self.dropout(x)
        
        # Fully connected layer to produce logits for 10 classes
        x = self.fc(x)
        
        return x


Dropout:

- `ON` during `model.train()`
- `OFF`during `model.eval()`

# Batch Normalization in CNN

**Batch Normalization (BatchNorm)** is a technique to normalize the inputs of each layer in a neural network. It is widely used in CNNs to stabilize and accelerate training.

For an input activation $x$ in a mini-batch:

$$
\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

Where:  
- $\mu$ = mean of the mini-batch  
- $\sigma^2$ = variance of the mini-batch  
- $\epsilon$ = small constant to avoid division by zero

After normalization, activations have zero mean and unit variance, which can limit the network’s expressive power. BatchNorm introduces two learnable parameters $\gamma$ and $\beta$ to allow the network to scale and shift the normalized values:

- $\gamma$ controls the **strength (variance)** of a feature  
- $\beta$ controls the **position (mean)** of a feature  
- They allow BatchNorm to represent the identity function if needed  
- Prevent normalization from restricting what the network can learn  

Without scale and shift, BatchNorm would stabilize training but reduce model capacity.

$$
y = \gamma \hat{x} + \beta
$$

- $\gamma$ = scale factor  
- $\beta$ = shift factor  

This ensures the network can still represent the identity transformation if needed.



## 1. Manual Example

Consider a mini-batch of 4 activations from a single neuron/channel:

$$
x = [2, 4, 6, 8]
$$

We'll apply Batch Normalization with a small $\epsilon = 10^{-5}$, and assume learnable parameters:

$$
\gamma = 2, \quad \beta = 1
$$

### Step 1: Compute Mini-Batch Mean

The mean $\mu$ is:

$$
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

Here, $N=4$:

$$
\mu = \frac{2 + 4 + 6 + 8}{4} = \frac{20}{4} = 5
$$


### Step 2: Compute Mini-Batch Variance

The variance $\sigma^2$ is:

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

$$
\sigma^2 = \frac{(2-5)^2 + (4-5)^2 + (6-5)^2 + (8-5)^2}{4}
$$

$$
\sigma^2 = \frac{(-3)^2 + (-1)^2 + 1^2 + 3^2}{4} = \frac{9 + 1 + 1 + 9}{4} = \frac{20}{4} = 5
$$


### Step 3: Normalize the Activations

Normalized activations $\hat{x}$:

$$
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

$$
\hat{x} = \frac{[2,4,6,8] - 5}{\sqrt{5 + 10^{-5}}} \approx \frac{[-3, -1, 1, 3]}{\sqrt{5}} \approx [-1.34, -0.45, 0.45, 1.34]
$$


### Step 4: Scale and Shift with $\gamma$ and $\beta$

Finally, apply learnable parameters:

$$
y_i = \gamma \hat{x}_i + \beta
$$

$$
y = 2 \cdot [-1.34, -0.45, 0.45, 1.34] + 1 \approx [-1.68, 0.1, 1.9, 3.68]
$$


### Result

Original mini-batch: $[2, 4, 6, 8]$  
Normalized & scaled mini-batch: $[-1.68, 0.1, 1.9, 3.68]$

- The activations are now centered, scaled, and shifted.
- BatchNorm has stabilized the input distribution while keeping learnable flexibility.


## 2. How BatchNorm Works

1. Normalizes activations across the mini-batch.  
2. Centers and scales each feature to have zero mean and unit variance.  
3. Introduces learnable parameters ($\gamma$ and $\beta$) to retain representational flexibility.  
4. Integrates seamlessly with convolutional layers by normalizing across channels for each spatial location.


## 3. Benefits of BatchNorm

1. **Stabilizes Gradients**  
   - By keeping activations in a consistent range, gradients do not explode or vanish.  
   - This makes deeper networks trainable.

2. **Allows Higher Learning Rates**  
   - Reduces the risk of divergence, enabling faster convergence.

3. **Reduces Internal Covariate Shift**  
   - The distribution of inputs to each layer becomes more stable during training, which improves learning efficiency.

4. **Acts as a Mild Regularizer**  
   - Slight noise from mini-batch statistics reduces overfitting, sometimes reducing the need for dropout.

5. **Improves Generalization**  
   - Normalization smoothens the optimization landscape, making training more robust.


## 4. CNN Batch Normalization Placement

The correct and most common placement of BatchNorm in CNNs is:

$$
\text{Conv} \;\rightarrow\; \text{BatchNorm} \;\rightarrow\; \text{Activation}
$$

Example with ReLU:

$$
y = \text{ReLU}(\text{BN}(\text{Conv}(x)))
$$

This is the default choice in modern CNN architectures.


### Why BatchNorm Is Placed After Convolution

A convolution layer produces raw feature maps with unstable distributions during training.

BatchNorm:
- Normalizes these feature maps
- Stabilizes their distribution
- Makes the activation function behave predictably

Placing BN before activation ensures the nonlinearity receives normalized inputs.


### Why NOT Place BatchNorm After Activation

Bad pattern:

$$
\text{Conv} \;\rightarrow\; \text{Activation} \;\rightarrow\; \text{BatchNorm}
$$

Reasons:
- Activations like ReLU clip negative values
- This distorts the distribution
- BatchNorm then normalizes a biased signal
- Empirically worse convergence


## 5. Exact Computation Order in CNN

For an input tensor $x$:

1. Convolution:
$$
z = W * x + b
$$

2. Batch Normalization:
$$
\hat{z} = \frac{z - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

$$
z_{\text{BN}} = \gamma \hat{z} + \beta
$$

3. Activation:
$$
y = \text{ReLU}(z_{\text{BN}})
$$

## 6. Channel-wise Normalization in CNNs

For CNNs, BatchNorm is applied per channel, not per pixel.

Given tensor shape:
$$
(N, C, H, W)
$$

BatchNorm computes:
- Mean $\mu_c$
- Variance $\sigma_c^2$

Across:
$$
N \times H \times W
$$

For each channel $c$ independently.


## 7. Real Architecture Examples

**VGG-BN**

$$
\text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU}
$$

**ResNet (Post-activation)**

$$
\text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU}
$$

**ResNet (Pre-activation variant)**

$$
\text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv}
$$

Pre-activation ResNet is a special case, not the default.


## 8. BatchNorm + Bias Redundancy

When using BatchNorm:

**Do NOT use bias in Conv layers**

Reason:

$$
\beta \text{ in BN replaces bias}
$$

Practical rule:
- `Conv2d(bias=False)`
- `BatchNorm2d(...)`


## 9. Summary: CNN BatchNorm Placement Rules

Default:

$$
\text{Conv} \rightarrow \text{BN} \rightarrow \text{Activation}
$$

Avoid:

$$
\text{Conv} \rightarrow \text{Activation} \rightarrow \text{BN}
$$

- BN normalizes feature maps, not activations  
- Always disable Conv bias when using BN  
- Enables deeper networks and higher learning rates


## 10. Example: BatchNorm in CNN

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# CNN with BatchNorm Example
class CNNWithBN(nn.Module):
    def __init__(self):
        super().__init__()
        # Conv layer 1: input 3 channels, output 16 channels, kernel 3x3
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
        # BatchNorm for conv1 output channels
        self.bn1 = nn.BatchNorm2d(16)
        
        # Conv layer 2: input 16 channels, output 32 channels
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(32)
        
        # Fully connected layer: flatten 32*8*8 -> 10 classes
        self.fc = nn.Linear(32*8*8, 10)

    def forward(self, x):
        # Conv -> BatchNorm -> ReLU
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        
        # Flatten for FC layer
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

Learning Notes:

- BatchNorm is applied after Conv, before ReLU.
- `nn.BatchNorm2d(num_features)` → number of channels.
- Normalizes activations per channel across the batch, stabilizing training.

# PART B: Regularization in RNNs

RNNs are designed to handle sequence data, e.g., text, time series, or speech.  
Unlike regular feedforward networks, RNNs maintain a memory of previous inputs using hidden states.

**RNN Structure**

At each time step $t$, the RNN computes:

$$
h_t = f(W_x x_t + W_h h_{t-1} + b)
$$

- $x_t$ = input at time step $t$  
- $h_{t-1}$ = hidden state from previous time step  
- $W_x, W_h$ = weights for input and hidden state  
- $b$ = bias  
- $f$ = activation function (usually $\tanh$ or ReLU)

Output can be:

$$
y_t = g(W_y h_t + c)
$$

- $y_t$ = output at time step $t$  
- $g$ = activation for output (softmax for classification)

Suppose a simple RNN with:
- 1 input neuron, 1 hidden neuron  
- $W_x = 0.5$, $W_h = 0.8$, $b = 0$  
- Activation $f = \tanh$  
- Sequence $x = [1.0, 2.0]$  
- Initial hidden state $h_0 = 0$

Time Step 1:

$$
h_1 = \tanh(W_x x_1 + W_h h_0) = \tanh(0.5*1 + 0.8*0) = \tanh(0.5) \approx 0.462
$$

Time Step 2:

$$
h_2 = \tanh(W_x x_2 + W_h h_1) = \tanh(0.5*2 + 0.8*0.462) = \tanh(1.3696) \approx 0.878
$$

So hidden states are $h = [0.462, 0.878]$.


# Why RNN Regularization Is Tricky

RNNs:
- Share weights across time
- Depend on hidden state
- Sensitive to noise

Naive dropout:
- Breaks temporal consistency
- Destroys memory


# Dropout in RNNs

- RNNs can overfit small datasets because they reuse the same weights for all time steps.  
- Dropout is a way to regularize and improve generalization.

⚠ Important: Hidden states are temporally dependent.  
- Randomly dropping hidden state values at each time step independently destroys sequence information.  
- **Solution:** use the same dropout mask across all time steps (recurrent dropout).


## 1. Standard Dropout Concept

For a layer with input vector $x$, standard dropout randomly sets a fraction $p$ of activations to zero:

$$
\tilde{x}_i =
\begin{cases} 
0 & \text{with probability } p \\
\frac{x_i}{1-p} & \text{with probability } 1-p
\end{cases}
$$

- $p$ = dropout rate (e.g., 0.2 means 20% of neurons are dropped)
- Scaling by $1/(1-p)$ ensures expectation is unchanged


## 2. Dropout in RNNs (Vanilla RNN/LSTM/GRU)

Key difference: hidden states are temporally correlated.  

- Standard dropout at each time step can destroy temporal patterns.  
- **Solution:** apply dropout to inputs and/or hidden states consistently across time steps (same mask for all time steps in a sequence).

Common patterns:

1. **Input Dropout**: Apply dropout to $x_t$ (the inputs at each time step).  
2. **Recurrent Dropout**: Apply dropout to $h_{t-1}$ (hidden state) using the same mask across all $t$.  
3. **Output Dropout**: Optional dropout after $h_t$ before feeding to final layer.


## 3. Vanilla RNN with Dropout (Step-by-Step)

RNN update (simplified):

$$
h_t = \tanh(W_x x_t + W_h h_{t-1} + b)
$$

With **dropout on input**:

$$
\tilde{x}_t = \text{Dropout}(x_t)
$$

$$
h_t = \tanh(W_x \tilde{x}_t + W_h h_{t-1} + b)
$$

With **recurrent dropout**:

$$
\tilde{h}_{t-1} = \text{Dropout}(h_{t-1})
$$

$$
h_t = \tanh(W_x x_t + W_h \tilde{h}_{t-1} + b)
$$


## 4. Example: RNN with Input & Recurrent Dropout


**Step 1: RNN Setup**

- Sequence: $x = [1.0, 2.0]$ (2 time steps)  
- Single hidden unit: $h_t$  
- Initial hidden state: $h_0 = 0$  
- Weights: $W_x = 0.5$, $W_h = 0.8$, bias $b = 0$  
- Activation: $\tanh$  
- Dropout rate: $p = 0.5$  
- Apply:
  - Input dropout
  - Recurrent dropout (same mask across time steps)


**Step 2: Generate Dropout Masks**

- Input dropout mask (same for all steps): $m_x = [1, 0]$  
- Recurrent dropout mask (same for all steps): $m_h = [1]$  

Scaling factor: $1/(1-p) = 1/0.5 = 2$  


**Step 3: Apply Dropout to Inputs**

$$
\tilde{x}_t = x_t \cdot m_x / (1-p)
$$

- Time step 1: $\tilde{x}_1 = 1.0 * 1 * 2 = 2.0$  
- Time step 2: $\tilde{x}_2 = 2.0 * 0 * 2 = 0.0$


**Step 4: Apply Dropout to Hidden State**

Initial hidden: $h_0 = 0$  
Recurrent dropout mask: $m_h = 1$  
Scaled: $h_0 \cdot 1 / 0.5 = 0$  

So initial hidden used in computation: $h_0^{drop} = 0$


**Step 5: Compute Hidden States**

Time step 1:

$$
h_1 = \tanh(W_x \tilde{x}_1 + W_h h_0^{drop} + b)
= \tanh(0.5*2.0 + 0.8*0 + 0)
= \tanh(1.0) \approx 0.761
$$

Time step 2:

$$
h_1^{drop} = h_1 * m_h / (1-p) = 0.761 * 1 / 0.5 = 1.522
$$

$$
h_2 = \tanh(W_x \tilde{x}_2 + W_h h_1^{drop} + b)
= \tanh(0.5*0 + 0.8*1.522 + 0)
= \tanh(1.218) \approx 0.839
$$


**Step 6: Summary of Computation**

| Time Step | Input $x_t$ | Input Dropout $\tilde{x}_t$ | Hidden $h_{t-1}^{drop}$ | Hidden $h_t$ |
|-----------|------------|----------------------------|-------------------------|-------------|
| 1         | 1.0        | 2.0                        | 0.0                     | 0.761       |
| 2         | 2.0        | 0.0                        | 1.522                   | 0.839       |

Key observations:

- Input dropout sets some inputs to zero (scaled up others).  
- Recurrent dropout scales hidden states consistently across steps.  
- Temporal patterns are preserved while regularizing the network.


## 5. CNN Dropout vs RNN Dropout


### 1. **Where Dropout is Applied**

**CNN:**
- Applied to activations of fully connected layers or feature maps of convolutional layers.  
- Usually after Conv + Activation, or between FC layers.  
- Each neuron (or feature map) is dropped independently for every forward pass.

**RNN:**
- Applied to input vectors ($x_t$), hidden states ($h_t$), or output before final layer.  
- Recurrent dropout is shared across all time steps to preserve temporal correlations.  
- Independent dropout per time step on hidden state usually breaks sequence learning.

### 2. **Temporal Dependency**

**CNN:**
- No temporal dependency between inputs; dropping neurons is independent.  
- Random dropout every forward pass works fine.

**RNN:**
- Hidden states are sequentially dependent ($h_t$ depends on $h_{t-1}$).  
- Random dropout per time step on $h_t$ can destroy temporal patterns.  
- Use same mask across sequence for hidden state (recurrent dropout).

### 3. **Scaling and Implementation**

**CNN:**
- Standard inverted dropout: scale kept activations by $1/(1-p)$ during training.  
- Dropout mask regenerated every forward pass.

**RNN:**
- Input dropout: same as CNN.  
- Recurrent dropout: mask is fixed for entire sequence, scaled by $1/(1-p)$.  
- Frameworks like PyTorch apply dropout between layers for multi-layer RNNs.


### 4. **Effect on Training**

**CNN:**
- Reduces co-adaptation between neurons, helps generalization.  
- Works well in deep feedforward and convolutional architectures.

**RNN:**
- Prevents overfitting to short sequences.  
- Preserves sequence information when applied correctly.  
- Must carefully balance dropout rate: too high can destroy memory of previous time steps.


### **Summary Table**

| Aspect                     | CNN Dropout                           | RNN Dropout                              |
|-----------------------------|--------------------------------------|-----------------------------------------|
| Applied To                  | FC layers, Conv activations          | Input $x_t$, Hidden $h_t$, Output       |
| Temporal Dependency          | None                                  | High, hidden states depend on $t-1$    |
| Dropout Mask per Pass       | Yes, independent                       | Input: independent; Hidden: same mask across sequence |
| Scaling                     | $1/(1-p)$                             | $1/(1-p)$ (input & recurrent separately) |
| Effect on Network           | Reduces neuron co-adaptation          | Preserves sequence, regularizes memory  |
| Common Pitfall              | None significant                       | Different mask per time step → destroys temporal info |


---

### Understanding Dropout Scaling $1/(1-p)$*

**Step 1: Setup**

Suppose a neuron has activation value:

$$
x = 4.0
$$

We apply dropout rate $p = 0.5$ (50% chance to drop the neuron during training).  

Without scaling, if the neuron is kept, its value is $x=4$; if dropped, $x=0$.

**Step 2: Why Scaling is Needed**

Dropout randomly drops neurons.  
- If we don’t scale, the expected value of the neuron during training decreases.  

Expected value without scaling:

$$
E[x_{\text{drop}}] = (1-p) \cdot x + p \cdot 0 = 0.5 * 4 + 0.5 * 0 = 2
$$

Notice: The expected activation drops from $4$ to $2$ — the network sees smaller activations during training.

**Step 3: Apply Inverted Dropout Scaling**

To keep the expected value same, scale kept neurons by $1/(1-p)$:

$$
\tilde{x} =
\begin{cases} 
0 & \text{with probability } p \\
x / (1-p) & \text{with probability } 1-p
\end{cases}
$$

Here, $1/(1-p) = 1/0.5 = 2$.

**Step 4: Compute Scaled Dropout Example**

- With probability $p = 0.5$, neuron is dropped: $\tilde{x} = 0$  
- With probability $1-p = 0.5$, neuron is kept: $\tilde{x} = x/(1-p) = 4/0.5 = 8$  

Expected value with scaling:

$$
E[\tilde{x}] = 0.5*0 + 0.5*8 = 4
$$

Now the expected activation is the same as original $x$, preventing the network from seeing smaller values during training.

**Step 5: Summary Table**

| Dropout Event | Without Scaling | With Scaling $1/(1-p)$ |
|---------------|----------------|------------------------|
| Neuron dropped | 0              | 0                      |
| Neuron kept    | 4              | 8                      |
| Expected value | 2              | 4 ✅                   |

Key idea: Scaling ensures training activations match inference activations.

### Dropout Scaling $1/(1-p)$ in CNN vs RNN

### 1. CNNs

- Dropout is applied independently to activations of neurons (or feature maps).  
- Example: after a fully connected layer or convolution + activation:
  $$ \tilde{x}_i = 
    \begin{cases} 
    0 & \text{with probability } p \\
    x_i / (1-p) & \text{with probability } 1-p
    \end{cases} $$
- Each neuron’s mask is regenerated every forward pass.  
- Scaling ensures expected activation remains same, so the network sees similar values during training and inference.


### 2. RNNs

- Dropout is applied to **input $x_t$, hidden states $h_t$, or output**.  
- Input dropout: same as CNN, applied independently per neuron (mask can be same across sequence or per step depending on framework).  
- Recurrent dropout (on $h_{t-1}$):
  - Mask is shared across all time steps to preserve temporal patterns.  
  - Still uses scaling $1/(1-p)$:
    $$ \tilde{h}_{t-1} = h_{t-1} \cdot \text{mask} / (1-p) $$
- Scaling ensures hidden states have same expected magnitude, preventing the network from shrinking memory over time.


### 3. Key Similarities and Differences

| Aspect                | CNN Dropout                | RNN Dropout                      |
|-----------------------|----------------------------|----------------------------------|
| Scaling               | $1/(1-p)$ (inverted)      | $1/(1-p)$ (inverted)            |
| Mask per forward pass | Yes, independent          | Input: independent; Recurrent: same across time steps |
| Temporal dependency    | None                      | Hidden state depends on previous steps |
| Purpose               | Prevent co-adaptation      | Prevent overfitting while preserving temporal memory |


---


## 6. Example: Dropout in RNN

In [3]:
import torch
import torch.nn as nn

# RNN with input dropout example
class RNNWithDropout(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout_p=0.5):
        super().__init__()
        # Single-layer RNN
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size,
                          num_layers=1, batch_first=True, dropout=0.0)
        
        # Dropout layer applied to input sequence
        self.input_dropout = nn.Dropout(dropout_p)
        
        # Fully connected output layer
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Apply dropout to input sequence
        x = self.input_dropout(x)
        
        # RNN forward pass
        out, h_n = self.rnn(x)
        
        # Take output of last time step for prediction
        out = self.fc(out[:, -1, :])
        return out

Learning Notes:

- Dropout prevents overfitting by randomly zeroing input features.
- For RNN hidden states, recurrent dropout should be mask-shared across time steps.
- `nn.RNN(..., dropout=...)` only works for multi-layer RNNs. Single-layer recurrent dropout must be applied manually.

# Normalization in RNNs

Unlike CNNs, Batch Normalization in RNNs is not straightforward because:

- **Hidden states are sequentially dependent:** $h_t$ depends on $h_{t-1}$  
- Standard BatchNorm (per time step) can break temporal patterns 
- **Solution:** specialized techniques are used to normalize RNNs.


## 1. Approaches for Normalization in RNNs

1. **Batch Normalization (BN) on inputs**
   - Apply BN only on input vectors $x_t$ at each time step:
     $$
     \hat{x}_t = \frac{x_t - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}
     $$
   - Then feed $\hat{x}_t$ into RNN: $h_t = f(W_x \hat{x}_t + W_h h_{t-1} + b)$
   - Safe because input does not depend on hidden states

2. **Layer Normalization (LN) on hidden states**
   - Normalize within a hidden state vector, not across batch:
     $$
     \hat{h}_t = \frac{h_t - \mu_{h_t}}{\sqrt{\sigma^2_{h_t} + \epsilon}}
     $$
   - $\mu_{h_t}$ and $\sigma^2_{h_t}$ computed across hidden units, not across batch
   - Works well because it preserves temporal dependencies

3. **Recurrent BatchNorm / Variants**
   - Some research applies BN to hidden-to-hidden transitions carefully, but usually LayerNorm is preferred in RNNs.


## 2. Why LayerNorm is Preferred in RNNs

- BatchNorm depends on batch statistics, which vary per time step → can destabilize RNN  
- LayerNorm normalizes across hidden units of a single time step, not across batch → stable for sequences  
- Works for LSTM, GRU, Vanilla RNN 
- Often combined with dropout for regularization


## 3. Example: LayerNorm in RNN Hidden State

Suppose hidden state vector at time $t$:

$$
h_t = [0.5, 1.0, -0.5]
$$

1. Compute mean and variance across units:
$$
\mu_{h_t} = (0.5 + 1.0 - 0.5)/3 = 0.333
$$
$$
\sigma^2_{h_t} = \frac{(0.5-0.333)^2 + (1-0.333)^2 + (-0.5-0.333)^2}{3} \approx 0.555
$$

2. Normalize:
$$
\hat{h}_t = \frac{h_t - \mu_{h_t}}{\sqrt{\sigma^2_{h_t} + \epsilon}} \approx [-0.23, 0.89, -0.67]
$$

3. Apply learnable $\gamma, \beta$:
$$
h_t^{LN} = \gamma \hat{h}_t + \beta
$$

Output is normalized per time step, temporal dependency preserved.


## 4. Normalization (CNN vs RNN)
| Concept                  | CNN                 | RNN                       |
|---------------------------|-------------------|---------------------------|
| BatchNorm placement       | After Conv/FC, before activation | Only safe on inputs ($x_t$) |
| Hidden state normalization| Not needed         | LayerNorm preferred       |
| Temporal dependency       | None               | Must preserve sequence    |
| Scaling                   | Learnable $\gamma, \beta$ | Same, per hidden state vector |
| Regularization            | Dropout often used | Dropout + LayerNorm       |


## 5. Example

In [4]:
# BatchNorm on RNN Input 

import torch
import torch.nn as nn

class RNNWithBN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        # BatchNorm applied to input features
        self.bn = nn.BatchNorm1d(input_size)
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        b, seq_len, f = x.size()
        
        # Reshape for BatchNorm: (batch*seq_len, input_size)
        x_reshaped = x.contiguous().view(-1, f)
        x_norm = self.bn(x_reshaped)
        
        # Reshape back to (batch, seq_len, input_size)
        x = x_norm.view(b, seq_len, f)
        
        # RNN forward
        out, h_n = self.rnn(x)
        out = self.fc(out[:, -1, :])
        return out


Learning Notes:

- BatchNorm is safe on inputs $x_t$ but not on hidden states.
- Must reshape sequence for BatchNorm1d because it expects `(batch, features)`.

In [5]:
# LayerNorm on RNN Hidden States

import torch
import torch.nn as nn

class RNNWithLN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, batch_first=True)
        
        # LayerNorm normalizes hidden state across features per time step
        self.ln = nn.LayerNorm(hidden_size)
        
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # RNN forward
        out, h_n = self.rnn(x)
        
        # Apply LayerNorm to hidden states at all time steps
        out = self.ln(out)
        
        # Take last time step output
        out = self.fc(out[:, -1, :])
        return out


Learning Notes:

- LayerNorm is preferred for RNN hidden states.
- Normalizes across hidden units per time step, preserving temporal dependencies.
- Works for Vanilla RNN, LSTM, and GRU.

# Key Takeaways from Day 31

- Regularization fights overfitting, not training loss
- Dropout simulates ensemble learning
- BatchNorm stabilizes CNN training
- RNNs need sequence-aware regularization
- LayerNorm > BatchNorm for RNNs
- Bad regularization can harm learning

---

<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
