<a href="https://www.kaggle.com/code/mrafraim/dl-day-31-regularization-in-cnn-rnn?scriptVersionId=293361456" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 31: Regularization in CNN/RNN

Welcome to day 31!

Today you'll learn: 

1. Why regularization is necessary
2. Dropout: intuition and behavior
3. Batch Normalization: why it works
4. Regularization in CNNs
5. Regularization in RNNs (important differences)
6. Practical do’s and don’ts


If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

#  Why Regularization Exists

Deep networks have:
- Millions of parameters
- High expressive power
- Ability to memorize noise

This leads to:
- Low training loss
- High validation loss

→ **Overfitting**

Regularization:
> Intentionally restricts model freedom to improve generalization


# CNN Overfitting Pattern

Typical CNN behavior without regularization:

- Training loss ↓↓↓
- Validation loss ↓ then ↑
- Filters become too specialized
- Model memorizes training images

CNNs overfit because:
- Many filters
- Fully connected layers at the end


# Dropout in CNN

Dropout is a **regularization technique** designed to reduce overfitting by randomly disabling neurons during training.

In CNNs, dropout must be used surgically. Blindly applying it hurts spatial learning.


## 1️. What Problem Dropout Solves

Deep CNNs have high representational capacity.

This creates a core risk:

- Model memorizes training data
- Learns fragile feature dependencies
- Performs poorly on unseen data

This phenomenon is called **overfitting**.

Dropout directly attacks:

❌ Feature co-adaptation  
❌ Over-reliance on specific neurons  
❌ Brittle internal representations


## 2️. What Dropout Actually Does (Mathematics)

Dropout randomly turns off neurons during training.

For a neuron output $h$ during training:

$$
\tilde{h} = h \cdot r, \quad r \sim \text{Bernoulli}(p)
$$

Where:
- $p$ = probability of keeping the neuron active
- $r = 1$ → neuron active
- $r = 0$ → neuron dropped

Important:
- Dropout is applied only during training
- During inference, all neurons are active
- Outputs are scaled automatically

### Step 1: What is `h`?

`h` is the output of a neuron before dropout.

Example:
- Input → weights → activation function
- Resulting number = `h`

So if a neuron fires with value:

$h = 6.0$

That is its contribution to the next layer.

### Step 2: What is `r`?

`r` is a random switch.

It is drawn from a Bernoulli distribution:

$$r ∼ Bernoulli(p)$$

Meaning:
- r = 1 with probability p (keep neuron)
- r = 0 with probability 1 − p (drop neuron)

This decision is made:
- Independently
- For every neuron
- At every training step


---

### What Is a Bernoulli Distribution?

A Bernoulli distribution models an experiment with:

- Exactly two possible outcomes
- One is labeled “success”
- One is labeled “failure”

That’s it. No middle ground.


**Mathematical Definition**

A random variable $X$ follows a Bernoulli distribution if:

$$
X \sim \text{Bernoulli}(p)
$$

Where:
- $p$ = probability of success
- $X = 1$ with probability $p$
- $X = 0$ with probability $1 - p$

So:

| Outcome | Value | Probability |
|------|------|-------------|
| Success | 1 | $p$ |
| Failure | 0 | $1 - p$ |


**Example A: Coin Flip (Biased)**

- Head = 1
- Tail = 0
- Probability of head = 0.7

Then:

$$
X \sim \text{Bernoulli}(0.7)
$$

Each flip gives:
- 1 (70% of the time)
- 0 (30% of the time)


**Example B: Light Switch**

- ON = 1
- OFF = 0

No probability involved here unless randomness is added.
Bernoulli adds randomness with control.

### Bernoulli in Neural Networks

In dropout:

- Each neuron is treated like a light switch
- ON = neuron kept
- OFF = neuron dropped

The switch is flipped randomly using Bernoulli.

For each neuron:

$$
r \sim \text{Bernoulli}(p)
$$

Meaning:
- r = 1 → neuron survives
- r = 0 → neuron removed

This happens:
- For every neuron
- For every training batch
- Independently


**Manual Dropout Example Using Bernoulli**

Layer output:

$h = [5, 3, 7, 1]$

Keep probability:

$p = 0.5$

Bernoulli draws:

$r = [1, 0, 1, 0]$

Apply dropout:

$$
\tilde{h} = h \odot r
$$

Result:

$$[5, 0, 7, 0]$$


---

### Step 3: The Core Equation

$$
\tilde{h} = h \cdot r
$$

This is not fancy math.

It literally means:

- If r = 1 → output stays the same
- If r = 0 → output becomes zero


### Step 4: Manual Single-Neuron Example

Assume:
- Neuron output: $h = 6$
- Keep probability: $p = 0.5$

Possible outcomes:

| r | Calculation | Output |
|---|------------|--------|
| 1 | 6 × 1 | 6 |
| 0 | 6 × 0 | 0 |

So during training:
- Sometimes the neuron exists
- Sometimes it vanishes


### Step 5: Manual Multi-Neuron Example (Critical)

Assume a layer output:

$h = [4, 2, 8, 6]$

Let $p = 0.5$

Random Bernoulli mask:

$r = [1, 0, 1, 0]$

Apply dropout:

$$
\tilde{h} = h \odot r
$$

Result:

$$[4, 0, 8, 0]$$

Half the neurons are removed for this step only.


### Why Scaling Is Required

If we randomly drop neurons, the expected output magnitude decreases.

Without correction:

- Training sees smaller activations
- Inference sees larger activations
- Model breaks

### Expected Value Explanation

Original expected output:

$$E[h] = h$$

After dropout (no scaling):

$$
E[\tilde{h}] = p \cdot h + (1 - p) \cdot 0
$$

$$
E[\tilde{h}] = p \cdot h
$$

So magnitude shrinks by factor p.

**Concrete Numeric Example**

Let:
- h = 10
- p = 0.5

Expected value:

$$
E[\tilde{h}] = 0.5 × 10 = 5
$$
Meaning:

- During training, neuron contributes half as much on average


**Why This Is a Problem**

During training:
- Network learns using average signal ≈ 5

During inference (no dropout):
- All neurons active
- Output = 10

Distribution mismatch:
- Training sees small activations
- Inference sees larger activations
- Leads to unstable predictions

### Inverted Dropout: The Fix (Used in PyTorch)

Instead of scaling at inference, we scale during training.

$$
\tilde{h} = \frac{h \cdot r}{p}
$$


**Expected Value With Inverted Dropout**

Possible outcomes now:


| r | Probability | Output |
|--|--|--|
| 1 | $p$ | $h / p$ |
| 0 | $1 − p$ | 0 |

**Expected Value Calculation**

$$
E[\tilde{h}] = p \cdot \frac{h}{p} + (1 - p) \cdot 0
$$

$$
E[\tilde{h}] = h
$$

Matches original neuron output

**Numeric Example**

Let:
- $h = 10$
- $p = 0.5$

Case 1: 

$r = 1$  
$Output = 10 / 0.5 = 20$

Case 2: 

$r = 0$  
$Output = 0$  

Expected value:

$0.5 × 20 + 0.5 × 0 = 10$ 


### What “Outputs Are Scaled Automatically” Really Means

When you write:

`nn.Dropout(p=0.5)`

PyTorch:
- Applies Bernoulli mask
- Divides by p during training
- Does NOTHING during inference

You never see the scaling, but it’s there.


## 3️.  How Dropout Actually Helps

Dropout exists to prevent co-adaptation between neurons.

Co-adaptation happens when:
- Neuron A becomes useful only because neuron B exists
- Neuron B depends on neuron A to work correctly

This creates fragile feature learning.

If either neuron fails, the prediction collapses.


### What Dropout Does During Training

During every training step:

- Random neurons are temporarily removed
- The network structure changes every batch
- Forward and backward passes use a different sub-network

Example:

- Batch 1: Neurons A, C active
- Batch 2: Neurons B, D active
- Batch 3: Neurons A, B active

No neuron is guaranteed to be present.

### Why This Is Equivalent to Training Many Sub-Networks

Because neurons are randomly removed:
- The model never trains as a single fixed architecture
- It trains thousands of smaller networks
- All networks share the same weights

This behaves like an ensemble:
- But without training separate models
- And without extra memory cost

### Why Neurons Become More Robust

Since any neuron can disappear:
- No neuron can rely on a specific partner
- Each neuron must learn independently useful features

Instead of learning:

Feature = Neuron A AND Neuron B

The network learns:

Feature = Neuron A OR Neuron B OR Neuron C

This creates redundancy.

### What This Achieves

Dropout forces the model to learn:
- Multiple ways to represent the same pattern
- Backup features instead of brittle shortcuts

Results:
- Better generalization
- Reduced overfitting
- More stable performance on unseen data

## 4️. Why Dropout Is Tricky in CNNs

Dropout behaves very differently in CNNs compared to fully connected networks. This is not accidental, it comes from how CNNs represent information.

### How CNNs Represent Information

CNNs rely on three structural ideas:

- **Local spatial correlations**  
  Nearby pixels (or tokens) are strongly related.

- **Shared convolutional filters**  
  The same filter detects the same pattern everywhere.

- **Structured feature maps**  
  Activations form grids (height × width × channels), not flat vectors.

This structure is the strength of CNNs.

### What Dropout Does That Causes Trouble

Dropout randomly removes individual activations.

In early convolution layers, this means:
- Random pixels in feature maps are erased
- Local continuity is broken
- Partial edges or textures disappear

**Example (edge detection):**

Original feature map:

████████<br>
████████<br>
████████

After dropout:

███ ███<br>
█ █████<br>
████ ██

Edges become fragmented.


### Why This Hurts Early CNN Layers

Early convolution layers learn:
- Edges
- Corners
- Textures
- Simple shapes

These features require spatial consistency.

Dropping random neurons early:
- Destroys local patterns
- Makes filters harder to learn
- Slows convergence
- Reduces representation quality

In short:
> Dropout fights against what early CNN layers are trying to learn.

### Why Dropout Works Better in Later Layers

Later CNN layers (especially fully connected layers):
- Represent abstract concepts
- No longer depend on precise spatial layout
- Behave like standard dense networks

Examples:
- “Catness”
- “Face-like structure”
- “Positive sentiment”

Here:
- Co-adaptation becomes a real risk
- Dropout helps prevent over-reliance on specific neurons


### Practical Rule Used in Real Systems

- Avoid dropout in early convolution layers
- Dropout is usually NOT needed in convolutional layers and if you use it, use very little, very carefully.
- Use dropout in:
  - Fully connected layers
  - Classification heads
  - Dense decision layers


### Industry Reality Check

Modern CNN architectures often:
- Use Batch Normalization instead of dropout
- Use data augmentation for regularization
- Apply dropout only near the output

That’s why you rarely see heavy dropout in ResNet, EfficientNet, etc.



> CNNs depend on spatial structure. Dropout destroys spatial structure.

So:

> Dropout is a poor regularizer for early CNN layers but a good regularizer for dense decision layers.


## 5️. Where to Use Dropout in CNN

- After Fully Connected (FC) layers
- After Global Average Pooling
- Late-stage convolution blocks (light dropout)

Avoid
- First conv layer
- Aggressive dropout in early feature extraction

Typical Values

| Layer Type | Dropout Rate |
|----------|-------------|
| FC Layers | 0.3 – 0.5 |
| Conv Blocks | 0.1 – 0.3 |


## 7. Example: Dropout in a CNN

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Convolutional layer:
        # Input: 1 channel (grayscale), Output: 16 channels, Kernel: 3x3
        # No padding, so output spatial size = (28-3+1)=26 # assuming input 28x28
        self.conv = nn.Conv2d(1, 16, 3)
        
        # Dropout layer for regularization
        # Randomly zeroes 30% of neurons during training
        self.dropout = nn.Dropout(p=0.3)
        
        # Fully connected layer:
        # Input features = 16 channels * 26 * 26 pixels (flattened)
        # Output features = 10 classes
        self.fc = nn.Linear(16*26*26, 10)  

    def forward(self, x):
        # Apply convolution
        x = self.conv(x)
        
        # Apply ReLU activation function
        # F.relu is functional (stateless) version
        x = F.relu(x)
        
        # Flatten 4D tensor (B, C, H, W) -> 2D tensor (B, features)
        # Necessary for feeding into fully connected layer
        x = x.view(x.size(0), -1)
        
        # Apply dropout (only active during training)
        x = self.dropout(x)
        
        # Fully connected layer to produce logits for 10 classes
        x = self.fc(x)
        
        return x


Dropout:

- `ON` during `model.train()`
- `OFF`during `model.eval()`

# Batch Normalization in CNN

**Batch Normalization (BatchNorm)** is a technique to normalize the inputs of each layer in a neural network. It is widely used in CNNs to stabilize and accelerate training.

For an input activation $x$ in a mini-batch:

$$
\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

Where:  
- $\mu$ = mean of the mini-batch  
- $\sigma^2$ = variance of the mini-batch  
- $\epsilon$ = small constant to avoid division by zero

After normalization, activations have zero mean and unit variance, which can limit the network’s expressive power. BatchNorm introduces two learnable parameters $\gamma$ and $\beta$ to allow the network to scale and shift the normalized values:

- $\gamma$ controls the **strength (variance)** of a feature  
- $\beta$ controls the **position (mean)** of a feature  
- They allow BatchNorm to represent the identity function if needed  
- Prevent normalization from restricting what the network can learn  

Without scale and shift, BatchNorm would stabilize training but reduce model capacity.

$$
y = \gamma \hat{x} + \beta
$$

- $\gamma$ = scale factor  
- $\beta$ = shift factor  

This ensures the network can still represent the identity transformation if needed.



## 1. Manual Example

Consider a mini-batch of 4 activations from a single neuron/channel:

$$
x = [2, 4, 6, 8]
$$

We'll apply Batch Normalization with a small $\epsilon = 10^{-5}$, and assume learnable parameters:

$$
\gamma = 2, \quad \beta = 1
$$

### Step 1: Compute Mini-Batch Mean

The mean $\mu$ is:

$$
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

Here, $N=4$:

$$
\mu = \frac{2 + 4 + 6 + 8}{4} = \frac{20}{4} = 5
$$


### Step 2: Compute Mini-Batch Variance

The variance $\sigma^2$ is:

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

$$
\sigma^2 = \frac{(2-5)^2 + (4-5)^2 + (6-5)^2 + (8-5)^2}{4}
$$

$$
\sigma^2 = \frac{(-3)^2 + (-1)^2 + 1^2 + 3^2}{4} = \frac{9 + 1 + 1 + 9}{4} = \frac{20}{4} = 5
$$


### Step 3: Normalize the Activations

Normalized activations $\hat{x}$:

$$
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

$$
\hat{x} = \frac{[2,4,6,8] - 5}{\sqrt{5 + 10^{-5}}} \approx \frac{[-3, -1, 1, 3]}{\sqrt{5}} \approx [-1.34, -0.45, 0.45, 1.34]
$$


### Step 4: Scale and Shift with $\gamma$ and $\beta$

Finally, apply learnable parameters:

$$
y_i = \gamma \hat{x}_i + \beta
$$

$$
y = 2 \cdot [-1.34, -0.45, 0.45, 1.34] + 1 \approx [-1.68, 0.1, 1.9, 3.68]
$$


### Result

Original mini-batch: $[2, 4, 6, 8]$  
Normalized & scaled mini-batch: $[-1.68, 0.1, 1.9, 3.68]$

- The activations are now centered, scaled, and shifted.
- BatchNorm has stabilized the input distribution while keeping learnable flexibility.


## 2. How BatchNorm Works

1. Normalizes activations across the mini-batch.  
2. Centers and scales each feature to have zero mean and unit variance.  
3. Introduces learnable parameters ($\gamma$ and $\beta$) to retain representational flexibility.  
4. Integrates seamlessly with convolutional layers by normalizing across channels for each spatial location.


## 3. Benefits of BatchNorm

1. **Stabilizes Gradients**  
   - By keeping activations in a consistent range, gradients do not explode or vanish.  
   - This makes deeper networks trainable.

2. **Allows Higher Learning Rates**  
   - Reduces the risk of divergence, enabling faster convergence.

3. **Reduces Internal Covariate Shift**  
   - The distribution of inputs to each layer becomes more stable during training, which improves learning efficiency.

4. **Acts as a Mild Regularizer**  
   - Slight noise from mini-batch statistics reduces overfitting, sometimes reducing the need for dropout.

5. **Improves Generalization**  
   - Normalization smoothens the optimization landscape, making training more robust.


## 4. CNN Batch Normalization Placement

The correct and most common placement of BatchNorm in CNNs is:

$$
\text{Conv} \;\rightarrow\; \text{BatchNorm} \;\rightarrow\; \text{Activation}
$$

Example with ReLU:

$$
y = \text{ReLU}(\text{BN}(\text{Conv}(x)))
$$

This is the default choice in modern CNN architectures.


<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
