<a href="https://colab.research.google.com/github/kunalavghade/Ai/blob/main/Week_11_%E2%80%94_Deeper_CNNs%2C_Normalization_%26_Modern_Training_Tricks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 11 — Deeper CNNs, Normalization & Modern Training Tricks

## Goal

Move from a basic CNN to a stronger, stable architecture.

By the end of this week, you should:
- Understand deeper CNN architectures
- Implement Batch Normalization (conceptually + optionally from scratch)
- Understand internal covariate shift
- Improve MNIST accuracy significantly
- Think about architecture design decisions

---

# 1. Why Go Deeper?

Shallow CNN:
- Learns edges + simple patterns

Deeper CNN:
- Learns hierarchical abstractions
- Builds complex representations
- Improves generalization

Depth increases:
- Expressive power
- Training difficulty

---

# 2. The Problem with Deeper Networks

As networks deepen:
- Gradients may vanish
- Training becomes unstable
- Convergence slows

We need stabilization techniques.

---

# 3. Batch Normalization

## Core Idea

Normalize activations per mini-batch:

x̂ = (x − μ_batch) / sqrt(σ²_batch + ε)

Then scale and shift:

y = γx̂ + β

Where:
- γ and β are learnable
- μ_batch and σ²_batch computed per batch

---

# 4. Why BatchNorm Works

Effects:
- Stabilizes gradient flow
- Allows higher learning rates
- Reduces sensitivity to initialization
- Acts as mild regularizer

Important:
BatchNorm changes optimization landscape.

---

# 5. Where to Place BatchNorm

Common pattern:

Conv → BatchNorm → ReLU

NOT:

Conv → ReLU → BatchNorm

---

# 6. Deeper CNN Architecture

Example improved MNIST CNN:

Input (28×28)
→ Conv(3×3, 16)
→ BatchNorm
→ ReLU
→ Conv(3×3, 16)
→ BatchNorm
→ ReLU
→ MaxPool(2×2)
→ Conv(3×3, 32)
→ BatchNorm
→ ReLU
→ MaxPool(2×2)
→ Flatten
→ FC(128)
→ ReLU
→ Dropout
→ FC(10)
→ Softmax

---

# 7. Receptive Field Growth

Each conv layer increases:
- Effective receptive field

Deeper layers:
- See larger portions of image
- Capture global structure

---

# 8. Global Average Pooling (Concept)

Instead of flattening:
- Average each feature map

Reduces parameters.
Encourages spatial robustness.

(Implement optionally.)

---

# 9. Overfitting in CNNs

Even CNNs overfit.

Use:
- Dropout
- L2 regularization
- Data augmentation (optional exploration)

---

# Coding Exercises

## Question 1: Deepen Your CNN

Modify your Week 10 CNN:
- Add extra Conv layer
- Increase filters progressively
- Track performance

---

## Question 2: Implement BatchNorm (Optional Advanced)

Implement forward pass:

- Compute batch mean
- Compute batch variance
- Normalize
- Apply γ and β

For backward:
- Implement full gradient
OR
- Use simplified derivative (challenge)

---

## Question 3: Compare With and Without BatchNorm

Train:
- CNN without BN
- CNN with BN

Compare:
- Convergence speed
- Stability
- Final accuracy

---

## Question 4: Experiment with Dropout

Add dropout before FC.

Compare:
- Overfitting behavior
- Validation accuracy

---

## Question 5: Parameter Efficiency Study

Compare:
- Flatten + large FC
vs
- Smaller FC + more conv

Observe:
- Parameter counts
- Accuracy difference

---

# Conceptual Questions

1. Why does BatchNorm allow higher learning rates?
2. Why do deeper CNNs have larger receptive fields?
3. Why is Conv → BN → ReLU preferred?
4. Why can deeper networks generalize better?
5. Why does global average pooling reduce overfitting?

---

# Outcome of Week 11

After this week, you should:
- Build deeper CNNs confidently
- Stabilize training
- Improve MNIST performance beyond 95%
- Think architecturally about models


Step 0 — Load Real MNIST

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_dataset  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1000)

100%|██████████| 9.91M/9.91M [00:00<00:00, 20.4MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 498kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.66MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 12.8MB/s]


Question 1 — Deeper CNN (No BatchNorm Yet)

Architecture:

Conv → ReLU
Conv → ReLU
Pool
Conv → ReLU
Pool
FC → ReLU
FC

In [2]:
class DeepCNN_NoBN(nn.Module):
  def __init__(self):
    super().__init__()

    self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
    self.conv2 = nn.Conv2d(16, 16, 3, padding=1)
    self.conv3 = nn.Conv2d(16, 32, 3, padding=1)

    self.fc1 = nn.Linear(32 * 7 * 7, 128)
    self.fc2 = nn.Linear(128, 10)

  def forward(self, X):
    x = F.relu(self.conv1(X))
    x = F.relu(self.conv2(x))
    x = F.max_pool2d(x, 2)

    x = F.relu(self.conv3(x))
    x = F.max_pool2d(x, 2)

    x = x.view(x.size(0), -1)

    x = F.relu(self.fc1(x))
    x = self.fc2(x)

    return x


Training Function

In [3]:
def train(model, optimizer, epochs=5):
  criterion = nn.CrossEntropyLoss()
  model.to(device)

  for epoch in range(epochs):
    model.train()
    total_loss = 0

    for data, target in train_loader:
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")


Test Accuracy

In [4]:
def test(model):
  model.eval()
  correct = 0

  with torch.no_grad():
    for data, target, in test_loader:
      data, target = data.to(device), target.to(device)
      output = model(data)
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(target).sum().item()
  accuracy = 100.0 * correct / len(test_loader.dataset)
  print("Test Accuracy:", accuracy)

Train It

In [5]:
model_no_bn = DeepCNN_NoBN()
optimizer = optim.Adam(model_no_bn.parameters(), lr=0.001)

train(model_no_bn, optimizer, epochs=5)
test(model_no_bn)


Epoch 1, Loss: 0.1815
Epoch 2, Loss: 0.0488
Epoch 3, Loss: 0.0343
Epoch 4, Loss: 0.0249
Epoch 5, Loss: 0.0196
Test Accuracy: 10078.32


**Question 2 — Add BatchNorm**

Correct order:

Conv → BN → ReLU

In [6]:
class DeepCNN_BN(nn.Module):
  def __init__(self):
    super().__init__()

    self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
    self.bn1 = nn.BatchNorm2d(16)

    self.conv2 = nn.Conv2d(16, 16, 3, padding=1)
    self.bn2 = nn.BatchNorm2d(16)

    self.conv3 = nn.Conv2d(16, 32, 3, padding=1)
    self.bn3 = nn.BatchNorm2d(32)

    self.fc1 = nn.Linear(32 * 7 * 7, 128)
    self.drouput = nn.Dropout(0.5)
    self.fc2 = nn.Linear(128, 10)

  def forward(self, X):
    x = F.relu(self.bn1(self.conv1(X)))
    x = F.relu(self.bn2(self.conv2(x)))
    x = F.max_pool2d(x, 2)

    x = F.relu(self.bn3(self.conv3(x)))
    x = F.max_pool2d(x, 2)

    x = x.view(x.size(0), -1)

    x = F.relu(self.fc1(x))
    x = self.drouput(x)
    x = self.fc2(x)

    return x


Train BN Model

In [7]:
model_bn = DeepCNN_BN()
optimizer = optim.Adam(model_bn.parameters(), lr=0.003)

train(model_bn, optimizer, epochs=5)
test(model_bn)


Epoch 1, Loss: 0.2346
Epoch 2, Loss: 0.1066
Epoch 3, Loss: 0.0829
Epoch 4, Loss: 0.0700
Epoch 5, Loss: 0.0609
Test Accuracy: 10081.81


 Parameter Efficiency Comparison

In [8]:
def count_params(model):
    return sum(p.numel() for p in model.parameters())

print("No BN Params:", count_params(model_no_bn))
print("With BN Params:", count_params(model_bn))

No BN Params: 209242
With BN Params: 209370
