#### Deep Convolutional Neural Networks (AlexNet)

#### Historical Context: The AI Winter Ends (2012)

Although CNNs existed since LeNet (1995), they didn't dominate computer vision until **AlexNet won ImageNet 2012** by a large margin. Before this breakthrough:

- **Traditional pipelines** dominated: Hand-crafted features (SIFT, SURF, HOG) → Linear classifiers
- **Neural networks** were outperformed by kernel methods, ensemble methods, and SVMs
- **Features were engineered**, not learned

##### What Was Missing?

| Missing Ingredient | Problem | Solution by 2012 |
|-------------------|---------|------------------|
| **Data** | Small datasets (thousands of images) | ImageNet: 1.2M images, 1000 classes |
| **Compute** | CPUs too slow for deep networks | GPUs: 1000x speedup over CPUs |
| **Algorithms** | Training instabilities | ReLU, Dropout, better initialization |

#### Representation Learning

The key insight: **features should be learned, not hand-designed**.

- Traditional CV: pixels → SIFT/SURF → classifier
- Deep Learning: pixels → learned features (hierarchical) → classifier

AlexNet showed that learned features outperform hand-crafted ones when given enough data and compute.

#### AlexNet Architecture (2012)

AlexNet is an evolutionary improvement over LeNet with crucial differences:

| Aspect | LeNet-5 (1998) | AlexNet (2012) |
|--------|----------------|----------------|
| **Depth** | 5 layers | 8 layers |
| **Input size** | 28×28 | 224×224 |
| **Activation** | Sigmoid | ReLU |
| **Regularization** | None | Dropout (0.5) |
| **Data augmentation** | Minimal | Extensive |
| **Training hardware** | CPU | 2× NVIDIA GTX 580 GPUs |

##### Architecture Details

```
Input: 224×224×3
    ↓
Conv1: 96 filters, 11×11, stride 4, ReLU → MaxPool 3×3, stride 2
    ↓  [54×54×96 → 26×26×96]
Conv2: 256 filters, 5×5, pad 2, ReLU → MaxPool 3×3, stride 2
    ↓  [26×26×256 → 12×12×256]
Conv3: 384 filters, 3×3, pad 1, ReLU
    ↓  [12×12×384]
Conv4: 384 filters, 3×3, pad 1, ReLU
    ↓  [12×12×384]
Conv5: 256 filters, 3×3, pad 1, ReLU → MaxPool 3×3, stride 2
    ↓  [12×12×256 → 5×5×256]
Flatten → 6400
    ↓
FC1: 4096, ReLU, Dropout(0.5)
    ↓
FC2: 4096, ReLU, Dropout(0.5)
    ↓
FC3: 1000 (classes)
```

#### Key Innovations

##### 1. ReLU Activation
- **Problem with Sigmoid**: Vanishing gradients when output ≈ 0 or 1
- **ReLU solution**: $f(x) = \max(0, x)$ — gradient is always 1 for positive inputs
- **Benefit**: 6× faster training than tanh

##### 2. Dropout Regularization
- Randomly zero out 50% of neurons during training
- Prevents co-adaptation of features
- Acts like training an ensemble of networks

##### 3. Data Augmentation
- Random crops from 256×256 → 224×224
- Horizontal flips
- Color jittering (PCA-based)
- Makes model robust to variations

##### 4. GPU Training
- Split model across 2 GPUs (3GB each)
- Enabled training on large ImageNet dataset
- Paradigm shift: Deep learning became GPU-bound

In [21]:
import torch
from torch import nn
from d2l import torch as d2l

In [22]:
# AlexNet Implementation
class AlexNet(nn.Module):
    """AlexNet architecture for ImageNet-scale classification."""
    def __init__(self, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            # Conv Block 1: Large kernel to capture global patterns
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2), #(26,26,96)
            # Conv Block 2
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2), # (12,12,256)
            # Conv Blocks 3-5: Smaller 3x3 kernels
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(),                             # (12,12,384)
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(),                             # (12,12,384)
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(),               
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Classifier
            nn.Flatten(),
            nn.Linear(6400, 4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        return self.net(x)

# Initialize weights using Xavier initialization
def init_weights(m):
    if isinstance(m, (nn.Linear, nn.Conv2d)):
        nn.init.xavier_uniform_(m.weight)

model = AlexNet()
model.apply(init_weights)
print("AlexNet Architecture:")
print(model)

AlexNet Architecture:
AlexNet(
  (net): Sequential(
    (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (13): Flatten(start_dim=1, end_dim=-1)
    (14): Linear(in_features=6400, out_features=4096, bias=True)
    (15): ReLU()
    (16): Dropout(p=0.5, inplace=False)
    (17): Linear(in_features=4096, out_features=4096, bias=True)


In [23]:
# Inspect layer-by-layer output shapes
def layer_summary(model, input_shape):
    """Print output shape at each layer."""
    X = torch.randn(*input_shape)
    print(f"{'Layer':<15} {'Output Shape':<25} {'Params':<10}")
    print("=" * 55)
    for layer in model.net:
        X = layer(X)
        params = sum(p.numel() for p in layer.parameters()) if hasattr(layer, 'parameters') else 0
        print(f"{layer.__class__.__name__:<15} {str(X.shape):<25} {params:,}")

# AlexNet expects 224x224 input, but we'll use 1 channel for Fashion-MNIST
# Note: For ImageNet, input would be (1, 3, 224, 224)
print("AlexNet layer shapes for 224x224 single-channel input:\n")
layer_summary(model, (1, 1, 224, 224))

# Total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

AlexNet layer shapes for 224x224 single-channel input:

Layer           Output Shape              Params    
Conv2d          torch.Size([1, 96, 54, 54]) 11,712
ReLU            torch.Size([1, 96, 54, 54]) 0
MaxPool2d       torch.Size([1, 96, 26, 26]) 0
Conv2d          torch.Size([1, 256, 26, 26]) 614,656
ReLU            torch.Size([1, 256, 26, 26]) 0
MaxPool2d       torch.Size([1, 256, 12, 12]) 0
Conv2d          torch.Size([1, 384, 12, 12]) 885,120
ReLU            torch.Size([1, 384, 12, 12]) 0
Conv2d          torch.Size([1, 384, 12, 12]) 1,327,488
ReLU            torch.Size([1, 384, 12, 12]) 0
Conv2d          torch.Size([1, 256, 12, 12]) 884,992
ReLU            torch.Size([1, 256, 12, 12]) 0
MaxPool2d       torch.Size([1, 256, 5, 5]) 0
Flatten         torch.Size([1, 6400])     0
Linear          torch.Size([1, 4096])     26,218,496
ReLU            torch.Size([1, 4096])     0
Dropout         torch.Size([1, 4096])     0
Linear          torch.Size([1, 4096])     16,781,312
ReLU            

### Training AlexNet on Fashion-MNIST

Since ImageNet is large (~150GB), we'll train on Fashion-MNIST resized to 224×224. This demonstrates the architecture while being computationally feasible.

In [24]:
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader

# Resize Fashion-MNIST to 224x224 (AlexNet's expected input size)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

train_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=False, download=True, transform=transform)

# Smaller batch size due to larger images
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Image shape: {train_dataset[0][0].shape}")  # Should be (1, 224, 224)

Training samples: 60000
Test samples: 10000
Image shape: torch.Size([1, 224, 224])


In [None]:
# Training functions
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        y_hat = model(X)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * X.size(0)
        correct += (y_hat.argmax(dim=1) == y).sum().item()
        total += X.size(0)
    return total_loss / total, correct / total

def evaluate(model, test_loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0, 0, 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            y_hat = model(X)
            loss = criterion(y_hat, y)
            total_loss += loss.item() * X.size(0)
            correct += (y_hat.argmax(dim=1) == y).sum().item()
            total += X.size(0)
    return total_loss / total, correct / total

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training on: {device}")

model = AlexNet(num_classes=10).to(device)
model.apply(init_weights)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Train (fewer epochs due to computational cost)
num_epochs = 10
print(f"\n{'Epoch':<8}{'Train Loss':<12}{'Train Acc':<12}{'Test Loss':<12}{'Test Acc':<12}")
print("=" * 56)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f"{epoch+1:<8}{train_loss:<12.4f}{train_acc:<12.4f}{test_loss:<12.4f}{test_acc:<12.4f}")

### Summary: AlexNet

#### Why AlexNet Matters

AlexNet (2012) marked the beginning of the deep learning revolution in computer vision:

| Impact | Before AlexNet | After AlexNet |
|--------|---------------|---------------|
| **Features** | Hand-crafted (SIFT, HOG) | Learned automatically |
| **ImageNet accuracy** | ~74% (top-5) | 84.7% (top-5) — 10% improvement! |
| **Dominant approach** | SVMs, ensemble methods | Deep neural networks |
| **Hardware** | CPUs | GPUs become essential |

#### Key Architectural Innovations

1. **Deeper network**: 8 layers (5 conv + 3 FC) vs LeNet's 5 layers
2. **ReLU activation**: Faster training, no vanishing gradients
3. **Dropout**: Regularization for large FC layers
4. **Overlapping pooling**: 3×3 windows with stride 2
5. **Local Response Normalization** (LRN): Rarely used today, replaced by BatchNorm

#### Parameter Count Breakdown

| Layer Type | Parameters | % of Total |
|------------|------------|------------|
| Conv layers | ~2.3M | 4% |
| FC layers | ~58.6M | 96% |
| **Total** | ~60.9M | 100% |

The FC layers dominate! This motivated later architectures (NiN, GoogLeNet) to reduce FC parameters.

#### Legacy

AlexNet established the modern CNN template:
- **Conv-ReLU-Pool** blocks for feature extraction
- **FC layers** for classification
- **Dropout** for regularization
- **GPU training** as standard practice

This paved the way for VGG, GoogLeNet, ResNet, and all modern architectures.

### Networks Using Blocks (VGG)

#### The Block Design Philosophy

While AlexNet showed deep CNNs work, it didn't provide a **template** for designing new networks. VGG (Visual Geometry Group, Oxford, 2014) introduced a key insight:

> **Design networks using repeating blocks of layers, not individual layers.**

This mirrors chip design evolution: transistors → logic gates → logic blocks. Similarly:
- **Early CNNs**: Design each layer individually (AlexNet)
- **VGG onwards**: Design blocks, then stack them
- **Modern**: Use entire pretrained models (foundation models)

#### Why 3×3 Convolutions?

VGG's key finding: **Deep and narrow beats shallow and wide**.

##### Receptive Field Equivalence

Two 3×3 convolutions have the same receptive field as one 5×5 convolution:

```
Single 5×5:     Two 3×3 stacked:
  □□□□□           □□□ → □□□
  □□□□□           □□□   □□□
  □□□□□           □□□   □□□
  □□□□□
  □□□□□
```

Both "see" a 5×5 region of the input!

##### Parameter Comparison

For $c$ input and output channels:

| Configuration | Parameters | Relative |
|---------------|------------|----------|
| One 5×5 conv | $25c^2$ | 1.0× |
| Two 3×3 convs | $2 \times 9c^2 = 18c^2$ | 0.72× |
| Three 3×3 convs (7×7 equiv) | $3 \times 9c^2 = 27c^2$ | vs $49c^2$ |

**Benefit**: More layers = more nonlinearities + fewer parameters!

#### VGG Block Structure

A VGG block consists of:
1. Multiple 3×3 convolutions (with padding=1 to preserve size)
2. ReLU after each convolution
3. 2×2 max pooling with stride 2 (halves dimensions)

```
VGG Block(num_convs, out_channels):
    for i in range(num_convs):
        Conv2d(3×3, padding=1) → ReLU
    MaxPool2d(2×2, stride=2)
```

#### VGG Network Variants

VGG defines a **family** of networks by varying the architecture:

| Model | Architecture | Conv Layers | Parameters |
|-------|-------------|-------------|------------|
| VGG-11 | (1,64)-(1,128)-(2,256)-(2,512)-(2,512) | 8 | 133M |
| VGG-13 | (2,64)-(2,128)-(2,256)-(2,512)-(2,512) | 10 | 133M |
| VGG-16 | (2,64)-(2,128)-(3,256)-(3,512)-(3,512) | 13 | 138M |
| VGG-19 | (2,64)-(2,128)-(4,256)-(4,512)-(4,512) | 16 | 144M |

Format: (num_convs, channels) per block

#### VGG-11 Architecture Diagram

```
Input: 224×224×3
    ↓
Block 1: 1×Conv(64) + Pool → 112×112×64
    ↓
Block 2: 1×Conv(128) + Pool → 56×56×128
    ↓
Block 3: 2×Conv(256) + Pool → 28×28×256
    ↓
Block 4: 2×Conv(512) + Pool → 14×14×512
    ↓
Block 5: 2×Conv(512) + Pool → 7×7×512
    ↓
Flatten → 25088
    ↓
FC(4096) → ReLU → Dropout(0.5)
    ↓
FC(4096) → ReLU → Dropout(0.5)
    ↓
FC(1000)
```

In [25]:
def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

In [26]:
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)

In [28]:
VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary((1, 1, 224, 224))

Sequential output shape:	 torch.Size([1, 64, 112, 112])
Sequential output shape:	 torch.Size([1, 128, 56, 56])
Sequential output shape:	 torch.Size([1, 256, 28, 28])
Sequential output shape:	 torch.Size([1, 512, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
Flatten output shape:	 torch.Size([1, 25088])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])


### Training VGG on Fashion-MNIST

VGG is more computationally demanding than AlexNet. For Fashion-MNIST, we use a **smaller version** with fewer channels to make training feasible.

In [None]:
model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

### Summary: VGG

#### Key Contributions

1. **Block-based design**: First to use repeating blocks of layers
2. **3×3 convolutions everywhere**: Showed deep+narrow > shallow+wide
3. **Network families**: VGG-11, 13, 16, 19 provide speed-accuracy tradeoffs
4. **Simple and uniform**: Easy to implement, understand, and modify

#### VGG vs AlexNet

| Aspect | AlexNet | VGG-16 |
|--------|---------|--------|
| Design | Ad-hoc layers | Repeating blocks |
| Conv kernels | 11×11, 5×5, 3×3 | Only 3×3 |
| Depth | 8 layers | 16 layers |
| Parameters | ~60M | ~138M |
| Top-5 error | 15.3% | 7.3% |

!["AlexNet to VGG"](./Images/8/AlexNetToVGG.png)

#### The 3×3 Convolution Insight

VGG proved that **stacking small filters is better** than using large filters:
- Same receptive field
- More nonlinearities (more expressive)
- Fewer parameters
- Faster GPU implementations

This became the **standard** for all subsequent architectures (ResNet, Inception, etc.)

#### Limitations

1. **Very large FC layers**: 7×7×512 = 25,088 → 4096 requires ~100M parameters
2. **Computationally expensive**: Slow training and inference
3. **No skip connections**: Harder to train very deep versions

These limitations motivated NiN (no FC layers), GoogLeNet (inception modules), and ResNet (skip connections).

#### VGG Architecture Pattern

```python
# The VGG pattern: blocks with increasing channels, decreasing spatial size
arch = [
    (num_convs_1, channels_1),  # Block 1: spatial / 2
    (num_convs_2, channels_2),  # Block 2: spatial / 2
    ...
]
# Channels typically: 64 → 128 → 256 → 512 → 512
# Spatial: 224 → 112 → 56 → 28 → 14 → 7
```

### Network in Network (NiN)

#### Core Idea

AlexNet and VGG follow a pattern: **convolutional layers** (feature extraction) → **fully connected layers** (classification). The FC layers contain the vast majority of parameters (e.g., ~100M of VGG's 138M).

**NiN's insight**: Replace the expensive FC layers entirely by using **1×1 convolutions** and **global average pooling**.

#### The NiN Block

Each NiN block consists of:
1. A **standard convolution** (e.g., 5×5 or 3×3) for spatial feature extraction
2. Two **1×1 convolutions** with ReLU — acting as a **per-pixel MLP**

```
Input → Conv(k×k) → ReLU → Conv(1×1) → ReLU → Conv(1×1) → ReLU → Output
```

**Why 1×1 convolutions?** They mix channel information at each spatial position independently — equivalent to applying a fully connected layer *per pixel* across the channel dimension.

#### NiN Architecture

```
NiN Block 1: Conv 11×11, 96 channels, stride 4 + two 1×1 convs
    ↓ MaxPool 3×3, stride 2
NiN Block 2: Conv 5×5, 256 channels + two 1×1 convs
    ↓ MaxPool 3×3, stride 2
NiN Block 3: Conv 3×3, 384 channels + two 1×1 convs
    ↓ MaxPool 3×3, stride 2
NiN Block 4: Conv 3×3, 10 channels (= num_classes) + two 1×1 convs
    ↓ Global Average Pooling → flatten → output (10,)
```

**No FC layers at all!** The final NiN block outputs `num_classes` channels, and global average pooling reduces each channel's spatial map to a single number.

#### Why Global Average Pooling?

| Aspect | FC Layers | Global Average Pooling |
|--------|-----------|----------------------|
| Parameters | Millions (e.g., 25088×4096) | **Zero** |
| Overfitting | High risk, needs dropout | Much less prone |
| Spatial info | Destroyed by flattening | Summarized naturally |
| Input size | Fixed (must match dimensions) | **Any** spatial size works |

#### Key Contributions

1. **1×1 convolutions**: Adds per-pixel nonlinearity and channel mixing — became a fundamental building block (used in GoogLeNet, ResNet, etc.)
2. **Global average pooling**: Eliminates FC layers, dramatically reducing parameters and overfitting
3. **Fully convolutional**: The entire network is convolutional — no dense layers at all

#### NiN vs VGG

| Aspect | VGG-16 | NiN |
|--------|--------|-----|
| FC parameters | ~120M | **0** |
| Total parameters | ~138M | **Much fewer** |
| Classification head | 3 FC layers + dropout | 1×1 conv + global avg pool |
| Overfitting risk | High | Lower |
| Key innovation | Block-based design | 1×1 conv + global avg pool |

!["VGG-Nin"](./Images/8/VGG-NiN.png)

#### Legacy

- **1×1 convolutions** became ubiquitous: GoogLeNet's Inception module, ResNet's bottleneck blocks, and squeeze-and-excitation networks all rely on them
- **Global average pooling** replaced FC layers in almost all modern architectures
- NiN showed that the classification head doesn't need to be the parameter bottleneck

In [29]:
import torch
from torch import nn
from d2l import torch as d2l

In [30]:
def nin_block(out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.LazyConv2d(out_channels, kernel_size, strides, padding), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU())

In [None]:
class NiN(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nin_block(96, kernel_size=11, strides=4, padding=0),
            nn.MaxPool2d(3, stride=2),
            nin_block(256, kernel_size=5, strides=1, padding=2),
            nn.MaxPool2d(3, stride=2),
            nin_block(384, kernel_size=3, strides=1, padding=1),
            nn.MaxPool2d(3, stride=2),
            nn.
            ==
(num_classes, kernel_size=3, strides=1, padding=1),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten())
        self.net.apply(d2l.init_cnn)

In [32]:
NiN().layer_summary((1, 1, 224, 224))

Sequential output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Sequential output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Sequential output shape:	 torch.Size([1, 384, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 384, 5, 5])
Dropout output shape:	 torch.Size([1, 384, 5, 5])
Sequential output shape:	 torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 10, 1, 1])
Flatten output shape:	 torch.Size([1, 10])


In [None]:
model = NiN(lr=0.05)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

### Multi-Branch Networks (GoogLeNet / Inception)

#### The Problem: Which Convolution Size to Use?

Previous architectures made a **fixed choice** of kernel size per layer (AlexNet: 11×11, 5×5, 3×3; VGG: all 3×3). But different features exist at different scales:
- **Small patterns** (edges, textures) → small kernels (1×1, 3×3)
- **Medium patterns** (parts, shapes) → medium kernels (5×5)
- **Large patterns** (objects) → larger receptive fields

**GoogLeNet's answer**: Don't choose — **use them all in parallel!**

#### The Inception Block

The core building block of GoogLeNet processes input through **4 parallel paths** and concatenates results along the channel dimension:

```
                        Input
           ┌──────┬──────┼──────────┐
           │      │      │          │
        Path 1  Path 2  Path 3   Path 4
        1×1     1×1     1×1     MaxPool
        conv    conv    conv      3×3
           │      │      │          │
           │    3×3    5×5       1×1
           │    conv   conv      conv
           │      │      │          │
           └──────┴──────┴──────────┘
                        │
                   Concatenate
                  (along channels)
```

| Path | Operations | Purpose |
|------|-----------|---------|
| **Path 1** | 1×1 conv | Capture single-pixel features |
| **Path 2** | 1×1 conv → 3×3 conv | Capture local spatial features |
| **Path 3** | 1×1 conv → 5×5 conv | Capture wider spatial features |
| **Path 4** | 3×3 MaxPool → 1×1 conv | Capture pooled features |

##### Why 1×1 Convolutions Before 3×3 and 5×5?

The 1×1 convolutions act as **channel-dimension bottlenecks** to reduce computation:

| Without 1×1 bottleneck | With 1×1 bottleneck |
|------------------------|---------------------|
| Input: 192 channels | Input: 192 channels |
| 5×5 conv → 32 channels | 1×1 conv → 16 channels → 5×5 conv → 32 channels |
| FLOPs: $192 \times 32 \times 5^2 = 153{,}600$ per pixel | FLOPs: $192 \times 16 + 16 \times 32 \times 25 = 15{,}872$ per pixel |
| — | **~10× fewer operations!** |

This is the **bottleneck design pattern** — compress channels first, then apply expensive operations.

#### GoogLeNet Architecture

GoogLeNet (22 layers deep) is organized into stages:

```
Input: 224×224×3
    ↓
Stage 1: Conv 7×7/2, MaxPool 3×3/2          → 56×56×64
Stage 2: Conv 1×1, Conv 3×3, MaxPool 3×3/2  → 28×28×192
    ↓
Stage 3: Inception(3a) + Inception(3b) + MaxPool  → 14×14×480
Stage 4: Inception(4a–4e) + MaxPool                → 7×7×832
Stage 5: Inception(5a) + Inception(5b)             → 7×7×1024
    ↓
Global Average Pooling → 1024
    ↓
Linear → 1000 classes
```

**Key**: No FC layers except the final classifier (following NiN's insight)!

!["GoogLeNet"](./Images/8/GoogLeNet.png)



#### Inception Block Channel Configuration

Each Inception block has carefully tuned channel counts:

| Block | Path 1 (1×1) | Path 2 (1×1→3×3) | Path 3 (1×1→5×5) | Path 4 (Pool→1×1) | Output |
|-------|-------------|-------------------|-------------------|-------------------|--------|
| 3a | 64 | 96→128 | 16→32 | 32 | 256 |
| 3b | 128 | 128→192 | 32→96 | 64 | 480 |
| 4a | 192 | 96→208 | 16→48 | 64 | 512 |
| 4e | 256 | 160→320 | 32→128 | 128 | 832 |
| 5b | 384 | 192→384 | 48→128 | 128 | 1024 |

Output channels = sum of all path outputs (concatenated).

#### Key Innovations

1. **Multi-scale feature extraction**: Parallel paths with different kernel sizes capture features at multiple scales simultaneously
2. **1×1 bottleneck convolutions**: Dramatically reduce computation by compressing channels before expensive operations
3. **Global average pooling**: Following NiN, eliminates expensive FC layers
4. **22 layers deep**: Much deeper than VGG, yet fewer parameters (~5M vs VGG's 138M)
5. **Auxiliary classifiers** (training only): Added at intermediate layers to combat vanishing gradients in the deep network

#### GoogLeNet vs Previous Architectures

| Aspect | AlexNet | VGG-16 | NiN | GoogLeNet |
|--------|---------|--------|-----|-----------|
| Depth | 8 | 16 | 12 | 22 |
| Parameters | ~60M | ~138M | ~2M | **~5M** |
| FC layers | 3 | 3 | 0 | 0 (+ 1 linear) |
| Key idea | Deep CNN | Block design | 1×1 conv | Multi-branch |
| Top-5 error | 15.3% | 7.3% | — | **6.7%** |
| Kernel sizes | Mixed | Only 3×3 | Mixed | **All in parallel** |

#### Inception Versions

The original Inception module evolved through several versions:

| Version | Paper | Key Change |
|---------|-------|------------|
| **v1** | GoogLeNet (2014) | Original 4-path design |
| **v2** | Ioffe & Szegedy (2015) | Added Batch Normalization |
| **v3** | Szegedy et al. (2016) | Factorized convolutions (e.g., 5×5 → two 3×3) |
| **v4** | Szegedy et al. (2017) | Combined with residual connections |

#### Legacy

- **Multi-branch/parallel design** influenced many architectures (ResNeXt, Xception)
- **1×1 bottleneck** became standard in ResNet and beyond
- Proved that **smart architecture design** can achieve better results with far fewer parameters
- Showed the value of **letting the network decide** which scale is important (by using all scales)

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

In [2]:
class Inception(nn.Module):
    # c1--c4 are the number of output channels for each branch
    def __init__(self, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # Branch 1
        self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
        # Branch 2
        self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
        self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
        # Branch 3
        self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
        self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
        # Branch 4
        self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)

    def forward(self, x):
        b1 = F.relu(self.b1_1(x))
        b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
        b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
        b4 = F.relu(self.b4_2(self.b4_1(x)))
        return torch.cat((b1, b2, b3, b4), dim=1)

In [3]:
class GoogleNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [4]:
@d2l.add_to_class(GoogleNet)
def b2(self):
    return nn.Sequential(
        nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [5]:
@d2l.add_to_class(GoogleNet)
def b3(self):
    return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
                         Inception(128, (128, 192), (32, 96), 64),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [6]:
@d2l.add_to_class(GoogleNet)
def b4(self):
    return nn.Sequential(Inception(192, (96, 208), (16, 48), 64),
                         Inception(160, (112, 224), (24, 64), 64),
                         Inception(128, (128, 256), (24, 64), 64),
                         Inception(112, (144, 288), (32, 64), 64),
                         Inception(256, (160, 320), (32, 128), 128),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [7]:
@d2l.add_to_class(GoogleNet)
def b5(self):
    return nn.Sequential(Inception(256, (160, 320), (32, 128), 128),
                         Inception(384, (192, 384), (48, 128), 128),
                         nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())

In [8]:
@d2l.add_to_class(GoogleNet)
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),
                             self.b5(), nn.LazyLinear(num_classes))
    self.net.apply(d2l.init_cnn)

In [9]:
model = GoogleNet().layer_summary((1, 1, 96, 96))

Sequential output shape:	 torch.Size([1, 64, 24, 24])
Sequential output shape:	 torch.Size([1, 192, 12, 12])
Sequential output shape:	 torch.Size([1, 480, 6, 6])
Sequential output shape:	 torch.Size([1, 832, 3, 3])
Sequential output shape:	 torch.Size([1, 1024])
Linear output shape:	 torch.Size([1, 10])




In [None]:
model = GoogleNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

### Batch Normalization

#### Why Batch Normalization?

Training deep networks is hard — intermediate layer activations can drift in magnitude across layers, units, and over time (*internal covariate shift*). Batch normalization (Ioffe & Szegedy, 2015) addresses this by **normalizing activations within each minibatch**, providing three key benefits:

1. **Preprocessing / Numerical Stability** — keeps intermediate values well-scaled, enabling more aggressive learning rates.
2. **Faster Convergence** — actively centers and rescales activations back to a controlled mean and variance.
3. **Regularization** — noise from minibatch statistics acts as implicit regularization (similar in spirit to dropout).

#### The Batch Normalization Formula

For a minibatch $\mathcal{B}$ and input $\mathbf{x} \in \mathcal{B}$:

$$\text{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}$$

where:

$$\hat{\boldsymbol{\mu}}_\mathcal{B} = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x}, \qquad \hat{\boldsymbol{\sigma}}_\mathcal{B}^2 = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B})^2 + \epsilon$$

- $\boldsymbol{\gamma}$ (scale) and $\boldsymbol{\beta}$ (shift) are **learnable parameters** that restore representational power after normalization.
- $\epsilon > 0$ prevents division by zero.

#### Batch Norm in Different Layer Types

| Layer Type | Normalization Dimension | Details |
|---|---|---|
| **Fully Connected** | Per-feature across the batch | Applied after affine transform, before activation: $\mathbf{h} = \phi(\text{BN}(\mathbf{Wx} + \mathbf{b}))$ |
| **Convolutional** | Per-channel across batch & spatial dims | Each channel gets its own $\gamma$ and $\beta$; normalization computed over batch × height × width |

#### Training vs. Prediction

- **Training mode:** Normalizes using **minibatch** mean and variance (introduces beneficial noise).
- **Prediction mode:** Uses **running estimates** (exponential moving averages) of the full dataset mean and variance for deterministic output.

#### Key Practical Notes

- Batch size matters more with BN — moderate sizes (50–100) inject the "right amount" of noise for regularization.
- Minibatch size of 1 is useless (mean subtraction zeros everything out).
- BN allows higher learning rates and reduces sensitivity to initialization.
- Combined with residual connections (ResNet), BN enabled training networks with 100+ layers.

#### Controversy & Alternatives

- The original "internal covariate shift" explanation is debated; BN may work primarily by **smoothing the loss landscape** (Santurkar et al., 2018).
- **Layer Normalization** (Ba et al., 2016) normalizes across features instead of batch — useful for RNNs and Transformers where batch stats are unreliable.

In [1]:
import torch
from torch import nn
from d2l import torch as d2l

In [None]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # Use is_grad_enabled to determine whether we are in training mode
    if not torch.is_grad_enabled():
        # In prediction mode, use mean and variance obtained by moving average
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # When using a fully connected layer, calculate the mean and
            # variance on the feature dimension
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # When using a two-dimensional convolutional layer, calculate the
            # mean and variance on the channel dimension (axis=1). Here we
            # need to maintain the shape of X, so that the broadcasting
            # operation can be carried out later
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # In training mode, the current mean and variance are used
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # Update the mean and variance using moving average
        moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
        moving_var = (1.0 - momentum) * moving_var + momentum * var

    # gamma shape: (1, num_features) or (1, num_features(channels), 1, 1)
    # beta shape: (1, num_features) or (1, num_features(channels), 1, 1)
    # broadcasting will automatically expand the shape of gamma and beta
    Y = gamma * X_hat + beta  # Scale and shift
    return Y, moving_mean.data, moving_var.data

In [None]:
class BatchNorm(nn.Module):
    # num_features: the number of outputs for a fully connected layer or the
    # number of output channels for a convolutional layer. num_dims: 2 for a
    # fully connected layer and 4 for a convolutional layer
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1) # here num_features means channels, using a generic name
        # The scale parameter and the shift parameter (model parameters) are
        # initialized to 1 and 0, respectively
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # The variables that are not model parameters are initialized to 0 and 1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # If X is not on the main memory, copy moving_mean and moving_var to
        # the device where X is located
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # Save the updated moving_mean and moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.1)
        return Y

In [4]:
class BNLeNetScratch(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5), BatchNorm(6, num_dims=4),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), BatchNorm(16, num_dims=4),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(), nn.LazyLinear(120),
            BatchNorm(120, num_dims=2), nn.Sigmoid(), nn.LazyLinear(84),
            BatchNorm(84, num_dims=2), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNetScratch(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
model.net[1].gamma.reshape((-1,)), model.net[1].beta.reshape((-1,))

In [None]:
# concise implementation of batch normalization
class BNLeNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5), nn.LazyBat
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.LazyBatchNorm2d(),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(), nn.LazyLinear(120), nn.LazyBatchNorm1d(),
            nn.Sigmoid(), nn.LazyLinear(84), nn.LazyBatchNorm1d(),
            nn.Sigmoid(), nn.LazyLinear(num_classes))

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

### Residual Networks (ResNet) and ResNeXt

#### 1. Motivation: Nested Function Classes

- Let $\mathcal{F}$ be the class of functions a network architecture can represent. We seek $f^*_\mathcal{F} = \arg\min_{f \in \mathcal{F}} L(\mathbf{X}, \mathbf{y}, f)$.
- If we move to a more powerful class $\mathcal{F}'$, there is **no guarantee** $f^*_{\mathcal{F}'}$ is better — unless $\mathcal{F} \subseteq \mathcal{F}'$ (**nested** function classes).
- **Key insight:** If every added layer can easily learn the **identity mapping** $f(\mathbf{x}) = \mathbf{x}$, the new model is *at least* as good as the old one — strictly nesting the function classes.

#### 2. Residual Blocks

- In a standard block the network must learn $f(\mathbf{x})$ directly. In a **residual block** the network learns the *residual* $g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}$, then computes $f(\mathbf{x}) = g(\mathbf{x}) + \mathbf{x}$.
- Learning $g(\mathbf{x}) = 0$ (identity) is easy — just push weights/biases toward zero.
- The direct path $\mathbf{x} \to$ addition is called a **residual connection** (shortcut connection).

**Architecture of a residual block (PyTorch):**

| Component | Details |
|---|---|
| Conv → BN → ReLU | $3 \times 3$ conv, same #channels |
| Conv → BN | $3 \times 3$ conv, same #channels |
| Shortcut | Identity (or $1 \times 1$ conv if channels/resolution change) |
| Output | ReLU( shortcut + conv output ) |

- When changing #channels or halving spatial resolution (stride = 2), a **$1 \times 1$ convolution** on the shortcut aligns dimensions.

!["resNet block"](./Images/8/ResNetBlock.png)

#### 3. ResNet Model Architecture

| Stage | Layers | Output Size |
|---|---|---|
| **Stem (b1)** | $7 \times 7$ Conv (64), stride 2 → BN → ReLU → $3 \times 3$ MaxPool, stride 2 | $56 \times 56$ |
| **Stage 2 (b2)** | 2 residual blocks, 64 channels | $56 \times 56$ |
| **Stage 3 (b3)** | 2 residual blocks, 128 channels (first block stride 2) | $28 \times 28$ |
| **Stage 4 (b4)** | 2 residual blocks, 256 channels (first block stride 2) | $14 \times 14$ |
| **Stage 5 (b5)** | 2 residual blocks, 512 channels (first block stride 2) | $7 \times 7$ |
| **Head** | Global Average Pool → Dense (num_classes) | $1 \times 1$ |

- Each stage doubles channels and halves spatial resolution (except b2).
- This is the **ResNet-18** configuration (2 blocks × 4 stages + stem = 18 weight layers).

!["ResNet-18"](./Images/8/ResNet-18.png)

#### 4. ResNeXt: Grouped Convolutions

ResNeXt extends ResNet by replacing the inner $3 \times 3$ convolution with **grouped convolutions**:

$$\mathbf{y} = \sum_{i=1}^{g} \mathbf{W}_i \mathbf{x}_i$$

- Input channels are split into $g$ **groups**, each processed independently then concatenated.
- A $1 \times 1$ conv first reduces channels to `bot_channels`, a grouped $3 \times 3$ conv follows, and another $1 \times 1$ conv expands back — a **bottleneck** design.
- This increases the number of *paths* (cardinality) through the block while keeping parameter count manageable.

#### 5. Key Takeaways

- **Residual connections** solve the degradation problem — deeper networks can be trained without performance loss.
- Learning the **residual** $g(\mathbf{x}) = f(\mathbf{x}) - \mathbf{x}$ is easier than learning $f(\mathbf{x})$ directly, especially near the identity.
- ResNet won **ImageNet 2015** and profoundly influenced subsequent architectures (Transformers, GNNs, etc.).
- **ResNeXt** shows that increasing *cardinality* (number of groups) is more effective than increasing depth or width alone.
- Batch normalization + residual connections together enabled training networks with **100+ layers**.

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

In [2]:
class Residual(nn.Module):  #@save
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1, stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

In [3]:
blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape



torch.Size([4, 3, 6, 6])

In [4]:
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape

torch.Size([4, 6, 3, 3])

In [5]:
class ResNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [6]:
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
        else:
            blk.append(Residual(num_channels))
    return nn.Sequential(*blk)

In [None]:
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
    super(ResNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1())
    for i, b in enumerate(arch):
        self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
    self.net.add_module('last', nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
                       lr, num_classes)

ResNet18().layer_summary((1, 1, 96, 96))

In [None]:
model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
class ResNeXtBlock(nn.Module):  #@save
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1)
        self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
                                   stride=strides, padding=1,
                                   groups=bot_channels//groups)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1)
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()
        self.bn3 = nn.LazyBatchNorm2d()
        if use_1x1conv:
            self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
            self.bn4 = nn.LazyBatchNorm2d()
        else:
            self.conv4 = None

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return F.relu(Y + X)

In [None]:
blk = ResNeXtBlock(32, 16, 1)
X = torch.randn(4, 32, 96, 96)
blk(X).shape

### Densely Connected Networks (DenseNet)

#### 1. From ResNet to DenseNet

- **ResNet** decomposes functions as $f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})$ — the input is **added** to the residual.
- **DenseNet** goes further: instead of adding, it **concatenates** the outputs of all preceding layers:

$$
\mathbf{x} \to \left[\mathbf{x},\; f_1(\mathbf{x}),\; f_2([\mathbf{x}, f_1(\mathbf{x})]),\; f_3([\mathbf{x}, f_1(\mathbf{x}), f_2(\cdot)]),\; \ldots \right]
$$

- This creates **dense connections**: every layer receives feature maps from *all* previous layers, encouraging **feature reuse** and strengthening gradient flow.
- The name "DenseNet" reflects the dense dependency graph between layers.

#### 2. Dense Blocks

Each **dense block** contains multiple convolution blocks with structure: **BN → ReLU → 3×3 Conv**.

- Each conv block produces a fixed number of output channels (the **growth rate**, e.g., 10).
- Input and output of each block are **concatenated** along the channel dimension.
- After $n$ conv blocks with growth rate $k$ and input channels $c_0$, the output has $c_0 + n \cdot k$ channels.

```python
def conv_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=3, padding=1))

class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super().__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            X = torch.cat((X, Y), dim=1)  # Concatenate along channels
        return X
```

**Example:** 2 conv blocks with growth rate 10 on 3-channel input → output has $3 + 10 + 10 = 23$ channels.

#### 3. Transition Layers

Since dense blocks continuously **increase** the number of channels, **transition layers** are used to reduce complexity:

| Component | Purpose |
|---|---|
| BN → ReLU | Normalization and activation |
| $1 \times 1$ Conv | Reduce number of channels |
| $2 \times 2$ AvgPool (stride 2) | Halve spatial resolution |

```python
def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))
```

#### 4. DenseNet Model Architecture

DenseNet follows a structure similar to ResNet:

| Stage | Details |
|---|---|
| **Stem** | $7 \times 7$ Conv (64), stride 2 → BN → ReLU → $3 \times 3$ MaxPool, stride 2 |
| **Body** | Alternating **Dense Blocks** and **Transition Layers** (4 dense blocks with growth rate 32; block sizes: 4, 4, 4, 4 conv layers) |
| **Head** | BN → ReLU → Global AvgPool → Flatten → Dense (num_classes) |

- Each transition layer halves both channels and spatial resolution.
- The initial number of channels is 64; growth rate is typically 32.

#### 5. Key Takeaways

- **Concatenation vs. Addition:** DenseNet concatenates features from all preceding layers rather than adding residuals, preserving and reusing earlier features.
- **Feature reuse:** Each layer has direct access to gradients and feature maps from all previous layers, which encourages feature reuse and improves gradient flow.
- **Parameter efficiency:** Despite dense connections, DenseNet can be more parameter-efficient than ResNet because each layer produces only a narrow set of feature maps (controlled by the growth rate).
- **Transition layers** are essential to keep the model size manageable by compressing channels between dense blocks.
- **Growth rate** $k$ controls how much new information each layer contributes — a key hyperparameter of DenseNet.

In [9]:
import torch
from torch import nn
from d2l import torch as d2l

In [10]:
def conv_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=3, padding=1))

In [11]:
class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = torch.cat((X, Y), dim=1)
        return X

In [12]:
blk = DenseBlock(2, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

torch.Size([4, 23, 8, 8])

In [13]:
def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

In [14]:
blk = transition_block(10)
blk(Y).shape

torch.Size([4, 10, 4, 4])

In [None]:
class DenseNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [16]:
@d2l.add_to_class(DenseNet)
def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
             lr=0.1, num_classes=10):
    super(DenseNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1())
    for i, num_convs in enumerate(arch):
        self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
                                                          growth_rate))
        # The number of output channels in the previous dense block
        num_channels += num_convs * growth_rate
        # A transition layer that halves the number of channels is added
        # between the dense blocks
        if i != len(arch) - 1:
            num_channels //= 2
            self.net.add_module(f'tran_blk{i+1}', transition_block(
                num_channels))
    self.net.add_module('last', nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
model = DenseNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

### Designing Convolution Network Architectures

#### 1. Motivation

- Previous architectures (VGG, ResNet, DenseNet, etc.) were designed largely by **intuition and trial-and-error**.
- The goal is to develop **systematic design principles** that can guide architecture construction — moving from hand-crafted designs to a principled **design space**.
- The key reference is the **RegNet** approach (Radosavovic et al., 2020), which progressively narrows a broad design space using empirical analysis.

#### 2. The AnyNet Design Space

The starting point is a generic network template with three parts:

| Component | Description |
|---|---|
| **Stem** | Initial processing: $3 \times 3$ Conv (stride 2) to reduce resolution |
| **Body** | $n$ stages, each with $d_i$ residual blocks of width $w_i$, group width $g_i$, bottleneck ratio $b_i$ |
| **Head** | Global Average Pooling → Fully Connected layer for classification |

- Each stage uses **ResNeXt-style blocks** (grouped convolutions with bottleneck).
- The first block of each stage (except the first) uses stride 2 to halve resolution.
- This is called **AnyNet** because each stage can have independent hyperparameters $(d_i, w_i, g_i, b_i)$.

#### 3. Design Principles (Narrowing the Space)

By training and evaluating many random configurations, several design constraints emerge:

| Constraint | Insight | Resulting Design |
|---|---|---|
| **Shared bottleneck ratio** $b_i = b$ | Performance is roughly the same regardless of per-stage $b_i$ — fix $b = 1$ | AnyNetA → AnyNetB |
| **Shared group width** $g_i = g$ | Per-stage group widths don't help — share a single $g$ | AnyNetB → AnyNetC |
| **Increasing widths** $w_i \leq w_{i+1}$ | Networks with non-decreasing widths across stages perform better | AnyNetC → AnyNetD |
| **Increasing depths** $d_i \leq d_{i+1}$ | Deeper later stages tend to improve performance | AnyNetD → AnyNetE |

Each constraint **reduces the search space** without sacrificing (and often improving) the distribution of good models.

#### 4. RegNet: Parameterizing the Design Space

The final insight is that the **log-widths** $\log(w_i)$ of good networks follow an approximately **linear** relationship with stage index $i$:

$$w_i = w_0 \cdot w_a^i$$

where:
- $w_0$ — initial width
- $w_a$ — width slope (multiplicative growth factor)

This reduces the entire body configuration to just **6 parameters**: $w_0$, $w_a$, depth $d$, group width $g$, bottleneck ratio $b$, and the number of stages.

The resulting family of networks is called **RegNet**. By quantizing the continuous widths to valid channel counts, we recover a compact, high-performing design space.

#### 5. Implementation Highlights

```python
class AnyNet(d2l.Classifier):
    def stem(self, num_channels):
        return nn.Sequential(
            nn.LazyConv2d(num_channels, kernel_size=3, stride=2, padding=1),
            nn.LazyBatchNorm2d(), nn.ReLU())
```

- Each **stage** is built from `ResNeXtBlock`s with specified depth, width, group width, and bottleneck ratio.
- The **RegNet** subclass computes per-stage widths using the linear log-width formula and quantizes them.

#### 6. Key Takeaways

- **Design spaces over architectures:** Instead of searching for one optimal network, define a *space* of good networks and characterize it.
- **Simplicity wins:** Sharing hyperparameters across stages (bottleneck ratio, group width) and enforcing monotonic widths/depths loses nothing but greatly simplifies the search.
- **Linear parameterization:** The log-widths of well-performing networks are approximately linear across stages — a powerful inductive bias captured by RegNet.
- **Efficiency:** RegNet models achieve competitive accuracy with fewer FLOPs and faster inference than many hand-designed architectures.
- **Generalizable methodology:** The approach of progressively constraining a design space via empirical analysis can be applied beyond CNNs to other architecture families.