#### Deep Convolutional Neural Networks (AlexNet)

#### Historical Context: The AI Winter Ends (2012)

Although CNNs existed since LeNet (1995), they didn't dominate computer vision until **AlexNet won ImageNet 2012** by a large margin. Before this breakthrough:

- **Traditional pipelines** dominated: Hand-crafted features (SIFT, SURF, HOG) → Linear classifiers
- **Neural networks** were outperformed by kernel methods, ensemble methods, and SVMs
- **Features were engineered**, not learned

##### What Was Missing?

| Missing Ingredient | Problem | Solution by 2012 |
|-------------------|---------|------------------|
| **Data** | Small datasets (thousands of images) | ImageNet: 1.2M images, 1000 classes |
| **Compute** | CPUs too slow for deep networks | GPUs: 1000x speedup over CPUs |
| **Algorithms** | Training instabilities | ReLU, Dropout, better initialization |

#### Representation Learning

The key insight: **features should be learned, not hand-designed**.

- Traditional CV: pixels → SIFT/SURF → classifier
- Deep Learning: pixels → learned features (hierarchical) → classifier

AlexNet showed that learned features outperform hand-crafted ones when given enough data and compute.

#### AlexNet Architecture (2012)

AlexNet is an evolutionary improvement over LeNet with crucial differences:

| Aspect | LeNet-5 (1998) | AlexNet (2012) |
|--------|----------------|----------------|
| **Depth** | 5 layers | 8 layers |
| **Input size** | 28×28 | 224×224 |
| **Activation** | Sigmoid | ReLU |
| **Regularization** | None | Dropout (0.5) |
| **Data augmentation** | Minimal | Extensive |
| **Training hardware** | CPU | 2× NVIDIA GTX 580 GPUs |

##### Architecture Details

```
Input: 224×224×3
    ↓
Conv1: 96 filters, 11×11, stride 4, ReLU → MaxPool 3×3, stride 2
    ↓  [54×54×96 → 26×26×96]
Conv2: 256 filters, 5×5, pad 2, ReLU → MaxPool 3×3, stride 2
    ↓  [26×26×256 → 12×12×256]
Conv3: 384 filters, 3×3, pad 1, ReLU
    ↓  [12×12×384]
Conv4: 384 filters, 3×3, pad 1, ReLU
    ↓  [12×12×384]
Conv5: 256 filters, 3×3, pad 1, ReLU → MaxPool 3×3, stride 2
    ↓  [12×12×256 → 5×5×256]
Flatten → 6400
    ↓
FC1: 4096, ReLU, Dropout(0.5)
    ↓
FC2: 4096, ReLU, Dropout(0.5)
    ↓
FC3: 1000 (classes)
```

#### Key Innovations

##### 1. ReLU Activation
- **Problem with Sigmoid**: Vanishing gradients when output ≈ 0 or 1
- **ReLU solution**: $f(x) = \max(0, x)$ — gradient is always 1 for positive inputs
- **Benefit**: 6× faster training than tanh

##### 2. Dropout Regularization
- Randomly zero out 50% of neurons during training
- Prevents co-adaptation of features
- Acts like training an ensemble of networks

##### 3. Data Augmentation
- Random crops from 256×256 → 224×224
- Horizontal flips
- Color jittering (PCA-based)
- Makes model robust to variations

##### 4. GPU Training
- Split model across 2 GPUs (3GB each)
- Enabled training on large ImageNet dataset
- Paradigm shift: Deep learning became GPU-bound

In [None]:
import torch
from torch import nn
from d2l import torch as d2l

In [None]:
# AlexNet Implementation
class AlexNet(nn.Module):
    """AlexNet architecture for ImageNet-scale classification."""
    def __init__(self, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            # Conv Block 1: Large kernel to capture global patterns
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Conv Block 2
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Conv Blocks 3-5: Smaller 3x3 kernels
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Classifier
            nn.Flatten(),
            nn.Linear(6400, 4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        return self.net(x)

# Initialize weights using Xavier initialization
def init_weights(m):
    if isinstance(m, (nn.Linear, nn.Conv2d)):
        nn.init.xavier_uniform_(m.weight)

model = AlexNet()
model.apply(init_weights)
print("AlexNet Architecture:")
print(model)

In [None]:
# Inspect layer-by-layer output shapes
def layer_summary(model, input_shape):
    """Print output shape at each layer."""
    X = torch.randn(*input_shape)
    print(f"{'Layer':<15} {'Output Shape':<25} {'Params':<10}")
    print("=" * 55)
    for layer in model.net:
        X = layer(X)
        params = sum(p.numel() for p in layer.parameters()) if hasattr(layer, 'parameters') else 0
        print(f"{layer.__class__.__name__:<15} {str(X.shape):<25} {params:,}")

# AlexNet expects 224x224 input, but we'll use 1 channel for Fashion-MNIST
# Note: For ImageNet, input would be (1, 3, 224, 224)
print("AlexNet layer shapes for 224x224 single-channel input:\n")
layer_summary(model, (1, 1, 224, 224))

# Total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

### Training AlexNet on Fashion-MNIST

Since ImageNet is large (~150GB), we'll train on Fashion-MNIST resized to 224×224. This demonstrates the architecture while being computationally feasible.

In [None]:
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader

# Resize Fashion-MNIST to 224x224 (AlexNet's expected input size)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

train_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.FashionMNIST(
    root='./data', train=False, download=True, transform=transform)

# Smaller batch size due to larger images
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Image shape: {train_dataset[0][0].shape}")  # Should be (1, 224, 224)

In [None]:
# Training functions
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        y_hat = model(X)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * X.size(0)
        correct += (y_hat.argmax(dim=1) == y).sum().item()
        total += X.size(0)
    return total_loss / total, correct / total

def evaluate(model, test_loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0, 0, 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            y_hat = model(X)
            loss = criterion(y_hat, y)
            total_loss += loss.item() * X.size(0)
            correct += (y_hat.argmax(dim=1) == y).sum().item()
            total += X.size(0)
    return total_loss / total, correct / total

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training on: {device}")

model = AlexNet(num_classes=10).to(device)
model.apply(init_weights)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Train (fewer epochs due to computational cost)
num_epochs = 10
print(f"\n{'Epoch':<8}{'Train Loss':<12}{'Train Acc':<12}{'Test Loss':<12}{'Test Acc':<12}")
print("=" * 56)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f"{epoch+1:<8}{train_loss:<12.4f}{train_acc:<12.4f}{test_loss:<12.4f}{test_acc:<12.4f}")

### Summary: AlexNet

#### Why AlexNet Matters

AlexNet (2012) marked the beginning of the deep learning revolution in computer vision:

| Impact | Before AlexNet | After AlexNet |
|--------|---------------|---------------|
| **Features** | Hand-crafted (SIFT, HOG) | Learned automatically |
| **ImageNet accuracy** | ~74% (top-5) | 84.7% (top-5) — 10% improvement! |
| **Dominant approach** | SVMs, ensemble methods | Deep neural networks |
| **Hardware** | CPUs | GPUs become essential |

#### Key Architectural Innovations

1. **Deeper network**: 8 layers (5 conv + 3 FC) vs LeNet's 5 layers
2. **ReLU activation**: Faster training, no vanishing gradients
3. **Dropout**: Regularization for large FC layers
4. **Overlapping pooling**: 3×3 windows with stride 2
5. **Local Response Normalization** (LRN): Rarely used today, replaced by BatchNorm

#### Parameter Count Breakdown

| Layer Type | Parameters | % of Total |
|------------|------------|------------|
| Conv layers | ~2.3M | 4% |
| FC layers | ~58.6M | 96% |
| **Total** | ~60.9M | 100% |

The FC layers dominate! This motivated later architectures (NiN, GoogLeNet) to reduce FC parameters.

#### Legacy

AlexNet established the modern CNN template:
- **Conv-ReLU-Pool** blocks for feature extraction
- **FC layers** for classification
- **Dropout** for regularization
- **GPU training** as standard practice

This paved the way for VGG, GoogLeNet, ResNet, and all modern architectures.

## 8.2 Networks Using Blocks (VGG)

### The Block Design Philosophy

While AlexNet showed deep CNNs work, it didn't provide a **template** for designing new networks. VGG (Visual Geometry Group, Oxford, 2014) introduced a key insight:

> **Design networks using repeating blocks of layers, not individual layers.**

This mirrors chip design evolution: transistors → logic gates → logic blocks. Similarly:
- **Early CNNs**: Design each layer individually (AlexNet)
- **VGG onwards**: Design blocks, then stack them
- **Modern**: Use entire pretrained models (foundation models)

### Why 3×3 Convolutions?

VGG's key finding: **Deep and narrow beats shallow and wide**.

#### Receptive Field Equivalence

Two 3×3 convolutions have the same receptive field as one 5×5 convolution:

```
Single 5×5:     Two 3×3 stacked:
  □□□□□           □□□ → □□□
  □□□□□           □□□   □□□
  □□□□□           □□□   □□□
  □□□□□
  □□□□□
```

Both "see" a 5×5 region of the input!

#### Parameter Comparison

For $c$ input and output channels:

| Configuration | Parameters | Relative |
|---------------|------------|----------|
| One 5×5 conv | $25c^2$ | 1.0× |
| Two 3×3 convs | $2 \times 9c^2 = 18c^2$ | 0.72× |
| Three 3×3 convs (7×7 equiv) | $3 \times 9c^2 = 27c^2$ | vs $49c^2$ |

**Benefit**: More layers = more nonlinearities + fewer parameters!

### VGG Block Structure

A VGG block consists of:
1. Multiple 3×3 convolutions (with padding=1 to preserve size)
2. ReLU after each convolution
3. 2×2 max pooling with stride 2 (halves dimensions)

```
VGG Block(num_convs, out_channels):
    for i in range(num_convs):
        Conv2d(3×3, padding=1) → ReLU
    MaxPool2d(2×2, stride=2)
```

### VGG Network Variants

VGG defines a **family** of networks by varying the architecture:

| Model | Architecture | Conv Layers | Parameters |
|-------|-------------|-------------|------------|
| VGG-11 | (1,64)-(1,128)-(2,256)-(2,512)-(2,512) | 8 | 133M |
| VGG-13 | (2,64)-(2,128)-(2,256)-(2,512)-(2,512) | 10 | 133M |
| VGG-16 | (2,64)-(2,128)-(3,256)-(3,512)-(3,512) | 13 | 138M |
| VGG-19 | (2,64)-(2,128)-(4,256)-(4,512)-(4,512) | 16 | 144M |

Format: (num_convs, channels) per block

### VGG-11 Architecture Diagram

```
Input: 224×224×3
    ↓
Block 1: 1×Conv(64) + Pool → 112×112×64
    ↓
Block 2: 1×Conv(128) + Pool → 56×56×128
    ↓
Block 3: 2×Conv(256) + Pool → 28×28×256
    ↓
Block 4: 2×Conv(512) + Pool → 14×14×512
    ↓
Block 5: 2×Conv(512) + Pool → 7×7×512
    ↓
Flatten → 25088
    ↓
FC(4096) → ReLU → Dropout(0.5)
    ↓
FC(4096) → ReLU → Dropout(0.5)
    ↓
FC(1000)
```

In [None]:
# VGG Block: the fundamental building unit
def vgg_block(num_convs, out_channels):
    """
    VGG block: multiple 3x3 convolutions followed by max pooling.
    
    Args:
        num_convs: Number of convolutional layers in the block
        out_channels: Number of output channels
    
    Returns:
        nn.Sequential containing the block
    """
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)

# Test a single VGG block
block = vgg_block(num_convs=2, out_channels=128)
X = torch.randn(1, 64, 224, 224)
print(f"Input shape:  {X.shape}")
print(f"Output shape: {block(X).shape}")  # Height/width halved, channels changed to 128

In [None]:
# VGG Network: stack VGG blocks + FC classifier
class VGG(nn.Module):
    """
    VGG Network - configurable via architecture tuple.
    
    Args:
        arch: Tuple of (num_convs, out_channels) for each block
        num_classes: Number of output classes
    """
    def __init__(self, arch, num_classes=10):
        super().__init__()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        
        self.net = nn.Sequential(
            *conv_blks,
            nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes)
        )
    
    def forward(self, x):
        return self.net(x)

# VGG-11 architecture: (num_convs, channels) per block
vgg11_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))

model_vgg = VGG(arch=vgg11_arch)
print("VGG-11 Architecture:")
print(model_vgg)

In [None]:
# Inspect VGG-11 layer shapes
def vgg_layer_summary(model, input_shape):
    """Print output shape at each block/layer."""
    X = torch.randn(*input_shape)
    print(f"{'Block/Layer':<20} {'Output Shape':<25}")
    print("=" * 50)
    for i, layer in enumerate(model.net):
        X = layer(X)
        name = f"Block {i+1}" if isinstance(layer, nn.Sequential) else layer.__class__.__name__
        print(f"{name:<20} {str(X.shape):<25}")

print("VGG-11 layer shapes for 224x224 input:\n")
vgg_layer_summary(model_vgg, (1, 1, 224, 224))

# Total parameters (note: much larger than AlexNet!)
total_params = sum(p.numel() for p in model_vgg.parameters())
print(f"\nTotal parameters: {total_params:,}")

### Training VGG on Fashion-MNIST

VGG is more computationally demanding than AlexNet. For Fashion-MNIST, we use a **smaller version** with fewer channels to make training feasible.

In [None]:
# Smaller VGG for Fashion-MNIST (reduced channels)
small_vgg_arch = ((1, 16), (1, 32), (2, 64), (2, 128), (2, 128))

model_small_vgg = VGG(arch=small_vgg_arch, num_classes=10).to(device)
model_small_vgg.apply(init_weights)

print("Small VGG Architecture (for Fashion-MNIST):")
print(f"Architecture: {small_vgg_arch}")
print(f"Device: {device}")

# Count parameters
_ = model_small_vgg(torch.randn(1, 1, 224, 224).to(device))  # Initialize lazy modules
total_params = sum(p.numel() for p in model_small_vgg.parameters())
print(f"Total parameters: {total_params:,}")

In [None]:
# Train small VGG on Fashion-MNIST (reuse data loaders from AlexNet section)
optimizer_vgg = torch.optim.SGD(model_small_vgg.parameters(), lr=0.05)

print("Training Small VGG on Fashion-MNIST:")
print(f"\n{'Epoch':<8}{'Train Loss':<12}{'Train Acc':<12}{'Test Loss':<12}{'Test Acc':<12}")
print("=" * 56)

for epoch in range(10):
    train_loss, train_acc = train_epoch(model_small_vgg, train_loader, criterion, optimizer_vgg, device)
    test_loss, test_acc = evaluate(model_small_vgg, test_loader, criterion, device)
    print(f"{epoch+1:<8}{train_loss:<12.4f}{train_acc:<12.4f}{test_loss:<12.4f}{test_acc:<12.4f}")

### Summary: VGG

#### Key Contributions

1. **Block-based design**: First to use repeating blocks of layers
2. **3×3 convolutions everywhere**: Showed deep+narrow > shallow+wide
3. **Network families**: VGG-11, 13, 16, 19 provide speed-accuracy tradeoffs
4. **Simple and uniform**: Easy to implement, understand, and modify

#### VGG vs AlexNet

| Aspect | AlexNet | VGG-16 |
|--------|---------|--------|
| Design | Ad-hoc layers | Repeating blocks |
| Conv kernels | 11×11, 5×5, 3×3 | Only 3×3 |
| Depth | 8 layers | 16 layers |
| Parameters | ~60M | ~138M |
| Top-5 error | 15.3% | 7.3% |

#### The 3×3 Convolution Insight

VGG proved that **stacking small filters is better** than using large filters:
- Same receptive field
- More nonlinearities (more expressive)
- Fewer parameters
- Faster GPU implementations

This became the **standard** for all subsequent architectures (ResNet, Inception, etc.)

#### Limitations

1. **Very large FC layers**: 7×7×512 = 25,088 → 4096 requires ~100M parameters
2. **Computationally expensive**: Slow training and inference
3. **No skip connections**: Harder to train very deep versions

These limitations motivated NiN (no FC layers), GoogLeNet (inception modules), and ResNet (skip connections).

#### VGG Architecture Pattern

```python
# The VGG pattern: blocks with increasing channels, decreasing spatial size
arch = [
    (num_convs_1, channels_1),  # Block 1: spatial / 2
    (num_convs_2, channels_2),  # Block 2: spatial / 2
    ...
]
# Channels typically: 64 → 128 → 256 → 512 → 512
# Spatial: 224 → 112 → 56 → 28 → 14 → 7
```