# Lab 03: CNN & Model Comparison

In this lab, we'll build a **Convolutional Neural Network (CNN)** - an architecture specifically designed for image data. We'll then compare all three models we've built across the labs.

**What we'll cover:**
1. Why CNNs are better for images
2. Understanding Conv2d and MaxPool2d layers
3. Building the TinyVGG architecture
4. Training the CNN
5. Comparing all three models
6. Creating a confusion matrix
7. Saving and loading the best model

## 1. Import Libraries

In [None]:
import torch
from torch import nn
import torchvision
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
from timeit import default_timer as timer
import pandas as pd

print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")

## 2. Setup Device

We'll use CPU for this lab to keep things simple and ensure it runs on any machine.

In [None]:
# For this lab, we'll use CPU for training
# This keeps the lab simple and works on any machine
device = "cpu"
print(f"Using device: {device}")

## 3. Load the Dataset

In [None]:
# Load training and test data
train_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
    target_transform=None
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

# Get class names from the dataset
class_names = train_data.classes

print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"Classes: {class_names}")

## 4. Create DataLoaders

In [None]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

print(f"Training batches: {len(train_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")

## 5. Why CNNs for Images?

### Problems with Fully Connected Networks for Images

1. **Too many parameters**: For a 28×28 image, we need 784 inputs × hidden_units weights just for the first layer
2. **No spatial awareness**: A pixel in the corner is treated the same as a nearby pixel
3. **Not translation invariant**: The same object in different positions looks completely different

### CNN Solution

CNNs address these issues with:
- **Convolutional layers**: Learn local patterns using small filters
- **Parameter sharing**: Same filter is applied across the entire image
- **Pooling layers**: Reduce spatial dimensions while preserving important features

Let's explore these components!

## 6. Understanding Conv2d

A convolutional layer slides a small filter (kernel) across the input image:

```python
nn.Conv2d(
    in_channels=1,     # Grayscale input (1 channel)
    out_channels=10,   # Number of filters to learn
    kernel_size=3,     # 3×3 filter
    stride=1,          # Move 1 pixel at a time
    padding=1          # Add 1 pixel border to preserve size
)
```

In [None]:
# Demonstrate Conv2d
conv_layer = nn.Conv2d(in_channels=1, out_channels=10, kernel_size=3, stride=1, padding=1)

# Get a sample image
sample_image = train_data[0][0].unsqueeze(0)  # Add batch dimension: [1, 1, 28, 28]
print(f"Input shape: {sample_image.shape}")

# Pass through conv layer
with torch.inference_mode():
    conv_output = conv_layer(sample_image)
print(f"Output shape after Conv2d: {conv_output.shape}")
print(f"\nNote: 10 feature maps, each 28x28 (size preserved due to padding=1)")

## 7. Understanding MaxPool2d

Pooling reduces spatial dimensions by taking the maximum value in each window:

```python
nn.MaxPool2d(
    kernel_size=2,     # 2×2 window
    stride=2           # Move 2 pixels (reduces size by half)
)
```

In [None]:
# Demonstrate MaxPool2d
pool_layer = nn.MaxPool2d(kernel_size=2, stride=2)

print(f"Before pooling: {conv_output.shape}")

with torch.inference_mode():
    pool_output = pool_layer(conv_output)
print(f"After pooling: {pool_output.shape}")
print(f"\nNote: Spatial dimensions halved from 28x28 to 14x14")

In [None]:
# Visualize the effect of convolution and pooling
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

# Original image
axes[0].imshow(sample_image.squeeze(), cmap='gray')
axes[0].set_title('Original\n(1×28×28)')
axes[0].axis('off')

# After Conv2d (show first feature map)
axes[1].imshow(conv_output[0, 0].detach(), cmap='gray')
axes[1].set_title('After Conv2d\n(Feature Map 1 of 10)')
axes[1].axis('off')

# After Conv2d (show another feature map)
axes[2].imshow(conv_output[0, 5].detach(), cmap='gray')
axes[2].set_title('After Conv2d\n(Feature Map 6 of 10)')
axes[2].axis('off')

# After MaxPool2d
axes[3].imshow(pool_output[0, 0].detach(), cmap='gray')
axes[3].set_title('After MaxPool2d\n(14×14)')
axes[3].axis('off')

plt.tight_layout()
plt.show()

## 8. Build the CNN Model (TinyVGG)

Our CNN follows the TinyVGG architecture with three main parts. Let's understand each block before we build it.

### Block 1: Extract Basic Features

![Block 1](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_03/images/infra-13.svg)

Block 1 is the first stage of feature extraction. It processes the raw input image and learns to detect simple patterns.

**What happens in Block 1:**
- **Input**: A 28×28 grayscale image (1 channel)
- **First Conv2d (1→10)**: Creates 10 different 3×3 filters that learn to detect basic patterns like edges, corners, and simple textures
- **ReLU**: Introduces non-linearity so the network can learn complex patterns (not just linear combinations)
- **Second Conv2d (10→10)**: Further refines the detected features by combining patterns from the first layer
- **ReLU**: Another non-linearity for more expressive power
- **MaxPool2d**: Reduces spatial dimensions from 28×28 to 14×14, keeping only the strongest activations
- **Output**: 10 feature maps of size 14×14

### Block 2: Extract Higher-Level Patterns

![Block 2](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_03/images/infra-14.svg)

Block 2 builds on top of Block 1's features to detect more complex patterns like shapes and textures.

**What happens in Block 2:**
- **Input**: 10 feature maps of size 14×14 (output from Block 1)
- **First Conv2d (10→10)**: Combines features from Block 1 to detect higher-level patterns (e.g., combining edges into shapes)
- **ReLU**: Non-linearity for learning complex combinations
- **Second Conv2d (10→10)**: Further combines patterns to detect even more abstract features
- **ReLU**: Another non-linearity
- **MaxPool2d**: Reduces spatial dimensions from 14×14 to 7×7
- **Output**: 10 feature maps of size 7×7

At this point, each of the 10 feature maps represents different high-level patterns detected in the image, at a much smaller spatial resolution.

### Classifier: Make Predictions

![Classifier](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_03/images/infra-15.svg)

The classifier takes all the extracted features and uses them to predict which class the image belongs to.

**What happens in the Classifier:**
- **Input**: 10 feature maps of size 7×7 (total: 10 × 7 × 7 = 490 values)
- **Flatten**: Converts the 3D tensor [10, 7, 7] into a 1D vector of 490 values
- **Linear (490→10)**: A fully connected layer that maps the 490 features to 10 output values (one for each clothing class)
- **Output**: 10 raw scores (logits), one per class

The class with the highest score is the model's prediction. During training, these logits are passed through CrossEntropyLoss to compute the loss.

In [None]:
class FashionMNISTModelV2(nn.Module):
    """CNN model following TinyVGG architecture."""
    
    def __init__(self, input_shape: int, hidden_units: int, output_shape: int):
        super().__init__()
        
        # Block 1: Input [1, 28, 28] -> Output [hidden_units, 14, 14]
        self.block_1 = nn.Sequential(
            nn.Conv2d(
                in_channels=input_shape,
                out_channels=hidden_units,
                kernel_size=3,
                stride=1,
                padding=1
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=hidden_units,
                out_channels=hidden_units,
                kernel_size=3,
                stride=1,
                padding=1
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)  # 28x28 -> 14x14
        )
        
        # Block 2: Input [hidden_units, 14, 14] -> Output [hidden_units, 7, 7]
        self.block_2 = nn.Sequential(
            nn.Conv2d(hidden_units, hidden_units, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_units, hidden_units, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)  # 14x14 -> 7x7
        )
        
        # Classifier: Input [hidden_units * 7 * 7] -> Output [output_shape]
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(
                in_features=hidden_units * 7 * 7,  # 10 * 7 * 7 = 490
                out_features=output_shape
            )
        )
    
    def forward(self, x):
        x = self.block_1(x)
        x = self.block_2(x)
        x = self.classifier(x)
        return x

In [None]:
# Create model instance
torch.manual_seed(42)

model_2 = FashionMNISTModelV2(
    input_shape=1,            # Grayscale images (1 channel)
    hidden_units=10,          # Same as other models for comparison
    output_shape=len(class_names)  # 10 classes
).to(device)

print(f"Model architecture:\n{model_2}")
print(f"\nModel is on: {next(model_2.parameters()).device}")

In [None]:
# Verify model with a dummy input
dummy_input = torch.randn(1, 1, 28, 28).to(device)

with torch.inference_mode():
    dummy_output = model_2(dummy_input)

print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {dummy_output.shape}")

## 9. Define Helper Functions

We'll reuse our training functions from Lab 02.

In [None]:
def accuracy_fn(y_true, y_pred):
    """Calculate accuracy."""
    correct = torch.eq(y_true, y_pred).sum().item()
    return (correct / len(y_true)) * 100

def print_train_time(start, end, device=None):
    """Print training time."""
    total = end - start
    print(f"Train time on {device}: {total:.3f} seconds")
    return total

In [None]:
def train_step(model, data_loader, loss_fn, optimizer, accuracy_fn, device):
    """Performs one training epoch."""
    train_loss, train_acc = 0, 0
    model.train()
    
    for batch, (X, y) in enumerate(data_loader):
        X, y = X.to(device), y.to(device)
        
        y_pred = model(X)
        loss = loss_fn(y_pred, y)
        train_loss += loss.item()
        train_acc += accuracy_fn(y, y_pred.argmax(dim=1))
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    train_loss /= len(data_loader)
    train_acc /= len(data_loader)
    print(f"Train loss: {train_loss:.5f} | Train accuracy: {train_acc:.2f}%")

def test_step(model, data_loader, loss_fn, accuracy_fn, device):
    """Evaluates model on test data."""
    test_loss, test_acc = 0, 0
    model.eval()
    
    with torch.inference_mode():
        for X, y in data_loader:
            X, y = X.to(device), y.to(device)
            test_pred = model(X)
            test_loss += loss_fn(test_pred, y).item()
            test_acc += accuracy_fn(y, test_pred.argmax(dim=1))
    
    test_loss /= len(data_loader)
    test_acc /= len(data_loader)
    print(f"Test loss: {test_loss:.5f} | Test accuracy: {test_acc:.2f}%")

## 10. Setup Loss Function and Optimizer

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(params=model_2.parameters(), lr=0.1)

## 11. Train the CNN

Now we'll train our CNN model on the FashionMNIST dataset. 

**What to expect:**
- Training will take approximately **40-120 seconds** on CPU (depending on your machine)
- We train for **3 epochs** - each epoch processes all 60,000 training images
- You'll see the loss decrease and accuracy increase with each epoch
- The model processes **1,875 batches** per epoch (60,000 images ÷ 32 batch size)

**During training, you'll see:**
- **Train loss**: How well the model fits the training data (lower is better)
- **Train accuracy**: Percentage of training images correctly classified
- **Test loss/accuracy**: Performance on unseen data (this is what really matters!)

Be patient - CNNs have more computations per image than linear models due to the convolution operations.

In [None]:
torch.manual_seed(42)

train_time_start = timer()

epochs = 3

for epoch in range(epochs):
    print(f"\nEpoch: {epoch}\n---------")
    
    train_step(
        model=model_2,
        data_loader=train_dataloader,
        loss_fn=loss_fn,
        optimizer=optimizer,
        accuracy_fn=accuracy_fn,
        device=device
    )
    
    test_step(
        model=model_2,
        data_loader=test_dataloader,
        loss_fn=loss_fn,
        accuracy_fn=accuracy_fn,
        device=device
    )

train_time_end = timer()
total_train_time_model_2 = print_train_time(train_time_start, train_time_end, device)

## 12. Evaluate and Create Model Comparison

Now that training is complete, let's evaluate our CNN and compare it with the models from previous labs.

We'll create an `eval_model` function that:
- Runs the model on the entire test dataset
- Calculates the average loss and accuracy
- Returns the results in a dictionary for easy comparison

This allows us to fairly compare V0 (baseline), V1 (non-linear), and V2 (CNN) side by side.

In [None]:
def eval_model(model, data_loader, loss_fn, accuracy_fn, device):
    """Evaluate model and return metrics."""
    loss, acc = 0, 0
    model.eval()
    
    with torch.inference_mode():
        for X, y in data_loader:
            X, y = X.to(device), y.to(device)
            y_pred = model(X)
            loss += loss_fn(y_pred, y).item()
            acc += accuracy_fn(y, y_pred.argmax(dim=1))
    
    return {
        "model_name": model.__class__.__name__,
        "model_loss": loss / len(data_loader),
        "model_acc": acc / len(data_loader)
    }

In [None]:
# Evaluate CNN model
model_2_results = eval_model(
    model=model_2,
    data_loader=test_dataloader,
    loss_fn=loss_fn,
    accuracy_fn=accuracy_fn,
    device=device
)

print(f"\nCNN Model Results:")
print(f"Model: {model_2_results['model_name']}")
print(f"Loss: {model_2_results['model_loss']:.4f}")
print(f"Accuracy: {model_2_results['model_acc']:.2f}%")

## 13. Compare All Three Models

After training all three models across our labs, let's compare their performance side by side.

![Model Comparison](https://raw.githubusercontent.com/poridhiEng/lab-asset/8104ff41aaf569aa65977e43cdbadc13fc1b7a34/tensorcode/Deep-learning-with-pytorch/Computer-Vision/Lab_03/images/model-comparison.svg)

**Key Observations:**

- **V2 (CNN) wins**: The CNN achieves the highest accuracy (~88%) and lowest loss (~0.33), demonstrating why CNNs are the go-to architecture for image tasks.

- **V0 (Baseline) is solid**: The simple linear model achieves ~83% accuracy - a strong baseline that's fast to train (~32s).

- **V1 (Non-Linear) underperforms**: Surprisingly, adding ReLU activations hurt performance (~75%). This happens because ReLU was placed after the output layer, distorting the class predictions.

- **Training time vs accuracy tradeoff**: The CNN takes longer to train (~44s) but the accuracy gain is worth it for image classification tasks.

## 14. Make Predictions with CNN

In [None]:
# Visualize predictions
torch.manual_seed(42)

fig, axes = plt.subplots(3, 3, figsize=(9, 9))

model_2.eval()
with torch.inference_mode():
    for i, ax in enumerate(axes.flatten()):
        random_idx = torch.randint(0, len(test_data), size=[1]).item()
        image, true_label = test_data[random_idx]
        
        image_device = image.unsqueeze(0).to(device)
        pred_logits = model_2(image_device)
        pred_label = pred_logits.argmax(dim=1).item()
        
        ax.imshow(image.squeeze().cpu(), cmap="gray")
        title_color = "green" if pred_label == true_label else "red"
        ax.set_title(
            f"True: {class_names[true_label]}\nPred: {class_names[pred_label]}",
            color=title_color,
            fontsize=10
        )
        ax.axis(False)

plt.suptitle("CNN Model (V2) Predictions", fontsize=14)
plt.tight_layout()
plt.show()

## 15. Create a Confusion Matrix

A confusion matrix shows where our model makes mistakes:
- **Rows**: True labels
- **Columns**: Predicted labels
- **Diagonal**: Correct predictions

In [None]:
# Get all predictions
y_preds = []
y_trues = []

model_2.eval()
with torch.inference_mode():
    for X, y in test_dataloader:
        X, y = X.to(device), y.to(device)
        y_pred = model_2(X)
        y_preds.extend(y_pred.argmax(dim=1).cpu().numpy())
        y_trues.extend(y.cpu().numpy())

print(f"Total predictions: {len(y_preds)}")
print(f"Total true labels: {len(y_trues)}")

First, we need to collect all predictions from our model on the test dataset. We loop through all test batches and store both the predicted labels and true labels in lists.

In [None]:
# Create confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_trues, y_preds)

# Plot
fig, ax = plt.subplots(figsize=(10, 10))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - CNN Model (V2)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Now we use scikit-learn's `confusion_matrix` and `ConfusionMatrixDisplay` to create and visualize the matrix. Each cell shows how many times a true class (row) was predicted as another class (column).

In [None]:
# Find most confused pairs
import numpy as np

# Zero out diagonal (correct predictions)
cm_no_diag = cm.copy()
np.fill_diagonal(cm_no_diag, 0)

# Find top confusions
print("Most Common Misclassifications:")
print("="*50)

for _ in range(5):
    idx = np.unravel_index(np.argmax(cm_no_diag), cm_no_diag.shape)
    count = cm_no_diag[idx]
    if count > 0:
        print(f"{class_names[idx[0]]} mistaken as {class_names[idx[1]]}: {count} times")
        cm_no_diag[idx] = 0

Let's find which class pairs the model confuses most often. We zero out the diagonal (correct predictions) and find the highest off-diagonal values.

## 16. Save and Load the Model

Now that we have our best model, let's save it for future use.

In [None]:
# Save the model state dict
from pathlib import Path

# Create models directory if it doesn't exist
MODEL_PATH = Path("models")
MODEL_PATH.mkdir(parents=True, exist_ok=True)

# Save
MODEL_NAME = "fashion_mnist_cnn_v2.pth"
MODEL_SAVE_PATH = MODEL_PATH / MODEL_NAME

torch.save(obj=model_2.state_dict(), f=MODEL_SAVE_PATH)
print(f"Model saved to: {MODEL_SAVE_PATH}")

In [None]:
# Load the model
loaded_model = FashionMNISTModelV2(
    input_shape=1,
    hidden_units=10,
    output_shape=len(class_names)
).to(device)

# Load state dict (map_location ensures it loads to CPU)
loaded_model.load_state_dict(torch.load(MODEL_SAVE_PATH, map_location=device))

print("Model loaded successfully!")

In [None]:
# Verify loaded model works
loaded_results = eval_model(
    model=loaded_model,
    data_loader=test_dataloader,
    loss_fn=loss_fn,
    accuracy_fn=accuracy_fn,
    device=device
)

print(f"\nLoaded Model Results:")
print(f"Loss: {loaded_results['model_loss']:.4f}")
print(f"Accuracy: {loaded_results['model_acc']:.2f}%")

# Verify results match
assert abs(loaded_results['model_acc'] - model_2_results['model_acc']) < 0.01, "Results don't match!"
print("\nResults match original model!")

## Summary

### What We Accomplished Across All Labs:

| Lab | Model | Key Concepts | Result |
|-----|-------|--------------|--------|
| **Lab 01** | V0 (Baseline) | Linear layers, training loop, evaluation | ~83% |
| **Lab 02** | V1 (Non-Linear) | ReLU, device-agnostic code, reusable functions | ~75% |
| **Lab 03** | V2 (CNN) | Conv2d, MaxPool2d, specialized architecture | ~88% |

### Key Takeaways:

1. **Architecture matters**: The CNN significantly outperformed linear models because it's designed for image data.

2. **More complex ≠ better**: V1 (non-linear) actually performed *worse* than the baseline - adding complexity without proper architecture doesn't help.

3. **CNNs for images**: Convolutional layers learn local patterns, pooling provides translation invariance, and parameter sharing makes training efficient.

4. **Always start with a baseline**: Simple models establish performance benchmarks and help you understand when more complex models actually help.

5. **Confusion matrices reveal insights**: We can see which classes are commonly confused (e.g., Shirt vs T-shirt, Pullover vs Coat).

### What's Next?

From here, you could:
- Increase hidden_units for more model capacity
- Add more convolutional blocks for deeper features
- Use data augmentation for better generalization
- Try transfer learning with pre-trained models
- Experiment with different optimizers (Adam, etc.)