# PyTorch Tutorial for Computer Vision

Topics covered include:

- PyTorch fundamentals and tensor operations
- Neural network building blocks
- Computer vision with CIFAR-10 and Fashion-MNIST
- LeNet and AlexNet implementations
- Complete training, validation, and evaluation pipelines
- Comprehensive visualizations

**Prerequisites**: Basic Python knowledge

---


## Table of Contents

1. [PyTorch Basics](#1-pytorch-basics)
2. [Tensors and Operations](#2-tensors-and-operations)
3. [Automatic Differentiation](#3-automatic-differentiation)
4. [Neural Network Fundamentals](#4-neural-network-fundamentals)
5. [Dataset and DataLoader](#5-dataset-and-dataloader)
6. [Computer Vision Datasets](#6-computer-vision-datasets)
7. [LeNet Implementation](#7-lenet-implementation)
8. [AlexNet Implementation](#8-alexnet-implementation)
9. [Training Pipeline](#9-training-pipeline)
10. [Evaluation and Visualization](#10-evaluation-and-visualization)

---


In [2]:
# Import required libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split

import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10, FashionMNIST

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

import os
from pathlib import Path

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print(f'PyTorch version: {torch.__version__}')

Using device: cpu
PyTorch version: 2.10.0


## 1. PyTorch Basics

PyTorch is a deep learning framework that provides:

- **Tensors**: Multi-dimensional arrays similar to NumPy, but with GPU acceleration
- **Autograd**: Automatic differentiation for building neural networks
- **Neural Network modules**: Pre-built layers and models
- **Optimizers**: Algorithms for training neural networks

### Workflow Overview

```mermaid
graph LR
    A[Data] --> B[DataLoader]
    B --> C[Model]
    C --> D[Loss Function]
    D --> E[Optimizer]
    E --> F[Backpropagation]
    F --> C
    C --> G[Predictions]
```


## 2. Tensors and Operations

Tensors are the fundamental data structure in PyTorch. They are similar to NumPy arrays but can run on GPUs.


### 2.1 Creating Tensors


In [None]:
# From Python lists: each list is a row in the tensor
tensor_from_list = torch.tensor([[1, 2, 3], [4, 5, 6]])
print("Tensor from list:")
print(tensor_from_list)
print(f"Shape: {tensor_from_list.shape}, dtype: {tensor_from_list.dtype}\n")

Tensor from list:
tensor([[1, 2, 3],
        [4, 5, 6]])
Shape: torch.Size([2, 3]), dtype: torch.int64



Zeros tensor

```
torch.zeros(size, dtype, device) - creates a tensor filled with zeros
```

Parameters:

- `size`: tuple defining tensor shape
- `dtype`: data type (default: torch.float32)
- `device`: cpu or cuda


In [4]:
zeros = torch.zeros(3, 4, dtype=torch.float32)
print("Zeros tensor:")
print(zeros)
print()

Zeros tensor:
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])



In [5]:
# Ones tensor
ones = torch.ones(2, 3)
print("Ones tensor:")
print(ones)
print()


Ones tensor:
tensor([[1., 1., 1.],
        [1., 1., 1.]])



In [6]:
# Random tensors
# torch.randn(size) - creates tensor with values from standard normal distribution N(0,1)
random_normal = torch.randn(3, 3)
print("Random normal tensor (mean=0, std=1):")
print(random_normal)
print()

Random normal tensor (mean=0, std=1):
tensor([[-1.2639, -1.7205,  0.9888],
        [ 0.3760, -0.0761, -0.8899],
        [-1.1185,  0.0677, -0.6314]])



In [7]:
# torch.rand(size) - creates tensor with values from uniform distribution [0, 1)
random_uniform = torch.rand(2, 4)
print("Random uniform tensor [0, 1):")
print(random_uniform)
print()

Random uniform tensor [0, 1):
tensor([[0.1675, 0.3780, 0.8059, 0.5116],
        [0.9904, 0.8365, 0.6565, 0.1969]])



In [8]:
# torch.arange(start, end, step) - creates 1D tensor with evenly spaced values
range_tensor = torch.arange(0, 10, 2)
print("Range tensor [0, 10) with step 2:")
print(range_tensor)
print()

Range tensor [0, 10) with step 2:
tensor([0, 2, 4, 6, 8])



In [9]:
# torch.linspace(start, end, steps) - creates 1D tensor with linearly spaced values
# Parameters:
#   start: starting value
#   end: ending value (included)
#   steps: number of points
linspace_tensor = torch.linspace(0, 1, 5)
print("Linspace tensor [0, 1] with 5 steps:")
print(linspace_tensor)

Linspace tensor [0, 1] with 5 steps:
tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])


### 2.2 Tensor Operations


In [10]:
# Element-wise operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

print("Tensor a:")
print(a)
print("\nTensor b:")
print(b)

Tensor a:
tensor([[1., 2.],
        [3., 4.]])

Tensor b:
tensor([[5., 6.],
        [7., 8.]])


In [11]:
# Addition
print("\na + b (element-wise addition):")
print(a + b)


a + b (element-wise addition):
tensor([[ 6.,  8.],
        [10., 12.]])


In [12]:
# Multiplication (element-wise)
print("\na * b (element-wise multiplication):")
print(a * b)


a * b (element-wise multiplication):
tensor([[ 5., 12.],
        [21., 32.]])


In [13]:
# Matrix multiplication
# torch.matmul(a, b) or a @ b - performs matrix multiplication
# For 2D tensors: (m, n) @ (n, p) -> (m, p)
print("\na @ b.T (matrix multiplication):")
print(a @ b.T)


a @ b.T (matrix multiplication):
tensor([[17., 23.],
        [39., 53.]])


In [14]:
# Reshaping
# tensor.view(new_shape) - returns view with new shape (shares memory)
# tensor.reshape(new_shape) - returns tensor with new shape (may copy)
c = torch.arange(12)
print("\nOriginal tensor:")
print(c)
print("\nReshaped to 3x4:")
print(c.view(3, 4))
print("\nReshaped to 2x6:")
print(c.reshape(2, 6))


Original tensor:
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Reshaped to 3x4:
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

Reshaped to 2x6:
tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])


In [None]:
# Transpose
# tensor.T - returns transposed tensor
# tensor.transpose(dim0, dim1) - swaps dimensions
print("\nTranspose:")
print(a.T)


Transpose:
tensor([[1., 3.],
        [2., 4.]])


In [16]:
# Slicing
matrix = torch.arange(20).reshape(4, 5)
print("\nOriginal matrix:")
print(matrix)
print("\nFirst two rows:")
print(matrix[:2, :])
print("\nLast column:")
print(matrix[:, -1])


Original matrix:
tensor([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])

First two rows:
tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

Last column:
tensor([ 4,  9, 14, 19])


### 2.3 Tensor Reductions and Statistics


In [None]:
data = torch.randn(3, 4)
print("Data tensor:")
print(data)

# tensor.sum(dim) - sums elements along dimension
# Parameters:
#   dim: dimension to reduce (None = all dimensions)
#   keepdim: whether to keep reduced dimension as size 1
print("\nSum of all elements:")
print(data.sum())
print("\nSum along rows (dim=0):")
print(data.sum(dim=0))
print("\nSum along columns (dim=1):")
print(data.sum(dim=1))

# tensor.mean(dim) - computes mean along dimension
print("\nMean:")
print(data.mean())

# tensor.std(dim) - computes standard deviation
print("\nStandard deviation:")
print(data.std())

# tensor.max(dim) - returns maximum values and indices
# Returns: (values, indices)
print("\nMaximum value:")
print(data.max())
print("\nMaximum values and indices along dim=1:")
max_vals, max_indices = data.max(dim=1)
print(f"Values: {max_vals}")
print(f"Indices: {max_indices}")

# tensor.argmax(dim) - returns index of maximum value
print("\nArgmax along dim=1:")
print(data.argmax(dim=1))

## 3. Automatic Differentiation

PyTorch's autograd package provides automatic differentiation for all operations on Tensors. This is essential for training neural networks using backpropagation.

### Autograd Workflow

```mermaid
graph TD
    A[Input Tensor<br/>requires_grad=True] --> B[Forward Pass<br/>Computations]
    B --> C[Loss Calculation]
    C --> D[loss.backward<br/>Compute Gradients]
    D --> E[Gradients Stored<br/>in tensor.grad]
    E --> F[Optimizer Step<br/>Update Parameters]
    F --> G[optimizer.zero_grad<br/>Clear Gradients]
```


In [None]:
### 3.1 Basic Autograd Example

# Create tensors with gradient tracking
# requires_grad=True tells PyTorch to track all operations for gradient computation
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"x: {x}")
print(f"x.requires_grad: {x.requires_grad}\n")

# Perform operations (PyTorch builds computation graph automatically)
y = x ** 2  # y = x^2
z = y.sum()  # z = sum(y) = x1^2 + x2^2

print(f"y = x^2: {y}")
print(f"z = sum(y): {z}")
print(f"z.grad_fn: {z.grad_fn}\n")  # Shows the operation that created z

# Compute gradients
# z.backward() computes dz/dx for all tensors with requires_grad=True
z.backward()

# Access gradients
# For z = x1^2 + x2^2, dz/dx1 = 2*x1, dz/dx2 = 2*x2
print(f"Gradient dz/dx: {x.grad}")
print(f"Expected gradient [2*x1, 2*x2]: {2 * x.data}")

In [None]:
### 3.2 More Complex Example

# Create input tensor
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)

# Complex operation
y = x * 2  # y = 2x
z = y ** 2  # z = (2x)^2 = 4x^2
out = z.mean()  # out = mean(4x^2)

print(f"x:\n{x}\n")
print(f"out: {out}\n")

# Backpropagate
out.backward()

# For out = mean(4x^2) = (1/4) * sum(4x^2)
# dout/dx = (1/4) * 8x = 2x
print(f"Gradient dout/dx:\n{x.grad}")
print(f"\nExpected gradient (2x):\n{2 * x.data}")

In [None]:
### 3.3 Gradient Management

x = torch.ones(2, 2, requires_grad=True)

# torch.no_grad() - context manager to disable gradient tracking
# Use this during inference to save memory and computation
print("With gradient tracking:")
y = x + 2
print(f"y.requires_grad: {y.requires_grad}\n")

print("Without gradient tracking:")
with torch.no_grad():
    y = x + 2
    print(f"y.requires_grad: {y.requires_grad}\n")

# tensor.detach() - creates a new tensor that doesn't require gradients
y = x + 2
y_detached = y.detach()
print(f"y.requires_grad: {y.requires_grad}")
print(f"y_detached.requires_grad: {y_detached.requires_grad}\n")

# Zero gradients (important for training loops)
# Gradients accumulate by default, so we need to zero them
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
for i in range(3):
    y = (x ** 2).sum()
    y.backward()
    print(f"Iteration {i+1}, gradient: {x.grad}")
    if i < 2:  # Don't zero on last iteration to show accumulation
        x.grad.zero_()  # In-place operation to zero gradients

## 4. Neural Network Fundamentals

PyTorch provides the `torch.nn` module for building neural networks. Key components:

- **nn.Module**: Base class for all neural network modules
- **nn.Linear**: Fully connected layer
- **nn.Conv2d**: 2D convolutional layer
- **nn.ReLU**: Activation functions
- **nn.CrossEntropyLoss**: Loss functions

### Neural Network Architecture

```mermaid
graph LR
    A[Input<br/>x] --> B[Linear Layer<br/>Wx + b]
    B --> C[Activation<br/>ReLU/Sigmoid]
    C --> D[Hidden Layer]
    D --> E[Output Layer]
    E --> F[Loss Function]
```


In [None]:
### 4.1 Simple Linear Layer

# nn.Linear(in_features, out_features, bias=True)
# Parameters:
#   in_features: size of input features
#   out_features: size of output features
#   bias: whether to include bias term (default: True)
# Computes: y = xW^T + b

linear = nn.Linear(in_features=5, out_features=3)

print("Linear layer structure:")
print(linear)
print(f"\nWeight shape: {linear.weight.shape}")  # (out_features, in_features)
print(f"Bias shape: {linear.bias.shape}")  # (out_features,)

# Forward pass
x = torch.randn(2, 5)  # Batch of 2 samples, 5 features each
output = linear(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")  # (2, 3)
print(f"\nOutput:\n{output}")

In [None]:
### 4.2 Activation Functions

x = torch.linspace(-3, 3, 100)

# ReLU: max(0, x)
relu = nn.ReLU()
relu_output = relu(x)

# Sigmoid: 1 / (1 + exp(-x))
sigmoid = nn.Sigmoid()
sigmoid_output = sigmoid(x)

# Tanh: (exp(x) - exp(-x)) / (exp(x) + exp(-x))
tanh = nn.Tanh()
tanh_output = tanh(x)

# LeakyReLU: max(0.01*x, x)
# nn.LeakyReLU(negative_slope=0.01)
# Parameters:
#   negative_slope: slope for negative values (default: 0.01)
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
leaky_relu_output = leaky_relu(x)

# Visualize activation functions
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].plot(x.numpy(), relu_output.numpy(), label='ReLU', color='blue')
axes[0, 0].set_title('ReLU Activation')
axes[0, 0].grid(True)
axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)

axes[0, 1].plot(x.numpy(), sigmoid_output.numpy(), label='Sigmoid', color='green')
axes[0, 1].set_title('Sigmoid Activation')
axes[0, 1].grid(True)
axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)

axes[1, 0].plot(x.numpy(), tanh_output.numpy(), label='Tanh', color='red')
axes[1, 0].set_title('Tanh Activation')
axes[1, 0].grid(True)
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)

axes[1, 1].plot(x.numpy(), leaky_relu_output.numpy(), label='Leaky ReLU', color='purple')
axes[1, 1].set_title('Leaky ReLU Activation')
axes[1, 1].grid(True)
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

print("Activation Functions Summary:")
print("- ReLU: Simple, fast, but can cause 'dying ReLU' problem")
print("- Sigmoid: Outputs in (0,1), but suffers from vanishing gradients")
print("- Tanh: Outputs in (-1,1), zero-centered, but also vanishing gradients")
print("- Leaky ReLU: Prevents dying ReLU by allowing small negative values")

In [None]:
### 4.3 Building a Simple Neural Network

class SimpleNN(nn.Module):
    """
    A simple feedforward neural network with two hidden layers.
    
    Architecture:
    Input -> Linear(input_size, 128) -> ReLU -> Linear(128, 64) -> ReLU -> Linear(64, num_classes)
    
    Args:
        input_size (int): Number of input features
        num_classes (int): Number of output classes
    """
    def __init__(self, input_size, num_classes):
        super(SimpleNN, self).__init__()  # Initialize parent class
        
        # Define layers
        self.fc1 = nn.Linear(input_size, 128)  # First fully connected layer
        self.fc2 = nn.Linear(128, 64)          # Second fully connected layer
        self.fc3 = nn.Linear(64, num_classes)  # Output layer
        
        self.relu = nn.ReLU()  # Activation function
        
    def forward(self, x):
        """
        Forward pass through the network.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, input_size)
            
        Returns:
            torch.Tensor: Output tensor of shape (batch_size, num_classes)
        """
        x = self.fc1(x)      # Linear transformation
        x = self.relu(x)     # Non-linear activation
        
        x = self.fc2(x)
        x = self.relu(x)
        
        x = self.fc3(x)      # Output layer (no activation, we'll use it with CrossEntropyLoss)
        return x

# Create model instance
model = SimpleNN(input_size=784, num_classes=10)  # For MNIST-like data (28x28 = 784)
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test forward pass
dummy_input = torch.randn(32, 784)  # Batch of 32 samples
output = model(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")

In [None]:
### 4.4 Loss Functions and Optimizers

# Classification loss functions
print("=== Loss Functions ===")

# nn.CrossEntropyLoss(weight=None, reduction='mean')
# Combines nn.LogSoftmax() and nn.NLLLoss()
# Parameters:
#   weight: manual rescaling weight for each class
#   reduction: 'mean' (default), 'sum', or 'none'
# Input: (batch_size, num_classes) - raw scores (logits)
# Target: (batch_size,) - class indices
criterion_ce = nn.CrossEntropyLoss()

# Example usage
logits = torch.randn(4, 3)  # 4 samples, 3 classes
targets = torch.tensor([0, 1, 2, 1])  # True class indices
loss_ce = criterion_ce(logits, targets)
print(f"\nCrossEntropyLoss: {loss_ce.item():.4f}")

# nn.MSELoss() - Mean Squared Error for regression
# Computes: mean((prediction - target)^2)
criterion_mse = nn.MSELoss()
predictions = torch.randn(4, 1)
targets_reg = torch.randn(4, 1)
loss_mse = criterion_mse(predictions, targets_reg)
print(f"MSELoss: {loss_mse.item():.4f}")

# nn.BCEWithLogitsLoss() - Binary cross-entropy with logits
# More numerically stable than separate sigmoid + BCE
criterion_bce = nn.BCEWithLogitsLoss()
logits_binary = torch.randn(4, 1)
targets_binary = torch.randint(0, 2, (4, 1)).float()
loss_bce = criterion_bce(logits_binary, targets_binary)
print(f"BCEWithLogitsLoss: {loss_bce.item():.4f}")

print("\n=== Optimizers ===")

# Create a simple model
model = SimpleNN(784, 10)

# optim.SGD(params, lr, momentum=0, weight_decay=0)
# Stochastic Gradient Descent
# Parameters:
#   params: model parameters to optimize
#   lr: learning rate
#   momentum: momentum factor (default: 0)
#   weight_decay: L2 regularization (default: 0)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
print(f"SGD optimizer: {optimizer_sgd}")

# optim.Adam(params, lr=0.001, betas=(0.9, 0.999), weight_decay=0)
# Adaptive Moment Estimation
# Parameters:
#   params: model parameters to optimize
#   lr: learning rate (default: 0.001)
#   betas: coefficients for computing running averages
#   weight_decay: L2 regularization (default: 0)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)
print(f"\nAdam optimizer: {optimizer_adam}")

# optim.RMSprop(params, lr=0.01, alpha=0.99, weight_decay=0)
# Root Mean Square Propagation
optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)
print(f"\nRMSprop optimizer: {optimizer_rmsprop}")

## 5. Dataset and DataLoader

PyTorch provides utilities for loading and preprocessing data efficiently.

### Data Loading Pipeline

```mermaid
graph LR
    A[Raw Data] --> B[Dataset Class<br/>__getitem__, __len__]
    B --> C[Transforms<br/>Preprocessing]
    C --> D[DataLoader<br/>Batching & Shuffling]
    D --> E[Model Training]

    style D fill:#e1f5ff
```


In [None]:
### 5.1 Custom Dataset

class CustomDataset(Dataset):
    """
    Custom dataset class for demonstration.
    
    All custom datasets must implement:
    - __init__: Initialize dataset (load data, setup transforms, etc.)
    - __len__: Return the size of the dataset
    - __getitem__: Return a single sample at given index
    
    Args:
        num_samples (int): Number of samples to generate
        num_features (int): Number of features per sample
        num_classes (int): Number of classes
    """
    def __init__(self, num_samples=1000, num_features=20, num_classes=3):
        self.num_samples = num_samples
        self.num_features = num_features
        self.num_classes = num_classes
        
        # Generate random data
        self.data = torch.randn(num_samples, num_features)
        self.labels = torch.randint(0, num_classes, (num_samples,))
        
    def __len__(self):
        """Return the total number of samples."""
        return self.num_samples
    
    def __getitem__(self, idx):
        """
        Get a single sample.
        
        Args:
            idx (int): Index of the sample to retrieve
            
        Returns:
            tuple: (data, label)
        """
        return self.data[idx], self.labels[idx]

# Create dataset instance
dataset = CustomDataset(num_samples=1000, num_features=20, num_classes=3)
print(f"Dataset size: {len(dataset)}")

# Get a single sample
sample, label = dataset[0]
print(f"Sample shape: {sample.shape}")
print(f"Label: {label}")

In [None]:
### 5.2 DataLoader

# DataLoader(dataset, batch_size=1, shuffle=False, num_workers=0)
# Parameters:
#   dataset: dataset to load from
#   batch_size: number of samples per batch
#   shuffle: whether to shuffle data at every epoch
#   num_workers: number of subprocesses for data loading (0 = main process)
#   drop_last: drop the last incomplete batch if dataset size not divisible by batch_size
#   pin_memory: if True, copies tensors into CUDA pinned memory (faster transfer to GPU)

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=0,  # Set to 0 for Windows, 2-4 for Linux/Mac
    drop_last=False
)

print(f"Number of batches: {len(dataloader)}")

# Iterate through batches
for batch_idx, (data, labels) in enumerate(dataloader):
    print(f"\nBatch {batch_idx + 1}:")
    print(f"  Data shape: {data.shape}")      # (batch_size, num_features)
    print(f"  Labels shape: {labels.shape}")  # (batch_size,)
    
    if batch_idx == 2:  # Show only first 3 batches
        break

print("\n...")
print(f"Total batches: {len(dataloader)}")

## 6. Computer Vision Datasets

We'll work with two popular computer vision datasets:

- **Fashion-MNIST**: 28x28 grayscale images of clothing items (10 classes)
- **CIFAR-10**: 32x32 color images of objects (10 classes)

### Image Preprocessing Pipeline

```mermaid
graph LR
    A[Raw Image] --> B[Resize/Crop]
    B --> C[ToTensor<br/>0-255 â†’ 0-1]
    C --> D[Normalize<br/>mean, std]
    D --> E[Augmentations<br/>flip, rotate]
    E --> F[Model Input]

    style C fill:#ffe1e1
    style D fill:#ffe1e1
```


In [None]:
### 6.1 Fashion-MNIST Dataset

# transforms.Compose() - chains multiple transforms together
# transforms.ToTensor() - converts PIL Image or ndarray to tensor and scales to [0, 1]
# transforms.Normalize(mean, std) - normalizes tensor with mean and standard deviation
#   output = (input - mean) / std

transform_fmnist = transforms.Compose([
    transforms.ToTensor(),  # Convert to tensor and scale to [0, 1]
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

# Load Fashion-MNIST
# FashionMNIST(root, train=True, transform=None, download=True)
# Parameters:
#   root: directory to save/load data
#   train: if True, loads training set; if False, loads test set
#   transform: transformations to apply to data
#   download: if True, downloads data if not already present
train_fmnist = FashionMNIST(
    root='./data',
    train=True,
    transform=transform_fmnist,
    download=True
)

test_fmnist = FashionMNIST(
    root='./data',
    train=False,
    transform=transform_fmnist,
    download=True
)

print(f"Fashion-MNIST Training set size: {len(train_fmnist)}")
print(f"Fashion-MNIST Test set size: {len(test_fmnist)}")

# Class labels
fmnist_classes = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                  'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('Fashion-MNIST Sample Images', fontsize=14, fontweight='bold')

for idx, ax in enumerate(axes.flat):
    image, label = train_fmnist[idx]
    # Denormalize: image = image * std + mean
    image = image * 0.5 + 0.5  # Convert back to [0, 1]
    ax.imshow(image.squeeze(), cmap='gray')
    ax.set_title(fmnist_classes[label])
    ax.axis('off')

plt.tight_layout()
plt.show()

In [None]:
### 6.2 CIFAR-10 Dataset

# CIFAR-10 specific transforms with data augmentation
# Data augmentation helps prevent overfitting

# Training transforms (with augmentation)
# transforms.RandomHorizontalFlip(p=0.5) - randomly flip images horizontally
# transforms.RandomCrop(size, padding) - randomly crop image after padding
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),  # 50% chance of flipping
    transforms.RandomCrop(32, padding=4),     # Crop to 32x32 after padding by 4
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],  # CIFAR-10 mean per channel (R, G, B)
        std=[0.2470, 0.2435, 0.2616]     # CIFAR-10 std per channel
    )
])

# Test transforms (no augmentation)
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])

# Load CIFAR-10
train_cifar10 = CIFAR10(
    root='./data',
    train=True,
    transform=transform_train,
    download=True
)

test_cifar10 = CIFAR10(
    root='./data',
    train=False,
    transform=transform_test,
    download=True
)

print(f"CIFAR-10 Training set size: {len(train_cifar10)}")
print(f"CIFAR-10 Test set size: {len(test_cifar10)}")

# Class labels
cifar10_classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                   'dog', 'frog', 'horse', 'ship', 'truck']

# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('CIFAR-10 Sample Images', fontsize=14, fontweight='bold')

# Use test set for visualization (no augmentation)
for idx, ax in enumerate(axes.flat):
    image, label = test_cifar10[idx]
    # Denormalize for visualization
    image = image * torch.tensor([0.2470, 0.2435, 0.2616]).view(3, 1, 1) + \
            torch.tensor([0.4914, 0.4822, 0.4465]).view(3, 1, 1)
    image = torch.clamp(image, 0, 1)  # Ensure values are in [0, 1]
    ax.imshow(image.permute(1, 2, 0))  # Change from (C, H, W) to (H, W, C)
    ax.set_title(cifar10_classes[label])
    ax.axis('off')

plt.tight_layout()
plt.show()

In [None]:
### 6.3 Dataset Statistics

# Analyze class distribution
train_labels_cifar = [label for _, label in train_cifar10]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Class distribution
class_counts = np.bincount(train_labels_cifar)
ax1.bar(range(10), class_counts, color=sns.color_palette('husl', 10))
ax1.set_xlabel('Class')
ax1.set_ylabel('Number of Samples')
ax1.set_title('CIFAR-10 Class Distribution')
ax1.set_xticks(range(10))
ax1.set_xticklabels(cifar10_classes, rotation=45, ha='right')
ax1.grid(axis='y', alpha=0.3)

# Sample image dimensions
sample_image, _ = train_cifar10[0]
info_text = f"""CIFAR-10 Dataset Information:

Image shape: {sample_image.shape}
Channels: {sample_image.shape[0]} (RGB)
Height: {sample_image.shape[1]} pixels
Width: {sample_image.shape[2]} pixels

Training samples: {len(train_cifar10):,}
Test samples: {len(test_cifar10):,}
Number of classes: 10

Samples per class: {class_counts[0]:,}
(Balanced dataset)
"""

ax2.text(0.1, 0.5, info_text, fontsize=11, family='monospace',
         verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))
ax2.axis('off')

plt.tight_layout()
plt.show()

## 7. LeNet Implementation

LeNet-5 is a pioneering convolutional neural network designed by Yann LeCun in 1998 for handwritten digit recognition.

### LeNet Architecture

```mermaid
graph LR
    A[Input<br/>32x32x1] --> B[Conv1<br/>6@28x28]
    B --> C[Pool1<br/>6@14x14]
    C --> D[Conv2<br/>16@10x10]
    D --> E[Pool2<br/>16@5x5]
    E --> F[Flatten<br/>400]
    F --> G[FC1<br/>120]
    G --> H[FC2<br/>84]
    H --> I[FC3<br/>10]

    style B fill:#e1f5ff
    style D fill:#e1f5ff
    style F fill:#ffe1e1
```


In [None]:
### 7.1 LeNet-5 Model

class LeNet5(nn.Module):
    """
    LeNet-5 Convolutional Neural Network.
    
    Original paper: "Gradient-Based Learning Applied to Document Recognition" (LeCun et al., 1998)
    
    Architecture:
    - Conv1: 1 input channel, 6 output channels, 5x5 kernel
    - Pooling: 2x2 average pooling
    - Conv2: 6 input channels, 16 output channels, 5x5 kernel
    - Pooling: 2x2 average pooling
    - FC1: 16*5*5 -> 120
    - FC2: 120 -> 84
    - FC3: 84 -> num_classes
    
    Args:
        num_classes (int): Number of output classes (default: 10)
        in_channels (int): Number of input channels (default: 1 for grayscale)
    """
    def __init__(self, num_classes=10, in_channels=1):
        super(LeNet5, self).__init__()
        
        # nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
        # Parameters:
        #   in_channels: number of channels in input
        #   out_channels: number of channels produced by convolution
        #   kernel_size: size of convolving kernel
        #   stride: stride of convolution (default: 1)
        #   padding: zero-padding added to both sides (default: 0)
        # Output size: (input_size - kernel_size + 2*padding) / stride + 1
        
        self.conv1 = nn.Conv2d(in_channels, 6, kernel_size=5, stride=1, padding=0)
        # Input: (batch, in_channels, 32, 32) -> Output: (batch, 6, 28, 28)
        
        # nn.AvgPool2d(kernel_size, stride=None)
        # Average pooling operation
        # If stride is None, it's set to kernel_size
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
        # Input: (batch, 6, 28, 28) -> Output: (batch, 6, 14, 14)
        
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0)
        # Input: (batch, 6, 14, 14) -> Output: (batch, 16, 10, 10)
        # After pool: (batch, 16, 5, 5)
        
        # Fully connected layers
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)
        
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, in_channels, height, width)
            
        Returns:
            torch.Tensor: Output logits of shape (batch_size, num_classes)
        """
        # Convolutional layers with activation and pooling
        x = self.pool(F.relu(self.conv1(x)))  # (batch, 6, 14, 14)
        x = self.pool(F.relu(self.conv2(x)))  # (batch, 16, 5, 5)
        
        # Flatten: reshape to (batch, 16*5*5)
        x = x.view(x.size(0), -1)
        
        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # No activation (use with CrossEntropyLoss)
        
        return x

# Create LeNet model for Fashion-MNIST (grayscale)
lenet_fmnist = LeNet5(num_classes=10, in_channels=1)
print("LeNet-5 for Fashion-MNIST:")
print(lenet_fmnist)
print(f"\nTotal parameters: {sum(p.numel() for p in lenet_fmnist.parameters()):,}")

# Create LeNet model for CIFAR-10 (color)
lenet_cifar = LeNet5(num_classes=10, in_channels=3)
print("\n" + "="*50)
print("LeNet-5 for CIFAR-10:")
print(lenet_cifar)
print(f"\nTotal parameters: {sum(p.numel() for p in lenet_cifar.parameters()):,}")

# Test forward pass
dummy_input_gray = torch.randn(4, 1, 32, 32)  # Batch of 4 grayscale images
dummy_input_color = torch.randn(4, 3, 32, 32)  # Batch of 4 color images

output_gray = lenet_fmnist(dummy_input_gray)
output_color = lenet_cifar(dummy_input_color)

print(f"\nGrayscale input shape: {dummy_input_gray.shape}")
print(f"Grayscale output shape: {output_gray.shape}")
print(f"\nColor input shape: {dummy_input_color.shape}")
print(f"Color output shape: {output_color.shape}")

In [None]:
### 7.2 Visualize LeNet Feature Maps

def visualize_feature_maps(model, image, layer_name='conv1'):
    """
    Visualize feature maps from a convolutional layer.
    
    Args:
        model: PyTorch model
        image: Input image tensor (1, C, H, W)
        layer_name: Name of the layer to visualize
    """
    model.eval()
    
    # Forward pass through first conv layer
    if layer_name == 'conv1':
        features = model.pool(F.relu(model.conv1(image)))
        title = 'Conv1 Feature Maps (after pooling)'
    else:
        x = model.pool(F.relu(model.conv1(image)))
        features = model.pool(F.relu(model.conv2(x)))
        title = 'Conv2 Feature Maps (after pooling)'
    
    # Convert to numpy
    features = features.detach().cpu().numpy()[0]  # Remove batch dimension
    
    # Plot feature maps
    num_features = features.shape[0]
    cols = 8 if num_features == 16 else 6
    rows = (num_features + cols - 1) // cols
    
    fig, axes = plt.subplots(rows, cols, figsize=(12, rows * 1.5))
    fig.suptitle(title, fontsize=14, fontweight='bold')
    
    for idx in range(rows * cols):
        ax = axes.flat[idx] if rows > 1 else axes[idx]
        if idx < num_features:
            ax.imshow(features[idx], cmap='viridis')
            ax.set_title(f'Filter {idx+1}', fontsize=8)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()

# Get a sample image from Fashion-MNIST
sample_image, label = train_fmnist[0]
sample_image = sample_image.unsqueeze(0)  # Add batch dimension

# Visualize original image
plt.figure(figsize=(3, 3))
plt.imshow(sample_image.squeeze().numpy() * 0.5 + 0.5, cmap='gray')
plt.title(f'Original Image: {fmnist_classes[label]}')
plt.axis('off')
plt.show()

# Visualize feature maps
visualize_feature_maps(lenet_fmnist, sample_image, 'conv1')
visualize_feature_maps(lenet_fmnist, sample_image, 'conv2')

## 8. AlexNet Implementation

AlexNet is a deeper convolutional neural network that won the ImageNet competition in 2012, marking a breakthrough in deep learning.

### AlexNet Architecture

```mermaid
graph LR
    A[Input<br/>3x32x32] --> B[Conv1<br/>64@32x32]
    B --> C[Pool1<br/>64@16x16]
    C --> D[Conv2<br/>192@16x16]
    D --> E[Pool2<br/>192@8x8]
    E --> F[Conv3<br/>384@8x8]
    F --> G[Conv4<br/>256@8x8]
    G --> H[Conv5<br/>256@8x8]
    H --> I[Pool3<br/>256@4x4]
    I --> J[Flatten]
    J --> K[FC1<br/>4096]
    K --> L[Dropout]
    L --> M[FC2<br/>4096]
    M --> N[Dropout]
    N --> O[FC3<br/>10]

    style B fill:#e1f5ff
    style D fill:#e1f5ff
    style F fill:#e1f5ff
    style L fill:#ffe1e1
    style N fill:#ffe1e1
```


In [None]:
### 8.1 AlexNet Model (Adapted for CIFAR-10)

class AlexNet(nn.Module):
    """
    AlexNet Convolutional Neural Network (adapted for CIFAR-10 32x32 images).
    
    Original paper: "ImageNet Classification with Deep Convolutional Neural Networks" 
    (Krizhevsky et al., 2012)
    
    Modifications for CIFAR-10:
    - Adjusted layer sizes for 32x32 input instead of 224x224
    - Reduced fully connected layer sizes
    - Added batch normalization for better training
    
    Architecture:
    - 5 Convolutional layers
    - 3 Fully connected layers
    - ReLU activations
    - Dropout for regularization
    - Max pooling
    
    Args:
        num_classes (int): Number of output classes (default: 10)
    """
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        
        # Feature extraction layers
        self.features = nn.Sequential(
            # Conv1: 3 -> 64 channels
            # nn.Conv2d maintains spatial dimensions with padding=2
            # Input: (batch, 3, 32, 32) -> Output: (batch, 64, 32, 32)
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            # nn.MaxPool2d(kernel_size, stride) - takes maximum value in window
            # Input: (batch, 64, 32, 32) -> Output: (batch, 64, 16, 16)
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv2: 64 -> 192 channels
            # Input: (batch, 64, 16, 16) -> Output: (batch, 192, 16, 16)
            nn.Conv2d(64, 192, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # Input: (batch, 192, 16, 16) -> Output: (batch, 192, 8, 8)
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv3: 192 -> 384 channels
            # Input: (batch, 192, 8, 8) -> Output: (batch, 384, 8, 8)
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            
            # Conv4: 384 -> 256 channels
            # Input: (batch, 384, 8, 8) -> Output: (batch, 256, 8, 8)
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            
            # Conv5: 256 -> 256 channels
            # Input: (batch, 256, 8, 8) -> Output: (batch, 256, 8, 8)
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # Input: (batch, 256, 8, 8) -> Output: (batch, 256, 4, 4)
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        
        # nn.AdaptiveAvgPool2d(output_size) - adaptive pooling to fixed output size
        # Useful when input size varies
        # Input: (batch, 256, 4, 4) -> Output: (batch, 256, 2, 2)
        self.avgpool = nn.AdaptiveAvgPool2d((2, 2))
        
        # Classification layers
        self.classifier = nn.Sequential(
            # nn.Dropout(p=0.5) - randomly zeroes elements with probability p
            # Parameters:
            #   p: probability of an element to be zeroed (default: 0.5)
            # Helps prevent overfitting during training
            nn.Dropout(p=0.5),
            nn.Linear(256 * 2 * 2, 2048),
            nn.ReLU(inplace=True),
            
            nn.Dropout(p=0.5),
            nn.Linear(2048, 1024),
            nn.ReLU(inplace=True),
            
            nn.Linear(1024, num_classes),
        )
        
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, 3, 32, 32)
            
        Returns:
            torch.Tensor: Output logits of shape (batch_size, num_classes)
        """
        x = self.features(x)      # Convolutional layers
        x = self.avgpool(x)       # Adaptive pooling
        x = torch.flatten(x, 1)   # Flatten all dimensions except batch
        x = self.classifier(x)    # Fully connected layers
        return x

# Create AlexNet model
alexnet = AlexNet(num_classes=10)
print("AlexNet for CIFAR-10:")
print(alexnet)
print(f"\nTotal parameters: {sum(p.numel() for p in alexnet.parameters()):,}")

# Count parameters by layer type
conv_params = sum(p.numel() for name, p in alexnet.named_parameters() if 'features' in name)
fc_params = sum(p.numel() for name, p in alexnet.named_parameters() if 'classifier' in name)
print(f"\nConvolutional layers parameters: {conv_params:,}")
print(f"Fully connected layers parameters: {fc_params:,}")

# Test forward pass
dummy_input = torch.randn(4, 3, 32, 32)
output = alexnet(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")

## 9. Training Pipeline

Complete training pipeline with training, validation, and evaluation on CIFAR-10 dataset.

### Training Pipeline Workflow

```mermaid
graph TD
    A[Start Epoch] --> B[Set model.train]
    B --> C[Iterate Training Batches]
    C --> D[Forward Pass]
    D --> E[Compute Loss]
    E --> F[Backward Pass]
    F --> G[Optimizer Step]
    G --> H{More Batches?}
    H -->|Yes| C
    H -->|No| I[Validation Phase]
    I --> J[Set model.eval]
    J --> K[torch.no_grad]
    K --> L[Iterate Val Batches]
    L --> M[Forward Pass]
    M --> N[Compute Metrics]
    N --> O{More Batches?}
    O -->|Yes| L
    O -->|No| P[Log Results]
    P --> Q{More Epochs?}
    Q -->|Yes| A
    Q -->|No| R[Final Evaluation]

    style B fill:#e1f5ff
    style J fill:#ffe1e1
    style F fill:#fff4e1
```


In [None]:
### 9.1 Prepare CIFAR-10 Data with Train/Validation Split

# Create data directory
data_dir = Path('./data')
data_dir.mkdir(exist_ok=True)

# Reload CIFAR-10 with proper transforms
transform_train_cifar = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),  # Additional augmentation
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2470, 0.2435, 0.2616]),
])

transform_test_cifar = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2470, 0.2435, 0.2616]),
])

# Load datasets
full_train_dataset = CIFAR10(root='./data', train=True, transform=transform_train_cifar, download=True)
test_dataset = CIFAR10(root='./data', train=False, transform=transform_test_cifar, download=True)

# Split training data into train and validation sets
# random_split(dataset, lengths) - randomly splits dataset into non-overlapping new datasets
# Parameters:
#   dataset: dataset to be split
#   lengths: lengths of splits to be produced (list or tuple)
train_size = int(0.9 * len(full_train_dataset))  # 90% for training
val_size = len(full_train_dataset) - train_size  # 10% for validation

# Set seed for reproducibility
torch.manual_seed(42)
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])

print(f"Training set size: {len(train_dataset):,}")
print(f"Validation set size: {len(val_dataset):,}")
print(f"Test set size: {len(test_dataset):,}")

# Create data loaders
batch_size = 128

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,      # Shuffle training data
    num_workers=0,     # Number of parallel workers
    pin_memory=True    # Faster transfer to GPU
)

val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,     # Don't shuffle validation data
    num_workers=0,
    pin_memory=True
)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True
)

print(f"\nNumber of training batches: {len(train_loader)}")
print(f"Number of validation batches: {len(val_loader)}")
print(f"Number of test batches: {len(test_loader)}")

In [None]:
### 9.2 Training and Validation Functions

def train_one_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train the model for one epoch.
    
    Args:
        model (nn.Module): Neural network model
        train_loader (DataLoader): Training data loader
        criterion: Loss function
        optimizer: Optimization algorithm
        device: Device to run on (cpu or cuda)
        
    Returns:
        tuple: (average_loss, accuracy)
    """
    model.train()  # Set model to training mode (enables dropout, batch norm updates)
    
    running_loss = 0.0
    correct = 0
    total = 0
    
    # tqdm provides a progress bar
    pbar = tqdm(train_loader, desc='Training', leave=False)
    
    for batch_idx, (inputs, targets) in enumerate(pbar):
        # Move data to device
        inputs, targets = inputs.to(device), targets.to(device)
        
        # Zero the parameter gradients
        # IMPORTANT: Must be done before backward pass
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        # Backward pass
        loss.backward()  # Compute gradients
        
        # Optimizer step
        optimizer.step()  # Update weights
        
        # Statistics
        running_loss += loss.item()
        _, predicted = outputs.max(1)  # Get class with highest score
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()  # Count correct predictions
        
        # Update progress bar
        pbar.set_postfix({
            'loss': running_loss / (batch_idx + 1),
            'acc': 100. * correct / total
        })
    
    avg_loss = running_loss / len(train_loader)
    accuracy = 100. * correct / total
    
    return avg_loss, accuracy


def validate(model, val_loader, criterion, device):
    """
    Validate the model.
    
    Args:
        model (nn.Module): Neural network model
        val_loader (DataLoader): Validation data loader
        criterion: Loss function
        device: Device to run on (cpu or cuda)
        
    Returns:
        tuple: (average_loss, accuracy)
    """
    model.eval()  # Set model to evaluation mode (disables dropout, batch norm uses running stats)
    
    running_loss = 0.0
    correct = 0
    total = 0
    
    # torch.no_grad() disables gradient computation (saves memory and computation)
    with torch.no_grad():
        pbar = tqdm(val_loader, desc='Validation', leave=False)
        
        for inputs, targets in pbar:
            inputs, targets = inputs.to(device), targets.to(device)
            
            # Forward pass only
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            # Statistics
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
            
            pbar.set_postfix({
                'loss': running_loss / len(val_loader),
                'acc': 100. * correct / total
            })
    
    avg_loss = running_loss / len(val_loader)
    accuracy = 100. * correct / total
    
    return avg_loss, accuracy


def train_model(model, train_loader, val_loader, criterion, optimizer, 
                num_epochs, device, scheduler=None):
    """
    Complete training loop with validation.
    
    Args:
        model (nn.Module): Neural network model
        train_loader (DataLoader): Training data loader
        val_loader (DataLoader): Validation data loader
        criterion: Loss function
        optimizer: Optimization algorithm
        num_epochs (int): Number of epochs to train
        device: Device to run on (cpu or cuda)
        scheduler: Learning rate scheduler (optional)
        
    Returns:
        dict: Training history containing losses and accuracies
    """
    history = {
        'train_loss': [],
        'train_acc': [],
        'val_loss': [],
        'val_acc': []
    }
    
    best_val_acc = 0.0
    
    for epoch in range(num_epochs):
        print(f'\nEpoch {epoch + 1}/{num_epochs}')
        print('-' * 50)
        
        # Training phase
        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
        
        # Validation phase
        val_loss, val_acc = validate(model, val_loader, criterion, device)
        
        # Update learning rate scheduler if provided
        if scheduler is not None:
            scheduler.step(val_loss)  # Some schedulers use validation loss
            current_lr = optimizer.param_groups[0]['lr']
            print(f'Learning rate: {current_lr:.6f}')
        
        # Record history
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)
        
        # Print epoch results
        print(f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%')
        print(f'Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%')
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            print(f'New best validation accuracy: {best_val_acc:.2f}%')
            # In practice, you would save the model here:
            # torch.save(model.state_dict(), 'best_model.pth')
    
    print(f'\nTraining completed! Best validation accuracy: {best_val_acc:.2f}%')
    return history

print("Training functions defined successfully!")

In [None]:
### 9.3 Train LeNet-5 on CIFAR-10

# Initialize model
lenet_model = LeNet5(num_classes=10, in_channels=3).to(device)
print(f"Training LeNet-5 on {device}")
print(f"Total parameters: {sum(p.numel() for p in lenet_model.parameters()):,}\n")

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_lenet = optim.Adam(lenet_model.parameters(), lr=0.001, weight_decay=1e-4)

# Learning rate scheduler
# optim.lr_scheduler.ReduceLROnPlateau - reduces learning rate when metric plateaus
# Parameters:
#   optimizer: wrapped optimizer
#   mode: 'min' for loss, 'max' for accuracy
#   factor: factor by which to reduce learning rate (new_lr = lr * factor)
#   patience: number of epochs with no improvement after which lr is reduced
#   verbose: if True, prints message when lr is reduced
scheduler_lenet = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer_lenet, mode='min', factor=0.5, patience=3, verbose=True
)

# Train the model
num_epochs = 15
print(f"Training for {num_epochs} epochs...\n")

history_lenet = train_model(
    model=lenet_model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer_lenet,
    num_epochs=num_epochs,
    device=device,
    scheduler=scheduler_lenet
)

In [None]:
### 9.4 Train AlexNet on CIFAR-10

# Initialize model
alexnet_model = AlexNet(num_classes=10).to(device)
print(f"Training AlexNet on {device}")
print(f"Total parameters: {sum(p.numel() for p in alexnet_model.parameters()):,}\n")

# Define optimizer with different learning rate
optimizer_alexnet = optim.Adam(alexnet_model.parameters(), lr=0.001, weight_decay=1e-4)

# Learning rate scheduler
scheduler_alexnet = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer_alexnet, mode='min', factor=0.5, patience=3, verbose=True
)

# Train the model
print(f"Training for {num_epochs} epochs...\n")

history_alexnet = train_model(
    model=alexnet_model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer_alexnet,
    num_epochs=num_epochs,
    device=device,
    scheduler=scheduler_alexnet
)

## 10. Evaluation and Visualization

Comprehensive evaluation of trained models with detailed visualizations using matplotlib and seaborn.

### Evaluation Metrics

```mermaid
graph LR
    A[Model Predictions] --> B[Accuracy]
    A --> C[Precision]
    A --> D[Recall]
    A --> E[F1-Score]
    A --> F[Confusion Matrix]

    B --> G[Overall Performance]
    C --> G
    D --> G
    E --> G
    F --> H[Per-Class Analysis]

    style G fill:#e1ffe1
    style H fill:#ffe1e1
```


In [None]:
### 10.1 Plot Training History

def plot_training_history(history, model_name='Model'):
    """
    Plot training and validation loss and accuracy.
    
    Args:
        history (dict): Training history with keys: train_loss, train_acc, val_loss, val_acc
        model_name (str): Name of the model for plot title
    """
    epochs = range(1, len(history['train_loss']) + 1)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot loss
    ax1.plot(epochs, history['train_loss'], 'b-o', label='Training Loss', linewidth=2, markersize=6)
    ax1.plot(epochs, history['val_loss'], 'r-s', label='Validation Loss', linewidth=2, markersize=6)
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('Loss', fontsize=12)
    ax1.set_title(f'{model_name} - Loss Curves', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(True, alpha=0.3)
    
    # Plot accuracy
    ax2.plot(epochs, history['train_acc'], 'b-o', label='Training Accuracy', linewidth=2, markersize=6)
    ax2.plot(epochs, history['val_acc'], 'r-s', label='Validation Accuracy', linewidth=2, markersize=6)
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('Accuracy (%)', fontsize=12)
    ax2.set_title(f'{model_name} - Accuracy Curves', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=11)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\n{model_name} Training Summary:")
    print(f"{'='*60}")
    print(f"Final Training Loss: {history['train_loss'][-1]:.4f}")
    print(f"Final Training Accuracy: {history['train_acc'][-1]:.2f}%")
    print(f"Final Validation Loss: {history['val_loss'][-1]:.4f}")
    print(f"Final Validation Accuracy: {history['val_acc'][-1]:.2f}%")
    print(f"\nBest Validation Accuracy: {max(history['val_acc']):.2f}% (Epoch {np.argmax(history['val_acc']) + 1})")
    print(f"Best Validation Loss: {min(history['val_loss']):.4f} (Epoch {np.argmin(history['val_loss']) + 1})")

# Plot LeNet training history
plot_training_history(history_lenet, 'LeNet-5')

# Plot AlexNet training history
plot_training_history(history_alexnet, 'AlexNet')

In [None]:
### 10.2 Compare Models

def compare_models(histories, model_names):
    """
    Compare training histories of multiple models.
    
    Args:
        histories (list): List of history dictionaries
        model_names (list): List of model names
    """
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    epochs = range(1, len(histories[0]['train_loss']) + 1)
    colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']
    
    # Training Loss
    for idx, (history, name) in enumerate(zip(histories, model_names)):
        axes[0, 0].plot(epochs, history['train_loss'], 
                       label=name, linewidth=2, marker='o', color=colors[idx])
    axes[0, 0].set_xlabel('Epoch', fontsize=12)
    axes[0, 0].set_ylabel('Loss', fontsize=12)
    axes[0, 0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
    axes[0, 0].legend(fontsize=11)
    axes[0, 0].grid(True, alpha=0.3)
    
    # Validation Loss
    for idx, (history, name) in enumerate(zip(histories, model_names)):
        axes[0, 1].plot(epochs, history['val_loss'], 
                       label=name, linewidth=2, marker='s', color=colors[idx])
    axes[0, 1].set_xlabel('Epoch', fontsize=12)
    axes[0, 1].set_ylabel('Loss', fontsize=12)
    axes[0, 1].set_title('Validation Loss Comparison', fontsize=14, fontweight='bold')
    axes[0, 1].legend(fontsize=11)
    axes[0, 1].grid(True, alpha=0.3)
    
    # Training Accuracy
    for idx, (history, name) in enumerate(zip(histories, model_names)):
        axes[1, 0].plot(epochs, history['train_acc'], 
                       label=name, linewidth=2, marker='o', color=colors[idx])
    axes[1, 0].set_xlabel('Epoch', fontsize=12)
    axes[1, 0].set_ylabel('Accuracy (%)', fontsize=12)
    axes[1, 0].set_title('Training Accuracy Comparison', fontsize=14, fontweight='bold')
    axes[1, 0].legend(fontsize=11)
    axes[1, 0].grid(True, alpha=0.3)
    
    # Validation Accuracy
    for idx, (history, name) in enumerate(zip(histories, model_names)):
        axes[1, 1].plot(epochs, history['val_acc'], 
                       label=name, linewidth=2, marker='s', color=colors[idx])
    axes[1, 1].set_xlabel('Epoch', fontsize=12)
    axes[1, 1].set_ylabel('Accuracy (%)', fontsize=12)
    axes[1, 1].set_title('Validation Accuracy Comparison', fontsize=14, fontweight='bold')
    axes[1, 1].legend(fontsize=11)
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Create comparison table
    print("\nModel Comparison Summary:")
    print(f"{'='*80}")
    print(f"{'Model':<15} {'Final Train Acc':<18} {'Final Val Acc':<18} {'Best Val Acc':<15}")
    print(f"{'-'*80}")
    for history, name in zip(histories, model_names):
        print(f"{name:<15} {history['train_acc'][-1]:>15.2f}% "
              f"{history['val_acc'][-1]:>15.2f}% {max(history['val_acc']):>12.2f}%")

# Compare LeNet and AlexNet
compare_models([history_lenet, history_alexnet], ['LeNet-5', 'AlexNet'])

In [None]:
### 10.3 Confusion Matrix and Detailed Evaluation

def evaluate_model(model, test_loader, device, class_names):
    """
    Evaluate model on test set and compute detailed metrics.
    
    Args:
        model (nn.Module): Trained model
        test_loader (DataLoader): Test data loader
        device: Device to run on
        class_names (list): List of class names
        
    Returns:
        tuple: (all_predictions, all_targets, accuracy)
    """
    model.eval()
    all_predictions = []
    all_targets = []
    
    with torch.no_grad():
        for inputs, targets in tqdm(test_loader, desc='Evaluating'):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            
            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())
    
    all_predictions = np.array(all_predictions)
    all_targets = np.array(all_targets)
    accuracy = 100. * (all_predictions == all_targets).sum() / len(all_targets)
    
    return all_predictions, all_targets, accuracy


def plot_confusion_matrix(predictions, targets, class_names, model_name='Model'):
    """
    Plot confusion matrix using seaborn.
    
    Args:
        predictions (array): Predicted labels
        targets (array): True labels
        class_names (list): List of class names
        model_name (str): Name of the model
    """
    # Compute confusion matrix
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(targets, predictions)
    
    # Normalize by true labels (rows)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Raw counts
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=class_names, yticklabels=class_names,
                ax=ax1, cbar_kws={'label': 'Count'})
    ax1.set_xlabel('Predicted Label', fontsize=12)
    ax1.set_ylabel('True Label', fontsize=12)
    ax1.set_title(f'{model_name} - Confusion Matrix (Counts)', fontsize=14, fontweight='bold')
    
    # Normalized (percentages)
    sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='RdYlGn', 
                xticklabels=class_names, yticklabels=class_names,
                ax=ax2, cbar_kws={'label': 'Percentage'}, vmin=0, vmax=1)
    ax2.set_xlabel('Predicted Label', fontsize=12)
    ax2.set_ylabel('True Label', fontsize=12)
    ax2.set_title(f'{model_name} - Confusion Matrix (Normalized)', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()


def compute_per_class_metrics(predictions, targets, class_names):
    """
    Compute per-class precision, recall, and F1-score.
    
    Args:
        predictions (array): Predicted labels
        targets (array): True labels
        class_names (list): List of class names
        
    Returns:
        dict: Dictionary containing per-class metrics
    """
    from sklearn.metrics import precision_recall_fscore_support, accuracy_score
    
    precision, recall, f1, support = precision_recall_fscore_support(
        targets, predictions, average=None
    )
    
    metrics = {
        'class_names': class_names,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'support': support
    }
    
    return metrics


def plot_per_class_metrics(metrics, model_name='Model'):
    """
    Plot per-class metrics using bar charts.
    
    Args:
        metrics (dict): Dictionary containing per-class metrics
        model_name (str): Name of the model
    """
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    x = np.arange(len(metrics['class_names']))
    width = 0.6
    
    # Precision
    bars1 = axes[0, 0].bar(x, metrics['precision'], width, color='steelblue', edgecolor='black')
    axes[0, 0].set_xlabel('Class', fontsize=12)
    axes[0, 0].set_ylabel('Precision', fontsize=12)
    axes[0, 0].set_title('Precision per Class', fontsize=14, fontweight='bold')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(metrics['class_names'], rotation=45, ha='right')
    axes[0, 0].set_ylim([0, 1.05])
    axes[0, 0].grid(axis='y', alpha=0.3)
    # Add value labels on bars
    for bar in bars1:
        height = bar.get_height()
        axes[0, 0].text(bar.get_x() + bar.get_width()/2., height,
                       f'{height:.3f}', ha='center', va='bottom', fontsize=9)
    
    # Recall
    bars2 = axes[0, 1].bar(x, metrics['recall'], width, color='coral', edgecolor='black')
    axes[0, 1].set_xlabel('Class', fontsize=12)
    axes[0, 1].set_ylabel('Recall', fontsize=12)
    axes[0, 1].set_title('Recall per Class', fontsize=14, fontweight='bold')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels(metrics['class_names'], rotation=45, ha='right')
    axes[0, 1].set_ylim([0, 1.05])
    axes[0, 1].grid(axis='y', alpha=0.3)
    for bar in bars2:
        height = bar.get_height()
        axes[0, 1].text(bar.get_x() + bar.get_width()/2., height,
                       f'{height:.3f}', ha='center', va='bottom', fontsize=9)
    
    # F1-Score
    bars3 = axes[1, 0].bar(x, metrics['f1'], width, color='mediumseagreen', edgecolor='black')
    axes[1, 0].set_xlabel('Class', fontsize=12)
    axes[1, 0].set_ylabel('F1-Score', fontsize=12)
    axes[1, 0].set_title('F1-Score per Class', fontsize=14, fontweight='bold')
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels(metrics['class_names'], rotation=45, ha='right')
    axes[1, 0].set_ylim([0, 1.05])
    axes[1, 0].grid(axis='y', alpha=0.3)
    for bar in bars3:
        height = bar.get_height()
        axes[1, 0].text(bar.get_x() + bar.get_width()/2., height,
                       f'{height:.3f}', ha='center', va='bottom', fontsize=9)
    
    # All metrics together
    x_multi = np.arange(len(metrics['class_names']))
    width_multi = 0.25
    
    axes[1, 1].bar(x_multi - width_multi, metrics['precision'], width_multi, 
                   label='Precision', color='steelblue', edgecolor='black')
    axes[1, 1].bar(x_multi, metrics['recall'], width_multi, 
                   label='Recall', color='coral', edgecolor='black')
    axes[1, 1].bar(x_multi + width_multi, metrics['f1'], width_multi, 
                   label='F1-Score', color='mediumseagreen', edgecolor='black')
    
    axes[1, 1].set_xlabel('Class', fontsize=12)
    axes[1, 1].set_ylabel('Score', fontsize=12)
    axes[1, 1].set_title('All Metrics Comparison', fontsize=14, fontweight='bold')
    axes[1, 1].set_xticks(x_multi)
    axes[1, 1].set_xticklabels(metrics['class_names'], rotation=45, ha='right')
    axes[1, 1].set_ylim([0, 1.05])
    axes[1, 1].legend(fontsize=11)
    axes[1, 1].grid(axis='y', alpha=0.3)
    
    plt.suptitle(f'{model_name} - Per-Class Performance Metrics', 
                 fontsize=16, fontweight='bold', y=1.00)
    plt.tight_layout()
    plt.show()
    
    # Print metrics table
    print(f"\n{model_name} - Detailed Per-Class Metrics:")
    print(f"{'='*90}")
    print(f"{'Class':<15} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Support':<10}")
    print(f"{'-'*90}")
    for i, class_name in enumerate(metrics['class_names']):
        print(f"{class_name:<15} {metrics['precision'][i]:>10.4f}  "
              f"{metrics['recall'][i]:>10.4f}  {metrics['f1'][i]:>10.4f}  "
              f"{metrics['support'][i]:>8}")
    print(f"{'-'*90}")
    print(f"{'Average':<15} {np.mean(metrics['precision']):>10.4f}  "
          f"{np.mean(metrics['recall']):>10.4f}  {np.mean(metrics['f1']):>10.4f}  "
          f"{np.sum(metrics['support']):>8}")

print("Evaluation functions defined successfully!")

In [None]:
### 10.4 Evaluate LeNet-5 on Test Set

print("Evaluating LeNet-5 on test set...")
print("="*60)

# Get predictions
predictions_lenet, targets_lenet, accuracy_lenet = evaluate_model(
    lenet_model, test_loader, device, cifar10_classes
)

print(f"\nLeNet-5 Test Accuracy: {accuracy_lenet:.2f}%\n")

# Plot confusion matrix
plot_confusion_matrix(predictions_lenet, targets_lenet, cifar10_classes, 'LeNet-5')

# Compute and plot per-class metrics
metrics_lenet = compute_per_class_metrics(predictions_lenet, targets_lenet, cifar10_classes)
plot_per_class_metrics(metrics_lenet, 'LeNet-5')

In [None]:
### 10.5 Evaluate AlexNet on Test Set

print("Evaluating AlexNet on test set...")
print("="*60)

# Get predictions
predictions_alexnet, targets_alexnet, accuracy_alexnet = evaluate_model(
    alexnet_model, test_loader, device, cifar10_classes
)

print(f"\nAlexNet Test Accuracy: {accuracy_alexnet:.2f}%\n")

# Plot confusion matrix
plot_confusion_matrix(predictions_alexnet, targets_alexnet, cifar10_classes, 'AlexNet')

# Compute and plot per-class metrics
metrics_alexnet = compute_per_class_metrics(predictions_alexnet, targets_alexnet, cifar10_classes)
plot_per_class_metrics(metrics_alexnet, 'AlexNet')

In [None]:
### 10.6 Visualize Predictions

def visualize_predictions(model, test_dataset, device, class_names, num_images=16, show_correct=True):
    """
    Visualize model predictions on test images.
    
    Args:
        model (nn.Module): Trained model
        test_dataset: Test dataset
        device: Device to run on
        class_names (list): List of class names
        num_images (int): Number of images to display
        show_correct (bool): If True, show correct predictions; if False, show incorrect
    """
    model.eval()
    
    # Find correct/incorrect predictions
    indices = []
    with torch.no_grad():
        for idx in range(len(test_dataset)):
            if len(indices) >= num_images:
                break
            
            image, label = test_dataset[idx]
            image_input = image.unsqueeze(0).to(device)
            output = model(image_input)
            _, predicted = output.max(1)
            
            is_correct = (predicted.item() == label)
            if is_correct == show_correct:
                indices.append(idx)
    
    # Plot images
    cols = 4
    rows = (num_images + cols - 1) // cols
    fig, axes = plt.subplots(rows, cols, figsize=(14, rows * 3.5))
    
    title = "Correct Predictions" if show_correct else "Incorrect Predictions"
    fig.suptitle(f'{title} - {model.__class__.__name__}', fontsize=16, fontweight='bold')
    
    for idx, ax in enumerate(axes.flat):
        if idx < len(indices):
            image, label = test_dataset[indices[idx]]
            
            # Get prediction
            with torch.no_grad():
                image_input = image.unsqueeze(0).to(device)
                output = model(image_input)
                probabilities = F.softmax(output, dim=1)
                confidence, predicted = probabilities.max(1)
            
            # Denormalize image
            image_display = image * torch.tensor([0.2470, 0.2435, 0.2616]).view(3, 1, 1) + \
                           torch.tensor([0.4914, 0.4822, 0.4465]).view(3, 1, 1)
            image_display = torch.clamp(image_display, 0, 1)
            
            # Display image
            ax.imshow(image_display.permute(1, 2, 0).cpu().numpy())
            
            # Set title with prediction info
            true_label = class_names[label]
            pred_label = class_names[predicted.item()]
            conf = confidence.item() * 100
            
            if show_correct:
                title_text = f'True: {true_label}\\nPred: {pred_label}\\nConf: {conf:.1f}%'
                title_color = 'green'
            else:
                title_text = f'True: {true_label}\\nPred: {pred_label}\\nConf: {conf:.1f}%'
                title_color = 'red'
            
            ax.set_title(title_text, fontsize=10, color=title_color, fontweight='bold')
            ax.axis('off')
        else:
            ax.axis('off')
    
    plt.tight_layout()
    plt.show()

# Visualize correct predictions for AlexNet
print("AlexNet - Correct Predictions:")
visualize_predictions(alexnet_model, test_dataset, device, cifar10_classes, num_images=16, show_correct=True)

# Visualize incorrect predictions for AlexNet
print("\nAlexNet - Incorrect Predictions:")
visualize_predictions(alexnet_model, test_dataset, device, cifar10_classes, num_images=16, show_correct=False)

In [None]:
### 10.7 Model Performance Comparison Visualization

def create_performance_dashboard(models_data):
    """
    Create a comprehensive performance dashboard comparing multiple models.
    
    Args:
        models_data (list): List of tuples (model_name, history, test_accuracy, metrics)
    """
    fig = plt.figure(figsize=(16, 10))
    gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
    
    # Extract data
    model_names = [data[0] for data in models_data]
    histories = [data[1] for data in models_data]
    test_accs = [data[2] for data in models_data]
    metrics_list = [data[3] for data in models_data]
    
    colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']
    
    # 1. Final Performance Comparison (Bar chart)
    ax1 = fig.add_subplot(gs[0, 0])
    x_pos = np.arange(len(model_names))
    bars = ax1.bar(x_pos, test_accs, color=colors[:len(model_names)], 
                   edgecolor='black', linewidth=1.5)
    ax1.set_ylabel('Accuracy (%)', fontsize=11, fontweight='bold')
    ax1.set_title('Test Accuracy Comparison', fontsize=12, fontweight='bold')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(model_names, fontsize=10)
    ax1.set_ylim([0, 100])
    ax1.grid(axis='y', alpha=0.3)
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}%', ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    # 2. Training Progress (Line plot)
    ax2 = fig.add_subplot(gs[0, 1:])
    epochs = range(1, len(histories[0]['val_acc']) + 1)
    for idx, (name, history) in enumerate(zip(model_names, histories)):
        ax2.plot(epochs, history['val_acc'], marker='o', linewidth=2, 
                label=name, color=colors[idx], markersize=4)
    ax2.set_xlabel('Epoch', fontsize=11, fontweight='bold')
    ax2.set_ylabel('Validation Accuracy (%)', fontsize=11, fontweight='bold')
    ax2.set_title('Validation Accuracy Progress', fontsize=12, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)
    
    # 3. Average Precision per Model
    ax3 = fig.add_subplot(gs[1, 0])
    avg_precisions = [np.mean(m['precision']) for m in metrics_list]
    bars3 = ax3.barh(model_names, avg_precisions, color=colors[:len(model_names)],
                     edgecolor='black', linewidth=1.5)
    ax3.set_xlabel('Average Precision', fontsize=11, fontweight='bold')
    ax3.set_title('Average Precision Comparison', fontsize=12, fontweight='bold')
    ax3.set_xlim([0, 1])
    ax3.grid(axis='x', alpha=0.3)
    for i, bar in enumerate(bars3):
        width = bar.get_width()
        ax3.text(width, bar.get_y() + bar.get_height()/2.,
                f'{width:.4f}', ha='left', va='center', fontsize=10, fontweight='bold')
    
    # 4. Average Recall per Model
    ax4 = fig.add_subplot(gs[1, 1])
    avg_recalls = [np.mean(m['recall']) for m in metrics_list]
    bars4 = ax4.barh(model_names, avg_recalls, color=colors[:len(model_names)],
                     edgecolor='black', linewidth=1.5)
    ax4.set_xlabel('Average Recall', fontsize=11, fontweight='bold')
    ax4.set_title('Average Recall Comparison', fontsize=12, fontweight='bold')
    ax4.set_xlim([0, 1])
    ax4.grid(axis='x', alpha=0.3)
    for i, bar in enumerate(bars4):
        width = bar.get_width()
        ax4.text(width, bar.get_y() + bar.get_height()/2.,
                f'{width:.4f}', ha='left', va='center', fontsize=10, fontweight='bold')
    
    # 5. Average F1-Score per Model
    ax5 = fig.add_subplot(gs[1, 2])
    avg_f1s = [np.mean(m['f1']) for m in metrics_list]
    bars5 = ax5.barh(model_names, avg_f1s, color=colors[:len(model_names)],
                     edgecolor='black', linewidth=1.5)
    ax5.set_xlabel('Average F1-Score', fontsize=11, fontweight='bold')
    ax5.set_title('Average F1-Score Comparison', fontsize=12, fontweight='bold')
    ax5.set_xlim([0, 1])
    ax5.grid(axis='x', alpha=0.3)
    for i, bar in enumerate(bars5):
        width = bar.get_width()
        ax5.text(width, bar.get_y() + bar.get_height()/2.,
                f'{width:.4f}', ha='left', va='center', fontsize=10, fontweight='bold')
    
    # 6. Training vs Validation Loss (final epoch)
    ax6 = fig.add_subplot(gs[2, :])
    x = np.arange(len(model_names))
    width = 0.35
    train_losses = [h['train_loss'][-1] for h in histories]
    val_losses = [h['val_loss'][-1] for h in histories]
    
    ax6.bar(x - width/2, train_losses, width, label='Training Loss', 
            color='steelblue', edgecolor='black', linewidth=1.5)
    ax6.bar(x + width/2, val_losses, width, label='Validation Loss', 
            color='coral', edgecolor='black', linewidth=1.5)
    ax6.set_ylabel('Loss', fontsize=11, fontweight='bold')
    ax6.set_title('Final Training vs Validation Loss', fontsize=12, fontweight='bold')
    ax6.set_xticks(x)
    ax6.set_xticklabels(model_names)
    ax6.legend(fontsize=10)
    ax6.grid(axis='y', alpha=0.3)
    
    plt.suptitle('Comprehensive Model Performance Dashboard', 
                 fontsize=16, fontweight='bold', y=0.995)
    plt.show()
    
    # Print summary table
    print("\\nComprehensive Performance Summary:")
    print("="*100)
    print(f"{'Model':<12} {'Test Acc':<12} {'Avg Precision':<15} {'Avg Recall':<12} "
          f"{'Avg F1':<10} {'Final Train Loss':<17}")
    print("-"*100)
    for i, name in enumerate(model_names):
        print(f"{name:<12} {test_accs[i]:>9.2f}% {avg_precisions[i]:>13.4f}  "
              f"{avg_recalls[i]:>10.4f}  {avg_f1s[i]:>8.4f}  "
              f"{histories[i]['train_loss'][-1]:>14.4f}")
    print("="*100)

# Create performance dashboard
models_data = [
    ('LeNet-5', history_lenet, accuracy_lenet, metrics_lenet),
    ('AlexNet', history_alexnet, accuracy_alexnet, metrics_alexnet)
]

create_performance_dashboard(models_data)

## 11. Saving and Loading Models

PyTorch provides flexible ways to save and load models for later use.

### Model Persistence Workflow

```mermaid
graph LR
    A[Trained Model] --> B{What to Save?}
    B -->|State Dict| C[torch.save<br/>model.state_dict]
    B -->|Entire Model| D[torch.save<br/>model]
    B -->|Checkpoint| E[torch.save<br/>full checkpoint]

    C --> F[torch.load]
    D --> G[torch.load]
    E --> H[torch.load]

    F --> I[model.load_state_dict]
    I --> J[Loaded Model]
    G --> J
    H --> K[Resume Training]

    style C fill:#e1f5ff
    style E fill:#ffe1e1
```


In [None]:
### 11.1 Saving Models

# Create models directory
models_dir = Path('./saved_models')
models_dir.mkdir(exist_ok=True)

print("=== Saving Models ===\n")

# Method 1: Save only the model parameters (state_dict) - RECOMMENDED
# This is the most common and flexible approach
# torch.save(obj, path) - saves a serialized object to disk
# Parameters:
#   obj: object to save (state_dict, model, or any Python object)
#   path: file path to save to
torch.save(alexnet_model.state_dict(), models_dir / 'alexnet_state_dict.pth')
print("âœ“ Saved AlexNet state_dict to 'saved_models/alexnet_state_dict.pth'")

# Method 2: Save entire model (less flexible, not recommended for production)
torch.save(lenet_model, models_dir / 'lenet_entire_model.pth')
print("âœ“ Saved LeNet entire model to 'saved_models/lenet_entire_model.pth'")

# Method 3: Save training checkpoint (for resuming training)
# This saves everything needed to resume training
checkpoint = {
    'epoch': num_epochs,
    'model_state_dict': alexnet_model.state_dict(),
    'optimizer_state_dict': optimizer_alexnet.state_dict(),
    'train_loss': history_alexnet['train_loss'][-1],
    'val_loss': history_alexnet['val_loss'][-1],
    'val_accuracy': history_alexnet['val_acc'][-1],
    'history': history_alexnet
}
torch.save(checkpoint, models_dir / 'alexnet_checkpoint.pth')
print("âœ“ Saved AlexNet checkpoint to 'saved_models/alexnet_checkpoint.pth'")

print(f"\nSaved files:")
for file in sorted(models_dir.iterdir()):
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"  - {file.name}: {size_mb:.2f} MB")

In [None]:
### 11.2 Loading Models

print("=== Loading Models ===\n")

# Method 1: Load state_dict (RECOMMENDED)
# First, create a new instance of the model
loaded_alexnet = AlexNet(num_classes=10).to(device)

# torch.load(path, map_location) - loads a saved object
# Parameters:
#   path: file path to load from
#   map_location: device to map tensors to (e.g., 'cpu', 'cuda:0')
#   weights_only: if True, only loads weights (safer, prevents code execution)
state_dict = torch.load(models_dir / 'alexnet_state_dict.pth', 
                        map_location=device, weights_only=True)

# model.load_state_dict(state_dict, strict=True)
# Parameters:
#   state_dict: dictionary containing parameters and buffers
#   strict: whether to strictly enforce that keys match (default: True)
loaded_alexnet.load_state_dict(state_dict)
loaded_alexnet.eval()  # Set to evaluation mode
print("âœ“ Loaded AlexNet from state_dict")

# Verify the loaded model works
test_input = torch.randn(1, 3, 32, 32).to(device)
with torch.no_grad():
    output = loaded_alexnet(test_input)
    print(f"  Test output shape: {output.shape}")
    print(f"  Predicted class: {output.argmax(1).item()}")

# Method 2: Load entire model
loaded_lenet = torch.load(models_dir / 'lenet_entire_model.pth', 
                          map_location=device, weights_only=False)
loaded_lenet.eval()
print("\nâœ“ Loaded LeNet entire model")

# Method 3: Load checkpoint (for resuming training)
checkpoint = torch.load(models_dir / 'alexnet_checkpoint.pth', 
                        map_location=device, weights_only=False)

# Create model and optimizer
resumed_model = AlexNet(num_classes=10).to(device)
resumed_optimizer = optim.Adam(resumed_model.parameters())

# Load states
resumed_model.load_state_dict(checkpoint['model_state_dict'])
resumed_optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
history = checkpoint['history']

print(f"\nâœ“ Loaded checkpoint from epoch {start_epoch}")
print(f"  Last validation accuracy: {checkpoint['val_accuracy']:.2f}%")
print(f"  Last training loss: {checkpoint['train_loss']:.4f}")
print(f"  Last validation loss: {checkpoint['val_loss']:.4f}")
print(f"\nCheckpoint can now be used to resume training from epoch {start_epoch + 1}")

## 12. Summary and Best Practices

### Key Takeaways

#### 1. **PyTorch Fundamentals**

- **Tensors**: Multi-dimensional arrays with GPU support
- **Autograd**: Automatic differentiation for backpropagation
- **nn.Module**: Base class for all neural networks
- **Optimizers**: Algorithms for updating model parameters

#### 2. **Model Architecture**

- **LeNet-5**: Simple CNN with 2 conv layers, 3 FC layers (~60K parameters)
  - Good for simple image classification tasks
  - Fast training, lower accuracy on complex datasets
- **AlexNet**: Deeper CNN with 5 conv layers, 3 FC layers (~2-3M parameters)
  - Better for complex image classification
  - Higher capacity, better accuracy, but slower training

#### 3. **Training Pipeline**

```python
for epoch in range(num_epochs):
    model.train()  # Training mode
    for inputs, targets in train_loader:
        optimizer.zero_grad()  # Clear gradients
        outputs = model(inputs)  # Forward pass
        loss = criterion(outputs, targets)  # Compute loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update weights

    model.eval()  # Evaluation mode
    with torch.no_grad():  # Disable gradient computation
        # Validation loop
```

#### 4. **Best Practices**

**Data Preparation:**

- Normalize inputs using dataset statistics
- Apply data augmentation for training (RandomCrop, RandomHorizontalFlip)
- Use separate transforms for training and testing
- Split data into train/validation/test sets

**Model Training:**

- Always call `optimizer.zero_grad()` before backward pass
- Use `model.train()` for training, `model.eval()` for evaluation
- Use `torch.no_grad()` during inference to save memory
- Monitor both training and validation metrics
- Use learning rate scheduling for better convergence
- Save checkpoints regularly

**Debugging:**

- Check tensor shapes at each layer
- Verify data loading with sample batches
- Start with a small learning rate
- Use gradient clipping if gradients explode
- Monitor loss values (should decrease)

**Performance Optimization:**

- Use GPU when available (`model.to(device)`)
- Set `num_workers > 0` in DataLoader (except on Windows)
- Use `pin_memory=True` for faster GPU transfer
- Batch multiple samples together
- Use mixed precision training for large models

### Common Pitfalls to Avoid

1. **Forgetting to zero gradients**: Gradients accumulate by default
2. **Not switching to eval mode**: Dropout/BatchNorm behave differently
3. **Using wrong device**: Ensure model and data are on same device
4. **Incorrect tensor shapes**: Always verify dimensions
5. **Learning rate too high**: Can cause training instability
6. **No validation set**: Can't detect overfitting
7. **Saving entire model vs state_dict**: Use state_dict for flexibility

### Next Steps

To continue learning PyTorch:

1. **Advanced Architectures**: ResNet, VGG, EfficientNet
2. **Transfer Learning**: Use pre-trained models on ImageNet
3. **Custom Datasets**: Create your own dataset classes
4. **Advanced Optimizers**: AdamW, RAdam, Ranger
5. **Learning Rate Schedulers**: CosineAnnealing, OneCycleLR
6. **Mixed Precision Training**: Faster training with torch.cuda.amp
7. **Distributed Training**: Multi-GPU training
8. **Other Domains**: NLP (transformers), RL, GANs

### Quick Reference

```python
# Essential imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

# Create model
model = MyModel().to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

# Save model
torch.save(model.state_dict(), 'model.pth')

# Load model
model.load_state_dict(torch.load('model.pth'))
```

---

**Congratulations!** You've completed this comprehensive PyTorch tutorial. You now have the knowledge to build, train, and evaluate deep learning models for computer vision tasks.


## Appendix: Additional Examples and Tips

### A. Custom Loss Functions


In [None]:
### A.1 Custom Loss Function Example

class FocalLoss(nn.Module):
    """
    Focal Loss for addressing class imbalance.
    
    Focal Loss = -Î±(1-pt)^Î³ * log(pt)
    
    where pt is the model's estimated probability for the true class.
    
    Args:
        alpha (float): Weighting factor (default: 0.25)
        gamma (float): Focusing parameter (default: 2.0)
    """
    def __init__(self, alpha=0.25, gamma=2.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        
    def forward(self, inputs, targets):
        """
        Args:
            inputs: (batch_size, num_classes) - raw logits
            targets: (batch_size,) - class indices
        """
        # Get probabilities
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)  # Probability of true class
        
        # Focal loss formula
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

# Example usage
focal_loss = FocalLoss(alpha=0.25, gamma=2.0)
dummy_logits = torch.randn(4, 10)
dummy_targets = torch.randint(0, 10, (4,))
loss = focal_loss(dummy_logits, dummy_targets)
print(f"Focal Loss: {loss.item():.4f}")
print("\\nFocal Loss is useful when dealing with imbalanced datasets,")
print("as it down-weights easy examples and focuses on hard examples.")

### B. Regularization Techniques


In [None]:
### B.1 Dropout and Batch Normalization

class RegularizedNet(nn.Module):
    """
    Example network with Dropout and Batch Normalization.
    
    Regularization techniques help prevent overfitting:
    - Dropout: Randomly drops neurons during training
    - Batch Normalization: Normalizes layer inputs
    - L2 Regularization: Added through optimizer weight_decay parameter
    """
    def __init__(self, input_size=784, num_classes=10):
        super(RegularizedNet, self).__init__()
        
        self.fc1 = nn.Linear(input_size, 512)
        # nn.BatchNorm1d(num_features) - normalizes batch of features
        # Parameters:
        #   num_features: number of features (same as output of previous layer)
        #   momentum: value used for running mean/var (default: 0.1)
        #   eps: small value for numerical stability (default: 1e-5)
        self.bn1 = nn.BatchNorm1d(512)
        self.dropout1 = nn.Dropout(p=0.5)
        
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.dropout2 = nn.Dropout(p=0.3)
        
        self.fc3 = nn.Linear(256, num_classes)
        
    def forward(self, x):
        # Layer 1
        x = self.fc1(x)
        x = self.bn1(x)  # Normalize
        x = F.relu(x)
        x = self.dropout1(x)  # Dropout (only active during training)
        
        # Layer 2
        x = self.fc2(x)
        x = self.bn2(x)
        x = F.relu(x)
        x = self.dropout2(x)
        
        # Output layer (no dropout or batch norm)
        x = self.fc3(x)
        return x

# Create model
reg_model = RegularizedNet()
print("Regularized Network:")
print(reg_model)
print(f"\\nTotal parameters: {sum(p.numel() for p in reg_model.parameters()):,}")

# Test behavior in training vs evaluation mode
reg_model.train()
dummy_input = torch.randn(8, 784)
output_train = reg_model(dummy_input)
print(f"\\nTraining mode output (with dropout): shape {output_train.shape}")

reg_model.eval()
output_eval = reg_model(dummy_input)
print(f"Evaluation mode output (no dropout): shape {output_eval.shape}")

print("\\nNote: Dropout is only active in training mode!")
print("Always use model.train() for training and model.eval() for inference.")

### C. GPU Utilization and Performance Tips


In [None]:
### C.1 Device Management and GPU Utilization

# Check GPU availability
print("=== Device Information ===\\n")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    
    # GPU memory management
    print(f"\\n=== GPU Memory ===")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved: {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
else:
    print("No GPU available, using CPU")

print("\\n=== Performance Tips ===\\n")

# 1. Moving data to GPU efficiently
print("1. Moving data to GPU:")
print("   - Move model to GPU once: model.to(device)")
print("   - Move batches in training loop: inputs.to(device)")
print("   - Use pin_memory=True in DataLoader for faster transfer")
print()

# 2. Batch size optimization
print("2. Batch size optimization:")
print("   - Larger batch = better GPU utilization")
print("   - But: may not fit in memory, may harm generalization")
print("   - Start with 32-128 and increase until GPU memory is ~80% full")
print()

# 3. Clear cache when needed
print("3. Memory management:")
print("   - Clear unused tensors: del tensor")
print("   - Clear GPU cache: torch.cuda.empty_cache()")
print("   - Use torch.no_grad() for inference")
print()

# 4. Mixed precision training
print("4. Mixed precision training (for faster training):")
print("   from torch.cuda.amp import autocast, GradScaler")
print("   scaler = GradScaler()")
print("   with autocast():")
print("       outputs = model(inputs)")
print()

# Example: Memory-efficient inference
def predict_batch(model, inputs, device, batch_size=32):
    """
    Memory-efficient batch prediction.
    
    Args:
        model: PyTorch model
        inputs: Input tensor
        device: Device to use
        batch_size: Batch size for inference
        
    Returns:
        Predictions
    """
    model.eval()
    predictions = []
    
    with torch.no_grad():  # Disable gradient computation
        for i in range(0, len(inputs), batch_size):
            batch = inputs[i:i+batch_size].to(device)
            output = model(batch)
            predictions.append(output.cpu())  # Move back to CPU
    
    return torch.cat(predictions)

print("5. Example function for memory-efficient inference defined above â†‘")

# Benchmark example
if torch.cuda.is_available():
    import time
    
    print("\\n=== Quick Benchmark ===\\n")
    model_bench = AlexNet(num_classes=10)
    dummy_data = torch.randn(128, 3, 32, 32)
    
    # CPU benchmark
    model_bench_cpu = model_bench.to('cpu')
    dummy_data_cpu = dummy_data.to('cpu')
    
    start = time.time()
    with torch.no_grad():
        _ = model_bench_cpu(dummy_data_cpu)
    cpu_time = time.time() - start
    
    # GPU benchmark
    model_bench_gpu = model_bench.to('cuda')
    dummy_data_gpu = dummy_data.to('cuda')
    
    # Warm up GPU
    with torch.no_grad():
        _ = model_bench_gpu(dummy_data_gpu)
    torch.cuda.synchronize()
    
    start = time.time()
    with torch.no_grad():
        _ = model_bench_gpu(dummy_data_gpu)
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    
    print(f"CPU inference time: {cpu_time*1000:.2f} ms")
    print(f"GPU inference time: {gpu_time*1000:.2f} ms")
    print(f"Speedup: {cpu_time/gpu_time:.2f}x")
else:
    print("\\nGPU not available for benchmark")