# Demo toy dataset: CIFAR-10 (32×32) — upscaled to 224×224 for pretrained ResNet-18.
### Student project: Replace CIFAR-10 with ASL Alphabet (Kaggle) (200×200 RGB, 29 classes). 

# Learning Goals

By the end of this notebook, you will:

Understand transfer learning — why use pretrained models and when to freeze layers

Master the ResNet-18 architecture — layers, blocks, skip connections, and their purpose

Control fine-tuning granularity — freeze/unfreeze specific layers strategically

Apply best practices — proper data preprocessing, normalization, and augmentation

## 1) Setup & Imports

In [16]:
# Standard PyTorch + Torchvision stack
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

# Reproducibility (essential for research and debugging)
import random
SEED = 1337
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Note: For complete reproducibility, you may also need:
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark = False

# Device (GPU if available)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', DEVICE)

Using device: cuda


## 2) Understanding ResNet-18 Architecture
### What Makes ResNets Special?
The Problem: Deep networks suffer from vanishing gradients — as networks get deeper, gradients become exponentially small, making training nearly impossible.

The Solution: Residual connections (skip connections) allow gradients to flow directly through shortcuts, enabling training of very deep networks.

ResNet-18 Architecture Breakdown
ResNet-18 has 18 layers with learnable weights. Here's the complete structure:

## What's a BasicBlock?
Each BasicBlock learns a residual function F(x) that's added to the input:

The block learns: Output = F(x) + x
Where F(x) is the residual (the "difference" to learn), making it easier to learn identity mappings.

In [20]:
# Let's examine a fresh ResNet-18 pretrained on ImageNet
res18 = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Print the full architecture
print(res18)
print("*"*100)

# Count parameters in each stage
def count_parameters(model):
    """Count trainable parameters by module"""
    param_counts = {}
    for name, module in model.named_children():
        params = sum(p.numel() for p in module.parameters())
        param_counts[name] = f"{params:,}"
    return param_counts

print("\n Parameters per module:")
for module, count in count_parameters(res18).items():
    print(f"  {module:10s}: {count:>12s} params")
print("*"*100)
total_params = sum(p.numel() for p in res18.parameters())
print(f"\n  {'TOTAL':10s}: {total_params:,} params")

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

## 3) Why Transfer Learning? The Power of Pretrained Features

### The Transfer Learning Hypothesis

Networks trained on large datasets (like ImageNet with 1.2M images, 1000 classes) learn hierarchical features:

Early layers (conv1, layer1): Low-level features (edges, textures, colors)

Middle layers (layer2, layer3): Mid-level features (shapes, parts, patterns)

Deep layers (layer4): High-level, task-specific features (object parts)

Final layer (fc): Class-specific decision boundaries

Key Insight: Low and mid-level features are universal across vision tasks! We can reuse them and only adapt the high-level features to our new task.

## Fine-Tuning Strategies

### 1) Feature Extraction *(Freeze all, train head)*
- **Pros:** Fastest; lowest overfitting risk  
- **Use when:** Limited data; domain ≈ ImageNet  
- **Unfrozen:** `fc` only

---

### 2) Shallow Fine-Tuning *(Unfreeze layer4 + head)*
- **Pros:** Adapts high-level features; still efficient  
- **Use when:** Moderate data; somewhat different domain  
- **Unfrozen:** `layer4`, `fc`

---

### 3) Deep Fine-Tuning *(Unfreeze layer3 + layer4 + head)*
- **Pros:** Greater adaptation capacity  
- **Use when:** Sufficient data; noticeable domain shift  
- **Unfrozen:** `layer3`, `layer4`, `fc`

---

### 4) Full Fine-Tuning *(Unfreeze everything)*
- **Pros:** Maximum flexibility  
- **Cons:** Slowest; higher overfitting risk  
- **Use when:** Large dataset; very different domain  
- **Unfrozen:** all layers

---

### Practical Tips
- Prefer **smaller LR** for earlier layers (discriminative LRs).
- Add regularization when unfreezing more (augmentations, weight decay, label smoothing).
- Monitor validation; consider early stopping/checkpointing.





## 4) Data Preprocessing: Why ImageNet Statistics?

### Understanding ImageNet Normalization
Pretrained networks expect inputs with specific statistics because they were trained on normalized ImageNet data:

In [21]:
# ImageNet channel-wise statistics (computed over millions of images)
IMAGENET_MEAN = [0.485, 0.456, 0.406]  # Mean per channel (R, G, B)
IMAGENET_STD  = [0.229, 0.224, 0.225]  # Std dev per channel

# Why these specific values?
# - They center the data around 0 and scale to ~[-2, 2] range
# - This matches the distribution the network was trained on
# - Network weights are calibrated to these input scales

## 4.1 Input Size Requirements

ResNet-18 was trained on 224×224 crops from ImageNet. However, it can accept various sizes due to:

1. Convolutional layers are size-agnostic (they slide across any size)

2. Adaptive Average Pooling before fc layer handles any spatial dimension

In [23]:
# ResNet input size flexibility demonstration
def test_input_sizes(model):
    """Test different input sizes through ResNet"""
    test_sizes = [112, 224, 256, 448]
    model.eval()
    
    print(" Testing input size flexibility:")
    for size in test_sizes:
        x = torch.randn(1, 3, size, size)
        with torch.no_grad():
            output = model(x)
        print(f"  Input: (3×{size}×{size}) → Output: {output.shape}")
    
    print("\n Note: While flexible, best performance is near 224×224")
    print("   (the training resolution)")

test_input_sizes(res18)

 Testing input size flexibility:
  Input: (3×112×112) → Output: torch.Size([1, 1000])
  Input: (3×224×224) → Output: torch.Size([1, 1000])
  Input: (3×256×256) → Output: torch.Size([1, 1000])
  Input: (3×448×448) → Output: torch.Size([1, 1000])

 Note: While flexible, best performance is near 224×224
   (the training resolution)


## 4.2 Data Augmentation Philosophy

In [24]:
IMG_SIZE = 224          # Standard ImageNet size
BATCH_SIZE = 64

# Training transforms: Add variability to prevent overfitting
train_tf = transforms.Compose([
    # 1. Resize: CIFAR-10 is 32×32, we need 224×224
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    
    # 3. Convert to tensor: PIL Image → Tensor, scales to [0,1]
    transforms.ToTensor(),
    
    # 4. Normalize: Match ImageNet statistics
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD)
    # This does: output = (input - mean) / std
])

# Validation transforms: No augmentation (we want consistent evaluation)
val_tf = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD)
])

# Load CIFAR-10 dataset
train_ds = datasets.CIFAR10(root='./data', train=True,  download=True, transform=train_tf)
val_ds   = datasets.CIFAR10(root='./data', train=False, download=True, transform=val_tf)

train_loader = DataLoader(
    train_ds, 
    batch_size=BATCH_SIZE, 
    shuffle=True,           # Randomize order each epoch
    num_workers=2,          # Parallel data loading
    pin_memory=True         # Faster GPU transfer
)
val_loader = DataLoader(
    val_ds, 
    batch_size=BATCH_SIZE, 
    shuffle=False,          # Keep validation order consistent
    num_workers=2, 
    pin_memory=True
)

NUM_CLASSES = 10
print(f' Dataset: {len(train_ds):,} train, {len(val_ds):,} val')
print(f' Classes: {train_ds.classes}')

Files already downloaded and verified
Files already downloaded and verified
 Dataset: 50,000 train, 10,000 val
 Classes: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']


## 5) Model Setup: Adapting ResNet-18 for the toy Task

Replacing the Classification Head

The pretrained ResNet-18 outputs 1000 classes (ImageNet), but we need 10 (CIFAR-10):

In [25]:
# Start with ImageNet-pretrained weights
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Examine the original classifier
print(" Original FC layer:")
print(f"  Input features: {model.fc.in_features}")
print(f"  Output features: {model.fc.out_features} (ImageNet classes)")

# Replace with our custom classifier
# The in_features must match (512 for ResNet-18's final feature size)
# The NUM_CLASSES will change for other datasets
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)

print("\n New FC layer:")
print(f"  Input features: {model.fc.in_features}")
print(f"  Output features: {model.fc.out_features} (our classes)")

# Move model to GPU if available
model = model.to(DEVICE)

 Original FC layer:
  Input features: 512
  Output features: 1000 (ImageNet classes)

 New FC layer:
  Input features: 512
  Output features: 10 (our classes)


## Understanding Parameter Names and Hierarchy
To selectively freeze/unfreeze layers, we need to understand PyTorch's parameter naming:

In [26]:
def explore_model_structure(model, max_depth=2):
    """Visualize the model's hierarchical structure"""
    
    print("\n Model Structure (hierarchical view):")
    
    def print_module(module, prefix="", depth=0):
        if depth >= max_depth:
            return
        for name, child in module.named_children():
            param_count = sum(p.numel() for p in child.parameters())
            trainable = sum(p.numel() for p in child.parameters() if p.requires_grad)
            print(f"{prefix}├── {name}: {child.__class__.__name__} "
                  f"({param_count:,} params, {trainable:,} trainable)")
            if depth < max_depth - 1:
                print_module(child, prefix + "│   ", depth + 1)
    
    print_module(model)

# Explore structure
explore_model_structure(model)


 Model Structure (hierarchical view):
├── conv1: Conv2d (9,408 params, 9,408 trainable)
├── bn1: BatchNorm2d (128 params, 128 trainable)
├── relu: ReLU (0 params, 0 trainable)
├── maxpool: MaxPool2d (0 params, 0 trainable)
├── layer1: Sequential (147,968 params, 147,968 trainable)
│   ├── 0: BasicBlock (73,984 params, 73,984 trainable)
│   ├── 1: BasicBlock (73,984 params, 73,984 trainable)
├── layer2: Sequential (525,568 params, 525,568 trainable)
│   ├── 0: BasicBlock (230,144 params, 230,144 trainable)
│   ├── 1: BasicBlock (295,424 params, 295,424 trainable)
├── layer3: Sequential (2,099,712 params, 2,099,712 trainable)
│   ├── 0: BasicBlock (919,040 params, 919,040 trainable)
│   ├── 1: BasicBlock (1,180,672 params, 1,180,672 trainable)
├── layer4: Sequential (8,393,728 params, 8,393,728 trainable)
│   ├── 0: BasicBlock (3,673,088 params, 3,673,088 trainable)
│   ├── 1: BasicBlock (4,720,640 params, 4,720,640 trainable)
├── avgpool: AdaptiveAvgPool2d (0 params, 0 trainable)
├── f

## 6) Freezing and Unfreezing: The Core Mechanism
### How Freezing Works
When we "freeze" a layer, we set requires_grad=False on its parameters:

Frozen parameters: No gradients computed, no updates during backprop

Unfrozen parameters: Gradients computed, weights updated

In [29]:
def set_requires_grad(module: nn.Module, requires_grad: bool):
    """
    Recursively set requires_grad for all parameters in a module.
    
    Args:
        module: PyTorch module (layer, block, or entire model)
        requires_grad: True to unfreeze (train), False to freeze
    """
    for param in module.parameters():
        param.requires_grad = requires_grad
    
    # Print status
    param_count = sum(p.numel() for p in module.parameters())
    status = "UNFROZEN (trainable)" if requires_grad else "FROZEN"
    print(f"  {module.__class__.__name__}: {param_count:,} parameters {status}")

# Example: Freeze entire model, then selectively unfreeze
print(" Freezing entire model...")
set_requires_grad(model, False)

print("\n Unfreezing only the FC layer...")
set_requires_grad(model.fc, True)

# Verify what's trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\n Trainable: {trainable_params:,} / {total_params:,} parameters "
      f"({100*trainable_params/total_params:.6f}%)")

 Freezing entire model...
  ResNet: 11,181,642 parameters FROZEN

 Unfreezing only the FC layer...
  Linear: 5,130 parameters UNFROZEN (trainable)

 Trainable: 5,130 / 11,181,642 parameters (0.045879%)


## 7) Training Infrastructure

Training and Evaluation Functions

In [30]:
criterion = nn.CrossEntropyLoss()

def train_one_epoch(model, loader, optimizer):
    """
    Train for one epoch.
    
    Returns:
        tuple: (average_loss, accuracy)
    """
    model.train()  # Enable dropout, batch norm training mode
    
    total_samples = 0
    correct_predictions = 0
    running_loss = 0.0
    
    for batch_idx, (images, labels) in enumerate(loader):
        # Move data to device (GPU/CPU)
        images, labels = images.to(DEVICE), labels.to(DEVICE)
        
        # Forward pass
        optimizer.zero_grad()  # Clear previous gradients
        logits = model(images)
        loss = criterion(logits, labels)
        
        # Backward pass
        loss.backward()  # Compute gradients
        optimizer.step()  # Update weights
        
        # Track metrics
        running_loss += loss.item() * images.size(0)
        predictions = logits.argmax(dim=1)
        correct_predictions += (predictions == labels).sum().item()
        total_samples += images.size(0)
        
        # Optional: Print progress
        if batch_idx % 100 == 0:
            print(f"    Batch {batch_idx}/{len(loader)}, "
                  f"Loss: {loss.item():.4f}")
    
    avg_loss = running_loss / total_samples
    accuracy = correct_predictions / total_samples
    return avg_loss, accuracy

@torch.no_grad()  # Decorator disables gradient computation
def evaluate(model, loader):
    """
    Evaluate model on validation/test set.
    
    Returns:
        tuple: (average_loss, accuracy)
    """
    model.eval()  # Disable dropout, batch norm eval mode
    
    total_samples = 0
    correct_predictions = 0
    running_loss = 0.0
    
    for images, labels in loader:
        images, labels = images.to(DEVICE), labels.to(DEVICE)
        
        # Forward pass only (no backward)
        logits = model(images)
        loss = criterion(logits, labels)
        
        # Track metrics
        running_loss += loss.item() * images.size(0)
        predictions = logits.argmax(dim=1)
        correct_predictions += (predictions == labels).sum().item()
        total_samples += images.size(0)
    
    avg_loss = running_loss / total_samples
    accuracy = correct_predictions / total_samples
    return avg_loss, accuracy

## 8) Phase 1.1: Head-Only Fine-Tuning (Feature Extraction)
Strategy: Use ResNet as a Fixed Feature Extractor

In this phase, we:

1. Freeze all convolutional layers (keep ImageNet features)

2. Train only the new classifier head (learn new class boundaries)

3. Use higher learning rate (since we're training from scratch)

This is the safest approach with limited data!

In [31]:
# Hyperparameters for Phase 1
EPOCHS_HEAD_ONLY = 3    
LR_HEAD = 1e-3          

print("\n" + "="*60)
print(" PHASE 1: HEAD-ONLY FINE-TUNING")
print("="*60)

# Step 1: Freeze entire model
print("\n Freezing all layers...")
set_requires_grad(model, False)

# Step 2: Unfreeze only the classifier head
print("\n Unfreezing classifier head...")
set_requires_grad(model.fc, True)

# Step 3: Create optimizer for ONLY trainable parameters
# filter() ensures we only optimize parameters with requires_grad=True
trainable_params = filter(lambda p: p.requires_grad, model.parameters())
optimizer = optim.Adam(trainable_params, lr=LR_HEAD)

print(f"\n Optimizer setup:")
print(f"   Learning rate: {LR_HEAD}")
print(f"   Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Step 4: Training loop
print("\n Training progress:")
print("-" * 60)

best_val_acc = 0.0
for epoch in range(1, EPOCHS_HEAD_ONLY + 1):
    print(f"\nEpoch {epoch}/{EPOCHS_HEAD_ONLY}")
    
    # Train
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer)
    
    # Validate
    val_loss, val_acc = evaluate(model, val_loader)
    
    # Track best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        # Optional: Save best model
        # torch.save(model.state_dict(), 'best_model_phase1.pth')
    
    print(f"   Train: Loss={train_loss:.4f}, Acc={train_acc:.3f}")
    print(f"   Val:   Loss={val_loss:.4f}, Acc={val_acc:.3f} "
          f"{' New best!' if val_acc == best_val_acc else ''}")

print("\n Phase 1 Complete!")
print(f"   Best validation accuracy: {best_val_acc:.3f}")


 PHASE 1: HEAD-ONLY FINE-TUNING

 Freezing all layers...
  ResNet: 11,181,642 parameters FROZEN

 Unfreezing classifier head...
  Linear: 5,130 parameters UNFROZEN (trainable)

 Optimizer setup:
   Learning rate: 0.001
   Trainable params: 5,130

 Training progress:
------------------------------------------------------------

Epoch 1/3


  return F.conv2d(input, weight, bias, self.stride,


    Batch 0/782, Loss: 2.4818
    Batch 100/782, Loss: 0.9739
    Batch 200/782, Loss: 0.9588
    Batch 300/782, Loss: 0.7352
    Batch 400/782, Loss: 0.9352
    Batch 500/782, Loss: 0.8188
    Batch 600/782, Loss: 0.6524
    Batch 700/782, Loss: 0.5866
   Train: Loss=0.8212, Acc=0.734
   Val:   Loss=0.6470, Acc=0.777  New best!

Epoch 2/3
    Batch 0/782, Loss: 0.7357
    Batch 100/782, Loss: 0.5079
    Batch 200/782, Loss: 0.7547
    Batch 300/782, Loss: 0.5252
    Batch 400/782, Loss: 0.4904
    Batch 500/782, Loss: 0.5411
    Batch 600/782, Loss: 0.4878
    Batch 700/782, Loss: 0.6596
   Train: Loss=0.6179, Acc=0.787
   Val:   Loss=0.5880, Acc=0.801  New best!

Epoch 3/3
    Batch 0/782, Loss: 0.6366
    Batch 100/782, Loss: 0.5225
    Batch 200/782, Loss: 0.7769
    Batch 300/782, Loss: 0.3899
    Batch 400/782, Loss: 0.4761
    Batch 500/782, Loss: 0.4913
    Batch 600/782, Loss: 0.5393
    Batch 700/782, Loss: 0.6337
   Train: Loss=0.5900, Acc=0.795
   Val:   Loss=0.5956, Acc=0.