<div style="display: flex; justify-content: space-between; align-items: center;">
    <div style="text-align: left; flex: 4">
        <strong>Author:</strong> Amirhossein Heydari ‚Äî 
        üìß <a href="mailto:amirhosseinheydari78@gmail.com">amirhosseinheydari78@gmail.com</a> ‚Äî 
        üêô <a href="https://github.com/mr-pylin/pytorch-workshop" target="_blank" rel="noopener">github.com/mr-pylin</a>
    </div>
    <div style="text-align: right; flex: 1;">
        <a href="https://pytorch.org/" target="_blank" rel="noopener noreferrer">
            <img src="../assets/images/pytorch/logo/pytorch-logo-dark.svg" 
                 alt="PyTorch Logo"
                 style="max-height: 48px; width: auto; background-color: #ffffff; border-radius: 8px;">
        </a>
    </div>
</div>
<hr>


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Prepare a Dataset](#toc2_)    
- [Model Reuse](#toc3_)    
  - [Transfer Learning](#toc3_1_)    
    - [Transfer Learning via Feature Extraction](#toc3_1_1_)    
      - [Load a Pre-trained Model](#toc3_1_1_1_)    
      - [Create a Feature Extractor](#toc3_1_1_2_)    
      - [Define a Custom Classifier](#toc3_1_1_3_)    
      - [Training Loop (Demo)](#toc3_1_1_4_)    
    - [End-to-End Transfer Learning](#toc3_1_2_)    
      - [Load a Pre-trained Model](#toc3_1_2_1_)    
      - [Define End-to-End Model](#toc3_1_2_2_)    
      - [Check Gradients](#toc3_1_2_3_)    
      - [Training Loop (Demo)](#toc3_1_2_4_)    
  - [Fine-tuning](#toc3_2_)    
    - [Fine-tuning Strategies](#toc3_2_1_)    
      - [Full Fine-Tuning](#toc3_2_1_1_)    
        - [Load a Pre-trained Model](#toc3_2_1_1_1_)    
        - [Unfreeze All Parameters](#toc3_2_1_1_2_)    
        - [Training Loop (Demo)](#toc3_2_1_1_3_)    
      - [Partial Fine-Tuning](#toc3_2_1_2_)    
        - [Load a Pre-trained Model](#toc3_2_1_2_1_)    
        - [Partially Unfreeze Parameters](#toc3_2_1_2_2_)    
        - [Training Loop (Demo)](#toc3_2_1_2_3_)    
      - [Progressive Fine-Tuning](#toc3_2_1_3_)    
        - [Load a Pre-trained Model](#toc3_2_1_3_1_)    
        - [Stage 1: Train classifier only](#toc3_2_1_3_2_)    
        - [Stage 2: Unfreeze Last Block](#toc3_2_1_3_3_)    
        - [Stage 3: Fine-tune Entire Network](#toc3_2_1_3_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchinfo import summary
from torchvision.datasets import CIFAR10
from torchvision.models import MobileNet_V3_Small_Weights, mobilenet_v3_small
from torchvision.models.feature_extraction import create_feature_extractor
from torchvision.transforms import v2

In [None]:
# disable automatic figure display (plt.show() required)
# this ensures consistency with .py scripts and gives full control over when plots appear
plt.ioff()

In [None]:
# set a seed for deterministic results
seed = 42
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# log
device

In [None]:
# update paths as needed based on your project structure
DATASET_DIR = Path("../datasets")

# <a id='toc2_'></a>[Prepare a Dataset](#toc0_)


In [None]:
transform = v2.Compose(
    [
        v2.ToImage(),
        v2.Resize((224, 224)),
        v2.ToDtype(torch.float32, scale=True),
        v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),  # ImageNet normalization
    ]
)

In [None]:
trainset = CIFAR10(root=DATASET_DIR, train=True, download=False, transform=transform)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

In [None]:
# log
print("trainset:")
print(f"    -> trainset.data.shape    : {trainset.data.shape}")
print(f"    -> trainset.data.dtype    : {trainset.data.dtype}")
print(f"    -> type(trainset.data)    : {type(trainset.data)}")
print(f"    -> type(trainset.targets) : {type(trainset.targets)}")
print("-" * 50)
print(f"classes : {trainset.classes}")
print(f"trainset distribution : {torch.unique(torch.tensor(trainset.targets), return_counts=True)[1]}")

In [None]:
# plot
fig, axs = plt.subplots(nrows=4, ncols=8, figsize=(12, 6), layout="compressed")
for i in range(4):
    for j in range(8):
        axs[i, j].imshow(trainset.data[i * 8 + j], cmap="gray")
        axs[i, j].set_title(trainset.classes[trainset.targets[i * 8 + j]])
        axs[i, j].axis("off")
plt.show()

# <a id='toc3_'></a>[Model Reuse](#toc0_)


## <a id='toc3_1_'></a>[Transfer Learning](#toc0_)

- Transfer learning is the practice of **reusing knowledge** from a p**retrained model** to solve a new but **related task**.
- Early layers of neural networks learn **general features** like edges, textures, or shapes that are often useful across tasks.

üìà **Motivation**:

- Reduces training time and data requirements.
- Leverages knowledge from large datasets (e.g., ImageNet) for smaller or domain-specific tasks.

üìâ **Common Pitfalls**:

- **Mismatched input size or channels:** Pretrained models expect specific input shapes (e.g., 224√ó224√ó3).
- **Overfitting downstream model:** Use regularization if downstream data is small.
- **Feature collapse:** Some extracted features may not be informative for new tasks.


### <a id='toc3_1_1_'></a>[Transfer Learning via Feature Extraction](#toc0_)


#### <a id='toc3_1_1_1_'></a>[Load a Pre-trained Model](#toc0_)


In [None]:
pretrained_model = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1)

In [None]:
# set to evaluation mode
pretrained_model.eval()

In [None]:
# freeze backbone parameters
for param in pretrained_model.parameters():
    param.requires_grad = False

#### <a id='toc3_1_1_2_'></a>[Create a Feature Extractor](#toc0_)


In [None]:
# extract features from the last layer before classifier
return_nodes = {"avgpool": "embedding"}
feature_extractor = create_feature_extractor(pretrained_model, return_nodes)

#### <a id='toc3_1_1_3_'></a>[Define a Custom Classifier](#toc0_)


In [None]:
class Classifier(nn.Module):
    def __init__(self, embedding_size: int = 576, num_classes: int = 10):
        super().__init__()
        self.fc = nn.Linear(embedding_size, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # flatten from (B, C, 1, 1) ‚Üí (B, C)
        x = torch.flatten(x, 1)
        return self.fc(x)

In [None]:
classifier = Classifier(embedding_size=576, num_classes=10)

#### <a id='toc3_1_1_4_'></a>[Training Loop (Demo)](#toc0_)


In [None]:
feature_extractor.to(device)
classifier.to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-3)

In [None]:
for epoch in range(2):
    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # extract features from frozen backbone
        with torch.no_grad():
            features = feature_extractor(x)["embedding"]

        # forward through new classifier
        y_pred = classifier(features)
        loss = criterion(y_pred, y_true)

        # backpropagate only through classifier
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # log
    print(f"Epoch {epoch+1}: Loss: {loss.item():.4f}")

### <a id='toc3_1_2_'></a>[End-to-End Transfer Learning](#toc0_)


#### <a id='toc3_1_2_1_'></a>[Load a Pre-trained Model](#toc0_)


In [None]:
pretrained_model = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1)

In [None]:
# set to evaluation mode
pretrained_model.eval()

In [None]:
# freeze backbone parameters
for param in pretrained_model.parameters():
    param.requires_grad = False

#### <a id='toc3_1_2_2_'></a>[Define End-to-End Model](#toc0_)


In [None]:
class CustomTransferModel(nn.Module):
    def __init__(self, backbone: nn.Module, num_classes: int = 10):
        super().__init__()

        # remove original classifier (fc layer)
        self.backbone = backbone.features
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Linear(576, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.backbone(x)
        x = self.pool(x)
        x = torch.flatten(x, start_dim=1)
        x = self.classifier(x)
        return x

In [None]:
# instantiate model
custom_model = CustomTransferModel(backbone=pretrained_model, num_classes=10)
custom_model

In [None]:
summary(custom_model, input_size=(1, 3, 224, 224), device="cpu")

#### <a id='toc3_1_2_3_'></a>[Check Gradients](#toc0_)


In [None]:
for name, param in custom_model.named_parameters():
    print(f"{name:<30s} -> requires_grad: {param.requires_grad}")

#### <a id='toc3_1_2_4_'></a>[Training Loop (Demo)](#toc0_)


In [None]:
custom_model.to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(custom_model.classifier.parameters(), lr=1e-3)  # only train classifier

In [None]:
for epoch in range(2):
    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = custom_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # log
    print(f"Epoch {epoch+1}: Loss: {loss.item():.4f}")

## <a id='toc3_2_'></a>[Fine-tuning](#toc0_)

- Fine-tuning is the practice of **starting from a pretrained model** and **updating some or all of its weights** on a new dataset.
- Instead of keeping the backbone frozen, the model is allowed to **adapt its learned representations** to better fit the target task.

üìà **Motivation**:

- Improves performance when the target dataset differs from the source dataset.
- Allows higher-level features to specialize for the new task.
- Particularly useful when the target dataset is moderately sized.

üìâ **Common Pitfalls**:

- **Overfitting:** Updating too many parameters with limited data can harm generalization.
- **Catastrophic forgetting:** The model may lose useful pretrained knowledge.
- **Learning rate misconfiguration:** Using too large a learning rate can destroy pretrained features; typically a smaller learning rate is used for pretrained layers.
- **Unstable training:** Fine-tuning without proper normalization or scheduling may cause training divergence.


### <a id='toc3_2_1_'></a>[Fine-tuning Strategies](#toc0_)


#### <a id='toc3_2_1_1_'></a>[Full Fine-Tuning](#toc0_)

- All pretrained layers are **unfrozen and learnable**.
- The entire network adapts to the new dataset.
- Provides maximum flexibility and adaptation.
- Requires a **small learning rate** to preserve useful pretrained knowledge.


##### <a id='toc3_2_1_1_1_'></a>[Load a Pre-trained Model](#toc0_)


In [None]:
pretrained_model = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1)

In [None]:
# replace classifier for CIFAR-10
in_features = pretrained_model.classifier[3].in_features
pretrained_model.classifier[3] = nn.Linear(in_features, 10)

##### <a id='toc3_2_1_1_2_'></a>[Unfreeze All Parameters](#toc0_)


In [None]:
for param in pretrained_model.parameters():
    param.requires_grad = True

##### <a id='toc3_2_1_1_3_'></a>[Training Loop (Demo)](#toc0_)


In [None]:
pretrained_model = pretrained_model.to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
    pretrained_model.parameters(),
    lr=1e-5,   # critical: small learning rate
)

In [None]:
for epoch in range(2):

    pretrained_model.train()
    total_loss = 0

    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = pretrained_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # calculate loss
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)

    # log
    print(f"Epoch {epoch+1}: Loss: {avg_loss:.4f}")

#### <a id='toc3_2_1_2_'></a>[Partial Fine-Tuning](#toc0_)

- Only **some layers** (usually higher layers) are unfrozen.
- Early layers remain frozen because they contain general features.
- Later layers adapt to task-specific features.
- Provides a balance between stability and adaptability.


##### <a id='toc3_2_1_2_1_'></a>[Load a Pre-trained Model](#toc0_)


In [None]:
pretrained_model = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1)

In [None]:
# replace classifier for CIFAR-10
in_features = pretrained_model.classifier[3].in_features
pretrained_model.classifier[3] = nn.Linear(in_features, 10)

##### <a id='toc3_2_1_2_2_'></a>[Partially Unfreeze Parameters](#toc0_)


In [None]:
pretrained_model.features[-1]

In [None]:
# freeze entire backbone
for param in pretrained_model.features.parameters():
    param.requires_grad = False

In [None]:
# unfreeze last block of geatures
for param in pretrained_model.features[-1].parameters():
    param.requires_grad = True

In [None]:
# classifier should be learnable
for param in pretrained_model.classifier.parameters():
    param.requires_grad = True

In [None]:
# check parameters
for name, param in pretrained_model.named_parameters():
    print(f"{name:<30s} -> requires_grad: {param.requires_grad}")

##### <a id='toc3_2_1_2_3_'></a>[Training Loop (Demo)](#toc0_)


In [None]:
pretrained_model = pretrained_model.to(device)

In [None]:
criterion = nn.CrossEntropyLoss()

# if you include frozen params, memory and computation are wasted on params that never update.
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, pretrained_model.parameters()),
    lr=1e-4,   # higher than full fine-tuning since fewer params update
)

In [None]:
for epoch in range(2):

    pretrained_model.train()
    total_loss = 0

    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = pretrained_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # calculate loss
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)

    # log
    print(f"Epoch {epoch+1}: Loss: {avg_loss:.4f}")

#### <a id='toc3_2_1_3_'></a>[Progressive Fine-Tuning](#toc0_)

- Layers are **gradually unfrozen during training**.
- Training starts with fewer learnable layers and increases over time.
- Improves stability and reduces risk of damaging pretrained representations.
- Common in research and professional workflows.


##### <a id='toc3_2_1_3_1_'></a>[Load a Pre-trained Model](#toc0_)


In [None]:
pretrained_model = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1)

In [None]:
# replace classifier for CIFAR-10
in_features = pretrained_model.classifier[3].in_features
pretrained_model.classifier[3] = nn.Linear(in_features, 10)

In [None]:
pretrained_model = pretrained_model.to(device)
criterion = nn.CrossEntropyLoss()

##### <a id='toc3_2_1_3_2_'></a>[Stage 1: Train classifier only](#toc0_)


In [None]:
# freeze all backbone
for param in pretrained_model.features.parameters():
    param.requires_grad = False

In [None]:
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, pretrained_model.parameters()),
    lr=1e-3,
)

In [None]:
for epoch in range(2):
    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = pretrained_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # log
    print(f"Epoch {epoch+1}: Loss: {loss.item():.4f}")

##### <a id='toc3_2_1_3_3_'></a>[Stage 2: Unfreeze Last Block](#toc0_)


In [None]:
# unfreeze last block
for param in pretrained_model.features[-1].parameters():
    param.requires_grad = True

In [None]:
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, pretrained_model.parameters()),
    lr=1e-4,
)

In [None]:
for epoch in range(2):
    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = pretrained_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # log
    print(f"Epoch {epoch+1}: Loss: {loss.item():.4f}")

##### <a id='toc3_2_1_3_4_'></a>[Stage 3: Fine-tune Entire Network](#toc0_)


In [None]:
# unfreeze all parameters
for param in pretrained_model.parameters():
    param.requires_grad = True

In [None]:
optimizer = torch.optim.Adam(
    pretrained_model.parameters(),
    lr=1e-5,  # very small LR for stability
)

In [None]:
for epoch in range(2):
    for x, y_true in train_loader:
        x, y_true = x.to(device), y_true.to(device)

        # forward
        y_pred = pretrained_model(x)
        loss = criterion(y_pred, y_true)

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # log
    print(f"Epoch {epoch+1}: Loss: {loss.item():.4f}")