<div style="line-height:0.5">
  <h1 style="color:#BF66F2 ">  Fine-tuning with pretrained models in PyTorch </h1>
  <span style="display: inline-block;">
      <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    transforms.Compose + WeightedRandomSampler + Loss + Optimizers  
    </span>

</div>
<br>
<div style="line-height:1.8">
  <h3 style="color:red; margin-bottom: 0;"> Notes: </h3> <!-- Add margin-bottom: 0; here -->
  <div style="line-height:0.6">
    <div style="line-height:1.5">
      Faster training since most parameters are frozen + better generalization due to features learned from a large dataset. <br>
      VGG16 alternatives:   <br>
      - ResNet - A very common alternative, especially ResNet-18 or ResNet-50. Works well for transfer learning. <br>
      - Inception - Also suitable for transfer learning, though sometimes slightly worse than ResNet. <br>
      - MobileNet - A more efficient architecture, good if you have memory or latency constraints. <br>

In this case Custom the dataset consists of images of dogs images ---> Start with an imbalanced dataset and fix! <br>  WeightedRandomSampler => samples elements from [0,..,len(weights)-1] with given probabilities (a sequence of weights). <br>
    </div>
  </div>
</div>


In [5]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  #to ignore CUDA warnings when GPU is not in use

In [6]:
import numpy as np

import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Func
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import WeightedRandomSampler, DataLoader

from tqdm import tqdm

from adabound import AdaBound

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

The ImageFolder class expects the root directory to contain subdirectories for each class, <br> and each subdirectory should contain images belonging to that class.

In [9]:
dset = datasets.ImageFolder(root="./dataset/dogs/")

In [10]:
def imb_get_loader(root_dir, batch_size):
    """ DataLoader object for loading images from the given root directory. \\
    Load and sample from imbalanced image datasets, where some classes have significantly fewer images than others. \\
    
    Parameters:
        - Root directory containing subdirectories of images [str]
        - Batch size to use for loading images [int]
    
    Details: 
        - #1 Calculate class weights for weighted random sampling after building the path;
        - #2 Calculate sample weights for weighted random sampling;
        - #3 Use WeightedRandomSampler for weighted random sampling;
        - #4 Create DataLoader object using ImageFolder dataset and WeightedRandomSampler.

    Returns:
        PyTorch DataLoader object for the specified dataset.
    """
    my_transforms = transforms.Compose(
        [
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
        ]
    )
    dataset = datasets.ImageFolder(root=root_dir, transform=my_transforms)
    subdirectories = dataset.classes
    class_weights = []

    for subdir in subdirectories:
        files = os.listdir(os.path.join(root_dir, subdir))
        class_weights.append(1 / len(files))
    
    sample_weights = [0] * len(dataset)
    
    for idx, (data, label) in enumerate(dataset):
        class_weight = class_weights[label]
        sample_weights[idx] = class_weight

    sampler = WeightedRandomSampler(sample_weights, num_samples=len(sample_weights), replacement=True)
    loader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)

    return loader

In [11]:
desired_bat, tot_samples = 10, len(dset)

# Batch size based on num of samples and desired number of batches
batch_size_custom = int(np.ceil(tot_samples / desired_bat))
# Create the DataLoader with num of batches => total / size 
loader = imb_get_loader(root_dir="dataset/dogs/", batch_size=9)  #batch_size=batch_size_custom

for batch_idx, (data, labels) in enumerate(loader):
    print(f"Batch {batch_idx}: {data.shape}, {labels.shape}")

Batch 0: torch.Size([9, 3, 224, 224]), torch.Size([9])
Batch 1: torch.Size([9, 3, 224, 224]), torch.Size([9])
Batch 2: torch.Size([9, 3, 224, 224]), torch.Size([9])
Batch 3: torch.Size([9, 3, 224, 224]), torch.Size([9])
Batch 4: torch.Size([9, 3, 224, 224]), torch.Size([9])
Batch 5: torch.Size([6, 3, 224, 224]), torch.Size([6])


In [14]:
num_goldens, num_swedish = 0, 0

for epoch in range(10):
    for data, labels in loader:
        num_goldens += torch.sum(labels == 0)
        num_swedish += torch.sum(labels == 1)

print("Total dog breed Golden is: ", num_goldens.item())
print("Total dog breed Swedish is:", num_swedish.item())

Total dog breed Golden is:  247
Total dog breed Swedish is: 263


<h3 style="color:#BF66F2"> Recap: </h3>
<div style="margin-top: -10px;">

**VGG16:** <br>
Convolutional neural network architecture [2014] for image classification tasks.    
It consists of 16 layers of convolutional and fully connected layers,   
and is known for its simplicity and effectiveness in image recognition tasks.     
VGG16 is trained on ImageNet with 1000 classes.   


<div style="line-height:1.5">
Using a pre-trained model can help with generalization since it has been trained on a large dataset like ImageNet. 
<div style="line-height:1.3">
<div>
<div>

- Freeze the existing parameters with requires_grad=False.    
- Replace the final classification layer to have 10 outputs for dogs classes   
- Only train the new classification layer, keeping the feature layers frozen

In [15]:
""" Hyperparameters """
num_classes = 10
learning_rate = 1e-3
batch_size = 1024
num_epochs = 5

In [16]:
# Instantiate the VGG16 model (initialized with the default weights) load pretrain model & modify it
model = torchvision.models.vgg16(weights="DEFAULT")

In [17]:
# Set requires_grad = False to do finetuning
# Remove these two lines to train entire model, to load only the pretrain weights.
for param in model.parameters():
    param.requires_grad = False

model.avgpool = nn.Identity()
model.classifier = nn.Sequential(nn.Linear(512, 100), nn.ReLU(), nn.Linear(100, num_classes))
# Send to GPU
model.to(device)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

<h3 style="color:#BF66F2 ">  Recap: </h3>
<div style="margin-top: -19px;">

<div style="line-height:1.3">
<div>
<div>

- The LOSS defines how we measure error/success for our model predictions.
- The OPTIMIZER minimizes a specified loss function by updating the model weights.

<div style="line-height:0.22">
<h3 style="color:#BF66F2 ">  Loss </h3>
<div style="line-height:1.3">
<div>
<div>

- CrossEntropyLoss - Loss for classification. SGD - Basic stochastic gradient descent optimizer.     
- BCEWithLogitsLoss - Loss for binary classification. Adam - Adaptive optimizer that works well for non-convex optimization problems.     
- MSELoss - Loss for regression problems. RMSprop - Optimizer that divides the gradient by a decaying average of its recent magnitude.     
- NLLLoss - Negative log likelihood loss. Adadelta - Adaptive learning rate method based on RMSProp.     
- L1Loss - Loss based on L1 norm. Adamax - Adaptive optimizer using the infinity norm.     
- KLDivLoss - Kullback–Leibler divergence loss. SparseAdam - Memory efficient version of Adam optimizer.     
- HuberLoss - Robust loss function. AdaBound - Adaptive gradient descent optimizer that combines AdaBound and Adam.     
- PoissonNLLLoss - Loss for modeling count data. Rprop - Uses the derivative sign for optimization.     
- BCELoss - Binary cross entropy loss. ASGD - Asynchronous Stochastic Gradient Descent optimizer.     
- CTCLoss - Connectionist Temporal Classification loss. RAdam - Adaptive variant of Adam optimizer.     

<div style="line-height:0.3">
<h3 style="color:#BF66F2 ">  Optimizers </h3>
<div style="line-height:1.5">
SGD - Stochastic Gradient Descent. The basic optimizer that performs parameter updates proportionally to the negative of the gradients.
<div style="line-height:1.5">
<div>
<div>
<div>

- Adam - Adaptive Moment Estimation optimizer. It adaptively rescales the learning rate for each parameter by calculating <br> individual adaptive learning rates for first and second moments of the gradients.   
- RMSprop - Uses a moving average of squared gradients to normalize the gradients, dampening the impact of oscillations or large gradients.     
- Adadelta - Scales the learning rate by the average of the magnitude of the recent gradients.     
- Adagrad - Accumulates the sum of squared gradients over time and divides the learning rate by this accumulated sum.     
- AdaBound - Combines the benefits of AdaBound and Adam. Uses an "AdaBound schedule" that starts like Adadelta and ends <br> like SGD with momentum.     
- Rprop - Uses the sign of the gradient to determine the update direction, and the derivative estimate's magnitude to determine <br> the update magnitude.     
- SparseAdam - A memory-efficient implementation of Adam. Only keeps the sufficient statistics of momentum (first moment) <br> and variability (second moment) of gradients.     
- ASGD - Asynchronous Stochastic Gradient Descent. Performs numerous SGD updates asynchronously in parallel to reduce <br> locking and improve convergence speed.     
- RAdam - An adaptive variant of Adam that decouples the weight dependence of the adaptive learning rate.     

<h3 style="color:#BF66F2 ">  Recap: Adaptive Mean Squared Gradients</h3>
<div style="margin-top: -10px;">

<div style="line-height:1.5">
Address some issues with the original Adam optimizer, specifically:  <br>
    - Large initial learning rates can cause Adam to diverge, even while the step-size is decayed.   <br> 
    - Adam stores an exponentially decaying average of past squared gradients, which can "contaminate" the estimate of the true second moments.  
<div style="line-height:1.5">
<div>
<div>
<div>
--> Keeps a "locked" estimate of the mean squared gradients    <br>
--> Uses that locked estimate in the denominator when calculating the adaptive learning rate   <br>
--> Tends to stabilize Adam and improve performance   <br>

<h2 style="color:#BF66F2 "> <u>  => 1) First trial with CTCLoss and SGD </u></h2>

In [18]:
""" Loss and optimizer options.
    _BCEWithLogitsLoss for binary classification problems
    _RMSprop optimize the weights to minimize that loss function
"""
#criterion = nn.CrossEntropyLoss()
#criterion = nn.BCEWithLogitsLoss()
#criterion = nn.MSELoss()
#criterion = nn.NLLLoss()
#criterion = nn.L1Loss()
#criterion = nn.KLDivLoss()
#criterion = nn.HuberLoss()
#criterion = nn.PoissonNLLLoss()
#criterion = nn.BCELoss()
criterion = nn.CTCLoss()

optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.8)
# optimizer = optim.Adam(model.parameters(), lr=learning_rate, amsgrad=True)   
# optimizer = optim.RMSprop(model.parameters(), lr=learning_rate, momentum=0.9)
# optimizer = optim.Adadelta(model.parameters(), lr=learning_rate)
# optimizer = optim.Adamax(model.parameters(), lr=learning_rate)
# optimizer = optim.SparseAdam(model.parameters(), lr=learning_rate)
# optimizer = AdaBound(model.parameters(), lr=learning_rate, weight_decay=0.7, betas=(0.6, 0.9))
# optimizer = optim.Rprop(model.parameters(), lr=learning_rate)
# optimizer = optim.ASGD(model.parameters(), lr=learning_rate)
# optimizer = optim.RAdam(model.parameters(), lr=learning_rate)

In [19]:
""" Modify the model to have 2 output classes.
- Set the right num of is crucial to avoid the RuntimeError of nn.Module linear.py lib.
N.B.
mat1 and mat2 shapes cannot be multiplied (batch x size (25088 in this case) and Linear_in_features x Linear_out_features).
"""
model.classifier = nn.Sequential(nn.Linear(25088, 100), nn.ReLU(), nn.Linear(100, 2))

## Adjust loss and optimizer for 2 classes 
criterion = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

######## Train DataLoader_1
for epoch in range(num_epochs):
    losses = []

    for batch_idx, (data, targets) in enumerate(tqdm(loader)):
        # Get data to cuda if possible
        data = data.to(device=device)
        targets = targets.to(device=device)
        # Forward
        scores = model(data)
        loss = criterion(scores, targets)

100%|██████████| 6/6 [00:24<00:00,  4.05s/it]
100%|██████████| 6/6 [00:22<00:00,  3.79s/it]
100%|██████████| 6/6 [00:25<00:00,  4.19s/it]
100%|██████████| 6/6 [00:29<00:00,  4.85s/it]
100%|██████████| 6/6 [00:25<00:00,  4.31s/it]


In [20]:
# Load the 2-class dog dataset 
loader = imb_get_loader(root_dir="dataset/dogs/", batch_size=batch_size_custom)

############# Train again loader    
for epoch in range(num_epochs):
    losses = []    

    for batch_idx, (data, targets) in enumerate(tqdm(loader)):
        # Get data to cuda if possible
        data = data.to(device=device)
        targets = targets.to(device=device)  
        ## Forward
        scores = model(data)
        loss = criterion(scores, targets)

        losses.append(loss.item())
        # Backward
        optimizer.zero_grad()
        loss.backward()
        # Gradient descent or Adam step  
        optimizer.step()

    print(f"Cost at epoch {epoch} is {sum(losses)/len(losses):.5f}")

100%|██████████| 9/9 [00:26<00:00,  2.99s/it]


Cost at epoch 0 is 0.11726


100%|██████████| 9/9 [00:28<00:00,  3.19s/it]


Cost at epoch 1 is 0.00000


100%|██████████| 9/9 [00:31<00:00,  3.48s/it]


Cost at epoch 2 is 0.00000


100%|██████████| 9/9 [00:35<00:00,  3.91s/it]


Cost at epoch 3 is 0.00000


100%|██████████| 9/9 [00:40<00:00,  4.49s/it]

Cost at epoch 4 is 0.00000





In [32]:
""" Accuracy checking.   
N.B.
predictions = model(x).argmax(dim=1)#, keepdim=True) => Lead to having more num_correct than num_samples!
When you include keepdim=True, the output of argmax retains its original shape with a single dimension for the class index, 
and it's compared element-wise with y, which typically has the shape (batch_size,). 
"""

def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval() 
    
    with torch.no_grad():
        for x, y in loader:
            predictions = model(x).argmax(dim=1)#, keepdim=True)
            # Convert to Python scalar
            num_correct += (predictions == y).sum().item()
            num_samples += predictions.size(0)
        accuracy = (num_correct / num_samples) * 100    
        print(f"Got {num_correct} / {num_samples} with accuracy {accuracy:.2f}%")


In [33]:
# Check accuracy on the custom loader        
check_accuracy(loader, model)

Got 51 / 51 with accuracy 100.00%


<h2 style="color:#BF66F2 "> <u>  => 2) Second trial changing criterion and optimizer (HuberLoss + Adam) </u></h2>

In [40]:
criterion = nn.HuberLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate, amsgrad=True)   

In [41]:
############# Train new loader    
def train_with_huber_loss(loader, model, criterion, optimizer, device):
    model.train()
    total_loss = 0.0

    for batch_idx, (data, targets) in enumerate(loader):
        data = data.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)  # Use nn.HuberLoss() here
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(loader)    

In [42]:
# Check accuracy on the custom loader        
check_accuracy(loader, model)

Got 51 / 51 with accuracy 100.00%
