## PyTorch Experiment Tracking

Machine Learning is very experimental. 

In order to figure out which experiment worth pursuing, that's where **experiment tracking** comes in, it helps you to figure out what doesn't work so you can figure out what does work.

In this notebook, we're going to see an example of programmatically tracking experiments.

In [4]:
!nvidia-smi

Thu May  2 20:59:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.79                 Driver Version: 537.79       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA T1200 Laptop GPU      WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P8               3W /  35W |    162MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [5]:
import torch 
import torchvision 

print(torch.__version__)
print(torchvision.__version__)

2.3.0+cu118
0.18.0+cu118


In [6]:
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd

from torch import nn 
from torchvision import transforms

In [7]:
# need to install torchinfo module 
try: 
    import torchinfo 
except: 
    print("[INFO] we don't have torchinfo, installing it....")
    !pip install torchinfo

In [8]:
# internal module  (need to clone them from github)
try: 
    from going_modular.going_modular import data_setup, engine 
except:
    print("[INFO] couldn't find the going_modular scripts, cloning them from github....")
    !git clone https://github.com/mrdbourke/pytorch-deep-learning
    !mv 
    !rm -rf pytorch-deep-learning
    from going_modular.going_modular import data_setup, engine

In [9]:
# !git clone https://github.com/mrdbourke/pytorch-deep-learning.git

# !move pytorch-deep-learning/going_modular .

In [10]:
# Setting device agnostic code 
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Target Device: {device}")

Target Device: cuda


In [11]:
# Set Seeds 
def set_seeds(seed: int = 42):
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations
    torch.cuda.manual_seed(seed)

In [12]:
set_seeds(42)

### 1. Getting Dataset


In [13]:
import os 
import zipfile 

from pathlib import Path

import requests


def download_data(source_url: str, 
                  destination_path: str,
                  remove_zipfile: bool = True) -> Path:
    """ 
    Downloads a zipped dataset from source URL and unzips it to destination path. 
    Removes the zip file after extraction.

    Args: 
        source_url (str): URL of the zipped dataset
        destination_path (str): Path to extract the dataset
        remove_zipfile (bool): Flag to remove the zip file after extraction (default is True)

    Returns:
        Path: Path to the extracted dataset
    """
    # Setup path to data folder and image folder (destination)
    data_path = Path('dataset/')
    image_path = data_path / Path(destination_path)

    # Create destination directory if it doesn't exist
    if image_path.is_dir():
        print(f"[INFO] {image_path} directory already exists....")
    else: 
        print(f"[INFO] {image_path} directory doesn't exist, creating one....")
        image_path.mkdir(parents=True, exist_ok=True)

    # Download the dataset from source
    with open(data_path / 'dataset.zip', 'wb') as f: 
        request = requests.get(source_url)
        print(f"[INFO] Downloading dataset from {source_url}....")
        f.write(request.content)

    # Unzip the dataset
    with zipfile.ZipFile(data_path / 'dataset.zip', 'r') as zip_ref:
        print(f"[INFO] Extracting dataset to {image_path}....")
        zip_ref.extractall(image_path)
    
    # Remove the zip file
    if remove_zipfile: 
        print(f"[INFO] Removing the dataset zip file....")
        os.remove(data_path / 'dataset.zip')

    return image_path

In [14]:
download_data(source_url="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip", 
              destination_path="pizza_steak_sushi", 
              remove_zipfile=True)

[INFO] dataset\pizza_steak_sushi directory already exists....
[INFO] Downloading dataset from https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip....
[INFO] Extracting dataset to dataset\pizza_steak_sushi....
[INFO] Removing the dataset zip file....


WindowsPath('dataset/pizza_steak_sushi')

In [15]:
image_path = Path('dataset/pizza_steak_sushi')

In [16]:
image_path

WindowsPath('dataset/pizza_steak_sushi')

### 2. Create Dataset and Dataloaders

We can create transforms manually or automatically (latest) 

Here I will create transforms automatically! 

* The goal with transforms is to ensure your custom data is formatted in a reproducible way as well as a way that will suit pre-trained models.

In [17]:
# Setup directories 
train_dir = image_path / "train"
test_dir = image_path / "test"

train_dir, test_dir

(WindowsPath('dataset/pizza_steak_sushi/train'),
 WindowsPath('dataset/pizza_steak_sushi/test'))

In [18]:
# Creating a data loader using Automatic transform

# Setup pretrained weights (plenty of these weights are available in torchvision)
import torchvision

weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # "DEFAULT" = best available weights

# Get the transforms from weights (these transforms are used while training the model)
efficientnet_b0_transforms = weights.transforms()

In [19]:
efficientnet_b0_transforms

ImageClassification(
    crop_size=[224]
    resize_size=[256]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BICUBIC
)

In [20]:
from going_modular.going_modular import data_setup

# create Dataloaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir,
                                                                  test_dir=test_dir,
                                                                  transform=efficientnet_b0_transforms,
                                                                  batch_size=32)

train_dataloader, test_dataloader, class_names

(<torch.utils.data.dataloader.DataLoader at 0x1d3518bec20>,
 <torch.utils.data.dataloader.DataLoader at 0x1d3518be9b0>,
 ['pizza', 'steak', 'sushi'])

### 3. Getting a pretrained model, freeze the base layers and add a classifier head (for our task)

In [21]:
from torchvision.models import efficientnet_b0, EfficientNet_B0_Weights

# Download the pretrained weights for EfficientNetB0
effnet_b0_weights = EfficientNet_B0_Weights.DEFAULT # "DEFAULT" = best available weights

# Create the model and send it to device (cuda)
model = efficientnet_b0(weights=effnet_b0_weights).to(device)

In [22]:
model

EfficientNet(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): SiLU(inplace=True)
    )
    (1): Sequential(
      (0): MBConv(
        (block): Sequential(
          (0): Conv2dNormActivation(
            (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
            (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): SiLU(inplace=True)
          )
          (1): SqueezeExcitation(
            (avgpool): AdaptiveAvgPool2d(output_size=1)
            (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
            (fc2): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1))
            (activation): SiLU(inplace=True)
            (scale_activation): Sigmoid()
          )
          (2): Conv2dNormActivat

In [23]:
# base layers (feature extractor)
model.features

Sequential(
  (0): Conv2dNormActivation(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): SiLU(inplace=True)
  )
  (1): Sequential(
    (0): MBConv(
      (block): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): SiLU(inplace=True)
        )
        (1): SqueezeExcitation(
          (avgpool): AdaptiveAvgPool2d(output_size=1)
          (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
          (fc2): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1))
          (activation): SiLU(inplace=True)
          (scale_activation): Sigmoid()
        )
        (2): Conv2dNormActivation(
          (0): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), 

In [24]:
# average pooling layer
model.avgpool

AdaptiveAvgPool2d(output_size=1)

In [25]:
# classifier layer 
model.classifier

Sequential(
  (0): Dropout(p=0.2, inplace=True)
  (1): Linear(in_features=1280, out_features=1000, bias=True)
)

In [26]:
from torchinfo import summary

summary(
    model=model,
    input_size=(1, 3, 224, 224),
    col_names=["input_size", "output_size", "num_params", "trainable"]
)

Layer (type:depth-idx)                                  Input Shape               Output Shape              Param #                   Trainable
EfficientNet                                            [1, 3, 224, 224]          [1, 1000]                 --                        True
├─Sequential: 1-1                                       [1, 3, 224, 224]          [1, 1280, 7, 7]           --                        True
│    └─Conv2dNormActivation: 2-1                        [1, 3, 224, 224]          [1, 32, 112, 112]         --                        True
│    │    └─Conv2d: 3-1                                 [1, 3, 224, 224]          [1, 32, 112, 112]         864                       True
│    │    └─BatchNorm2d: 3-2                            [1, 32, 112, 112]         [1, 32, 112, 112]         64                        True
│    │    └─SiLU: 3-3                                   [1, 32, 112, 112]         [1, 32, 112, 112]         --                        --
│    └─Sequential: 2-2  

In [27]:
# Freeze the base layers (feature extractor) by setting their requires_grad attribute to False
for param in model.features.parameters():
    param.requires_grad = False

In [28]:
summary(
    model=model,
    input_size=(1, 3, 224, 224),
    col_names=["input_size", "output_size", "num_params", "trainable"]
)

Layer (type:depth-idx)                                  Input Shape               Output Shape              Param #                   Trainable
EfficientNet                                            [1, 3, 224, 224]          [1, 1000]                 --                        Partial
├─Sequential: 1-1                                       [1, 3, 224, 224]          [1, 1280, 7, 7]           --                        False
│    └─Conv2dNormActivation: 2-1                        [1, 3, 224, 224]          [1, 32, 112, 112]         --                        False
│    │    └─Conv2d: 3-1                                 [1, 3, 224, 224]          [1, 32, 112, 112]         (864)                     False
│    │    └─BatchNorm2d: 3-2                            [1, 32, 112, 112]         [1, 32, 112, 112]         (64)                      False
│    │    └─SiLU: 3-3                                   [1, 32, 112, 112]         [1, 32, 112, 112]         --                        --
│    └─Sequential

In [29]:
model.classifier

Sequential(
  (0): Dropout(p=0.2, inplace=True)
  (1): Linear(in_features=1280, out_features=1000, bias=True)
)

In [30]:
len(class_names)

3

In [31]:
# Added a new classifier layer to the model according to our task 
# our task is to classify between pizza, steak and sushi (3 classes)

# to maintain reproducibility, we need to set the seeds 
set_seeds(42)

model.classifier = nn.Sequential(
        nn.Dropout(p=0.2, inplace=True),
        nn.Linear(in_features=1280, out_features=len(class_names), bias=True)
).to(device)

In [32]:
summary(
    model=model,
    input_size=(32, 3, 224, 224), # [batch_size, color_channels, height, width]
    col_names=["input_size", "output_size", "num_params", "trainable"],
    verbose=0,
    col_width=20,
    row_settings=["var_names"]
)

Layer (type (var_name))                                      Input Shape          Output Shape         Param #              Trainable
EfficientNet (EfficientNet)                                  [32, 3, 224, 224]    [32, 3]              --                   Partial
├─Sequential (features)                                      [32, 3, 224, 224]    [32, 1280, 7, 7]     --                   False
│    └─Conv2dNormActivation (0)                              [32, 3, 224, 224]    [32, 32, 112, 112]   --                   False
│    │    └─Conv2d (0)                                       [32, 3, 224, 224]    [32, 32, 112, 112]   (864)                False
│    │    └─BatchNorm2d (1)                                  [32, 32, 112, 112]   [32, 32, 112, 112]   (64)                 False
│    │    └─SiLU (2)                                         [32, 32, 112, 112]   [32, 32, 112, 112]   --                   --
│    └─Sequential (1)                                        [32, 32, 112, 112]   [32, 

Now we only have 3,843 Trainable parameters.

## 4. Train a single model and track results

In [33]:
# Define loss function and optimizer 
loss_fn = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(params=model.parameters(), 
                             lr=0.001)

To track experiments, we're going to use TensorBoard

In [34]:
# Setup TensorBoard for logging training results
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

In [35]:
from going_modular.going_modular.engine import train_step, test_step

from tqdm.auto import tqdm
from typing import Dict, List, Tuple

def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
    model: A PyTorch model to be trained and tested.
    train_dataloader: A DataLoader instance for the model to be trained on.
    test_dataloader: A DataLoader instance for the model to be tested on.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    loss_fn: A PyTorch loss function to calculate loss on both datasets.
    epochs: An integer indicating how many epochs to train for.
    device: A target device to compute on (e.g. "cuda" or "cpu").

    Returns:
    A dictionary of training and testing loss as well as training and
    testing accuracy metrics. Each metric has a value in a list for 
    each epoch.
    In the form: {train_loss: [...],
              train_acc: [...],
              test_loss: [...],
              test_acc: [...]} 
    For example if training for epochs=2: 
             {train_loss: [2.0616, 1.0537],
              train_acc: [0.3945, 0.3945],
              test_loss: [1.2641, 1.5706],
              test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }
    
    # Make sure model on target device
    model.to(device)

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        # model evaluation
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        ## New: Experiment Tracking
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "test_loss": test_loss},
                           global_step=epoch)

        writer.add_scalars(main_tag="Accuracy",
                           tag_scalar_dict={"train_acc": train_acc,
                                            "test_acc": test_acc},
                           global_step=epoch)

        writer.add_graph(model=model, 
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))

        # closing writer 
        writer.close()

    # Return the filled results at the end of the epochs
    return results

In [36]:
# Train model 

# setting random seed
set_seeds()

# # saving results
# results = train(model=model,
#       train_dataloader=train_dataloader,
#       test_dataloader=test_dataloader,
#       optimizer=optimizer,
#       loss_fn=loss_fn,
#       epochs=5,
#       device=device)

In [37]:
 from torch.utils.tensorboard import SummaryWriter
 

### Lauching Tensorboard in Notebook

In [40]:
# %load_ext tensorboard

# %tensorboard --logdir runs

### 6. Create a function to prepare a `SummaryWriter()` instance 

By default our `SummaryWriter()` class saves to `log_dir`. 

How about if we wanted to save different experiments to different folders? 

In essence, one experiment = one folder.

For example, we'd like to track: 

* Experiment date/timestamp
* Experiment name 
* Model name 
* Extra - is there anything else that should be tracked ?


Let's create a function to create a `SummaryWriter()` instance to take all of these things into account. 

So ideally we end up tracking experiments to a directory: 

`runs/YYYY-MM-DD/experiment_name/model_name/extra`


In [52]:
from datetime import datetime

datetime.now()

datetime.now().strftime('%Y-%m-%d')

'2024-05-02'

In [55]:
from datetime import datetime
from torch.utils.tensorboard import SummaryWriter
import os


def create_writer(experiment_name: str,
                  model_name: str,
                   extra: str = None): 
    """Creates a torch.utils.tensorboard.writer.SummaryWriter instance tracking to a specific directory."""

    # Get timestamp of current date in reverse order
    time_stamp = datetime.now().strftime("%Y-%m-%d")

    if extra: 
        # create log directory 
        log_dir = os.path.join("runs", time_stamp, experiment_name, model_name, extra)
    else: 
        log_dir = os.path.join("runs", time_stamp, experiment_name, model_name)

    print(f"[INFO] Created SummaryWriter saving to {log_dir}")
    return SummaryWriter(log_dir=log_dir)

In [56]:
example_writer = create_writer(
    experiment_name="data_10_percent",
    model_name="efficientnet_b0",
    extra="5_epochs"
)

example_writer

[INFO] Created SummaryWriter saving to runs\2024-05-02\data_10_percent\efficientnet_b0\5_epochs


<torch.utils.tensorboard.writer.SummaryWriter at 0x1d453bee860>

#### 6.1 update the `train()` to include a `writer` parameter to track our experiments 

In [57]:
from going_modular.going_modular.engine import train_step, test_step
from torch.utils.tensorboard import SummaryWriter
from tqdm.auto import tqdm
from typing import Dict, List, Tuple

def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device,
          writer: SummaryWriter = None ) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
    model: A PyTorch model to be trained and tested.
    train_dataloader: A DataLoader instance for the model to be trained on.
    test_dataloader: A DataLoader instance for the model to be tested on.
    optimizer: A PyTorch optimizer to help minimize the loss function.
    loss_fn: A PyTorch loss function to calculate loss on both datasets.
    epochs: An integer indicating how many epochs to train for.
    device: A target device to compute on (e.g. "cuda" or "cpu").

    Returns:
    A dictionary of training and testing loss as well as training and
    testing accuracy metrics. Each metric has a value in a list for 
    each epoch.
    In the form: {train_loss: [...],
              train_acc: [...],
              test_loss: [...],
              test_acc: [...]} 
    For example if training for epochs=2: 
             {train_loss: [2.0616, 1.0537],
              train_acc: [0.3945, 0.3945],
              test_loss: [1.2641, 1.5706],
              test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }
    
    # Make sure model on target device
    model.to(device)

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        # model evaluation
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        if writer:
            ## Experiment Tracking
            writer.add_scalars(main_tag="Loss", 
                            tag_scalar_dict={"train_loss": train_loss,
                                                "test_loss": test_loss},
                            global_step=epoch)

            writer.add_scalars(main_tag="Accuracy",
                            tag_scalar_dict={"train_acc": train_acc,
                                                "test_acc": test_acc},
                            global_step=epoch)

            writer.add_graph(model=model, 
                            input_to_model=torch.randn(32, 3, 224, 224).to(device))

            # closing writer 
            writer.close()

    # Return the filled results at the end of the epochs
    return results

In [58]:
# creating a writer 
trail_1_writer = create_writer(
    experiment_name="trail_1", 
    model_name="effinet_b0",
    extra="3_epochs"
)

[INFO] Created SummaryWriter saving to runs\2024-05-02\trail_1\effinet_b0\3_epochs


In [59]:
train_results = train(model=model,
                     train_dataloader=train_dataloader,
                     test_dataloader=test_dataloader,
                     optimizer=optimizer,
                     loss_fn=loss_fn,
                     epochs=3,
                     device=device,
                     writer=trail_1_writer)

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0916 | train_acc: 0.3828 | test_loss: 0.9098 | test_acc: 0.5909
Epoch: 2 | train_loss: 0.8992 | train_acc: 0.6445 | test_loss: 0.7881 | test_acc: 0.8561
Epoch: 3 | train_loss: 0.8069 | train_acc: 0.7422 | test_loss: 0.6774 | test_acc: 0.8864
