<a href="https://colab.research.google.com/github/eborin/SSL-course/blob/main/16_minerva_BYOL-STL10-backbone_pretrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[View Source Code](https://github.com/eborin/SSL-course/blob/main/16_minerva_BYOL-STL10-backbone_pretrain.ipynb)

# Pretraining backbones with Minerva BYOL

This notebook provides a demonstration of how to pretrain feature extraction backbones using the Minerva BYOL model. 
In particular, it walks through the process of training a ResNet-18 backbone on the "unlabeled" split of the STL10 dataset.

## 1. Introduction

### 1.1 Objective

The main objective of this tutorial is to present how to employ Minerva BYOL to pretrain a given backbone.

### 1.2 Before Running this Notebook

The training operation performed in this notebook is likely to take a considerable amount of time when executed on typical laptop or desktop hardware.
As of May 2025, it also remains time-consuming even when running on Google Colab.
If you have access to a system equipped with a powerful GPU, it is recommended that you run this notebook on that system to significantly reduce the training time.

#### 1.1.1. Running this notebook from a terminal as a Python script.

You can convert this notebook into a Python script by executing the following command in your terminal:

```bash
jupyter nbconvert --to script 16_minerva_BYOL-STL10-backbone_pretrain.ipynb
```

This will generate a Python script named `16_minerva_BYOL-STL10-backbone_pretrain.py`, which you can then run directly from your terminal using:

```bash
python 16_minerva_BYOL-STL10-backbone_pretrain.py
```

> **Note**: Before converting the notebook, you may want to adjust the main configuration variables found in the "Basic Setup" section to ensure they are appropriately set for your environment.

### 1.3 BYOL

BYOL (Bootstrap Your Own Latent) is a self-supervised learning framework introduced by DeepMind.
It learns useful visual representations without requiring labeled data or negative pairs, by encouraging consistency between different augmented views of the same image using a bootstrapping mechanism.

The method was presented in a paper published as a preprint on the [arXiv repository](https://arxiv.org/pdf/2006.07733), and has significantly influenced the development of non-contrastive self-supervised learning techniques since its release in 2020.

### 1.4 What we're going to cover

In this tutorial, we demonstrate how to use the BYOL model from the [Minerva framework](https://github.com/discovery-unicamp/Minerva) to train a ResNet-18 backbone network.
Specifically, we will train a ResNet18-based backbone using the "unlabeled" split of the STL10 dataset.

The training process closely follows the approach used in the `09_minerva_SimCLR-STL10-backbone_pretrain.ipynb` tutorial, but here we apply the BYOL self-supervised learning technique instead of SimCLR.

| **Topic** | **Contents** |
| ----- | ----- |
| [**2. Basic Setup**](#sec_2) | Import useful modules (torch, torchvision, and lightning). |
| [**3. Setting up the Dataset**](#sec_3) | Set up the data transforms, the dataset and the data module for the traininig process. |
| [**4. Create the Model for the Pretext Task**](#sec_4) | Create the backbone, the projection head, and the model for the pretext task. |
| [**5. Setting up the KNN benchmark**](#sec_5) | Create a benchmark to track the performance of the backbone on the downstream task during training. |
| [**6. Training the model**](#sec_6) | Create the trainer object and invoke the `fit` method to train the model. |
| [**7. Exercises**](#sec_7) | Suggested Exercises. |

### 1.5 Where can you get help?

In addition to discussing with your colleagues or the course professor, you might also consider:

* Minerva: check the [Minerva docs](https://discovery-unicamp.github.io/Minerva/).

* Lightning: check the [Lightning documentation](https://lightning.ai/docs/overview/getting-started) and research or post Lightning related question on the [PyTorch Lightning forum](https://lightning.ai/forums/).

* PyTorch: check the [PyTorch documentation](https://pytorch.org/docs/stable/index.html) and research or post PyTorch related question on the [PyTorch developer forums](https://discuss.pytorch.org/).

## <a id="sec_2">2. Basic Setup</a>

### 2.1 Setup main variables

Several variables influence the execution of this notebook, particularly in terms of memory usage and training time. These include:

* **`n_epochs`**: Specifies the maximum number of training epochs. 
    Increasing this value generally improves backbone performance but also leads to longer training times. 
    Reducing the number of epochs can speed up training, but doing so excessively may compromise the quality of the learned representations. 
    Based on my experiments, training for at least 90 epochs typically yields backbones with noticeably better performance compared to random backbones (You will be able to evaluate this in the next tutorials).

* **`checkpoint_every_n_epochs`**: Specifies how often, in terms of training epochs, a model checkpoint is saved. 
    For example, if set to 10, the model's state will be saved every 10 epochs during training.
    These checkpoints can be used to recover from interruptions, and they also allow you to evaluate the backbone's performance at various stages to monitor how the learned representations are evolving over time.

* **`DL_BATCH_SIZE`**: Determines the batch size used during training. 
    Very large batches may exceed your GPU's memory capacity, however, small batch sizes may affect the quality of the representation learned by BYOL. Adjust accordingly based on available resources.

* **`DL_NUM_WORKERS`**: Sets the number of worker threads used by the DataLoader for parallel data loading and preprocessing. 
    Increasing this value can help improve data throughput and reduce training bottlenecks, especially on multi-core systems.

You can customize these parameters in the following code cell.

In [None]:
# Total number of epochs for training the model using the BYOL pretext task.
n_epochs = 200

# Number of epochs between model checkpoints
checkpoint_every_n_epochs = n_epochs // 10

# Dataloaders/Datamodule parameters
DL_BATCH_SIZE=256
DL_NUM_WORKERS=16

### 2.2 Installing Lightining and Minerva modules

The code below attempts to import the Minerva module and will automatically install it if it is not already available.
> **Note**: Since Minerva depends on PyTorch Lightning, Lightning will also be installed automatically if it is not already present.

In [2]:
try:
    import minerva
except:
    try:
        #Try to install it and import again
        print("[INFO]: Could not import the minerva module. Trying to install it!")
        !pip install -q minerva-ml
        import minerva
        print("[INFO]: It looks like minerva was successfully imported!")
    except:
        raise Exception("[ERROR] Couldn't find the minerva module ... \n" +
                        "Please, install it before running the notebook.\n"+
                        "You might want to install the modules listed at requirements.txt\n" +
                        "To do so, run: \"pip install -r requirements.txt\"")

### 2.3 Importing basic modules

Let's import the basic modules, such as lightning, torch, minerva, and other utility modules.

In [3]:
# Import PyTorch
import torch

# Import torchvision
import torchvision

# Import lightning
import lightning

# Import minerva
import minerva

# Check versions
# Note: your PyTorch version shouldn't be lower than 1.10.0 and torchvision version shouldn't be lower than 0.11
print(f"PyTorch version: {torch.__version__}")
print(f"torchvision version: {torchvision.__version__}")
print(f"Lightning version: {lightning.__version__}")
#print(f"Minerva version: {M.__version__}") ## TODO

# Import matplotlib for visualization
import matplotlib.pyplot as plt

PyTorch version: 2.6.0+cu124
torchvision version: 0.21.0+cu124
Lightning version: 2.5.1


## <a id="sec_3">3. Setting up the Dataset</a>

We will use the unlabeled split of the STL10 dataset to pretrain our backbone. 
To enable contrastive learning, we will apply a series of data transformations to generate randomly augmented views of each image.

For a detailed discussion of the data augmentation strategies used in the next code block, please refer to the tutorial:
`08_minerva_data_transforms.ipynb`.

In [4]:
# Torchvision transforms
from torchvision.transforms.v2 import Compose, ToImage, ToDtype, RandomHorizontalFlip, RandomResizedCrop, RandomApply, ColorJitter, RandomGrayscale, GaussianBlur, Normalize
# Minerva Contrastive transform
from minerva.transforms.transform import ContrastiveTransform

# STL10 statistics for the unlabeled split. 
# - Note: If you would like to compute these statistics for your own dataset, refer 
#         to the discussion in tutorial 05_pytorch_transfer_learning.ipynb.
stl10_unlabeled_mean  = torch.tensor([0.4406, 0.4273, 0.3858])
stl10_unlabeled_std = torch.tensor([0.2687, 0.2613, 0.2685])

transform_pipeline = Compose([
    ToImage(), 
    ToDtype(torch.float32, scale=True),
    RandomHorizontalFlip(),
    RandomResizedCrop(size=96),
    RandomApply([ColorJitter(brightness=0.5,contrast=0.5,saturation=0.5,hue=0.1)], p=0.8),
    RandomGrayscale(p=0.2),
    GaussianBlur(kernel_size=9),
    Normalize(mean=stl10_unlabeled_mean, std=stl10_unlabeled_std)
])

contrastive_transform = ContrastiveTransform(transform_pipeline)

contrastive_dataset = torchvision.datasets.STL10(root="data", split="unlabeled",  download=True,
                                                 transform=contrastive_transform)

The BYOL SSL class expects the dataloader to return batches containing an array with two augmented versions of each input sample.
Unlike SimCLR, it does not expect a tuple with the sample features and label. 
Therefore, we must modify the dataloader to return only the input data.
To achieve this, we define a simple utility class called `FeatureOnlyDataset`, which ensures that only the input features are retrieved when the `__getitem__` method is called.

In [5]:
class FeaturesOnlyDataset(torch.utils.data.Dataset):
    def __init__(self, original_dataset):
        self.dataset = original_dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        features, _ = self.dataset[idx]
        return features

contrastive_dataset_features_only = FeaturesOnlyDataset(contrastive_dataset)

To monitor learning performance, we will split the dataset into 80% training and 20% validation subsets.
Additionally, we will use a `MinervaDataModule` to streamline data handling and simplify the training workflow.

In [6]:
from torch.utils.data import random_split
from minerva.data.data_modules.base import MinervaDataModule

torch.manual_seed(42)
train_size = int(0.8 * len(contrastive_dataset_features_only))
val_size = len(contrastive_dataset_features_only) - train_size
train_set, val_set = random_split(contrastive_dataset_features_only, [train_size, val_size])

datamodule = MinervaDataModule(name="Contrastive STL10",
                                train_dataset=train_set, 
                                val_dataset=val_set, 
                                test_dataset=None,
                                batch_size=DL_BATCH_SIZE, 
                                num_workers=DL_NUM_WORKERS)

## <a id="sec_4">4. Create the Model for the Pretext Task</a>

### 4.1 Backbone and Projection Head Generation

We will use a modified version of the ResNet18 model as the backbone. 
Specifically, we replace its final fully connected (fc) layer with an identity layer—`torch.nn.Identity()`—which effectively removes any operation at that stage, allowing us to extract raw feature representations.

The `generate_backbone()` function handles this process: it instantiates a ResNet18 model, replaces its fully connected layer with an identity layer, and returns the modified model.

In the following code block, we instantiate the backbone and display its architecture using the summary() function from the torchinfo package.

In [7]:
from torchinfo import summary
from torchvision.models import resnet18

# Function to generate a ResNet18 based backbone.
def generate_backbone(weights=None):
    backbone = resnet18(weights=weights)
    backbone.fc = torch.nn.Identity()
    return backbone

# Generate the backbone and check its structure
backbone = generate_backbone()

summary(backbone,
        input_size=(32, 3, 96, 96), # input data shape (N x C x H x W)
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)

Layer (type (var_name))                  Input Shape          Output Shape         Param #              Trainable
ResNet (ResNet)                          [32, 3, 96, 96]      [32, 512]            --                   True
├─Conv2d (conv1)                         [32, 3, 96, 96]      [32, 64, 48, 48]     9,408                True
├─BatchNorm2d (bn1)                      [32, 64, 48, 48]     [32, 64, 48, 48]     128                  True
├─ReLU (relu)                            [32, 64, 48, 48]     [32, 64, 48, 48]     --                   --
├─MaxPool2d (maxpool)                    [32, 64, 48, 48]     [32, 64, 24, 24]     --                   --
├─Sequential (layer1)                    [32, 64, 24, 24]     [32, 64, 24, 24]     --                   True
│    └─BasicBlock (0)                    [32, 64, 24, 24]     [32, 64, 24, 24]     --                   True
│    │    └─Conv2d (conv1)               [32, 64, 24, 24]     [32, 64, 24, 24]     36,864               True
│    │    └─BatchN

Note that the output of the backbone is a 512-dimensional feature vector, as indicated by the Identity (fc) layer in the model summary.

Next, we will define the projection head—the component that will be attached to the backbone to form the complete pretext model used during contrastive learning.

For this purpose, we’ll use a simple multi-layer perceptron (MLP), defined as follows:

In [8]:
from minerva.models.nets.mlp import MLP
import torch

def generate_BYOL_proj_head(input_dim=512, output_dim=256):
        return MLP(
                layer_sizes=[input_dim, 4096, output_dim],
                activation_cls=torch.nn.ReLU,
                intermediate_ops=[torch.nn.BatchNorm1d(4096), None],
               )

def generate_BYOL_pred_head(input_dim=256, output_dim=256):
    """Creates the default prediction head used in BYOL."""
    return MLP(
        layer_sizes=[input_dim, 4096, output_dim],
        activation_cls=torch.nn.ReLU,
        intermediate_ops=[torch.nn.BatchNorm1d(4096), None],
    )

  from .autonotebook import tqdm as notebook_tqdm


### 4.2 Adjusting the BYOL optimizer

For our experiments, we will employ PyTorch's `Adam` optimizer with learning rate = 0.0001 and weight decay = 1e-6.

To apply this configuration, we need to modify the `configure_optimizers()` method of the pretext model so that it returns our chosen optimizer and scheduler.

While one clean approach would be to extend the Minerva BYOL class, we will opt for a quicker, more pragmatic solution: we will monkey-patch the `configure_optimizers` method directly on the model instance. This is handled by the `adjust_configure_optimizers_Adam()` function.

In [9]:
def adjust_configure_optimizers_Adam(model, 
                                lr=1e-4, # Adam Optimizer learning rate parameter
                                weight_decay=1e-6): # Adam Optimizer weight_decay parameter
    # Redefine the optimizers
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=lr, weight_decay=weight_decay)

    model.configure_optimizers = configure_optimizers.__get__(model)

### 4.3 Building and Configuring the Model for the Pretext Task

Now we build and configure the model for the pretext task.

In [10]:
from minerva.models.ssl.byol import BYOL

# Create the backbone and the projection head
backbone  = generate_backbone()
BYOL_proj_head = generate_BYOL_proj_head()
BYOL_pred_head = generate_BYOL_pred_head()

# Create the model for the pretext task
BYOL_model = BYOL(backbone=backbone, 
                  projection_head=BYOL_proj_head, 
                  prediction_head=BYOL_pred_head)

# # Adjusting the model optimizers
adjust_configure_optimizers_Adam(BYOL_model, lr=0.0001, weight_decay=1e-6)

## <a id="sec_5">5. Setting up the KNN benchmark</a>

In this section, we will create a benchmark and attach it to the trainer object to track the backbone's performance on the downstream task when training with BYOL.
Specifically, at the end of each epoch, we will use the backbone to encode the target dataset features and train and evaluate a KNN model with these encoded features.

### 5.1. Implementing the `KNN_Benchmark` class

Our KNN benchmark will be implemented as a PyTorch Lightning Callback that is invoked by the trainer at the end of each training epoch.
To achieve this, we will define a class that extends the `pytorch_lightning.callbacks.Callback` class and override the `on_train_epoch_end()` method.
For example:

```python
from pytorch_lightning.callbacks import Callback

class KNNBenchmark(Callback):
    def __init__(self, ...):
        # Initialization code

    def on_train_epoch_end(self, trainer, model):
        # Benchmark evaluation code
```

After defining the class, we create a benchmark object and attach it to the trainer in the same way we attach other callbacks, such as `ModelCheckpoint` and `LearningRateMonitor`.

With this setup, the `on_train_epoch_end(self, trainer, model)` method will be automatically invoked by the trainer at the end of every epoch, allowing us to extract features from the benchmark dataset using the model's backbone and evaluate a KNN model on these features.

To simplify the process, we will employ `scikit-learn` `KNeighborsClassifier` class to implement the KNN model.

In [None]:
import torch
from torch import Tensor
from torch.nn import functional as F
from lightning.pytorch.callbacks import Callback

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

class KNN_Benchmark(Callback):
    """
    KNN Benchmark that can be attached to the Lightning trainer object to track the backbone's performance on the downstream task.
    It expects the PyTorch Lightning model to have a backbone attribute.
    """

    def __init__(self, train_dataset, test_dataset, K: int = 10) -> None:
        """        
        Args:
            train_dataset: downstream training dataset -- N samples with D features - D must be compatible with the backbone.
            test_dataset: downstream test dataset -- M samples with D features - D must be compatible with the backbone.
            K: hyperparameter for KNN

        The callback will be invoked at the end of every epoch to compute the accuracy using KNN on the features extracted by the backbone of the lightning model.
        """
        super().__init__()

        # Store the features and labels from the downstream train and test datasets
        self.train_X, self.train_y = self.dataset_to_tensors(train_dataset)
        self.test_X, self.test_y = self.dataset_to_tensors(test_dataset)

        # Set the KNN hyperparameter
        self.K = K

        # Create the KNN classifier
        self.skl_KNN = KNeighborsClassifier(n_neighbors=self.K)    

    def on_train_epoch_end(self, trainer, model):
        # Use the model backbone to compute the features.
        train_features, test_features = self.compute_features(model.backbone)
        # Train the KNN model with the train set
        self.skl_KNN.fit(train_features, self.train_y)
        # Predict and compute the accuracy with the test set
        y_pred = self.skl_KNN.predict(test_features)
        acc = accuracy_score(self.test_y, y_pred)
        # Log the result using the PyTorch Lightning model logger.
        model.log("KNN_acc", acc, on_step=False, on_epoch=True, sync_dist=True)

    # Organize the features and labels from the dataset samples into two tensors.
    def dataset_to_tensors(self, dataset):
        features_l = [ f for f,l in dataset ]
        labels_l = [ l for f,l in dataset ]
        return torch.stack(features_l), torch.tensor(labels_l)

    # Extract the features using the backbone
    def compute_features(self, backbone):
        backbone_device = next(backbone.parameters()).device
        with torch.no_grad():
            # Extract features from the train and test datasets using the model backbone
            train_features = backbone( self.train_X.to(backbone_device) ).flatten(start_dim=1)
            test_features = backbone( self.test_X.to(backbone_device) ).flatten(start_dim=1)
        return train_features.to("cpu"), test_features.to("cpu")

### 5.2 Setting Up the Downstream Dataset

We will download the STL10 training dataset and split it into separate training and testing subsets.
> Note: We will not use the STL10 test partition, as we want to avoid biasing our decisions based on the official test set.

In [14]:
# Torchvision transforms
from torchvision.transforms.v2 import Compose, ToImage, ToDtype, Normalize

# STL10 statistics for the train split
stl10_train_mean = torch.tensor([0.4467, 0.4398, 0.4066])
stl10_train_std  = torch.tensor([0.2603, 0.2566, 0.2713])

# Build the data transform pipeline to convert from PIL images to tensors and normalize the data. 
transform_pipeline = Compose([
    ToImage(), 
    ToDtype(torch.float32, scale=True),
    Normalize(mean=stl10_train_mean, std=stl10_train_std)
])

# Build the dataset object (This step will download the dataset if it hasn't been previously downloaded).
train_dataset = torchvision.datasets.STL10(root="data", 
                                           split="train",  
                                           download=True,
                                           transform=transform_pipeline)

The following code split the train_dataset into train and validation subsets.

In [15]:
from torch.utils.data import random_split

# Split the data
torch.manual_seed(42)
train_size = int(0.80 * len(train_dataset))
test_size = len(train_dataset) - train_size
train_set, val_set = random_split(train_dataset, [train_size, test_size])

### 5.3 Create the Downstream Benchmark

In [16]:
downstream_benchmark = KNN_Benchmark(train_dataset=train_set, 
                                     test_dataset=val_set,
                                     K=10)

Now, let's call the `trainer.fit()` method to begin training the backbone.

Once the training process finishes, you will find all the model weights in the `${log_ckpt_dir}/version_0/checkpoints/` directory, where `log_ckpt_dir` corresponds to the directory set in the previous code blocks.

> **Note**: If `monitor_backbone_performance_with_downstream_benchmark` is set to `True`, the training process in the previous block will be skipped. 
    Instead, training will occur in the following section, where the trainer is configured with a callback to evaluate the backbone's performance on a designated downstream task.

## <a id="sec_6">6. Training the model</a>

As before, we’ll use a PyTorch Lightning `Trainer` object to handle the training process.

This time, however, we will enhance the trainer by adding several callbacks:

* **`ModelCheckpoint`**: to save model weights at regular intervals and also store the best-performing weights (based on the lowest validation loss).
  - We will also set `save_weights_only=True` to save only the model parameters (i.e., the backbone and projection head), excluding additional training states such as optimizer values and scheduler status.

* **`LearningRateMonitor`**: to track and log the learning rate throughout training.

* **`downstream_benchmark`**: The KNN benchmark to be executed at the end of every epoch. The goal is to assess whether the backbone's evolution is improving the feature representations for the downstream task, enabling a simple machine learning model to achieve better performance when trained on these features. It was defined in the previous section.

Additionally, we will configure a `TensorBoardLogger` to log training metrics and store model checkpoints in the `logs/16_BYOL_STL/Pretext/BYOL-Resnet18/version_0/checkpoints/epoch=N-step=...ckpt` directory, where `N` corresponds to the number of epochs in which the checkpoint was saved.

The following code sets up the trainer along with these callbacks and logging configuration.

In [17]:
from lightning import Trainer
from lightning.pytorch.callbacks import ModelCheckpoint, LearningRateMonitor
from lightning.pytorch.loggers import TensorBoardLogger

log_ckpt_dir=f"logs/16_BYOL_STL/Pretext"
trainer = Trainer(max_epochs=n_epochs,
                  log_every_n_steps=16,
                  benchmark=True,
                  callbacks=[ModelCheckpoint(save_weights_only=True, mode='min', monitor='val_loss', save_last="link"), 
                             ModelCheckpoint(save_weights_only=True, every_n_epochs=checkpoint_every_n_epochs, save_top_k=-1), 
                             LearningRateMonitor('epoch'), 
                             downstream_benchmark],
                  logger = TensorBoardLogger(save_dir=log_ckpt_dir, name=f"BYOL-Resnet18"))

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Finally, we invoke the trainer object with the BYOL (pretext) model and the contrastive version of the STL datamodule. 

In [18]:
trainer.fit(BYOL_model, datamodule)

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/configuration_validator.py:68: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                     | Type                     | Params | Mode 
------------------------------------------------------------------------------
0 | backbone                 | ResNet                   | 11.2 M | train
1 | projection_head          | MLP                      | 3.2 M  | train
2 | prediction_head          | MLP                      | 2.1 M  | train
3 | backbone_momentum        | ResNet       

Epoch 0: 100%|██████████| 313/313 [00:50<00:00,  6.15it/s, v_num=5, train_loss=-0.751]

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py:384: `ModelCheckpoint(monitor='val_loss')` could not find the monitored key in the returned metrics: ['lr-Adam', 'train_loss', 'KNN_acc', 'epoch', 'step']. HINT: Did you call `log('val_loss', value)` in the `LightningModule`?


Epoch 1: 100%|██████████| 313/313 [00:45<00:00,  6.82it/s, v_num=5, train_loss=-0.818]

`Trainer.fit` stopped: `max_epochs=2` reached.


Epoch 1: 100%|██████████| 313/313 [00:47<00:00,  6.61it/s, v_num=5, train_loss=-0.818]


Once the training is complete, you will find several checkpoints at the `log_ckpt_dir/BYOL-Resnet18/` folder. 
These checkpoints were registered at different epochs/training steps.
The `last.ckpt` is a link that points to the checkpoint that achieved the best validation loss.

Now, we can use these checkpoins to load backbone weights and employ pre-trained backbones on downstream tasks -- this is the subject for another tutorial.

## <a id="sec_7">7. Exercises</a>

1) **Transformation Analysis**: Experiment with different combinations of data augmentation transforms and evaluate their impact on the performance of the Downstream Benchmark.

2) **Projection Head Ablation**: Modify the projection head parameters and evaluate how these changes affect the quality of the learned representations.

3) **Backbone Evaluation with BYOL**: Modify the tutorial `10_minerva_SimCLR-STL10-downstream_task.ipynb` to load and evaluate backbones pretrained using BYOL.
    > **Hint**: Extend the notebook to allow a side-by-side comparison between the performance of backbones trained with SimCLR and BYOL.

4) **Latent Space Visualization**: Adapt the `tutorial 12_minerva_SimCLR-STL10-latend_space_vis.ipynb` to visualize the latent spaces generated by BYOL-pretrained models.

5) **Optimizer Comparison**: Update the training code to use the LARS optimizer with the same hyperparameters reported in the original BYOL paper. 
   Compare the downstream performance against the results obtained using the current optimizer.
