# **Introduction: Self-Supervised Learning with BYOL (Bootstrap Your Own Latent)**


**Self-supervised learning** is a form of unsupervised learning where the system learns to predict part of its input from other parts of its input. The key idea is to create a supervised learning task from unlabeled data, allowing the model to learn representations that can be useful for a wide range of tasks without the need for manually annotated labels. This approach leverages the inherent structure in the data to generate labels from the data itself, often through cleverly designed **pretext tasks**. It has become increasingly popular due to its ability to leverage large amounts of unlabeled data, significantly reducing the dependence on expensive labeled datasets.    
    
## **General Process**
**Pretext Task Creation**: A pretext task is created from the unlabeled data. The nature of this task can vary widely but is designed so that solving it will require the model to understand and learn meaningful representations of the data.   

**Model Training**: The model is trained on this self-generated supervised task, learning to predict the artificially created labels from the input data.

**Feature Extractio**: After training, the learned representations (features) can be used for **downstream tasks**. These tasks are often the actual target tasks we care about, such as classification, detection, or segmentation in vision, and various NLP tasks in text.

## **Popular Approaches**

**Contrastive Learning**: This approach involves learning representations by contrasting positive pairs against negative pairs. A positive pair consists of two different augmentations of the same data point, while negative pairs are generated from different data points. The model learns by bringing the representations of positive pairs closer and pushing those of negative pairs apart. Examples include SimCLR and MoCo.

**Cluster-Based Learning**: Techniques like DeepCluster and SeLa work by clustering the feature space to assign pseudo-labels to the data, then training the model to predict these cluster assignments. This cyclic process of clustering and prediction helps in learning useful features.

**Prediction-Based Methods**: These methods involve predicting some part of the data from another. Examples include predicting the future frames in a video or the missing part of an image. In natural language processing (NLP), a popular method is predicting the next word in a sentence, as seen in models like BERT, which predicts masked words in a sentence.

**BYOL (Bootstrap Your Own Latent)**: A novel approach that avoids the need for negative pairs by training two networks simultaneously: an online network and a target network. The online network learns to predict the target network's representation of the same data point under a different augmentation.

**SimSiam**: Similar to BYOL, SimSiam operates without negative pairs but simplifies the architecture by not using a moving average target network. Instead, it employs a stop-gradient operation to prevent collapsing.

## **Advantages and Challenges**
### **Advantages**:

* Reduces the reliance on expensive labeled data.

* Can leverage vast amounts of unlabeled data available.

* Learned representations are often more generalizable across different tasks.

### **Challenges**:

* Designing effective pretext tasks is non-trivial and often domain-specific.

* Some approaches, particularly contrastive learning, require careful negative pair sampling to avoid trivial solutions.


It is on ongoing research question as to the best practices for transferring self-supervised learning features to downstream tasks.

Self-supervised learning has the continuing potential to unlock more scalable and efficient ways to learn from data. Self-supervised learning development continues to be an area of active research, with new methods and improvements being proposed regularly.

In this notebook we are going to implement the BYOL approach to self-supervised learning learning the Lightly Python library (https://pypi.org/project/lightly/).

### **Installs**

In [3]:
! pip install lightly

Collecting lightly
  Downloading lightly-1.5.0-py3-none-any.whl (733 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m733.1/733.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting hydra-core>=1.0.0 (from lightly)
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lightly-utils~=0.0.0 (from lightly)
  Downloading lightly_utils-0.0.2-py3-none-any.whl (6.4 kB)
Collecting pydantic<2,>=1.10.5 (from lightly)
  Downloading pydantic-1.10.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aenum>=3.1.11 (from lightly)
  Downloading aenum-3.1.15-py3-none-any.whl (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m20.7 MB/s[0m et

### **Imports**

In [1]:
import copy

import torch
import torchvision
from torch import nn

from lightly.loss import NegativeCosineSimilarity
from lightly.models.modules import BYOLPredictionHead, BYOLProjectionHead
from lightly.models.utils import deactivate_requires_grad, update_momentum
from lightly.transforms.byol_transform import (
    BYOLTransform,
    BYOLView1Transform,
    BYOLView2Transform,
)
from lightly.utils.scheduler import cosine_schedule

### **Model Definition**

**BYOL (Bootstrap Your Own Latent)** is a novel approach to **self-supervised learning** introduced in a paper by Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. The main idea behind BYOL is to train a deep neural network to learn powerful representations without relying on negative samples, which is a common requirement in many other self-supervised learning frameworks.

### **Mechanics of BYOL**:

**Architecture**: BYOL utilizes a dual-network architecture consisting of a target network and an online network. Both networks have the same architecture but do not share weights. The online network is updated through backpropagation during training, while the target network's weights are updated as a slow-moving average of the online network's weights. This means the target network evolves more smoothly over time.

**Learning Process**: The core idea is to make the representation of an augmented version of an image (produced by the online network) similar to the representation of another augmented version of the same image (produced by the target network). BYOL uses two sets of data augmentations to create these two different views of the same image. These augmentations can include cropping, resizing, color jittering, etc.

**Loss Function**: The similarity between the representations is measured using a loss function (e.g., mean squared error). The goal is to minimize the distance between the representations of the two augmented views of the same image as produced by the online and target networks, respectively.

**No Negative Pairs**: Unlike contrastive learning approaches that require comparing positive pairs (similar or the same data points) with negative pairs (dissimilar data points) to learn useful features, BYOL does not use negative pairs. It only relies on positive pairs and still learns useful representations. This is significant because managing negative pairs can be challenging and computationally expensive in large datasets.

**Update Mechanism**: The target network's parameters are updated as an exponential moving average of the online network's parameters. This update mechanism is key to BYOL's performance, as it provides stability to the learning process and helps in learning consistent representations.

Grill, Jean-Bastien, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.

In [2]:
class BYOL(nn.Module):
    def __init__(self, backbone):
        super().__init__()

        self.backbone = backbone
        self.projection_head = BYOLProjectionHead(512, 1024, 256)
        self.prediction_head = BYOLPredictionHead(256, 1024, 256)

        self.backbone_momentum = copy.deepcopy(self.backbone)
        self.projection_head_momentum = copy.deepcopy(self.projection_head)

        deactivate_requires_grad(self.backbone_momentum)
        deactivate_requires_grad(self.projection_head_momentum)

    def forward(self, x):
        y = self.backbone(x).flatten(start_dim=1)
        z = self.projection_head(y)
        p = self.prediction_head(z)
        return p

    def forward_momentum(self, x):
        y = self.backbone_momentum(x).flatten(start_dim=1)
        z = self.projection_head_momentum(y)
        z = z.detach()
        return z

In [3]:
resnet = torchvision.models.resnet18()
backbone = nn.Sequential(*list(resnet.children())[:-1])
model = BYOL(backbone)

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

BYOL(
  (backbone): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 

### **Transforms and Data Loading**


In [6]:
# Disable resizing and gaussian blur for cifar10.
transform = BYOLTransform(
    view_1_transform=BYOLView1Transform(input_size=32, gaussian_blur=0.0),
    view_2_transform=BYOLView2Transform(input_size=32, gaussian_blur=0.0),
)

In [None]:
dataset = torchvision.datasets.CIFAR10(
    "datasets/cifar10", download=True, transform=transform
)
# or create a dataset from a folder containing images or videos:
# dataset = LightlyDataset("path/to/folder", transform=transform)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to datasets/cifar10/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:09<00:00, 18455263.65it/s]


Extracting datasets/cifar10/cifar-10-python.tar.gz to datasets/cifar10


### **Model training**

In [None]:
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=256,
    shuffle=True,
    drop_last=True,
    num_workers=8,
)



In [None]:
criterion = NegativeCosineSimilarity()
optimizer = torch.optim.SGD(model.parameters(), lr=0.06)

epochs = 10

In [None]:
print("Starting Training")
for epoch in range(epochs):
    total_loss = 0
    momentum_val = cosine_schedule(epoch, epochs, 0.996, 1)
    for batch in dataloader:
        x0, x1 = batch[0]
        update_momentum(model.backbone, model.backbone_momentum, m=momentum_val)
        update_momentum(
            model.projection_head, model.projection_head_momentum, m=momentum_val
        )
        x0 = x0.to(device)
        x1 = x1.to(device)
        p0 = model(x0)
        z0 = model.forward_momentum(x0)
        p1 = model(x1)
        z1 = model.forward_momentum(x1)
        loss = 0.5 * (criterion(p0, z1) + criterion(p1, z0))
        total_loss += loss.detach()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    avg_loss = total_loss / len(dataloader)
    print(f"epoch: {epoch:>02}, loss: {avg_loss:.5f}")

Starting Training
epoch: 00, loss: -0.49175
epoch: 01, loss: -0.54451
epoch: 02, loss: -0.56582
epoch: 03, loss: -0.58042
epoch: 04, loss: -0.58841
epoch: 05, loss: -0.59704
epoch: 06, loss: -0.60009
epoch: 07, loss: -0.60295
epoch: 08, loss: -0.60661
epoch: 09, loss: -0.60718


### **Conclusion**

BYOL has shown impressive results in learning visual representations without labels, outperforming or matching the state-of-the-art methods on multiple benchmarks. Its effectiveness without negative pairs challenges the previously held belief that contrastive learning with negative samples was necessary for successful self-supervised learning.

### **References**

[1] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M. and Piot, B., 2020. **Bootstrap Your Own Latent-A New Approach to Self-Supervised Learning**. Advances in neural information processing systems, 33, pp.21271-21284.

**Abstract from BYOL paper**:
We introduce Bootstrap Your Own Latent (BYOL), a new approach to selfsupervised image representation learning. BYOL relies on two neural networks,
referred to as online and target networks, that interact and learn from each other.
From an augmented view of an image, we train the online network to predict the
target network representation of the same image under a different augmented view.
At the same time, we update the target network with a slow-moving average of
the online network. While state-of-the art methods rely on negative pairs, BYOL
achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture
and 79.6% with a larger ResNet. We show that BYOL performs on par or better than
the current state of the art on both transfer and semi-supervised benchmarks. Our
implementation and pretrained models are given on GitHub.

[2] **Review — BYOL: Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
Outperforms Contrastive Learning Approaches: SimCLR, MoCo v2, CPCv2, CMC, MoCo**. https://sh-tsang.medium.com/review-byol-bootstrap-your-own-latent-a-new-approach-to-self-supervised-learning-6f770a624441

[3] **BYOL —The Alternative to Contrastive Self-Supervised Learning, Paper Analysis—Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning**. https://towardsdatascience.com/byol-the-alternative-to-contrastive-self-supervised-learning-5d0a26983d7c


