# Self-Supervised Learning <a id="top"></a>

---
## Table of Contents

* [Self-Supervised Learning Overview](#ssl_overview)
    * [Core Concepts](#core_concepts)
    * [How It Works](#how_it_works)
* [Self-Supervised Learning Tutorial](#ssl_tutorial)
    * [Imports](#imports)
    * [Dataset Preparation](#dataset_prep)
    * [Model Architecture](#model_architecture)
    * [Projection Head](#projection_head)
    * [Simple Contrastive Learning of Representations](#simclr)
    * [Constrastive Loss Function](#constrastive_loss)
    * [Init the Model](#init_model)
    * [Training Loop](#training_loop)
* [Downstream Task (Image Classification)](#img_classification)
    * [Linear Classifier](#linear_classifier)
    * [Setup](#setup)
    * [Init the Classifier](#init_classifier)
    * [Train the Classifier](#train_classifier)
    * [Evaluate the Classifier](#eval_classifier) 

# Self-Supervised Learning Overview <a class="anchor" id="ssl_overview"></a>

Self-supervised learning (SSL) leverages unsupervised learning for tasks that conventionally require supervised learning. SSL has been gaining a lot of interests in recent years for its ability to learn from unlabeled data, reduce annotation costs, and facilitate transferable representations

Instead of relying on labeled datasets to understand semantic meanings, self-supervised models generate implicit labels from unstructured data. This enables the model to extract meaningful features from the data, allowing it to learn useful representations even without explicit labels.

SSL is particularly useful in fields like computer vision and natural language processing (NLP) where obtaining large amounts of labeled data can be challenging (i.e. anomaly detection).

A core technique in self-supervised learning is contrastive learning which focuses on maximizing the similarity between representations of similar data points and minimizing the similarity between dissimilar ones. Imagine showing your model two images: one of a cat and another of a dog. Contrastive learning encourages the model to create representations where the cat image's representation is closer to another cat image's representation than it is to the dog image's representation.

## How It Works <a class="anchor" id="how_it_works"></a>

- In supervised learning, ground truth labels are directly provided by human experts.
- In self-supervised learning, tasks are designed such that “ground truth” can be inferred from unlabeled data.
- SSL tasks fall into two categories:
  - Pretext Tasks: Train AI systems to learn meaningful representations of unstructured data. These learned representations can be subsequently used in downstream tasks.
  - Downstream Tasks: Reuse pre-trained models on new tasks, a technique known as "transfer learning"

# Self-Supervised Learning Tutorial <a class="anchor" id="ssl_tutorial"></a>

## Imports <a class="anchor" id="imports"></a>

In [None]:
import time

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Available Device: {device}")

## Dataset Preparation <a class="anchor" id="dataset_prep"></a>

The Cutout class implements a cutout augmentation technique. It takes a mask of random squares of pixels from the input images during training, to make the model more robust.

For this tutorial, we'll use the CIFAR-10 dataset. You can download and load it using torchvision

In [None]:
# Data Augmentation
class Cutout:
    def __init__(self, n_holes, length):
        self.n_holes = n_holes
        self.length = length

    def __call__(self, img):
        # Extracts the height (h) and width (w) of the input image (img) using .size()
        h, w = img.size(1), img.size(2)
        mask = np.ones((h, w), np.float32)

        # Iterates over the specified number of holes (n_holes) and 
        # selects random coordinates (x, y) within the image dimensions
        for _ in range(self.n_holes):
            y = np.random.randint(h)
            x = np.random.randint(w)

            # Calculates the coordinates (y1, y2, x1, x2) for the cutout region 
            # around the random coordinates (x, y)
            y1 = int(max(0, y - self.length // 2))
            y2 = int(min(h, y + self.length // 2))
            x1 = int(max(0, x - self.length // 2))
            x2 = int(min(w, x + self.length // 2))

            # Updates the corresponding region in the mask to zeros, 
            # effectively creating a hole in the mask
            mask[y1:y2, x1:x2] = 0

        # The binary mask is converted into a PyTorch tensor (torch.Tensor) and 
        # expanded to match the dimensions of the input image
        mask = torch.from_numpy(mask)
        mask = mask.expand_as(img)

        return img * mask

In [None]:
# Takes a list of transformations as input and applies them sequentially to the image
transform = transforms.Compose([
    # Converts the input image into a PyTorch tensor
    transforms.ToTensor(),
    # Random crop of the input image and then resizes it to the specified size
    transforms.RandomResizedCrop(32),
    # Horizontally flips the input image at a probability of 0.5
    transforms.RandomHorizontalFlip(p=0.5),
    # Randomly rotates the image up to 10 degrees and 
    # translates it up to 10% of the image size both vertically and horizontally.
    transforms.RandomAffine(degrees=10, translate=(0.1, 0.1)),
    # Randomly adjusts the brightness, contrast, saturation, and hue of the input image
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.2),
    # Normalizes the image by subtracting the mean (0.5, 0.5, 0.5) from each channel and 
    # dividing by the standard deviation (0.5, 0.5, 0.5) for each channel
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    # Randomly removes rectangular regions from the input image
    Cutout(n_holes=1, length=16)  # Introduce holes in images
])

In [None]:
# Load CIFAR-10 Datasets
batch_size = 128 # number of samples to be processed in each batch during training and testing
num_workers = 16 # number of subprocesses to use for parallel data loading

# Creates an instance of the CIFAR-10 dataset for training
train_dataset = datasets.CIFAR10(root='./data', train=True,
                                 download=True, transform=transform)
# Creates a data loader for the training dataset
train_loader = DataLoader(train_dataset, batch_size=batch_size,
                          shuffle=True, num_workers=num_workers)

# Creates an instance of the CIFAR-10 dataset for testing
test_dataset = datasets.CIFAR10(root='./data', train=False,
                                download=True, transform=transform)
# Creates a data loader for the testing dataset
test_loader = DataLoader(test_dataset, batch_size=batch_size,
                         shuffle=False, num_workers=num_workers)

## Model Architecture <a class="anchor" id="model_architecture"></a>

Define a simple convolutional neural network (CNN) as our base encoder. This is a convolutional neural network that encodes the input images into feature representations. It contains Conv2d, ReLU and MaxPool2d layers.

- The encoder is defined as a Sequential container of convolutional layers (Conv2d), activation functions (ReLU), and max-pooling layers (MaxPool2d). This sequence of layers forms the encoder part of the neural network.
- The first convolutional layer (Conv2d) takes input channels of size 3 (for RGB images) and outputs 64 channels. It uses a kernel size of 3x3 and padding of 1 to maintain the spatial dimensions.
- ReLU activation functions (ReLU) are applied after each convolutional layer to introduce non-linearity.
- Max-pooling layers (MaxPool2d) with a kernel size of 2x2 are used to downsample the spatial dimensions of the feature maps.
- The process is repeated with increasing numbers of output channels (64, 128, 256, and finally 512) along with downsampling using max-pooling after each set of convolutional layers.

The forward method defines the forward pass computation of the Encoder module. It takes an input tensor x and passes it through the layers defined in self.encoder, returning the output tensor.

In [None]:
class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2)
            )

    def forward(self, x):
        return self.encoder(x)

## Projection Head <a class="anchor" id="projection_head"></a>

Add a projection head to project the encoded features into a lower-dimensional space. This takes the encoded features from the Encoder and projects them into embeddings. It contains Linear and ReLU layers.

- The projection_head is defined as a Sequential container of linear layers (Linear) and activation functions (ReLU). This sequence of layers forms the projection head part of the neural network.
- The first linear layer (Linear) takes an input of size input_dim and outputs hidden_dim. This is essentially a fully connected layer with a ReLU activation function (ReLU) applied afterward.
- The second linear layer (Linear) then takes the output of the previous layer (of size hidden_dim) and outputs a tensor of size output_dim. No activation function is applied after this layer.

The forward method defines the forward pass computation of the ProjectionHead module. It takes an input tensor x and passes it through the layers defined in self.projection_head, returning the output tensor.

In [None]:
class ProjectionHead(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(ProjectionHead, self).__init__()
        self.projection_head = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.projection_head(x)

## Simple Contrastive Learning of Representations <a class="anchor" id="simclr"></a>

Combine the encoder and projection head into the SimCLR model. The overall model that contains the Encoder and ProjectionHead. It handles passing inputs through these and returning features and projections.

- The `__init__` method initializes the SimCLR class. It first calls the `__init__` method of the parent class (`nn.Module`) using `super()`. Then, it initializes the encoder and projection_head attributes of the SimCLR class with the provided encoder and projection_head modules.
- Encoder typically refers to the backbone neural network architecture (e.g., a CNN) that extracts features from input data, and projection_head is a neural network module that projects the extracted features into a higher-dimensional space.

The `forward` method defines the forward pass computation of the SimCLR module. It takes an input tensor `x` and passes it through the `encoder` module (backbone network) to extract features.
- After extracting features, the `features` tensor is reshaped using `.view()` to flatten it while preserving the batch size (`features.size(0)`).
- The flattened features are then passed through the `projection_head` module to project them into a higher-dimensional space.
- The method returns both the extracted features and their corresponding projections.

In [None]:
class SimCLR(nn.Module):
    def __init__(self, encoder, projection_head):
        super(SimCLR, self).__init__()
        self.encoder = encoder
        self.projection_head = projection_head

    def forward(self, x):
        features = self.encoder(x)
        features = features.view(features.size(0), -1)  # Flatten the features
        projections = self.projection_head(features)
        return features, projections

## Contrastive Loss Function <a class="anchor" id="constrastive_loss"></a>

Define the contrastive loss function. This calculates the loss for the contrastive learning objective. It normalizes the features, calculates the similarity matrix between samples, and uses cross entropy on this to optimize the model.

- The `__init__` method initializes the `ContrastiveLoss` class. It takes an optional argument `temperature`, which defaults to `0.5`. The temperature parameter scales the similarity scores before computing the loss.

The forward method defines the forward pass computation of the ContrastiveLoss module. It takes a tensor of features extracted from the model's encoder as input.
- First, it normalizes the feature vectors along the embedding dimension (dimension 1) using `nn.functional.normalize()`. Normalization ensures that feature vectors have unit length, which is often beneficial for contrastive learning.
- It then computes the similarity matrix by performing a matrix multiplication of the normalized feature vectors and their transposes. The resulting matrix contains pairwise cosine similarities between all feature vectors.
- The similarity scores are divided by the temperature parameter to scale them.
- Finally, it computes the contrastive loss using `F.cross_entropy()`. This function calculates the cross-entropy loss between the similarity scores and the labels, where the labels are simply indices from 0 to the batch size. The loss is calculated based on how well the similarity scores match the ground truth labels, where matching views from the same instance should have high similarity scores, and views from different instances should have low similarity scores.

In [None]:
class ContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.5):
        super(ContrastiveLoss, self).__init__()
        self.temperature = temperature

    def forward(self, features):
        bs = features.size(0)
        features = nn.functional.normalize(features, dim=1)
        similarity_matrix = torch.matmul(features, features.T) / self.temperature
        loss = F.cross_entropy(similarity_matrix, torch.arange(bs).cuda())
        return loss

## Init Model <a class="anchor" id="init_model"></a>

In [None]:
# Compiling the model
encoder = Encoder().to(device)
projection_head = ProjectionHead(2048, 256, 128).to(device) # Update projection head input dimension
model = SimCLR(encoder, projection_head).to(device)

# Hyperparameter Tuning (Experiment with different learning rates and epochs)
learning_rate = 0.0005  # Determines the step size at which the model parameters are updated during optimization
num_epochs = 10  # Specifies the number of times the entire dataset will pass through the model during training

# Define optimizer and loss function
criterion = ContrastiveLoss() # Defines the loss function for the contrastive learning task
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # Defines the optimizer using Adam optimizer

## Training Loop <a class="anchor" id="training_loop"></a>

Define the training loop

In [None]:
start = time.time()

for epoch in range(num_epochs):
    total_loss = 0
    # Iterates over batches of data from the train_loader
    for batch in train_loader:
        images, _ = batch
        # Moves images to GPU, if available
        images = images.to(device)
        # Passes through the model obtain both features and their projections
        features, projections = model(images)
        # Contrastive loss is computed using the features
        loss = criterion(features)

        # The optimizer's gradients are zeroed
        optimizer.zero_grad()
        # Backpropagated the loss through the network
        loss.backward()
        # Updates the model parameters based on the computed gradients
        optimizer.step()

        # Accumulates total loss for the epoch
        total_loss += loss.item()
        
    # Print information every 5 epochs or at the last epoch
    if (epoch + 1) % (num_epochs/(num_epochs/20)) == 0 or epoch == num_epochs - 1:
        print(f"Epoch [{epoch + 1}/{num_epochs}], "
              f"Loss: {total_loss / len(train_loader):.4f}")

end = time.time()
print(f"\nTraining took {(end - start)/60} mins")

<details>
    <summary>Click to show/hide expected output </summary>
    <br>
    Epoch [10/10], Loss: 4.8454
    <br>
    <br>
    
Training took 0.8129964351654053 mins
    
</details>

# Downstream Task (Image Classification) <a class="anchor" id="img_classification"></a>

Simple linear classifier trained on top of the frozen encoder of your SimCLR model

### Linear Classifier <a class="anchor" id="linear_classifier"></a>

This simple linear classifier takes the learned features and classifies them into classes. It contains Linear, ReLU and Dropout layers.

- `input_dim` is the dimensionality of the input features.
- `num_classes` is the number of classes in the classification task.
- It consists of three fully connected (linear) layers (`nn.Linear`), each followed by a ReLU activation function (`nn.ReLU`) and a dropout layer (`nn.Dropout`).
- The first fully connected layer (`fc1`) takes input of size input_dim and outputs 512 features.
- The second fully connected layer (`fc2`) takes 512 features as input and outputs 256 features.
- The third fully connected layer (`fc3`) takes 256 features as input and outputs `num_classes` features, corresponding to the class scores.

The forward method defines the forward pass computation of the LinearClassifier module.
- It takes an input tensor `x` and passes it through the layers defined in the `__init__` method.
- After each fully connected layer, ReLU activation function is applied followed by dropout for regularization.
- The output of the last fully connected layer is returned, representing the class scores for each input sample.

In [None]:
# Define a simple linear classifier
class LinearClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(LinearClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, 512)  # Increased hidden layer size
        self.relu = nn.ReLU(inplace=True)
        self.dropout1 = nn.Dropout(p=0.2)  # Dropout layer for regularization
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(p=0.2)
        self.fc3 = nn.Linear(256, num_classes)  # Additional hidden layer

    def forward(self, x):
        x = self.dropout1(self.relu(self.fc1(x)))
        x = self.dropout2(self.relu(self.fc2(x)))
        x = self.fc3(x)
        return x

## Setup <a class="anchor" id="setup"></a>

In [None]:
# Set it to evaluation mode and send it to GPU
model.eval()
model.to(device)

### Init the Classifier <a class="anchor" id="init_classifier"></a>

In [None]:
# Initialize the classifier (assuming reduced feature dim is 192 * 4 * 4)
classifier = LinearClassifier(input_dim=192 * 4 * 4, 
                              num_classes=10).to(device)

# Hyperparameter Tuning (Experiment with different learning rates and epochs)
learning_rate = 0.0001  # Determines the step size at which the model parameters are updated during optimization
num_epochs = 5  # Specifies the number of times the entire dataset will pass through the model during training

# Define optimizer and loss function
criterion = nn.CrossEntropyLoss() # Cross-entropy loss combines a softmax activation function and a negative log-likelihood loss
optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate, 
                             weight_decay=0.001) # Initializes the optimizer using the Adam algorithm

### Train the Classifier <a class="anchor" id="train_classifier"></a>

In [None]:
start = time.time()

# Train the linear classifier
for epoch in range(num_epochs):
    # Sets the model (classifier) to training mode
    classifier.train()
    # Iterates over the batches of data (features and labels) from the train_loader
    for features, labels in train_loader:
        # Reshape the features tensor if necessary (flattens it)
        features = features.view(features.size(0), -1).to(device)
        labels = labels.to(device)
        # Performs a forward pass through the classifier modelto obtain predicted class scores for the input features
        outputs = classifier(features)
        # Computes the cross-entropy loss using the predicted outputs and the ground truth labels
        loss = criterion(outputs, labels)


        # The optimizer's gradients are zeroed
        optimizer.zero_grad()
        # Backpropagated the loss through the network
        loss.backward()
        # Updates the model parameters based on the computed gradients
        optimizer.step()

    # Print information every 5 epochs or at the last epoch
    if (epoch + 1) % (num_epochs/(num_epochs/20)) == 0 or epoch == num_epochs - 1:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")

end = time.time()
print(f"\nTraining took {(end - start)/60} mins")

<details>
    <summary>Click to show/hide expected output </summary>
    <br>
    Epoch [5/5], Loss: 2.0327
    <br>
    <br>
    Training took 0.3966091076533 mins
    
</details>

### Evaluate the Classifier <a class="anchor" id="eval_classifier"></a>

In [None]:
classifier.eval()
correct = 0
total = 0
# Disable gradient computation during inference to save memory and computation
with torch.no_grad():
    # Iterates over the batches of data (images and labels) from the test_loader
    for images, labels in test_loader:
        # Reshape the features tensor if necessary (flattens it)
        images = images.view(images.size(0), -1).to(device)
        labels = labels.to(device)
        # Performs a forward pass through the classifier model to obtain predicted class scores for the input image
        outputs = classifier(images)
        # Uses `torch.max()` to find the maxiumum value along dimension 1 and return the predicted value
        _, predicted = torch.max(outputs, 1)
        # Increments the total counter by the number of labels in the current batch
        total += labels.size(0)
        # Comparing the predicted labels with the ground truth labels and summing the number of correct predictions
        correct += (predicted == labels).sum().item()

print(f"Accuracy on the test set: {(100 * correct / total):.2f}%")

<details>
    <summary>Click to show/hide expected output </summary>
    <br>
    Accuracy on the test set: 27.60%
    
</details>

## Notes

This tutorial provides a basic implementation of contrastive learning with SimCLR. You can further experiment by adjusting hyperparameters, using different datasets, or exploring advanced techniques like data augmentations and different architectures.

**[Go to Top](#top)**