<a href="https://colab.research.google.com/github/naru289/Assignment-37/blob/main/M4_AST_37_Momentum_Contrast(MoCo)_Cifar10_C%20Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint
### Assignment: Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning










## Learning Objectives

At the end of the experiment, you will be able to :

* implement Momentum Contrast(MoCo) for Unsupervised Visual Representation Learning

## Dataset

### Description

In this experiment, we will use the CIFAR-10 dataset. It consists of 60,000 colour images(32x32) in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images.


The dataset is divided into five training batches and one test batch where each batch has 10000 images. The test batch contains 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

Here are the classes in the dataset, as well as 10 random images from each:


<img src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Images/CIFAR10.png" alt="Drawing" height="350" width="440"/>

It has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

## Information

Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by BERT. But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind. The reason may stem from differences in their respective signal spaces. Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised
learning can be based. Computer vision, in contrast, further
concerns dictionary building, as the raw signal is
in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words)

Several recent studies present promising results on unsupervised visual representation learning using approaches related to the contrastive loss. Though driven by various motivations, these methods
can be thought of as building **dynamic dictionaries**. The
**“keys”** (tokens) in the dictionary are sampled from data
(e.g., images or patches) and are represented by an encoder
network. Unsupervised learning trains encoders to perform
dictionary look-up: an encoded **“query”** should be similar
to its matching key and dissimilar to others. Learning is
formulated as minimizing a contrastive loss.

<br>
<center>
<img src="https://miro.medium.com/max/1016/1*AURmxepRI4G6WT0DirxZDw.png" width=450px/>
</center>

Momentum Contrast (MoCo) trains a visual represen-
tation encoder by matching an encoded **query q** to a dictionary
of encoded keys using a contrastive loss. The dictionary keys
{k0, k1, k2, ...} are defined by a set of data samples.
The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from
the mini-batch size. The keys are encoded by a slowly progressing
encoder, driven by a momentum update with the query encoder.
This method enables a large and consistent dictionary for learning
visual representations.

From this perspective, we hypothesize that it is desirable
to build dictionaries that are: (i) large and (ii) consistent
as they evolve during training. Intuitively, a larger dictionary may better sample the underlying continuous, highdimensional visual space, while the keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent.


We present **Momentum Contrast (MoCo)** as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (figure above). We maintain the **dictionary as a queue** of data samples: the encoded representations of the current mini-batch are enqueued, and the
oldest are dequeued. The queue decouples the dictionary
size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.


In the [paper](https://arxiv.org/pdf/1911.05722.pdf), they followed a simple instance discrimination task: a query matches a key if they are encoded views (e.g., different crops) of the same image. Using this pretext task, MoCo shows competitive results.

**Contrastive learning** is used for unsupervised pre-training. Contrastive learning is to learn a metric space between two samples (images in our case) in which the distance between two positive samples is reduced while the distance between two negative samples is enlarged. Positive samples can be represented by samples from the same category or different augmented versions of the same sample, while negative samples can be represented by samples from different categories.

A main purpose of unsupervised learning is to pre-train
representations (i.e., features) that can be transferred to
downstream tasks by fine-tuning. These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many
computer vision tasks.



### MoCo (Momentum Contrast)




The training process is designed as follows:

* A query image is selected and processed by the encoder network to compute q, the encoded query image.

* Since the goal of the model is learn to differentiate between a large number of different images, this query image encoding is not only compared to one mini-batch of encoded key images, but to multiple of them.

* To achieve that, MoCo forms a queue of mini-batches that are encoded by the momentum encoder network. As a new mini-batch is selected, its encodings are enqueued and the oldest encodings in the data structure are dequeued. This decouples the dictionary size, represented by the queue, from the batch size and enables a much larger dictionary to query from.

If the encoding of the query image matches a key in the dictionary, these two views are deemed to be from the same image (e.g. multiple different crops).





#### Results

These are the ResNet-18 classification accuracy of a **kNN monitor** on the unsupervised pre-training features.

| config | 200ep | 400ep | 800ep |
| --- | --- | --- | --- |
| Asymmetric | 82.6 | 86.3 | 88.7 |
| Symmetric | 85.3 | 88.5 | 89.7 |

#### Notes

* **Symmetric loss**: the original MoCo [paper](https://arxiv.org/pdf/1911.05722.pdf) uses an *asymmetric* loss -- one crop is the query and the other crop is the key, and it backpropagates to one crop (query). Following SimCLR/BYOL, here we provide an option of a *symmetric* loss -- it swaps the two crops and computes an extra loss. The symmetric loss behaves like 2x epochs of the asymmetric counterpart: this may dominate the comparison results when the models are trained with a fixed epoch number.

* **SplitBatchNorm**: the original MoCo was trained in 8 GPUs. To simulate the multi-GPU behavior of BatchNorm in this 1-GPU, we provide a SplitBatchNorm layer. We set `bn_splits 8` by default to simulate 8 GPUs. `bn_splits 1` is analogous to SyncBatchNorm in the multi-GPU case.

* **kNN monitor**: The paper provides a kNN monitor on the test set.



### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2237180" #@param {type:"string"}

In [2]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "6366871391" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython
import warnings
warnings.filterwarnings("ignore")

ipython = get_ipython()

notebook= "M4_AST_37_Momentum_Contrast(MoCo)_Cifar10_C" #name of the notebook

def setup():
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/MoCo_model_checkpoint.pth")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer1() and getAnswer2() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer1" : Answer1, "answer2" : Answer2, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://dlfa-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer1():
  try:
    if not Answer1:
      raise NameError
    else:
      return Answer1
  except NameError:
    print ("Please answer Question 1")
    return None

def getAnswer2():
  try:
    if not Answer2:
      raise NameError
    else:
      return Answer2
  except NameError:
    print ("Please answer Question 2")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Importing required packages


In [None]:
from functools import partial
from PIL import Image
import math
import numpy as np
import matplotlib.pyplot as plt
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import CIFAR10
from torchvision.models import resnet
from tqdm import tqdm

### Initializing CUDA

CUDA is used as an interface between our code and the GPU.

Normally, we run the code in the CPU. To run it in the GPU, we need CUDA. Check if CUDA is available:

In [None]:
# To test whether GPU instance is present in the system of not.
use_cuda = torch.cuda.is_available()
print('Using PyTorch version:', torch.__version__, 'CUDA:', use_cuda)

If it's False, then we run the program on CPU. If it's True, then we run the program on GPU.

Let us initialize some GPU-related variables:

In [None]:
device = torch.device("cuda" if use_cuda else "cpu")
print(device)

### Load Cifar-10 dataset


In [None]:
# Define transformations
# The data augmentation setting proposed by the paper
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(32),    # A 32x32 pixel crop is taken from a randomly resized image
    transforms.RandomHorizontalFlip(p=0.5), # Random horizontal flip
    transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8), # Random color jittering
    transforms.RandomGrayscale(p=0.2),   # Random grayscale conversion (Color transformation involves basic adjustments of color levels in an image)
    transforms.ToTensor(),
    transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])]) # Normalizing with mean and standard deviation values

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])])

Below class takes the CIFAR-10 dataset and return the image pairs along with applied transformations.

In [None]:
class CIFAR10Pair(CIFAR10):
    """CIFAR10 Dataset.
    """
    def __getitem__(self, index):
        # Select the image index
        img = self.data[index]

        # Creating image object of above array
        img = Image.fromarray(img)

        # Applying transformations
        if self.transform is not None:
            im_1 = self.transform(img)
            im_2 = self.transform(img)

        return im_1, im_2 # Returns image pairs

In [None]:
# Loading train and test sets
train_data = CIFAR10Pair(root='data', train=True, transform=train_transform, download=True)
memory_data = CIFAR10(root='data', train=True, transform=test_transform, download=True)
test_data = CIFAR10(root='data', train=False, transform=test_transform, download=True)

In [None]:
# Check number of training and test images
dataset_sizes = {'Train': len(train_data), 'Test': len(test_data)}
dataset_sizes



**torch.utils.data.DataLoader** class represents a Python iterable over a dataset, with following features.

1. Batching the data
2. Shuffling the data
3. Load the data in parallel using multiprocessing workers.


The batches of train and test data are provided via data loaders that provide iterators over the datasets to train our models.

In [None]:
# Define dataloaders
batch_size = 512

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=16, pin_memory=True, drop_last=True)

memory_loader = DataLoader(memory_data, batch_size=batch_size, shuffle=False, num_workers=16, pin_memory=True)

test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=16, pin_memory=True)

In [None]:
# Generate a batch of 10 images and labels
train_images, train_labels = next(iter(memory_loader))
train_images.shape, train_labels.shape

In [None]:
# labels Translator
label_names = {v: k for k, v in train_data.class_to_idx.items()}
label_names

### Visualization of CIFAR-10 dataset

In [None]:
# Create a grid of images along with their corresponding labels
L = 3
W = 3

fig, axes = plt.subplots(L, W, figsize = (12, 12))
axes = axes.reshape(-1)

for i in np.arange(0, L*W):
    train_images = np.clip(train_images, 0, 1)
    axes[i].imshow(train_images[i].permute(1, 2, 0))
    axes[i].set_title(label_names[train_labels[i].item()])
    axes[i].axis('off')

plt.tight_layout()

### Split Batch-Normalization

Recent work has shown that using unlabeled data in selfsupervised learning is not always beneficial and can even hurt generalization, especially when there is a class mismatch between the unlabeled and labeled examples. We investigate this phenomenon for image classification on the CIFAR-10 and the ImageNet datasets, and with many other forms of domain shifts applied. Our
main contribution is Split Batch Normalization (Split-BN), a technique
to improve SSL when the additional unlabeled data comes from a shifted
distribution. We achieve it by using separate batch normalization statistics for unlabeled examples. Due to its simplicity, we recommend it as a
standard practice. Finally, we analyse how domain shift affects the SSL
training process. In particular, we find that during training the statistics
of hidden activations in late layers become markedly different between
the unlabeled and the labeled examples.


**Batch Normalization:** The main idea behind batch normalization is to normalize the distribution of hidden activations $h$ based on the batch statistics as follows:

<center>
$\hat{h} = \alpha \frac{h - \mu(h)}{\sigma(h)} + \beta$
</center>

where $\alpha$ and $\beta$ are learnable parameters, and $\mu(h)$ and $\sigma(h)$ are the mean and the standard deviation computed on the given batch $h$, called batch normalization statistics. Batch normalization leads to large improvements in both convergence speed and generalization performance of deep neural networks.

**Split Batch Normalization:**

Typically, during the inference batch normalization statistics are computed on
the whole training dataset. However, these statistics are not accurate if the deep network is applied to examples coming from a different distribution. One possible solution to this issue is to recompute the statistics on the new dataset and allow the model to learn new $\alpha$ and $\beta$ parameters.

The authors main contribution is introducing a related technique to self-supervised learning. We propose to compute separately batch normalization statistics for the unsupervised and supervised dataset. By ensuring the hidden activations have the same statistics regardless of the label presence, we aim to reduce the negative effect of a domain shift between the labeled and unlabeled examples. We will refer to this technique as **Split Batch Normalization (Split-BN)**.

More precisely, let $h_l$ and $h_u$ denote the labeled and the unlabeled examples in a given batch $h$, respectively. Then Split-BN normalizes the hidden activations
as follows:

<center>
$\hat{h_u} = \alpha \frac{h_u - \mu(h_u)}{\sigma(h_u)} + \beta$
</center>
<br>
<center>
$\hat{h_l} = \alpha \frac{h_l - \mu(h_l)}{\sigma(h_l)} + \beta$
</center>

Analogously, during the inference means and standard deviations are computed separately on the labeled, and the unlabeled examples. Even though the
statistics are computed independently, the $\alpha$ and $\beta$ parameters are shared.

**Note:** To understand more about split batch normalization refer to the following [link](https://arxiv.org/pdf/1904.03515.pdf).



In [None]:
# SplitBatchNorm: simulate multi-gpu behavior of BatchNorm in one gpu by splitting alone the batch dimension
# implementation adapted from https://github.com/davidcpage/cifar10-fast/blob/master/torch_backend.py
# Redefining the nn.BatchNorm2d method with the num of batch splits
# Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension)
class SplitBatchNorm(nn.BatchNorm2d):
    def __init__(self, num_features, num_splits, **kw):
        super().__init__(num_features, **kw)
        self.num_splits = num_splits

    # The mean and standard-deviation are calculated per-dimension over the mini-batches and α and β
    # are learnable parameter vectors of size C (where C is the input size). By default, the elements of α are
    # set to 1 and the elements of β are set to 0.
    def forward(self, input):
        N, C, H, W = input.shape
        # Also by default, during training this layer keeps running estimates of its computed mean and variance,
        # which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1
        # If track_running_stats is set to False, this layer then does not keep running estimates,
        # and batch statistics are instead used during evaluation time as well.
        if self.training or not self.track_running_stats:
            running_mean_split = self.running_mean.repeat(self.num_splits)
            running_var_split = self.running_var.repeat(self.num_splits)
            outcome = nn.functional.batch_norm(
                input.view(-1, C * self.num_splits, H, W), running_mean_split, running_var_split,
                self.weight.repeat(self.num_splits), self.bias.repeat(self.num_splits),
                True, self.momentum, self.eps).view(N, C, H, W)
            self.running_mean.data.copy_(running_mean_split.view(self.num_splits, C).mean(dim=0))
            self.running_var.data.copy_(running_var_split.view(self.num_splits, C).mean(dim=0))
            return outcome
        else:
            return nn.functional.batch_norm(
                input, self.running_mean, self.running_var,
                self.weight, self.bias, False, self.momentum, self.eps)

### Define base ResNet 18 Encoder

ResNet is a Convolutional Neural Network (CNN) architecture, made up of series of residual blocks (ResBlocks) described below with skip connections differentiating ResNets from other CNNs.

We adopt a ResNet as the encoder, whose last fully-connected layer (after global average pooling) has a fixed-dimensional output (128-D). This output vector is normalized. This is the representation of the query or key.



<center>
<img src="https://www.researchgate.net/profile/Sajid-Iqbal-13/publication/336642248/figure/fig1/AS:839151377203201@1577080687133/Original-ResNet-18-Architecture.png" width=750px/>
</center>



The following is the pipeline from images to representations:

<br>
<center>
<img src="https://miro.medium.com/max/700/1*0bYRv7XBQPpbnxMjO7i1RQ.jpeg" width=750px/>
</center>



In [None]:
# Create a ResNet backbone and remove the classification head
class ModelBase(nn.Module):
    """
    Common CIFAR ResNet recipe.
    Comparing with ImageNet ResNet recipe, it:
    (i) replaces conv1 with kernel=3, str=1
    (ii) removes pool1
    """
    def __init__(self, feature_dim=128, arch=None, bn_splits=16):
        super(ModelBase, self).__init__()

        # use split batchnorm
        norm_layer = partial(SplitBatchNorm, num_splits=bn_splits) if bn_splits > 1 else nn.BatchNorm2d

        # Loading the ResNet-18 architecure without pretrained weights
        resnet_arch = getattr(resnet, arch)

        # Feature representation of images are passed as number of classes to the linear layer
        # num_classes is the output fc dimension
        net = resnet_arch(num_classes=feature_dim, norm_layer=norm_layer)

        self.net = []
        # This module is composed of “children” or “submodules” that define the layers of the neural network
        # and are utilized for computation within the module’s forward() method.
        # Immediate children of a module can be iterated through via a call to children() or named_children():
        for name, module in net.named_children():
            # Changing the first convolutional layer kernel size, stride and padding, remove maxpooling and adding the layers back to the module
            if name == 'conv1':
                module = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
            if isinstance(module, nn.MaxPool2d):
                continue
            if isinstance(module, nn.Linear):
                self.net.append(nn.Flatten(1))
            self.net.append(module)  # modules() and named_modules() recursively iterate through a module and its child modules
        self.net = nn.Sequential(*self.net)

    def forward(self, x):
        x = self.net(x)
        # note: not normalized here
        return x

### Define Momentum Contrast (MoCo) wrapper

**Contrastive Learning as Dictionary Look-up**

Contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training.

Contrastive learning, and its recent developments, can be thought of as training an encoder for a dictionary look-up task.

Consider an encoded **query q** and a set of encoded samples **{k0, k1, k2, …}** that are the keys of a dictionary. Assume that there is a single key (denoted as k+) in the dictionary that q matches. A contrastive loss is a function whose value is low (minimizing) when q is similar to its **positive key k+** and dissimilar to all other keys (considered negative keys for q).
With similarity measured by dot product, a form of
a contrastive loss function, called **InfoNCE (Noise-contrastive estimation)**, is considered in the [paper](https://arxiv.org/pdf/1911.05722.pdf):

<br>
<center>
$L_q = -log \frac{exp(q.k_{+}/T)}{\sum_{i=0}^{k} exp(q.k_{i}/T)}$
</center>


**Dictionary as a queue (Dynamic Dictionaries):** At the core of our approach is maintaining the dictionary as a queue of data samples. This allows us to reuse the encoded keys from the immediate preceding mini-batches. The introduction of a queue decouples
the dictionary size from the mini-batch size. Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter.


<br>
<center>
<img src="https://miro.medium.com/max/700/1*wvWN9acS5AlXMM0nKghRvg.png" width=750px/>
</center>
<br>

We can look at the contrastive learning approach in a slightly different way i.e., matching queries to keys. Instead of having a single encoder, we now have two encoders — one for query and another one for the key. Moreover, to have a large number of negative samples, we have a large dictionary of encoded keys.

A **positive pair** in this context means that the query matches the key. They match if both the query and the key come from the same image. An encoded query should be similar to its matching key and dissimilar to others.

For **negative pairs**, **we maintain a large dictionary which contains encoded keys from previous batches**. They serve as negative samples to the query at hand. We maintain the dictionary in the form of a **queue**. The **latest batch is enqueued and the oldest batch is dequeued**. By changing the size of this queue, change the number of negative samples.

**Challenges with this approach**

* Using a queue can make the dictionary large as the key encoder changes, the keys which are enqueued at later points of time can become inconsistent with the keys that were enqueued quite early. **For the contrastive learning approach to work, all the keys that are compared to the queries must come from the same or similar encoders for the comparisons to be meaningful and consistent.**
    
* Another challenge is that **it’s not feasible to learn the key encoder parameters using backpropagation because that would require calculating gradients for all the samples in the queue** (which would result in a large computational graph).



**Momentum update:**

To address both of these above issues, MoCo implements the key encoder as a momentum-based moving average of the query encoder.It means that it updates the key encoder parameters in the following way

$f_k$ - Key Encoder
<br>
$f_q$ - Query Encoder

Formally, denoting the parameters of $f_k$ as $\theta_k$ and those
of $f_q$ as $\theta_q$, we update $\theta_k$ by:

<br>
<center>
$\theta_k ← m\theta_k + (1 − m) \theta_q$
</center>
<br>
Here $m ∈ [0, 1]$ is a momentum coefficient. Only the parameters $\theta_q$ are updated by back-propagation. The momentum update in the Equation makes $\theta_k$ evolve more smoothly than $\theta_q$. As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small. In experiments, a relatively large momentum (e.g., m = 0.999, our default) works much better than a smaller value (e.g.,m = 0.9), suggesting that a slowly evolving key encoder is
a core to making use of a queue.

In [None]:
 # Build a MoCo model with: a query encoder, a key encoder, and a queue https://arxiv.org/abs/1911.05722
class ModelMoCo(nn.Module):
    def __init__(self, dim=128, K=4096, m=0.99, T=0.1, arch='resnet18', bn_splits=8, symmetric=True):
        super(ModelMoCo, self).__init__()

        """
        dim: feature dimension (default: 128)
        K: queue size; number of negative keys (default: 4096)
        m: moco momentum of updating key encoder (default: 0.99)
        T: softmax temperature (default: 0.07)
        """
        self.K = K
        self.m = m
        self.T = T
        self.symmetric = symmetric

        # Create the encoders for query and key
        # f_q, f_k: encoder networks for query and key
        # queue: dictionary as a queue of K keys (CxK)
        # num_classes is the output fc dimension
        self.encoder_q = ModelBase(feature_dim=dim, arch=arch, bn_splits=bn_splits)
        self.encoder_k = ModelBase(feature_dim=dim, arch=arch, bn_splits=bn_splits)

        # Initialize the key encoder to have the same values as query encoder
        # Do not update the key encoder via gradient
        for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
            param_k.data.copy_(param_q.data)  # initialize
            param_k.requires_grad = False  # The gradients of keys will not get updated

        # create the queue
        # For the current mini-batch, we encode the
        # queries and their corresponding keys, which form the positive sample pairs. The negative samples are from the queue.
        # Adds a buffer to the module
        # First parameter is name of the buffer. The buffer can be accessed from this module using the given name
        # second parameter is the operations that run on buffers
        # Create the queue to store negative samples
        self.register_buffer("queue", torch.randn(dim, K))

        # Normalization of the negative keys in a specified dimensions
        self.queue = nn.functional.normalize(self.queue, dim=0)

        # Initialize Queue pointer (dequeue and enqueue)
        self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))

    @torch.no_grad()
    def _momentum_update_key_encoder(self):
        # For each of the parameters in each encoder
        for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
            # θk ← m.θk + (1 − m)θq.
            # Only the parameters θq are updated by back-propagation.
            param_k.data = param_k.data * self.m + param_q.data * (1. - self.m)

    @torch.no_grad()
    def _dequeue_and_enqueue(self, keys):
        '''
        Update the memory / queue.
        Add batch to end of most recent sample index and remove the oldest samples in the queue.
        Store location of most recent sample index (ptr).
        args:
            feat_k (Tensor): Feature reprentations of the view x_k computed by the key_encoder.
        '''

        # gather keys before updating queue
        batch_size = keys.shape[0]

        ptr = int(self.queue_ptr)
        assert self.K % batch_size == 0  # for simplicity

        # replace the keys at ptr (dequeue and enqueue)
        self.queue[:, ptr:ptr + batch_size] = keys.t()  # transpose
        ptr = (ptr + batch_size) % self.K  # move pointer

        self.queue_ptr[0] = ptr

    @torch.no_grad()
    def _batch_shuffle_single_gpu(self, x):
        """
        Batch shuffle, for making use of splitBatchNorm.
        """
        # random shuffle index
        idx_shuffle = torch.randperm(x.shape[0]).cuda()

        # index for restoring
        idx_unshuffle = torch.argsort(idx_shuffle)

        return x[idx_shuffle], idx_unshuffle

    @torch.no_grad()
    def _batch_unshuffle_single_gpu(self, x, idx_unshuffle):
        """
        Undo batch shuffle.
        """
        return x[idx_unshuffle]

    # Defining contrastive loss function
    def contrastive_loss(self, im_q, im_k):

        # compute query features
        # Feature representations of the query view from the query encoder
        q = self.encoder_q(im_q)  # queries: NxC
        q = nn.functional.normalize(q, dim=1)  # already normalized

        # compute key features
        # Get shuffled and reversed indexes for the current minibatch
        with torch.no_grad():  # no gradient to keys
            # shuffle for making use of BN
            im_k_, idx_unshuffle = self._batch_shuffle_single_gpu(im_k)

            k = self.encoder_k(im_k_)  # keys: NxC
            k = nn.functional.normalize(k, dim=1)  # Normalize the feature representations

            # undo shuffle
            k = self._batch_unshuffle_single_gpu(k, idx_unshuffle)

        # With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE is used in the paper
        # Einstein sum is more intuitive
        # It Sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.
        # positive logits: Nx1 (Compute sim between positive views)
        l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
        # negative logits: NxK (Compute similarity between postive and all negatives in the memory)
        l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()]) # Get queue from register_buffer (self.queue.clone().detach())

        # logits: Nx(1+K)
        logits = torch.cat([l_pos, l_neg], dim=1)

        # apply temperature
        logits /= self.T

        # labels: positive key indicators
        # For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs
        labels = torch.zeros(logits.shape[0], dtype=torch.long).cuda()

        # softmax-based classifier that tries to classify q as k+
        # A contrastive loss is a function whose value is low when query(q) is similar to its positive key k+
        loss = nn.CrossEntropyLoss().cuda()(logits, labels)

        return loss, q, k

    # Forward pass
    def forward(self, im1, im2):
        """
        Input:
            im_q: a batch of query images
            im_k: a batch of key images
        Output:
            loss
        """

        # update the key encoder
        with torch.no_grad():  # no gradient to keys
            self._momentum_update_key_encoder()

        # compute loss
        if self.symmetric:  # asymmetric loss
            loss_12, q1, k2 = self.contrastive_loss(im1, im2)
            loss_21, q2, k1 = self.contrastive_loss(im2, im1)
            loss = loss_12 + loss_21
            k = torch.cat([k1, k2], dim=0)
        else:  # asymmetric loss
            loss, q, k = self.contrastive_loss(im1, im2)

        # Update the queue/memory with the current key_encoder minibatch.
        self._dequeue_and_enqueue(k)

        return loss

### Instantiate the model

In [None]:
moco_dim = 128 # feature dimension
moco_k = 4096  # queue size; number of negative keys
moco_m = 0.99  # moco momentum of updating key encoder
moco_t = 0.1   # softmax temperature
bn_splits = 8  # simulate multi-gpu behavior of BatchNorm in one gpu; 1 is SyncBatchNorm in multi-gpu
symmetric = False # use a symmetric loss function that backprops to both crops

# Create model
model = ModelMoCo(
        dim=moco_dim,
        K=moco_k,
        m=moco_m,
        T=moco_t,
        bn_splits=bn_splits,
        symmetric=symmetric,
    ).cuda()
print(model.encoder_q)

### Define training function



In [None]:
# Train for one epoch
def train(net, data_loader, train_optimizer, epoch):
    # Set the model in training mode
    net.train()
    adjust_learning_rate(optimizer, epoch)
    total_loss, total_num, train_bar = 0.0, 0, tqdm(data_loader)

    # load a minibatch x with N samples
    for im_1, im_2 in train_bar:
        # im_1 : randomly augmented image
        # im_2 : another randomly augmented image
        im_1, im_2 = im_1.cuda(non_blocking=True), im_2.cuda(non_blocking=True)

        # Forward pass of the model
        loss = net(im_1, im_2)

        # Zero out the gradients
        train_optimizer.zero_grad()

        # SGD update: query network
        loss.backward()

        # Update the weights
        train_optimizer.step()

        total_num += data_loader.batch_size
        total_loss += loss.item() * data_loader.batch_size
        train_bar.set_description('Train Epoch: [{}/{}], lr: {:.6f}, Loss: {:.4f}'.format(epoch, epochs, optimizer.param_groups[0]['lr'], total_loss / total_num))

    return total_loss / total_num

# lr scheduler for training
def adjust_learning_rate(optimizer, epoch):
    """Decay the learning rate based on schedule"""
    lr = 0.06
    if cos:  # cosine lr schedule
        lr *= 0.5 * (1. + math.cos(math.pi * epoch / epochs))
    else:  # stepwise lr schedule
        for milestone in schedule:
            lr *= 0.1 if epoch >= milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

### Define test function

* After training the model we get the feature representation for the images (which we are calling as pre-trained features in this case) which can be transferrable.

* Now, by using these pre-trained features we kind of apply transfer learning to perform the supervised learning

* We will be extracting the features from the memory set (memory_data_loader) with out calculating the gradients.

* Then we will loop through the test data to predict the label by weighted knn search.


In [None]:
# Test using a knn monitor
# test for one epoch, use weighted knn to find the most similar image's label to assign the test image
# knn_k = k in kNN monitor, knn_t = softmax temperature in kNN monitor; could be different with moco_t
def test(net, memory_data_loader, test_data_loader, epoch, knn_k = 200, knn_t = 0.1):

    # Set the model mode to evaluation
    net.eval()

    # Loading the number of classes from the memory data loader
    classes = len(memory_data_loader.dataset.classes)
    total_top1, total_top5, total_num, feature_bank = 0.0, 0.0, 0, []

    with torch.no_grad(): # not calculate the gradients
        # Generate feature bank
        # Loop through the batchs of images
        for data, target in tqdm(memory_data_loader, desc='Feature extracting'):
            feature = net(data.cuda(non_blocking=True)) # -> output shape: [512, 128] -> [batch_size, feature dimesion]
            feature = F.normalize(feature, dim=1)
            feature_bank.append(feature) # Appending all the features in batches to the feature bank
        # [D, N]
        # contiguous(), actually makes a copy of the tensor such that the order of its elements
        # in memory is the same as if it had been created from scratch with the same data.
        feature_bank = torch.cat(feature_bank, dim=0).t().contiguous() # feature_bank (shape[128, 50000]) -> [feature_dimension, train_images]
        # [N]
        feature_labels = torch.tensor(memory_data_loader.dataset.targets, device=feature_bank.device) # feature_labels (50000)

        # loop test data to predict the label by weighted knn search
        test_bar = tqdm(test_data_loader)
        for data, target in test_bar:
            data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)

            # Forward pass
            feature = net(data)

            # Normalize the features
            feature = F.normalize(feature, dim=1)

            # Call the knn-predict funcion to get kNN predictions on features based on a feature bank
            pred_labels = knn_predict(feature, feature_bank, feature_labels, classes, knn_k, knn_t)

            total_num += data.size(0)
            total_top1 += (pred_labels[:, 0] == target).float().sum().item()
            test_bar.set_description('Test Epoch: [{}/{}] Acc@1:{:.2f}%'.format(epoch, epochs, total_top1 / total_num * 100))

    return total_top1 / total_num * 100

### Define function to run kNN predictions

In [None]:
# knn monitor as in InstDisc https://arxiv.org/abs/1805.01978
# implementation follows http://github.com/zhirongw/lemniscate.pytorch and https://github.com/leftthomas/SimCLR

def knn_predict(feature, feature_bank, feature_labels, classes, knn_k=200, knn_t=0.1):

    """Helper function to run kNN predictions on features based on a feature bank
    Args:
        feature: Tensor of shape [N, D] consisting of N D-dimensional features
        feature_bank: Tensor of a database of features used for kNN
        feature_labels: Labels for the features in our feature_bank
        classes: Number of classes (e.g. 10 for CIFAR-10)
        knn_k: Number of k neighbors used for kNN
        knn_t: 0.1
    """
    # compute cos similarity between each feature vector and feature bank ---> [B, N]
    # Performs a matrix multiplication of the matrices input and mat2.
    # If input is a (n×m) tensor, mat2 is a (m×p) tensor, out will be a (n×p) tensor
    sim_matrix = torch.mm(feature, feature_bank)
    # [B, K]
    sim_weight, sim_indices = sim_matrix.topk(k=knn_k, dim=-1)
    # [B, K]
    sim_labels = torch.gather(feature_labels.expand(feature.size(0), -1), dim=-1, index=sim_indices)
    # we do a reweighting of the similarities
    sim_weight = (sim_weight / knn_t).exp()

    # counts for each class
    one_hot_label = torch.zeros(feature.size(0) * knn_k, classes, device=sim_labels.device)
    # [B*K, C]
    one_hot_label = one_hot_label.scatter(dim=-1, index=sim_labels.view(-1, 1), value=1.0)
    # weighted score ---> [B, C]
    pred_scores = torch.sum(one_hot_label.view(feature.size(0), -1, classes) * sim_weight.unsqueeze(dim=-1), dim=1)

    pred_labels = pred_scores.argsort(dim=-1, descending=True)
    return pred_labels

### Start training the model

In [None]:
lr = 0.06
epochs = 20  # number of total epochs
wd = 0.0005  # weight decay
cos = True  # learning rate schedule (when to drop lr by 10x); does not take effect if cos is on
schedule = []

# Define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=wd, momentum=0.9)

**Note:** The model is already trained for 200 epochs and saved the checkpoint (model, optimizer). Now, we will be running it only for 20 epochs.

In [None]:
# Function to load the downloaded checkpoint file path
def load_ckp(checkpoint_fpath, model, optimizer):
    checkpoint = torch.load(checkpoint_fpath) # Load the saved or downloaded checkpoint
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    return model, optimizer

In [None]:
# Loading the checkpoint path here
ckp_path = "/content/MoCo_model_checkpoint.pth"

# Call the load checkpoint function by passing the file path
# Basically, you first initialize your model and optimizer and then update the state dictionaries using the load checkpoint function.
model, optimizer = load_ckp(ckp_path, model, optimizer)

In [None]:
!mkdir outputs

In [None]:
# logging
results = {'train_loss': [], 'test_accuracy': []}

# Training loop
for epoch in range(epochs):
    # Call train function
    train_loss = train(model, train_loader, optimizer, epoch)
    results['train_loss'].append(train_loss)

    # Call test function
    test_acc = test(model.encoder_q, memory_loader, test_loader, epoch)
    results['test_accuracy'].append(test_acc)

    # Save model
    torch.save({'epoch': epoch, 'state_dict': model.state_dict(), 'optimizer' : optimizer.state_dict(),},  '/content/outputs/model_last.pth')

### Please answer the questions below to complete the experiment:




In [4]:
#@title Q.1. Contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training.
Answer1 = "TRUE" #@param ["","TRUE", "FALSE"]


#### Consider the following statements about MoCo and answer Q2.


A. Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query q to a dictionary
of encoded keys using a contrastive loss

B. The dictionary is built as a queue, the encoded representations of the current mini-batch are enqueued,
and the oldest are dequeued, decoupling it from the mini-batch size allowing it to be large.

C. The keys are encoded by a slowly progressing key encoder, driven by a momentum-based moving average with the query
encoder, which enables a large and consistent dictionaries for learning visual representations.

In [5]:
#@title Q.2. Which of the above statements is/are True for Momentum Contrast (MOCO)?
Answer2 = "A, B and C" #@param ["","Only A", "Only C", "Only A and B", "Only B and C", "Only A and C", "A, B and C"]


In [6]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [7]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "NA" #@param {type:"string"}


In [8]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [9]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [10]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Somewhat Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [11]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 3022
Date of submission:  13 Nov 2023
Time of submission:  12:46:17
View your submissions: https://dlfa-iisc.talentsprint.com/notebook_submissions
