In [None]:
!nvidia-smi

Tue Apr 18 16:54:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# TD 5: Transformers for computer vision
By Nicolas Dufour, Vicky Kalogeiton and Pascal Vannier

In this TD, we will implement the Transformers architecture. Transformers has been a key architecture in deep learning for the past 5 years.

It has first began with NLP, then came audio and finally, since 2020, computer vision.
We will implement every block that makes a transformer from scratch and we will try to create a deep understanding of what is happening.
Here is a diagram for the transformer architecture:

<img src="https://www.researchgate.net/profile/Miruna-Gheata/publication/355339249/figure/fig1/AS:1079476452622337@1634378650979/Encoder-decoder-architecture-of-the-Transformer-developed-by-Vaswani-et-al-28.ppm" width=768>

## Instructions
As stated before, in pytorch you must achieve for loops at all cost. It's almost always possible to find a vectorized version of the operation you want to implement.
In this TP, the only for-loop you can do is the training loop.

In [None]:
!pip install einops
!pip install timm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting timm
  Downloading timm-0.6.13-py3-none-any.whl (549 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m549.1/549.1 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface-hub, timm
Successfully installed huggingface-hub-0.13.4 timm-0.6.13


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
import math
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torchvision import transforms
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
from einops import rearrange, repeat
from PIL import Image
from torchvision import transforms
import requests
from io import BytesIO
from torchvision.models.feature_extraction import get_graph_node_names, create_feature_extractor
import timm



## The Transformer model from the paper Attention is All You Need.

### Attention

The transformer architecture is built around one key block: The attention.
The idea behind attention is the following. Imagine you want to retrieve information from a dictionary. The dictionnary is indexed by keys which maps to a particular value. Now, you have a query which will be matched against the keys of the dict and if you have a match, you will retrieve the associated value.
Attention is very similar to this simple retrieval example. Now, with real data, we don't have this structure, we however are going to learn to create it.

We have 2 sets of vectors (also named tokens). One is $X_{to}$ which is the destination set. We want to be able to map this set of tokens to queries. We achieve this by doing a linear projection of $X_{to}$. $Q = W_QX_{to}$

The other set is $X_{from}$ the set from which we want to retrieve information. We will need to extract both keys and values from this set. We therefore do 2 linear projections of $X_{from}$. $K = W_KX_{from}$ and $V = W_VX_{from}$.

Now, contrary to the dictionnary where queries and values are exact matchs, we don't have this here. Therefore, we will perform a softer match by computing the similarity matrix between $Q$ and $K$. Then for each $Q$, we want to output the values that have the higher similarity. We therefore output the weighted sum of the values, weighted by the softmax of the similarity (also called the attention matrix).

Finally, the attention operation is given by the cross attention:

$$
A(Q,K,V) = SoftMax(\frac{Q^TK}{\sqrt{d_k}})V
$$

We divide the similarity by $\sqrt{d_k}$ for stability reason to avoid the similarity to explode with big vectors which would lead to very sharp attention coeficients.

##### Question 1:
Implement the attention operation.

Tip: Look into `torch.einsum` to easily compute the similarity matrix, an easy to understand explanation may be found [here](https://rockt.github.io/2018/04/30/einsum).


In [None]:
class Attention(nn.Module):
    def __init__(self, x_to_dim, x_from_dim, hidden_dim,):
        # To complete
        super().__init__()
        self.w_query = nn.Linear(x_to_dim, hidden_dim)
        self.w_key = nn.Linear(x_from_dim,hidden_dim)
        self.w_value = nn.Linear(x_from_dim, hidden_dim)
        self.hidden_dim = hidden_dim
    def forward(self, x_to, x_from):
        # x_to = [batch size, x_to_len, x_to_dim]
        # x_from = [batch size, x_from_len, x_from_dim]

        # To complete
        query = self.w_query(x_to)
        key = self.w_key(x_from)
        value = self.w_value(x_from)

        A = torch.einsum('bik,bjk->bij', query, key)
        A = A/math.sqrt(self.hidden_dim)
        A = nn.Softmax(dim = -1)(A)
        A = torch.einsum('bik,bkj->bij', A, value)
        return A

In [None]:
x = torch.Tensor([[[0,1,2,3],[1,2,3,4]]])

In [None]:
attention = Attention(4,4,10)

In [None]:
attention(x,x)

tensor([[[-1.3018, -0.3023, -0.1118, -1.0333,  1.2913,  2.4931,  0.4207,
           2.9757, -0.6286, -0.5660],
         [-1.3028, -0.2606, -0.1687, -1.0282,  1.2700,  2.3313,  0.4122,
           2.8274, -0.5908, -0.4989]]], grad_fn=<ViewBackward0>)

#### Multi-head attention

We improve the above attention implementation by introducing mult-head attention. The idea here is that we compute the attention on subspaces of the $Q,K,V$ triplets.
We split each vector in n subsets and compute the attention for each subset. At the end, we concatenate every attention output and project it with an output projection.

##### Question 2
Implement Multihead attention.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, x_to_dim, x_from_dim, hidden_dim, n_heads):
        super().__init__()
        self.x_to_dim = x_to_dim
        self.x_from_dim = x_from_dim
        self.hidden_dim = hidden_dim
        self.n_heads = n_heads
        self.he_dim = hidden_dim // n_heads

        self.w_query = nn.Linear(x_to_dim, hidden_dim)
        self.w_key = nn.Linear(x_from_dim, hidden_dim)
        self.w_value = nn.Linear(x_from_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x_to, x_from):
        # x_to = [batch size, x_to_len, x_to_dim]
        # x_from = [batch size, x_from_len, x_from_dim]

        batch_size = x_to.shape[0]
        q = self.w_query(x_to)
        k = self.w_key(x_from)
        v = self.w_value(x_from)

        # separation des vecteurs d'entrée
        q = q.view(batch_size, -1, self.n_heads, self.he_dim).transpose(1, 2)
        k = k.view(batch_size, -1, self.n_heads, self.he_dim).transpose(1, 2)
        v = v.view(batch_size, -1, self.n_heads, self.he_dim).transpose(1, 2)

        A = torch.einsum('bqik,bqjk->bqij', q, k) / math.sqrt(self.he_dim)
        A = nn.Softmax(dim=-1)(A)
        A = torch.einsum('bqki,bqkj->bqij', A, v)
        # concaténation des attentions et projection
        A = A.transpose(1, 2).reshape(batch_size, -1, self.hidden_dim)
        A = self.linear(A)
        return A


In [None]:
attention = MultiHeadAttention(4,4,10,2)

In [None]:
attention(x, x)

tensor([[[ 0.9453,  0.1184,  0.8194, -1.0045,  0.1303, -1.6511, -0.6159,
          -0.3407, -0.9195,  0.2212],
         [ 1.0536, -0.2872,  0.7614, -1.4821, -0.1132, -1.9304, -0.9134,
          -0.1743, -0.7936,  0.4989]]], grad_fn=<ViewBackward0>)

MultiheadAttention is the attention that is used in transformers in pratice. It is used in 2 flavors:
- Self Attention: When $X_{to}$ attends itself ($X_{to}=X_{from}$)
- Cross Attention. $X_{to}\neq X_{from}$


##### Question 3: Implement MultiHead Self Attention and MultiHeadCrossAttention from Multihead attention

In [None]:
class MultiheadSelfAttention(nn.Module):
  def __init__(self, x_to_dim, hidden_dim, n_heads):
    super().__init__()
    self.MHSAttention = MultiHeadAttention(x_to_dim, x_to_dim,hidden_dim,n_heads)

  def forward(self, x_to):
    return self.MHSAttention(x_to,x_to)

class MultiheadCrossAttention(nn.Module):
  def __init__(self, x_to_dim,x_from_dim, hidden_dim, n_heads):
    super().__init__()
    self.MHCAttention = MultiHeadAttention(x_to_dim,x_from_dim, hidden_dim, n_heads)

  def forward(self,x_to,x_from):
    return self.MHCAttention(x_to, x_from)

In [None]:
mhsa = MultiheadSelfAttention(4,10,2)

In [None]:
mhsa(x)

tensor([[[ 0.3547, -0.5814, -0.4119,  0.2062, -0.1709,  0.3415,  0.4629,
           0.0819, -0.1023,  0.1509],
         [ 0.7639, -1.7763, -0.5130,  0.1048, -0.0878,  0.6728,  1.6058,
           0.9961, -0.6688,  0.1572]]], grad_fn=<ViewBackward0>)

### LayerNorm
Another key component of the transformer is the LayerNorm. As we have previously seen, normalizing the output of a deep learning layer helps a lot with convergence and stability.
Until Transformers, the most used normalization is BatchNorm. We normalize the data among the batch dimension. However, this has a few problems.
- The normalization depend on the other samples in the batch
- When using multiple GPUs, BatchNorm needs to synchronize the batch statistic across GPUs, which locks the forward process and slow down training.

The last element is the most important one. Transformers, aims to be a easy to parralilize architecture and can't afford to use batchnorm.

Instead, Transformers uses Layer Norm. LayerNorm is sample dependent, which removes the synchronization issue. We normalize over the channel dimension instead of the batch dimension.

<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-19_at_4.24.42_PM.png">

To account for the loss of capacity, we map the output by a linear transformation with a learned bias and scale.

##### Question 4:
Implement the LayerNorm

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, size, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.w = torch.nn.Parameter(torch.ones(size))
        self.b = torch.nn.Parameter(torch.zeros(size))

    def forward(self, y):
        x=y
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True, unbiased=False)
        x = (x - mean) / (std + self.eps)
        x = self.w * x + self.b
        return x

### Feed Feedward Network

Finally, the last block is a feed-forward network with one hidden layer. This layer has usually a size of $2 * input\_dim$. This is followed by a dropout layer and an activation function. Here, we will use leaky relu, with a leak parameter of 0.1.
##### Question 5: Implement the FFN layer

In [None]:
class FFN(nn.Sequential):
    def __init__(self, hidden_dim, dropout_rate=0.1, expansion_factor=2):
        super().__init__(
            nn.Linear(hidden_dim, hidden_dim * expansion_factor),
            nn.Linear(hidden_dim*expansion_factor, hidden_dim),
            nn.Dropout(dropout_rate),
            nn.LeakyReLU(0.1)
        )

### The Transformer block

The last thing that we are missing are the skip connection. Like in ResNet, the transformer architecture implements the skip-connection. This allow for a better gradient flow avoiding vanishing gradient.
There is a skip connection after the attention and the feed forward network

##### Question 6.
Looking at the transformer figure, implement the Transformer Encoder Block


In [None]:
class TransformerEncoderBlock(nn.Module):
    def __init__(self,data_dim, hidden_dim, n_heads, dropout_rate=0.1):
       # To complete
       super().__init__()
       self.h_dim = hidden_dim
       self.dropout_rate = dropout_rate
       self.MHSAttention = MultiheadSelfAttention(data_dim, hidden_dim, n_heads)
       self.norm = LayerNorm(hidden_dim)
       self.ffn = FFN(hidden_dim, dropout_rate)

    def forward(self, y):
        # x = [batch size, x_len, hidden dim]
        x = y
        x0 = self.MHSAttention(x)
        x = x0 +x
        x = self.norm(x)
        x0 = self.ffn(x)
        x = x0 + x
        x = self.norm(x)
        return x

### Positional embedding
The transformers architecture is permutation independent. That means that for every token, we can swap 2 tokens and have the exact same result. However, the position of the token can be a very important information to consider. Imagine in an image. If a pixel is nearby another pixel, we want the transformer to be able to capture such information. Which is not the case for now.
That's why we introduce positional encodings. For each token, add the positional encoding to the original token:

$$
X_i = X_i + PE(i)
$$

with X_i the token at the i dimension.

The most used positional encodings are sinusoidal encodings. They are defined as follow:

$$
PE(i, 2j) = sin(i / 10000^{\frac{2j}{d}}) \\
PE(i, 2j + 1) = cos(i / 10000^{\frac{2j}{d}})
$$

Where $d$ the dimension of the tokens, $i$, the i-th token in the sequence and $2j$ (resp $2j + 1$), the index of the dimension of the vector.
The idea here is that we add a sinusoidal that encode the position in a multidimensional array.

Another common positional encodings is the learned positional encoding. Simply, we let the network learn a set of tensor $PE$ that match the sequence length and dimension of the tokens.

##### Question 7.

Implement both Sinusoidal and Learned positional embeddings

In [None]:
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, hidden_dim):
        # To complete
        super().__init__()
        self.hidden_dim = hidden_dim

    def forward(self,x):
      batch_size, len, h_dim = x.size()
      position = torch.arange(0,len).unsqueeze(-1)
      div_term = torch.exp(torch.arange(0, self.hidden_dim, 2).float() * (-math.log(10000.0) / self.hidden_dim))
      pe = torch.zeros(1,len,self.hidden_dim)
      pe[0,:, 0::2] = torch.sin(position * div_term)
      pe[0,:, 1::2] = torch.cos(position * div_term)

      return x + pe.to(device)

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, hidden_dim, max_len):
        # To complete
        super().__init__()
        self.param = nn.Parameter(torch.zeros((max_len,hidden_dim)))
        self.hidden_dim = hidden_dim

    def forward(self, x):
      bs, len, h_dim = x.size()
      lpe = self.param[:len, :].unsqueeze(0)
      return x + lpe




In [None]:
x = torch.arange(6)
x

tensor([0, 1, 2, 3, 4, 5])

In [None]:
torch.zeros(1, 4,5)[0,:,1::2]

tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])

### The transformer encoder
Now you have everything you need to implement the transformer . You add positional encoding to the tokens and then stack N transformer encoder layers

##### Question 8.
Implement the transformer encoder with n_layers and the ability to choose both positional embeddings.

Tip: Look into `ModuleList`

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, data_dim,  hidden_dim, n_heads, n_layers, dropout_rate=0.1, positional_encoding="sinusoidal", max_len=1000):
        # To complete
        super().__init__()

        if positional_encoding == "sinusoidal":
          self.pe = SinusoidalPositionalEncoding(data_dim)
        else :
          self.pe = LearnedPositionalEncoding(data_dim, max_len)

        self.transformers = nn.ModuleList([TransformerEncoderBlock(data_dim if i == 0 else hidden_dim ,hidden_dim, n_heads, dropout_rate = dropout_rate)
            for i in range(n_layers)])


    def forward(self, y):
        # To complete
        x = self.pe(y)
        for transformer_block in self.transformers:
          x = transformer_block(x)
        return x


## The Vision Transformer
The above architecture was introduced in 2017 to process sequences of text tokens. However, it could be useful to be able to leverage this architecture for computer vision. On the contrary of convolutional neural network, the transformer has the advantage to introduce less inductive bias.

This could be interesting to leverage to improve vision systems. If we learn the biases from the data, we can hope to have better performances. We however need compute and a lot of data to do this.

To apply the transformer to images, one key question remains to be answered: How do we transform an image to tokens? The approach introduce in Vision Transformers is to cut the image into patches that are then transformed into a token trhought a linear projection.

We also add an extra token, known as the classification token, that will be the token which will be use to predict upon. After going through the N transformer layers, this is the token that goes throught a multi layer perceptron.


<img src= "https://1.bp.blogspot.com/-_mnVfmzvJWc/X8gMzhZ7SkI/AAAAAAAAG24/8gW2AHEoqUQrBwOqjhYB37A7OOjNyKuNgCLcBGAsYHQ/s1600/image1.gif" width="512">


##### Question 9

Implement the vision transformer

Hint: Use Conv2D with the right kernel size and stride to do the linear projection of non-overlapping patches.

In [None]:
class ViT(nn.Module):
    def __init__(self, patch_size, hidden_dim, n_heads, n_layers, n_classes, dropout_rate=0.1, positional_encoding="sinusoidal", max_len=1000):
        # To complete
        super().__init__()
        self.projection = nn.Conv2d(3,hidden_dim, kernel_size=patch_size, stride=patch_size)
        self.transformerEncoder = TransformerEncoder(patch_size**2*hidden_dim, hidden_dim,n_heads, n_layers, dropout_rate, positional_encoding, max_len)
        self.norm = nn.LayerNorm(hidden_dim)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim,n_classes))

    def forward(self, y):
        # x = [batch size, 3, image height, image width]
        # To complete
        x = self.linear_projection(y)
        x = x.flatten(2)
        x = x.permute(0, 2, 1)
        x = self.transformer(x)
        x = x.mean(dim=1)
        x = self.norm(x)
        x = self.mlp(x)
        return x

### Compact Convolutional Transformer
The previous network is a network that need a lot of compute and data to be trained. As we mentionned before, the transformer removes the inductive bias of convnets which requires more data to be tuned.
For this TP, we will try to train an hybrid architecture that preserves the inductive biases of convolution but manages to use the transformer to add global learning.

The first change is the tokenizer. We replace it with a ConvNet. Each convnet layer has a convolution, ReLU and maxpooling.
The second change is to actually remove the classfication token and classify on top of a pooling of all tokens. The pooling is done with an attention like mechanism:
- For each sample, we predict a scalar, that we compute the softmax over all the sample tokens.
- We then do an weighted average pool by this softmax values over the tokens. The weight is given by the previous step

More details see: https://arxiv.org/abs/2104.05704

<img src= https://miro.medium.com/v2/resize:fit:720/format:webp/1*8diH01Fl7MhHRemLy9hUHw.png width=512>

##### Question 10
Implement the Convolutional based tokenizer and the SeqPool operationm

In [None]:
class ConvPatchEmbedding(nn.Module):
    def __init__(self, n_layers, kernel_size, hidden_dim):
        # To complete
        super().__init__()
        initial_layer = nn.Sequential(nn.Conv2d(3, hidden_dim, kernel_size, stride= 1, padding = kernel_size//2),
                nn.ReLU(),
                nn.MaxPool2d(kernel_size))
        list_layers = [initial_layer] + [nn.Sequential(
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride= 1, padding = kernel_size//2),
                nn.ReLU(),
                nn.MaxPool2d(kernel_size)
            )
            for i in range(n_layers-1)]
        self.conv_nets = nn.ModuleList(list_layers)

    def forward(self,x):
      for conv in self.conv_nets:
            x = conv(x)
      x = x.flatten(2).transpose(1, 2)
      return x


class SeqPool(nn.Module):
    def __init__(self, hidden_dim):
        # To complete
        super().__init__()
        self.hidden_dim = hidden_dim
        self.linear = nn.Linear(hidden_dim,1)

    def forward(self, x):
      scalar = self.linear(x).squeeze(-1)
      w = nn.functional.softmax(scalar, dim=1)
      pooled = torch.einsum('bsh,bs->bh', x, w)
      return pooled



##### Question 11

Implement the Compact Convolutional Transformer.

In [None]:
class CCT(nn.Module):
    def __init__(self, n_conv_layers, kernel_size,  n_transformer_layers, hidden_dim, n_heads, n_classes, dropout_rate=0.1):
      super().__init__()
      self.conv = ConvPatchEmbedding(n_conv_layers, kernel_size, hidden_dim)
      self.transformerEncoder = TransformerEncoder(hidden_dim ,  hidden_dim , n_heads, n_transformer_layers, dropout_rate)
      self.seqPool = SeqPool(hidden_dim)
      self.mlpHead = nn.Sequential(
          nn.LayerNorm(hidden_dim),
          nn.Linear(hidden_dim,hidden_dim),
          nn.ReLU(),
          nn.Linear(hidden_dim, n_classes)
      )

    def forward(self, y):
      x = self.conv(y)
      x = self.transformerEncoder(x)
      x = self.seqPool(x)
      x = self.mlpHead(x)
      return x


##### Question 12
Train the CCT on CIFAR-10 for 300 epochs and log both train and test loss and accuracy. You should obtain at least 80+% test accuracy. (Possible to get 90%+).
We provide a data augmentation strategy called auto augment to avoid overfitting on the training data.
Hparameters are to be choosen to your discretion.

Tips for Hparams:
- Don't use too big of a transformer hidden dim (<256)
- For the convnet, aim to have between 32 and 128 output tokens.
- Use AdamW with some weight decay to avoid overfitting
- Use between 2 and 6 transformer layers.
- Use between 2 and 4 transformer heads

Training takes around 30min (depending of hparams), so keep working on the next questions while it trains. You can copy paste the notebook and run it in a separate collab instance to be able to execute the code of the next questions.

In [None]:
batch_size = 128
train_set = CIFAR10(root='./data', train=True, download=True, transform=transforms.Compose([
    transforms.autoaugment.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
]))

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)

test_set = CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
]))

test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=4)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:03<00:00, 45027005.11it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data




Files already downloaded and verified


In [None]:
for i,test in enumerate(test_loader):
  print(test[0][1])
  break

tensor([[[ 1.7416,  1.6781,  1.6939,  ...,  1.7098,  1.7098,  1.6939],
         [ 1.7892,  1.7416,  1.7416,  ...,  1.7575,  1.7575,  1.7416],
         [ 1.7733,  1.7257,  1.7257,  ...,  1.7416,  1.7416,  1.7257],
         ...,
         [-0.6082, -1.3068, -1.6878,  ...,  0.6937,  0.9001,  0.9954],
         [-0.6876, -1.2591, -1.4179,  ...,  0.7731,  0.9477,  0.9795],
         [-0.6399, -1.0051, -1.0686,  ...,  0.6778,  0.8683,  0.9636]],

        [[ 1.8044,  1.7400,  1.7561,  ...,  1.7722,  1.7722,  1.7561],
         [ 1.8527,  1.8044,  1.8044,  ...,  1.8205,  1.8205,  1.8044],
         [ 1.8366,  1.7883,  1.7883,  ...,  1.8044,  1.8044,  1.7883],
         ...,
         [-0.3859, -1.1589, -1.6099,  ...,  0.9830,  1.1924,  1.2729],
         [-0.4342, -1.0623, -1.2717,  ...,  1.0636,  1.2407,  1.2729],
         [-0.3537, -0.7724, -0.9013,  ...,  0.9669,  1.1602,  1.2407]],

        [[ 1.8160,  1.7560,  1.7710,  ...,  1.7860,  1.7860,  1.7710],
         [ 1.8610,  1.8160,  1.8160,  ...,  1

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
def train(model, device, dataloader, epoch, rate = 1e-4):
    optimizer = torch.optim.Adam(model.parameters(), lr=rate)
    loss_fn = nn.CrossEntropyLoss()
    train_losses = []
    for t in tqdm(range(epoch)):
        for i, (input_data, target) in enumerate(dataloader):
            input_data, target = input_data.to(device), target.to(device)
            y_pred = model(input_data)
            loss = loss_fn(y_pred, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        train_losses.append(loss.detach().item())

    return train_losses

In [None]:
cct = CCT(2, 2 , 2, 64, 2, 10, dropout_rate=0.1).to(device)

In [None]:
train(cct, device, train_loader, epoch = 60)

100%|██████████| 60/60 [32:03<00:00, 32.06s/it]


[1.9476810693740845,
 1.8285179138183594,
 1.7229106426239014,
 1.553041696548462,
 1.5502225160598755,
 1.5303542613983154,
 1.5699392557144165,
 1.5757817029953003,
 1.5357153415679932,
 1.5646289587020874,
 1.4238452911376953,
 1.5295069217681885,
 1.439213514328003,
 1.1810235977172852,
 1.2987089157104492,
 1.519635796546936,
 1.4135290384292603,
 1.2578595876693726,
 1.5348069667816162,
 1.4834283590316772,
 1.6704965829849243,
 1.3717098236083984,
 1.5442084074020386,
 1.3372304439544678,
 1.2268873453140259,
 1.62014639377594,
 1.2977235317230225,
 1.4628371000289917,
 1.3077534437179565,
 1.3193362951278687,
 1.4504109621047974,
 1.0950582027435303,
 1.2305861711502075,
 1.3158506155014038,
 1.340714931488037,
 1.1126363277435303,
 1.3511735200881958,
 1.1501401662826538,
 1.1959095001220703,
 1.3208059072494507,
 1.1327016353607178,
 1.2234971523284912,
 1.3177011013031006,
 1.2000446319580078,
 1.3073893785476685,
 1.0386688709259033,
 1.0902806520462036,
 1.1748830080032349

In [None]:
def success_rate(model,dataloader, batch_size = batch_size):
  nb = 0
  success = 0
  for i, (input, target) in enumerate(dataloader):
    input, target = input.to(device), target.to(device)
    with torch.no_grad():
      y_pred = model(input)
    pred = torch.argmax(y_pred, dim = -1)
    for i in range(int(pred.shape[0])):
      if pred[i] == target[i]:
        success += 1
      nb += 1
  return success / nb

In [None]:
success_rate(cct,test_loader)

0.6522

Training on 60 epochs in 30 minutes with GPU Tesla 4. Success rate of 65.22%

## What is my transformer doing? Visualizing the attention matrices
Transformers offer a great tool for visualisation. Indeed, we can look at the attention matrices to see what is our attention block looking at. This allows to visualise what data is the transformer paying attention. It could be super useful to identify biases on which the network has been focusing.

Imagine you want to classify dogs and cats, but in your training data dogs always have a red collar. When you use your classifier on a cat with a red collar it classifies it as a dog! Looking at the attention matrix you can see that the transformer just look at the cats collar and doesn't pay attention to cat itself. You just have identified a bias in your data! You can now fix it by collecting data of dogs without collars or of cats with red collars.


The idea here is to look at the attention matrices. We will look into a pretrained ViT called DiNO. Dino has been train with a self-supervised training. We will visualize the attention matrices of this network for some images.

We will use the timm library which has lots of models implemented with pretrained weights.

First let's list all the models that have been trained with the dino procedure:

In [None]:
timm.list_models('*vit*dino*')

['vit_base_patch8_224_dino',
 'vit_base_patch16_224_dino',
 'vit_small_patch8_224_dino',
 'vit_small_patch16_224_dino']

In [None]:
dino = timm.create_model('vit_base_patch8_224_dino', pretrained=True, img_size=480).eval()

Downloading: "https://dl.fbaipublicfiles.com/dino/dino_vitbase8_pretrain/dino_vitbase8_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dino_vitbase8_pretrain.pth


We will use the torchvision to extract the attention matrix. Look at this tutorial on how to extract certain node in the computational network of a model: https://pytorch.org/vision/stable/feature_extraction.html

##### Question 13
First, isolate for each block the name of the node which correspond to the attention matrix.
To guide you, you can look at the Timm library implementation of ViT.

https://github.com/huggingface/pytorch-image-models/blob/7501972cd61dde7428164041b0a6dd8fea60c4d4/timm/models/vision_transformer.py

In [None]:
nodes = [dino.blocks[i].attn.proj for i in range(len(dino.blocks))]

##### Question 14
Create the feature extractor that outputs all the attention matrices

In [None]:
def get_attention_maps(x):
  attention_maps = []

  with torch.no_grad():
        x = dino.patch_embed(x)
        x = dino.pos_drop(x)
        x = dino.norm_pre(x)
        for i, blk in enumerate(dino.blocks):
            x = blk(x)
            attention_maps.append(torch.softmax(blk.attn.proj.parameters(), dim=-1))

    # Return the attention matrices
  return attention_maps

##### Question 15
Now, find some images online and visualize the attention matrices. Look for images with multiple objects.
We will visualize the matrix corresponding to the class token with all the other tokens. Make sure to reshape them so that they have an image format. Plot every block head attention matrix in a single row. Comment. Plot also the last layer attentions superoposed on the real images.

We provide the code to process the image from an image url.

Tip: For easy reshaping of tensor, look into the `einops` library.

Tip 2: To better visualize the post softmax attention, you can clamp the values and renormalize. Otherwise, a single token can have too much attention and will not allow to visualize the rest of the tokens.

In [None]:

def get_img_from_url(url):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    img = img.convert('RGB')
    img = img.resize((480, 480))
    img = transforms.ToTensor()(img)
    img = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))(img)
    img = img.unsqueeze(0)
    return img

In [None]:
# To complete

In [None]:
#def plot_attn_matrix(attn_matrices):
    # To complete

#plot_attn_matrix(attn_matrices)


In [None]:
# Overlapp the attention matrix with the image (only the last block)
#def plot_attn_matrix_with_image(attn_matrices, img):
    # To complete

#plot_attn_matrix_with_image(attn_matrices, img)
