<a href="https://colab.research.google.com/github/mail4apz/deep-learning-with-python-notebooks/blob/master/m1_10_Introduction_to_ViT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="font-size:30px;">Vision Transformers (ViT)</h1>

In the previous section of the module, we revisited the basics by going through PyTorch and CNN classification tasks. Now its time to gradually shift gears and move into advanced computer vision techniques and practices.

Until 2020, Convolutional Neural Networks (CNNs) were the dominant architecture in computer vision but they had limitations in capturing long-range dependencies and adapting to other domains. However, the significant strides made by **Transformers**, introduced by Vaswani et al. in Natural Language Processing (NLP) with models like BERT and GPT, paved the way for their application in computer vision.


The vision community sought to make use of the Transformer's ability to capture long-range dependencies in visual scenes, resulting in the Vision Transformer (ViT) architecture. The original paper titled **"[An Image is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929v2)"** by Dosovitskiy et. al. from the Google Brain team explored the potential of applying Transformers to image classification tasks.


<img src = "https://learnopencv.com/wp-content/uploads/2023/02/image-9-1024x538.png" width = 600>


This pioneering work laid the foundation for vision architectures such as **CLIP, SAM, TrOCR and DINO** (we will encounter a lot of these architectures in future modules).

In the upcoming modules we will get hands on these excellent foundational models for various downstream tasks. In this notebook, we will primarily understand the architectural details of vanilla ViT, along with notable derivatives that have emerged since its introduction, up until Q3 of 2024.

Tidbit: As of Oct 2024, OmniVec(ViT) is a SOTA classification model achieving **92.4** on the ImageNet dataset. [[Source](https://paperswithcode.com/sota/image-classification-on-imagenet)]

## Table of Contents

* [Introduction to ViT](#Introduction-to-ViT)
* [Internal Workings of a ViT](#Internal-Workings-of-a-ViT)
    * [1. Patch Embeddings](#1.-Patch-Embeddings)
    * [2. Positional Embedding](#2.-Positional-Embedding)
    * [3. Transformer Encoder](#3.-Transformer-Encoder)
        * [3.1 Attention](#3.1-Attention)
        * [3.2 LayerNorm and Residual Connections](#3.2-LayerNorm-and-Residual-Connections)
        * [3.3 Feed Forward Layers](#3.3-Feed-Forward-Layers)
    * [4. Classification Head](#4.-Classification-Head)
    * [5. Conclusion](#5.-Conclusion)
    * [6. References](#6.-References)
    * [7. Further Reads](#7.-Further-Reads)

-----------------------------------------------------------------------------------------------------------------------------------

## Introduction to ViT

Transformers dominates the NLP space, primarily due to their inherent scalability and ability to handle long range dependencies efficiently. Unlike CNNs which process the entire images as local receptive fields through convolutional kernels, Vision Transformers adepts most of its architecture from vanilla Transformers introduced by Vaswani et.al.

**At an high level here are few pointers**:

In ViTs, an image is divided into 196 non-overlapping patches, each of size 16x16 pixels. These patches are flattened into $N = HW/P^2$ patches, where $H$, $W$ , $P$ are height, width and patch size respectively. The patches are then linearly projected to get N x 1D Vectors. Each patch is now treated like a token. A special **[CLS]** token similar to the BERT architecture is prepended which acts a class representation of all the patches. Additionally, learnable or fixed positional embeddings are added to each patch which helps to retain the spatial relation between patches within an image.

The resulting patch embeddings (N + [CLS] ) are passed to a typical Transformer Encoder block having Multi Head Attention (MHA). The output of this is a contextualized representation of the patch embeddings having self attention. Finally, an MLP Head is attached on top of the classification token patch that gives the classification result for the image.



<img src="https://uvadlc-notebooks.readthedocs.io/en/latest/_images/vit.gif" width = 600>

Figure Credits: [Phil Wang - lucidrains](https://github.com/lucidrains/vit-pytorch/blob/main/images/vit.gif)

## Internal Workings of a ViT

---------------------------------------------------------------------------------------------------------------------------------

Ok, now we have an holistic idea of how a ViT works. Next we will understand the internal workings of a ViT in much more detailed way considering each block and layers. For a more practical understanding throughout this notebook, we will load a `vit_base_patch16_224` model from the popular `timm` (PyTorch Image Models) library. This will help us examine the model parameters, layers and flow.

Here `224` refers to the input image size and `16` is spatial dimensions of the patches, i.e., each path will be of 16x16 resolution.

In [None]:
# !pip install -q timm

The following ViT Base model has 12 Encoder blocks and 12 Heads.



In [None]:
import timm

model_path = "vit_base_patch16_224"

vit_model = timm.create_model(
    model_path, pretrained = False) #to avoid downloading weights

vit_model.eval()

VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identity(

### 1. Patch Embeddings

Transformer models process inputs as tokens. If we want to apply transformers to image recognition, the first question that comes into mind is "what is the equivalent of words in images?" There are several choices, such as treating each pixel as a token. However, we note that the computational complexity of calculating the attention matrix is $O(N^2)$ where $N$ is the sequence length. If we treat each pixel as a separate word, then assuming a relatively small image size of 100×100, the attention matrix will be of size 10000×10000. This is obviously unmanageable even for the largest GPUs.
Additionally, treating individual pixels as tokens would fail to capture the interconnected local features between nieghboring pixels.

<img src = https://learnopencv.com/wp-content/uploads/2023/02/image-1-1024x716.png width = 600>

An optimal approach would be to treat patches as tokens and learn the inter-patch representations. For this the image is divided into equal sized 16x16 patches. Thus, a RGB image of size $W$x$H$x$3$ is splitted into patches, each of size $w$x$h$x$3$.

For example, an input image of 224x224 ($H$, $W$) will be divided into a total of 14x14 = 196 patches ($N$) by a 16x16 patch ($P$).

Then all the patches are flattened as sequence of patches. The embedding layer transforms the patch into a hidden, learned representation of dimension $d_{in}$.

In code, this "patch embedding" can be achieved using `Conv2d` with a kernel size of `16x16` and a stride of `(16,16)` which effectively creates non-overlapping patches, projecting into a feature or vector space of  $d_{in}$ embed dim.

```python
(patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    (norm): Identity()
  )
```

In terms of tensor sizes, assuming a batch size of 1, the input is of size [1, 3, 224, 224]. After patch embedding, the tensor has size [1, 196,  $d_{in}$] where each patch is a 1D vector. For example,  $d_{in}$ = 768 in the base vision transformer model.

From now, we will use the notation $d_{\text{in}}$ and $\text{emb_dim}$ interchangeably for embedding dimension.


<img src = "https://learnopencv.com/wp-content/uploads/2023/02/image-2-1024x578.png" width=600>

BERT framed the pre-training task as a classification problem. To let the transformer model perform classification, an extra token called the class token was used. Following this idea, ViT concatenated a **learnable [CLS]** patch token to the beginning of the patch sequence. This classification token that acts as a summary representation for the entire image after passing through the final MLP Head predicting the image class.

```python
cls_token = nn.Parameter(torch.zeros(1, embed_dim))
```

In terms of tensor sizes, after adding the class token the resulting tensor is of size [1, 197, 768] where the shape is $[B, \text{patches}, d_{in}]$.

<img src = "https://learnopencv.com/wp-content/uploads/2023/02/image-3-1024x483.png" width=600>

### 2. Positional Embedding

> "**Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global**"

We know that vanilla self-attention mechanism from 2017 does not have any concept of temporal order among its inputs. All patches or words are treated equally. This is a problem since the order of patches and words really matters in both NLP and computer vision. Thus, to allow the transformer to learn to differentiate between patches at different locations, we add something called position embedding to the inputs.

There are many kinds of position embeddings in the NLP literature such as the sine/cosine fixed embeddings and learnable embeddings. The sine and cosine functions are applied to alternate tokens and help determine the unique position of the patch in the sequence.


Vision transformers work about the same with either of these types. In the below figure, a position embedding is just a learnable parameter. Continuing with our example of images of size 224×224, recall that after concatenating the classification token, the tensor has size **[1, 197, 768]**. We will need to instantiate the position embedding parameter to be of the same size and add the patches and position embedding element-wise. The resulting sequence of vectors is then fed into the transformer model.

<img src="https://learnopencv.com/wp-content/uploads/2023/02/image-4-1024x334.png" width=800>


Positional embeddings helps to capture the spatial relationship of a patch along its row and column. During pre-training the positional embeddings carry no information about the inter-patch 2D positions, all are learnt from scratch. From paper "*Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful* " which suggests that if we fine-tune on different resolution, the learned position embeddings are no longer meaningful for the new image resolution. However in practice, methods like interpolation of postional embedddings to adjust to new resolution are found effective.


The following map from the paper shows after training the patches in the same row and column, as well as nearby patches, have higher cosine similarity compared to those farther apart.

<img src = "https://learnopencv.com/wp-content/uploads/2024/10/Positional_Embedding_Cosine_Similarity-ViT-16x16-paper.png" width = 400>

### 3. Transformer Encoder

The core part of a ViT is the Transformer Encoder with **attention mechanism**. Once the embedding patches are prepared it is fed into a series of encoder blocks where the attention mechanism capture semantic and contextual relationship between input patches. An alignment score is computed which is key part of attention map determining how one patch token attends to other.

<img src="https://learnopencv.com/wp-content/uploads/2024/10/ViT-MHA.png"  width="700" >

Each Transformer encoder consists of Norm, MHA and Feed Forward Network layers.

Let's look at one of the encoder block our timm ViT model,
```python
12 x (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identity()
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
        (drop2): Dropout(p=0.0, inplace=False)
      )
      (ls2): Identity()
      (drop_path2): Identity()
    )

```

##### 3.1 Attention

Each patch embedding is transformed into three matrices namely **Query(Q)**, **Key(K)** and **Value(V)** having shape $[B, \text{num_patches}, \text{emb_dim}]$

The transformations are achieved using learned weight matrices $W_{Q}$, $W_{K}$, and $W_{V}$ which project the input patch embeddings into the query, key and value matrices.

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

where $X$ is the input embedding matrix.


```python
Q = nn.Linear(emb_dim, head_size) # In MHA head size
K = nn.Linear(emb_dim, head_size)
V = nn.Linear(emb_dim, head_size)
```

The significant components of Self Attention are:

- **Query**: What the patch is trying to learn from other patches? (a question)
- **Key**: The information other patches hold that might help to answer the query. (answer relevance)
- **Value**: The value holds the actual content from the patches, which is weighted by the attention mechanism to produce the final representation.


The following analogy is adapted from Jalammar [[Source](https://jalammar.github.io/illustrated-gpt2/)]:

> "A crude analogy is to think of it like searching through a filing cabinet. The query is like a sticky note with the topic you’re researching. The keys are like the labels of the folders inside the cabinet. When you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. Except you’re not only looking for one value, but a blend of values from a blend of folders. Multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax)"

<img src="https://jalammar.github.io/images/gpt2/self-attention-example-folders-scores-3.png" width=400>


**Self Attention**:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

where, ${d_k}$ is emb_dim of k vector.


In self attention mechanism each patch token attends to every other token in the sequence, helping the model to understand inter-patch relationships.

- To calculate attention scores, a dot product is applied to Query(Q) of one token with the Key(K) of all other tokens.
$$\text{Score}(Q, K) = Q K^T$$

- To avoid large values and stabilize training, divide the attention scores by ${\sqrt{d_k}}$

$$\text{Scaled Scores} = \frac{Q K^T}{\sqrt{d_k}}$$

- This scaled scores are passed through a softmax function to convert them into probability distribution.
$$\text{Attention Weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)$$

- Finally, the attention weights are used to compute a weighted sum of the Value(V) vectors.

<img src = "https://learnopencv.com/wp-content/uploads/2023/01/neural-self-attention-cover-picture-768x576.png" width=650>

**Multi Head Attention**:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W_O$$

MHA extends self attention mechanism by efficiently running multiple attention heads in parallel. Each head learns different types of relationships between patches or tokens. After computing attention for each head, the results are concatenated and passed through a linear transformation. $W_O$ is the output projection matrix, applied after concatenating the results from all attention heads.

```python
(proj): Linear(in_features=768, out_features=768, bias=True)
```
The key point is that the `qkv` linear layer simultaneously computes the `qkv` vectors for all heads in a single linear transformation. In MHA the $\text{num_heads}$ should be a divisible of $\text{emb_dim}$. Since ViT-Base has 12 heads, the output size is computed as, $3 \times \text{emb_dim}$ $=>$ 3 x 768 = 2304.

```python
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
```


In code,
```python
scale = emb_dim // num_heads  #d_k
q = q * scale
attn = q @ k.transpose(-2, -1)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = attn @ v
```

Attention maps offer a way to visualize how a ViT focuses on different image patches, similar to feature maps in CNN  highlighting key regions in an image. We will discuss this in depth in our upcoming notebook.

##### 3.2 LayerNorm and Residual Connections

Layer normalization, first proposed by the Nobel laureate Professor Geoffrey Hinton’s lab, is a slightly different version of batch normalization. We are all familiar with batch norm in the context of computer vision. However, batch norm cannot be directly applied to recurrent architectures. Moreover, since the mean (μ) and standard deviation (σ) statistics in batch norm are calculated for a mini-batch, the results are dependent on the batch size. As shown in below figure , layer normalization overcomes this problem by calculating the statistics for the neurons in a layer rather than across the mini batch. Thus, each sample in the mini batch gets a different μ and σ, but the mean and std deviation are the same for all neurons in a layer.

The thing to note is that for typical model sizes, layer norm is slower than batch norm. Thus, some architectures like DEST (which are designed for speed, but we will not introduce them here), use engineering tricks to use batch norm while keeping the training stable. However, for most widely used vision transformers, layer norm is used and is quite critical for their performance.

In ViT, the reorganization of the normalization layers - differing from vanilla transformer architecture supports better gradient flow and eliminates the need for warm-up stage during training.
```python
 (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
 ```
<img src="https://learnopencv.com/wp-content/uploads/2023/02/image-5-1024x405.png" width=700>


Additionally, to prevent vanishing gradient problem, the transformer architecture uses skip connections to add the original input embeddings to the output of the each sublayer.

##### 3.3 Feed Forward Layers

In the Transformer encoder for each block we also see a MLP layer which is being used. This is a sequential module consisting of:

- A linear layer that projects the output of the MHSA layer into higher dimensions ($d_{mlp} > d_{in}$). The ouput dimension of 3072 is determined by MLP ratio (typically 4) multipltied by the hidden dimension.
  
- An activation layer with GELU activation ($\text{GELU}(x) = x\phi(x)$, where $x\phi(x)$ is the cumulative distribution function of the standard gaussian distribution)
- A dropout layer to prevent overfitting
- A linear layer to project the output back to the same size as the output of the MHA layer.
- Another dropout layer to prevent overfitting.


```python
(mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identity()
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
        (drop2): Dropout(p=0.0, inplace=False)
      )
```

### 4. Classification Head

We remarked in the above section on ‘classification token’, a learnable parameter called a classification token is concatenated to the patch embeddings. This token becomes a part of the vector sequence fed into the transformer model and evolves with self-attention. Finally, we attach a small MLP classification head on top of this module and read the classification results from it. This is just a vanilla dense layer with the number of neurons equal to the number of classes in the dataset. So, for example, continuing with our example of dim=768, for imagenet dataset, this layer will take in a vector of size 768 and output 1000 class probabilities.

```python
(head): Linear(in_features=768, out_features=1000, bias=True)
```

Note that once we have obtained the classification probabilities from the MLP head on top of the classification token, the outputs from all other patches is IGNORED! This seems quite unintuitive and one may wonder why the classification token is required at all. After all, can’t we average the outputs from all the other tokens and train an MLP on top of that, much like what we do with ResNets? Yes, it is quite possible to do so and it works just as well as the classification token approach. Just note that a different, lower learning rate is required to get this to work.




## 5. Conclusion

ViTs can scale effectively with enhanced performance as dataset sizes increase and more computational resources become available. However, these improvements come at the cost of increased parameters and latency. This need for more parameters in ViT-based models is likely due to their lack of the image-specific inductive bias inherent to CNNs. Many resource-constrained applications, such as AR and mobile deployments, still rely heavily on lightweight CNNs such as MobileNet, ESPNet, ShuffleNet, and MNASNet, as they are easy to optimize and integrate with task-specific networks.


Successors like Swin Transformers and MobileViT architecture addresses these issues by basically applying “transformers as convolutions; allowing to leverage the merits of both convolutions (versatile and simple training) and transformers (global processing).

The success of ViTs has made them an integral part of a wide range of tasks, from traditional computer vision applications and multimodal large language models (MLLMs) to vision-based action models for robotics and more.

## 6. References
- [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929v2)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [HuggingFace Timm Repository](https://github.com/huggingface/pytorch-image-models)
- [PyTorch Docs](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html)

## 7. Further Reads

- [Transformer Visualizer](https://poloclub.github.io/transformer-explainer/)
- [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
- [DeiT : Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
- [BEiT : BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
- [Mobile ViT](https://arxiv.org/abs/2110.02178)
- [LLM Visualization](https://bbycroft.net/llm)