# Vision Transformer (Incomplete) <a class="anchor" id="top"></a>

Vision Transformers (ViT), since their introduction by Dosovitskiy et. al. [reference] in 2020, have dominated the field of Computer Vision.

In a vision transformer, an input image is divided into smaller patches, similar to how a CNN processes local image regions. These patches are then flattened and fed into the transformer architecture. The transformer comprises multiple layers of self-attention and feed-forward neural networks, allowing it to learn both local and global relationships between patches.

The self-attention mechanism enables the model to attend to different patches and learn their relationships, which helps in capturing long-range dependencies in images. Additionally, vision transformers can be pretrained on large datasets, such as ImageNet, and then fine-tuned on specific tasks.

This tutorial will provide a basic vision transformer from scratch.

### Papers

[**An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**](https://arxiv.org/abs/2010.11929).<br />
Github: [**Google Research - Vision Transformer**](https://github.com/google-research/vision_transformer)

[**Attention is all you need**](https://arxiv.org/pdf/1706.03762.pdf).<br />
Github: [**Natural Language Processing Lab**](https://github.com/jadore801120/attention-is-all-you-need-pytorch)(not official but quite good!)

---
## Table of Contents

* [Overview](#overview)
    * [Training Process](#training_process)
* [Building Models](#models)
    * [Imports](#imports)
    * [Architectures](#arch)
* [FAQ](#faq)

# Overview<a class="anchor" id="overview"></a>

Transformer models revolutionized Natural Language Processing (NLP). They have become a de-facto standard for modern NLP tasks and display obvious performance boost when compared to models like LSTMs and GRUs.

The most important paper that transformed the NLP landscape is the "[**Attention is all you need**](https://arxiv.org/pdf/1706.03762.pdf)" paper.

![Transformer Architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg)

## Training Process<a class="anchor" id="training_process"></a>

**Data Preparation**:
- Collect and preprocess your training data. This often includes resizing images to a consistent size, data augmentation, and splitting the dataset into training, validation, and test sets.
- Organize the data with appropriate labels, especially if you are working on a supervised task like image classification or object detection.

**Model Architecture**:
- Define the architecture of your ViT. This includes specifying the number of layers, patch size, embedding dimensions, number of attention heads, and the structure of the feed-forward networks.
- Pre-trained models can also be used as a starting point, fine-tuned for your specific task.

**Data Encoding**:
- Convert your image data into a format that the ViT can understand. This usually involves splitting images into non-overlapping patches and linearly projecting these patches into embedding vectors.

**Positional Encoding**:
- Since ViTs don't have built-in spatial information like convolutional networks, you need to add positional encodings to the patch embeddings. This informs the model about the relative positions of patches.

**Loss Function**:
- Define an appropriate loss function for your task. For image classification, cross-entropy loss is commonly used. For object detection, you may use a combination of localization and classification losses.

**Training Objective**:
- ViTs can be pre-trained on large datasets (pre-training) and fine-tuned on your specific task (fine-tuning). The pre-training stage often involves tasks like image classification or predicting patch permutations.

**Training**:
- Train the ViT on your task-specific dataset using the defined loss function.
- Backpropagate gradients and update the model's parameters using optimization techniques like Adam or SGD.
- Monitor training progress with validation data and consider early stopping to avoid overfitting.

**Evaluation**:
- After training, evaluate your ViT on a held-out test dataset to assess its performance. Common evaluation metrics include accuracy, mAP (mean Average Precision), or other task-specific metrics.

**Hyperparameter Tuning**:
- Fine-tune hyperparameters like learning rate, batch size, and architectural choices to optimize the model's performance.

**Deployment**:
- Once your ViT is trained and evaluated, you can deploy it to make predictions on new, unseen data.

# Building Models<a class="anchor" id="models"></a>

## IMPORTS<a class="anchor" id="imports"></a>

In [85]:
import cv2
import numpy as np
from tqdm import tqdm, trange
from urllib.request import urlopen

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
from torchvision.datasets.mnist import MNIST

np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x7f4b061068f0>

In [2]:
# Check if cuda is available
torch.cuda.is_available()

True

## Architectures<a class="anchor" id="arch"></a>

### Args

In [None]:
PATCH_SIZE = 16
IMAGE_WIDTH = 224
IMAGE_HEIGHT = IMAGE_WIDTH
IMAGE_CHANNELS = 3
EMBEDDING_DIMS = IMAGE_CHANNELS * PATCH_SIZE**2
NUM_OF_PATCHES = int((IMAGE_WIDTH * IMAGE_HEIGHT) / PATCH_SIZE**2)

### Image from url

In [None]:
url = "https://upload.wikimedia.org/wikipedia/commons/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg"
req = urlopen(url)
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
img = cv2.imdecode(arr, -1) # 'Load it as it is'
img.shape

### Transformations

In [222]:
# Define the train_transform using Compose
transform_img = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Resize((224,224))])

### Image Patching 

We start with splitting the inpute image into sub-images of equal sizes - patches. Each of the sub-images/patchs goes through a linear embedding resulting in a 1-d vector.

In [223]:
def make_2tuple(x):
    if isinstance(x, tuple):
        assert len(x) == 2
        return x

    assert isinstance(x, int)
    return (x, x)

In [224]:
def patching(
    img,
    embed_dim = 768,
    flatten_embedding = True,
    img_size = 224,
    img_chans = 3,
    norm_layer = None,
    patch_size = 16
):
    image_HW = make_2tuple(img_size) # img_size, img_size
    patch_HW = make_2tuple(patch_size) # patch_size, patch_size

    img = transform_img(img) # torch.Size([3, 224, 224])
    img = img.unsqueeze(0) # torch.Size([1, 3, 224, 224])
    
    _, _, H, W = img.shape
    img_size = image_HW
    patch_size = patch_HW
    
    patch_H, patch_W = patch_size
    
    assert W % patch_W == 0 and H % patch_H == 0, \
        print("Image Width is not divisible by patch size")
    
    proj = nn.Conv2d(img_chans,
                     embed_dim,
                     kernel_size=patch_HW,
                     stride=patch_HW)
    x = proj(img) # Batch (B), Channels (C), Height (H), Width (W)
    H, W = x.size(2), x.size(3)
#     x = x.flatten(2).transpose(1,2) # B HW C
    x = torch.einsum('ijk -> ijk', x.flatten(2)) # same as transpose
#     print(torch.all(x.eq(x2)))
    
    norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
    x = norm(x)
    if not flatten_embedding:
        x = x.reshape(-1, H, W, embed_dim) # B, H, W, C
    
    return x

In [226]:
patching(img = test_img)

tensor([[[ 0.1330,  0.1418,  0.1928,  ...,  0.1036, -0.0242,  0.2727],
         [ 0.5675,  0.5009,  0.4831,  ...,  0.2510,  0.0497,  0.1118],
         [ 0.1200,  0.1250,  0.1544,  ...,  0.0028,  0.0744,  0.0110],
         ...,
         [ 0.6000,  0.6404,  0.6084,  ...,  0.2332,  0.1651,  0.1323],
         [ 0.1264,  0.1149,  0.0716,  ...,  0.0388,  0.0849, -0.0154],
         [-0.2970, -0.2613, -0.2429,  ..., -0.1754, -0.0784,  0.1188]]],
       grad_fn=<PermuteBackward0>)

### Classification token

### Positional Encoding

### Endcoder Block

### Putting it all together

In [218]:
class VisionTransformer(nn.Module):
    def __init__(self, image, n_patches=16):
        super(VisionTransformer, self).__init__()
        
        # Attributes
        self.image = image
        self.chw = image.shape # (H, W, C)
        print(self.chw)
        self.n_patches = n_patches
        
    def forward(self, img = None):
        img = img if img is not None else self.image
        patches = patching(img = img)
        return patches

### Example

In [219]:
vit_test = VisionTransformer(image = test_img)

(2848, 2136, 3)


In [220]:
patches = vit_test.forward(img)

In [221]:
patches

tensor([[[-0.2537, -0.2921, -0.3430,  ..., -0.0436, -0.1249, -0.0270],
         [ 0.2203,  0.2702,  0.3274,  ...,  0.0705,  0.1268,  0.1792],
         [-0.3846, -0.4100, -0.4190,  ..., -0.1242, -0.1103, -0.1355],
         ...,
         [ 0.0136,  0.0054, -0.0144,  ...,  0.1597,  0.0722, -0.0587],
         [ 0.5060,  0.5510,  0.5870,  ...,  0.3285,  0.4363,  0.2380],
         [ 0.3207,  0.3567,  0.3643,  ...,  0.0378,  0.2446, -0.0531]]],
       grad_fn=<PermuteBackward0>)

# Notes<a class="anchor" id="faq"></a>

**CNNs**:
- Philosophy:
    - Pixels are dependent on their neighboring pixels. Important features and edges are extracted using filters on a patch of an image.
- Advantage:
    - Perform better with a smaller labeled dataset
    - Compact and efficient memory utilization
- Disadvantage:
    - Does not provide details of each pixel of an image
    - Convolving can lead to bias (inductive)

**ViTs**:
- Philosophy:
    - Instead of parts that the filters can extract, feed a model with entire image data.
- Advantage:
    - Labeled data aren't necessary needed
    - Generalize better
- Disadvantage:
    - Require a lot of data to be effective

---
[**Back to top**](#top)