# Vision Transformer
Step-by-step explanation of Vision Transformer Architecture
1. Image to patches
2. Patch EMbedding
3. Add positional Embeddings
4. Add [CLS] token
5. Transformer Encoder Layers
6. Classification Head

##### 1. Image to Patches
Segment image into patches/grids (fixed-size like 4x4 or 16x16). Because this process is equivalent to the process of tokenization in NLP models (segement a sequence of text into tokens).

##### 2. Patch Embedding
Flatten the patches into 1D vector (because transformer deals with 1D data)

##### 3. Add positional Embeddings
We pass the vectors to the embedding layer to map these vectors/patches into high dimension dense vectors and then we add positional embeddings to preserve the order of the patches in order to encode positional information.

##### 4. Add [CLS] Token
CLS Token is a special learnable token or vector that represents the whole image and this CLS Token will be used for classification. The output of this CLS token will be used later on for classification. [Patch token + Patch Embedding]. Special token to summarize image.

##### 5. Transformer Encoder Layers
Consist of numerous blocks. Pass the output of [Patch token + Patch embedding] to the encoder layer.

##### 6. Classification Head
Pass the output of transformer to the classification head to classify it into predefined classes. For example....CIFAR 10 (trained) will result output in either of 10 classes (one probability value of each class and highest probability is chosen using the arg max function to choose the predicted class).

#### Comparative Study
| NLP Transformer | Vision Transformer (VIT) | Explanation |
| --- | --- | --- |
| Tokens | Patches | Words/subwords in text -> Image split into small fixed-size patches |
| Token IDs| Patch Indices | Each patch can be indexed like tokens |
| Token Embedding | Patch Embedding | Converts token/patch index into dense vectors via learned projection |
| Positional Embedding | Positional Embedding | Adds location information for sequence order/spatial structure |
| [CLS] Token | [CLS] Token | Special token to summarize sequence/image -> used for classification |
| Encoder Input Sequence | Embedded Patch Sequence | Input to Transformer: tokens + positional encoding OR patches + positional info |
| Transformer Encoder Layers | Transformer Encoder Layers | Identical architecture for modeling dependencies |
| Output Token Representation | [CLS] Token Representation | Used to generate final prediction (e.g., class label) |
| Softmax Layer | Classification Head (MLP + Softmax) | Converts final output to class probabilites |

Step-by-Step ViT Pipeline Image to Patches
- The input imgae (e.g., 224x224x3) is divided into fixed-size patches (e.g., 16x16)
- This converts the image into a sequence of small flattened grids.
- Patch Embedding
- Each patch is flattened into a vector and passed through a **linear projection layer** to embed it into a fixed-length vector (like word embeddings in NLP).
    - IN VIT, the embedding layer used is a linear layer and not as embedding layer as in NLP.
- Add Positional Embeddings
- Since Transformers have no built-in sense of order, positional embeddings are added to each patch embedding to presever spatial information.
- Add [CLS] Token
- A special learnable token [CLS] is prepended to the sequence. Its output after the Transformer will represent the whole image (used for classification).
- Transformer Encoder Layers
- The sequence (patch embeddings + CLS + positional encoding) is fed into standard Transformer encoder layers:
- Multi-Head Self-Attention
- Add & Norm
- Feed Forward Network
- Add & Norm
- CLassification Head
- The final hidden state of the [CLS] token is passed through a classification layer (typically an MLP) to produce the predicted image class.

### Implementation

##### Import required libraries

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import numpy as np
import random # generate random indexes to visualize some images randomly
import matplotlib.pyplot as plt

In [7]:
torch.__version__

'2.7.1+cu126'

In [8]:
torchvision.__version__

'0.22.1+cu126'

##### Setup Device-Agnostic Code

In [11]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [12]:
print(f"Using Device: {device}")

Using Device: cpu


##### Set the seed

In [13]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)
random.seed(42)

##### Setting the hyperparameters

In [14]:
BATCH_SIZE = 128
EPOCHS = 10
LEARNING_RATE = 3e-4
PATCH_SIZE = 4
NUM_CLASSES = 10
IMAGE_SIZE = 32
CHANNELS = 3
EMBED_DIM = 256
NUM_HEADS = 8
DEPTH = 6
MLP_DIM = 512
DROP_RATE = 0.1

##### Define Image Transformations

In [15]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5), (0.5))
    # helps the model to converge faster and
    # also it helps to make numerical computations stable
])

##### Getting a dataset

In [16]:
train_dataset = datasets.CIFAR10(root="data", train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root="data", train=False, download=True, transform=transform)

100.0%


##### Converting our datasets into dataloaders
- Right now, our data is in the form of PyTorch Datasets.
- DataLoader turns our data into batches or (mini-batches)
    - It is more computationally efficient, as in, our computing hardware may not be able to look (store in memory) at 50000 images in one hit. So we break it into 128 images at a time. (batch size of 128).
    - It gives our NN more chances to update its gradients per epoch.