# Comprehensive Vision Transformer (ViT) Tutorial with Detailed Mathematical Computations

The Vision Transformer (ViT) is a state-of-the-art model for image recognition that leverages the transformer architecture, originally developed for natural language processing. This tutorial covers the ViT's operations, including forward and backward pass computations for each layer.

## Vision Transformer Architecture Overview

The Vision Transformer consists of several key components:
1. **Input Layer**: Tokenizes input images into patches.
2. **Patch Embedding**: Projects flattened patches into a higher-dimensional space.
3. **Positional Encoding**: Adds positional information to the token embeddings.
4. **Transformer Encoder Layers**: Consist of multi-head self-attention and feedforward neural networks.
5. **Classification Head**: Applies a linear layer to the class token for classification.

### Components and their Details:
1. **Input Layer**: Split an image of size $H \times W$ into patches of size $P \times P$.
2. **Patch Embedding**: Each patch is flattened and mapped to a vector of size $D$ using a linear projection.
3. **Positional Encoding**: Adds a positional embedding to each patch embedding.
4. **Transformer Encoder**: Stack of $L$ identical layers, each consisting of:
   - **Multi-Head Self-Attention (MHSA)**
   - **Feedforward Neural Network (FFN)**
5. **Classification Head**: A simple fully connected layer that outputs class probabilities.

## Detailed Layer-by-Layer Mathematical Operations

### Input Layer - Patching
- **Forward Pass**:
  - Split the image into $N$ patches where $N = \frac{H \times W}{P^2}$.
  - **Formula**: $I_{\text{patch}}^n = I[:, i*P:(i+1)*P, j*P:(j+1)*P]$ where $i$ and $j$ index the patch position.

### Patch Embedding
- **Forward Pass**:
  - Flatten each patch and project it to a higher-dimensional space.
  - **Formula**: $E_{\text{patch}}^n = W_e \cdot \text{flatten}(I_{\text{patch}}^n) + b_e$
  - Where $W_e$ is the weight matrix and $b_e$ is the bias vector.
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial \text{flatten}(I_{\text{patch}}^n)} = W_e^T \cdot \frac{\partial L}{\partial E_{\text{patch}}^n}$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W_e} = \sum_n \frac{\partial L}{\partial E_{\text{patch}}^n} \cdot \text{flatten}(I_{\text{patch}}^n)^T$
  - **Gradient w.r.t. bias**: $\frac{\partial L}{\partial b_e} = \sum_n \frac{\partial L}{\partial E_{\text{patch}}^n}$

### Positional Encoding
- **Forward Pass**:
  - Add positional information to each patch embedding.
  - **Formula**: $E_{\text{pos}}^n = E_{\text{patch}}^n + P^n$
  - Where $P^n$ is the positional encoding.
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial E_{\text{patch}}^n} = \frac{\partial L}{\partial E_{\text{pos}}^n}$
  - **Gradient w.r.t. positional encoding**: $\frac{\partial L}{\partial P^n} = \frac{\partial L}{\partial E_{\text{pos}}^n}$

### Transformer Encoder Layers

#### Multi-Head Self-Attention (MHSA)
- **Forward Pass**:
  - Compute Query, Key, and Value matrices.
  - **Formula**: $Q = W_Q \cdot X$, $K = W_K \cdot X$, $V = W_V \cdot X$
  - Compute attention scores: $A = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right)$
  - Compute output: $O = A \cdot V$
  - Concatenate heads and project: $O' = W_O \cdot \text{concat}(O_1, O_2, ..., O_h)$
  - Where $W_Q, W_K, W_V, W_O$ are the weight matrices, and $d_k$ is the dimensionality of the keys.
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial X} = W_Q^T \cdot \frac{\partial L}{\partial Q} + W_K^T \cdot \frac{\partial L}{\partial K} + W_V^T \cdot \frac{\partial L}{\partial V}$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W_Q} = \frac{\partial L}{\partial Q} \cdot X^T$, $\frac{\partial L}{\partial W_K} = \frac{\partial L}{\partial K} \cdot X^T$, $\frac{\partial L}{\partial W_V} = \frac{\partial L}{\partial V} \cdot X^T$

#### Feedforward Neural Network (FFN)
- **Forward Pass**:
  - Two linear transformations with a ReLU activation in between.
  - **Formula**: $FFN(X) = W_2 \cdot \text{ReLU}(W_1 \cdot X + b_1) + b_2$
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial X} = W_1^T \cdot \left(\frac{\partial L}{\partial H} \cdot \text{ReLU}'(W_1 \cdot X + b_1)\right) + W_2^T \cdot \frac{\partial L}{\partial O}$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W_1} = \left(\frac{\partial L}{\partial H} \cdot \text{ReLU}'(W_1 \cdot X + b_1)\right) \cdot X^T$, $\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial O} \cdot H^T$
  - **Gradient w.r.t. biases**: $\frac{\partial L}{\partial b_1} = \sum \left(\frac{\partial L}{\partial H} \cdot \text{ReLU}'(W_1 \cdot X + b_1)\right)$, $\frac{\partial L}{\partial b_2} = \sum \frac{\partial L}{\partial O}$

### Classification Head
- **Forward Pass**:
  - **Formula**: $O_{\text{class}} = \text{softmax}(W_{\text{class}} \cdot X_{\text{class}} + b_{\text{class}})$
  - Where $X_{\text{class}}$ is the class token.
- **Backward Pass**:
  - **Gradient w.r.t. input**: $\frac{\partial L}{\partial X_{\text{class}}} = W_{\text{class}}^T \cdot \frac{\partial L}{\partial O_{\text{class}}}$
  - **Gradient w.r.t. weights**: $\frac{\partial L}{\partial W_{\text{class}}} = \frac{\partial L}{\partial O_{\text{class}}} \cdot X_{\text{class}}^T$
  - **Gradient w.r.t. biases**: $\frac{\partial L}{\partial b_{\text{class}}} = \frac{\partial L}{\partial O_{\text{class}}}$




## Vision Transformer (ViT) Overview

Vision Transformer (ViT), introduced by researchers at Google Brain in 2020, marks a significant shift in the approach to image classification tasks traditionally dominated by convolutional neural networks (CNNs). ViT applies the principles of the Transformer architecture, commonly used in natural language processing, to vision tasks by treating images as sequences of patches, which allows it to capture global dependencies within an image.

### Key Innovations of ViT

Vision Transformer introduced several groundbreaking ideas to the field of computer vision:

1. **Transformer Architecture**: ViT adapts the Transformer model, primarily used in NLP, to process images by treating image patches as equivalent to words in a sentence.
2. **Attention Mechanism**: Utilizes the self-attention mechanism to weigh the importance of different patches of an image relative to each other.
3. **Patch Encoding**: Images are split into fixed-size patches, embedded into tokens, and processed through multiple layers of Transformer blocks.

### Architecture of ViT

ViT's architecture is simpler in terms of its reliance on well-understood Transformer blocks rather than specialized convolutions. Here’s a simplified overview:

| Layer Type            | Input Dimension              | Output Dimension             | Details                    | Parameters Formula                                           | Number of Parameters |
|-----------------------|------------------------------|------------------------------|----------------------------|--------------------------------------------------------------|----------------------|
| **Input Image**       | $224 \times 224 \times 3$    | N/A                          | N/A                        | N/A                                                          | 0                    |
| **Patch Embedding**   | $224 \times 224 \times 3$    | $N \times (P^2 \cdot C)$     | $16 \times 16$, flatten   | $(P^2 \cdot C) \times D$                                     | Varies               |
| **Positional Encoding**| $N \times D$                | $N \times D$                 | Add to embedding          | $N \times D$                                                 | Varies               |
| **Transformer Blocks**| $N \times D$                | $N \times D$                 | Multi-head attention, MLP | Varies per block                                             | Varies               |
| **Classifier Head**   | $D$                          | Number of classes            | Linear                     | $D \times \text{Number of classes}$                          | Varies               |

- **$N$** is the number of patches.
- **$P$** is the size of each patch (e.g., 16x16 pixels).
- **$C$** is the number of channels in the image (usually 3 for RGB).
- **$D$** is the dimensionality of the patch embeddings.

### Advantages of Vision Transformer

- **Scalability**: ViT scales effectively with the number of parameters, often surpassing state-of-the-art CNNs as model size increases.
- **Efficient Transfer Learning**: Demonstrates strong performance on smaller datasets when pretrained on larger datasets.
- **Global Context**: Capable of capturing relationships between distant parts of the image, which is a challenge for CNNs that rely on local receptive fields.

### Disadvantages of Vision Transformer

- **Data Hungry**: Requires a large amount of data to train effectively from scratch, making it less practical for tasks with limited data.
- **Computational Requirements**: High computational cost due to the self-attention mechanism, especially as the number of patches increases.
- **Lack of Inductive Biases**: Unlike CNNs, ViT lacks certain inductive biases such as translation invariance, which can sometimes hinder its performance on smaller or less diverse datasets.

### Key Properties of ViT

- **Patch-based Processing**: Treats image patches as the fundamental component, akin to tokens in NLP, enabling it to leverage the Transformer's capabilities.
- **Self-Attention Across Patches**: Allows the model to focus on the most informative parts of the image regardless of their spatial location.
- **Flexibility**: Can be easily adapted and extended for various vision tasks beyond classification, such as object detection and segmentation.

Vision Transformer continues to inspire innovations in the field of machine learning, challenging conventional approaches and encouraging further exploration of Transformer-based models in vision tasks.
