# 📄 Paper Summary: Vision Transformer (ViT)

**Title**: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale  
**Authors**: Dosovitskiy et al. (Google Research, Brain Team)  
**Published in**: ICLR 2021  
**Link**: https://arxiv.org/abs/2010.11929  

---

## ✅ Day 1 – Abstract & Introduction

### 📌 Background & Motivation
- **CNN dominance in CV**: For over a decade, convolutional neural networks (CNNs) have been the backbone of computer vision tasks (e.g., image classification, detection, segmentation).  
- **Limitations of CNNs**:  
  - Strong **inductive biases** (locality, translation equivariance) limit flexibility.  
  - Scaling CNNs to extremely large datasets shows diminishing returns.  
- **NLP breakthrough with Transformers**: In contrast, NLP models based on pure **Transformer architectures** scale extremely well, achieving state-of-the-art results without handcrafted inductive biases.  
- **Core Question**: Can the same Transformer architecture that revolutionized NLP be directly applied to vision tasks, without convolutions?  

---

### 📌 Core Idea
- Represent an image as a **sequence of patches** (like words in NLP).  
- Flatten each patch (e.g., 16×16 pixels), project it into an embedding space, and feed it into a standard Transformer encoder.  
- Introduce a special **[CLS] token** for classification tasks.  
- Add **positional embeddings** to retain spatial information.  

---

### 📌 Contributions
1. Proposed the **Vision Transformer (ViT)**: the first pure Transformer architecture for large-scale image recognition.  
2. Demonstrated that ViT, when trained on **large datasets** (JFT-300M), can **match or outperform CNNs** on ImageNet and other benchmarks.  
3. Showed that **scaling laws** of Transformers in NLP also apply to vision: performance improves predictably with model/data size.  

---


## ✅ Day 2 – Method & Architecture

### 1. Image to Sequence (Patch Embedding)
- Input image size: \( H \times W \times C \)  
- Split the image into **non-overlapping patches** of size \( P \times P \).  
  → Total number of patches: \( N = \frac{HW}{P^2} \)  
- Flatten each patch into a vector \( x_p^i \in \mathbb{R}^{P^2 \cdot C} \).  
- Apply a linear projection to map it into a \( D \)-dimensional embedding:  
  \[
  z_0^i = E \cdot x_p^i, \quad E \in \mathbb{R}^{D \times (P^2 \cdot C)}
  \]

---

### 2. Positional Encoding
- Transformers lack inherent order awareness → need **positional information**.  
- Add a learnable positional embedding \( E_{pos}^i \) to each patch embedding:  
  \[
  z_0^i = x_p^i E + E_{pos}^i
  \]

---

### 3. Transformer Encoder
- Standard Transformer encoder stack consisting of:  
  - **Multi-Head Self-Attention (MSA)**  
  - **Feed-Forward Network (MLP block)**  
  - Residual connections + Layer Normalization  
- This allows modeling of **global relationships** among all patches.

---

### 4. Classification Head
- A special **[CLS] token** is prepended to the patch sequence.  
- After the Transformer encoder, the final hidden state of [CLS] is taken.  
- A classification head (MLP + softmax) is applied to predict the class label.

---

# ✅ Day 3 – Experiments & Results


### 1. Pre-training
- ViT does not achieve strong performance when trained only on ImageNet.  
- Models are **pre-trained on JFT-300M** (300M images, 18k classes).  
- Optimization setup:  
  - Adam with weight decay  
  - Learning rate: linear warm-up + cosine decay  
  - Regularization: dropout, stochastic depth  

---

### 2. Fine-tuning (Transfer Learning)
- Pre-trained ViT is fine-tuned on downstream datasets (e.g., **ImageNet, CIFAR-100, VTAB**).  
- During fine-tuning, the input resolution is often **increased** for better detail.  
  - Example: pre-training at 224×224, fine-tuning at 384×384.  

---

### 3. Baseline Comparison
- Compared against **ResNet** CNN baselines.  
- Key findings:  
  - On **large datasets**, ViT **outperforms ResNets**.  
  - On **smaller datasets**, ResNets still perform better due to stronger inductive biases.  
- Conclusion: ViT requires **large-scale data** to unleash its potential.  

---

### 4. Scaling Experiments
- ViT was tested at different scales: **Base, Large, Huge**.  
- Larger models and more data → consistent performance improvements.  
- Confirms that **scaling laws** (observed in NLP Transformers) also hold for vision tasks.  

---

### 5. Ablation Studies
- **Patch Size**:  
  - Smaller patches (16×16) yield higher accuracy.  
  - Larger patches (32×32) reduce computation but hurt performance.  
- **Depth & Width**:  
  - Increasing the number of layers and hidden dimensions improves results.  
- **Regularization**:  
  - Techniques like dropout, stochastic depth, and label smoothing enhance data efficiency.  

---

## ✅ Key Takeaways
1. ViT’s performance strongly depends on **large-scale pre-training**.  
2. With sufficient data and scaling, ViT **surpasses CNNs** on multiple benchmarks.  
3. **Scaling laws** from NLP extend to vision tasks.  
4. Patch size, model depth/width, and regularization are crucial design factors.  