# 📄 Paper Summary: Vision Transformer (ViT)

**Title**: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale  
**Authors**: Dosovitskiy et al. (Google Research, Brain Team)  
**Published in**: ICLR 2021  
**Link**: https://arxiv.org/abs/2010.11929  

---

## ✅ Day 1 – Abstract & Introduction

### 📌 Background & Motivation
- **CNN dominance in CV**: For over a decade, convolutional neural networks (CNNs) have been the backbone of computer vision tasks (e.g., image classification, detection, segmentation).  
- **Limitations of CNNs**:  
  - Strong **inductive biases** (locality, translation equivariance) limit flexibility.  
  - Scaling CNNs to extremely large datasets shows diminishing returns.  
- **NLP breakthrough with Transformers**: In contrast, NLP models based on pure **Transformer architectures** scale extremely well, achieving state-of-the-art results without handcrafted inductive biases.  
- **Core Question**: Can the same Transformer architecture that revolutionized NLP be directly applied to vision tasks, without convolutions?  

---

### 📌 Core Idea
- Represent an image as a **sequence of patches** (like words in NLP).  
- Flatten each patch (e.g., 16×16 pixels), project it into an embedding space, and feed it into a standard Transformer encoder.  
- Introduce a special **[CLS] token** for classification tasks.  
- Add **positional embeddings** to retain spatial information.  

---

### 📌 Contributions
1. Proposed the **Vision Transformer (ViT)**: the first pure Transformer architecture for large-scale image recognition.  
2. Demonstrated that ViT, when trained on **large datasets** (JFT-300M), can **match or outperform CNNs** on ImageNet and other benchmarks.  
3. Showed that **scaling laws** of Transformers in NLP also apply to vision: performance improves predictably with model/data size.  

---

# 📄 Vision Transformer (ViT) – Day 2

## 🧠 Method / Architecture

### 1. Image to Sequence (Patch Embedding)
- Input image size: \( H \times W \times C \)  
- Split the image into **non-overlapping patches** of size \( P \times P \).  
  → Total number of patches: \( N = \frac{HW}{P^2} \)  
- Flatten each patch into a vector \( x_p^i \in \mathbb{R}^{P^2 \cdot C} \).  
- Apply a linear projection to map it into a \( D \)-dimensional embedding:  
  \[
  z_0^i = E \cdot x_p^i, \quad E \in \mathbb{R}^{D \times (P^2 \cdot C)}
  \]

---

### 2. Positional Encoding
- Transformers lack inherent order awareness → need **positional information**.  
- Add a learnable positional embedding \( E_{pos}^i \) to each patch embedding:  
  \[
  z_0^i = x_p^i E + E_{pos}^i
  \]

---

### 3. Transformer Encoder
- Standard Transformer encoder stack consisting of:  
  - **Multi-Head Self-Attention (MSA)**  
  - **Feed-Forward Network (MLP block)**  
  - Residual connections + Layer Normalization  
- This allows modeling of **global relationships** among all patches.

---

### 4. Classification Head
- A special **[CLS] token** is prepended to the patch sequence.  
- After the Transformer encoder, the final hidden state of [CLS] is taken.  
- A classification head (MLP + softmax) is applied to predict the class label.

---