# Vision Transformer (ViT) — Complete Notes

---

# 1. Overview
A Vision Transformer converts an image into a sequence of tokens (patch embeddings) and feeds them into a standard Transformer encoder. It relies on **three key components**:

### ViT Positional Encoding = 3 Layers
1. **Patch Embeddings**
2. **Class Token (`[CLS]`)**
3. **Positional Embeddings**

---

# 2. Patch Embeddings
### Goal
Convert an image into a sequence of tokens.

### Steps
- Split image into patches of size **P × P**
- Flatten each patch
- Linearly project into embedding of size **D**

### Output Shape
```
(batch, num_patches, D)
```

### Analogy
Patch embedding ≈ Word embedding in NLP.

---

# 3. Class Token (`[CLS]`)
### What it is
A **learnable vector** of dimension **D**, prepended to the patch sequence.

### Why it exists
Transformers produce one output per token, but classification needs **one global vector**.  
The `[CLS]` token learns to gather information from all patches through attention.

### After the Transformer
Use:
```
CLS_out = output[:, 0, :]
```

---

# 4. Positional Embeddings
Transformers lack positional awareness. ViT uses **learned** positional embeddings.

### Shape
```
(num_patches + 1, D)
```

### Why Learned Instead of Sinusoidal
- Sinusoidal is 1D; images are 2D  
- Learned embeddings work better for spatial patterns  
- Improves accuracy  

### Added to Tokens
```
tokens = patch_embeddings + positional_embeddings
```

---

# 5. Full Transformer Input
```
[CLS] + patch_embeddings + positional_embeddings
```

Shape:
```
(batch, num_patches + 1, D)
```

---

# 6. Transformer Encoder in ViT
Same structure as NLP Transformer encoder:
- Multi-Head Self-Attention
- MLP
- LayerNorm
- Residual connections

No decoder is used.

---

# 7. Classification Head
Only the final CLS vector is used:
```
cls_vec = output[:, 0, :]
logits = MLP(cls_vec)
```

---

# 8. ViT vs NLP Transformer

| Component | NLP Transformer | Vision Transformer |
|----------|------------------|--------------------|
| Input Tokens | Words | Image patches |
| Token Embedding | Lookup | Linear projection |
| Positional Encoding | Sinusoidal or learned | Always learned |
| Structure | 1D | 2D patches flattened |
| Special Token | CLS | CLS |
| Resolution Changes | Easy | Need interpolation |

---

# 9. Why CLS Token is Required
- Transformer outputs multiple vectors  
- Classification requires **one vector**  
- CLS token learns global aggregation  
- Better than average pooling  

---

# 10. Minimal Summary
- Image → patches → linear projection  
- Add `[CLS]` token  
- Add learned positional embeddings  
- Pass through Transformer encoder  
- Extract CLS output  
- Classify

This is the core of Vision Transformers.