# Limitations of CNNs That Led to Vision Transformers  

---

# 1. Strong Inductive Bias
CNNs are built on assumptions like locality, translation equivariance, and hierarchical feature extraction.  
These assumptions help when data is small but restrict flexibility and global reasoning.

---

# 2. Poor Global Context Modeling
Convolutions operate on local neighborhoods (e.g., 3×3 or 5×5).  
To gather long-range relationships, CNNs must stack many layers, use dilations, or add attention-like modules.

This makes capturing global structure:
- slow,
- indirect,
- computationally expensive.

Transformers, in contrast, use self-attention, giving global receptive field from the first layer.

---

# 3. Slow Receptive Field Growth
A CNN’s receptive field grows only as you stack more layers.  
This is inefficient and can still fail to capture whole-image relationships.  
ViTs eliminate this by letting every patch attend to every other patch immediately.

---

# 4. Inefficient Scaling
Increasing depth/width of CNNs often yields diminishing returns.  
Transformers scale linearly and benefit strongly from large datasets + large compute.

This ability to scale up (same as GPT-like models) is a major reason ViT surpasses CNNs when pretrained on large datasets.

---

# 5. Heavy Architectural Engineering
CNNs require manually designed components:
- kernel sizes,
- strides,
- padding,
- feature pyramids,
- skip connections,
- specialized detection/segmentation heads.

ViTs simplify architecture to repeated Transformer blocks (+ patch embedding).

---

# 6. Weaker Transfer Across Domains
CNNs encode vision-specific priors → less reusable across modalities.  
Transformers unify architectures across NLP, vision, audio, and multimodal tasks.

---

# 7. Fixed Local Interactions
CNN kernels learn **fixed patterns** that apply locally.  
Transformers instead compute **dynamic token-to-token relationships** that change based on the content.

---

# 8. Limitations in High-Data Regimes
When more data becomes available (ImageNet-21k, JFT-300M), CNNs hit a performance ceiling.  
Transformers improve dramatically with scale, showing higher upper limits.

---

# 9. No Unified Architecture With NLP
ViT enables consistent architecture across fields.  
This makes multimodal models (CLIP, Flamingo, LLaVA) possible.

---

# 10. Translation Equivariance (Major Difference)

### CNNs **have** translation equivariance  
Translation equivariance means:

**If the input image shifts, the output feature map shifts in the same way.**  
The network’s response is preserved under translation.

Why CNNs have this:
- Convolution uses the **same kernel at every location**.
- The kernel reacts identically no matter where a pattern appears.
- So moving the pattern just moves the activation.

This is a strong image-specific inductive bias that makes CNNs:
- robust to object movement,
- very effective at small/medium data settings,
- naturally suited for spatial tasks.

---

### Vision Transformers **do NOT** have translation equivariance  

Why ViTs break this property:

#### 1. Patchifying breaks translation structure  
If the image shifts slightly:
- patches are cut differently,
- patch boundaries change,
- the token sequence changes completely.

Even 1–2 pixels of shift can create a different patch composition.

#### 2. Positional embeddings remove translation symmetry  
A positional embedding gives each patch a **fixed coordinate identity**.  
If the image shifts, patches get different embeddings → outputs do not shift predictably.

#### 3. Global self-attention mixes all tokens  
Self-attention has no spatial consistency requirement.  
A shift in input does NOT guarantee any structured shift in output.

---

### Consequence  
**CNN:**
- Built-in robustness to object movement.
- No need to learn translation behavior.

**ViT:**
- Must learn translation invariance purely from data.
- Performs worse in low-data settings.
- Outperforms CNNs once pretrained on very large datasets.

---

# CNN vs ViT — Summary Table (Updated)

| Property / Limitation | CNNs | Vision Transformers (ViT) |
|------------------------|------|----------------------------|
| **Inductive Bias** | Strong: locality, translation equivariance | Weak; model learns everything from data |
| **Translation Equivariance** | **Yes (built-in)** | **No (broken by patching + positional embeddings)** |
| **Global Context** | Hard; large depth needed | Immediate global self-attention |
| **Receptive Field** | Grows slowly | Global from first layer |
| **Scalability** | Saturates when scaled | Scales extremely well with data/compute |
| **Dynamic Interactions** | Fixed convolution kernels | Dynamic token-to-token attention |
| **Architecture Complexity** | Requires manual design (kernels, pooling) | Simple repeated blocks |
| **Transfer Learning** | Moderate; mostly vision-only | Excellent; cross-modal + large-scale |
| **Data Efficiency** | Strong for small data | Weak for small data; needs large pretraining |
| **Multimodal Compatibility** | Not natural | Natural (same Transformer backbone as NLP) |

---

# Vision Transformer (ViT)

---

# 1. Overview
A Vision Transformer converts an image into a sequence of tokens (patch embeddings) and feeds them into a standard Transformer encoder. It relies on **three key components**:

### ViT Positional Encoding = 3 Layers
1. **Patch Embeddings**
2. **Class Token (`[CLS]`)**
3. **Positional Embeddings**

---

# 2. Patch Embeddings
### Goal
Convert an image into a sequence of tokens.

### Steps
- Split image into patches of size **P × P**
- Flatten each patch
- Linearly project into embedding of size **D**

### Output Shape
```
(batch, num_patches, D)
```

### Analogy
Patch embedding ≈ Word embedding in NLP.

---

# 3. Class Token (`[CLS]`)
### What it is
A **learnable vector** of dimension **D**, prepended to the patch sequence.

### Why it exists
Transformers produce one output per token, but classification needs **one global vector**.  
The `[CLS]` token learns to gather information from all patches through attention.

### After the Transformer
Use:
```
CLS_out = output[:, 0, :]
```

---

# 4. Positional Embeddings
Transformers lack positional awareness. ViT uses **learned** positional embeddings.

### Shape
```
(num_patches + 1, D)
```

### Why Learned Instead of Sinusoidal
- Sinusoidal is 1D; images are 2D  
- Learned embeddings work better for spatial patterns  
- Improves accuracy  

### Added to Tokens
```
tokens = patch_embeddings + positional_embeddings
```

---

# 5. Full Transformer Input
```
[CLS] + patch_embeddings + positional_embeddings
```

Shape:
```
(batch, num_patches + 1, D)
```

---

# 6. Transformer Encoder in ViT
Same structure as NLP Transformer encoder:
- Multi-Head Self-Attention
- MLP
- LayerNorm
- Residual connections

No decoder is used.

---

# 7. Classification Head
Only the final CLS vector is used:
```
cls_vec = output[:, 0, :]
logits = MLP(cls_vec)
```

---

# 8. ViT vs NLP Transformer

| Component | NLP Transformer | Vision Transformer |
|----------|------------------|--------------------|
| Input Tokens | Words | Image patches |
| Token Embedding | Lookup | Linear projection |
| Positional Encoding | Sinusoidal or learned | Always learned |
| Structure | 1D | 2D patches flattened |
| Special Token | CLS | CLS |
| Resolution Changes | Easy | Need interpolation |

---

# 9. Why CLS Token is Required
- Transformer outputs multiple vectors  
- Classification requires **one vector**  
- CLS token learns global aggregation  
- Better than average pooling  

---

# 10. Minimal Summary
- Image → patches → linear projection  
- Add `[CLS]` token  
- Add learned positional embeddings  
- Pass through Transformer encoder  
- Extract CLS output  
- Classify

This is the core of Vision Transformers.