# 🧠 Attention Mechanisms — Core Concepts & Intuition

---

## 💡 What Is Attention?

Attention lets each token in a sequence decide **which other tokens to pay attention to**, and **how much**, when constructing its contextualized representation.

Unlike RNNs, attention is not sequential — every token **attends to all tokens in parallel**, including itself.

---

## 🎯 The Goal

For each token in a sequence:  
> “What other tokens should I care about, and how much should their information contribute to my final vector?”

---

## 🔁 Q, K, V: Query, Key, and Value

Given an input sequence of token embeddings:

$$
X = \begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix} \in \mathbb{R}^{n \times d_{\text{model}}}
$$

We compute:

- **Queries**: $Q = X W^Q$
- **Keys**: $K = X W^K$
- **Values**: $V = X W^V$

Where:

- $W^Q, W^K, W^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$ are **learned projection matrices**
- Each token embedding is linearly projected into 3 spaces:
  - **Query** = “What am I looking for?”
  - **Key** = “What do I offer?”
  - **Value** = “What content should you use from me?”

✅ This part is critical: Q, K, and V are **not the embeddings themselves**, but **learned views** of them based on the task.

---

## 📐 Scaled Dot-Product Attention Formula

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$

This formula happens in **3 core steps**:

---

### Step 1: Similarity Scoring — $QK^\top$

This is where each token’s **query** is compared to every other token’s **key** using a dot product:

- The output is a matrix $\in \mathbb{R}^{n \times n}$
- Entry $(i, j)$ is the similarity between:
  - Token $i$’s **query**
  - Token $j$’s **key**

✅ This is **exactly where you interpret how much attention token $i$ gives to token $j$**.

---

### Step 2: Scaling

Divide each score by $\sqrt{d_k}$ to prevent large dot products from dominating softmax.

---

### Step 3: Softmax — Normalizing to Attention Weights

Apply softmax across each row:

- Turns raw scores into probabilities (weights between 0 and 1)
- Each row now sums to 1
- Row $i$ gives a **distribution over how much token $i$ should attend to all tokens (including itself)**

✅ This softmax output is the **attention weight matrix** — it holds the interpretable attention values.

---

### Step 4: Weighted Sum — Multiply with V

$$
\text{Output} = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$

- Multiply attention weights by value vectors
- This gives a new vector for each token — a **blend** of all token values, weighted by how relevant they are

---

## 🔍 Intuition Summary

| Role   | Meaning                                                   |
|--------|-----------------------------------------------------------|
| Query  | What am I looking for?                                    |
| Key    | What do I offer?                                          |
| Value  | What content should you use if you care about me?         |
| $Q \cdot K^\top$ | How well does one token match another                     |
| $\text{softmax}(QK^\top)$ | Attention weights — how much focus to give each token |
| Output | Contextualized vector based on others’ values             |

---

## 🧠 Additional Insights

- **Self-attention** = Q, K, V all come from the same input
- **Cross-attention** = Q comes from one input, K/V from another (e.g., encoder-decoder setup)
- **Attention is fully differentiable** and learned via backprop
- The attention output has the **same shape** as the input, so it can be combined via residual connections

---

## ✅ This Is Where Interpretation Happens:

> The attention **weights** (after softmax of $QK^\top$) are **directly interpretable** as:  
> “How much should this token care about each other token (including itself)?”

That matrix holds the **core meaning** of attention.


# 🧠 Single-Head vs Multi-Head Attention — Dimensions, Design, and Intuition

---

## 🔁 Single-Head (Standard) Attention

In standard attention (no heads), we work directly with the full embedding dimension:

Let:
- $X \in \mathbb{R}^{n \times d_{\text{model}}}$ — input sequence
- $n$: number of tokens in sequence
- $d_{\text{model}}$: embedding size per token

We compute:

- $Q = X W^Q$, $K = X W^K$, $V = X W^V$
- Where each projection matrix $W^Q, W^K, W^V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$

So:
- $Q, K, V \in \mathbb{R}^{n \times d_{\text{model}}}$
- Meaning:
  $$
  d_k = d_v = d_{\text{model}}
  $$

✅ We retain the full expressiveness of each token's embedding for computing attention.

---

## 🔀 Multi-Head Attention

We split attention into $h$ **parallel heads**, each operating in a smaller subspace.

We set:
$$
d_k = d_v = \frac{d_{\text{model}}}{h}
$$

Each head computes:

$$
Q_i = X W_i^Q,\quad K_i = X W_i^K,\quad V_i = X W_i^V
\quad \text{with} \quad W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}
$$

Each head returns:

$$
\text{head}_i = \text{softmax}\left( \frac{Q_i K_i^\top}{\sqrt{d_k}} \right) V_i \in \mathbb{R}^{n \times d_v}
$$

---

### 🔧 Concatenation and Final Projection

All heads are concatenated:
$$
\text{Concat}(\text{head}_1, \dots, \text{head}_h) \in \mathbb{R}^{n \times (h \cdot d_v)} = \mathbb{R}^{n \times d_{\text{model}}}
$$

Then passed through a final linear projection:
$$
\text{Output} = \text{Concat}(\dots) \cdot W^O,\quad W^O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}
$$

✅ Final output shape: $n \times d_{\text{model}}$

---

## 🤔 Why Set $d_k = d_v = \frac{d_{\text{model}}}{h}$?

- Keeps computation per head low
- Preserves total output size after concatenation
- Enables diversity across heads (each learns a different “view”)
- Helps avoid overfitting by specializing heads

---

## ❌ What if $d_k$ or $d_v$ were too small?

- $d_k$ too small → poor matching resolution → queries can't “find” relevant keys effectively
- $d_v$ too small → limited content to pass into final representation
- Bottom line: shrinking Q/K/V dimensions **too much** causes **information loss**

---

## ✅ Summary

| Concept                     | Single-Head                     | Multi-Head                              |
|-----------------------------|----------------------------------|------------------------------------------|
| Q/K/V dimension             | $d_{\text{model}}$               | $\frac{d_{\text{model}}}{h}$             |
| Output shape                | $n \times d_{\text{model}}$     | $n \times d_{\text{model}}$ (via concat + proj) |
| Why use multiple heads?     | N/A                              | Captures different types of attention in parallel |
| Final projection?           | No                               | Yes — to mix head outputs               |


# 📍 Positional Information in Transformers — Complete Guide

This note summarizes all the key concepts, comparisons, and implementations for encoding positional information into transformer models. It includes positional **embeddings**, **encodings**, and **rotary embeddings (RoPE)**.

---

## 🔁 Why We Need Positional Information

Transformers process tokens in parallel, meaning they have **no inherent sense of token order**. Positional information must be added explicitly so that:

- Tokens can be aware of **absolute position** (e.g., beginning, middle, end)
- Attention can reflect **relative distance** (e.g., "how far apart")

---

## 📦 Positional Embeddings (Learned)

- Each position (0, 1, 2, ...) is assigned a **learnable vector** of size $d_{\text{model}}$
- Looked up from a table and **added to token embeddings** before the first transformer layer
- Example: `input = token_embedding + position_embedding`

### ✅ Pros:
- Task-specific optimization (learned from data)
- Compatible with any tokenizer and model

### ❌ Cons:
- Does **not generalize** to longer sequences
- Positional meaning is **not interpretable**

---

## 🔢 Sinusoidal Positional Encodings (Fixed)

### Formula:
For position $ \text{pos} $ and dimension $ i $:

If $ i $ is even:

$$
\text{PE}_{\text{pos}, i} = \sin\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

If $ i $ is odd:

$$
\text{PE}_{\text{pos}, i} = \cos\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$


- Injected **by addition** to token embeddings
- Dimensions vary by **frequency**, allowing **multi-scale position awareness**

### ✅ Pros:
- Generalizes to **arbitrarily long sequences**
- Smooth, interpretable, and deterministic

### ❌ Cons:
- Not task-optimized
- Position only affects **input embeddings**, not attention directly

---

## 🌀 Rotary Positional Embeddings (RoPE)

RoPE encodes position **inside the attention mechanism itself** by rotating Q and K vectors.

### Core Idea:
- Split each Q/K vector into **2D pairs**
- Rotate each pair by an angle:

$$
\theta_i = \frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}
$$


- Rotation uses a 2D matrix:

$$
R(\theta) =
\begin{bmatrix}
\cos(\theta) & -\sin(\theta) \\
\sin(\theta) & \cos(\theta)
\end{bmatrix}
$$


### 🔁 Multi-scale Encoding:
- **Slow-rotating** pairs capture **global** position
- **Fast-rotating** pairs capture **fine-grained** distances

### ✅ Pros:
- Encodes **relative position** through dot products
- Generalizes to **longer sequences** naturally
- No extra memory or learnable parameters
- Lightweight and fused into attention calculation

### ❌ Cons:
- Rotation must be manually implemented per 2D pair

### 🔍 Real Output Example:
Dot product between Q vectors after RoPE (simulated):

| Q\_pos=0 vs | Dot Product |
|------------|-------------|
| Q\_pos=0    | 1.35         |
| Q\_pos=1    | 0.89         |
| Q\_pos=2    | -0.07        |
| Q\_pos=3    | -0.65        |

> Demonstrates how attention degrades smoothly with increasing distance via rotation.

---

## ✅ Final Summary

| Method                   | Type     | Generalizes | Relative Awareness | Added to Embedding? | Interpretable? |
|--------------------------|----------|-------------|---------------------|----------------------|-----------------|
| Learned Pos Embedding    | Learned  | ❌          | ❌                  | ✅                   | ❌              |
| Sinusoidal Encoding      | Fixed    | ✅          | ✅ (with effort)     | ✅                   | ✅              |
| Rotary Embedding (RoPE)  | Fixed    | ✅          | ✅ (by design)       | ❌ (used in Q/K)     | ✅ (via rotation) |

RoPE gives the most **elegant and functional** way to encode position **without any addition to embeddings**, directly into **how tokens attend to each other**.


# 🔢 Sinusoidal Positional Encoding — Simple Toy Example (d_model = 4)

### ❗ Key Reminder:
- **`pos`** = the position of the token in the sequence (e.g., 0, 1, 2, ...)
- **`i`** = the dimension index of the embedding (0 through d_model - 1)
- ✅ **`i ≠ pos`** — they are totally separate axes

---

### ⚙️ Setup:
- Sequence length = 2 (positions 0 and 1)
- Embedding size = 4

Formulas:

For even $i$:

$$
\text{PE}(\text{pos}, i) = \sin\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

For odd $i$:

$$
\text{PE}(\text{pos}, i) = \cos\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
$$

---

## 📍 Position 0:

- $\text{PE}(0, 0) = \sin(0) = 0$
- $\text{PE}(0, 1) = \cos(0) = 1$
- $\text{PE}(0, 2) = \sin(0) = 0$
- $\text{PE}(0, 3) = \cos(0) = 1$

**Resulting vector:**  
`[0.0, 1.0, 0.0, 1.0]`

---

## 📍 Position 1:

- For $i = 0, 1$: $\frac{1}{10000^{0}} = 1$
- For $i = 2, 3$: $\frac{1}{10000^{0.5}} = \frac{1}{\sqrt{10000}} = 0.01$

- $\text{PE}(1, 0) = \sin(1) \approx 0.841$
- $\text{PE}(1, 1) = \cos(1) \approx 0.540$
- $\text{PE}(1, 2) = \sin(0.01) \approx 0.010$
- $\text{PE}(1, 3) = \cos(0.01) \approx 0.999$

**Resulting vector:**  
`[0.841, 0.540, 0.010, 0.999]`

---

## ✅ Final PE Matrix (2 positions × 4 dimensions):

| Position | dim 0 | dim 1 | dim 2 | dim 3 |
|----------|--------|--------|--------|--------|
| 0        | 0.000  | 1.000  | 0.000  | 1.000  |
| 1        | 0.841  | 0.540  | 0.010  | 0.999  |

---


# 🌀 Rotary Positional Embedding (RoPE) — Toy Example + Math Explained

### ❗ Key Points

- RoPE applies **rotations directly to Q and K vectors** in attention
- It does **not** modify the input embeddings — the position is injected **within the attention mechanism**
- Each token's Q/K vector is split into **2D subvectors** and rotated using a position-dependent angle
- Different 2D subvectors rotate at **different frequencies** (multi-scale)

---

### ⚙️ Setup: 1 token vector of dimension 4

Let:
- Token position = 1
- Vector = `[x₀, x₁, x₂, x₃] = [1.0, 0.0, 0.5, 0.0]`
- We split it into two 2D subvectors: `[1.0, 0.0]` and `[0.5, 0.0]`
- Let $d_{\text{model}} = 4$

We will rotate each subvector by an angle:

$$
\theta_i = \frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}
$$

So for:
- Pair 0 ($i=0$): $\theta_0 = \frac{1}{10000^0} = 1.0$
- Pair 1 ($i=1$): $\theta_1 = \frac{1}{10000^{0.5}} = 0.01$

---

### 📐 Rotation Matrix

Each 2D vector is rotated using:

$$
R(\theta) =
\begin{bmatrix}
\cos(\theta) & -\sin(\theta) \\
\sin(\theta) & \cos(\theta)
\end{bmatrix}
$$

---

### 🧮 Applying RoPE

For subvector 1: $[1.0,\ 0.0]$ with $\theta = 1.0$

$$
\text{RoPE}_0 = R(1.0) \cdot
\begin{bmatrix}
1.0 \\
0.0
\end{bmatrix}
=
\begin{bmatrix}
\cos(1.0) \cdot 1.0 + (-\sin(1.0)) \cdot 0.0 \\
\sin(1.0) \cdot 1.0 + \cos(1.0) \cdot 0.0
\end{bmatrix}
\approx
\begin{bmatrix}
0.540 \\
0.841
\end{bmatrix}
$$

For subvector 2: $[0.5,\ 0.0]$ with $\theta = 0.01$

$$
\text{RoPE}_1 = R(0.01) \cdot
\begin{bmatrix}
0.5 \\
0.0
\end{bmatrix}
\approx
\begin{bmatrix}
0.5 \cdot \cos(0.01) \\
0.5 \cdot \sin(0.01)
\end{bmatrix}
\approx
\begin{bmatrix}
0.4999 \\
0.005
\end{bmatrix}
$$

---

### ✅ Final Rotated Vector:

Concatenate the two rotated subvectors:



# 🔁 Positional Information in Transformers

## ✅ Why Do We Need Positional Information?

Transformers lack recurrence or convolution, so they process tokens **in parallel**.  
To understand **order**, we inject positional information into the model.

---

## 🔷 Types of Positional Information

| Method                         | Category      | Generalizes Beyond Training Length? | Notes |
|-------------------------------|---------------|------------------------|-------|
| Learned Positional Embeddings | Absolute       | ❌ No | Fixed-size lookup table; 1 vector per position |
| Sinusoidal Encoding           | Absolute       | ✅ Yes | Uses trigonometric functions to encode positions |
| Relative Position Bias (T5, DeBERTa) | Relative       | ✅ Yes | Focuses on token distance, not exact location |
| RoPE (Rotary Positional Embeddings) | ✅ Hybrid (Abs + Rel) | ✅ Yes | Injects position into attention via rotations |

---

## 🔹 Absolute Positional Encodings

- Encode the **exact position** of each token (e.g., "token 5").
- Either **learned** (lookup table) or **fixed** (sin/cos functions).
- Cannot reason about *distances between tokens* directly.

---

## 🔸 Relative Positional Encodings

- Encode **distances** between tokens (e.g., "token A is 3 steps before B").
- More robust to repeated structures, generalized patterns (e.g., music, DNA).
- Often used as bias in the attention score between query and key.

---

## 🚀 Rotary Positional Embeddings (RoPE)

RoPE rotates each embedding vector based on its **absolute position**,  
but the **attention scores reflect relative position**.

### 🔬 How It Works

1. Split embedding into pairs:  
   $$
   [x_0, x_1], [x_2, x_3], \dots
   $$

2. Apply a **rotation matrix** based on token's position $p$:
   $$
   \begin{bmatrix}
   \cos(\theta_p) & -\sin(\theta_p) \\
   \sin(\theta_p) & \cos(\theta_p)
   \end{bmatrix}
   \cdot
   \begin{bmatrix}
   x_{2i} \\
   x_{2i+1}
   \end{bmatrix}
   $$

3. At attention time, compute:
   $$
   \text{score}(i, j) = \text{RoPE}(q_i)^T \cdot \text{RoPE}(k_j)
   $$

4. The dot product reflects the **relative angle** between positions $i$ and $j$,  
   giving us **relative position sensitivity** while keeping **absolute context**.

---

## 🧠 Why RoPE is the Best of Both Worlds

- ✅ **Absolute position** is encoded via rotation angle.
- ✅ **Relative distance** affects the similarity score in attention.
- ✅ No need for additional embeddings or memory overhead.
- ✅ Smooth generalization to longer sequences.

---

## 🔑 Key Takeaways

- Transformers need position info because they’re order-agnostic by default.
- Absolute ≠ Relative — they serve different roles.
- **RoPE injects both types of info directly into attention**, making it highly efficient and flexible.


# 🧱 Transformer Architecture — Encoder & Decoder Notes

This breakdown covers the structure, flow, and components of the original Transformer encoder and decoder, as described in *“Attention Is All You Need.”*

---

## 🔁 Encoder Overview

The encoder processes the **entire input sequence** at once and produces a sequence of **context-enriched token representations**.

### 💡 Structure of Each Encoder Layer:


### 🔑 Key Details:
- **Self-attention is unmasked** → each token can attend to all others
- **Same layer is applied N times** (usually 6–12)
- Positional information is added at the embedding level before the first layer
- Outputs a sequence of vectors: one per input token

---

## 📤 Decoder Overview

The decoder generates an output sequence **token by token**, using:
- Previously generated tokens (causal masked self-attention)
- The full encoder output (via cross-attention)

### 💡 Structure of Each Decoder Layer:


### 🔑 Key Details:
- **Masked self-attention** prevents the model from looking ahead
- **Cross-attention** uses keys/values from encoder output and queries from the decoder
- Also stacked N times
- Final decoder output is projected to vocab logits → softmax → next token

---

## 🧠 Summary: Encoder vs Decoder

| Component     | Encoder                        | Decoder                            |
|---------------|--------------------------------|-------------------------------------|
| Attention 1   | Self-attention (unmasked)      | Masked self-attention (causal)     |
| Attention 2   | ❌                              | Cross-attention (with encoder out) |
| Feedforward   | ✅ Yes                         | ✅ Yes                              |
| Add & Norm    | ✅ Yes (twice per layer)        | ✅ Yes (three times per layer)      |
| Positional Info | Added at input               | Added at input                      |
| Output        | Contextual token vectors       | Next token logits                   |

---

## ✅ Final Output Flow (Decoder Side)

1. Decoder produces contextual token representations
2. Projected to vocabulary size:  
   $$ \text{logits} = \text{decoder\_output} \cdot W_{\text{vocab}} + b $$
3. Softmax is applied → pick next token
4. Append token → repeat until `<EOS>` or max length

---




# 🧱 Transformer Architecture — Encoder & Decoder Notes (with Sequential Steps)

This breakdown covers the structure, flow, and sequential processing steps of the Transformer **encoder** and **decoder**, as introduced in *“Attention Is All You Need.”*

---

## 🔁 Encoder Overview

The encoder processes the **entire input sequence** in parallel and transforms each token into a **contextual embedding**.

### ✅ Sequential Steps (Per Encoder Layer):

1. **Token Embedding + Positional Encoding**  
   Input tokens are embedded into vectors, and position encodings are added:  
   $$
   X = \text{TokenEmbedding} + \text{PositionalEncoding}
   $$

2. **Multi-Head Self-Attention**  
   - Compute $Q, K, V$ for each token  
   - Perform scaled dot-product attention across all tokens (no mask)  
   - Outputs new token representations with context

3. **Add & LayerNorm**  
   $$
   \text{Norm}_1 = \text{LayerNorm}(X + \text{SelfAttention}(X))
   $$

4. **Feedforward Network (FFN)**  
   Apply a 2-layer MLP to each token independently:  
   $$
   \text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2
   $$

5. **Add & LayerNorm**  
   $$
   \text{Output} = \text{LayerNorm}(\text{Norm}_1 + \text{FFN})
   $$

Repeat this stack $N$ times (e.g., $N = 6$)

---

## 📤 Decoder Overview

The decoder generates the output **one token at a time**, attending to:
- **Previously generated tokens** (masked self-attention)
- **Encoder output** (via cross-attention)

### ✅ Sequential Steps (Per Decoder Layer):

1. **Token Embedding + Positional Encoding**  
   Decoder input is embedded and positionally encoded

2. **Masked Multi-Head Self-Attention**  
   - Causal mask applied so each token only sees previous ones  
   - Compute attention over the generated sequence so far

3. **Add & LayerNorm**  
   Residual + normalization as usual

4. **Cross-Attention (Encoder-Decoder Attention)**  
   - Decoder queries attend to encoder output (keys/values)  
   - Lets decoder align output with relevant input context

5. **Add & LayerNorm**  
   Another residual + normalization

6. **Feedforward Network (MLP)**  
   Same structure as encoder:  
   $$
   \text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2
   $$

7. **Add & LayerNorm**  
   Final norm and residual

Repeat this stack $N$ times

8. **Final Output Projection**  
   Project last-layer decoder output to vocabulary size:  
   $$
   \text{logits} = \text{DecoderOutput} \cdot W_{\text{vocab}} + b
   $$

9. **Softmax & Sampling**  
   Convert logits to probabilities → sample or argmax next token

10. **Append next token & repeat** until `<EOS>` or max length

---

## 🔁 Summary Table

| Step | Encoder | Decoder |
|------|---------|---------|
| 1    | Embed input + add position | Embed output so far + add position |
| 2    | Self-attention (unmasked) | Masked self-attention |
| 3    | Add & Norm | Add & Norm |
| 4    | Feedforward | Cross-attention (to encoder output) |
| 5    | Add & Norm | Add & Norm |
| 6    | — | Feedforward |
| 7    | — | Add & Norm |
| 8    | Repeat $N$ layers | Repeat $N$ layers |
| 9    | — | Project to vocab + softmax |
| 10   | — | Predict next token & repeat |

---
