<a href="https://colab.research.google.com/github/karankulshrestha/ai-notebooks/blob/main/concepts_attention_is_all_you_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [39]:
SENTENCE = "AI IS FUTURE"
D_MODEL = 512

In [40]:
import torch
import math

input_ids = torch.tensor([10, 20, 30])

### Step 2: The Embedding Matrix (The First Weights)

The transformer model starts by converting each token ID into a dense vector using a large **lookup table** called the **Embedding Matrix**. This matrix contains trainable weights that the model learns during training.

- **Rows**: Equal to the vocabulary size (e.g., 10,000 different tokens/words).
- **Columns**: Equal to the model dimension $d_{\text{model}}$ (e.g., 512 in the original Transformer – here we use a small value like 4 for illustration).

#### Shape of the Embedding Matrix


#### How it works
For each input token ID (an integer), the model simply **looks up** ("plucks") the corresponding row from this matrix. That row becomes the initial vector representation of the token.

**Example** (with tiny numbers for clarity):

| Token ID | Token     | Embedding Vector (d_model = 4)      |
|----------|-----------|-------------------------------------|
| 0        | \<pad\>   | [0.12, -0.45, 0.67, 0.23]          |
| 1        | hello     | [-0.34, 0.89, -0.12, 0.56]         |
| 2        | world     | [0.78, -0.23, 0.45, -0.89]         |
| ...      | ...       | ...                                 |
| 9999     | !         | [0.01, 0.34, -0.67, 0.12]          |

So, if the input token is "hello" (ID = 1), the model grabs the vector `[-0.34, 0.89, -0.12, 0.56]` directly from the embedding matrix.

This simple lookup is the very first operation in the model and turns discrete token IDs into continuous vectors that the rest of the transformer can work with. During training, these embedding weights are updated so the model learns meaningful representations for each token.

In [41]:
# Weights: [Vocab Size=100, d_model=4]
W_embedding = torch.randn(100, 4)

# Operation: Select rows 10, 20, and 30
# Shape becomes: [3, 4] (3 words, each is a vector of size 4)
x = W_embedding[input_ids]

# Scale weights by sqrt(d_model) as per paper [cite: 159]
x = x * math.sqrt(4)

## Step 3: Positional Encoding (Adding Order)

Since we are performing matrix operations, the model has no inherent understanding that "AI" appears before "is" in the sequence. We must mathematically encode the order of tokens.

# Positional Encoding Formula

For a model with dimension $d_{model} = 4$, we use two indices to cover the columns: $2i$ (even columns) and $2i+1$ (odd columns).

$$\begin{aligned}
PE_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\
PE_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
\end{aligned}$$

Where:
* $pos$ is the position of the word (0, 1, 2).
* $i$ is the index of the frequency pair (0, 1).
* $d_{model}$ is 4.

### The Solution

We create a fixed matrix of the same shape `[3, 4]` containing sine and cosine waves, then add it element-wise to our input embeddings.

### Why This Works

The sine and cosine functions with different frequencies create unique positional signatures for each position in the sequence, allowing the model to distinguish between tokens based on their location while maintaining mathematical properties that help with learning.

# Positional Encoding Calculation for Position 1

Since your example matrix is $3 \times 4$ (3 positions, 4 dimensions), let's perform the actual Sinusoidal Positional Encoding calculation for Position 1 (the middle row). This will show you exactly how $i$ (the frequency pair index) drives the values for that specific row.

## The Inputs

- Current Position ($pos$): 1
- Model Dimension ($d_{model}$): 4
- Frequency Pairs ($i$): Since we have 4 columns, we have 2 pairs.
  - $i = 0$ (controls columns 0 & 1)
  - $i = 1$ (controls columns 2 & 3)

## Step 1: Calculate for Pair $i = 0$ (Columns 0 & 1)

**1. Calculate the Divisor (Frequency)**

The formula for the divisor is $10000^{\frac{2i}{d_{model}}}$.

$$\text{Divisor} = 10000^{\frac{2(0)}{4}} = 10000^{0} = \mathbf{1}$$

**2. Calculate the Values**

- Column 0 (Sine): $\sin(\frac{pos}{\text{divisor}}) = \sin(\frac{1}{1}) = \sin(1) \approx \mathbf{0.84}$
- Column 1 (Cosine): $\cos(\frac{pos}{\text{divisor}}) = \cos(\frac{1}{1}) = \cos(1) \approx \mathbf{0.54}$

**Result for $i=0$:** [0.84, 0.54]

## Step 2: Calculate for Pair $i = 1$ (Columns 2 & 3)

**1. Calculate the Divisor (Frequency)**

$$\text{Divisor} = 10000^{\frac{2(1)}{4}} = 10000^{\frac{2}{4}} = 10000^{0.5} = \sqrt{10000} = \mathbf{100}$$

**2. Calculate the Values**

- Column 2 (Sine): $\sin(\frac{pos}{\text{divisor}}) = \sin(\frac{1}{100}) = \sin(0.01) \approx \mathbf{0.01}$
- Column 3 (Cosine): $\cos(\frac{pos}{\text{divisor}}) = \cos(\frac{1}{100}) = \cos(0.01) \approx \mathbf{1.00}$

**Result for $i=1$:** [0.01, 1.00]

## Step 3: The Final Row

Combining the results from $i=0$ and $i=1$, the actual positional encoding vector for Position 1 is:

$$[0.84, 0.54, 0.01, 1.00]$$

## Comparison to Your Dummy Values

You provided the dummy row: [0.3, 0.4, 0.3, 0.4].

Notice the difference:

- **Your Dummy:** The values repeated (0.3, 0.4 then 0.3, 0.4). This implies the frequency (divisor) was the same for both pairs.
- **Actual Math:** The values changed drastically (0.84, 0.54 vs 0.01, 1.00). This is because $i$ increased, making the divisor larger ($1 \rightarrow 100$), which slows down the frequency of the wave.

In [42]:
# Fixed matrix of positions (simplified for demo)
# Shape: [3, 4]
position_matrix = torch.tensor([
    [0.1, 0.2, 0.1, 0.2],  # Position 0
    [0.3, 0.4, 0.3, 0.4],  # Position 1
    [0.5, 0.6, 0.5, 0.6]   # Position 2
])

# Operation: Element-wise addition
# Shape remains: [3, 4]
x = x + position_matrix

## Step 4: Creating Query, Key, Value (The "Projections")

Now we enter the **Self-Attention mechanism**—this is the core "brain."

### Three Distinct Views

We need three different perspectives of our data:

1. **Query** ($Q$): What I am looking for
2. **Key** ($K$): What I contain (for others to search)
3. **Value** ($V$): My actual content

We create these by multiplying our input $x$ by three different weight matrices ($W_Q$, $W_K$, $W_V$).

Each projection transforms the same input into a specialized representation that serves a specific purpose in the attention mechanism.

In [43]:
# Randomly initialized weight matrices
# Shape: [d_model=4, d_model=4]
W_q = torch.randn(4, 4)
W_k = torch.randn(4, 4)
W_v = torch.randn(4, 4)

# Operation: Matrix Multiplication (Dot Product)
# Equation: Q = x @ W_q
Q = x @ W_q  # Shape: [3, 4]
K = x @ W_k  # Shape: [3, 4]
V = x @ W_v  # Shape: [3, 4]

## Step 5: The Attention Scores (Relationships)

We want to find out how much the word "Future" (Query) relates to "AI" (Key). We do this by calculating the **Dot Product** between every Query and every Key.

### The Operation

$$Q \times K^T$$

We transpose $K$ to align the dimensions properly for matrix multiplication.

```python
# Q shape: [3, 4]
# K shape: [3, 4]
# K^T shape: [4, 3]

# Matrix multiplication: [3, 4] @ [4, 3] = [3, 3]
attention_scores = Q @ K.T

# Result: A [3, 3] matrix where each cell (i, j) represents
# how much token i (query) relates to token j (key)
```

### Understanding the Result

The resulting `[3, 3]` matrix contains similarity scores:
- Each row represents one query (one word asking "who should I pay attention to?")
- Each column represents one key (one word being searched)
- Higher scores indicate stronger relationships between tokens

In [44]:
# Transpose K: Swaps rows and columns
K_transpose = K.t() # Shape: [4, 3]

# Operation: Matrix Multiply
# Shape: [3, 4] @ [4, 3] -> [3, 3]
scores = Q @ K_transpose

# The result is a 3x3 grid showing the relationship of every word to every other word.
# Row 1 is "AI", Row 2 is "is", Row 3 is "future".

## Step 6: Scaling and Masking (The "Decoder" Logic)

### Scaling: Preventing Gradient Issues

We divide by the square root of the dimension size ($\sqrt{d_k}$) to stop the numbers from getting too large and killing the gradients during backpropagation.

```python
# d_k = 4 in our example
d_k = 4

# Scale the attention scores
attention_scores = attention_scores / math.sqrt(d_k)
```

**Why this matters:** Without scaling, large dot products push the softmax function into regions with extremely small gradients, making learning difficult.

---

### Masking: Enforcing Causality

This is **crucial for GPT/Decoder models**. We cannot let the first word see the third word (because the third word is "the future"). We force those scores to be $-\infty$.

```python
# Create a causal mask (lower triangular matrix)
# Shape: [3, 3]
mask = torch.triu(torch.ones(3, 3), diagonal=1).bool()

# Apply mask: Set future positions to -inf
attention_scores = attention_scores.masked_fill(mask, float('-inf'))

# Result looks like:
# [[score, -inf, -inf],   # Token 0 can only see itself
#  [score, score, -inf],  # Token 1 can see 0 and 1
#  [score, score, score]] # Token 2 can see all previous tokens
```

**The Logic:** When we apply softmax in the next step, $e^{-\infty} = 0$, effectively zeroing out attention to future tokens. This ensures the model can only attend to previous and current positions—maintaining the autoregressive property that makes GPT a valid language model.

In [45]:
d_k = 4
scores = scores / math.sqrt(d_k)
mask = torch.triu(torch.ones(3, 3), diagonal=1).bool() # this makes all the elements in the specified diaognal (0, 1, 2) and below it as 0 (false)
scores = scores.masked_fill(mask, float('-inf'))

# Result Ex:
# [ 0.5, -inf, -inf ] -> Word 1 only sees Word 1
# [ 0.2,  0.8, -inf ] -> Word 2 sees Word 1 and 2
# [ 0.1,  0.4,  0.9 ] -> Word 3 sees all

scores

tensor([[-7.5409,    -inf,    -inf],
        [ 3.9556,  5.2303,    -inf],
        [-7.4488,  4.2887, 10.1782]])

## Step 7: Softmax (Probabilities)

We convert these raw scores into **probabilities** that sum to 1. This tells the model "pay 80% attention to word X and 20% to word Y".

### The Transformation

```python
# Apply softmax across the last dimension (each row independently)
# Shape: [3, 3] -> [3, 3]
attention_weights = torch.softmax(scores, dim=-1)

# Each row now sums to 1.0
# Example result:
# [[1.0,  0.0,  0.0 ],   # Token 0: 100% attention to itself
#  [0.3,  0.7,  0.0 ],   # Token 1: 30% to token 0, 70% to itself
#  [0.1,  0.2,  0.7 ]]   # Token 2: distributed across all three tokens
```

### What This Means

Each row represents an **attention distribution** for one token:
- Values are between 0 and 1
- Each row sums to exactly 1.0
- Higher values mean "pay more attention to this token"
- The masking from Step 6 ensures future tokens have 0% attention (due to $e^{-\infty} = 0$)

This probability distribution determines how much information each token will gather from other tokens in the next step.

In [46]:
# Operation: Softmax
attention_weights = torch.nn.functional.softmax(scores, dim=-1)

# Now every row sums to 1.0

## Step 8: Weighted Sum (The Output of Attention)

Now we create the **new representation** of each word. We multiply our probabilities (Attention Weights) by the actual content ($V$).

### The Operation

$$\text{Output} = \text{Weights} \times V$$

```python
# attention_weights shape: [3, 3]
# V shape: [3, 4]

# Matrix multiplication: [3, 3] @ [3, 4] = [3, 4]
attention_output = attention_weights @ V

# Result: [3, 4] - same shape as our original input
# Each token now contains information from all tokens it attended to
```

### What Just Happened

Each token's new representation is a **weighted combination** of all the Value vectors it was allowed to attend to:

- **Token 0**: Gets 100% of its own value (due to masking)
- **Token 1**: Gets 30% of token 0's value + 70% of its own value
- **Token 2**: Gets 10% of token 0 + 20% of token 1 + 70% of its own value

The attention weights act as a **mixing recipe**, determining how much of each token's content to blend together. This is where the model learns relationships and context between words.

In [47]:
# Shape: [3, 3] @ [3, 4] -> [3, 4]
attention_output = attention_weights @ V

## Step 9: Feed Forward Network (The "Processing")

The attention mechanism just gathered information. Now the **Feed Forward Network (FFN)** processes it.

The FFN is simply two linear transformations with a ReLU activation (removing negatives) in the middle.

### The Mathematical Formula

$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

### The Implementation

```python
# Typical FFN expands then contracts
# d_model = 4, d_ff = 16 (usually 4x larger)

# First layer weights and bias
W_1 = torch.randn(4, 16)  # Expand: [4] -> [16]
b_1 = torch.randn(16)

# Second layer weights and bias
W_2 = torch.randn(16, 4)  # Contract: [16] -> [4]
b_2 = torch.randn(4)

# Forward pass
# Step 1: Linear transformation + expand
hidden = attention_output @ W_1 + b_1  # [3, 4] @ [4, 16] = [3, 16]

# Step 2: ReLU activation (zero out negatives)
hidden = torch.relu(hidden)  # [3, 16] -> [3, 16]

# Step 3: Linear transformation + contract back
ffn_output = hidden @ W_2 + b_2  # [3, 16] @ [16, 4] = [3, 4]
```

### What This Does

The FFN provides **position-wise processing**:
- **Expand**: Projects to a higher dimension ($d_{ff} = 4 \times d_{model}$) to increase representational capacity
- **ReLU**: Introduces non-linearity, allowing the network to learn complex patterns
- **Contract**: Projects back to the original dimension ($d_{model}$)

Each token is processed **independently** through the same FFN, allowing the model to refine the representations after attention has mixed information between tokens.

In [48]:
W1 = torch.randn(4, 16)
b1 = torch.zeros(16)
W2 = torch.randn(16, 4)
b2 = torch.zeros(4)


# 1. First Expansion
# Shape: [3, 4] @ [4, 16] -> [3, 16]
hidden = attention_output @ W1 + b1

# 2. ReLU Activation (Non-linearity)
hidden = torch.relu(hidden)

# 3. Projection Back
# Shape: [3, 16] @ [16, 4] -> [3, 4]
layer_output = hidden @ W2 + b2

## Step 10: Final Prediction (Logits)

- After passing through $N$ layers of the above logic (repeating Steps 4-9), we project the final vector back to the vocabulary size to predict the next word.

In [49]:
# Unembedding Matrix (Output Head)
# Shape: [4, 100] (Projects back to vocab size)
W_unembed = torch.randn(4, 100)

# Operation: Matrix Multiply
# Shape: [3, 4] @ [4, 100] -> [3, 100]
logits = layer_output @ W_unembed

temperature = 0.7
scaled_logits = logits / temperature


import torch.nn.functional as F

probs = F.softmax(scaled_logits, dim=-1)

print("Probabilities:", probs)

predicted_token_ids = torch.argmax(probs, dim=-1)

print("Predicted Token IDs:", predicted_token_ids)

Probabilities: tensor([[0.0000e+00, 2.7273e-04, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 1.3735e-06, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.7868e-32, 9.3156e-38,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4876e-25, 9.9800e-17,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 5.2232e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.3460e-14, 5.2254e-34, 0.0000e+00, 1.6853e-31, 0.0000e+00, 0.0000e+00,
         1.6955e-31, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 4.5662e-18, 9.1214e-01, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 6.9784e-34, 0.0000e+00, 0.0000e+00, 4.3503e-13,
         0.00

## ADD & NORM BASIC_PRINCIPLES

In [50]:
# Attention Output
# [[ 0.5,  0.5, -1.0 ],   <- Update for Word 1
# [ 0.2, -0.2,  0.0 ]]   <- Update for Word 2

# Define the tensors
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
attn_out = torch.tensor([[0.5, 0.5, -1.0], [0.2, -0.2, 0.0]])

# THE RESIDUAL ADDITION
added_x = x + attn_out

print("After Add:")
print(added_x)

After Add:
tensor([[1.5000, 2.5000, 2.0000],
        [4.2000, 4.8000, 6.0000]])


## Step 2: The "Norm" (Layer Normalization)

We normalize each word (row) independently. We want every word vector to have a **Mean** of $\approx 0$ and a **Standard Deviation** of $\approx 1$.

---

### 2a. Calculate Mean & Variance (Per Row)

#### Analyzing Row 1: `[1.5, 2.5, 2.0]`

**1. Mean:**
$$\frac{1.5 + 2.5 + 2.0}{3} = 2.0$$

**2. Variance:** How far is each number from the mean (2.0)?
- $(1.5 - 2.0)^2 = (-0.5)^2 = 0.25$
- $(2.5 - 2.0)^2 = (0.5)^2 = 0.25$
- $(2.0 - 2.0)^2 = (0.0)^2 = 0.00$
- Average Variance = $0.50 / 3 \approx 0.166$

**3. Standard Deviation:**
$$\sqrt{0.166} \approx 0.408$$

---

#### Analyzing Row 2: `[4.2, 4.8, 6.0]`

**1. Mean:** $5.0$

**2. Standard Deviation:** (Calculated similarly) $\approx 0.748$

---

### 2b. Normalize (Shift & Scale)

**Formula:**
$$\frac{x - \text{mean}}{\text{std\_dev} + \epsilon}$$

We add a tiny $\epsilon$ (like $10^{-5}$) to avoid dividing by zero.

---

#### Normalizing Row 1:

- Value 1: $\frac{1.5 - 2.0}{0.408} = -1.22$
- Value 2: $\frac{2.5 - 2.0}{0.408} = +1.22$
- Value 3: $\frac{2.0 - 2.0}{0.408} = 0.00$

---

#### Normalizing Row 2:

- Value 1: $\frac{4.2 - 5.0}{0.748} = -1.07$
- Value 2: $\frac{4.8 - 5.0}{0.748} = -0.27$
- Value 3: $\frac{6.0 - 5.0}{0.748} = +1.33$

---

In [51]:
# 1. Calculate Mean (keepdim=True ensures we get [[mean1], [mean2]])
mean = added_x.mean(dim=-1, keepdim=True)
# Result: [[2.0], [5.0]]

# 2. Calculate Std Dev
std = added_x.std(dim=-1, keepdim=True, unbiased=False)
# Result: [[0.4082], [0.7483]]

# 3. Normalize
normalized_x = (added_x - mean) / (std + 1e-5)

print("After Normalization:")
print(normalized_x)

After Normalization:
tensor([[-1.2247,  1.2247,  0.0000],
        [-1.0690, -0.2673,  1.3363]])


# Multi-Head Concept

Multi-Head Attention changes the shape of the data so we run several smaller attention mechanisms in parallel.

## The Goal

We want to split our d_model (4) into 2 Heads of size 2.

- Head 1 focuses on the first 2 numbers (maybe "grammar").
- Head 2 focuses on the last 2 numbers (maybe "meaning").

In [62]:
x = torch.randn(1, 3, 4)


d_model = 4
n_head = 2
head_dim = d_model // n_head # 4 / 2 = 2

# Weights (Total size is still 4x4)
W_q = torch.randn(d_model, d_model)
W_k = torch.randn(d_model, d_model)
W_v = torch.randn(d_model, d_model)
W_o = torch.randn(d_model, d_model) # The output projection


# 1. Project to Q, K, V (Standard Matrix Multiplication)
# shape [1, 3, 4]
Q = x @ W_q
K = x @ W_k
V = x @ W_v

In [63]:
Q

tensor([[[-0.0454, -0.9407,  0.4785,  0.4870],
         [-1.8014, -0.4781, -0.9136, -1.0884],
         [-1.2597,  1.5056, -1.7328, -2.7407]]])

# Step 2: The "Split" (Reshape & Transpose) [MAGIC BEGINS]

# The Operation

1. **View (Reshape):** Split the `4` into `2` (heads) $\times$ `2` (head_dim).
   - Old Shape: `[1, 3, 4]`
   - New Shape: `[1, 3, 2, 2]`

2. **Transpose (Swap):** We want the "Head" dimension to be before the "Sequence" dimension so PyTorch treats each head as a separate batch.
   - Swap dim 1 (Seq) and dim 2 (Head).
   - Final Shape: `[1, 2, 3, 2]` $\rightarrow$ `[Batch, Heads, Seq, Head_Dim]`

In [64]:
Q = Q.view(1, 3, 2, 2)
K = K.view(1, 3, 2, 2)
V = V.view(1, 3, 2, 2)

# transpose (Swap Heads and Seq)
# [Batch, Heads, Seq, Head_Dim] -> [1, 2, 3, 2]
Q = Q.transpose(1, 2)
K = K.transpose(1, 2)
V = V.transpose(1, 2)

print("Shape after split:", Q.shape)

Shape after split: torch.Size([1, 2, 3, 2])


# Step 3: Scaled Dot-Product Attention (Parallel)

In [65]:
# 1. MatMul Q and K Transposed
# We transpose the LAST two dimensions only (Seq and Head_Dim)
# [1, 2, 3, 2] @ [1, 2, 2, 3] -> [1, 2, 3, 3]
scores = Q @ K.transpose(-2, -1)

# 2. Scale
scores = scores / math.sqrt(head_dim)

# 3. Mask (Optional but standard for Decoder)
# We apply the same mask to both heads
mask = torch.tril(torch.ones(3, 3))
scores = scores.masked_fill(mask == 0, float('-inf'))

# 4. Softmax
attn_weights = torch.softmax(scores, dim=-1)

# 5. Multiply by V
# [1, 2, 3, 3] @ [1, 2, 3, 2] -> [1, 2, 3, 2]
attn_output = attn_weights @ V

# Step 4: The "Concat" (Merging)

We need to glue the heads back together to get our original shape `[1, 3, 4]` back.

## The Operation

1. **Transpose Back:** Swap `Head` and `Seq` again. $\rightarrow$ `[1, 3, 2, 2]`
2. **Contiguous:** Fix memory layout (required by PyTorch after transpose).
3. **View (Flatten):** Smash the last two dimensions (`2` and `2`) back into `4`.

In [66]:
attn_output = attn_output.transpose(1, 2).contiguous()

final_output = attn_output.view(1, 3, d_model)

print("Shape after merge:", final_output.shape)

Shape after merge: torch.Size([1, 3, 4])


In [67]:
final_output = final_output @ W_o