# Transformers from Scratch

<img src="../images/transformer-architecture.png" alt="transformer" width="500"/>

## 1. Tokenization + embeddings
- Build char/word vocab
- Map to embeddings

## 2. Positional encoding
- Implementing sinusoidal, visualize the curves.

## 3. Scaled dot-product attention

**Self-Attention** is a mechanism that captures dependencies and relationships within input sequences.

**What it does** — for each token, self-attention builds a **contextualized vector** by mixing information from **all** tokens in the sequence, weighted by how relevant they are to that token.
- It allows the model to identify and weight the importance of different parts of the input sequence by attending to itself.

<img src="../images/scaled-dot-product-attn.png" alt="scaled-dot-product-attn" width="400"/>

#### Attention equation:
$$
\operatorname{Attention}(\textbf{Q}, \textbf{K}, \textbf{V})=\underbrace{\operatorname{softmax}\left(\frac{\textbf{Q}\textbf{K}^{\top}}{\sqrt{d_k}}+\operatorname{mask}\right)}_{\text {attention weights } A} \textbf{V}
$$
- $ Q \in \mathbb{R}^{n \times d_{k}}, K \in \mathbb{R}^{n \times d_{k}}, V \in \mathbb{R}^{n \times d_{v}}$, with sequence length $ n$.
- $ A \in \mathbb{R}^{n \times n}$ has **row-wise** softmax: each row sums to 1.

**Why the scale $ \sqrt{d_{k}}$ ?**

Dot products grow with dimension; dividing by $ \sqrt{ d_{k} }$ keeps logits in a range where softmax is well-behaved (prevents tiny gradients / saturation).

#### Where $Q, K, V$ come from?
$Q, K, V$ are **learned linear projections** (not literal copies) of the same input $ H \in \mathbb{R}^{n \times d}$:

$$
Q=HW_{Q},\ \ K=HW_{K},\ \ V=HW_{V}
$$

This lets the model learn:
- **Q (Query):** what to ask.
- **K (Key):** how to index.
- **V (Value):** what to retrieve.

**Hash-table intuition (soft lookup)**
- Compare each **query** to all **keys** via dot product –> *similarity scores*.
- Softmax turns scores into a probability distribution –> *attention weights*.
- Take a weighted sum of **values** using those weights –> the output for that token.

#### Masks (what to zero out and why)
- **Padding mask:** add $ -\infty$ to logits where positions are padding so their softmax weight becomes 0.
- **Causal mask (decoder):** forbid looking ahead (future tokens).
> The mask is added **before** softmax, it zeroes weights *after* softmax.

#### What the $ Q, K, V$ matrices mean
- $ QK^\top$ –> an $ n \times n$ **score matrix** where entry $ (i, j)$ is how much token $ i$ attends to token $ j$.
- After softmax, each row is a distribution over **which positions token $ i$ should read from.**
- Multiplying by $ V$ aggregates information: token $ i$ gets a weighted mix of other token’s value vectors.
> Important: rows are **not** forced to be one-hot. They’re usually **soft** (spread across many tokens), especially in early layers.


In [None]:
import torch
import torch.nn.functional as F

# 3 tokens, head_dim = 2 tensors
Q = torch.tensor([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
K = torch.tensor([[1.0, 0.0], [1.0, 1.0], [0.0, 1.0]])
V = torch.tensor([[1.0, 0.0], [0.0, 2.0], [1.0, 1.0]])
L, d_h = Q.shape  # L=3, d_h=2

# attention scores
scores = (Q @ K.T) / (d_h**0.5)
print(f"scores shape: {scores.shape}")
print(scores)

# attention weights: softmax over keys dimension
attn_weights = F.softmax(scores, dim=-1)  # [L, L]
print(f"\nattn_weights shape: {attn_weights.shape}")
print(attn_weights)

# output: [L, d_h]
attn_output = attn_weights @ V
print(f"\nattn_output shape: {attn_output.shape}")
print(attn_output)

## 4. Multi-head attention

<img src="../images/multi-head-attn.png" alt="multi-head-attn" width="400"/>

## 5. Feedforward layer

## 6. Residual + LayerNorm

In the [Attention is All You Need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) paper, the original diagram shows the residual connection and LayerNorm being applied ***after*** each sublayer (Multi-Head Attention and Feed-Forward Network). 
- **Multi-Head Attention -> Add & Norm**
- **Feed-Forward Network -> Add & Norm**

This design is often called **Post-LN (post-layer normalization)**.

Later work found training to be unstable with the original Post-LN architecture, especially for deeper models. To fix this, the LayerNorm was moved to the input of each sublayer:
- **LayerNorm -> Multi-Head Attention -> Add**
- **LayerNorm -> Feed-Forward Network -> Add**

This is called **Pre-LN (pre-layer normalization)**. It stabilizes gradients and makes optimization easier for large models.

## 7. Encoder block

## 8. Decoder block

## 9. Tiny Transformer model

## 10. Training loop

## 11. Sampling / generation