# Transformers from Scratch

<img src="../images/transformer-architecture.png" alt="transformer" width="500"/>

## 1. Tokenization + embeddings
- Build char/word vocab
- Map to embeddings

## 2. Positional encoding
- Implementing sinusoidal, visualize the curves.

## 3. Scaled dot-product attention

<img src="../images/scaled-dot-product-attn.png" alt="scaled-dot-product-attn" width="400"/>

In [None]:
import torch
import torch.nn.functional as F

# 3 tokens, head_dim = 2 tensors
Q = torch.tensor([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
K = torch.tensor([[1.0, 0.0], [1.0, 1.0], [0.0, 1.0]])
V = torch.tensor([[1.0, 0.0], [0.0, 2.0], [1.0, 1.0]])
L, d_h = Q.shape  # L=3, d_h=2

# attention scores
scores = (Q @ K.T) / (d_h**0.5)
print(f"scores shape: {scores.shape}")
print(scores)

# attention weights: softmax over keys dimension
attn_weights = F.softmax(scores, dim=-1)  # [L, L]
print(f"\nattn_weights shape: {attn_weights.shape}")
print(attn_weights)

# output: [L, d_h]
attn_output = attn_weights @ V
print(f"\nattn_output shape: {attn_output.shape}")
print(attn_output)

## 4. Multi-head attention

<img src="../images/multi-head-attn.png" alt="multi-head-attn" width="400"/>

## 5. Feedforward layer

## 6. Residual + LayerNorm

In the [Attention is All You Need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) paper, the original diagram shows the residual connection and LayerNorm being applied ***after*** each sublayer (Multi-Head Attention and Feed-Forward Network). 
- **Multi-Head Attention -> Add & Norm**
- **Feed-Forward Network -> Add & Norm**

This design is often called **Post-LN (post-layer normalization)**.

Later work found training to be unstable with the original Post-LN architecture, especially for deeper models. To fix this, the LayerNorm was moved to the input of each sublayer:
- **LayerNorm -> Multi-Head Attention -> Add**
- **LayerNorm -> Feed-Forward Network -> Add**

This is called **Pre-LN (pre-layer normalization)**. It stabilizes gradients and makes optimization easier for large models.

## 7. Encoder block

## 8. Decoder block

## 9. Tiny Transformer model

## 10. Training loop

## 11. Sampling / generation