# Minimal Transformer (PyTorch)

This notebook contains a minimal implementation of the Transformer architecture using PyTorch's autograd engine. It covers the core building blocks of attention-based models:
- Scaled Dot-Product Attention: compute attention weights from queries, keys, and values
- Multi-Head Attention: project into multiple subspaces and combine parallel attention heads
- Residual Connections and LayerNorm: stabilize training with skip connections and normalization
- Feed-Forward Network with GELU: apply non-linear transformations between attention layers
- Encoder and Decoder Blocks: stack layers to form the Transformer backbone

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import tqdm
np.set_printoptions(precision=3)
np.random.seed(0)

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(0)

<torch._C.Generator at 0x117c77ab0>

## Scaled Dot-Product Attention

The goal of attention is to decide **how much each token should look at other tokens** when forming a new representation.

### Tokens and Embeddings

Before attention, raw text is turned into vectors through **tokenization** and **embedding**. Tokenization splits a sentence into smaller units. In practice, this is usually **subword pieces** (e.g. "Transformers" $\rightarrow$ ["Trans", "former", "s"]), each mapped to an integer ID from the vocabulary.

Each ID is then mapped to a **word embedding**, a vector of size $d_{model}$ (often 128–1024+). A **positional encoding** is often added to represent where each token appears in the sequence. After this step, a sequence of length $n$ is represented as a matrix of shape $(n, d_{model})$ where each row is a token.

From this input matrix, the model forms three projections, queries $Q$, keys $K$, and values $V$. Each is obtained by multiplying the input matrix by a different weight matrix ($W_Q$, $W_K$, $W_V$). Intuitively, each token produces a query that asks what it wants, a key that signals what it offers, and a value that carries the content to be shared. These are the inputs to the attention mechanism.


### Mechanism

Attention works by comparing queries and keys to decide how much weight to give each value. First, we compute similarity scores between every query and key.

$$
scores = Q K^T
$$

To prevent these values from growing too large as the dimension $d_k$ increases, the scores tend to be scaled.

$$
scores_{scaled} = \frac{Q K^T}{\sqrt{d_k}}
$$

The scaled scores are then passed through a softmax, which normalizes each row into a probability distribution (so the sum of all entries is $1$).

$$
\alpha = \text{softmax}(scores_{scaled})
$$

Finally, the output of attention is obtained by weighting the values by these probabilities:

$$
\text{Attention}(Q, K, V) = \alpha V
$$

In words, each token's new representation is a weighted combination of the value vectors of all tokens in the sequence.

### Masking

Sometimes we want to restrict which tokens can attend to which others. This is done with a **mask** applied to the attention scores before the softmax.

- **Padding mask:** prevents the model from attending to padded positions in sequences of different lengths.
- **Causal mask:** prevents a token from looking ahead at future tokens, which is required in the decoder for autoregressive generation.

In practice, masked positions are set to $-\infty$ so that their softmax probability becomes zero.

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    # Normalizing factor for stability
    d_k = Q.size(-1)

    # Compute similarity scores between queries and keys
    scores = Q @ K.transpose(-2, -1) / d_k**0.5

    # Apply optional mask
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Convert to probability distribution
    weights = F.softmax(scores, dim=-1)

    # Get weighted average of value vectors based on similarity scores of queries and keys
    return weights @ V

## Multi-Head Attention

## Feedforward Block (FFN, GELU)

## Residual Connections and Layer Norm

## Encoder Block

## Decoder Block

## Transformer Model

## Training and Validation