# Minimal Transformer (PyTorch)

This notebook contains a minimal implementation of the Transformer architecture using PyTorch's autograd engine. It covers the core building blocks of attention-based models:
- Scaled Dot-Product Attention: compute attention weights from queries, keys, and values
- Multi-Head Attention: project into multiple subspaces and combine parallel attention heads
- Residuals and LayerNorm: stabilize training with skip connections and normalization
- Feed-Forward Network with GELU: apply non-linear transformations between attention layers
- Encoder and Decoder Blocks: stack layers to form the Transformer backbone

## Setup

In [30]:
import numpy as np
import matplotlib.pyplot as plt
import tqdm
np.set_printoptions(precision=3)
np.random.seed(0)

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(0)

<torch._C.Generator at 0x117c77ab0>

## Tokens and Embeddings

Before attention, raw text is turned into vectors through **tokenization** and **embedding**. Tokenization splits a sentence into smaller units. In practice, this is usually **subword pieces** (e.g. "Transformers" $\rightarrow$ ["Trans", "former", "s"]), each mapped to an integer ID from the vocabulary.

Each ID is then mapped to a **word embedding**, a vector of size $d_{model}$ (often 128-1024+). A **positional encoding** is often added to represent where each token appears in the sequence. After this step, a sequence of length $n$ is represented as a matrix of shape $(n, d_{model})$ where each row is a token.

## Scaled Dot-Product Attention

The goal of attention is to decide **how much each token should look at other tokens** when forming a new representation.

From the input matrix, the model forms three projections, queries $Q$, keys $K$, and values $V$. Each is obtained by multiplying the input matrix by a different weight matrix ($W_Q$, $W_K$, $W_V$). Intuitively, each token produces a query that asks what it wants, a key that signals what it offers, and a value that carries the content to be shared. These are the inputs to the attention mechanism.

### Mechanism

Attention works by comparing queries and keys to decide how much weight to give each value. First, we compute similarity scores between every query and key.

$$
scores = Q K^T
$$

To prevent these values from growing too large as the dimension $d_k$ increases, the scores tend to be scaled.

$$
scores_{scaled} = \frac{Q K^T}{\sqrt{d_k}}
$$

The scaled scores are then passed through a softmax, which normalizes each row into a probability distribution (so the sum of all entries is $1$).

$$
\alpha = \text{softmax}(scores_{scaled})
$$

Finally, the output of attention is obtained by weighting the values by these probabilities:

$$
\text{Attention}(Q, K, V) = \alpha V
$$

In words, each token's new representation is a weighted combination of the value vectors of all tokens in the sequence.

### Masking

Sometimes we want to restrict which tokens can attend to which others. This is done with a **mask** applied to the attention scores before the softmax.

- **Padding mask:** prevents the model from attending to padded positions in sequences of different lengths.
- **Causal mask:** prevents a token from looking ahead at future tokens, which is required in the decoder for autoregressive generation.

In practice, masked positions are set to $-\infty$ so that their softmax probability becomes zero.

In [31]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) * V with optional masking

    Args:
        Q: (num_batches, sequence_length, d_k)
        K: (num_batches, sequence_length, d_k)
        V: (num_batches, sequence_length, d_v)

    Returns:
        H: (num_batches, sequence_length, d_v)
    """

    # Normalizing factor for stability
    d_k = Q.size(-1)

    # Compute similarity scores between queries and keys
    scores = Q @ K.transpose(-2, -1) / d_k**0.5

    # Apply optional mask
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Convert to probability distribution
    weights = F.softmax(scores, dim=-1)

    # Get weighted average of value vectors based on similarity scores of queries and keys
    return weights @ V

## Multi-Head Attention

Scaled dot-product attention works with a single set of queries, keys, and values. Multi-head attention extends this idea by applying attention multiple times in parallel, with each head having its own learned projections. The purpose is to let the model capture different types of relationships: one head might focus on nearby words, another on long-range dependencies, and others may discover more abstract patterns that go beyond grammatical structures.

For head $i$:
$$
head_i = \text{Attention}(X W_Q^{(i)}, X W_K^{(i)}, X W_V^{(i)})
$$
where $X \in \mathbb{R}^{n \times d_{model}}$ is the input sequence, and $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$ are learned weight matrices specific to head $i$.

After computing all $h$ heads, the results are concatenated and projected back into the original model dimension:
$$
\text{MultiHead}(X) = \text{Concat}(head_1, \dots, head_h) \cdot W_O
$$
where $W_O$ is another learned weight matrix.

In terms of shapes, the input has shape $(n, d_{model})$. It is projected and reshaped into $(num\_heads, n, d_k)$ so that each head works in a subspace of dimension $d_k = d_{model} / num\_heads$. Each head runs scaled dot-product attention independently. The outputs are then concatenated back into $(n, d_{model})$ and passed through $W_O$.

In [32]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()

        # Check that token embedding size is divisible by number of heads
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        # d_model = num_heads * d_k so these are the weights of each head stacked together
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, X, mask=None):
        B, n, _ = X.size()

        # Project to Q, K, V
        Q = self.W_Q(X).view(B, n, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(X).view(B, n, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(X).view(B, n, self.num_heads, self.d_k).transpose(1, 2)

        # Run attention per head
        H = scaled_dot_product_attention(Q, K, V, mask)

        # Combine heads
        H = H.transpose(1, 2).contiguous().view(B, n, -1)
        return self.W_O(H)

## Feed-Forward Network (FFN) with GELU

After multi-head attention, each position in the sequence is processed independently by a feed-forward network. This network applies the same transformation to every token embedding, without mixing information across sequence positions. Its purpose is to increase the expressive power of the model by introducing non-linearity and enabling richer transformations.

For an input $x \in \mathbb{R}^{d_{model}}$:
$$
\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2
$$

where:
- $W_1 \in \mathbb{R}^{d_{model} \times d_{ff}}$ and $b_1 \in \mathbb{R}^{d_{ff}}$
- $W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}$ and $b_2 \in \mathbb{R}^{d_{model}}$
- $d_{ff}$ is typically larger than $d_{model}$ (e.g. 2048 vs. 512), creating an “expansion” layer before projecting back down.

The non-linearity used here is the **Gaussian Error Linear Unit (GELU)**, which smoothly scales negative values instead of zeroing them out (as ReLU would).

In terms of shapes:
- The input sequence has shape $(n, d_{model})$.
- It is projected into $(n, d_{ff})$, passed through GELU, then projected back into $(n, d_{model})$.
- Each token is transformed independently, but using the same learned weights.

In [33]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()

        # First projection expands the hidden dimension
        self.W1 = nn.Linear(d_model, d_ff)

        # Second projection projects back to model dimension
        self.W2 = nn.Linear(d_ff, d_model)

        # GELU activation
        self.activation = nn.GELU()

    def forward(self, X):
        """
        Args:
            X: (batch_size, sequence_length, d_model)

        Returns:
            Output: (batch_size, sequence_length, d_model)
        """
        return self.W2(self.activation(self.W1(X)))

## Residual Connections

Imagine each Transformer sublayer (like attention or the feed-forward network) as a function that transforms its input. Without a residual connection, the model would have to **completely rewrite the representation** at every layer, which can make training unstable and slow.

Residual connections fix this by **adding the original input back in** after the sublayer transformation.
This means:
- The sublayer only needs to **learn a correction or refinement** to the input, not a full replacement.
- During backpropagation, gradients can flow directly through the skip path, which reduces the risk of vanishing gradients in very deep models.

For a sublayer function $\text{Sublayer}(x)$:
$$
\text{Residual}(x) = x + \text{Sublayer}(x)
$$

### Example: Residual Connections Illustration

Suppose the input vector is $x = [2, 3]$, and the sublayer produces $\text{Sublayer}(x) = [5, -1]$. Without a residual, the output would simply be $[5, -1]$, completely replacing the input. With a residual (skip connection), we add the input back:
$$
\text{Output} = x + \text{Sublayer}(x) = [2, 3] + [5, -1] = [7, 2]
$$

Now the sublayer only needs to learn a refinement of $x$ rather than rebuild it from scratch. If the sublayer output were $[0, 0]$, the residual path would still pass $x$ forward unchanged. This means residuals let the network preserve useful information, make each layer act as a tweak instead of a rewrite, and provide a direct path for gradients, which stabilizes training.

## Layer Normalization

After adding the residual, a **Layer Normalization** step is applied. LayerNorm normalizes activations across the feature dimension for each token, stabilizing training and ensuring consistent scale.

For input $x \in \mathbb{R}^{d_{model}}$:
$$
\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta
$$

where:
- $\mu$ and $\sigma$ are the mean and standard deviation of the features in $x$  
- $\gamma, \beta \in \mathbb{R}^{d_{model}}$ are learnable parameters that scale and shift the normalized values  

Each sublayer in the Transformer (multi-head attention, feed-forward) is wrapped as:
$$
\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$

In [34]:
class ResidualConnection(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        """
        Apply residual connection to any sublayer with the same input/output shape.

        Args:
            x: (batch_size, sequence_length, d_model)
            sublayer: a function or nn.Module that takes x and returns (batch_size, sequence_length, d_model)

        Returns:
            Output: (batch_size, sequence_length, d_model)
        """
        return self.norm(x + self.dropout(sublayer(x)))

## Encoder Block

A Transformer encoder block stacks the core components including multi-head attention, a feed-forward network, and residual connections with layer normalization. Each block processes the input sequence:
- Apply multi-head attention with residual and layer normalization
- Apply a position-wise feed-forward network with residual and layer normalization

$$
\begin{aligned}
\text{EncoderBlock}(x) 
&= \text{LayerNorm}(x + \text{MultiHeadAttention}(x)) \\
&\to \text{LayerNorm}(x + \text{FeedForward}(x))
\end{aligned}
$$

This structure allows the model to mix information across sequence positions (via attention) and then transform each position independently (via the FFN), while residuals and normalization ensure stable training and consistent dimensions throughout the stack.


In [35]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.residual1 = ResidualConnection(d_model, dropout)
        self.residual2 = ResidualConnection(d_model, dropout)

    def forward(self, x, mask=None):
        """
        Args:
            x: (batch_size, sequence_length, d_model)
            mask: optional attention mask

        Returns:
            Output: (batch_size, sequence_length, d_model)
        """
        # Multi-head attention with residual + norm
        x = self.residual1(x, lambda x: self.attention(x, mask))

        # Feed-forward with residual + norm
        x = self.residual2(x, self.ffn)
        return x

## Decoder Block

A Transformer decoder block extends the encoder structure with two attention layers:
- Apply **masked multi-head self-attention** with residual and layer normalization (causal mask prevents each position from attending to future tokens)
- Apply **encoder-decoder cross-attention** with residual and layer normalization (queries come from the decoder, keys and values from the encoder output)
- Apply a position-wise feed-forward network with residual and layer normalization

$$
\begin{align}
\text{DecoderBlock}(x, \text{enc\_out})
&= \text{LayerNorm}(x + \text{MaskedSelfAttention}(x)) \\
&\to \text{LayerNorm}(x + \text{CrossAttention}(x, \text{enc\_out})) \\
&\to \text{LayerNorm}(x + \text{FeedForward}(x))
\end{align}
$$

This structure allows the decoder to (1) build up its own representation autoregressively, (2) attend to the encoder's representation of the input sequence, and (3) refine token embeddings through the feed-forward network, all while residuals and normalization maintain stability.

In [36]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)

        self.residual1 = ResidualConnection(d_model, dropout)
        self.residual2 = ResidualConnection(d_model, dropout)
        self.residual3 = ResidualConnection(d_model, dropout)

    def forward(self, x, enc_out, src_mask=None, tgt_mask=None):
        """
        Args:
            x: (batch_size, target_length, d_model) decoder input
            enc_out: (batch_size, source_length, d_model) encoder output
            src_mask: optional mask for encoder attention
            tgt_mask: optional mask for decoder self-attention (causal mask)

        Returns:
            Output: (batch_size, target_length, d_model)
        """
        # Masked self-attention (causal)
        x = self.residual1(x, lambda x: self.self_attn(x, tgt_mask))

        # Cross-attention (queries from decoder, keys/values from encoder)
        x = self.residual2(x, lambda x: self.cross_attn(x, src_mask))

        # Feed-forward network
        x = self.residual3(x, self.ffn)
        return x


## Transformer Model

The Transformer consists of an **encoder stack** and a **decoder stack**:
- The **encoder** is a stack of $N$ identical encoder blocks, each with multi-head self-attention and a feed-forward network, connected with residuals and layer normalization.
- The **decoder** is a stack of $N$ decoder blocks, each with masked self-attention, encoder-decoder cross-attention, and a feed-forward network, also wrapped with residuals and layer normalization.

The encoder produces hidden representations $H = \text{Encoder}(X)$ for the input sequence $X$, and the decoder generates the output sequence autoregressively as:
$$
Y = \text{Decoder}(Y_{<t}, H)
$$

This design enables the model to capture long-range dependencies, align source and target sequences through cross-attention, and generate outputs in parallel during encoding and sequentially during decoding.

In [37]:
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x


class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            DecoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x, enc_out, src_mask=None, tgt_mask=None):
        for layer in self.layers:
            x = layer(x, enc_out, src_mask, tgt_mask)
        return x


class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, src_vocab_size, tgt_vocab_size, dropout=0.1):
        super().__init__()
        self.src_embed = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab_size, d_model)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        """
        Args:
            src: (batch_size, src_len) - source token indices
            tgt: (batch_size, tgt_len) - target token indices
            src_mask: optional source mask
            tgt_mask: optional target (causal) mask

        Returns:
            Output logits: (batch_size, tgt_len, tgt_vocab_size)
        """
        # Embedding lookup
        src = self.src_embed(src)
        tgt = self.tgt_embed(tgt)

        # Encode source
        enc_out = self.encoder(src, src_mask)

        # Decode target
        dec_out = self.decoder(tgt, enc_out, src_mask, tgt_mask)

        # Project to vocab size
        return self.output_layer(dec_out)

## Training and Validation

A simple way to validate the Transformer is with a toy copy task where the model is trained to reproduce its input sequence.

In [39]:
# Tiny Toy Dataset: copy task
src_vocab_size = tgt_vocab_size = 10
seq_len, batch_size = 5, 2
src = torch.randint(1, src_vocab_size, (batch_size, seq_len))
# Target is same as input
tgt = src.clone()

# Shift target for decoder input
dec_in = torch.zeros_like(tgt)
dec_in[:, 1:] = tgt[:, :-1]

# Small model
model = Transformer(d_model=16, num_heads=2, d_ff=32,
                    num_layers=1, src_vocab_size=src_vocab_size,
                    tgt_vocab_size=tgt_vocab_size)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-2)

# Simple causal mask
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)

# Training loop
for step in range(101):
    out = model(src, dec_in, tgt_mask=mask)  # (batch, seq_len, vocab)
    loss = criterion(out.view(-1, tgt_vocab_size), tgt.view(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 20 == 0:
        print(f"Step {step}, Loss = {loss.item():.4f}")


Step 0, Loss = 2.5311
Step 20, Loss = 0.2345
Step 40, Loss = 0.0186
Step 60, Loss = 0.0071
Step 80, Loss = 0.0039
Step 100, Loss = 0.0025
