---
title: "Transformer"
description: "Implementing a Transformer Architecture from scratch in PyTorch"
date: "2025-02-09"
#date-modified: "2025-02-22"
#categories: [news]
bread-crumbs: true
back-to-top-navigation: true
toc: true
toc-depth: 3
#image: images/pizza-13601_256.gif
---

The Transformer is a neural network architecture for natural language processing (NLP) tasks and was introduced in the ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) paper by Vaswani et al. in 2017. 


The core idea is that for any given position in a sequence, the transformer asks "What other positions in this sequence should I pay attention to?"


This article explains the core concepts and Python implementation in PyTorch of the Transformer. 

Note that PyTorch has the [class `nn.Transformer` and its components (`nn.TransformerEncoder`, `nn.TransformerDecoder`, `nn.TransformerEncoderLayer`, and `nn.TransformerDecoderLayer`)](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/nn/modules/transformer.py) already built-in. The goal of this article, however, is to implement the Transformer from scratch to gain a better understanding of it.

In [45]:
#| echo: true
#| output: false
!pip3.7 install -q torchdata torchtext spacy altair


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Frameworks/Python.framework/Versions/3.7/bin/python3 -m pip install --upgrade pip[0m


In [2]:
import torch
import torch.nn as nn

import math
import copy

## Transformer Architecture Overview

Transformers have an encoder-decoder architecture.
That means they have two main components:

- **Encoder**: Processes the input sequence $(x_1, ..., x_n)$ and creates rich representations in the form of vector embeddings $\mathbf{z} = (z_1, ...,z_n)$
- **Decoder**: Generates the output sequence $(y_1,...,y_m)$ one token at a time, attending to both its own partial output and the encoder's representations

That's why they are commonly used for language translation models.

![Transformer architecture overview diagram showing encoder-decoder structure.](images/transformer_encoders_decoders.png)

## Core components
At a more detailed level, the Transformer has the following architecture composed of the following core components:

- Input Embeddings
- Positional Encodings: to maintain sequence order without recurrence
- Multi-Head Attention and Masked Multi-head attention
- Feed Forward layer
- Add & Norm Layer
- Residual connections and layer normalization: to stabilize training
- Projection Layer (Softmax and Liniear)

![**Figure:** Detailed Transformer architecture with encoder and decoder layers, self-attention, and feed-forward networks](images/architecture_detailed.png)

### Embeddings

Embeddings convert discrete tokens (words, subwords, characters) into dense vector representations that the transformer can work with.

The transformer uses learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$.  

The Problem: Transformers need numerical input, but text consists of discrete symbols. For example, a vocabulary of 50,000 words would create massive, sparse one-hot vectors that are computationally inefficient and carry no semantic meaning.

The Solution: Map each token to a dense vector of fixed dimension $d_{\text{model}}$ (`d_model`)  (typically 256, 512, or 768 dimensions) that can encode semantic relationships.
These vectors are learned during training to capture semantic similarity

The original paper scales embeddings by $\sqrt(d_{model})$ to:

- ensure that embedding values and positional encoding values are roughly the same magnitude
- when they're added together, neither dominates the other
- helps with training stability

The embedding matrix is `d_model * vocab_size`

TODO:
- [ ] difference between input and output embeddings

![](images/embeddings.png)


In [47]:
class Embeddings(nn.Module):
    def __init__(self, vocab_size, d_model):
        """
        Args:
            vocab_size: size of vocabulary
            d_model: dimension of embeddings
        """
        super(Embeddings, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            out: scaled embedding vector
        """
        # Scale by sqrt(d_model) from original paper
        return self.embedding(x) * math.sqrt(self.d_model)

Suppoese each embedding vector is of 512 dimension and suppose our vocab size is 100, then our embedding matrix will be of size 100x512. 
This embedding matrix will be learned during training. 
During inference each word will be mapped to the corresponding 512 dimensional vector. 
Suppose we have batch size of 32 and sequence length of 10 words, The output will be 32x10x512.

### Positional Encodings

Positional encodings are vectors that tell the transformer where each token sits in a sequence.
The motivation for positional encoding is that unlike RNNs, which processes tokens one by one, the transformer looks at all tokens in a sentence at the same time. 
This makes the transformer fast but it also means that it can't distinguish between "The cat sat on the mat" and "The mat sat on the cat" because it just sees the same [bag of words](bag_of_words.ipynb).

To overcome this problem, you can add positional information in the form of positional encoding vectors directly to the input embeddings. The positional encoding vectors have the same dimensions $d_{model}$ (`d_model`)  as the input embeddings, so that they can be summed element-wise.
This gives each token a combined representation that captures both its meaning and its location in the sequence.

![](images/positional_encodings.png)

for eg: if we have batch size of 32 and seq length of 10 and let embedding dimension be 512. 
Then we will have embedding vector of dimension 32 x 10 x 512. 
Similarly we will have positional encoding vector of dimension 32 x 10 x 512. Then we add both.

The original transformer uses the **sinusoidal positional encodings**  based on sine and cosine functions of different frequencies:
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$

where $pos$ is the position and $i$ is the dimension index. 

That is, each dimension of the positional encoding corresponds to a sinusoid.  
The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$.  

These formulas create unique patterns for each position by using different frequencies across the embedding dimensions. Position 0 has one pattern, position 1 has a slightly different pattern, and so on. This enables the model to learn to recognize not just absolute positions, but also relative distances between tokens.

The inuition is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they are projected into QKV vectors and during dot-product attention.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.  
For the base model, we use a rate of $P_{drop}=0.1$.

In [48]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model,  max_seq_len=5000, dropout=0.1,):
        """
        Args:
            d_model: dimension of embeddings
            dropout: dropout rate, the original paper uses 0.1
            max_seq_len: maximum sequence length
        """
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1)

        # Create a div term for the denominator
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))

        # Apply sin to even indices (0, 2, 4, ...)
        pe[:, 0::2] = torch.sin(position * div_term)

        # Apply cos to odd indices (1, 3, 5, ...)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add batch dimension
        pe = pe.unsqueeze(0)

        # Register as buffer (saved with model, not trained)
        self.register_buffer("pe", pe)

    def forward(self, x):
        # x shape : [batch_size, seq_len, d_model]
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len].requires_grad_(False)
        return self.dropout(x)


> Below the positional encoding will add in a sine wave based on
> position. The frequency and offset of the wave is different for
> each dimension.

In [49]:
#| echo: false
#| output: true
"""import altair as alt
import pandas as pd

def example_positional():
    pe = PositionalEncoding(20, 0)
    y = pe.forward(torch.zeros(1, 100, 20))

    data = pd.concat(
        [
            pd.DataFrame(
                {
                    "embedding": y[0, :, dim],
                    "dimension": dim,
                    "position": list(range(100)),
                }
            )
            for dim in [4, 5, 6, 7]
        ]
    )

    return (
        alt.Chart(data)
        .mark_line()
        .properties(width=800)
        .encode(x="position", y="embedding", color="dimension:N")
        .interactive()
    )


show_example(example_positional)"""

'import altair as alt\nimport pandas as pd\n\ndef example_positional():\n    pe = PositionalEncoding(20, 0)\n    y = pe.forward(torch.zeros(1, 100, 20))\n\n    data = pd.concat(\n        [\n            pd.DataFrame(\n                {\n                    "embedding": y[0, :, dim],\n                    "dimension": dim,\n                    "position": list(range(100)),\n                }\n            )\n            for dim in [4, 5, 6, 7]\n        ]\n    )\n\n    return (\n        alt.Chart(data)\n        .mark_line()\n        .properties(width=800)\n        .encode(x="position", y="embedding", color="dimension:N")\n        .interactive()\n    )\n\n\nshow_example(example_positional)'

Note, that there are different ways to add positional information to input tokens. Read more on [positional encoding](positional_encoding.ipynb)

### Attention mechanism

**Attention** is the mechanism where one set of elements "pays attention to" another set of elements to gather relevant information.

#### Self-attention

*What is self-attention and how does it work mathematically?*

*Why do transformers use multiple attention heads? What does each head learn?*

*Derive the attention formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V*

*Why do we scale by √d_k in scaled dot-product attention?*

*What is the computational complexity of self-attention?*

*How does causal/masked attention work in decoder models*

*Explain cross-attention vs self-attention*

What are some alternatives to standard attention (sparse attention, linear attention, flash atention)?


As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
> ”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? 
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

![](images/transformer_self-attention_visualization.png)

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

- [] Add visual of formula here

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.  
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a
compatibility function of the query with the corresponding key.

We call our particular attention "Scaled Dot-Product Attention".
The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$.  We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$.  The keys and values are also packed together into matrices $K$ and $V$.  
Wecompute the matrix of outputs as:

$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$

- Query ($Q$): What we're looking for
- Key ($K$): What we're looking at
- Value ($V$): What we actually use

1. We compute similarity scores between queries and keys using dot products
2. Scale by $\sqrt{d_k}$ to prevent gradients from becoming too small/prevents saturation in the softmax when d_model is large
3.  Masking lets us ignore padded tokens or implement causal attention
3. Apply softmax to get attention weights that sum to 1
4. Use these weights to compute a weighted average of the values



In [5]:
def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
    """
    Compute 'Scaled Dot Product Attention'
    Attention with optional masking
    mask shape: [batch_size, seq_len, seq_len] or broadcastable
    """

    # Compute scaled attention scores
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Apply mask before softmax (set masked positions to large negative value)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Softmax over the last dimension
    p_attn = scores.softmax(dim=-1)

    # Apply droppout after softmax
    if dropout is not None:
        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, value), p_attn

Self-Attention vs Cross-Attention
Where Q, K, V Come From
- Self-attention: all from same sequence (Q = K = V = same_input)
- Cross-attention: from different sequences (Q = sequence_A, K = V = sequence_B)


Key points:
- Supports both self-attention and cross-attention
- Handles different sequence lengths for encoder/decoder

Implementation tips:
- Use separate Q,K,V projections
- Handle masking through addition (not masked_fill)
- Remember to use braodcasting and reshape for multi-head attention


Why masking? In the decoder, we need to prevent tokens from seeing future tokens during training.

The goal of reducing sequential computation also forms the
foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of
which use convolutional neural networks as basic building block,
computing hidden representations in parallel for all input and
output positions. In these models, the number of operations required
to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and
logarithmically for ByteNet. This makes it more difficult to learn
dependencies between distant positions. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of
reduced effective resolution due to averaging attention-weighted
positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention
mechanism relating different positions of a single sequence in order
to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading
comprehension, abstractive summarization, textual entailment and
learning task-independent sentence representations. End-to-end
memory networks are based on a recurrent attention mechanism instead
of sequencealigned recurrence and have been shown to perform well on
simple-language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first
transduction model relying entirely on self-attention to compute
representations of its input and output without using sequence
aligned RNNs or convolution.


The self-attention layer helps the encoder look at other words in the input sentence as it encodes a specific word.

The two most commonly used attention functions are additive
attention [(cite)](https://arxiv.org/abs/1409.0473), and dot-product
(multiplicative) attention.  Dot-product attention is identical to
our algorithm, except for the scaling factor of
$\frac{1}{\sqrt{d_k}}$. Additive attention computes the
compatibility function using a feed-forward network with a single
hidden layer.  While the two are similar in theoretical complexity,
dot-product attention is much faster and more space-efficient in
practice, since it can be implemented using highly optimized matrix
multiplication code.


While for small values of $d_k$ the two mechanisms perform
similarly, additive attention outperforms dot product attention
without scaling for larger values of $d_k$
[(cite)](https://arxiv.org/abs/1703.03906). We suspect that for
large values of $d_k$, the dot products grow large in magnitude,
pushing the softmax function into regions where it has extremely
small gradients (To illustrate why the dot products get large,
assume that the components of $q$ and $k$ are independent random
variables with mean $0$ and variance $1$.  Then their dot product,
$q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance
$d_k$.). To counteract this effect, we scale the dot products by
$\frac{1}{\sqrt{d_k}}$.

![Multi-Head Attention mechanism diagram: visualization of how multiple attention heads process input sequences in parallel, each focusing on different representation subspaces, then concatenating their outputs.](images/multihead-attention.png)

Multi-head attention allows the model to jointly attend to
information from different representation subspaces at different
positions. With a single attention head, averaging inhibits this.

$$
\mathrm{MultiHead}(Q, K, V) =
    \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O \\
    \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)
$$

Where the projections are parameter matrices $W^Q_i \in
\mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in
\mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in
\mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in
\mathbb{R}^{hd_v \times d_{\text{model}}}$.

In this work we employ $h=8$ parallel attention layers, or
heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due
to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full
dimensionality.


*How would you implement multi-head attention from scratch?*


In [3]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # dimension per head
        
        # Linear projections for queries, keys, and values
        self.W_query = nn.Linear(d_model, d_model)
        self.W_key = nn.Linear(d_model, d_model)
        self.W_value = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_output = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        seq_len_query = query.size(1)
        seq_len_key = key.size(1)
        
        # Step 1: Linear projections for all heads at once
        # Shape: [batch_size, seq_len, d_model]
        Q = self.W_query(query)
        K = self.W_key(key)
        V = self.W_value(value)
        
        # Step 2: Reshape to separate heads
        # From [batch_size, seq_len, d_model] to [batch_size, seq_len, num_heads, d_k]
        Q = Q.view(batch_size, seq_len_query, self.num_heads, self.d_k)
        K = K.view(batch_size, seq_len_key, self.num_heads, self.d_k)
        V = V.view(batch_size, seq_len_key, self.num_heads, self.d_k)
        
        # Step 3: Transpose to [batch_size, num_heads, seq_len, d_k] for efficient computation
        Q = Q.transpose(1, 2)  # [batch_size, num_heads, seq_len_query, d_k]
        K = K.transpose(1, 2)  # [batch_size, num_heads, seq_len_key, d_k]
        V = V.transpose(1, 2)  # [batch_size, num_heads, seq_len_key, d_k]
        
        # Step 4: Apply scaled dot-product attention to each head
        attention_output, attention_weights = scaled_dot_product_attention(
            Q, K, V, mask
        )
        # attention_output: [batch_size, num_heads, seq_len_query, d_k]
        
        # Step 5: Concatenate heads
        # Transpose back: [batch_size, seq_len_query, num_heads, d_k]
        attention_output = attention_output.transpose(1, 2)
        
        # Reshape to concatenate heads: [batch_size, seq_len_query, d_model]
        attention_output = attention_output.contiguous().view(
            batch_size, seq_len_query, self.d_model
        )
        
        # Step 6: Final linear projection
        output = self.W_output(attention_output)
        
        return output, attention_weights

In [6]:
# Example usage
def example_usage():
    batch_size, seq_len, d_model = 2, 10, 512
    num_heads = 8
    
    # Create sample input
    x = torch.randn(batch_size, seq_len, d_model)
    
    # Initialize multi-head attention
    mha = MultiHeadAttention(d_model, num_heads)
    
    # Self-attention (query, key, value are all the same)
    output, weights = mha(x, x, x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    
    # For encoder-decoder attention, you'd use different inputs:
    # output, weights = mha(decoder_hidden, encoder_output, encoder_output)

example_usage()

Input shape: torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])
Attention weights shape: torch.Size([2, 8, 10, 10])


Multi-head attention runs multiple attention mechanisms in parallel, each focusing on different aspects of the relationships, then concatenates and projects the results.

Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:
1) In "encoder-decoder attention" layers, the queries come from the
previous decoder layer, and the memory keys and values come from the
output of the encoder.  This allows every position in the decoder to
attend over all positions in the input sequence.  This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence
models such as [(cite)](https://arxiv.org/abs/1609.08144).


2) The encoder contains self-attention layers.  In a self-attention
layer all of the keys, values and queries come from the same place,
in this case, the output of the previous layer in the encoder.  Each
position in the encoder can attend to all positions in the previous
layer of the encoder.


3) Similarly, self-attention layers in the decoder allow each
position in the decoder to attend to all positions in the decoder up
to and including that position.  We need to prevent leftward
information flow in the decoder to preserve the auto-regressive
property.  We implement this inside of scaled dot-product attention
by masking out (setting to $-\infty$) all values in the input of the
softmax which correspond to illegal connections.

### Position-wise Feed-Forward Networks

The outputs of the self-attention layer are fed to a feed-forward neural network in the encoder.

In addition to attention sub-layers, each of the layers in our
encoder and decoder contains a fully connected feed-forward network,
which is applied to each position separately and identically.  This
consists of two linear transformations with a ReLU activation in
between.

$$\mathrm{FFN}(x)=\max(0, xW_1 + b_1) W_2 + b_2$$

While the linear transformations are the same across different
positions, they use different parameters from layer to
layer. Another way of describing this is as two convolutions with
kernel size 1.  The dimensionality of input and output is
$d_{\text{model}}=512$, and the inner-layer has dimensionality
$d_{ff}=2048$.

In [52]:
class FeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

### Layer normalization

*What is the purpose of layer normalization in transformers?*

We employ a residual connection [(cite)](https://arxiv.org/abs/1512.03385) around each of the two
sub-layers, followed by layer normalization [(cite)](https://arxiv.org/abs/1607.06450).


When we look at the encoder and decoder blocks, we see several normalization layers called Add & Norm.

The LayerNormalization class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracting the mean and dividing by the standard deviation plus a small number called epsilon to avoid any divisions by zero. This process results in a normalized output with a mean 0 and a standard deviation 1.

We will then scale the normalized output by a learnable parameter alpha and add a learnable parameter called bias. The training process is responsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.



In [53]:
class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


### Residual connection



When we look at the architecture of the Transformer, we see that each sub-layer, including the self-attention and Feed Forward blocks, adds its output to its input before passing it to the Add & Norm layer. This approach integrates the output with the original input in the Add & Norm layer. This process is known as the skip connection, which allows the Transformer to train deep networks more effectively by providing a shortcut for the gradient to flow through during backpropagation.

The ResidualConnection class below is responsible for this process.


In [54]:
class ResidualConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(ResidualConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))
    

## Transformer Architecture
Now, we have all the core components to build the Transformer.

- The original Transformer is an encoder-decoder 
- decoder-only
- encoder-only

### Encoder

The encoders are all identical in structure but don't share any weights.

Each encoder has two sub-layers:
1. Self-attention layer
2. Feed Forward Neural Network

...we employ residual connections around each of the sub-layers, followed by layer normalization.

![](images/encoder-architecture.png)


That is, the output of each sub-layer is $\mathrm{LayerNorm}(x +
\mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function
implemented by the sub-layer itself.  We apply dropout
[(cite)](http://jmlr.org/papers/v15/srivastava14a.html) to the
output of each sub-layer, before it is added to the sub-layer input
and normalized.

To facilitate these residual connections, all sub-layers in the
model, as well as the embedding layers, produce outputs of dimension
$d_{\text{model}}=512$.

Each layer has two sub-layers. The first is a multi-head
self-attention mechanism, and the second is a simple, position-wise
fully connected feed-forward network.

In [1]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super().__init__()
        
        # Sub-layer 1: Multi-head self-attention
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        
        # Sub-layer 2: Feed-forward network
        self.feed_forward = FeedForward(d_model, d_ff)
        
        # Layer normalization for each sub-layer
        self.layer_norm_1 = nn.LayerNorm(d_model)
        self.layer_norm_2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Sub-layer 1: Multi-head self-attention with residual connection
        # For self-attention: query, key, and value are all the same input
        attention_output, attention_weights = self.self_attention(
            query=x,    # Same input
            key=x,      # Same input  
            value=x,    # Same input
            mask=mask
        )
        
        # Post-norm: residual connection then normalize
        x = self.layer_norm_1(x + self.dropout(attention_output))
        
        # Sub-layer 2: Feed-forward with residual connection  
        feed_forward_output = self.feed_forward(x)

        # Post-norm: residual connection then normalize
        x = self.layer_norm_2(x + self.dropout(feed_forward_output))
        
        return x

NameError: name 'nn' is not defined

TODO: add something here
like that the encoder is a stack of multiple encoder layers

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, source_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout):
        super().__init__()
        
        # Input processing
        self.embedding = Embeddings(source_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
        
        # Stack of encoder layers
        self.encoder_layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout) 
            for _ in range(num_layers)
        ])
        
    def forward(self, source_tokens, source_mask=None):
        # Step 1: Convert tokens to embeddings + add positional encoding
        embeddings = self.embedding(source_tokens)  # [batch, seq_len, d_model]
        encoder_input = self.positional_encoding(embeddings)
        
        # Step 2: Pass through each encoder layer
        for encoder_layer in self.encoder_layers:
            encoder_input = encoder_layer(encoder_input, source_mask)
        
        return encoder_input  # [batch, seq_len, d_model]

### Decoder

The decode has the following layers:
- Self-attention layer (masked multi head ) 
    - Sub-layer 1: Self-attention (same sequence)
    - query=key=value=target_input
- Cross-attention (multi head Ecoder-Decode attention layer): 
    - helps the decode focus on relevant parts of the input sentenc, similar to `seq2seq` models.
    - In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.  
    - Sub-layer 2: Cross-attention (different sequences)  
    - query=target_input:          What we're generating
    - key=value=encoder_output :    What we can look at
- Feed forward layer


Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

In [None]:
class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super().__init__()
        
        # Sub-layer 1: Masked multi-head self-attention
        self.masked_self_attention = MultiHeadAttention(d_model, num_heads)
        
        # Sub-layer 2: Multi-head encoder-decoder attention
        self.encoder_decoder_attention = MultiHeadAttention(d_model, num_heads)
        
        # Sub-layer 3: Feed-forward network
        self.feed_forward = FeedForward(d_model, d_ff)
        
        # Layer normalization for each sub-layer
        self.layer_norm_1 = nn.LayerNorm(d_model)
        self.layer_norm_2 = nn.LayerNorm(d_model)
        self.layer_norm_3 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, target_input, encoder_output, source_mask=None, target_mask=None):
        # Sub-layer 1: Masked self-attention on target sequence
        # For self-attention: query, key, and value are all the same input (target)
        masked_attention_output, masked_attention_weights = self.masked_self_attention(
            query=target_input,     # Same target input
            key=target_input,       # Same target input
            value=target_input,     # Same target input
            mask=target_mask        # Causal mask to prevent seeing future tokens
        )
        # Post-norm: residual connection then normalize
        target_input = self.layer_norm_1(target_input + self.dropout(masked_attention_output))
        
        # Sub-layer 2: Encoder-decoder attention
        # Query comes from decoder, key and value come from encoder
        encoder_attention_output, encoder_attention_weights = self.encoder_decoder_attention(
            query=target_input,     # What the decoder is generating
            key=encoder_output,     # What information is available from encoder
            value=encoder_output,   # What information to retrieve from encoder
            mask=source_mask        # Mask for padding tokens in source
        )
        # Post-norm: residual connection then normalize
        target_input = self.layer_norm_2(target_input + self.dropout(encoder_attention_output))
        
        # Sub-layer 3: Feed-forward network
        feed_forward_output = self.feed_forward(target_input)
        # Post-norm: residual connection then normalize
        target_input = self.layer_norm_3(target_input + self.dropout(feed_forward_output))
        
        return target_input

TODO: add something here
like that the encoder is a stack of multiple encoder layersljkslajd

In [None]:
class TransformerDecoder(nn.Module):
    def __init__(self, target_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout):
        super().__init__()
        
        # Input processing
        self.embedding = Embeddings(target_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
        
        # Stack of decoder layers
        self.decoder_layers = nn.ModuleList([
            TransformerDecoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
    def forward(self, target_tokens, encoder_output, source_mask=None, target_mask=None):
        # Step 1: Convert tokens to embeddings + add positional encoding
        embeddings = self.embedding(target_tokens)  # [batch, seq_len, d_model]
        decoder_input = self.positional_encoding(embeddings)
        
        # Step 2: Pass through each decoder layer
        for decoder_layer in self.decoder_layers:
            decoder_input = decoder_layer(
                target_input=decoder_input,
                encoder_output=encoder_output,
                source_mask=source_mask,
                target_mask=target_mask
            )
        
        return decoder_input  # [batch, seq_len, d_model]


We also modify the self-attention sub-layer in the decoder stack to
prevent positions from attending to subsequent positions.  This
masking, combined with fact that the output embeddings are offset by
one position, ensures that the predictions for position $i$ can
depend only on the known outputs at positions less than $i$.

In [59]:
# "Mask out subsequent positions."

In [60]:
def create_causal_mask(seq_len):
    """Create causal mask to prevent attending to future positions"""
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    return mask == 0  # Convert to boolean mask


> Below the attention mask shows the position each tgt word (row) is
> allowed to look at (column). Words are blocked for attending to
> future words during training.


TODO: add visualization of mask here

![](images/ModalNet-20.png)

### Transformer Model

Both the encoder and decode blocks are repeated N times. In the original paper, they defined N=6, and we will define a similar value in this notebook.

![](images/transformer_encoders_decoder_stacks.png)


In [61]:
class Transformer(nn.Module):
    def __init__(self, 
                 source_vocab_size,     # Source vocabulary size
                 target_vocab_size,     # Target vocabulary size  
                 d_model=512,           # Model dimension
                 num_heads=8,           # Number of attention heads
                 num_layers=6,          # Number of encoder/decoder layers
                 d_ff=2048,             # Feed-forward dimension
                 max_seq_len=5000,      # Maximum sequence length
                 dropout=0.1):
        super().__init__()
        
        self.d_model = d_model
        
        # Encoder stack
        self.encoder = TransformerEncoder(
            source_vocab_size=source_vocab_size,
            d_model=d_model, 
            num_heads=num_heads, 
            num_layers=num_layers,
            d_ff=d_ff, 
            max_seq_len=max_seq_len, 
            dropout=dropout
        )
        
        # Decoder stack  
        self.decoder = TransformerDecoder(
            target_vocab_size=target_vocab_size,
            d_model=d_model, 
            num_heads=num_heads, 
            num_layers=num_layers,
            d_ff=d_ff, 
            max_seq_len=max_seq_len, 
            dropout=dropout
        )
        
        # Final output projection to target vocabulary
        self.output_projection = nn.Linear(d_model, target_vocab_size)
        
    def forward(self, source_tokens, target_tokens, source_mask=None, target_mask=None):
        """
        Forward pass for training (teacher forcing)
        
        source_tokens: [batch_size, source_seq_len] - source token ids
        target_tokens: [batch_size, target_seq_len] - target token ids  
        """
        # Step 1: Encode source sequence
        encoder_output = self.encoder(source_tokens, source_mask)
        # Shape: [batch_size, source_seq_len, d_model]
        
        # Step 2: Decode target sequence
        decoder_output = self.decoder(
            target_tokens=target_tokens,
            encoder_output=encoder_output,
            source_mask=source_mask,
            target_mask=target_mask
        )
        # Shape: [batch_size, target_seq_len, d_model]
        
        # Step 3: Project to vocabulary logits
        output_logits = self.output_projection(decoder_output)
        # Shape: [batch_size, target_seq_len, target_vocab_size]
        
        return output_logits

    def encode(self, source_tokens, source_mask=None):
        """Encode source sequence (for inference)"""
        return self.encoder(source_tokens, source_mask)
    
    def decode_step(self, target_tokens, encoder_output, source_mask=None, target_mask=None):
        """Decode one step (for autoregressive generation)"""
        decoder_output = self.decoder(target_tokens, encoder_output, source_mask, target_mask)
        return self.output_projection(decoder_output)


Interview Questions:
- Walk through the forward pass of a transformer block*
- What are the differences between encoder-only, decoder-only, and encoder-decoder transformers?*

## Training and Evaluation

In [None]:
#| echo: true
#| output: false
!pip3.7 install -q torchdata torchtext spacy altair
#!python3.7 -m spacy download de_core_news_sm
#!python3.7 -m spacy download en_core_web_sm

In [62]:
import time
from torch.optim.lr_scheduler import LambdaLR
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy

from torch.nn.functional import log_softmax, pad

In [66]:
# Example: Training and evaluating a Transformer on a simple English-to-French translation task

# 1. Prepare toy data (English to French pairs)
english_sentences = [
    "hello world", "how are you", "good morning", "thank you", "see you soon"
]
french_sentences = [
    "bonjour le monde", "comment ça va", "bonjour", "merci", "à bientôt"
]

# 2. Build vocabularies
def build_vocab(sentences):
    tokens = set()
    for s in sentences:
        tokens.update(s.split())
    vocab = {word: idx+2 for idx, word in enumerate(sorted(tokens))}
    vocab["<pad>"] = 0
    vocab["<unk>"] = 1
    return vocab

src_vocab = build_vocab(english_sentences)
tgt_vocab = build_vocab(french_sentences)
inv_tgt_vocab = {v: k for k, v in tgt_vocab.items()}

# 3. Tokenize and numericalize
def encode(sentence, vocab, max_len):
    tokens = [vocab.get(w, vocab["<unk>"]) for w in sentence.split()]
    tokens = tokens + [vocab["<pad>"]] * (max_len - len(tokens))
    return tokens[:max_len]

max_src_len = max(len(s.split()) for s in english_sentences)
max_tgt_len = max(len(s.split()) for s in french_sentences)

src_data = torch.tensor([encode(s, src_vocab, max_src_len) for s in english_sentences])
tgt_data = torch.tensor([encode(s, tgt_vocab, max_tgt_len) for s in french_sentences])

# 4. Initialize transformer
transformer = Transformer(
    source_vocab_size=len(src_vocab),
    target_vocab_size=len(tgt_vocab),
    d_model=128,
    num_heads=4,
    num_layers=2,
    d_ff=256,
    max_seq_len=20,
    dropout=0.1
)

optimizer = torch.optim.Adam(transformer.parameters(), lr=1e-3)
loss_function = nn.CrossEntropyLoss(ignore_index=0)

# 5. Training loop (few epochs for demonstration)
for epoch in range(20):
    transformer.train()
    optimizer.zero_grad()
    # Shift target for teacher forcing
    tgt_input = tgt_data[:, :-1]
    tgt_output = tgt_data[:, 1:]
    logits = transformer(src_data, tgt_input)
    loss = loss_function(
        logits.view(-1, logits.size(-1)),
        tgt_output.contiguous().view(-1)
    )
    loss.backward()
    optimizer.step()
    if (epoch+1) % 5 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# 6. Evaluation: Translate English sentences
transformer.eval()
with torch.no_grad():
    for idx, src in enumerate(english_sentences):
        src_tensor = torch.tensor([encode(src, src_vocab, max_src_len)])
        tgt_tokens = [tgt_vocab["<pad>"]]
        for _ in range(max_tgt_len):
            tgt_tensor = torch.tensor([tgt_tokens])
            logits = transformer(src_tensor, tgt_tensor)
            next_token = logits[0, -1].argmax().item()
            tgt_tokens.append(next_token)
            if next_token == tgt_vocab["<pad>"]:
                break
        translation = [inv_tgt_vocab.get(tok, "") for tok in tgt_tokens[1:] if tok != tgt_vocab["<pad>"]]
        print(f"EN: {src} -> FR: {' '.join(translation)}")

Epoch 5, Loss: 0.2940
Epoch 10, Loss: 0.0584
Epoch 15, Loss: 0.0239
Epoch 20, Loss: 0.0200
EN: hello world -> FR: monde le monde
EN: how are you -> FR: va ça va
EN: good morning -> FR: bientôt bientôt bientôt
EN: thank you -> FR: bientôt bientôt bientôt
EN: see you soon -> FR: bientôt bientôt bientôt


## References
- [Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)
- [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)


TODO: ### Softmax
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.  
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.