# Understanding and Implementing a Transformer Encoder

<small>

**Introduction:**

**Goal of this Notebook:**
The goal of this notebook is to provide a comprehensive understanding of the transformer encoder architecture, a key component in many state-of-the-art models in natural language processing (NLP). Participants will learn how to implement each core component of a transformer encoder from scratch, including the embedding layer, positional encoding, multi-head attention, feedforward networks, and residual connections.

**Learning Outcomes:**
By the end of this notebook, participants will be able to:
- Understand the role and function of each component within a transformer encoder.
- Implement a multi-layer transformer encoder from scratch using PyTorch.
- Apply embeddings and positional encodings to prepare input sequences for transformer models.
- Gain practical experience in constructing and training deep learning models that utilize the transformer architecture for sequence processing tasks.

This notebook serves as both an educational resource and a practical guide, walking participants through the process of building and understanding transformer encoders, which are foundational in tasks such as machine translation, text classification, and more.

</small>


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# InputEmbedding Class Definition

<small>

**Details:**  
This section defines the `InputEmbedding` class, a custom embedding layer used in neural networks. The class inherits from `nn.Module`, making it compatible with PyTorch's model framework.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `d_model` (int): The dimension of the embedding vector. This is a key hyperparameter that defines the size of each word's vector representation.
    - `vocab_size` (int): The size of the vocabulary. This parameter determines the number of unique words that can be embedded.
  - **Embedding Layer:**
    - The `nn.Embedding` layer is initialized with `vocab_size` and `d_model`, creating a lookup table where each word in the vocabulary is mapped to a `d_model`-dimensional vector.

- **Forward Method:**
  - **Input:** `x`, a tensor of word indices.
  - **Operation:**
    - The method retrieves the corresponding embeddings for the input word indices using the embedding layer.
    - The output embeddings are scaled by multiplying with the square root of `d_model` (`math.sqrt(self.d_model)`). This scaling is crucial for stabilizing gradients during training, especially in deep networks.
  - **Output:** The scaled embeddings, which can be fed into subsequent layers of the model.

</small>


In [None]:
# InputEmbedding class creates an embedding layer that scales the output embeddings by the square root of the embedding dimension (d_model).
# This scaling helps stabilize gradients during training.
class InputEmbedding(nn.Module):
  def __init__(self, d_model:int, vocab_size:int):
    super().__init__()
    self.d_model = d_model
    self.vocab_size = vocab_size
    self.embedding = nn.Embedding(vocab_size, d_model)
  def forward(self, x):
    return self.embedding(x) * math.sqrt(self.d_model)

# PositionalEncoding Class Definition

<small>

**Details:**  
This section defines the `PositionalEncoding` class, which generates and applies sinusoidal positional encodings to the input embeddings. These encodings enable the model to capture the order of tokens in a sequence, which is essential for processing sequential data. The class also includes a dropout layer to reduce the risk of overfitting by randomly zeroing out elements in the input tensor during training.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `d_model` (int): The dimension of the embedding vector. This defines the size of each positional encoding vector.
    - `seq_len` (int): The length of the input sequence, which determines how many positional encodings will be generated.
    - `dropout` (float): The dropout rate, which controls the probability of zeroing out elements in the input tensor.
  - **Positional Encoding Generation:**
    - A matrix of positional encodings is created with shape `(seq_len, d_model)`, where each position is encoded using sinusoidal functions. Specifically:
      - The even indices of the encoding vector are assigned values using the sine function.
      - The odd indices are assigned values using the cosine function.
    - These encodings are scaled according to their position in the sequence and the dimension of the model (`d_model`).
  - **Buffer Registration:**
    - The generated positional encodings are stored in a buffer, which ensures they are part of the model's state but not updated during training.

- **Forward Method:**
  - **Input:** `x`, the input embeddings tensor.
  - **Operation:**
    - The positional encodings are added to the input embeddings to inject the positional information.
    - The dropout layer is applied to the resulting tensor, adding a regularization effect during training.
  - **Output:** The modified embeddings with positional information, ready for further processing by the model.

</small>


In [None]:
# PositionalEncoding class generates and applies sinusoidal positional encodings to the input embeddings.
# This encoding helps the model capture the order of tokens in a sequence, which is crucial for sequential data processing.
# The dropout layer is used to prevent overfitting by randomly zeroing some of the elements in the input tensor.
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        positional_encoding = torch.zeros(seq_len, d_model)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        positional_encoding[:, 0::2] = torch.sin(position * div_term)
        positional_encoding[:, 1::2] = torch.cos(position * div_term)
        positional_encoding = positional_encoding.unsqueeze(0)
        self.register_buffer('positional_encoding', positional_encoding)

    def forward(self, x):
        x = x + self.positional_encoding[:, :x.shape[1], :].requires_grad_(False)
        return self.dropout(x)

# LayerNormalization Class Definition

<small>

**Details:**  
This section defines the `LayerNormalization` class, which applies layer normalization to the input tensor. Layer normalization stabilizes the learning process by normalizing the input across the features of a single layer, ensuring that the outputs have zero mean and unit variance. This technique is particularly beneficial in deep networks, as it helps to prevent internal covariate shift, a common issue where the distribution of layer inputs changes during training.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `eps` (float): A small constant added to the denominator for numerical stability. The default value is set to \(10^{-6}\).
  - **Learnable Parameters:**
    - `alpha`: A scaling factor initialized as a learnable parameter with a value of 1. It allows the model to scale the normalized output.
    - `bias`: A bias term initialized as a learnable parameter with a value of 0. It allows the model to shift the normalized output.

- **Forward Method:**
  - **Input:** `x`, the input tensor.
  - **Operation:**
    - The mean and standard deviation of the input tensor are calculated along the last dimension (`dim=-1`).
    - The input tensor is then normalized by subtracting the mean and dividing by the standard deviation plus the epsilon value (`eps`) for stability.
    - The normalized tensor is scaled by `alpha` and shifted by `bias` to allow the model to adjust the normalized output.
  - **Output:** The layer-normalized tensor, which has a mean of 0 and a variance of 1 across its features.

</small>


In [None]:
# LayerNormalization class applies layer normalization to the input tensor.
# Layer normalization stabilizes the learning process by normalizing the input across the features of a single layer,
# ensuring that the outputs have zero mean and unit variance. This is particularly useful in deep networks to prevent
# internal covariate shift.
class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 10**-6) -> None:
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.eps) + self.bias

# FeedForwardBlock Class Definition

<small>

**Details:**  
This section defines the `FeedForwardBlock` class, which implements a two-layer feedforward neural network with ReLU activation and dropout. This block is commonly used in transformer models to further process the output of the attention mechanism, allowing the model to learn more complex patterns.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `d_model` (int): The dimension of the input and output features. This parameter controls the size of the input and the final output of the block.
    - `d_ff` (int): The dimension of the hidden layer. This is typically larger than `d_model` and allows the network to learn more complex representations by increasing the number of features.
    - `dropout` (float): The dropout rate, which adds regularization to prevent overfitting by randomly zeroing out elements of the hidden layer during training.
  - **Layers:**
    - `linear_1`: The first linear layer, which expands the input from `d_model` to `d_ff` dimensions.
    - `dropout`: The dropout layer applied after the ReLU activation to introduce regularization.
    - `linear_2`: The second linear layer, which projects the output back from `d_ff` to `d_model` dimensions.

- **Forward Method:**
  - **Input:** `x`, the input tensor.
  - **Operation:**
    - The input tensor is first passed through `linear_1`, which expands the dimensionality.
    - A ReLU activation function is applied to introduce non-linearity.
    - The dropout layer is then applied to the activated output to prevent overfitting.
    - Finally, the tensor is passed through `linear_2`, which projects it back to the original dimension (`d_model`).
  - **Output:** The processed tensor, which has the same dimensionality as the input but with enhanced feature representations.

</small>


In [None]:
# FeedForwardBlock class implements a two-layer feedforward neural network with ReLU activation and dropout.
# This block is typically used in transformer models to process the output of the attention mechanism.
# The first linear layer expands the dimensionality, the ReLU activation adds non-linearity,
# dropout is applied for regularization, and the second linear layer projects the output back to the original dimension.
class FeedForwardBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear_2(self.dropout(F.relu(self.linear_1(x))))

# MultiHeadAttentionBlock Class Definition

<small>

**Details:**  
This section defines the `MultiHeadAttentionBlock` class, which implements the multi-head attention mechanism used in transformer models. Multi-head attention allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships between tokens. The mechanism involves splitting the input into multiple heads, performing scaled dot-product attention on each head, and then concatenating and projecting the results back to the original dimension.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `d_model` (int): The dimension of the input features. It must be divisible by the number of heads (`h`).
    - `h` (int): The number of attention heads. Each head processes a portion of the input sequence independently.
    - `dropout` (float): The dropout rate applied to the attention scores to prevent overfitting.
  - **Layer Definitions:**
    - `w_q`: Linear layer to generate query vectors from the input.
    - `w_k`: Linear layer to generate key vectors from the input.
    - `w_v`: Linear layer to generate value vectors from the input.
    - `w_o`: Linear layer to project the concatenated output of all heads back to the original dimension.
    - `dropout`: Dropout layer applied to the attention scores for regularization.

- **Attention Method:**
  - **Input:** `query`, `key`, `value` tensors, and an optional `mask`.
  - **Operation:**
    - Compute attention scores by performing a scaled dot-product of the `query` and `key` tensors.
    - If a `mask` is provided, it is used to set attention scores of certain positions to a large negative value, effectively ignoring them.
    - Apply softmax to the attention scores to get the attention weights.
    - Optionally, apply dropout to the attention weights.
    - Multiply the attention weights with the `value` tensor to get the output of the attention mechanism.

- **Forward Method:**
  - **Input:** `q`, `k`, `v` tensors, and an optional `mask`.
  - **Operation:**
    - Apply the linear layers (`w_q`, `w_k`, `w_v`) to compute the query, key, and value vectors.
    - Reshape and transpose these vectors to create multiple heads.
    - Perform the attention mechanism on each head using the `attention` method.
    - Concatenate the outputs from all heads and project them back to the original dimensionality using the `w_o` layer.
  - **Output:** The final tensor, which represents the result of the multi-head attention mechanism, ready for subsequent layers in the model.

</small>


In [None]:
# MultiHeadAttentionBlock class implements the multi-head attention mechanism used in transformer models.
# Multi-head attention allows the model to focus on different parts of the input sequence simultaneously,
# capturing various relationships between tokens. The input is split into multiple heads, each head performs
# scaled dot-product attention, and the results are concatenated and projected back to the original dimension.
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h
        assert d_model % h == 0, "d_model is not divisible by h"
        self.d_k = d_model // h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def attention(self, query, key, value, mask, dropout):
        d_k = query.shape[-1]
        attention_scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores = attention_scores.masked_fill(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1)
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        return torch.matmul(attention_scores, value)

    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1, 2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1, 2)
        x = self.attention(query, key, value, mask, self.dropout)
        x = x.transpose(1, 2).contiguous().reshape(x.shape[0], -1, self.h * self.d_k)
        return self.w_o(x)

# Residual Class Definition

<small>

**Details:**  
This section defines the `Residual` class, which implements a residual connection combined with layer normalization and dropout. Residual connections are crucial in training deep networks as they help mitigate the vanishing gradient problem, allowing gradients to flow more effectively through the network. In this class, the input is first normalized, passed through a sublayer, then dropout is applied, and finally, the processed output is added back to the original input.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `dropout` (float): The dropout rate used to regularize the output of the sublayer by randomly zeroing out elements, reducing the risk of overfitting.
  - **Layer Definitions:**
    - `dropout`: A dropout layer applied after the sublayer to introduce regularization.
    - `norm`: A layer normalization module, which normalizes the input to have zero mean and unit variance, improving training stability.

- **Forward Method:**
  - **Input:**
    - `x`: The input tensor that passes through the residual connection.
    - `sublayer`: A function or layer that processes the normalized input tensor.
  - **Operation:**
    - First, the input tensor `x` is normalized using the `LayerNormalization` instance.
    - The normalized tensor is then passed through the provided `sublayer`.
    - Dropout is applied to the output of the sublayer for regularization.
    - Finally, the resulting tensor is added back to the original input `x` to complete the residual connection.
  - **Output:** The tensor that combines the original input with the output of the sublayer, enhanced by the residual connection for better gradient flow and model performance.

</small>


In [None]:
# Residual class implements a residual connection with layer normalization and dropout.
# Residual connections help in training deep networks by mitigating the vanishing gradient problem,
# allowing gradients to flow through the network more effectively.
# The input is first normalized, passed through a sublayer, followed by dropout,
# and finally added back to the original input.
class Residual(nn.Module):
    def __init__(self, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

# EncoderBlock Class Definition

<small>

**Details:**  
This section defines the `EncoderBlock` class, which represents a single block in the transformer encoder. Each encoder block consists of a multi-head self-attention mechanism, followed by a feedforward network. Residual connections and layer normalization are applied after each sublayer to ensure stable gradients and effective learning. This structure enables the encoder to capture complex relationships within the input sequence, which is crucial for tasks such as machine translation, text summarization, and other sequence processing tasks.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `self_attention_block`: An instance of `MultiHeadAttentionBlock`, responsible for capturing dependencies between tokens in the input sequence by computing attention scores across different heads.
    - `feed_forward_block`: An instance of `FeedForwardBlock`, a two-layer feedforward neural network that further processes the output of the self-attention mechanism.
    - `dropout` (float): The dropout rate applied in the residual connections to prevent overfitting.
  - **Residual Connections:**
    - Two residual connections are defined using the `Residual` class:
      - The first residual connection wraps around the multi-head self-attention block.
      - The second residual connection wraps around the feedforward block.

- **Forward Method:**
  - **Input:**
    - `x`: The input tensor representing the sequence to be processed.
    - `src_mask`: A source mask tensor that prevents attention to certain positions in the sequence, typically used to handle padding tokens.
  - **Operation:**
    - The input tensor `x` is first passed through the self-attention block, with a residual connection wrapping around it.
    - The output is then passed through the feedforward block, again wrapped with a residual connection.
    - The residual connections ensure that the input is added back to the output of each sublayer, helping in stabilizing the gradient flow and improving the overall training process.
  - **Output:** The processed tensor, enriched with information captured by the self-attention and feedforward blocks, ready for further processing by subsequent encoder blocks or layers.

</small>


In [None]:
# EncoderBlock class represents a single block in the transformer encoder.
# It consists of a multi-head self-attention mechanism, followed by a feedforward network,
# with residual connections and layer normalization applied after each sublayer.
# This structure allows the encoder to capture complex relationships in the input sequence while maintaining stable gradients.
class EncoderBlock(nn.Module):
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([Residual(dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x

# Encoder Class Definition

<small>

**Details:**  
This section defines the `Encoder` class, which stacks multiple encoder blocks to form the complete transformer encoder. The input sequence is passed sequentially through each encoder block, allowing the model to incrementally build a rich and complex representation of the input data. After processing through all the encoder blocks, layer normalization is applied to the final output to ensure stability and improve the overall training process.

**Key Components:**

- **Initialization (`__init__` method):**
  - **Parameters:**
    - `layers`: An `nn.ModuleList` containing multiple instances of `EncoderBlock`. Each block is responsible for processing the input sequence, capturing dependencies, and transforming the representation at each layer.
  - **Layer Normalization:**
    - The final output of the stacked encoder blocks is normalized using an instance of `LayerNormalization`, which ensures that the output has stable mean and variance, improving training stability.

- **Forward Method:**
  - **Input:**
    - `x`: The input tensor representing the sequence to be processed.
    - `mask`: A mask tensor that is passed to each encoder block to control which parts of the input sequence should be attended to, typically used to handle padding tokens.
  - **Operation:**
    - The input tensor `x` is passed through each encoder block in the `layers` list in sequence. Each block processes the input, capturing increasingly complex relationships within the sequence.
    - After passing through all the encoder blocks, the final output tensor is normalized using the `LayerNormalization` instance.
  - **Output:** The normalized tensor that has been processed through all the encoder blocks, enriched with the hierarchical representation built by the transformer encoder.

</small>


In [None]:
# Encoder class stacks multiple encoder blocks to form the transformer encoder.
# The input passes through each encoder block in sequence, allowing the model to build a rich representation of the input.
# Finally, layer normalization is applied to the output of the last encoder block for stability.
class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

# Example Usage of Transformer Encoder

<small>

**Details:**
This example demonstrates how to use the previously defined classes, including `InputEmbedding`, `PositionalEncoding`, `EncoderBlock`, `MultiHeadAttentionBlock`, `FeedForwardBlock`, and `Encoder`. The code walks through the process of embedding a sequence of tokens, applying positional encoding, and processing the sequence through a stack of encoder blocks.

**Key Steps:**

- **Vocabulary and Input Sequence:**
  - A small vocabulary is created with five words: `"this"`, `"is"`, `"an"`, `"example"`, and `"sentence"`.
  - A sample sentence is converted into a list of indices corresponding to the vocabulary.
  - The input tensor is then formed by converting these indices into a tensor and adding a batch dimension.

- **Embedding and Positional Encoding:**
  - The `InputEmbedding` class is used to convert the input indices into embedding vectors of dimension `d_model = 512`.
  - The `PositionalEncoding` class is then applied to add positional information to these embeddings, which helps the transformer model understand the order of tokens in the sequence.

- **Encoder Layers:**
  - Six encoder blocks are created, each consisting of a `MultiHeadAttentionBlock` and a `FeedForwardBlock`.
  - The `MultiHeadAttentionBlock` uses 8 attention heads, and the `FeedForwardBlock` expands the dimensionality to `d_ff = 2048` before projecting it back to `d_model = 512`.

- **Forward Pass:**
  - The embedded and positionally encoded tensor is passed through the stack of encoder blocks.
  - The output tensor, which contains the enriched representations of the input sequence, is then printed along with its shape.

**Expected Output:**
- The final output tensor will have the same shape as the input tensor but with the processed embeddings after passing through all the encoder blocks.
- The shape of the output will be `(1, seq_len, d_model)` which is `(1, 5, 512)` in this case.

</small>


In [None]:
vocab_size = 5  # vocabulary size example
d_model = 512
seq_len = 5  # sequence length example

# test example
vocab = {word: idx for idx, word in enumerate(["this", "is", "an", "example", "sentence"])}
sentence = ["this", "is", "an", "example", "sentence"]

# convert the sentence into indices
input_indices = [vocab[word] for word in sentence]

input_tensor = torch.LongTensor(input_indices).unsqueeze(0)  # (1, seq_len)

# create embedding and positional encoding
embedding = InputEmbedding(d_model=d_model, vocab_size=vocab_size)
pos_encoding = PositionalEncoding(d_model=d_model, seq_len=seq_len, dropout=0.1)

# create encoder layers
num_layers = 6
dropout = 0.1
attention_heads = 8
d_ff = 2048

layers = nn.ModuleList([
    EncoderBlock(
        MultiHeadAttentionBlock(d_model=d_model, h=attention_heads, dropout=dropout),
        FeedForwardBlock(d_model=d_model, d_ff=d_ff, dropout=dropout),
        dropout=dropout
    ) for _ in range(num_layers)
])

encoder = Encoder(layers=layers)

# feed forward
x = embedding(input_tensor)
x = pos_encoding(x)
output = encoder(x, None)

print(output)
print(output.shape)

tensor([[[-4.7664e-04,  3.5593e-01, -1.0730e+00,  ..., -1.5652e+00,
          -9.0137e-01,  7.7596e-01],
         [-4.8475e-01,  7.7966e-01, -1.0795e-02,  ..., -2.6110e-01,
           5.0710e-01,  4.4744e-01],
         [-1.2681e-01,  1.8640e-01, -1.5022e+00,  ...,  2.3316e+00,
          -1.5395e-01,  7.0094e-01],
         [-6.8704e-01,  1.4763e-02,  1.5459e-01,  ..., -9.5795e-01,
           5.3138e-01, -4.1171e-01],
         [-2.0745e+00,  3.8764e-01,  1.1710e-02,  ...,  7.0964e-01,
           2.4866e-01, -4.2348e-01]]], grad_fn=<AddBackward0>)
torch.Size([1, 5, 512])
