# Transformers Architecture

References:
- [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [NLP with Transformers](https://www.amazon.com.au/Natural-Language-Processing-Transformers-Applications/dp/1098103246)
- [Transformer Anatomy](https://github.com/nlp-with-transformers/notebooks/blob/main/03_transformer-anatomy.ipynb)
- [Advanced Deep Learning with Python](https://www.amazon.com.au/Advanced-Deep-Learning-Python-Vasilev/dp/178995617X)

<a href="https://colab.research.google.com/github/paulaceccon/deep-learning-studies/blob/main/notebooks/generative_models/transformers/transformer_architecture.ipynb" target="_parent" style="float: left;"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import sys
import numpy as np
import torch
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoConfig, AutoModel, AutoTokenizer
from math import sqrt
from torch import nn
from loguru import logger
from typing import Optional
import matplotlib.pyplot as plt

## Transformer Architecture

<img src="images/architecture.png" alt="Architecture" width="400" align="left"/>

The encoder in a Transformer model is responsible for processing the input sequence, such as a sentence or a document. It consists of a stack (`Nx`) of encoder layers or "blocks". Each encoder layer receives a sequence of token embeddings, which are representations of the input tokens obtained through tokenization and embedding techniques.

To capture the sequential nature of the text, the encoder combines the token embeddings with positional embeddings. Positional embeddings provide information about the relative positions of the tokens in the input sequence. This injection of positional information helps the attention mechanism in the Transformer model to understand the order of the tokens.

Each encoder layer in the stack performs the following operations on the input embeddings:

1. **Multi-head self-attention layer:** This layer allows each token to attend to other tokens in the input sequence. It computes attention weights that determine the importance of each token with respect to other tokens. The self-attention mechanism helps the model capture dependencies and relationships between different tokens in the input sequence. Because the self-attention mechanism works across the whole input sequence, the encoder is **bidirectional** by design.

2. **Fully connected feed-forward layer:** After the self-attention layer, the output embeddings from the previous step are passed through a feed-forward neural network layer. This layer applies a non-linear transformation to each input embedding independently. The feed-forward layer introduces additional modeling capacity and helps capture more complex relationships within the input sequence. This layer can be defined by:

$$FFN(x) = ReLU(\mathbf{W}_1 x +b_1)\mathbf{W}_2 + b_2$$

In addition to the self-attention and feed-forward layers, each encoder layer in the stack incorporates a layer normalization step. Layer normalization is applied after each sublayer, such as the self-attention layer and the feed-forward layer. The purpose of layer normalization is to normalize the values across the features dimension (often referred to as the hidden or inner dimension) for each position in the sequence.

By applying layer normalization, the model ensures that the mean of the values across the features dimension is close to zero, and the standard deviation is close to one. This normalization step helps address issues related to internal covariate shift, which is the phenomenon of the distribution of inputs to a learning system changing during training, causing difficulties in learning.

The output embeddings of each encoder layer have the same size as the inputs. The role of the encoder stack is to "update" the input embeddings at each layer, gradually incorporating contextual information and capturing higher-level representations of the sequence. The stacking of multiple encoder layers allows the model to learn hierarchical representations of the input, capturing both local and global dependencies.

The encoder's final output is then passed to the decoder for further processing. In a sequence-to-sequence setting, the decoder generates predictions for the next token in the sequence based on the encoder's output. This prediction is then fed back into the decoder to generate subsequent tokens, continuing until an end-of-sequence (EOS) token is reached.

---

The decoder in the Transformer model is structurally similar to the encoder, but it has some key differences. The input at each step of the decoder is its own predicted output word from the previous step, similar to an autoregressive model. The input word is embedded and combined with positional encodings, just like in the encoder.

The decoder consists of a stack of N identical blocks. Each block contains three sublayers, and within each sublayer, residual connections and normalization are employed. The three sublayers in each block of the decoder are as follows:

1. **Masked multihead self-attention layer**: This self-attention mechanism in the decoder allows each position in the sequence to attend to preceding positions in the partially generated target sequence. Unlike the encoder's self-attention, which attends to the entire input sequence, the decoder's self-attention _only attends to the preceding sequence elements_. This is achieved by applying a _mask_ to the softmax input, setting the corresponding values to -∞, which prevents illegal connections between future positions and the current position being attended to. This masking ensures that the decoder is unidirectional, attending only to the preceding positions.

2. **Encoder-decoder attention layer**: This layer allows every position in the decoder to attend over all positions in the input sequence (encoder output). The queries for this attention mechanism come from the previous decoder layer, while the keys and values are derived from the previous sublayer output, which represents the processed decoder output from the previous step. This attention mechanism mimics the typical encoder-decoder attention mechanisms used in sequence-to-sequence models with attention.

3. **Feed-forward network**: Similar to the encoder, the decoder includes a feed-forward network. This network applies a non-linear transformation to each position's representation independently, enhancing the model's ability to capture complex relationships within the decoder.

The decoder concludes with a fully connected layer, followed by a softmax activation function. This final layer predicts the most probable next word of the sentence based on the representations and attention mechanisms of the decoder.

To regularize the model and prevent overfitting, dropout is applied in the Transformer model. Dropout is added to the output of each sublayer before it is combined with the sublayer input and normalized. Additionally, dropout is applied to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.

## Main Components

### 1. Attention

#### Scale Dot Product Attention

The scaled dot-product attention is the most common way to implement a self-attention layer. It computes the attention weights between a query vector and a set of key-value pairs by calculating the dot product similarity between them. The key idea behind the scaled dot-product attention is to scale the dot products by the square root of the dimensionality of the query and key vectors, which helps stabilize the gradients during training.

<img src="images/scale_dot_product.png" alt="Scale Dot Product" width="200" align="left"/>

In [3]:
def scaled_dot_product_attention(
    query: torch.Tensor, 
    key: torch.Tensor, 
    value: torch.Tensor, 
    mask: Optional[torch.Tensor] = None,
    dropout: Optional[nn.Dropout] = None
) -> torch.Tensor:
    """
    Compute scaled dot product attention weights.

    Args:
        query: Tensor with shape [batch_size, seq_length_q, depth_q].
        key: Tensor with shape [batch_size, seq_length_k, depth_k].
        value: Tensor with shape [batch_size, seq_length_v, depth_v].
        mask: Optional tensor with shape [batch_size, seq_length_q, seq_length_k],
            containing values to be masked. Default is None.

    Returns:
        Tensor with shape [batch_size, seq_length_q, depth_v].
    """
    dim_k = query.size(-1)
    logger.debug(f"query_size: {query.size()}")
    logger.debug(f"key: {key.transpose(-2, -1).size()}")
    scores = torch.matmul(query, key.transpose(-2, -1)) / sqrt(dim_k)
    if mask is not None:
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, float("-inf"))
        
    weights = F.softmax(scores, dim=-1)
    
    if dropout is not None:
        weights = dropout(weights)
        
    return torch.matmul(weights, value)

#### Multi-head Attention

The multi-head attention is an extension of the self-attention mechanism. It enhances the modeling capability by performing multiple attention computations in parallel, with different learned linear projections.

The reasoning for heaving multi-head attention is that the softmax of one head usually focuses on mostly a single aspect of similarity. In other words, the multi-head attention allows the model to capture different types of dependencies and relationships between the elements in the input sequence. I.e., each head can attend to different parts of the sequence, enabling the model to learn more nuanced patterns.

<img src="images/multi_head_attention.png" alt="Multi-Head Attention" width="300" align="left"/>

In [4]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head attention module.

    Args:
        config: Configuration for the multi-head attention.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.embed_dim = config.hidden_size
        self.num_heads = config.num_attention_heads
        logger.debug(f"hidden_dim: {self.embed_dim}")
        logger.debug(f"num_heads: {self.num_heads}")
        
        assert self.embed_dim % self.num_heads == 0
        self.head_dim = self.embed_dim // self.num_heads
        logger.debug(f"head_dim: {self.head_dim}")
        
        self.q = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.k = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.v = nn.Linear(self.embed_dim, self.head_dim * self.num_heads)
        self.output_linear = nn.Linear(self.embed_dim, self.embed_dim)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self, 
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor, 
        mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Perform a forward pass of the multi-head attention.

        Args:
            query: Query tensor of shape [batch_size, seq_len, embed_dim].
            key: Key tensor of shape [batch_size, seq_len, embed_dim].
            value: Value tensor of shape [batch_size, seq_len, embed_dim].
            mask: Optional mask tensor. Default is None.

        Returns:
            Tensor of shape [batch_size, seq_len, embed_dim], 
            representing the output of the multi-head attention.
        """
        q = self.q(query)
        k = self.k(key)
        v = self.v(value)
        logger.debug(f"q_size: {q.size()}")
        logger.debug(f"k_size: {k.size()}")
        logger.debug(f"v_size: {v.size()}")
                     
        q = q.view(q.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(k.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(v.size(0), -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        logger.debug(f"qT_size: {q.size()}")
        logger.debug(f"kT_size: {k.size()}")
        logger.debug(f"vT_size: {v.size()}")

        attn_scores = scaled_dot_product_attention(q, k, v, mask, self.dropout)
        attn_scores = attn_scores.transpose(1, 2).contiguous()
        attn_scores = attn_scores.view(attn_scores.size(0), -1, self.embed_dim)
        logger.debug(f"attn_scores: {attn_scores.size()}")

        output = self.output_linear(attn_scores)
        logger.debug(f"output_size: {output.size()}")
        return output

In [5]:
model_ckpt = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_ckpt)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)

text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs_embeds = token_emb(inputs.input_ids)

query = key = value = inputs_embeds

multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(query, key, value)  

[32m2023-07-27 15:35:44.212[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2023-07-27 15:35:44.213[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2023-07-27 15:35:44.213[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2023-07-27 15:35:44.221[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m48[0m - [34m[1mq_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:35:44.221[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m49[0m - [34m[1mk_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:35:44.222[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m50[0m - [34m[1mv_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:35:44.222[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m56[0m - [34m[1mqT_size: torch.

Here's a breakdown of what the code does:

1. It applies linear transformations to the query, key, and value tensors using the learned linear layers `self.q`, `self.k`, `and self.v`, respectively. This projects the tensors to the appropriate dimensions for the attention computation.

2. The tensors are reshaped and transposed to prepare them for the matrix multiplication step. The view and transpose operations create multi-head versions of the tensors.

3. The attention scores are computed by performing matrix multiplication between the query and key tensors. The resulting scores represent the similarity or importance of each element in the query with respect to the elements in the key.

4. The attention scores are scaled by dividing them by the square root of the head dimension (`self.head_dim`). This scaling helps stabilize the gradients during training.

5. If a mask is provided (`self.mask` is not `None`), the attention scores are masked by setting the scores corresponding to masked positions to negative infinity. This ensures that these masked positions do not contribute to the attention computation.

6. The attention scores are passed through a softmax activation function along the last dimension (`dim=-1`). This calculates the attention weights or probabilities for each element in the query with respect to the elements in the key.

7. The attention probabilities are used to weight the value tensor. This is done by performing matrix multiplication between the attention probabilities and the value tensor. This step computes the context or output representation based on the attention weights.

8. The resulting attention output is transposed and reshaped to match the original shape. This is achieved using the transpose and view operations.

9. Finally, the attention output is passed through the `self.output_linear` linear layer, which applies another linear transformation to the output representation.

10. The resulting output is returned, representing the output of the multi-head attention operation.

### 2. The Feed-Forward Layer

The feed-forward layer is a type of neural network layer that processes the input data independently at each position in the input sequence, without considering the dependencies between different positions. This means that the computations for different positions can be parallelized, making the Transformer architecture highly efficient for sequence processing tasks.

The feed-forward layer in Transformers typically consists of two linear transformations with a non-linear activation function in between. The input to the feed-forward layer is a tensor representing the hidden states of the previous layer or the input embeddings. 

The feed-forward layer is a critical component of Transformers as it helps capture local patterns and dependencies in the input data. By incorporating non-linear transformations, it enables the model to learn complex representations and extract meaningful features from the input sequence.

In [6]:
class FeedForward(nn.Module):
    """
    Feed-forward layer module.

    Args:
        config: Configuration for the feed-forward layer.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the feed-forward layer.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim], 
            representing the output of the feed-forward layer.
        """
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        logger.debug(f"ff_output_size: {x.size()}")
        return x  

In [7]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)

[32m2023-07-27 15:35:52.429[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m30[0m - [34m[1mff_output_size: torch.Size([1, 9, 768])[0m


##  The Encoder 

Converts an input sequence of tokens into a sequence of embedding vectors (hidden state).
It's composed of multiple components:

### Normalization

Applied to normalized each input in the batch to have zero mean and unit variance.

In [8]:
class TransformerEncoderBlock(nn.Module):
    """
    Transformer Encoder block module.

    Args:
        config: Configuration for the encoder block.
    """

    def __init__(self, config) -> None:
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor]=None) -> torch.Tensor:
        """
        Perform a forward pass of the transformer encoder block.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].
            mask: Optional mask tensor. Default is None.

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim], 
            representing the output of the encoder block.
        """
        logger.debug(f"encoder_block_input_size: {x.size()}")
        hidden_state = self.layer_norm_1(x)
        attention_output = self.attention(hidden_state, hidden_state, hidden_state, mask)
        x = x + self.dropout(x)
        x = self.layer_norm_2(x)
        x = x + self.feed_forward(x)
        x = self.dropout(x)
        logger.debug(f"encoder_block_output_size: {x.size()}")
        return x

In [9]:
encoder_layer = TransformerEncoderBlock(config)
_ = encoder_layer(inputs_embeds)

[32m2023-07-27 15:41:56.168[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2023-07-27 15:41:56.172[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2023-07-27 15:41:56.173[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2023-07-27 15:41:56.206[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m29[0m - [34m[1mencoder_block_input_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:41:56.208[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m48[0m - [34m[1mq_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:41:56.208[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m49[0m - [34m[1mk_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:41:56.209[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m50[0m - [34m

1. **Layer Normalization:** The input tensor `x` is first passed through a layer normalization operation using `self.layer_norm_1`. This operation normalizes the activations across the hidden dimension of `x` to have zero mean and unit variance. The result is stored in `hidden_state`.

2. **Attention with Skip Connection:** The attention mechanism is applied to `hidden_state` using `self.attention`. This attention operation takes `hidden_state` as the input and produces an attention-based output. The output is then element-wise added (`+`) to the original input tensor `x`. This skip connection allows the model to directly incorporate the original input along with the attention-based output.

3. **Feed-Forward Layer with Skip Connection:** The output of the previous step is passed through another layer normalization operation `self.layer_norm_2` to normalize the activations. Then, the result is passed through the feed-forward layer `self.feed_forward`. The output of the feed-forward layer is again element-wise added (`+`) to the input tensor from the previous step (`x`). This skip connection allows the model to combine the information from the original input with the transformed output from the feed-forward layer.

In summary, the skip connections enable the model to incorporate the original input tensor `x` into the output of each layer. By adding the transformed outputs to the original input, the model can retain important information from the input and facilitate the flow of gradients during training. The skip connections help in addressing the vanishing gradient problem and make it easier to train deep Transformer architectures by ensuring the model has access to the original input information at each layer.

### Positional Embeddings

The purpose of positional embeddings is to provide the model with a representation that encodes the relative positions of tokens within the sequence. This allows the model to differentiate between tokens based on their position, even though all tokens initially have the same embeddings.

In the original Transformer model, the positional embeddings used to encode the sequential order of tokens are learned as part of the model training process. The positional embeddings are initialized with fixed sinusoidal functions of different frequencies and then fine-tuned during training. The specific form of the positional embeddings in the original Transformer model is as follows:

$$PE(pos, 2i) = \sin(\frac{pos}{10000^{2i/d_{model}}})$$


$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{2i/d_{model}}})$$

Here, $P$ represents the matrix of positional embeddings, and $d_{model}$ is the dimensionality of both the token and positional embeddings. 

In [11]:
class Embeddings(nn.Module):
    """
    Embeddings layer module.
    Combines a token embedding layer that projects the `input_ids` to a dense hidden state 
    with the positional embedding that does the same for `position_ids`. 
    The resulting embedding is simply the sum of both embeddings.

    Args:
        config: Configuration for the embeddings layer.
    """
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the embeddings layer.

        Args:
            input_ids: Input tensor of shape [batch_size, seq_len].

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim], 
            representing the embeddings of the input.

        Notes:
            1. Create position IDs for input sequence.
            2. Create token and position embeddings.
            3. Combine token and position embeddings.
        """
        logger.debug(f"input_size: {input_ids.size()}")
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        token_embeddings = self.token_embeddings(input_ids)
        logger.debug(f"token_embd_size: {token_embeddings.size()}")
        position_embeddings = self.position_embeddings(position_ids)
        logger.debug(f"position_embd_size: {token_embeddings.size()}")
        
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        logger.debug(f"embd_size: {token_embeddings.size()}")
        
        return embeddings

In [12]:
embedding_layer = Embeddings(config)
_ = embedding_layer(inputs.input_ids).size()

[32m2023-07-27 15:42:40.460[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m34[0m - [34m[1minput_size: torch.Size([1, 9])[0m
[32m2023-07-27 15:42:40.460[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m38[0m - [34m[1mtoken_embd_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:42:40.460[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m40[0m - [34m[1mposition_embd_size: torch.Size([1, 9, 768])[0m
[32m2023-07-27 15:42:40.461[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m45[0m - [34m[1membd_size: torch.Size([1, 9, 768])[0m


1. In the `__init__` method of the `Embeddings` class, two embedding layers are defined: `self.token_embeddings` and `self.position_embeddings`. These layers are instances of `nn.Embedding` and are initialized with different vocabulary sizes and hidden sizes.

2. In the `forward` method, the input sequence `input_ids` is passed as an argument. The size of the input sequence is determined using `input_ids.size(1)` and stored in `seq_length`.

3. Position IDs are created using `torch.arange(seq_length, dtype=torch.long).unsqueeze(0)`. This creates a tensor of sequential integers from 0 to `seq_length - 1` and unsqueezes it to have a shape of `[1, seq_length]`. These position IDs represent the positions of the tokens in the input sequence.

4. The token embeddings for the input sequence are obtained by passing `input_ids` to `self.token_embeddings`. This maps each token ID to its corresponding embedding vector.

5. The position embeddings for the input sequence are obtained by passing `position_ids` to `self.position_embeddings`. This maps each position ID to its corresponding embedding vector.

6. The token embeddings and position embeddings are added element-wise (`token_embeddings + position_embeddings`) to create the combined embeddings. This operation incorporates both the token information and the positional information of each token in the input sequence.

7. The combined embeddings are then passed through `self.layer_norm`, which applies layer normalization to normalize the embeddings along the hidden dimension.

8. A dropout layer, `self.dropout`, is applied to the normalized embeddings to prevent overfitting by randomly dropping out some elements.

9. The resulting embeddings are returned as the output of the `forward` method.


Note that the previous code assumes that the positional embeddings are learned embeddings, and it does not directly define the specific sinusoidal positional embeddings mentioned earlier.

### Putting all together 

In [13]:
class TransformerEncoder(nn.Module):
    """
    Transformer Encoder module.

    Args:
        config: Configuration for the encoder.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderBlock(config) for _ in range(config.num_hidden_layers)])

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Perform a forward pass of the transformer encoder.

        Args:
            x: Input tensor of shape [batch_size, seq_len].
            mask: Optional mask tensor. Default is None.

        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim], 
            representing the output of the encoder.
        """
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x, mask)
        
        return x

In [14]:
logger.remove()
logger.add(sys.stderr, level="INFO")
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
encoder_output.size()

torch.Size([1, 9, 768])

In [15]:
encoder_output

tensor([[[-0.2112, -1.0971,  0.6207,  ..., -0.0322, -0.1413, -0.0000],
         [ 0.3913, -0.4309,  0.5121,  ..., -0.7229, -1.2283, -0.2651],
         [ 0.0000, -0.4601,  0.1352,  ...,  0.1401, -0.3543, -0.1372],
         ...,
         [ 0.5117,  0.3125, -0.0207,  ...,  0.2547, -0.2552, -0.6708],
         [ 0.5857, -0.0468, -0.2946,  ...,  0.0000,  0.2407, -0.9026],
         [ 0.4352, -0.1858, -0.3051,  ...,  0.5849, -0.4693,  0.2345]]],
       grad_fn=<MulBackward0>)

### Adding a Classification Head

In [16]:
class TransformerForSequenceClassification(nn.Module):
    """
    Transformer for Sequence Classification module.

    Args:
        config: Configuration for the model.
    """
    def __init__(self, config) -> None:
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass of the transformer for sequence classification model.

        Args:
            x: Input tensor of shape [batch_size, seq_len].

        Returns:
            Tensor of shape [batch_size, num_labels], 
            representing the logits for each class.
        """
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        
        return x

In [17]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids)  # unnormalized logits for each class in the output.

tensor([[-0.4699, -0.6738,  0.6459]], grad_fn=<AddmmBackward0>)

## The Decoder

The main difference between the decoder and encoder is that the decoder has two attention sublayers.

The first attention sublayer, known as the _self-attention sublayer_, allows the decoder to attend to its own previously generated tokens, capturing dependencies and relationships within the output sequence. The second attention sublayer is the _encoder-decoder attention_, which allows the decoder to attend to the encoded representations produced by the encoder, incorporating contextual information from the input sequence. These attention sublayers play a crucial role in the decoder's ability to generate coherent and contextually appropriate output based on the input and the generated context.

In [18]:
class TransformerDecoderBlock(nn.Module):
    """
    Transformer Decoder layer module.

    Args:
        config: Configuration for the decoder layer.
    """

    def __init__(self, config, ) -> None:
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_3 = nn.LayerNorm(config.hidden_size)
        self.attn_1 = MultiHeadAttention(config) 
        self.attn_2 = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self, 
        x: torch.Tensor, 
        encoder_output: torch.Tensor,
        source_mask: Optional[torch.Tensor] = None,
        target_mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        """
        Perform a forward pass of the transformer decoder block.

        Args:
            x: Input tensor of shape [batch_size, seq_len, hidden_dim].
            encoder_output: Output tensor from the encoder of shape [batch_size, seq_len, hidden_dim].
            source_mask: Optional source mask tensor. Default is None.
            target_mask: mask: Optional target mask tensor. Default is None.
        Returns:
            Tensor of shape [batch_size, seq_len, hidden_dim], 
            representing the output of the decoder block.
        """
        logger.debug(f"decoder_block_input_size: {x.size()}")
        hidden_state = self.layer_norm_1(x)

        attn_1_out = self.attn_1(hidden_state, hidden_state, hidden_state, target_mask)
        x = x + self.dropout(attn_1_out)
        x = self.layer_norm_2(x) 

        attn_2_out = self.attn_2(x, encoder_output, encoder_output, source_mask)
        x = x + self.dropout(attn_2_out)
        x = self.layer_norm_3(x) 

        feed_forward_output = self.feed_forward(x)
        x = x + self.dropout(feed_forward_output)
        logger.debug(f"decoder_block_output_size: {x.size()} ")
        
        return x

The mask is applied in the self-attention mechanism to enforce the causality constraint during the decoding process. Since the decoder generates the target sequence autoregressively, each position in the target sequence should only attend to previous positions and not future positions. This prevents information leakage from future positions, ensuring that the model generates output in an autoregressive manner.

On the other hand, the encoder-decoder attention operation in the decoder, which attends over the encoder output, doesn't require a mask because it doesn't have the same causality constraint. The cross-attention allows the decoder to attend to all positions in the encoder output, capturing relevant information from the source sequence.

In [19]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask.size()

torch.Size([1, 9, 9])

If you recall the `scaled_dot_product_attention` function, we set the uppear values to infinity. This guarantees that the attention weights are all zero once we take the softmax over the scores (as $e^{-\infty}= 0$.)

In [20]:
logger.remove()
logger.add(sys.stderr, level="DEBUG")
decoder_layer = TransformerDecoderBlock(config)
decoder_layer(encoder_output, encoder_output, target_mask=mask)

[32m2023-07-27 15:44:28.537[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2023-07-27 15:44:28.539[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2023-07-27 15:44:28.540[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2023-07-27 15:44:28.551[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m12[0m - [34m[1mhidden_dim: 768[0m
[32m2023-07-27 15:44:28.551[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m13[0m - [34m[1mnum_heads: 12[0m
[32m2023-07-27 15:44:28.552[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36m__init__[0m:[36m17[0m - [34m[1mhead_dim: 64[0m
[32m2023-07-27 15:44:28.577[0m | [34m[1mDEBUG   [0m | [36m__main__[0m:[36mforward[0m:[36m38[0m - [34m[1mdecoder_block_input_size: torch.Size([1, 9, 768])[0m
[32m2023-0

tensor([[[-0.4448, -0.7806, -0.0674,  ...,  0.2934,  0.0953,  0.2780],
         [ 0.1839,  0.1398,  0.3562,  ..., -0.0375, -1.3261, -0.3123],
         [-0.0157, -0.4480, -0.0162,  ...,  0.0030, -0.9646,  0.1143],
         ...,
         [ 0.7106,  0.5281,  0.2768,  ...,  0.2146, -0.2368, -0.8534],
         [-0.0257, -0.2485, -0.4623,  ...,  0.2212, -0.1476, -0.7523],
         [ 0.3837, -0.2680, -0.3011,  ...,  0.5106, -0.7389,  0.2627]]],
       grad_fn=<AddBackward0>)

In [21]:
class TransformerDecoder(nn.Module):
    def __init__(self, config) -> None:
        """
        Transformer Decoder module.

        Args:
            config: Configuration object for the decoder.
            mask: Masking object for attention layers.
        """
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerDecoderBlock(config) for _ in range(config.num_hidden_layers)])

    def forward(
        self, 
        input_ids: torch.Tensor, 
        encoder_output: torch.Tensor,
        source_mask: torch.Tensor = None,
        target_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Perform a forward pass of the transformer decoder.

        Args:
            x: Input tensor of shape [batch_size, tgt_len].

        Returns:
            Tensor of shape [batch_size, tgt_len, vocab_size], 
            representing the predicted probabilities over the vocabulary.
        """
        x = self.embeddings(input_ids)
        for layer in self.layers:
            x = layer(x, encoder_output, source_mask=source_mask, target_mask=target_mask) 
        return x

In [22]:
logger.remove()
logger.add(sys.stderr, level="INFO")
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
decoder = TransformerDecoder(config)
output = decoder(inputs.input_ids, encoder_output, target_mask=mask)
output.size()

torch.Size([1, 9, 768])

In [23]:
output

tensor([[[ 0.8122, -0.4533,  2.1675,  ...,  0.3536, -0.0078,  0.3310],
         [ 0.0167,  0.6473, -0.9278,  ...,  1.1630,  2.0164,  0.4868],
         [ 1.1181,  0.5270, -0.4915,  ...,  0.5112,  2.1158, -0.9579],
         ...,
         [ 0.3959, -0.0088, -0.1022,  ...,  0.1227,  1.2568,  0.2252],
         [ 1.0487,  1.2220,  0.1334,  ...,  0.2272,  1.4226, -1.4288],
         [-0.5014,  0.0110,  0.2871,  ...,  0.6070,  0.5336, -0.5335]]],
       grad_fn=<AddBackward0>)

## Encoder-Decoder

Let's put it all together to create a Transformer model.

In [24]:
class EncoderDecoder(nn.Module):
    """
    Encoder-Decoder model that combines the TransformerEncoder and TransformerDecoder.

    Args:
        encoder_config: Configuration for the encoder.
        decoder_config: Configuration for the decoder.
    """
    def __init__(
        self, 
        config, 
    ) -> None:
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.decoder = TransformerDecoder(config)
        self.fc = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(
        self, 
        input_ids: torch.Tensor, 
        target_ids: torch.Tensor,
        source_mask: Optional[torch.Tensor] = None,
        target_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Perform a forward pass of the encoder-decoder model.

        Args:
            input_ids: Input tensor of shape [batch_size, src_len].
            target_ids: Target tensor of shape [batch_size, tgt_len].

        Returns:
            Tensor of shape [batch_size, tgt_len, vocab_size], 
            representing the predicted probabilities over the vocabulary.
        """
        encoder_output = self.encoder(input_ids)
        decoder_output = self.decoder(
            target_ids, 
            encoder_output, 
            source_mask=source_mask, 
            target_mask=target_mask
        )
        x = self.fc(decoder_output)  # Apply linear layer to transform to vocab_size
        
        return x

#### Masking

The mask used in the Transformer model should have a specific shape and values to ensure proper masking during the attention mechanism. Here's how you can define the mask:

1. **Padding Mask:** The padding mask is used to mask out padding tokens in the input sequences. It should have a shape of `(batch_size, seq_length)` and contain 1 where the padding tokens are present and 0 for the non-padding tokens. This mask ensures that the padding tokens do not contribute to the attention scores.

2. **Future Mask:** The future mask is used to prevent attending to future positions in the self-attention mechanism. It should have a shape of `(seq_length, seq_length)` and have 1 for positions that can be attended and 0 for positions that should be masked or ignored.

3. **Combined Mask:** To create the final mask, you need to combine the padding mask and the future mask. This can be done by applying logical operations, such as element-wise multiplication or logical OR, to the two masks.

Let's verify the encoder-decoder with some fake data and future masking only.

In [25]:
class TransformerConfig:
    """
    Configuration class for the Transformer model.

    Args:
        hidden_size: Size of the hidden state.
        intermediate_size: Size of the intermediate layer in the feed-forward network.
        num_hidden_layers: Number of hidden layers in the Transformer.
        vocab_size: Size of the vocabulary.
        max_position_embeddings: Maximum number of positional embeddings.
        hidden_dropout_prob: Dropout probability for the hidden layers.
        num_attention_heads: Number of attention heads in the multi-head attention.
    """
    def __init__(
        self, 
        hidden_size: int, 
        intermediate_size: int, 
        num_hidden_layers: int, 
        vocab_size: int, 
        max_position_embeddings: int, 
        hidden_dropout_prob: float,
        num_attention_heads: int
    ):
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_dropout_prob = hidden_dropout_prob        
        self.num_attention_heads = num_attention_heads

In [26]:
# Set up hyperparameters and configuration
config = TransformerConfig(
    hidden_size=512,
    intermediate_size=2048,
    num_hidden_layers=6,
    vocab_size=100,
    max_position_embeddings=512,
    hidden_dropout_prob=0.1,
    num_attention_heads=8
)

In [27]:
# Define some fake data
batch_size = 16
source_length = 10
target_length = 12

source_ids = torch.randint(0, config.vocab_size, (batch_size, source_length))
target_ids = torch.randint(0, config.vocab_size, (batch_size, target_length))

source_ids.size(), target_ids.size()

(torch.Size([16, 10]), torch.Size([16, 12]))

In [28]:
def create_mask(batch_size: int, seq_length: int) -> torch.Tensor:
    """
    Create a lower triangular mask with ones below the diagonal.

    Args:
        batch_size: The batch size.
        seq_length: The length of the sequence.

    Returns:
        The mask tensor with shape (batch_size, seq_length, seq_length).
    """
    mask = torch.tril(torch.ones(seq_length, seq_length))
    mask = mask.unsqueeze(0).expand(batch_size, seq_length, seq_length)  # Expand the mask along the batch dimension
    
    return mask

In [29]:
source_mask = create_mask(batch_size, source_length)
target_mask = create_mask(batch_size, target_length)
source_mask.size(), target_mask.size()

(torch.Size([16, 10, 10]), torch.Size([16, 12, 12]))

In [30]:
logger.remove()
logger.add(sys.stderr, level="INFO")

4

In [31]:
encoder = TransformerEncoder(config)
encoder_output = encoder(source_ids)
decoder = TransformerDecoder(config)
output = decoder(source_ids, encoder_output, source_mask=source_mask)
output.size()

torch.Size([16, 10, 512])

In [32]:
# Define the EncoderDecoder model
encoder_decoder = EncoderDecoder(config)
output = encoder_decoder(source_ids, target_ids, target_mask=target_mask)
print("Output Shape:", output.shape)  # Should be (batch_size, target_length, vocab_size)

Output Shape: torch.Size([16, 12, 100])


In [33]:
target_ids.size()

torch.Size([16, 12])

### Training

Data set generation and training from [Avanced Deep Learning with Python](https://github.com/PacktPublishing/Advanced-Deep-Learning-with-Python/blob/master/Chapter08/transformer.py#L326
)

In [34]:
class RandomDataset(torch.utils.data.Dataset):
    """
    Provides random data copy dataset for training.

    Args:
        vocabulary_size: The vocabulary size.
        batch_size: The batch size.
        num_samples: The number of samples.
        sample_length: The length of each sample.
    """

    def __init__(self, vocabulary_size: int, batch_size: int, num_samples: int, sample_length: int):
        self.samples = list()

        for i in range(batch_size * num_samples):
            data = torch.from_numpy(np.random.randint(1, vocabulary_size, size=(sample_length,)))
            data[0] = 1
            source = torch.autograd.Variable(data, requires_grad=False)
            target = torch.autograd.Variable(data, requires_grad=False)

            # Prepare the sample dictionary
            sample = {
                'source': source,
                'target': target[:-1],
                'target_y': target[1:],
                'source_mask': (source != 0).unsqueeze(-2),
                'target_mask': self.make_std_mask(target, 0),
                'tokens_count': (target[1:] != 0).data.sum()  # Assuming target_y is the actual target shifted by 1
            }

            self.samples.append(sample)

    def __len__(self) -> int:
        """
        Get the number of samples in the dataset.

        Returns:
            The number of samples.
        """
        return len(self.samples)

    def __getitem__(self, idx: int) -> dict:
        """
        Get a sample from the dataset.

        Args:
            idx: The index of the sample to retrieve.

        Returns:
            A dictionary containing the source, target, target_y, source_mask, target_mask, and tokens_count.
        """
        return self.samples[idx]

    @staticmethod
    def make_std_mask(target: torch.Tensor, pad: int) -> torch.Tensor:
        """
        Create a mask to hide padding and future words.

        Args:
            target (torch.Tensor): The target tensor.
            pad (int): The padding value.

        Returns:
            torch.Tensor: The mask tensor.
        """
        target_mask = (target != pad)
        target_mask = target_mask & torch.autograd.Variable(
            RandomDataset.subsequent_mask(target.size(-1)).type_as(target_mask.data))

        return target_mask

    @staticmethod
    def subsequent_mask(size: int) -> torch.Tensor:
        """
        Mask out subsequent positions.

        Args:
            size: The size of the mask.

        Returns:
            torch.Tensor: The subsequent mask tensor.
        """
        attn_shape = (size, size)
        subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
        return torch.from_numpy(subsequent_mask) == 0


In [35]:
batch_size = 64
num_samples = 1000
samples_len = 10
train_set = RandomDataset(config.vocab_size, batch_size, num_samples, samples_len)
train_loader = torch.utils.data.DataLoader(train_set, batch_size)

- Inside the training loop, we iterate over the dataset batches and perform the forward pass using `model` with the appropriate input sequences and masks.

- The loss is computed based on the output of the model and the target batch.

- Backpropagation is performed by calling `loss.backward()`.

- The optimizer's parameters are updated using `optimizer.step()` and the gradients are reset using `optimizer.zero_grad()`.

In [36]:
model = EncoderDecoder(config)

# Initialize parameters.
for p in encoder_decoder.parameters():
    if p.dim() > 1:
        torch.nn.init.xavier_uniform_(p)
        
model.train()

optimizer = torch.optim.Adam(model.parameters())
loss_function = torch.nn.CrossEntropyLoss()
    
current_loss = 0.0
counter = 0

for i, batch in enumerate(train_loader):
    with torch.set_grad_enabled(True):
        out = model.forward(batch['source'], batch['target'], batch['source_mask'], batch['target_mask'])
        loss = loss_function(out.contiguous().view(-1, out.size(-1)), batch['target_y'].contiguous().view(-1))
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

        current_loss += loss
        counter += 1

        if counter % 5 == 0:
            print("Batch: %d; Loss: %f" % (i + 1, current_loss / counter))
            current_loss = 0.0
            counter = 0

Batch: 5; Loss: 4.172111
Batch: 10; Loss: 3.737579
Batch: 15; Loss: 2.513162
Batch: 20; Loss: 2.348919
Batch: 25; Loss: 2.237202
Batch: 30; Loss: 2.113222
Batch: 35; Loss: 1.482312
Batch: 40; Loss: 0.713707
Batch: 45; Loss: 0.354369
Batch: 50; Loss: 0.063452
Batch: 55; Loss: 0.005703
Batch: 60; Loss: 0.002458
Batch: 65; Loss: 0.000987
Batch: 70; Loss: 0.000610
Batch: 75; Loss: 0.000323
Batch: 80; Loss: 0.000215
Batch: 85; Loss: 0.000146
Batch: 90; Loss: 0.000089
Batch: 95; Loss: 0.000067
Batch: 100; Loss: 0.000059
Batch: 105; Loss: 0.000050
Batch: 110; Loss: 0.000044
Batch: 115; Loss: 0.000041
Batch: 120; Loss: 0.000038
Batch: 125; Loss: 0.000036
Batch: 130; Loss: 0.000035
Batch: 135; Loss: 0.000036
Batch: 140; Loss: 0.000032
Batch: 145; Loss: 0.000030
Batch: 150; Loss: 0.000030
Batch: 155; Loss: 0.000029
Batch: 160; Loss: 0.000027
Batch: 165; Loss: 0.000029
Batch: 170; Loss: 0.000025
Batch: 175; Loss: 0.000024
Batch: 180; Loss: 0.000024
Batch: 185; Loss: 0.000022
Batch: 190; Loss: 0.0