# Introduction to the Transformer Architecture

Welcome to this notebook, where we will explore the inner workings of the revolutionary Transformer Architecture. 

Originally introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017, transformers have become a cornerstone of modern Natural Language Processing (NLP) and have found applications in a myriad of other domains.

The transformer model, with its self-attention mechanism, is adept at handling sequences, making it perfect for tasks like translation, text generation, and more. Unlike recurrent models, transformers do not process data in order, which allows them to be more parallelizable and thus, faster to train on modern hardware.

In this notebook, our goal is to demystify the underlying components of the transformer architecture, specifically focusing on an autoregressive setting. By building it from scratch, we aim to provide clarity on how each part contributes to the overall functioning of the model.

In this notebook you will find one of the most complete documented code. My goal is to make sure you understand everything. We are not leaving any rock unturned here!

### Here's what we'll cover:

* Positional Encoding Class: Learn how transformers account for the order of data without inherently processing it sequentially.
* Scaled Dot Product Class: Dive into the self-attention mechanism and see how different parts of a sequence attend to each other.
* Attention Head Class: Understand the fundamental building block of the self-attention mechanism.
* Multi-head Attention Class: Discover how transformers harness multiple attention heads to capture various features from the data.
* Feed-forward Module Class: Delve into the feed-forward networks present within the transformer and their role.
* Transformer Block Class: See how the various components come together to form a transformer block.
* Transformer Model: Integrate the different classes to build the complete autoregressive transformer model.


By the end of this notebook, you'll have a hands-on understanding of the transformer architecture's components and the knowledge to build upon it for various applications. 

Whether you're a student, a researcher, or an enthusiast, this deep dive into transformers will equip you with the foundational knowledge to further explore the vast landscape of modern NLP.

##### Get ready for a lot of explanations!

# Autoregresive Language Transformer

The model we will build here will be a decoder-only model, which is also called **Autoregressive**. This type of model is good for:

* Text Generation: Given a prompt or starting sequence, the model can generate subsequent tokens/words to produce coherent and contextually relevant text.
* Filling in the Blanks: You can use it to predict missing words or tokens in a sentence.
* Next Word Prediction: Predict the next word in a sentence.
* Other NLP Tasks: With fine-tuning, the model can be adapted for tasks like text classification, sentiment analysis, etc., although that's not its primary design.

#### Let's embark on this exciting journey together!

In [20]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer

from torch.nn.utils import clip_grad_norm_
from torch.optim.lr_scheduler import LambdaLR
from torch.nn.utils.rnn import pad_sequence

#for learning rate decay:
from torch.optim.lr_scheduler import ReduceLROnPlateau

import os
import random
import numpy as np
import math

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()

## Positional Encoding

### Overview:
The PositionalEncoding class is designed to add positional encodings to the input embeddings of a transformer model. Transformers do not have an inherent sense of sequence order (unlike recurrent models), so positional encodings provide them with some information about the relative positions of tokens.

### Code Breakdown:

1. class PositionalEncoding(nn.Module):  
    We're defining a new class PositionalEncoding that inherits from PyTorch's base nn.Module class, allowing it to be used as a part of neural networks.

2. def __init__(self, max_tokens=1000, embedding_dimensions=768): 
    This is the constructor of the class where we initialize the parameters. By default, it assumes a maximum sequence length (max_tokens) of 1000 and embedding dimensions of 768, typical for BERT-like models.

3. super().__init__() 
   Calls the constructor of the parent class (nn.Module). This is necessary for the proper functioning of PyTorch modules.

4. pe = torch.zeros(max_tokens, embedding_dimensions) 
   Initializes a matrix full of zeros with dimensions [max_tokens, embedding_dimensions], intended to store our positional encodings.

5. position = torch.arange(0, max_tokens, dtype=torch.float).unsqueeze(1)
   Creates a tensor of shape [max_tokens, 1] representing positions (0 to max_tokens-1).

6. div_term = 1 / (10000 ** (torch.arange(0, embedding_dimensions, 2).float() / embedding_dimensions))
   Computes the divisor term used in the positional encoding formula for even indices. The purpose of this term is to create a series of diminishing magnitudes for the positional encodings.

7. pe[:, 0::2] = torch.sin(position * div_term)
   Computes the sine of the positional encoding for even indices and stores it in the pe tensor.

8. pe[:, 1::2] = torch.cos(position * div_term)
   Computes the cosine of the positional encoding for odd indices and stores it in the pe tensor.

9.  self.pe = pe.unsqueeze(0)
    Adds an extra dimension at the start, making the shape [1, max_tokens, embedding_dimensions]. This facilitates broadcasting when adding positional encodings to embeddings.

10. self.pe = self.pe.to(device)
    Transfers the pe tensor to the specified device (device should be defined elsewhere in the code, typically as "cuda" or "cpu").

11. def forward(self, x):
    The forward method that will be called when we pass input data through the PositionalEncoding module.

12. x = x.to(device)
    Transfers the input tensor x to the same device as self.pe.
    
13. return x + self.pe

Adds the positional encoding to the input embeddings. Due to broadcasting, this will work across batches of any size, as long as the sequence length and embedding dimensions match.

This is the standard sinusoidal positional encodings used in the original Transformer

In [22]:
class PositionalEncoding(nn.Module):
    def __init__(self, max_tokens=1000, embedding_dimensions=768):
        super().__init__()
        pe = torch.zeros(max_tokens, embedding_dimensions)
        position = torch.arange(0, max_tokens, dtype=torch.float).unsqueeze(1)
        div_term = 1 / (10000 ** (torch.arange(0, embedding_dimensions, 2).float() / embedding_dimensions))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.pe = pe.unsqueeze(0) # Shape [1, max_tokens, embedding_dimensions]
        self.pe = self.pe.to(device)
        
    def forward(self, x):
        # x has shape [batch_size, max_tokens, embedding_dimensions]
        # self.pe has shape [1, max_tokens, embedding_dimensions]
        # Broadcasting will take care of the batch size dimension
        
        x = x.to(device) 
        
        return x + self.pe


### Attention Mechanism

The attention mechanism, particularly as implemented in the Transformer architecture, has been a pivotal development in the field of deep learning, especially for tasks like machine translation and text generation.

Here's an overview of the attention mechanism in general terms:

#### Intuition
Imagine you're translating a sentence from one language to another. While translating a particular word, you might want to focus on, or "attend to", one or a few specific words in the source sentence rather than the whole sentence. The attention mechanism allows a model to focus on different parts of the input data with different intensities.

#### Purpose
The primary goal of attention is to capture contextual information. For a given word or token in a sequence, the attention mechanism computes its importance concerning other words in the sequence, thereby capturing the context in which the word appears.

#### Attention Score
The attention mechanism computes an "attention score" between input elements. This score typically represents how much focus a model should place on a particular input element when producing an output.

### Queries, Keys, and Values
The Transformer represents input data as "Queries", "Keys", and "Values". For each input element, an attention score is computed by taking the dot product of its Query with all other Keys. The scores are then used to take a weighted sum of Values, which gives the output.

#### Scaled Dot Product



*Checkout my post [A Simplified, Even 'Blasphemous' 👿 Explanation of the Transformer's Scaled Dot-Product Attention](https://www.linkedin.com/feed/update/urn:li:activity:7102755952914784257?utm_source=share&utm_medium=member_desktop) where I attempt to explain this function, which is at the core of the attention model used by the transformers.*



In the Transformer architecture, a specific kind of attention called "Scaled Dot-Product Attention" is used. Here, the attention scores are calculated using the dot product of the input vectors, and then they're scaled down (for stability reasons).

The function accepts three parameters:

* query: A matrix that, in the context of attention, seeks specific information.
* key: A matrix that acts like a reference, helping match the appropriate parts of the input data to the queries.
* value: A matrix holding the actual data/content that the attention mechanism will use to produce an output.

  
In the context of the Transformer model, these matrices come from the input embeddings, and they play specific roles in determining which parts of the input data the model should focus on (i.e., attend to).

The function starts by determining the size or depth of the query matrix. This size, particularly of its last dimension, represents the depth or size of the key vectors. Given the architecture of the Transformer, it's common for the query, key, and value matrices to have similar depth for their last dimensions.

Next, attention scores are computed by taking the dot product between the query and a transposed version of the key. This transposition ensures that the matrix multiplication provides the desired attention scores. These scores are then scaled down by dividing them by the square root of the depth of the key vectors. This scaling step is essential for the stability of the model during training, preventing the scores from becoming too large and causing gradient issues.

Subsequently, the function computes attention weights by applying the softmax operation to the scaled scores. This operation converts the scores into a set of probabilities (weights) that sum up to 1. These weights essentially determine how much focus should be placed on different parts of the input data.

Lastly, the function produces its output. It does this by taking a weighted sum of the value matrix using the attention weights. This results in an output that's a combination of the input data (from the value matrix) with emphasis placed according to the computed attention weights.

In [23]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / np.sqrt(dim_k) 
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

#### Attention Head

The AttentionHead class represents a single attention head within the Transformer architecture. In the context of Transformers, multi-head attention allows the model to focus on various parts of the input simultaneously. Each "head" captures different relationships or patterns in the data, and AttentionHead defines the functionality for one of these heads.

Upon initialization, the AttentionHead requires two parameters:

* embed_dim: This is the size or dimensionality of the embeddings or hidden states in the Transformer.
* head_dim: This represents the size or dimensionality specific to this particular attention head.
  
The class possesses three primary linear transformations represented by the nn.Linear layers:

* self.q: Transforms the input embedding or hidden state into a query matrix.
* self.k: Transforms the input embedding or hidden state into a key matrix.
* self.v: Transforms the input embedding or hidden state into a value matrix.

The forward method defines the computation performed by this attention head when processing a given hidden_state. 

Within this method, the hidden_state is transformed using the query (self.q), key (self.k), and value (self.v) linear layers. These transformed matrices are then passed to the scaled_dot_product_attention function (as seen in the earlier code), which calculates the attention outputs. 

The resulting attn_outputs are then returned by the method.

In [24]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
        
    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(self.q(hidden_state), 
                                                    self.k(hidden_state), 
                                                    self.v(hidden_state)) 
        return attn_outputs

#### Multi-Head Attention

*I recommend that you read this explanation following along with the code below, line by line.*

Instead of having a single set of attention weights, the Transformer can employ multiple heads, allowing it to focus on different parts of the input simultaneously. This is called "Multi-Head Attention". Each "head" captures different types of relationships or patterns in the data.

Upon initialization the class takes two parameters:

* embedding_dimensions: This represents the size or dimensionality of the input embeddings or hidden states. In our case here it defaults to 512, which is a common choice in many Transformer implementations.
* num_attention_heads: This specifies how many attention heads the model should have. In our case it defaults to 8.

The head_dim variable is then calculated by dividing the embedding dimensions by the number of attention heads. This value represents the size of each individual attention head.

Based on the number of heads, the class creates multiple instances of the previously described AttentionHead class. 

The number of instances corresponds to the number of attention heads. These instances are stored in the self.heads variable, which is a nn.ModuleList. This ensures that PyTorch treats each of these attention heads as separate sub-modules, correctly handling parameters and gradients.

Finally, an output linear layer (self.output_linear) is defined. This layer will be used to transform the concatenated output from all attention heads to the desired output size.

When the forward method is called:

* For each attention head in self.heads, the hidden_state (input data) is processed, and the individual attention outputs are collected.

* All these separate attention outputs are concatenated together using torch.cat. They're concatenated along the last dimension (dim=-1), effectively stacking the outputs side by side.

* This concatenated output is then passed through the self.output_linear layer. This step serves to transform the concatenated outputs from all heads into a unified output with the desired shape and properties.

Finally, the transformed output is returned.

In [25]:
class MultiHeadAttention(nn.Module): 
    def __init__(self, embedding_dimensions=512, num_attention_heads=8):
        super().__init__()
        embed_dim = embedding_dimensions
        num_heads = num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1) 
        x = self.output_linear(x)
        return x

#### Feed-Forward module

The MLP class represents the feed-forward neural network module often found in Transformer architectures. This feed-forward module acts as a non-linear transformation that operates on the output of the multi-head attention mechanism in each layer of the Transformer. 

**Its main objective is to further process and transform the information after the attention mechanism has done its job, adding more capacity and expressiveness to the model.**

Lets review the forward pass:

* The input x, representing the data or hidden states, first undergoes the initial linear transformation through self.c_fc.
* The transformed data then passes through the GELU activation function (self.act), introducing non-linearity.
* Finally, the activated data is projected back to the original dimension using self.c_proj, resulting in the final output of the feed-forward module.


In [26]:
class MLP(nn.Module):
    def __init__(self, embed_dim, mlp_dim):
        super().__init__()
        self.c_fc = nn.Linear(embed_dim, mlp_dim)
        self.c_proj = nn.Linear(mlp_dim, embed_dim)
        self.act = nn.GELU()

    def forward(self, x):
        x = self.c_fc(x)
        x = self.act(x)
        x = self.c_proj(x)
        return x


#### The Transformer block

*I recommend that you read this explanation following along with the code below, line by line.*


Here we start to glue all the above pieces together:

The TransformerBlock class represents a single layer or block within a Transformer model. This block typically comprises two main components: 

* a multi-head attention mechanism and 
* a feed-forward neural network. 
  
Both parts are supplemented by normalization and residual connections, ensuring smooth learning and rich information flow.


This class requires three primary parameters:

* embed_dim: The size or dimensionality of the embeddings or hidden states in the Transformer.
* mlp_dim: The size of the intermediate layer in the feed-forward neural network.
* num_attention_heads: Specifies the number of attention heads for the multi-head attention mechanism.

Two layer normalization modules are instantiated:

self.ln_1 and self.ln_2: These modules perform layer normalization, ensuring that the inputs to both the attention mechanism and the feed-forward neural network have a stable and standardized distribution.

The multi-head attention module (self.attn) is set up using the MultiHeadAttention class. It will focus on different parts of the input simultaneously, capturing various relationships or patterns.

The feed-forward neural network module (self.mlp) is set up using the MLP class. This acts as a non-linear transformation operating sequentially after the attention mechanism.

**During the forward pass**:

* The input data, denoted by x, first undergoes layer normalization using self.ln_1. Post normalization, this data is processed by the multi-head attention module (self.attn).

* The attention mechanism's output (attn_output) is added back to the original input x. This is the residual connection, ensuring the direct flow of information through the network and aiding in the learning process.

* Next, the summation from the previous step undergoes another layer normalization, this time using self.ln_2. Following this normalization, the data is processed by the feed-forward neural network (self.mlp).

Similar to the attention mechanism, the output of the feed-forward network (ff_output) is added back to the result from the previous step, once again forming a residual connection.

The final transformed and enriched data x is then returned by the method.

In [27]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, mlp_dim, num_attention_heads):
        super().__init__()

        self.ln_1 = nn.LayerNorm(embed_dim)
        self.ln_2 = nn.LayerNorm(embed_dim)
        self.attn = MultiHeadAttention(embedding_dimensions=embed_dim, num_attention_heads=num_attention_heads)
        self.mlp = MLP(embed_dim, mlp_dim)
        
    def forward(self, x):
        # First, the attention + residual connection
        attn_output = self.attn(self.ln_1(x))
        x = x + attn_output
        
        # Then, the feed-forward + residual connection
        ff_output = self.mlp(self.ln_2(x))
        x = x + ff_output
        
        return x


### The full model

*I recommend that you read this explanation following along with the code below, line by line.*


The following class, Transformer class, defines the causal language modeling architecture based on the Transformer framework. 

It starts by embedding input tokens and adding positional information, then processes this data through multiple Transformer blocks, and finally predicts the likelihood of each token being the next in the sequence.

Please follow along the code these explanations:

#### Upon initialization

* **Token Embeddings (self.wte)**: An embedding layer is created to convert each token from the input into a vector representation of specified embedding_dimensions.

* **Positional Encodings (self.wpe)**: The PositionalEncoding module creates positional embeddings, ensuring the model accounts for the position or order of tokens in the sequence.

* **Transformer Blocks (self.blocks)**: A list of TransformerBlock modules is initialized, representing the core layers of the Transformer. The number of blocks is determined by num_blocks.

* **Final Layer Normalization (self.ln_f)**: A layer normalization module is added to stabilize and normalize the outputs after processing through all Transformer blocks.

* **Output Head (self.lm_head)**: A linear layer that maps the Transformer's output back to the size of the vocabulary. This facilitates token prediction, essential for language modeling tasks.

#### During the forward pass

* **Token Embedding**: The input tokens are first converted into their respective vector representations using the token embeddings.

* **Positional Encoding Addition**: To these embeddings, positional encodings are added, ensuring the model recognizes the position of each token within the sequence.

* **Processing through Transformer Blocks**: The enriched token representations sequentially pass through each Transformer block, undergoing attention mechanisms and feed-forward transformations.

* **Final Layer Normalization**: After all blocks have processed the data, it undergoes a final layer normalization.

**Token Prediction:** The processed data is passed through the lm_head, producing the logits or raw predictions for each token in the vocabulary. 

##### These logits can be further processed to predict the next token in the sequence.

In [28]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, embedding_dimensions, max_tokens, num_blocks, mlp_dim, num_attention_heads):
        super().__init__()

        # Token embeddings
        self.wte = nn.Embedding(vocab_size, embedding_dimensions)
        
        # Positional encodings
        self.wpe = PositionalEncoding(max_tokens, embedding_dimensions)

        # Transformer blocks
        self.blocks = nn.ModuleList(
            [TransformerBlock(embedding_dimensions, mlp_dim, num_attention_heads) for _ in range(num_blocks)]
        )

        # Final layer normalization
        self.ln_f = nn.LayerNorm(embedding_dimensions)
        
        # Output head for causal language modeling (prediction of the next token)
        self.lm_head = nn.Linear(embedding_dimensions, vocab_size, bias=False)

    def forward(self, input_tokens):
        # input_tokens is of shape [batch_size, max_tokens]

        # Get embeddings
        x = self.wte(input_tokens)
        
        # Add positional encodings
        x = self.wpe(x)

        # Go through each block
        for block in self.blocks:
            x = block(x)

        # Final layer normalization
        x = self.ln_f(x)

        # Get token probabilities
        logits = self.lm_head(x)
        
        return logits


### Instantiation

Now that we have everything together, we can instantiate our model. Just run the following cell.

At this point you have built a complete Autoregression model with a whopping 105MM parameters!  You will find the function that counts parameters in this or any transformer in the last cell of this notebook.

In [29]:
model = Transformer(vocab_size=50257, embedding_dimensions=768, max_tokens=1000, num_blocks=4, mlp_dim=3072, num_attention_heads=12)
model.to(device) 

Transformer(
  (wte): Embedding(50257, 768)
  (wpe): PositionalEncoding()
  (blocks): ModuleList(
    (0-3): 4 x TransformerBlock(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-11): 12 x AttentionHead(
            (q): Linear(in_features=768, out_features=64, bias=True)
            (k): Linear(in_features=768, out_features=64, bias=True)
            (v): Linear(in_features=768, out_features=64, bias=True)
          )
        )
        (output_linear): Linear(in_features=768, out_features=768, bias=True)
      )
      (mlp): MLP(
        (c_fc): Linear(in_features=768, out_features=3072, bias=True)
        (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        (act): GELU(approximate='none')
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (lm_head): Linear(in_features=768

### Transformer Parameters Counter

In [30]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} parameters.")

The model has 105,547,776 parameters.


## Conclusion

Throughout this Jupyter notebook, we've embarked on a comprehensive journey, discovering the intricate design of the autoregressive transformer, one of the pillars of modern natural language processing. By building it from scratch, we've gained an invaluable understanding, not just of its high-level operations, but of the nuanced mechanics that power each module.

From token and positional embeddings to the multi-head attention mechanisms, and from the feed-forward networks to the very structure of the transformer blocks, we've delved deep into the heart of what makes transformers revolutionary. Our step-by-step approach, interspersed with code snippets, has offered a pragmatic perspective, enabling us to appreciate both the theoretical foundation and practical implementations.

But our exploration doesn't end here. This notebook is not just an endpoint but a springboard. Equipped with this foundational knowledge, there are numerous avenues to explore:

* Optimization and Fine-tuning: How can we optimize our transformer for specific tasks or datasets?
* Model Variants: Delving into variants like BERT, GPT, and T5 to understand the evolution of transformer architectures.
* Applications: Beyond language modeling, transformers have reshaped fields from image processing to bioinformatics. Exploring these realms can offer fascinating insights.

Building the autoregressive transformer from scratch has been similar to assembling a complex object. Each piece, a module or a function, has its unique role, and watching them come together to form a coherent, functioning model has been enlightening.

As you step beyond this guide, remember that understanding the basics concepts is the way for innovation. 

In my next eposides I will add more complexity to this notebook. Stay tunned for these additional steps:

1. The dataset preprocessing
2. The Training loop, a simple version
3. The Training loop, a more robust version. Here we build on the previous understanding of the basic training loop and add lots of additional sophistication like learning rate schedule, checkpoints, validation loop, and much more.
4. The inference loop, a simple version
5. The inference loop, a more robust version. Here we build on the previous understanding of the generation and add 'temperature', 'top_k', 'top_p' to the inference.

Happy coding and exploring!

Developed by [Juan Olano](https://www.linkedin.com/in/juan-olano-b9a330112/) Sept.2023