### **Tutorial 16: Transformer (Custom)**

In this tutorial, we will learn how to build and implement a custom Transformer model from scratch for a machine learning task. The Transformer model has revolutionized the field of natural language processing (NLP) and time series prediction by replacing traditional recurrent neural networks (RNNs) with a more efficient attention mechanism. This tutorial will guide you through the process of creating a Transformer model for sequence-to-sequence tasks, such as translation, text summarization, or time series forecasting.

### **What You Will Learn**
- Introduction to the Transformer architecture
- Implementing the multi-head self-attention mechanism
- Building the Encoder and Decoder components of the Transformer
- Customizing the Transformer for specific tasks
- Training the model on sample data
- Evaluating and fine-tuning the model

---

### **1. Understanding the Transformer Architecture**

The Transformer architecture is composed of two main parts:
- **Encoder**: This part processes the input sequence and generates feature representations that the decoder can use.
- **Decoder**: This part generates the output sequence from the feature representations provided by the encoder.

The key component that distinguishes Transformers from other architectures is the **self-attention mechanism**, which allows the model to focus on different parts of the input sequence as it generates an output.

Here’s a simplified overview of the Transformer architecture:

- **Multi-Head Attention**: A mechanism that allows the model to focus on different parts of the sequence simultaneously.
- **Positional Encoding**: Since Transformers do not inherently process data in order (like RNNs), positional encodings are added to the input data to give it information about the position of each element in the sequence.
- **Feed-Forward Networks**: After the attention layers, the output is passed through fully connected layers.
- **Residual Connections and Layer Normalization**: Used to prevent vanishing gradients and to improve training stability.

---

### **2. Implementing the Transformer Components**

Let’s now dive into the implementation of a custom Transformer model. We will start by implementing the essential components: the multi-head attention mechanism, the encoder, and the decoder.

In [1]:
from torch.utils.data import Dataset
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


#### **Multi-Head Attention Layer**
The multi-head attention mechanism allows the model to simultaneously focus on different parts of the sequence. Here's the code to implement it:

In [3]:
class AttentionHead(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.Q_weights = nn.Linear(config["embedding_dim"], config["head_size"], config["use_bias"])
    self.K_weights = nn.Linear(config["embedding_dim"], config["head_size"], config["use_bias"])
    self.V_weights = nn.Linear(config["embedding_dim"], config["head_size"], config["use_bias"])

    self.dropout = nn.Dropout(config["dropout_rate"])

    casual_attention_mask = torch.tril(torch.ones(config["context_size"], config["context_size"]))
    self.register_buffer('casual_attention_mask', casual_attention_mask)


  def forward(self, input):
    batch_size, tokens_num, embedding_dim = input.shape
    Q = self.Q_weights(input) 
    K = self.K_weights(input) 
    V = self.V_weights(input)

    attention_scores = Q @ K.transpose(1, 2)  
    attention_scores = attention_scores.masked_fill(
        self.casual_attention_mask[:tokens_num,:tokens_num] == 0,
        -torch.inf
    )
    attention_scores = attention_scores / ( K.shape[-1] ** 0.5 )
    attention_scores = torch.softmax(attention_scores, dim=-1)
    attention_scores = self.dropout(attention_scores)

    return attention_scores @ V 

In [4]:
class MultiHeadAttention(nn.Module):
  def __init__(self, config):
    super().__init__()

    heads_list = [AttentionHead(config) for _ in range(config["heads_num"])]
    self.heads = nn.ModuleList(heads_list)

    self.linear = nn.Linear(config["embedding_dim"], config["embedding_dim"])
    self.dropout = nn.Dropout(config["dropout_rate"])

  def forward(self, input):
    heads_outputs = [head(input) for head in self.heads]

    scores_change = torch.cat(heads_outputs, dim=-1)

    scores_change = self.linear(scores_change)
    return self.dropout(scores_change)

In [5]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.linear_layers = nn.Sequential(
        nn.Linear(config["embedding_dim"], config["embedding_dim"] * 4),
        nn.GELU(),
        nn.Linear(config["embedding_dim"] * 4, config["embedding_dim"]),
        nn.Dropout(config["dropout_rate"])
    )


    def forward(self, input):        
        return self.linear_layers(input)

In [6]:
class TransformerBlock(nn.Module):

  def __init__(self, config):
    super().__init__()

    self.multi_head = MultiHeadAttention(config)
    self.layer_norm_1 = nn.LayerNorm(config["embedding_dim"])

    self.feed_forward = TransformerEncoder(config)
    self.layer_norm_2 = nn.LayerNorm(config["embedding_dim"])

  def forward(self, input):
    residual = input
    x = self.multi_head(self.layer_norm_1(input))
    x = x + residual

    residual = x
    x = self.feed_forward(self.layer_norm_2(x))
    return x + residual
  


In [7]:
class DemoGPT(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.token_embedding_layer = nn.Embedding(config["vocabulary_size"], config["embedding_dim"])
    self.positional_embedding_layer = nn.Embedding(config["context_size"], config["embedding_dim"])

    blocks = [TransformerBlock(config) for _ in range(config["layers_num"])]
    self.layers = nn.Sequential(*blocks)

    self.layer_norm = nn.LayerNorm(config["embedding_dim"])
    self.unembedding = nn.Linear(config["embedding_dim"], config["vocabulary_size"], bias=False)

  def forward(self, token_ids):
    batch_size, tokens_num = token_ids.shape

    x = self.token_embedding_layer(token_ids)
    sequence = torch.arange(tokens_num, device=device)
    x = x + self.positional_embedding_layer(sequence)

    x = self.layers(x)
    x = self.layer_norm(x)
    x = self.unembedding(x)

    return x

In [8]:
def generate(model, prompt_ids, max_tokens):
    output_ids = prompt_ids
    for _ in range(max_tokens):
      if output_ids.shape[1] >= config["context_size"]:
        break
      with torch.no_grad():
        logits = model(output_ids)

      logits = logits[:, -1, :]
      probs = F.softmax(logits, dim=-1)
      next_token_id = torch.multinomial(probs, num_samples=1)
      output_ids = torch.cat([output_ids, next_token_id], dim=-1)
      
    return output_ids

In [9]:
def generate_with_prompt(model, tokenizer, prompt, max_tokens=100):
  
  model.eval()
  prompt = tokenizer.encode(prompt).unsqueeze(dim=0).to(device)

  return tokenizer.decode(generate(model, prompt, max_tokens=max_tokens)[0])

This is the result before training. 

In [10]:
from utils.tokenizer import CharTokenizer, WordTokenizer

text_paragraph = """
You are a manufacturer of hip implants. The doctor who will use your implants in surgeries has a requirement: he is willing to accept implants that are 1 mm bigger or smaller than the specified target size. This means the implant sizes must fall within a 2 mm range of the target size, i.e., ±1 mm from the target.
Additionally, your financial officer has stated that in order to maintain profitability, you can afford to discard **1 out of every 1000 implants**. This means that the size distribution of your implants must be such that only 0.1% of implants fall outside the acceptable ±1 mm range.
Given a recent sample of 1000 implants from the factory, the task is to evaluate whether the factory is meeting the specified quality and profitability requirements. If more than one percent of the implants fall outside the ±1 mm range, the factory will incur a loss due to excess waste.
You are a manufacturer of surgical gloves. The doctor who will use your gloves in surgeries has a requirement: the gloves must fit tightly but comfortably. 
The doctor is willing to accept a slight variance of up to **2 cm** in the glove's length from the specified target size. This means the gloves must fall within a **4 cm range** of the target size, 
i.e., ±2 cm from the target. Additionally, your financial officer has stated that to maintain profitability, you can afford to discard **2 out of every 1000 gloves**. 
This means that the size distribution of your gloves must be such that only 0.2% of gloves fall outside the acceptable ±2 cm range. Given a recent sample of 1000 gloves from the factory, 
the task is to evaluate whether the factory is meeting the specified quality and profitability requirements. If more than **two percent** of 
the gloves fall outside the ±2 cm range, the factory will incur a loss due to excess waste.
"""

tokenizer = WordTokenizer.train_from_text(text_paragraph)
print(f"Vocabulary size: {tokenizer.vocabulary_size()}")
print(f"Ecoded Vector of 'Given a recent sample of ' is: {tokenizer.encode("Given a recent sample of ")}")

config = {
  "vocabulary_size": tokenizer.vocabulary_size(),
  "context_size": 26, # Length of context window for each token
  "embedding_dim": 768,
  "heads_num": 12,
  "layers_num": 10,
  "dropout_rate": 0.1,
  "use_bias": False,
}

config["head_size"] = config["embedding_dim"] // config["heads_num"]

model = DemoGPT(config).to(device)
generate_with_prompt(model, tokenizer, "Given a recent sample of")

Vocabulary size: 118
Ecoded Vector of 'Given a recent sample of ' is: tensor([10, 15, 82, 85, 65])


'Given a recent sample of target willing order You are within the 1 loss profitability This use meeting up target **two percent must use use more'

In [11]:
## This result is after training
from utils.sequentialdataset import TokenIdsDataset
from torch.utils.data import Dataset, DataLoader, RandomSampler


batch_size = 16
train_iterations = 60
evaluation_interval = 10
learning_rate=4e-4

train_data = tokenizer.encode(text_paragraph).to(device)
train_dataset = TokenIdsDataset(train_data, config["context_size"])
print(f"Training Dataset length: {len(train_dataset)}")


train_sampler = RandomSampler(train_dataset, num_samples=batch_size * train_iterations, replacement=True)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


test_paragraph = """
You are a manufacturer of surgical gloves. The doctor who will use your gloves in surgeries has a requirement: the gloves must fit tightly but comfortably. 
The doctor is willing to accept a slight variance of up to **2 cm** in the glove's length from the specified target size. This means the gloves must fall within a **4 cm range** of the target size, 
i.e., ±2 cm from the target. Additionally, your financial officer has stated that to maintain profitability, you can afford to discard **2 out of every 1000 gloves**. 
This means that the size distribution of your gloves must be such that only 0.2% of gloves fall outside the acceptable ±2 cm range. Given a recent sample of 1000 gloves from the factory, 
the task is to evaluate whether the factory is meeting the specified quality and profitability requirements. If more than **two percent** of 
the gloves fall outside the ±2 cm range, the factory will incur a loss due to excess waste.
"""
test_data = tokenizer.encode(test_paragraph).to(device)
test_dataset = TokenIdsDataset(test_data, config["context_size"])
print(f"Testing Dataset length: {len(test_dataset)}")

test_sampler = RandomSampler(test_dataset, num_samples=batch_size * train_iterations, replacement=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=test_sampler, drop_last = True)


for step_num, sample in enumerate(train_dataloader):
    model.train()
    input, targets = sample
    logits = model(input)

    logits_view = logits.view(batch_size * config["context_size"], config["vocabulary_size"])
    targets_view = targets.view(batch_size * config["context_size"])

    loss = F.cross_entropy(logits_view, targets_view)  
    loss.backward()  
    optimizer.step()
    optimizer.zero_grad(set_to_none=True) #reduce memory usage

    if step_num%evaluation_interval==0:
        print(f"Step {step_num}. Loss {loss.item():.3f}")

    # Evaluation on the test dataset at regular intervals
    if step_num % evaluation_interval == 0:
        model.eval()  # Switch model to evaluation mode
        total_test_loss = 0
        with torch.no_grad():  # Disable gradient computation for testing
            for test_sample in test_dataloader:
                test_input, test_targets = test_sample
                test_logits = model(test_input)
                                
                test_logits_view = test_logits.view(batch_size * config["context_size"], config["vocabulary_size"])
                test_targets_view = test_targets.view(batch_size * config["context_size"])
                
                test_loss = F.cross_entropy(test_logits_view, test_targets_view)
                total_test_loss += test_loss.item()

        average_test_loss = total_test_loss / len(test_dataloader)
        print(f"Evaluation Step {step_num}. Test Loss: {average_test_loss:.3f}")

    torch.cuda.empty_cache()


model.eval()
generate_with_prompt(model, tokenizer, "Given a recent sample of")

Training Dataset length: 295
Testing Dataset length: 139
Step 0. Loss 4.941
Evaluation Step 0. Test Loss: 5.514
Step 10. Loss 0.934
Evaluation Step 10. Test Loss: 0.918
Step 20. Loss 0.582
Evaluation Step 20. Test Loss: 0.518
Step 30. Loss 0.394
Evaluation Step 30. Test Loss: 0.316
Step 40. Loss 0.262
Evaluation Step 40. Test Loss: 0.188
Step 50. Loss 0.112
Evaluation Step 50. Test Loss: 0.133


'Given a recent sample of 1000 gloves from the factory, the task is to evaluate whether the factory is meeting the specified quality and profitability requirements.'

In [12]:
#Using the hugging face

from transformers import pipeline

# Load a pre-trained model for text generation
generator = pipeline("text-generation", model = 'gpt2')
result = generator("Given a recent sample of", max_length=200, truncation=True)
print(result)

  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Given a recent sample of all the people interviewed for this study reported a strong preference of women on all political topics.\n\nA third of the study\'s authors had been appointed by members of Parliament.\n\nAs of this writing, only four men and two women were on the advisory board: Mr Turnbull, Mr Davis, Lisa Raitt, and Ms Abbott.\n\nMr Turnbull\'s chief of staff, David Lakin, resigned at the end of February after a series of embarrassing controversies.\n\nThis past week it emerged that the ALP had tried to get the public to agree on issues such as the minimum wage and the abolition of the free childcare program.\n\nOpposition spokesman Paul Nuttall said yesterday he was "shocked and saddened" to hear about the report, calling it "ridiculous".\n\n"While people believe everything this group says they care about, they are not really going to change their minds," he said.\n\n"We are not changing our minds'}]
