# Building a Transformer Model for Machine Translation: A Comprehensive Guide

The `Transfomer` architecture is one of the basic most influential and important contribution in the field of Natural Language Processing (NLP). In this notebook, we are going to demostrate the implementations of the Transformer-based translation model using PyTorch. The model is designed to translate text from one language to another. We'll cover every aspect in detail, from data preparation to model architecture, training, evaluation, and testing

## 1. Imports and Setup

In [1]:
from datasets import load_dataset
import torch
import time
import os
import torch.nn as nn
import math
from pathlib import Path
from tqdm import tqdm
import random
import warnings
import nltk
from nltk.tokenize import word_tokenize
from torch.utils.data import Dataset, DataLoader

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/pdeb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/pdeb/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 2. Hyperparameters and Configurations

In [2]:
# Set up the device for computation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps") if torch.backends.mps.is_available() else device
print(f"Device: {device}")

torch.manual_seed(42)
random.seed(42)

Device: cuda


In [3]:
def get_params():
    """
    Define and return a dictionary of hyperparameters and configuration settings.

    Returns:
        dict: A dictionary containing model and training parameters.
    """
    param = {
        'lr': 1e-4,  # Learning rate
        'bs': 16,    # Batch size
        'n_epochs': 100,  # Number of training epochs
        'src_lang': 'en',  # Source language
        'tgt_lang': 'fr',  # Target language
        'model_path': './model/',  # Path to store model
        'dim_embed': 256,  # Embedding dimension
        'n_layers': 6,  # Number of encoder/decoder layers
        'n_heads': 8,  # Number of attention heads
        'dropout': 0.1,  # Dropout rate
        'ffn_hidden_dim': 256  # Hidden dimension of feed-forward network
    }
    return param

parameters = get_params()
print(f"Parameters: {parameters}")

Parameters: {'lr': 0.0001, 'bs': 16, 'n_epochs': 100, 'src_lang': 'en', 'tgt_lang': 'fr', 'model_path': './model/', 'dim_embed': 256, 'n_layers': 6, 'n_heads': 8, 'dropout': 0.1, 'ffn_hidden_dim': 256}


## 3. Dataset Preparation

Now, let's prepare our dataset. We're using the 'opus_books' dataset, which contains English-French translations. You can select a subset(~2000) of examples to keep the training process manageable for your machine.

We transform the dataset into a list of tuples of (source, target) pairs, where source is the English sentence and target is the French translation.

In [4]:
nltk.download('punkt')
nltk.download('punkt_tab')
# ds = load_dataset('opus_books', 'en-fr', split='train')
ds = load_dataset('opus_books', 'en-fr', split='train').select(range(20000))


[nltk_data] Downloading package punkt to /home/pdeb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/pdeb/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
# Prepare the dataset as a list of (source, target) pairs
dataset = []
for example in ds:
    src = example['translation']['en']
    tgt = example['translation']['fr']
    dataset.append((src, tgt))

print("Sample from the dataset:")
print(dataset[:2])

Sample from the dataset:
[('The Wanderer', 'Le grand Meaulnes'), ('Alain-Fournier', 'Alain-Fournier')]


## 4. Vocabulary Creation

The next step is to create vocabularies for both source and target languages:

In [6]:
def create_vocabulary(dataset):
    """
    Create vocabularies for source and target languages.

    Args:
        dataset (list): List of (source, target) sentence pairs.

    Returns:
        tuple: Contains vocab_src, vocab_tgt, src_token_to_id, tgt_token_to_id,
               src_id_to_token, tgt_id_to_token
    """
    vocab_src = set()
    vocab_tgt = set()

    max_src_seq_len = 0
    max_tgt_seq_len = 0

    for src, tgt in dataset:
        max_src_seq_len = max(max_src_seq_len, len(src))
        max_tgt_seq_len = max(max_tgt_seq_len, len(tgt))

        src_tokens = word_tokenize(src)
        tgt_tokens = word_tokenize(tgt)
        vocab_src.update(src_tokens)
        vocab_tgt.update(tgt_tokens)

    # Add special tokens
    special_tokens = ['<sos>', '<eos>', '<pad>', '<unk>']
    vocab_src = list(vocab_src) + special_tokens
    vocab_tgt = list(vocab_tgt) + special_tokens

    vocab_src_size = len(vocab_src)
    vocab_tgt_size = len(vocab_tgt)

    print(f"Source vocabulary size: {vocab_src_size}")
    print(f"Target vocabulary size: {vocab_tgt_size}")

    # Create token to id and id to token mappings
    src_token_to_id = {token: id for id, token in enumerate(vocab_src)}
    tgt_token_to_id = {token: id for id, token in enumerate(vocab_tgt)}
    src_id_to_token = {id: token for token, id in src_token_to_id.items()}
    tgt_id_to_token = {id: token for token, id in tgt_token_to_id.items()}

    return (vocab_src, vocab_src_size, max_src_seq_len, vocab_tgt, vocab_tgt_size, max_tgt_seq_len, src_token_to_id, tgt_token_to_id,
            src_id_to_token, tgt_id_to_token)

vocab_src, vocab_src_size, max_src_seq_len, vocab_tgt, vocab_tgt_size, max_tgt_seq_len, src_token_to_id, tgt_token_to_id, src_id_to_token, tgt_id_to_token = create_vocabulary(dataset)

Source vocabulary size: 21051
Target vocabulary size: 28348


In [7]:
# # Printing src_token_to_id
# print("\nsrc_token_to_id:")
# print(src_token_to_id)

# # Printing tgt_token_to_id
# print("\ntgt_token_to_id:")
# print(tgt_token_to_id)

# # Printing src_id_to_token
# print("\nsrc_id_to_token:")
# print(src_id_to_token)

# # Printing tgt_id_to_token
# print("\ntgt_id_to_token:")
# print(tgt_id_to_token)

This above function accomplishes few important things:

1. Tokenizes each sentence in the dataset.
2. Creates vocabularies for both source (`en`) and target (`fr`) languages by collecting all unique words.
3. Adds special tokens like `<sos>` (start of sentence), `<eos>` (end of sentence), `<pad>` (padding), and `<unk>` (unknown words).
4. Creates mappings between tokens and their numerical IDs, which we'll use to convert sentences into sequences of numbers that our model can process.

## 5. Custom Dataset and DataLoader

Now, we create a custom Dataset class and a collate function for our DataLoader:


In [8]:
class TranslationDataset(Dataset):
    """
    Custom Dataset for our translation task.

    Attributes:
        dataset (list): List of (source, target) sentence pairs.
        src_token_to_id (dict): Mapping from source tokens to IDs.
        tgt_token_to_id (dict): Mapping from target tokens to IDs.
        src_id_to_token (dict): Mapping from source IDs to tokens.
        tgt_id_to_token (dict): Mapping from target IDs to tokens.
    """

    def __init__(self, dataset, src_token_to_id, tgt_token_to_id, src_id_to_token, tgt_id_to_token):
        self.dataset = dataset
        self.src_token_to_id = src_token_to_id
        self.tgt_token_to_id = tgt_token_to_id
        self.src_id_to_token = src_id_to_token
        self.tgt_id_to_token = tgt_id_to_token

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Fetch and process a single item from the dataset.

        Args:
            idx (int): Index of the item to fetch.

        Returns:
            tuple: Contains source and target tensor, each representing a sentence as a sequence of token IDs.
        """
        src, tgt = self.dataset[idx]
        # Tokenize and add start/end tokens
        src_tokens = ['<sos>'] + word_tokenize(src) + ['<eos>']
        tgt_tokens = ['<sos>'] + word_tokenize(tgt) + ['<eos>']
        # Convert tokens to IDs
        src_ids = [self.src_token_to_id.get(token, self.src_token_to_id['<unk>']) for token in src_tokens]
        tgt_ids = [self.tgt_token_to_id.get(token, self.tgt_token_to_id['<unk>']) for token in tgt_tokens]
        return torch.tensor(src_ids).long(), torch.tensor(tgt_ids).long()

def collate_fn(batch):
    """
    Collate function to create batches of data.

    This function pads sequences in a batch to the same length.

    Args:
        batch (list): List of (source, target) tensor pairs.

    Returns:
        tuple: Contains padded source and target tensors.
    """
    src_batch, tgt_batch = zip(*batch)
    src_max_len = max(len(src) for src in src_batch)
    tgt_max_len = max(len(tgt) for tgt in tgt_batch)

    # Pad sequences
    src_batch = [torch.cat([src, torch.full((src_max_len - len(src),), src_token_to_id['<pad>'])]) for src in src_batch]
    tgt_batch = [torch.cat([tgt, torch.full((tgt_max_len - len(tgt),), tgt_token_to_id['<pad>'])]) for tgt in tgt_batch]

    # Stack tensors
    src_batch = torch.stack(src_batch)
    tgt_batch = torch.stack(tgt_batch)
    return src_batch, tgt_batch

# Create dataset and dataloader
dataset = TranslationDataset(dataset, src_token_to_id, tgt_token_to_id, src_id_to_token, tgt_id_to_token)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

# Print a sample batch
for batch_idx, (src, tgt) in enumerate(dataloader):
    print(f"Batch {batch_idx + 1}:")
    print("Source shape:", src.shape)
    print("Target shape:", tgt.shape)
    break

Batch 1:
Source shape: torch.Size([2, 12])
Target shape: torch.Size([2, 10])


The `TranslationDataset` class is a custom implementation of PyTorch's Dataset. It handles the conversion of our text data into tensor form. The `__getitem__` method is particularly important as it:

1. Retrieves a source-target sentence pair
2. Tokenizes both sentences
3. Adds start and end tokens
4. Converts tokens to their corresponding IDs

The `collate_fn` function is used by the DataLoader to process our data into batches. It ensures that all sequences in a batch are padded to the same length, which is necessary for efficient processing by our model.

This setup allows us to easily iterate over our data during training, with each batch containing padded sequences of token IDs.

In the next sections, we'll dive into the model architecture, starting with the individual components of the Transformer model.

## 6. Model Architecture

Now, let's dive into the heart of our Transformer model. We'll break down each component, explaining its purpose and implementation in detail.

In [9]:
class InputEmbedding(nn.Module):
    """
    Embeds input tokens into continuous vector representations.

    Attributes:
        dim_embed (int): The dimension of the embedding space.
        vocab_size (int): The size of the vocabulary.
        embedding (nn.Embedding): The embedding layer.
    """

    def __init__(self, vocab_size: int, dim_embed: int):
        """
        Initialize the InputEmbedding.

        Args:
            vocab_size (int): The size of the vocabulary.
            dim_embed (int): The dimension of the embedding space.
        """
        super().__init__()
        self.dim_embed = dim_embed
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(self.vocab_size, self.dim_embed)

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        """
        Embed the input tokens.

        Args:
            input (torch.Tensor): Input tensor of token IDs.

        Returns:
            torch.Tensor: Embedded representation of the input.
        """
        input = input.long()
        # Scale embeddings by sqrt(dim_embed) as per the original paper
        output = self.embedding(input) * math.sqrt(self.dim_embed)
        return output

The `InputEmbedding` class is responsible for converting our input tokens (represented as integer IDs) into continuous vector representations. Here's what's happening:

1. We initialize an `nn.Embedding` layer with our vocabulary size and embedding dimension.
2. In the forward pass, we ensure our input is of type `long` (required for `nn.Embedding`).
3. We multiply the embeddings by the square root of the embedding dimension. This scaling is mentioned in the original Transformer paper and helps to keep the variance of the embeddings roughly constant regardless of the embedding dimension.

### 6.2 Positional Embedding

In [10]:
class PositionalEmbedding(nn.Module):
    """
    Adds positional information to input embeddings.

    Attributes:
        dim_embed (int): The dimension of the embedding space.
        max_seq_len (int): The maximum sequence length.
        dropout (nn.Dropout): Dropout layer for regularization.
        pos_emb (torch.Tensor): Pre-computed positional embeddings.
    """

    def __init__(self, dim_embed: int, max_seq_len: int = 1000, dropout: float = 0.2):
        """
        Initialize the PositionalEmbedding.

        Args:
            dim_embed (int): The dimension of the embedding space.
            max_seq_len (int, optional): The maximum sequence length. Defaults to 1000.
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        self.dim_embed = dim_embed
        self.max_seq_len = max_seq_len
        self.dropout = nn.Dropout(dropout)

        # Pre-compute positional embeddings
        pos_emb = torch.zeros(max_seq_len, dim_embed)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, dim_embed, 2).float() * (-math.log(10000.0) / dim_embed))

        pos_emb[:, 0::2] = torch.sin(position * div_term)
        pos_emb[:, 1::2] = torch.cos(position * div_term)
        pos_emb = pos_emb.unsqueeze(0)

        self.register_buffer('pos_emb', pos_emb)

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        """
        Add positional embeddings to the input.

        Args:
            input (torch.Tensor): Input tensor of shape (batch_size, seq_len, dim_embed).

        Returns:
            torch.Tensor: Input with added positional embeddings.
        """
        output = input + self.pos_emb[:, :input.size(1)].requires_grad_(False)
        return self.dropout(output)

The `PositionalEmbedding` class adds positional information to our input embeddings. This is crucial because the Transformer model, unlike RNNs, doesn't inherently understand the order of its inputs. Here's what's happening:

1. We pre-compute positional embeddings for positions up to `max_seq_len`.
2. The positional embeddings use sine and cosine functions of different frequencies, as described in the original Transformer paper. This allows the model to easily learn to attend to relative positions.
3. In the forward pass, we add these positional embeddings to our input. We only use as many positional embeddings as we have input tokens.
4. We apply dropout for regularization.

### 6.3 Layer Normalization

In [11]:
class LayerNormalization(nn.Module):
    """
    Implements Layer Normalization.

    Attributes:
        eps (float): A small value added for numerical stability.
        alpha (nn.Parameter): Learnable scale parameter.
        bias (nn.Parameter): Learnable bias parameter.
    """

    def __init__(self, eps: float = 1e-6):
        """
        Initialize LayerNormalization.

        Args:
            eps (float, optional): Small value added for numerical stability. Defaults to 1e-6.
        """
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1))  # scale
        self.bias = nn.Parameter(torch.zeros(1))  # shift

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        """
        Apply layer normalization to the input.

        Args:
            input (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Normalized input.
        """
        mean = input.mean(-1, keepdim=True)
        std = input.std(-1, keepdim=True)
        output = self.alpha * (input - mean) / (std + self.eps) + self.bias
        return output

Layer Normalization is a technique to normalize the inputs across the features. It's applied after each sub-layer in the Transformer. Here's what it does:

1. It calculates the mean and standard deviation of the input tensor along the last dimension (which represents the features).
2. It normalizes the input using these statistics.
3. It then applies a learnable scale (`alpha`) and shift (`bias`) to allow the network to undo the normalization if needed.

Layer Normalization helps in stabilizing the learning process and reduces the training time by reducing the dependence of gradients on the scale of parameters or their initial values.

### 6.4 Feedforward Block

In [12]:
class FeedforwardBlock(nn.Module):
    """
    Implements the feedforward network used in each Transformer layer.

    Attributes:
        linear1 (nn.Linear): First linear transformation.
        dropout (nn.Dropout): Dropout layer for regularization.
        linear2 (nn.Linear): Second linear transformation.
    """

    def __init__(self, dim_embed: int, hidden_dim: int, dropout: float = 0.2):
        """
        Initialize the FeedforwardBlock.

        Args:
            dim_embed (int): The input and output dimension.
            hidden_dim (int): The dimension of the hidden layer.
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        self.linear1 = nn.Linear(dim_embed, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(hidden_dim, dim_embed)

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        """
        Apply the feedforward network to the input.

        Args:
            input (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Output after applying the feedforward network.
        """
        output = self.linear2(self.dropout(torch.relu(self.linear1(input))))
        return output

The Feedforward Block is a simple neural network applied to each position separately and identically. It consists of:

1. A linear transformation that expands the input dimension.
2. A ReLU activation function.
3. Dropout for regularization.
4. Another linear transformation that projects back to the original input dimension.

This block allows the model to process the information from the attention mechanism and introduce non-linearity into the model.

### 6.5 Residual Connection

In [13]:
class ResidualConnection(nn.Module):
    """
    Implements a residual connection with layer normalization and dropout.

    Attributes:
        dropout (nn.Dropout): Dropout layer for regularization.
        layer_norm (LayerNormalization): Layer normalization.
    """

    def __init__(self, dropout: float = 0.2):
        """
        Initialize the ResidualConnection.

        Args:
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = LayerNormalization()

    def forward(self, input: torch.Tensor, sublayer: callable) -> torch.Tensor:
        """
        Apply the residual connection.

        Args:
            input (torch.Tensor): Input tensor.
            sublayer (callable): The sublayer to apply (e.g., attention or feedforward).

        Returns:
            torch.Tensor: Output after applying the residual connection.
        """
        return input + self.dropout(sublayer(self.layer_norm(input)))

The Residual Connection is a crucial component in the Transformer architecture. It does the following:

1. Applies layer normalization to the input.
2. Passes the normalized input through a sublayer (which could be attention or feedforward).
3. Applies dropout to the output of the sublayer.
4. Adds the result back to the original input (the residual connection).

Residual connections help in training very deep networks by allowing gradients to flow directly through the network. The "Add & Norm" step in the original Transformer paper refers to this residual connection followed by layer normalization.

In the next part, we'll continue with the Multi-Head Attention mechanism, which is at the core of the Transformer's power.

### 6.6 Multi-Head Attention Block

The Multi-Head Attention mechanism is the core innovation of the Transformer model. Let's break it down:

In [14]:
class MultiHeadAttentionBlock(nn.Module):
    """
    Implements the Multi-Head Attention mechanism.

    Attributes:
        dim_k (int): Dimension of keys and queries.
        dim_embed (int): Embedding dimension.
        n_head (int): Number of attention heads.
        dropout (nn.Dropout): Dropout layer for regularization.
        w_k (nn.Linear): Linear transformation for keys.
        w_q (nn.Linear): Linear transformation for queries.
        w_v (nn.Linear): Linear transformation for values.
        w_o (nn.Linear): Linear transformation for output.
    """

    def __init__(self, dim_embed: int, n_head: int, dropout: float = 0.2):
        """
        Initialize the MultiHeadAttentionBlock.

        Args:
            dim_embed (int): Embedding dimension.
            n_head (int): Number of attention heads.
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        assert dim_embed % n_head == 0, 'Embedding dimension must be divisible by number of attention heads!'
        self.dim_k = dim_embed // n_head
        self.dim_embed = dim_embed
        self.n_head = n_head
        self.dropout = nn.Dropout(dropout)

        self.w_k = nn.Linear(dim_embed, dim_embed, bias=False)
        self.w_q = nn.Linear(dim_embed, dim_embed, bias=False)
        self.w_v = nn.Linear(dim_embed, dim_embed, bias=False)
        self.w_o = nn.Linear(dim_embed, dim_embed, bias=False)

    @staticmethod
    def self_attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor,
                       mask: torch.Tensor = None, dropout: nn.Dropout = None) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Compute scaled dot-product attention.

        Args:
            query (torch.Tensor): Query tensor.
            key (torch.Tensor): Key tensor.
            value (torch.Tensor): Value tensor.
            mask (torch.Tensor, optional): Attention mask. Defaults to None.
            dropout (nn.Dropout, optional): Dropout layer. Defaults to None.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: Attention output and attention scores.
        """
        d_k = query.shape[-1]
        score = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            score = score.masked_fill(mask == 0, -1e9)
        score = torch.softmax(score, dim=-1)
        if dropout is not None:
            score = dropout(score)
        return (score @ value), score

    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Apply multi-head attention.

        Args:
            q (torch.Tensor): Query tensor.
            k (torch.Tensor): Key tensor.
            v (torch.Tensor): Value tensor.
            mask (torch.Tensor, optional): Attention mask. Defaults to None.

        Returns:
            torch.Tensor: Output after applying multi-head attention.
        """
        bs = q.shape[0]

        # Linear transformations and reshape
        q = self.w_q(q).view(bs, -1, self.n_head, self.dim_k).transpose(1, 2)
        k = self.w_k(k).view(bs, -1, self.n_head, self.dim_k).transpose(1, 2)
        v = self.w_v(v).view(bs, -1, self.n_head, self.dim_k).transpose(1, 2)

        # Apply attention
        output, self.attention_score = self.self_attention(q, k, v, mask, self.dropout)

        # Reshape and apply final linear transformation
        output = output.transpose(1, 2).contiguous().view(bs, -1, self.n_head * self.dim_k)
        return self.w_o(output)

The Multi-Head Attention mechanism allows the model to jointly attend to information from different representation subspaces at different positions. Here's a breakdown of how it works:

1. The input is linearly projected to create queries, keys, and values using `w_q`, `w_k`, and `w_v` respectively.
2. These projections are split into multiple heads (reshaped and transposed).
3. Scaled dot-product attention is applied in parallel for each head.
4. The results are concatenated and linearly transformed using `w_o`.

The `self_attention` method implements the scaled dot-product attention mechanism:

- It computes the dot product of queries and keys, scaled by the square root of their dimension.
- If a mask is provided (used in the decoder to prevent attending to future tokens), it applies the mask.
- The softmax function is applied to obtain attention weights.
- These weights are used to compute a weighted sum of the values.

### 6.7 Encoder Block & Encoder Module

Now, let's look at how these components come together in an Encoder Block and then an Encoder module:

In [15]:
class EncoderBlock(nn.Module):
    """
    Implements a single Encoder block of the Transformer.

    Attributes:
        attention_block (MultiHeadAttentionBlock): Multi-head attention mechanism.
        ffn_block (FeedforwardBlock): Feedforward neural network.
        residual_block (nn.ModuleList): List of residual connections.
    """

    def __init__(self, attention_block: MultiHeadAttentionBlock, ffn_block: FeedforwardBlock, dropout: float = 0.2):
        """
        Initialize the EncoderBlock.

        Args:
            attention_block (MultiHeadAttentionBlock): Multi-head attention mechanism.
            ffn_block (FeedforwardBlock): Feedforward neural network.
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        self.attention_block = attention_block
        self.ffn_block = ffn_block
        self.residual_block = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

    def forward(self, input: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        """
        Apply the Encoder block to the input.

        Args:
            input (torch.Tensor): Input tensor.
            mask (torch.Tensor): Attention mask.

        Returns:
            torch.Tensor: Output after applying the Encoder block.
        """
        input = self.residual_block[0](input, lambda x: self.attention_block(x, x, x, mask))
        output = self.residual_block[1](input, self.ffn_block)
        return output
    
class Encoder(nn.Module):
    """
    Implements the full Encoder of the Transformer.

    Attributes:
        layers (nn.ModuleList): List of EncoderBlock layers.
        norm (LayerNormalization): Final layer normalization.
    """

    def __init__(self, layers: nn.ModuleList):
        """
        Initialize the Encoder.

        Args:
            layers (nn.ModuleList): List of EncoderBlock layers.
        """
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, input: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        """
        Apply the full Encoder to the input.

        Args:
            input (torch.Tensor): Input tensor.
            mask (torch.Tensor): Attention mask.

        Returns:
            torch.Tensor: Output after applying the full Encoder.
        """
        for layer in self.layers:
            input = layer(input, mask)
        return self.norm(input)

An Encoder Block consists of:

1. A Multi-Head Attention layer, where queries, keys, and values all come from the same input (hence "self-attention").
2. A Feedforward layer.
3. Residual connections and layer normalizations around each of these sub-layers.

The Encoder applies each Encoder Block in sequence, with a final layer normalization at the end.

### 6.8 Decoder Block & Decoder Module

The Decoder Block is similar to the Encoder Block, but with an additional attention layer. Like the Ecnoder module, we will use multiple Decoder block to create Decoder module.

The Decoder Block has three main components:

1. A masked self-attention layer, where subsequent positions are prevented from attending to earlier positions.
2. A cross-attention layer, which attends to the Encoder's output.
3. A feedforward layer.

Each of these is wrapped with a residual connection and layer normalization.

The Decoder Module applies each Decoder Block in sequence, with a final layer normalization at the end.

In [16]:
class DecoderBlock(nn.Module):
    """
    Implements a single Decoder block of the Transformer.

    Attributes:
        masked_attention_block (MultiHeadAttentionBlock): Masked self-attention mechanism.
        attention_block (MultiHeadAttentionBlock): Cross-attention mechanism.
        ffn_block (FeedforwardBlock): Feedforward neural network.
        residual_block (nn.ModuleList): List of residual connections.
    """

    def __init__(self, masked_attention_block: MultiHeadAttentionBlock, attention_block: MultiHeadAttentionBlock,
                 ffn_block: FeedforwardBlock, dropout: float = 0.2):
        """
        Initialize the DecoderBlock.

        Args:
            masked_attention_block (MultiHeadAttentionBlock): Masked self-attention mechanism.
            attention_block (MultiHeadAttentionBlock): Cross-attention mechanism.
            ffn_block (FeedforwardBlock): Feedforward neural network.
            dropout (float, optional): Dropout rate. Defaults to 0.2.
        """
        super().__init__()
        self.masked_attention_block = masked_attention_block
        self.attention_block = attention_block
        self.ffn_block = ffn_block
        self.residual_block = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])

    def forward(self, input: torch.Tensor, encoder_output: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor) -> torch.Tensor:
        """
        Apply the Decoder block to the input.

        Args:
            input (torch.Tensor): Input tensor.
            encoder_output (torch.Tensor): Output from the Encoder.
            src_mask (torch.Tensor): Source attention mask.
            tgt_mask (torch.Tensor): Target attention mask.

        Returns:
            torch.Tensor: Output after applying the Decoder block.
        """
        input = self.residual_block[0](input, lambda x: self.masked_attention_block(x, x, x, tgt_mask))
        input = self.residual_block[1](input, lambda x: self.attention_block(x, encoder_output, encoder_output, src_mask))
        output = self.residual_block[2](input, self.ffn_block)
        return output
    
class Decoder(nn.Module):
    """
    Implements the full Decoder of the Transformer.

    Attributes:
        layers (nn.ModuleList): List of DecoderBlock layers.
        norm (LayerNormalization): Final layer normalization.
    """

    def __init__(self, layers: nn.ModuleList):
        """
        Initialize the Decoder.

        Args:
            layers (nn.ModuleList): List of DecoderBlock layers.
        """
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, input: torch.Tensor, encoder_output: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor) -> torch.Tensor:
        """
        Apply the full Decoder to the input.

        Args:
            input (torch.Tensor): Input tensor.
            encoder_output (torch.Tensor): Output from the Encoder.
            src_mask (torch.Tensor): Source attention mask.
            tgt_mask (torch.Tensor): Target attention mask.

        Returns:
            torch.Tensor: Output after applying the full Decoder.
        """
        for layer in self.layers:
            input = layer(input, encoder_output, src_mask, tgt_mask)
        return self.norm(input)

### 6.9 Final Projection

After the Decoder, we need to project the output to our vocabulary size. This layer transforms the Decoder's output into log probabilities over our target vocabulary.

In the next part, we'll put all these components together to create the full Transformer model and explain how to train and use it for translation tasks.


In [17]:
class Projection(nn.Module):
    """
    Projects the Decoder output to vocabulary size and applies log softmax.

    Attributes:
        linear (nn.Linear): Linear transformation.
    """

    def __init__(self, vocab_size: int, dim_embed: int):
        """
        Initialize the Projection.

        Args:
            vocab_size (int): Size of the vocabulary.
            dim_embed (int): Dimension of the input embeddings.
        """
        super().__init__()
        self.linear = nn.Linear(dim_embed, vocab_size)

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        """
        Apply the projection to the input.

        Args:
            input (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Log probabilities over the vocabulary.
        """
        return torch.log_softmax(self.linear(input), dim=-1)

## 7. Assembling the Full Transformer Model

Now that we have all the components, let's put them together to create our full Transformer model

In [18]:
class Transformer(nn.Module):
    """
    Implements the full Transformer model for sequence-to-sequence tasks.

    Attributes:
        src_emb (InputEmbedding): Input embedding for source language.
        tgt_emb (InputEmbedding): Input embedding for target language.
        src_pos_emb (PositionalEmbedding): Positional embedding for source language.
        tgt_pos_emb (PositionalEmbedding): Positional embedding for target language.
        encoder (Encoder): Encoder stack.
        decoder (Decoder): Decoder stack.
        projection (Projection): Final projection layer.
    """

    def __init__(self, src_emb: InputEmbedding, tgt_emb: InputEmbedding,
                 src_pos_emb: PositionalEmbedding, tgt_pos_emb: PositionalEmbedding,
                 encoder: Encoder, decoder: Decoder, projection: Projection):
        """
        Initialize the Transformer.

        Args:
            src_emb (InputEmbedding): Input embedding for source language.
            tgt_emb (InputEmbedding): Input embedding for target language.
            src_pos_emb (PositionalEmbedding): Positional embedding for source language.
            tgt_pos_emb (PositionalEmbedding): Positional embedding for target language.
            encoder (Encoder): Encoder stack.
            decoder (Decoder): Decoder stack.
            projection (Projection): Final projection layer.
        """
        super().__init__()
        self.src_emb = src_emb
        self.tgt_emb = tgt_emb
        self.src_pos_emb = src_pos_emb
        self.tgt_pos_emb = tgt_pos_emb
        self.encoder = encoder
        self.decoder = decoder
        self.projection = projection

    def encode(self, src: torch.Tensor, src_mask: torch.Tensor) -> torch.Tensor:
        """
        Encode the source sequence.

        Args:
            src (torch.Tensor): Source sequence.
            src_mask (torch.Tensor): Source mask.

        Returns:
            torch.Tensor: Encoded representation of the source sequence.
        """
        src = self.src_emb(src)
        src = self.src_pos_emb(src)
        return self.encoder(src, src_mask)

    def decode(self, tgt: torch.Tensor, tgt_mask: torch.Tensor, src_mask: torch.Tensor, encoder_output: torch.Tensor) -> torch.Tensor:
        """
        Decode the target sequence.

        Args:
            tgt (torch.Tensor): Target sequence.
            tgt_mask (torch.Tensor): Target mask.
            src_mask (torch.Tensor): Source mask.
            encoder_output (torch.Tensor): Output from the encoder.

        Returns:
            torch.Tensor: Decoded representation of the target sequence.
        """
        tgt = self.tgt_emb(tgt)
        tgt = self.tgt_pos_emb(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

    def project(self, decoder_output: torch.Tensor) -> torch.Tensor:
        """
        Project the decoder output to vocabulary space.

        Args:
            decoder_output (torch.Tensor): Output from the decoder.

        Returns:
            torch.Tensor: Log probabilities over the target vocabulary.
        """
        return self.projection(decoder_output)

    def forward(self, src: torch.Tensor, tgt: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor) -> torch.Tensor:
        """
        Perform a forward pass through the Transformer.

        Args:
            src (torch.Tensor): Source sequence.
            tgt (torch.Tensor): Target sequence.
            src_mask (torch.Tensor): Source mask.
            tgt_mask (torch.Tensor): Target mask.

        Returns:
            torch.Tensor: Log probabilities over the target vocabulary for each position.
        """
        encoder_output = self.encode(src, src_mask)
        decoder_output = self.decode(tgt, tgt_mask, src_mask, encoder_output)
        return self.project(decoder_output)

This `Transformer` class combines all the components we've built so far. The forward pass:

1. Encodes the source sequence.
2. Decodes the target sequence, using the encoded source.
3. Projects the decoder output to get log probabilities over the target vocabulary.



## 8. Creating the Transformer Model

Now, let's create a function to instantiate our Transformer model

In [19]:
def create_transformer_model(
        src_vocab_size: int,
        tgt_vocab_size: int,
        src_max_seq_len: int,
        tgt_max_seq_len: int,
        dim_embed: int = 256,
        n_layer: int = 8,
        n_head: int = 6,
        dropout: float = 0.2,
        ffn_hidden_dim: int = 1024
    ) -> Transformer:
    """
    Create a Transformer model with the specified parameters.

    Args:
        src_vocab_size (int): Size of the source vocabulary.
        tgt_vocab_size (int): Size of the target vocabulary.
        src_max_seq_len (int): Maximum length of source sequences.
        tgt_max_seq_len (int): Maximum length of target sequences.
        dim_embed (int, optional): Embedding dimension. Defaults to 256.
        n_layer (int, optional): Number of encoder/decoder layers. Defaults to 8.
        n_head (int, optional): Number of attention heads. Defaults to 6.
        dropout (float, optional): Dropout rate. Defaults to 0.2.
        ffn_hidden_dim (int, optional): Hidden dimension of feedforward networks. Defaults to 1024.

    Returns:
        Transformer: Instantiated Transformer model.
    """
    src_embedding = InputEmbedding(src_vocab_size, dim_embed)
    tgt_embedding = InputEmbedding(tgt_vocab_size, dim_embed)
    src_pos_embedding = PositionalEmbedding(dim_embed, src_max_seq_len, dropout)
    tgt_pos_embedding = PositionalEmbedding(dim_embed, tgt_max_seq_len, dropout)

    encoder_blocks = []
    decoder_blocks = []
    for _ in range(n_layer):
        encoder_attention = MultiHeadAttentionBlock(dim_embed, n_head, dropout)
        encoder_ffn = FeedforwardBlock(dim_embed, ffn_hidden_dim, dropout)
        encoder_block = EncoderBlock(encoder_attention, encoder_ffn, dropout)
        encoder_blocks.append(encoder_block)

        decoder_masked_attention = MultiHeadAttentionBlock(dim_embed, n_head, dropout)
        decoder_attention = MultiHeadAttentionBlock(dim_embed, n_head, dropout)
        decoder_ffn = FeedforwardBlock(dim_embed, ffn_hidden_dim, dropout)
        decoder_block = DecoderBlock(decoder_masked_attention, decoder_attention, decoder_ffn, dropout)
        decoder_blocks.append(decoder_block)

    encoder = Encoder(nn.ModuleList(encoder_blocks))
    decoder = Decoder(nn.ModuleList(decoder_blocks))
    projection = Projection(tgt_vocab_size, dim_embed)

    transformer = Transformer(
        src_embedding,
        tgt_embedding,
        src_pos_embedding,
        tgt_pos_embedding,
        encoder,
        decoder,
        projection
    )

    # Initialize parameters with Xavier uniform distribution
    for param in transformer.parameters():
        if param.dim() > 1:
            nn.init.xavier_uniform_(param)

    return transformer

## 9. Training the Model

Now that we have our model, let's set up the training process:

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import time
import os

def create_mask(src: torch.Tensor, tgt: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Create masks for source and target sequences.

    Args:
        src (torch.Tensor): Source sequence.
        tgt (torch.Tensor): Target sequence.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: Source and target masks.
    """
    src_mask = (src != src_token_to_id['<pad>']).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != tgt_token_to_id['<pad>']).unsqueeze(1).unsqueeze(3)

    tgt_len = tgt.size(1)
    tgt_submask = torch.tril(torch.ones((tgt_len, tgt_len), device=device)).bool()
    tgt_mask = tgt_mask & tgt_submask

    return src_mask, tgt_mask

def train_epoch(model: nn.Module, optimizer: torch.optim.Optimizer, dataloader: DataLoader, loss_fn: nn.Module, device: torch.device, log_file) -> float:
    """
    Train the model for one epoch.

    Args:
        model (nn.Module): The Transformer model.
        optimizer (torch.optim.Optimizer): The optimizer.
        dataloader (DataLoader): DataLoader for training data.
        loss_fn (nn.Module): The loss function.
        device (torch.device): The device to train on.
        log_file: File object for logging.

    Returns:
        float: Average loss for this epoch.
    """
    model.train()
    total_loss = 0
    start_time = time.time()

    for i, (src, tgt) in enumerate(dataloader):
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]

        src_mask, tgt_mask = create_mask(src, tgt_input)

        optimizer.zero_grad()

        logits = model(src, tgt_input, src_mask, tgt_mask)

        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_output.contiguous().reshape(-1))
        loss.backward()

        optimizer.step()

        total_loss += loss.item()

        # Log the loss after each iteration
        log_file.write(f"Iteration {i+1}, Loss: {loss.item():.4f}\n")
        log_file.flush()  # Ensure the loss is written to the file immediately

    end_time = time.time()
    epoch_time = end_time - start_time
    avg_loss = total_loss / len(dataloader)
    return avg_loss, epoch_time

# Set up the model and training
transformer_model = create_transformer_model(
    vocab_src_size,
    vocab_tgt_size,
    max_src_seq_len,
    max_tgt_seq_len,
    parameters['dim_embed'],
    parameters['n_layers'],
    parameters['n_heads'],
    parameters['dropout'],
    parameters['ffn_hidden_dim']
).to(device)

optimizer = torch.optim.Adam(transformer_model.parameters(), lr=parameters['lr'], betas=(0.9, 0.98), eps=1e-9)
loss_fn = nn.CrossEntropyLoss(ignore_index=tgt_token_to_id['<pad>'])
dataloader = DataLoader(dataset, batch_size=parameters['bs'], shuffle=True, collate_fn=collate_fn)

# Create model save directory if it doesn't exist
os.makedirs(parameters['model_path'], exist_ok=True)

# Construct the full paths for the model files
best_model_path = os.path.join(parameters['model_path'], 'best_transformer_model.pth')
latest_model_path = os.path.join(parameters['model_path'], 'latest_transformer_model.pth')

best_loss = float('inf')
no_improvement_count = 0
max_no_improvement = 3

# Open the log file
with open("log.txt", "w") as log_file:
    log_file.write("Training started\n")
    
    for epoch in range(parameters['n_epochs']):
        train_loss, epoch_time = train_epoch(transformer_model, optimizer, dataloader, loss_fn, device, log_file)
        log_message = f'Epoch: {epoch+1}/{parameters["n_epochs"]}, Loss: {train_loss:.4f}, Time: {epoch_time:.2f} seconds'
        print(log_message)
        log_file.write(log_message + "\n")
        
        # Check for improvement
        if round(train_loss, 2) < round(best_loss, 2):
            best_loss = train_loss
            no_improvement_count = 0
            torch.save(transformer_model.state_dict(), best_model_path)
            save_message = f"New best model saved with loss: {best_loss:.4f}"
            print(save_message)
            log_file.write(save_message + "\n")
        else:
            no_improvement_count += 1
        
        # Save the latest model (useful for resuming training)
        torch.save({
            'epoch': epoch,
            'model_state_dict': transformer_model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': train_loss,
        }, latest_model_path)
        
        log_file.flush()

        # Check for early stopping
        if no_improvement_count >= max_no_improvement:
            print(f"Early stopping after {no_improvement_count} epochs without improvement.")
            log_file.write(f"Early stopping after {no_improvement_count} epochs without improvement.\n")
            break

    print("Training complete.")
    log_file.write("Training complete.\n")

print(f"Best model saved to {best_model_path}")
print(f"Latest model checkpoint saved to {latest_model_path}")
print(f"Total epochs: {epoch + 1}")

Epoch: 1/100, Loss: 6.4170, Time: 89.81 seconds
New best model saved with loss: 6.4170
Epoch: 2/100, Loss: 5.5120, Time: 95.87 seconds
New best model saved with loss: 5.5120
Epoch: 3/100, Loss: 5.2158, Time: 99.55 seconds
New best model saved with loss: 5.2158
Epoch: 4/100, Loss: 4.9875, Time: 100.15 seconds
New best model saved with loss: 4.9875
Epoch: 5/100, Loss: 4.8123, Time: 99.99 seconds
New best model saved with loss: 4.8123
Epoch: 6/100, Loss: 4.6637, Time: 99.78 seconds
New best model saved with loss: 4.6637
Epoch: 7/100, Loss: 4.5310, Time: 101.85 seconds
New best model saved with loss: 4.5310
Epoch: 8/100, Loss: 4.4090, Time: 100.23 seconds
New best model saved with loss: 4.4090
Epoch: 9/100, Loss: 4.3023, Time: 101.62 seconds
New best model saved with loss: 4.3023
Epoch: 10/100, Loss: 4.1951, Time: 100.22 seconds
New best model saved with loss: 4.1951
Epoch: 11/100, Loss: 4.0931, Time: 101.04 seconds
New best model saved with loss: 4.0931
Epoch: 12/100, Loss: 4.0014, Time: 

This training loop does the following for each epoch:

1. Iterates over batches of data.
2. Creates appropriate masks for source and target sequences.
3. Performs a forward pass through the model.
4. Calculates the loss.
5. Performs backpropagation and updates the model parameters.

After training, we save the model weights for later use.

## 10. Using the Model for Translation

Finally, let's create a function to use our trained model for translation.

This `translate` function:

1. Encodes the source sequence.
2. Generates the translation one token at a time, using the previously generated tokens as input for the next prediction.
3. Stops when it generates an end-of-sequence token or reaches the maximum length.

With this, we've completed our implementation of a Transformer model for machine translation task.

In [24]:
def translate(model: nn.Module, src: torch.Tensor, max_len: int = 50) -> torch.Tensor:
    """
    Translate a source sequence using the trained model.

    Args:
        model (nn.Module): The trained Transformer model.
        src (torch.Tensor): Source sequence tensor.
        max_len (int, optional): Maximum length of the translated sequence. Defaults to 50.

    Returns:
        torch.Tensor: Translated sequence as tensor of token IDs.
    """
    model.eval()
    src_mask = (src != src_token_to_id['<pad>']).unsqueeze(-2)
    enc_output = model.encode(src, src_mask)

    tgt = torch.ones(1, 1).fill_(tgt_token_to_id['<sos>']).type_as(src).long()
    for i in range(max_len-1):
        tgt_mask = (tgt != tgt_token_to_id['<pad>']).unsqueeze(-2)
        tgt_mask = tgt_mask & torch.tril(torch.ones((1, tgt.size(1), tgt.size(1)), device=device)).bool()

        out = model.decode(tgt, tgt_mask, src_mask, enc_output)
        prob = model.project(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        tgt = torch.cat([tgt, torch.ones(1, 1).type_as(src).long().fill_(next_word)], dim=1)

        if next_word == tgt_token_to_id['<eos>']:
            break

    return tgt.squeeze(0)

# Example usage
src_sentence = "Hello, how are you?"
src_tokens = ['<sos>'] + word_tokenize(src_sentence) + ['<eos>']
src_ids = [src_token_to_id.get(token, src_token_to_id['<unk>']) for token in src_tokens]
src_tensor = torch.tensor(src_ids).unsqueeze(0).to(device)

translated_ids = translate(transformer_model, src_tensor)
translated_tokens = [tgt_id_to_token[id.item()] for id in translated_ids if id.item() not in [tgt_token_to_id['<sos>'], tgt_token_to_id['<eos>'], tgt_token_to_id['<pad>']]]
translated_sentence = ' '.join(translated_tokens)

print(f"Source: {src_sentence}")
print(f"Translation: {translated_sentence}")

Source: Hello, how are you?
Translation: A vous le sentir .


## 11. Evaluation & Translation

To properly assess our model's performance, we should use established evaluation metrics for machine translation. The most common metric is BLEU (Bilingual Evaluation Understudy) score. Let's implement a function to calculate BLEU scores.

In [25]:
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download('punkt')

def evaluate_bleu(model: nn.Module, test_data: list[tuple[str, str]]) -> float:
    """
    Evaluate the model using BLEU score.

    Args:
        model (nn.Module): The trained Transformer model.
        test_data (List[Tuple[str, str]]): List of (source, target) sentence pairs.

    Returns:
        float: Average BLEU score across all test samples.
    """
    model.eval()
    bleu_scores = []

    for src, ref in test_data:
        src_tokens = ['<sos>'] + word_tokenize(src) + ['<eos>']
        src_ids = [src_token_to_id.get(token, src_token_to_id['<unk>']) for token in src_tokens]
        src_tensor = torch.tensor(src_ids).unsqueeze(0).to(device)

        with torch.no_grad():
            translated_ids = translate(model, src_tensor)

        translated_tokens = [tgt_id_to_token[id.item()] for id in translated_ids
                             if id.item() not in [tgt_token_to_id['<sos>'], tgt_token_to_id['<eos>'], tgt_token_to_id['<pad>']]]

        
        translated_sentence = ' '.join(translated_tokens)
        
        reference_tokens = word_tokenize(ref)

        print(f"Reference Sentence: {ref} \t Translated Sentence: {translated_sentence}")

        bleu = sentence_bleu([reference_tokens], translated_tokens)
        bleu_scores.append(bleu)

    return sum(bleu_scores) / len(bleu_scores)

# Sample validaion data
sample_data = [
    ("I love programming.", "J'aime la programmation."),
    ("She is a good teacher.", "Elle est une bonne professeure."),
    ("He plays the guitar.", "Il joue de la guitare."),
    ("We went to the beach.", "Nous sommes allés à la plage."),
    ("I have two brothers.", "J'ai deux frères."),
    ("She speaks three languages.", "Elle parle trois langues."),
    ("He is from France.", "Il vient de France."),
    ("We are going to the cinema.", "Nous allons au cinéma."),
    ("I drink coffee every morning.", "Je bois du café tous les matins."),
    ("She reads a book every day.", "Elle lit un livre tous les jours."),
    ("He likes to cook.", "Il aime cuisiner."),
    ("We are learning French.", "Nous apprenons le français."),
    ("I work as a software engineer.", "Je travaille comme ingénieur logiciel."),
    ("She studies medicine.", "Elle étudie la médecine."),
    ("He is a professional musician.", "Il est musicien professionnel."),
    ("We are visiting our grandparents.", "Nous rendons visite à nos grands-parents."),
    ("I prefer tea to coffee.", "Je préfère le thé au café."),
    ("She is a vegetarian.", "Elle est végétarienne."),
    ("He is a runner.", "Il est un coureur."),
    ("We are going on a vacation.", "Nous partons en vacances.")
]

# Split the sample data into source and target sentences
source_sentences, target_sentences = zip(*sample_data)

# Create a test dataset
test_dataset = list(zip(source_sentences, target_sentences))

# Evaluate test_dataset
test_bleu = evaluate_bleu(transformer_model, test_dataset)
print(f"Test BLEU score: {test_bleu:.4f}")

[nltk_data] Downloading package punkt to /home/pdeb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Reference Sentence: J'aime la programmation. 	 Translated Sentence: Je veux supporter ce sujet .
Reference Sentence: Elle est une bonne professeure. 	 Translated Sentence: Elle est une bonne maîtresse .
Reference Sentence: Il joue de la guitare. 	 Translated Sentence: Il fait de ce devoir .
Reference Sentence: Nous sommes allés à la plage. 	 Translated Sentence: Nous avons fait ce sujet .
Reference Sentence: J'ai deux frères. 	 Translated Sentence: J'ai deux ans .
Reference Sentence: Elle parle trois langues. 	 Translated Sentence: Elle est puis , le reste de la Mr .
Reference Sentence: Il vient de France. 	 Translated Sentence: Il est de la somme de
Reference Sentence: Nous allons au cinéma. 	 Translated Sentence: Nous allons nous établir dans ce sujet .
Reference Sentence: Je bois du café tous les matins. 	 Translated Sentence: Je passai tout le café du matin .
Reference Sentence: Elle lit un livre tous les jours. 	 Translated Sentence: Elle mit leur livre pendant ce temps à parler .