### Machine Translation with Seq2Seq Model in PyTorch

# Machine Translation
Machine translation (MT) refers to the automatic translation of text from one language to another using computational models. Neural Machine Translation (NMT) systems often use Seq2Seq architectures with encoder-decoder structures and attention mechanisms.

## What is Machine Translation?

Machine Translation (MT) is a subfield of Natural Language Processing (NLP) focused on the automatic translation of text or speech from one language to another. Given a source sentence $ X = (x_1, x_2, \dots, x_n) $ in a source language, the goal is to generate a target sentence $ Y = (y_1, y_2, \dots, y_m) $ in a target language that conveys the same meaning.

In mathematical terms, the task can be modeled as finding the most probable translation $ Y^* $:

$$
Y^* = \underset{Y}{\text{argmax}} \; P(Y|X)
$$

Where:
- $ P(Y|X) $ is the conditional probability of the target sequence given the source sequence.

Seq2Seq models break this task into two steps:
1. **Encoding:** The source sentence $ X $ is encoded into a fixed-length context vector $ C $, which represents the meaning of the source sentence.
2. **Decoding:** The target sentence $ Y $ is generated word-by-word based on the context vector $ C $ and the previously generated words.




In [1]:
#!pip install datasets

In [2]:
#!pip install torchmetrics

This lab demonstrates how to:
1. Load a translation dataset from Hugging Face's `datasets` library.
2. Create a PyTorch Dataset and DataLoader.
3. Build a Seq2Seq model with an encoder-decoder architecture.
4. Train the model.
5. Evaluate the model using BLEU score.

### 1. Load  Dataset
The **ManyThings English-French dataset** is a popular dataset used for machine translation tasks, particularly for English-to-French translations. It is simple, lightweight, and primarily intended for beginner-level experimentation with machine translation models. Below is an explanation of its key aspects:

**Content:**
   - The dataset consists of parallel sentences, meaning that each English sentence is paired with its corresponding translation in French.
   - The translations are often short, conversational, and simple, making it ideal for introductory machine translation tasks.



In [3]:
from datasets import load_dataset

ds = load_dataset("avitri/eng-fra")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 135842
    })
})

In [5]:
split_ds=ds["train"].train_test_split(0.1)

In [6]:
train_data, test_data= split_ds['train'], split_ds['test']

In [7]:
train_data[0]['text'].split('\t')

['You are taller than me.', 'Vous êtes plus grands que moi.']

In [8]:
train_tokens_A=[token for sentence_pair in  train_data for token in sentence_pair['text'].split('\t')[0].lower().split()]
train_tokens_B=[token for sentence_pair in  train_data for token in sentence_pair['text'].split('\t')[1].lower().split()]

In [9]:
from collections import Counter

vocab_A=Counter(train_tokens_A)
vocab_B=Counter(train_tokens_B)



In [10]:
w2i_A={k: (i+4) for i, (k,v) in enumerate(vocab_A.items())}
w2i_B={k: (i+4) for i, (k,v) in enumerate(vocab_B.items())}

In [11]:
for i, k in enumerate(['<pad>','<bos>','<eos>','<unk>']):
     w2i_A[k]=i
     w2i_B[k]=i

### 2. Dataset and Dataloader Creation

In [12]:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, train_data, w2i_A, w2i_B):
        self.train_data = train_data
        self.w2i_A = w2i_A
        self.i2w_A = {v:k for (k,v) in w2i_A.items()}

        self.w2i_B = w2i_B
        self.i2w_B = {v:k for (k,v) in w2i_B.items()}
    def __len__(self):
        return len(self.train_data)

    def __getitem__(self, idx):
        sample = self.train_data[idx]
        sent_A, sent_B= sample['text'].split('\t')


        tokens_A=[w2i_A['<bos>']]
        for token in sent_A.lower().split():
          if token in w2i_A:
            tokens_A.append(w2i_A[token])
          else:
             tokens_A.append(w2i_A['<unk>'])
        tokens_A.append(w2i_A['<eos>'])

        tokens_B=[w2i_B['<bos>']]
        for token in sent_B.lower().split():
          if token in w2i_B:
            tokens_B.append(w2i_B[token])
          else:
             tokens_B.append(w2i_B['<unk>'])
        tokens_B.append(w2i_B['<eos>'])

        return tokens_A, tokens_B, len(tokens_A) , len(tokens_B)

In [13]:
train_dataset=CustomDataset(train_data, w2i_A, w2i_B)
test_dataset=CustomDataset(test_data, w2i_A, w2i_B)

In [16]:
import torch
from torch.nn.utils.rnn import pad_sequence

def my_collate_fn(batch):
    """
    Custom collate function for the DataLoader to handle variable-length sequences.
    Pads input and target sequences to the maximum length in the batch.

    Args:
        batch (list of dicts): Each element contains `source_ids`, `source_mask`, and `target_ids`.

    Returns:
        dict: Padded and batched inputs and targets.
    """


    # Separate source and target sequences
    source_ids = [torch.tensor(item[0]) for item in batch]
    target_ids = [torch.tensor(item[1]) for item in batch]

    source_lentghs=torch.tensor( [item[2] for item in batch])
    target_lentghs = torch.tensor( [item[3] for item in batch])
    # Pad the sequences to the maximum length in the batch
    padded_source_ids = pad_sequence(source_ids, batch_first=True, padding_value=0)
    padded_target_ids = pad_sequence(target_ids, batch_first=True, padding_value=0)


    return padded_source_ids, padded_target_ids , source_lentghs, target_lentghs

In [17]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn= my_collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True)

### 3. Seq2Seq Model Definition

In [18]:
import torch.nn as nn
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)

        weights = torch.nn.functional.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights



In [19]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, padding_idx_A, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size, padding_idx=padding_idx_A)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

In [20]:
class Seq2Seq(nn.Module):
    def __init__(self, num_embeddings_A, num_embeddings_B, hidden_size, padding_idx_A, padding_idx_B , dropout_rate=0.1):
        super(Seq2Seq, self).__init__()

        self.encoder= EncoderRNN(num_embeddings_A, hidden_size, padding_idx_A, dropout_rate)
        self.embedding = nn.Embedding(num_embeddings_B, hidden_size, padding_idx=padding_idx_B)

        self.attention = BahdanauAttention(hidden_size)

        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, num_embeddings_B)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, input_tensor, target_tensor):

        encoder_outputs, encoder_hidden=self.encoder(input_tensor)
        batch_size,seq_len = target_tensor.shape

        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(seq_len):

            decoder_input = target_tensor[:, i].unsqueeze(1)
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)


        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        embedded =  self.dropout(self.embedding(input))

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [21]:
# Initialize encoder and decoder
num_embeddings_A=len(w2i_A)
padding_idx_A=w2i_A['<pad>']


num_embeddings_B=len(w2i_B)
padding_idx_B=w2i_B['<pad>']

hidden_size= 256
dropout_rate=0.1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model= Seq2Seq(num_embeddings_A, num_embeddings_B ,hidden_size,padding_idx_A, padding_idx_B,dropout_rate).to(device)


In [None]:
https://drive.google.com/file/d/1p7bsrla6DGgtJNOAuqOmAn6FQ5YgrHDH/view?usp=sharing

In [26]:

torch.save(model.state_dict(), 'pretrained_mt.pth')

### 5. Evaluation with BLEU Score

### BLEU Score Explanation

The **BLEU (Bilingual Evaluation Understudy)** score is a metric for evaluating the quality of text that has been machine-translated compared to a reference translation. It is based on the following key components:

1. **N-gram Precision**: The BLEU score calculates the precision of n-grams (sequences of $n$ words) in the candidate translation compared to the reference translation.

2. **Brevity Penalty (BP)**: A penalty is applied if the candidate translation is shorter than the reference translation, to discourage overly short translations.

The BLEU score formula is as follows:

$$
\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^N w_n \cdot \log p_n\right)
$$

Where:  
- $p_n$ is the precision for n-grams of size $n$.  
- $w_n$ is the weight for n-grams of size $n$, typically $w_n = \frac{1}{N}$ (equal weights).  
- $N$ is the maximum size of n-grams considered (e.g., 4 for BLEU-4).  
- $\text{BP}$ is the brevity penalty, defined as:  

$$
\text{BP} =
\begin{cases}
1, & \text{if } c > r \\
e^{1 - \frac{r}{c}}, & \text{if } c \leq r
\end{cases}
$$

Where:  
- $c$ is the length of the candidate translation.  
- $r$ is the length of the reference translation.




In [None]:
#from nltk.translate.bleu_score import sentence_bleu