### Machine Translation with Seq2Seq Model in PyTorch

# Machine Translation
Machine translation (MT) refers to the automatic translation of text from one language to another using computational models. Neural Machine Translation (NMT) systems often use Seq2Seq architectures with encoder-decoder structures and attention mechanisms.

## What is Machine Translation?

Machine Translation (MT) is a subfield of Natural Language Processing (NLP) focused on the automatic translation of text or speech from one language to another. Given a source sentence $ X = (x_1, x_2, \dots, x_n) $ in a source language, the goal is to generate a target sentence $ Y = (y_1, y_2, \dots, y_m) $ in a target language that conveys the same meaning.

In mathematical terms, the task can be modeled as finding the most probable translation $ Y^* $:

$$
Y^* = \underset{Y}{\text{argmax}} \; P(Y|X)
$$

Where:
- $ P(Y|X) $ is the conditional probability of the target sequence given the source sequence.

Seq2Seq models break this task into two steps:
1. **Encoding:** The source sentence $ X $ is encoded into a fixed-length context vector $ C $, which represents the meaning of the source sentence.
2. **Decoding:** The target sentence $ Y $ is generated word-by-word based on the context vector $ C $ and the previously generated words.




In [91]:
!pip install datasets



In [92]:
!pip install torchmetrics



This lab demonstrates how to:
1. Load a translation dataset from Hugging Face's `datasets` library.
2. Create a PyTorch Dataset and DataLoader.
3. Build a Seq2Seq model with an encoder-decoder architecture.
4. Train the model.
5. Evaluate the model using BLEU score.

### 1. Load  Dataset
The **ManyThings English-French dataset** is a popular dataset used for machine translation tasks, particularly for English-to-French translations. It is simple, lightweight, and primarily intended for beginner-level experimentation with machine translation models. Below is an explanation of its key aspects:

**Content:**
   - The dataset consists of parallel sentences, meaning that each English sentence is paired with its corresponding translation in French.
   - The translations are often short, conversational, and simple, making it ideal for introductory machine translation tasks.



In [93]:
from datasets import load_dataset

ds = load_dataset("avitri/eng-fra")

In [94]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 135842
    })
})

In [95]:
split_ds=ds["train"].train_test_split(0.1, seed=33)

In [96]:
train_data, test_data= split_ds['train'], split_ds['test']

In [97]:
train_data[0]['text'].split('\t')

['They bought a box of cookies.', 'Ils ont acheté une boîte de biscuits.']

In [98]:
train_tokens_A=[token for sentence_pair in  train_data for token in sentence_pair['text'].split('\t')[0].lower().split()]
train_tokens_B=[token for sentence_pair in  train_data for token in sentence_pair['text'].split('\t')[1].lower().split()]

In [99]:
from collections import Counter

vocab_A=Counter(train_tokens_A)
vocab_B=Counter(train_tokens_B)



In [100]:
w2i_A={k: (i+4) for i, (k,v) in enumerate(vocab_A.items())}
w2i_B={k: (i+4) for i, (k,v) in enumerate(vocab_B.items())}

In [101]:
for i, k in enumerate(['<pad>','<bos>','<eos>','<unk>']):
     w2i_A[k]=i
     w2i_B[k]=i

In [102]:
len(w2i_A),len(w2i_B)

(22603, 37272)

### 2. Dataset and Dataloader Creation

In [103]:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, train_data, w2i_A, w2i_B):
        self.train_data = train_data
        self.w2i_A = w2i_A
        self.i2w_A = {v:k for (k,v) in w2i_A.items()}

        self.w2i_B = w2i_B
        self.i2w_B = {v:k for (k,v) in w2i_B.items()}
    def __len__(self):
        return len(self.train_data)

    def __getitem__(self, idx):
        sample = self.train_data[idx]
        sent_A, sent_B= sample['text'].split('\t')


        tokens_A=[w2i_A['<bos>']]
        for token in sent_A.lower().split():
          if token in w2i_A:
            tokens_A.append(w2i_A[token])
          else:
             tokens_A.append(w2i_A['<unk>'])
        tokens_A.append(w2i_A['<eos>'])

        tokens_B=[w2i_B['<bos>']]
        for token in sent_B.lower().split():
          if token in w2i_B:
            tokens_B.append(w2i_B[token])
          else:
             tokens_B.append(w2i_B['<unk>'])
        tokens_B.append(w2i_B['<eos>'])

        return tokens_A, tokens_B, len(tokens_A) , len(tokens_B)

In [104]:
train_dataset=CustomDataset(train_data, w2i_A, w2i_B)
test_dataset=CustomDataset(test_data, w2i_A, w2i_B)

In [105]:
import torch
from torch.nn.utils.rnn import pad_sequence

def my_collate_fn(batch):
    """
    Custom collate function for the DataLoader to handle variable-length sequences.
    Pads input and target sequences to the maximum length in the batch.

    Args:
        batch (list of dicts): Each element contains `source_ids`, `source_mask`, and `target_ids`.

    Returns:
        dict: Padded and batched inputs and targets.
    """


    # Separate source and target sequences
    source_ids = [torch.tensor(item[0]) for item in batch]
    target_ids = [torch.tensor(item[1]) for item in batch]

    source_lentghs=torch.tensor( [item[2] for item in batch])
    target_lentghs = torch.tensor( [item[3] for item in batch])
    # Pad the sequences to the maximum length in the batch
    padded_source_ids = pad_sequence(source_ids, batch_first=True, padding_value=0)
    padded_target_ids = pad_sequence(target_ids, batch_first=True, padding_value=0)


    return padded_source_ids, padded_target_ids , source_lentghs, target_lentghs

In [106]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn= my_collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True, collate_fn= my_collate_fn)

### 3. Seq2Seq Model Definition

In [107]:
import torch.nn as nn
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)

        weights = torch.nn.functional.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights



In [108]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, padding_idx_A, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size, padding_idx=padding_idx_A)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

In [109]:
class Seq2Seq(nn.Module):
    def __init__(self, num_embeddings_A, num_embeddings_B, hidden_size, padding_idx_A, padding_idx_B , dropout_rate=0.1):
        super(Seq2Seq, self).__init__()

        self.encoder= EncoderRNN(num_embeddings_A, hidden_size, padding_idx_A, dropout_rate)
        self.embedding = nn.Embedding(num_embeddings_B, hidden_size, padding_idx=padding_idx_B)

        self.attention = BahdanauAttention(hidden_size)

        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, num_embeddings_B)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, input_tensor, target_tensor, test_phase=False):

        encoder_outputs, encoder_hidden=self.encoder(input_tensor)

        batch_size = input_tensor.size(0)

        decoder_input= torch.ones((batch_size, 1)).long().to(input_tensor.device)

        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        max_length=target_tensor.size(1)

        for i in range(max_length):


            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            if test_phase:
                _, top_index= decoder_output.squeeze(1).topk(1)# bs, 1, V
                decoder_input= top_index.detach() #bs, 1
            else:
              decoder_input = target_tensor[:, i].unsqueeze(1)
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)


        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        embedded =  self.dropout(self.embedding(input))

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [110]:
# Initialize encoder and decoder
num_embeddings_A=len(w2i_A)
padding_idx_A=w2i_A['<pad>']


num_embeddings_B=len(w2i_B)
padding_idx_B=w2i_B['<pad>']

hidden_size= 256
dropout_rate=0.1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model= Seq2Seq(num_embeddings_A, num_embeddings_B ,hidden_size,padding_idx_A, padding_idx_B,dropout_rate).to(device)


  model.load_state_dict(torch.load('pretrained_mt.pth'))


<All keys matched successfully>

In [111]:
#!gdown 1a3nVKJ8QVUb2dkevAbT6sG8Oi3UPIhSW

In [124]:
model.load_state_dict(torch.load('pretrained_mt.pth'))

  model.load_state_dict(torch.load('pretrained_mt.pth'))


<All keys matched successfully>

### 5. Evaluation with BLEU Score

### BLEU Score Explanation

The **BLEU (Bilingual Evaluation Understudy)** score is a metric for evaluating the quality of text that has been machine-translated compared to a reference translation. It is based on the following key components:

1. **N-gram Precision**: The BLEU score calculates the precision of n-grams (sequences of $n$ words) in the candidate translation compared to the reference translation.

2. **Brevity Penalty (BP)**: A penalty is applied if the candidate translation is shorter than the reference translation, to discourage overly short translations.

The BLEU score formula is as follows:

$$
\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^N w_n \cdot \log p_n\right)
$$

Where:  
- $p_n$ is the precision for n-grams of size $n$.  
- $w_n$ is the weight for n-grams of size $n$, typically $w_n = \frac{1}{N}$ (equal weights).  
- $N$ is the maximum size of n-grams considered (e.g., 4 for BLEU-4).  
- $\text{BP}$ is the brevity penalty, defined as:  

$$
\text{BP} =
\begin{cases}
1, & \text{if } c > r \\
e^{1 - \frac{r}{c}}, & \text{if } c \leq r
\end{cases}
$$

Where:  
- $c$ is the length of the candidate translation.  
- $r$ is the length of the reference translation.




In [132]:
def decoded_sent(idxs, i2w, max_length):
  idxs=idxs[:  max_length]
  return [i2w[idx] for idx in idxs if idx >=4]

In [136]:
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

In [147]:
import numpy as np
from tqdm import tqdm

sent_blue=[]
gt=[]
pred=[]
for batch in tqdm(test_dataloader):
    padded_source_ids, padded_target_ids , source_lentghs, target_lentghs=batch

    padded_source_ids, padded_target_ids= padded_source_ids.to(device), padded_target_ids.to(device)

    with torch.no_grad():
      decoder_outputs, attentions= model(padded_source_ids, padded_target_ids)
      decoder_outputs= np.argmax(torch.nn.functional.softmax(decoder_outputs,-1).detach().cpu().numpy(),-1)

    padded_target_ids= padded_target_ids.detach().cpu().numpy()
    target_lentghs = target_lentghs.detach().cpu().numpy()

    for i in range(len(target_lentghs)):
      gt.append(decoded_sent(padded_target_ids[i], train_dataset.i2w_B, target_lentghs[i]))
      pred.append(decoded_sent(decoder_outputs[i], train_dataset.i2w_B, target_lentghs[i]))


      sent_blue.append(sentence_bleu([gt[-1]], pred[-1], weights=(1,0,0,0)))


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
100%|██████████| 213/213 [00:44<00:00,  4.76it/s]


In [149]:
np.mean(sent_blue)

0.5369103053320133

KeyError: ('aide-moi',)

In [155]:
filtred_idxs=[i for i, sent in enumerate(gt) if len(sent)>10]

In [158]:
corpus_bleu([gt[idxs] for idxs in filtred_idxs], [pred[idxs] for idxs in filtred_idxs] , weights=(1,0,0,0))

0.03663932975396779

In [183]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("translation", model="hopkins/mbart-finetuned-eng-fra", src_lang='fr', tgt_lang="eng")

Device set to use cuda:0


In [186]:
text= train_data[1]['text']

In [187]:
text.split('\t')

['Tom signed the documents.', 'Tom signa les documents.']

In [189]:
target

'Tom signa les documents.'

In [195]:
pipe(target)[0]['translation_text'].split()[1:]

['Tom', 'signed', 'the', 'documents.']

In [196]:
source.split()

['Tom', 'signed', 'the', 'documents.']

1.0

In [198]:
sent_blue=[]
for sample in tqdm(test_data):
  text=sample['text']
  text= train_data[1]['text']
  target, source =text.split('\t')

  score= sentence_bleu([target.split()], pipe(source)[0]['translation_text'].split()[1:])
  sent_blue.append(score)

print(np.mean(sent_blue))

  0%|          | 3/13585 [00:01<1:45:38,  2.14it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  0%|          | 23/13585 [00:09<1:29:06,  2.54it/s]


KeyboardInterrupt: 

In [200]:
sent_blue

[1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0]

In [199]:
print(np.mean(sent_blue))

1.0
