## Assigment 3: Transformers for translation 🙊


Have you ever wondered how applications like Google Translate or language translation features in social media platforms work? Behind these impressive technologies are sophisticated machine learning models that can understand and translate text between different languages. One of the most powerful and groundbreaking models used for this purpose is the Transformer model.

In this assignment, you will step into the shoes of an AI researcher and engineer to create your own Transformer model for translating text from English to French. This journey will not only enhance your understanding of machine learning and deep learning but also give you hands-on experience with state-of-the-art techniques in natural language processing.

Let's start by downloading important libraries

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score



For this assignment we are using the IWSLT2017 dataset (read more about it [here](https://huggingface.co/datasets/IWSLT/iwslt2017) ). This dataset easily found in Huggingface fits perfectly for our machine translation task.

In [1]:
from datasets import load_dataset
from datasets import load_dataset
import re
import string
from nltk.corpus import stopwords
import nltk
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer
from torch.amp import GradScaler, autocast

nltk.download('stopwords')
dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


ValueError: The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

Just to have an idea let's have a quick peak at what our dataset looks like.

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

Since we don't want to take 8 hours training, let's trim our dataset a bit (although this might lead to underperformance, feel free to use the complete dataset if you have the computing power).

SUGESTION: start with a small dataset to debug your code and increase it gradually (the same principle applies for the number of epochs, batch size, test set size...).

In [None]:
trim_dataset= dataset['train']['translation'][:100000]

### Preprocessing


Same as our previous assignments preprocessing is an essential part of any NLP task.

In [None]:
import string
def preprocess_data(text):
    text = text.lower()
    text = text.replace('\n', ' ')
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join([word for word in text.split(" ") if word.isalpha()])
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

For an easier training structure, it is useful to format our training and validation sets. The following function should help with this.

In [None]:
def create_dataset(dataset,source_lang,target_lang):
    new_dataset = []
    for example in dataset:
        source_text = example.get(source_lang, "")
        target_text = example.get(target_lang, "")
        clean_source = preprocess_data(source_text)
        clean_target = preprocess_data(target_text)
        new_dataset.append((clean_source, clean_target))
    return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

### Model Creation


Now that our data is ready, we can get started. Let's start by creating our Sequence to Sequence Transformer model.

In [None]:
class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.fc = nn.Linear(d_model, tgt_vocab_size)

    def positional_encoding(self, d_model, maxlen=5000):
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)
        PE[:, 1::2] = torch.cos(pos / denominator)

        return PE.unsqueeze(0)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)
        positional_encoding = self.positional_encoding(d_model=src.shape[2]).to(src.device)
        src_emb = src + positional_encoding[:, :src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:, :tgt.shape[1], :]
        output = self.transformer(
            src_emb, tgt_emb,
            src_mask, tgt_mask,
            None,
            src_key_padding_mask, tgt_key_padding_mask,
            src_key_padding_mask
        )
        return self.fc(output)

    def encode(self, src, src_mask):
        src = self.src_embedding(src)
        positional_encoding = self.positional_encoding(d_model=src.shape[2]).to(src.device)
        src_emb = src + positional_encoding[:, :src.shape[1], :]
        return self.transformer.encoder(src_emb, src_mask)

    def decode(self, tgt, memory, tgt_mask):
        tgt = self.tgt_embedding(tgt)
        positional_encoding = self.positional_encoding(d_model=tgt.shape[2]).to(tgt.device)
        tgt_emb = tgt + positional_encoding[:, :tgt.shape[1], :]
        return self.transformer.decoder(tgt_emb, memory, tgt_mask)

Now that our model is ready, we still need some methods that will come in handy during training.

In [None]:
def create_padding_mask(seq):
    return (seq == 0).float()

def create_triu_mask(sz):
    mask = torch.triu(torch.ones(sz, sz), diagonal=1).transpose(0, 1).float()
    mask = mask.masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
    return torch.flip(mask, dims=(0, 1))

def tokenize_batch(source, targets, tokenizer):
    tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')
    tokenized_targets = tokenizer(targets, padding='max_length', max_length=120, truncation=True, return_tensors='pt')
    return tokenized_source['input_ids'], tokenized_targets['input_ids']


### Training


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id
BOS_IDX = tokenizer.bos_token_id
EOS_IDX = tokenizer.eos_token_id

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=64, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=64, shuffle=False)

NameError: name 'torch' is not defined

In [None]:
from torch.utils.data import DataLoader
from torch.amp import autocast
from tqdm import tqdm
def train_epoch(model, train_loader, tokenizer, scaler, accumulation_steps=4):
    model.train()
    losses = 0
    optimizer.zero_grad()

    for batch_idx, (src, tgt) in enumerate(tqdm(train_loader)):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:, :-1]
        tgt_out = tgt[:, 1:]

        src_mask = torch.zeros((src.size(1), src.size(1)), device=device)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)
        src_padding_mask = create_padding_mask(src).to(device)
        tgt_padding_mask = create_padding_mask(tgt_input).to(device)
        with autocast(device_type='cuda'):
            logits = model(
                src, tgt_input,
                src_mask=src_mask, tgt_mask=tgt_mask,
                src_key_padding_mask=src_padding_mask, tgt_key_padding_mask=tgt_padding_mask
            )
            loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
            loss = loss / accumulation_steps
        scaler.scale(loss).backward()
        if (batch_idx + 1) % accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        losses += loss.item()

    return losses / len(list(train_loader))

def evaluate(model, val_dataloader):
    model.eval()
    losses = 0
    with torch.no_grad():
        for src, tgt in tqdm(val_dataloader):
            src, tgt = tokenize_batch(src, tgt, tokenizer)
            src = src.to(device)
            tgt = tgt.to(device)
            tgt_input = tgt[:, :-1]
            src_mask = torch.zeros((src.size(1), src.size(1)), device=device)
            tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)
            src_padding_mask = create_padding_mask(src).to(device)
            tgt_padding_mask = create_padding_mask(tgt_input).to(device)
            logits = model(
                src, tgt_input,
                src_mask=src_mask, tgt_mask=tgt_mask,
                src_key_padding_mask=src_padding_mask, tgt_key_padding_mask=tgt_padding_mask
            )
            tgt_out = tgt[:, 1:]
            loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
            losses += loss.item()
    return losses / len(list(val_dataloader))

Now we can start training! Keep in mind this code is very demanding computationally, it has been set to 10 epochs (which can take up to 6-8 hours) but feel free to change this value depending on your resources, in this case the more epochs you can execute the better 😀

In [None]:
def train(model, epochs, train_loader, validation_loader):
    scaler = GradScaler()
    for epoch in range(1, epochs + 1):
        train_loss = train_epoch(model, train_loader, tokenizer, scaler)
        val_loss = evaluate(model, validation_loader)
        print(f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}")
train(model, 1, train_loader,validation_loader)

100%|██████████| 1563/1563 [06:19<00:00,  4.11it/s]
100%|██████████| 14/14 [00:01<00:00, 10.48it/s]


Epoch: 1, Train loss: 5.813, Val loss: 5.280


100%|██████████| 1563/1563 [06:19<00:00,  4.12it/s]
100%|██████████| 14/14 [00:01<00:00, 11.09it/s]


Epoch: 2, Train loss: 4.576, Val loss: 4.697


100%|██████████| 1563/1563 [06:15<00:00,  4.17it/s]
100%|██████████| 14/14 [00:01<00:00, 11.15it/s]


Epoch: 3, Train loss: 4.049, Val loss: 4.437


100%|██████████| 1563/1563 [06:14<00:00,  4.18it/s]
100%|██████████| 14/14 [00:01<00:00, 10.86it/s]


Epoch: 4, Train loss: 3.686, Val loss: 4.251


100%|██████████| 1563/1563 [06:18<00:00,  4.13it/s]
100%|██████████| 14/14 [00:01<00:00, 10.90it/s]


Epoch: 5, Train loss: 3.400, Val loss: 4.074


100%|██████████| 1563/1563 [06:17<00:00,  4.14it/s]
100%|██████████| 14/14 [00:01<00:00, 10.96it/s]


Epoch: 6, Train loss: 3.160, Val loss: 3.924


100%|██████████| 1563/1563 [06:14<00:00,  4.18it/s]
100%|██████████| 14/14 [00:01<00:00, 10.66it/s]


Epoch: 7, Train loss: 2.960, Val loss: 3.832


100%|██████████| 1563/1563 [06:19<00:00,  4.11it/s]
100%|██████████| 14/14 [00:01<00:00, 10.69it/s]


Epoch: 8, Train loss: 2.788, Val loss: 3.681


100%|██████████| 1563/1563 [06:18<00:00,  4.13it/s]
100%|██████████| 14/14 [00:01<00:00, 10.98it/s]


Epoch: 9, Train loss: 2.640, Val loss: 3.635


100%|██████████| 1563/1563 [06:14<00:00,  4.18it/s]
100%|██████████| 14/14 [00:01<00:00, 10.64it/s]

Epoch: 10, Train loss: 2.513, Val loss: 3.568





### Testing


In this assignment, we will use three different evaluation metrics to see our model's test performance: [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore), [Meteor](https://huggingface.co/spaces/evaluate-metric/meteor) and [Rouge](https://huggingface.co/spaces/evaluate-metric/rouge). Please access their hugging face documentation to know how to implement them.

In [None]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Implement greedy decode as seen in class in the NLG slides.

In [None]:
from collections import defaultdict
def greedy_decode(model, src, src_mask, max_len, start_symbol, repetition_penalty=1.5, top_k=10, max_repetitions=5):
    src = src.to(device)
    src_mask = src_mask.to(device)
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    repetition_counter = defaultdict(int)
    for i in range(max_len - 1):
        tgt_mask = create_triu_mask(ys.size(1)).to(device)
        out = model.decode(ys, memory, tgt_mask)
        logits = model.fc(out[:, -1])
        for token_id, count in repetition_counter.items():
            if count > 0:
                logits[0, token_id] /= (repetition_penalty ** count)
        topk_prob, topk_indices = torch.topk(logits, top_k, dim=-1)
        next_word_index = torch.multinomial(torch.nn.functional.softmax(topk_prob, dim=-1), 1).item()
        next_word = topk_indices[0, next_word_index].item()
        if next_word == EOS_IDX:
            break
        repetition_counter[next_word] += 1
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
        if repetition_counter[next_word] >= max_repetitions:
            break
    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=BOS_IDX).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)

In [None]:
print(translate(model, "Hello how are you today",tokenizer))
print(translate(model, "Hi",tokenizer))
print(translate(model, "I live in Montreal",tokenizer))
print(translate(model, "There are two cats in the room",tokenizer))

comment vous avez comment


In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0
    for src, target in test_loader:
        src_tensor, _ = tokenize_batch([src], [""], tokenizer)
        src_tensor = src_tensor.to(device)
        translated_output = translate(model, tokenizer.decode(src_tensor[0]), tokenizer)
        target_sentence = target[0]
        bert_results = bertscore.compute(predictions=[translated_output], references=[target_sentence], lang='fr')
        meteor_results = meteor.compute(predictions=[translated_output], references=[target_sentence])
        precision += bert_results['precision'][0]
        recall += bert_results['recall'][0]
        f1 += bert_results['f1'][0]
        meteor_metric += meteor_results['meteor']
    num_samples = len(test_loader)
    return precision / num_samples, recall / num_samples, f1 / num_samples, meteor_metric / num_samples

test(test_set, model, tokenizer, device)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

(0.5954862767920406,
 0.7134637086422232,
 0.6467910342276394,
 0.30765436145334296)

## Let's experiment!

1. Play with a hyperparameter of your choice to measure its effect on the translation.

2. Compare the results of your model with the performance of using the T5 pretrained model. This [tutorial](https://huggingface.co/docs/transformers/en/tasks/translation) on using T5 for machine translation might come in handy.

## Code of T5

In [1]:
import torch
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
def preprocess_t5_input_for_training(dataset, source_lang="English", target_lang="French"):
    inputs = [f"translate {source_lang} to {target_lang}: {src}" for src, tgt in dataset]
    targets = [tgt for src, tgt in dataset]
    return inputs, targets

train_inputs, train_targets = preprocess_t5_input_for_training(training_set)
val_inputs, val_targets = preprocess_t5_input_for_training(validation_set)
test_inputs, test_targets = preprocess_t5_input_for_training(test_set)


In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "t5-small"
t5_model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)

In [None]:
def translate_with_t5(t5_model, t5_tokenizer, sentences, device, max_length=200):
    inputs = t5_tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
    outputs = t5_model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)
    return t5_tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
from bert_score import score as bert_score
from nltk.translate.meteor_score import single_meteor_score
def evaluate_translation_model(model, tokenizer, test_sentences, reference_sentences, device, is_t5=False, max_length=200):
    generated_translations, meteor_metric = [], 0
    for src_sentence in test_sentences:
        if is_t5:
            inputs = tokenizer(src_sentence, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(device)
            outputs = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)
            translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        else:
            translation = translate(model, src_sentence, tokenizer)
        generated_translations.append(translation)
    P, R, F1 = bert_score(generated_translations, reference_sentences, lang="en")
    precision = P.mean().item()
    recall = R.mean().item()
    f1 = F1.mean().item()
    for ref, hyp in zip(reference_sentences, generated_translations):
        meteor_metric += single_meteor_score(ref.split(), hyp.split())
    return generated_translations, precision, recall, f1, meteor_metric / len(reference_sentences)
test_src_sentences = [src for src, _ in test_set[:10]]
test_ref_sentences = [tgt for _, tgt in test_set[:10]]
generated_translations, precision, recall, f1, meteor_metric = evaluate_translation_model(
    model=t5_model,
    tokenizer=t5_tokenizer,
    test_sentences=test_src_sentences,
    reference_sentences=test_ref_sentences,
    device=device,
    is_t5=True
)
print("Sample Translations:")
for i in range(5):
    print(f"Source: {test_src_sentences[i]}")
    print(f"Generated Translation: {generated_translations[i]}")
    print(f"Reference Translation: {test_ref_sentences[i]}\n")
print(f"T5 Model - Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}, METEOR: {meteor_metric:.4f}")


In [None]:
from bert_score import score as bert_score
from evaluate import load as load_metric
from nltk.translate.meteor_score import single_meteor_score
rouge = load_metric('rouge')
def evaluate_translation_model(model, tokenizer, test_sentences, reference_sentences, device, is_t5=False, max_length=200):
    generated_translations = []
    meteor_metric = 0
    for src_sentence in test_sentences:
        if is_t5:
            inputs = tokenizer(src_sentence, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(device)
            outputs = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=max_length)
            translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        else:
            translation = translate(model, src_sentence, tokenizer)
        generated_translations.append(translation)
    P, R, F1 = bert_score(generated_translations, reference_sentences, lang="en")
    precision = P.mean().item()
    recall = R.mean().item()
    f1 = F1.mean().item()
    rouge_scores = rouge.compute(predictions=generated_translations, references=reference_sentences)
    for ref, hyp in zip(reference_sentences, generated_translations):
        meteor_metric += single_meteor_score(ref.split(), hyp.split())
    return generated_translations, precision, recall, f1, meteor_metric / len(reference_sentences), rouge_scores
test_src_sentences = [src for src, _ in test_set[:10]]
test_ref_sentences = [tgt for _, tgt in test_set[:10]]
generated_translations, precision, recall, f1, meteor_metric, rouge_scores = evaluate_translation_model(
    model=t5_model,
    tokenizer=t5_tokenizer,
    test_sentences=test_src_sentences,
    reference_sentences=test_ref_sentences,
    device=device,
    is_t5=True
)
print("Sample Translations:")
for i in range(5):
    print(f"Source: {test_src_sentences[i]}")
    print(f"Generated Translation: {generated_translations[i]}")
    print(f"Reference Translation: {test_ref_sentences[i]}\n")
print(f"T5 Model - Precision (BERTScore): {precision:.4f}, Recall (BERTScore): {recall:.4f}, F1 (BERTScore): {f1:.4f}, METEOR: {meteor_metric:.4f}")
print(f"ROUGE Scores: {rouge_scores}")

In [None]:
avg_f1_score = (rouge_scores['rouge1'] + rouge_scores['rouge2'] + rouge_scores['rougeL']) / 3
print(f"Average ROUGE F1 Score: {avg_f1_score:.4f}")
print(rouge_scores)

In [None]:
avg_f1_score = (rouge_scores['rouge1'] + rouge_scores['rouge2'] + rouge_scores['rougeL'] + rouge_scores['rougeLsum']) / 4
print(f"Average ROUGE F1 Score (including ROUGE-Lsum): {avg_f1_score:.4f}")