<a href="https://colab.research.google.com/github/MUmairAB/English-to-French-Translation-Model-using-HuggingFace-Transformers/blob/main/English_to_French_Translation_using_HuggingFace_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation

The translation is a very popular NLP tasks. This type of problem lies in the domain of sequence-to-sequence task. It means that it’s a problem that can be formulated as going from one sequence to another. Such models can further be employed in the following problems:


- **Style transfer**: Creating a model that translates texts written in a certain style to another (e.g., formal English to casual English; formal English to Shakespearean English)

- **Generative question answering**: Creating a model that generates answers to questions, given a context.

In this project, we'll fine-tune a Transformers model for **English to French** translation. For this task, we'll use [KDE4 dataset](https://huggingface.co/datasets/kde4) from HuggingFace.

In [None]:
#INstall the transformers library
!pip install transformers

In [None]:
#Declare a seed value for better reproducability
SEED = 4243

## Dataset

We'll use [KDE4 Dataset](https://huggingface.co/datasets/kde4) avaiable on HuggigFace. We are fine-tuning an English to French translation model, so we'll download the relevant dataset. But if you want to fine-tune the model for some other language, you can download the required dataset by using **lang1** and **lang2** parameters below.

In [None]:
#Install the datasets library
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.14.6 dill-0.3.7 multiprocess-0.70.15


In [None]:
#Load the dataset from HuggingFace for English to French translation
from datasets import load_dataset

dataset = load_dataset(path="kde4",
                       lang1="en",
                       lang2="fr"
                      )

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/210173 [00:00<?, ? examples/s]

In [None]:
#Let's view the data fields
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

In [None]:
#Interview the data / Let's have a look at the data
i = 10
dataset["train"][i]

{'id': '10', 'translation': {'en': 'translate', 'fr': 'traduction'}}

In [None]:
#Let's interview further

#Shuffle the dataset using seed and then print 5 random samples
random_samples = dataset["train"].shuffle(seed=SEED).select(range(5))
for sample in random_samples:
    print(sample)
    print("\n\t\t\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")

{'id': '29209', 'translation': {'en': 'At the beginning of each game all cards are mixed in the deck. In some games not all cards are dealt out. The remaining cards are put down on the so-called talon. You can find this quite easily, since in most games it is the only pile showing the reverse.', 'fr': "Au début de chaque partie, toutes les cartes sont mélangées dans le paquet. Dans certains jeux, certaines cartes ne sont pas distribuées. Ces cartes se retrouvent dans ce qu'on appelle le talon, que l'on reconnaît facilement au fait que, dans la plupart des jeux, c'est le seul tas de cartes vues de dos."}}

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{'id': '33602', 'translation': {'en': 'Gadu-Gadu', 'fr': 'Gadu-Gadu'}}

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{'id': '106740', 'translation': {'en': 'Change the color of the numbers', 'fr': 'Modifier la couleur des nombres'}}

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{'id': '69569', 'translation': {'en': 'Leave Channel', 'fr': 'Quitter le canal

**We can see that the dataset is not split into train, test and validation splits. So, we need to do that on our own usig "train_test_split()" method.**

In [None]:
split_dataset = dataset["train"].train_test_split(train_size=0.9,
                                         #Use a differnet seed value
                                         seed=int(SEED/2))
#View the DatasetDict object
split_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

In [None]:
#Let's rename the "test" key of the "split_dataset" as "validation"
split_dataset["validation"] = split_dataset.pop("test")
split_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

## Model

The model that we'll fine-tune is [Helsinki-NLP's English to French Translator](https://huggingface.co/Helsinki-NLP/opus-mt-en-fr?text=My+name+is+Sarah+and+I+live+in+London).

This model's tokenizer needs **sentencepiece** and **sacremoses**. So, we'll need to install them first.

In [None]:
#Install the sentencepiece library
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m939.2 kB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m1.1/1.3 MB[0m [31m17.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
#Install the sacremoses library
!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [None]:
#Instantiate the model
from transformers import pipeline

checkpoint = "Helsinki-NLP/opus-mt-en-fr"
translator = pipeline(task="translation",
                      model=checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

In [None]:
#Let's test the model
sample_text = "How are you doing today?"
translator(sample_text)

[{'translation_text': "Comment allez-vous aujourd'hui ?"}]

## Data pre-processing

This involves preparing the data for the model. The steps involves are

1. Preparing DatasetDict object,
2. Tokenizing

In [None]:
#Instantiate the tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint,
                                          return_tensor="tf")

**Before applying tokenizer on the whole dataset, let's see how it works on sample text**

In [None]:
split_dataset["train"][1]

{'id': '178111',
 'translation': {'en': 'Cloudy weather', 'fr': 'Temps nuageux'}}

In [None]:
#Extract the English and French sentences for one sample
i = 20
en_sentence = split_dataset["train"][i]["translation"]["en"]
fr_sentence = split_dataset["train"][i]["translation"]["fr"]
print("English:",en_sentence)
print("French:",fr_sentence)

English: Insert new cell(s) at selected location, moving existing cell(s) to make room.
French: Insérer de nouvelles cellules à l'emplacement sélectionné, déplaçant les cellules existantes pour faire de la place.


In [None]:
#Apply the tokenizer
model_input = tokenizer(en_sentence,
                        #If you are doing translation in some other language,
                        # then change the following value accordingly
                        text_target=fr_sentence)
model_input

{'input_ids': [24849, 191, 6742, 401, 9, 28, 71, 3819, 2014, 2, 6383, 1462, 6742, 401, 9, 28, 12, 399, 1478, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [45013, 5, 828, 6758, 17, 14, 6, 10810, 18524, 2, 45501, 16, 6758, 6737, 27, 183, 5, 8, 245, 3, 0]}

The resultant dictionary contains:
- input_ids
- attention_mask
- labels

In [None]:
#Let's convert these tokens back to words to
# check how the tokenizer performed

#English (source language) word tokens
print("English Tokens:",tokenizer.convert_ids_to_tokens(model_input["input_ids"]))

#French (target language) word tokens
print("French Tokens:",tokenizer.convert_ids_to_tokens(model_input["labels"]))

English Tokens: ['▁Insert', '▁new', '▁cell', '(', 's', ')', '▁at', '▁selected', '▁location', ',', '▁moving', '▁existing', '▁cell', '(', 's', ')', '▁to', '▁make', '▁room', '.', '</s>']
French Tokens: ['▁Insérer', '▁de', '▁nouvelles', '▁cellules', '▁à', '▁l', "'", 'emplacement', '▁sélectionné', ',', '▁déplaçant', '▁les', '▁cellules', '▁existantes', '▁pour', '▁faire', '▁de', '▁la', '▁place', '.', '</s>']


**Tokenizer works fine. Now, we'll define a function to apply the aforementioned tokenization on the whole dataset using map() method.**

In [None]:
max_length = 128

def tokenize_dataset(examples):
    #Extract the English sentence from the given sample
    inputs = [ex["en"] for ex in examples["translation"]]
    #Extract the French sentence from the given sample
    targets = [ex["fr"] for ex in examples["translation"]]

    #Apply tokenizer
    model_inputs = tokenizer(inputs,
                             text_target=targets,
                             max_length=max_length,
                             truncation=True
                            )
    return model_inputs

In [None]:
tokenized_dataset = split_dataset.map(function=tokenize_dataset,
                                   batched=True,
                                   remove_columns=split_dataset["train"].column_names
                                  )

Map:   0%|          | 0/189155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21018 [00:00<?, ? examples/s]

## Model

In [None]:
#Instantiate the model
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Downloading tf_model.h5:   0%|          | 0.00/301M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


## Data collator


In [None]:
#Instantiate the data collator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                       model=model,
                                       return_tensors="tf")

**Before applying the data collator on the whole batch, let's test it on some samples.**

In [None]:
collated_samples = data_collator([sample for sample in tokenized_dataset["train"].select(range(5))])
collated_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [None]:
#Let's convert the some English tokens back to words
print(tokenizer.convert_ids_to_tokens(collated_samples["input_ids"][2]))
print("\n\t\t\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")
#Let's convert the some French tokens back to words
print(tokenizer.convert_ids_to_tokens(collated_samples["labels"][2]))

['▁Open', '▁a', '▁Bi', 'b', 'tex', '▁file', '%', '▁q', 'd', 'bus', '▁or', 'g', '.', '▁k', 'de', '.', '▁tell', 'ico', '▁/', '▁Tell', 'ico', '▁or', 'g', '.', '▁k', 'de', '.', '▁tell', 'ico', '.', '▁import', 'Bi', 'b', 'tex', '▁"', '/', '▁home', '/', '▁rob', 'by', '/', '▁reference', '.', '▁bi', 'b', '"', '▁"', 're', 'place', '"', '▁true', '</s>']

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%

['▁Ouvrir', '▁un', '▁fichier', '▁B', 'ib', 'tex', '%', '▁q', 'd', 'bus', '▁', 'org', '.', '▁k', 'de', '.', '▁tel', 'lic', 'o', '▁/', '▁T', 'elli', 'co', '▁', 'org', '.', '▁k', 'de', '.', '▁tel', 'lic', 'o', '.', '▁import', 'B', 'ib', 'tex', '▁"', '/', '▁home', '/', '▁', 'rob', 'by', '/', '▁re', 'ference', '.', '▁b', 'ib', '"', '▁"', 're', 'place', '"', '▁tru', 'e', '</s>']


In [None]:
#Apply the data collator to convert the dataset to
#tf.data.Dataset object
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_dataset["validation"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

## Model fine-tuning

In [None]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_epochs = 5

#Number of training steps is geven by the formula:
#    (number of samples // batch size) * number of epochs
num_train_steps = len(tf_train_dataset) * num_epochs

#INstantiate the optimizer
optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

#Compile the model
model.compile(optimizer=optimizer)

In [None]:
#Log in to HuggingFace account
#from huggingface_hub import notebook_login

#notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#Define the callbacks to save the model on the
# HuggingFace Hub during training
from transformers.keras_callbacks import PushToHubCallback


#callback = PushToHubCallback(output_dir="marian-finetuned-kde4-en-to-fr",
#                             tokenizer=tokenizer
#)

#Train the model
history = model.fit(tf_train_dataset,
                    validation_data=tf_eval_dataset,
                    #callbacks=[callback],
                    epochs=num_epochs,
)

Epoch 1/5
Epoch 2/5
 956/5911 [===>..........................] - ETA: 47:40 - loss: 0.8152

KeyboardInterrupt: ignored

## Don't be affraid by the above scary-looking error, it's just a KeyboardInterrupt message

Don't be afraid to see a long error message. It is not an error but a **KeyboardInterrupt** message. It is generated because have **manually stopped the further training** because it is taking too much time. A single epoch takes more than 1 hour. So, after 3 epochs, the accuracy is pretty good. So we stopped the training and we'll manually upload the model and tokenizer to the hub using **push_to_hub()** method.

In [None]:
trainer.save_model(path)

In [None]:
#Save the model to the Model Hub
#model.push_to_hub("marian-finetuned-kde4-english-to-french")

In [None]:
#Save the tokenizer to hub
tokenizer.push_to_hub("marian-finetuned-kde4-english-to-french")

## Evaluation

For evaluation of tranlsation models, **BLEU** or **SacreBLEU** metric is used. Here, we'll use **SacreBLEU**. This model requires the source sentene (here, English) and target sentence(s) (here, French). The model does not take the tokenized sentences, rather accepts the complete sentences.

So, to evaluate teh model, we can write a method that will accept the **test** dataset and apply the model to get the translated text and then compute the BLEU score.

We will wrap this method in **@tf.function** to compiling it with [XLA](https://www.tensorflow.org/xla) by passing **jit_compile=True** as argument. This will increase the processing speed.

In [None]:
#Install the SacreBLEU library
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/119.7 kB[0m [31m783.5 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.8.2 sacrebleu-2.3.2


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [None]:
#Import the SacreBLEU metric
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
import numpy as np
import tensorflow as tf
#We'll use the tqdm to monitor the processing speed
from tqdm import tqdm

#Instantiate the data collator
generation_data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                                  model=model,
                                                  return_tensors="tf",
                                                  pad_to_multiple_of=128
)

#Generate the tf.data object
tf_generate_dataset = model.prepare_tf_dataset(tokenized_dataset["validation"],
                                               collate_fn=generation_data_collator,
                                               shuffle=False,
                                               batch_size=8,
)

#Wrap the function in tf.function
@tf.function(jit_compile=True)
def generate_with_xla(batch):
    return model.generate(input_ids=batch["input_ids"],
                          attention_mask=batch["attention_mask"],
                          max_new_tokens=128,
    )


def compute_metrics():
    all_preds = []
    all_labels = []

    #Use tqdm to get the progress bar
    for batch, labels in tqdm(tf_generate_dataset):

        #Translate the text
        predictions = generate_with_xla(batch)

        #Convert the tokens into words
        decoded_preds = tokenizer.batch_decode(predictions,
                                               skip_special_tokens=True)
        #Convert the label IDs to NumPy array
        labels = labels.numpy()

        #Replace the -100 tokens with pad_token_id (here, 59513)
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        #np.where does the following
        # new_labels = []
        # for label in labels:
        #     if label!=100:
        #         new_labels.append(label)
        #     else:
        #         new_labels.append(tokenizer.pad_token_id)

        #Decode the labels
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        #Remove all unncessary spaces
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [[label.strip()] for label in decoded_labels]

        #Since we are dealing with batches of data, we are using "extend()"
        # method, not the "append()"
        # "extend()" will iterate over the batch and then add
        # each element of the iterable to the end of the List
        all_preds.extend(decoded_preds)
        all_labels.extend(decoded_labels)

    #Finally, compute the metric
    result = metric.compute(predictions=all_preds, references=all_labels)
    return {"bleu": result["score"]}

In [None]:
#Compute the SarcreBLEU score
print(compute_metrics())

  0%|          | 0/2628 [00:21<?, ?it/s]


InvalidArgumentError: ignored

**We have achieved a very impressive SacreBLEU score. We can further improve it by training the model for some more epochs**