Fine-tuning Helsinki-NLP/opus-mt-en-fr on a translation task with a parallel corpus from Europarl, tested on a parallel corpus from ECHR.

We start with the preprocessing and preparation of the datasets.

## Prepare the data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Pre-process the dataset

Tokenizing the dataset.

In [2]:
! pip install datasets transformers sacrebleu torch sentencepiece transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m84.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets

In [3]:
import os
os.environ["WANDB_DISABLED"]="true"

In [4]:
import transformers
print(transformers.__version__) #Ensure that the version is greater than 4.11.1

4.28.1


In [5]:
#Call the model
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"

In [6]:
#Prepare the tokenizer
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



In [7]:
#We define our variables so that we don't have to keep calling the path
# USING THE SPLITTED FILES THAT WE OBTAINED FROM THE NOTEBOOK 'SUPERVISED-NMT' (PARALLEL CORPUS FROM EUROPARL)
train_dataset_en = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/train.en"
train_dataset_fr = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/train.fr"
dev_dataset_en = '/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/dev.en'
dev_dataset_fr = '/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/dev.fr'

with open(train_dataset_en, "r", encoding="utf-8") as f:
    train_en = f.readlines()

with open(train_dataset_fr, "r", encoding="utf-8") as f:
    train_fr = f.readlines()

with open(dev_dataset_en, "r", encoding="utf-8") as f:
    dev_en = f.readlines()

with open(dev_dataset_fr, "r", encoding="utf-8") as f:
    dev_fr = f.readlines()

In [8]:
#We define a function for preprocessing
prefix = ""
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"

def preprocess_function(examples):
    source_examples = [prefix + ex for ex in examples["source"]]
    target_examples = [ex for ex in examples["target"]]

    source_inputs = tokenizer(source_examples, max_length=max_input_length, truncation=True, padding="max_length")
    target_inputs = tokenizer(target_examples, max_length=max_target_length, truncation=True, padding="max_length")

    model_inputs = {
        "input_ids": source_inputs["input_ids"],
        "attention_mask": source_inputs["attention_mask"],
        "labels": target_inputs["input_ids"],
    }

    return model_inputs

In [9]:
from datasets import Dataset
# Crear un diccionario con las claves "source" y "target" para pasar a la función preprocess_function
data_dict = {
    "source": train_en,
    "target": train_fr,
}

# Convertir el diccionario en un objeto Dataset
train = Dataset.from_dict(data_dict)

tokenized_train = train.map(preprocess_function, batched=True)

Map:   0%|          | 0/1884273 [00:00<?, ? examples/s]

In [11]:
tokenized_train

Dataset({
    features: ['source', 'target', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1884273
})

In [12]:
tokenized_train['source'][27]

'"My report hinges on the notions of transparency and accountability , regulation and solidarity ."\n'

In [13]:
tokenized_train['target'][27]

'"L&apos; axe de mon rapport tourne autour des notions de transparence et de responsabilisation , de régulation et de solidarité ."\n'

In [14]:
from datasets import Dataset
# Crear un diccionario con las claves "source" y "target" para pasar a la función preprocess_function
data_dict = {
    "source": dev_en,
    "target": dev_fr,
}

# Convertir el diccionario en un objeto Dataset
dev = Dataset.from_dict(data_dict)

tokenized_dev = dev.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Start the fine-tuning and training

In [15]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [16]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True    
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [17]:
#We define our datacollator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [18]:
#We define some processing functions to check the metrics
!pip install evaluate
import evaluate
metric = evaluate.load("bleu")

import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["bleu"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [19]:
#We define the training parameters
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.2751,0.259839,0.4383,62.202


TrainOutput(global_step=117768, training_loss=0.2723879138782502, metrics={'train_runtime': 12956.9562, 'train_samples_per_second': 145.426, 'train_steps_per_second': 9.089, 'total_flos': 6.387377377797734e+16, 'train_loss': 0.2723879138782502, 'epoch': 1.0})

In [None]:
#Saving it on Drive
output_dir = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/fine-tuning-results/"

trainer.save_model(output_dir)


Example of use. We try with a sentence that is in the original corpus of English used in the training.

In [None]:
from transformers import MarianMTModel, MarianTokenizer
!pip install sacremoses 

src_text = ['The European Parliament should work hard to make an ambitious , Europeanist response worthy of our citizens .']
# reference translation = Le Parlement européen doit travailler dur pour apporter une réponse européiste ambitieuse digne de nos concitoyens .

tokenizer = MarianTokenizer.from_pretrained('/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/fine-tuning-results/')
model = MarianMTModel.from_pretrained('/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/fine-tuning-results/')
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




['Le Parlement européen devrait travailler dur pour apporter une réponse ambitieuse et européiste digne de nos▁citoyens.']

In [20]:
# PREPARING THE TEST SET FOR EVALUATION

test_dataset_en = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/test.en"
test_dataset_fr = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr-baseline/test.fr"

with open(test_dataset_en, "r", encoding="utf-8") as f:
    test_en = f.readlines()

with open(test_dataset_fr, "r", encoding="utf-8") as f:
    test_fr = f.readlines()

In [21]:
from datasets import Dataset
# Crear un diccionario con las claves "source" y "target" para pasar a la función preprocess_function
data_dict = {
    "source": test_en,
    "target": test_fr,
}

# Convertir el diccionario en un objeto Dataset
test = Dataset.from_dict(data_dict)

tokenized_test = test.map(preprocess_function, batched=True)

Map:   0%|          | 0/330 [00:00<?, ? examples/s]

In [22]:
tokenized_test

Dataset({
    features: ['source', 'target', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 330
})

In [23]:
tokenized_test['source'][27]

'The Grand Chamber to be constituted included ex officio Mr Ryssdal , President of the Court , Mr Bernhardt , Vice-President of the Court , and the other members and substitute judges ( namely , Mr B. Repik , Mr F. Gölcüklü , Mr R. Pekkanen and Mr K. Jungwiert ) of the Chamber which had relinquished jurisdiction ( Rule 51 para . 2 ( a ) and ( b ) ) .\n'

In [24]:
tokenized_test['target'][27]

'Conformément à l ’ article 51 par . 2 a ) et b ) du règlement A , le président et le vice-président de la Cour , M. Ryssdal et M. Bernhardt , ainsi que les autres membres et juges suppléants ( à savoir M. B. Repik , M. F. Gölcüklü , M. R. Pekkanen et M. K. Jungwiert ) de la chambre originaire sont devenus membres de la grande chambre .\n'

In [None]:
# EVALUATING THE MODEL WITH THE TEST SET

trainer.evaluate(eval_dataset=tokenized_test)

{'eval_loss': 6.845170497894287,
 'eval_bleu': 0.1629,
 'eval_gen_len': 35.4212,
 'eval_runtime': 46.0068,
 'eval_samples_per_second': 7.173,
 'eval_steps_per_second': 0.456}

In [25]:
# PREPARING ONE TEXT OF THE TEST SET FOR TRANSLATION

partial_test_en = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/partial-test.en"
partial_test_fr = "/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/en-fr/partial-test.fr"

with open(partial_test_en, "r", encoding="utf-8") as f:
    p_test_en = f.readlines()

with open(partial_test_fr, "r", encoding="utf-8") as f:
    p_test_fr = f.readlines()

In [27]:
from transformers import MarianMTModel, MarianTokenizer
!pip install sacremoses 

# TRANSLATING ONE TEXT OF THE TEST SET 

tokenizer = MarianTokenizer.from_pretrained('/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/fine-tuning-results/')
model = MarianMTModel.from_pretrained('/content/drive/MyDrive/HAPLAPMaster/MachineTranslation/MT-project/fine-tuning-results/')
translated = model.generate(**tokenizer(p_test_en, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


['Procédure d&apos; intégration',
 '"L ▁affaire a▁trouvé son▁origine dans une▁demande ( n ° 8866 / 04 ) déposée par un ressortissant▁britannique, M. Yassar Hussain, en vertu de l  article 34 de la Convention de sauvegarde des droits de l  homme et des▁libertés fondamentales ( &quot; la Convention &quot; ), à l  encontre du Royaume-Uni, le 1er mars 2004."',
 'Le▁demandeur était représenté par M. Bromley de Lichfield Reynolds à Stoke-on-Trent.',
 '"Le gouvernement du Royaume-Uni ( « le gouvernement » ) était représenté par son agent, M. J. Grager, du ministère des affaires étrangères et du Commonwealth."',
 '"Le 16 février 2005, la Cour a décidé de transmettre la▁demande au gouvernement."',
 '"Conformément aux dispositions de l&apos; article 29,▁paragraphe 3, de la Convention, elle a décidé d&apos; examiner le bien-fondé de la▁demande en même temps que sa recevabilité."',
 'L&apos; état d&apos; avancement de l&apos; Union européenne dans le▁domaine de l&apos; énergie et de l&apos; énergi