**Lev Berezhnoy Y8749641H**

# Neural machine translation (Aplicaciones de Procesamiento del Lenguaje Natural)

In this colab notebook you will learn how to work with a pre-trained model from [Huggingface](https://huggingface.co/) and how to fine tune such a model to improve its performace on your texts.

----
The code in this colab notebook is insprired in the content of the following websites:
* https://medium.com/@tskumar1320/how-to-fine-tune-pre-trained-language-translation-model-3e8a6aace9f
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/docs/transformers/main_classes/tokenizer
* https://huggingface.co/docs/evaluate/


## Install the required libraries

In [1]:
%%capture
! pip install transformers[torch,sentencepiece]
! pip install datasets
! pip install evaluate
! pip install sacremoses
! pip install sacrebleu

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, EarlyStoppingCallback
from datasets import Dataset
import evaluate
import numpy as np
import os

## Give access to your Google Drive and set some variables


List of variables to set:
* `mydrive`: Full path to the Google Drive folder containing the corpora for fine-tuning.
* `source_es`: Source language for ES->RU translation (see [ISO-639-1 two-letter codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)).
* `target_ru`: Target language for ES->RU translation.
* `source_ru`: Source language for RU->ES translation.
* `target_es`: Target language for RU->ES translation.
* `corpus`: Prefix of the files with the parallel corpus in moses format (two documents with the same number of lines; no blank line in either document, no duplicated parallel entries)
* `model_name_es_ru`: Identifier for the pre-trained machine translation model for ES->RU, available from the NLP Language Technology Research Group at the University of Helsinki or other sources.
* `model_name_ru_es`: Identifier for the pre-trained machine translation model for RU->ES, available from the NLP Language Technology Research Group at the University of Helsinki or other sources.
* `output_model_name_es_ru`: Directory name where the fine-tuned ES->RU model will be saved.
* `output_model_name_ru_es`: Directory name where the fine-tuned RU->ES model will be saved.
* `output_model_es_ru_path`: Full path to the directory where the fine-tuned ES->RU model will be saved.
* `output_model_ru_es_path`: Full path to the directory where the fine-tuned RU->ES model will be saved.
* `train_size`: Number of parallel sentences for training (fine-tuning).
* `train_mono_size_es`: Number of monolingual sentences in Spanish for synthetic data generation.
* `train_mono_size_ru`: Number of monolingual sentences in Russian for synthetic data generation.
* `test_size`: Number of parallel sentences for testing.
* `dev_size`: Number of parallel sentences for development.
* `patience`: Number of epochs to wait without improvement before early stopping.
* `batch_size`: Batch size for training steps. Use 64 for training with FP16 precision; otherwise, consider using 32.

The parallel corpus for fine-tuning should contain at least `train_size` + `test_size` + `dev_size` + `train_mono_size_es`  + `train_mono_size_ru`  parallel sentences.

In [3]:
from google.colab import drive

drive.mount("/content/drive", force_remount=True)

mydrive="/content/drive/MyDrive/apln"
corpus = "News-Commentary.ready"

source_es = "es"
target_ru = "ru"
model_name_es_ru = "Helsinki-NLP/opus-mt-es-ru"
output_model_name_es_ru = "fine-tuned-model-es-ru"
output_model_es_ru_path = mydrive + "/" + output_model_name_es_ru

source_ru = "ru"
target_es = "es"
model_name_ru_es = "Helsinki-NLP/opus-mt-ru-es"
output_model_name_ru_es = "fine-tuned-model-ru-es"
output_model_ru_es_path = mydrive + "/" + output_model_name_ru_es

train_size = 2500
train_mono_size_es = 2500
train_mono_size_ru = 2500
test_size = 2000
dev_size = 2000

patience = 3 # default 5
batch_size = 32 ## Using fp16; otherwise use 32 #default 64

source_path_es = mydrive + "/" + corpus + "."+ source_es
source_path_ru = mydrive + "/" + corpus + "."+ source_ru
target_path_es = mydrive + "/" + corpus + "."+ target_es
target_path_ru = mydrive + "/" + corpus + "."+ target_ru
#save_path = "/content/corpus-for-finetuning"

Mounted at /content/drive


## Translate a sentence with the selected model

* The text to be translated must be assigned to the variable `text`

In [None]:
%%capture
text = "Este es un ejemplo muy simple de traducción del español al ruso."

tokenizer = AutoTokenizer.from_pretrained(model_name_es_ru)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_es_ru).to("cuda:0") # Use GPU 0

input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda:0") # Use GPU 0
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

del tokenizer
del model

In [None]:
print(decoded)

Это очень простой пример перевода с испанского на русский.


In [None]:
%%capture
text = "Это очень простой пример перевода с русского языка на испанский."

tokenizer = AutoTokenizer.from_pretrained(model_name_ru_es)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_ru_es).to("cuda:0") # Use GPU 0

input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda:0") # Use GPU 0
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

del model
del tokenizer

In [None]:
print(decoded)

Es un ejemplo muy sencillo de la traducción del ruso al español.


## Generate a dataset from a parallel corpus in moses format
The moses format consists of two separate files with the same amount of lines so that the segments in the *n*-th line in both documents are mutual translation. The parallel corpus must not contain blank or duplicated entries.

Once created, the dataset is split into training, development and testing.

In [4]:
def erase_key(example, target):
    example['translation'][target] = ''
    return example

In [5]:
# Read data from the files
with open(source_path_es, 'r', encoding='utf-8') as source_file, open(target_path_ru, 'r', encoding='utf-8') as target_file:
    source_data_es = source_file.readlines()  # Read all lines from the Spanish source file
    target_data_ru = target_file.readlines()  # Read all lines from the Russian target file

# Create the dataset
dataset_es_ru = Dataset.from_dict({
    'translation': [{source_es: src.strip(), target_ru: tgt.strip()} for src, tgt in zip(source_data_es, target_data_ru)]
})  # Create a dataset from the read data, stripping whitespace from each line and pairing Spanish source with Russian target

print(dataset_es_ru[0])  # Print the first pair of sentences from the dataset


remaining_size = len(dataset_es_ru) - test_size - dev_size - train_mono_size_es - train_mono_size_ru
if (train_size > remaining_size):
    train_size = remaining_size  # Adjust train_size if it exceeds the size available after reserving data for test, dev, and mono training

# Split the dataset into training, development, testing, train_mono_size_es and train_mono_size_ru
dataset_split = dataset_es_ru.train_test_split(test_size=(test_size+dev_size+train_mono_size_es+train_mono_size_ru), train_size=train_size)
train_dataset = dataset_split['train']  # Split off the training dataset
test_dev_mono_dataset = dataset_split['test']  # Remaining data for further splitting into test, dev, and mono datasets

dataset_split = test_dev_mono_dataset.train_test_split(test_size=(train_mono_size_es+train_mono_size_ru), train_size=(test_size+dev_size))
mono_dataset = dataset_split['test']  # Split off mono dataset for training monolingual models
test_dev_dataset = dataset_split['train']  # Remaining data for splitting into test and dev datasets

dataset_split = test_dev_dataset.train_test_split(test_size=dev_size, train_size=test_size)
dev_dataset = dataset_split['test']  # Development dataset (for evaluating)
test_dataset = dataset_split['train']  # Test dataset

dataset_split = mono_dataset.train_test_split(test_size=train_mono_size_ru, train_size=train_mono_size_es)
mono_dataset_ru = dataset_split['test']  # Monolingual dataset for Russian
mono_dataset_es = dataset_split['train']  # Monolingual dataset for Spanish

# Process the monolingual datasets to remove ES keys
mono_dataset_ru = mono_dataset_ru.map(erase_key,
    fn_kwargs={
        'target': target_es,
    })

# Process the monolingual datasets to remove RU keys
mono_dataset_es = mono_dataset_es.map(erase_key,
    fn_kwargs={
        'target': target_ru,
    })

# Print dataset sizes for verification
print("Size of the training set: "+str(len(train_dataset)))
print("Size of the development set: "+str(len(dev_dataset)))
print("Size of the test set: "+str(len(test_dataset)))
print("Size of the train_mono_size_ru set: "+str(len(mono_dataset_ru)))
print("Size of the train_mono_size_es set: "+str(len(mono_dataset_es)))

# Print the first element from each monolingual dataset for inspection
print(mono_dataset_ru[0])
print(mono_dataset_es[0])

{'translation': {'es': 'Es posible ver otros enfoques: por ejemplo, fortalecer la legislación en materia de patentes o imponer controles de precios a las industrias monopólicas, como la farmacéutica; estrategia que varias economías de mercado ya han adoptado.', 'ru': 'Возможны и другие подходы к решению проблемы: ужесточение патентных прав или введение контроля над ценами монопольных отраслей промышленности, таких как фармацевтические препараты, как это уже сделано во многих странах с рыночной экономикой.'}}


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Size of the training set: 2500
Size of the development set: 2000
Size of the test set: 2000
Size of the train_mono_size_ru set: 2500
Size of the train_mono_size_es set: 2500
{'translation': {'es': '', 'ru': 'Америка гордится тем, что является одной самых процветающих стран в мире, она может хвалиться тем, что все последние годы за исключение�� одного – 2009-го – её подушевой ВВП неуклонно рос.'}}
{'translation': {'es': 'Los bajísimos tipos de interés que ahora predominan han movido a los inversores a correr riesgos excesivos para logar un mayor rédito actual de sus carteras, en muchos casos para atender las obligaciones de devolución establecidas en los contratos de pensiones y seguros.', 'ru': ''}}


In [None]:
'''
{'translation': {'es': 'Es posible ver otros enfoques: por ejemplo, fortalecer la legislación en materia de patentes o imponer controles de precios a las industrias monopólicas, como la farmacéutica; estrategia que varias economías de mercado ya han adoptado.', 'ru': 'Возможны и другие подходы к решению проблемы: ужесточение патентных прав или введение контроля над ценами монопольных отраслей промышленности, таких как фармацевтические препараты, как это уже сделано во многих странах с рыночной экономикой.'}}
Size of the training set: 2500
Size of the development set: 2000
Size of the test set: 2000
Size of the train_mono_size_ru set: 2500
Size of the train_mono_size_es set: 2500
{'translation': {'es': '', 'ru': 'Многие, кто хочет рефинансировать свои ипотечные кредиты, по-прежнему не могут этого сделать, поскольку они «находятся под водой» (имея долг по своей ипотеке выше, чем стоит залоговое имущество).'}}
{'translation': {'es': 'Cuando se trata de la agricultura, los países desarrollados, como Estados Unidos y los miembros de la Unión Europea, blindan tanto a los consumidores como a los agricultores frente a estos riesgos.', 'ru': ''}}
'''

## Evaluating the pre-trained models



In [6]:
def evaluate_model(model_path, source, target):
    inputs = [ex[source] for ex in test_dataset["translation"]]
    references = [ex[target] for ex in test_dataset["translation"]]

    # Translate using pipelines - Use GPU 0 (device="cuda:0")
    translator = pipeline("translation", model=model_path, device="cuda:0", batch_size=batch_size)
    pre_outputs = translator(inputs)
    outputs = [ex["translation_text"] for ex in pre_outputs]

    metric = evaluate.load("sacrebleu") # BLEU
    result = metric.compute(predictions=outputs, references=references)
    print(model_path)
    print(result)

    del translator

In [None]:
# Initial model evaluation for ES->RU translation
evaluate_model(model_name_es_ru, source_es, target_ru)

Helsinki-NLP/opus-mt-es-ru
{'score': 20.988746600665284, 'counts': [23798, 12032, 6972, 4125], 'totals': [48437, 46437, 44440, 42452], 'precisions': [49.13186200631748, 25.910373193789436, 15.688568856885688, 9.7168566851974], 'bp': 1.0, 'sys_len': 48437, 'ref_len': 48326}


In [None]:
'''
Helsinki-NLP/opus-mt-es-ru
{'score': 20.988746600665284, 'counts': [23798, 12032, 6972, 4125], 'totals': [48437, 46437, 44440, 42452], 'precisions': [49.13186200631748, 25.910373193789436, 15.688568856885688, 9.7168566851974], 'bp': 1.0, 'sys_len': 48437, 'ref_len': 48326}
'''

In [None]:
# Initial model evaluation for RU->ES translation
evaluate_model(model_name_ru_es, source_ru, target_es)

Helsinki-NLP/opus-mt-ru-es
{'score': 27.365264903298467, 'counts': [33457, 18485, 11706, 7524], 'totals': [58868, 56868, 54871, 52878], 'precisions': [56.833933546239045, 32.505099528733204, 21.333673525177232, 14.228979916033133], 'bp': 1.0, 'sys_len': 58868, 'ref_len': 57449}


In [None]:
'''
Helsinki-NLP/opus-mt-ru-es
{'score': 27.365264903298467, 'counts': [33457, 18485, 11706, 7524], 'totals': [58868, 56868, 54871, 52878], 'precisions': [56.833933546239045, 32.505099528733204, 21.333673525177232, 14.228979916033133], 'bp': 1.0, 'sys_len': 58868, 'ref_len': 57449}
'''

## Preprocess the datasets before their use for fine tuning
The proprocessing implies tokenizing the sentences in the datasets using the tokenizer included in the pre-trained model


In [7]:
max_input_length = 128
max_target_length = 128

# I modified this method by adding additional parameters to increase its versatility.
def preprocess_function(examples, source, target, tokenizer):
    inputs = [ex[source] for ex in examples["translation"]]
    targets = [ex[target] for ex in examples["translation"]]
    model_inputs = tokenizer(text=inputs, max_length=max_input_length, padding=True, truncation=True)
    labels = tokenizer(text_target=targets, max_length=max_target_length, padding=True, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [8]:
tokenizer_es_ru = AutoTokenizer.from_pretrained(model_name_es_ru)
print(tokenizer_es_ru)
tokenized_train_dataset_es_ru = train_dataset.map(preprocess_function,
    batched=True,
    fn_kwargs={
        'source': source_es,
        'target': target_ru,
        'tokenizer': tokenizer_es_ru
    })

tokenized_dev_dataset_es_ru = dev_dataset.map(preprocess_function,
    batched=True,
    fn_kwargs={
        'source': source_es,
        'target': target_ru,
        'tokenizer': tokenizer_es_ru
    })

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/829k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.62M [00:00<?, ?B/s]

MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-es-ru', vocab_size=63430, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	63429: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
'''
MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-es-ru', vocab_size=63430, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	63429: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
'''

In [9]:
tokenizer_ru_es = AutoTokenizer.from_pretrained(model_name_ru_es)
print(tokenizer_ru_es)
tokenized_train_dataset_ru_es = train_dataset.map(preprocess_function,
    batched=True,
    fn_kwargs={
        'source': source_ru,
        'target': target_es,
        'tokenizer': tokenizer_ru_es
    })
tokenized_dev_dataset_ru_es = dev_dataset.map(preprocess_function,
    batched=True,
    fn_kwargs={
        'source': source_ru,
        'target': target_es,
        'tokenizer': tokenizer_ru_es
    })

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/829k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.62M [00:00<?, ?B/s]

MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-ru-es', vocab_size=63430, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	63429: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
'''
MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-ru-es', vocab_size=63430, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	63429: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
'''

## Fine tune the model on the dataset created above
Before fine tuning, we need to set the automatic evaluation metric to be used to evaluate on the development set, then we will run the training algorithm on the training dataset


### Define the metric to be used on the development set and basic methods

In [10]:
metric = evaluate.load("sacrebleu") # BLEU

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics_es_ru(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer_es_ru.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer_es_ru.pad_token_id)
    decoded_labels = tokenizer_es_ru.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer_es_ru.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

def compute_metrics_ru_es(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer_ru_es.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer_ru_es.pad_token_id)
    decoded_labels = tokenizer_ru_es.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer_ru_es.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [11]:
def fine_tune_model(model_name, output_dir, tokenizer, compute_metrics, tokenized_train_dataset, tokenized_dev_dataset, num_train_epochs=30):
    model_file_path = os.path.join(output_dir, "model.safetensors")
    if os.path.exists(model_file_path):
        print(f"Model already fine-tuned and saved at {output_dir}. Skipping training.")
        return

    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda:0") # Load in GPU 0

    args = Seq2SeqTrainingArguments(
        output_dir = output_dir,
        evaluation_strategy = "epoch",
        save_strategy="epoch",
        #evaluation_strategy="steps",
        #save_strategy="steps",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        weight_decay=0.01,
        save_total_limit=3,
        num_train_epochs=num_train_epochs,
        predict_with_generate=True,
        fp16=True,
        metric_for_best_model="bleu",
        load_best_model_at_end=True, # It uses metric_for_best_model to compare models
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_dev_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks = [EarlyStoppingCallback(early_stopping_patience=patience)], # It uses metric_for_best_model
    )

    trainer.train()
    trainer.save_model()

    del trainer
    del model
    del data_collator

### Fine-tuning models using bilingual datasets for ES->RU / RU->ES translation

**In the process of iterative back-translation and training of machine translation models based on synthetic data**, we will combine bilingual and synthetic data at each stage of iteration. At each iteration, we will create a new set of synthetic data and merge it with the original bilingual corpus for further training of the models.

Order of actions:

0. Initial training: First, we will train both models (ES→RU and RU→ES) on the original bilingual corpus.

1. First iteration:
- Generate synthetic parallel data using both models to translate the respective monolingual corpora.
- Merge the generated synthetic data with the original bilingual corpus.
- Retrain the models on this combined dataset.

2. Second iteration:
- Repeat the process of generating synthetic data using the models that have already been trained in the iteration 1.
- Again, merge the new synthetic data with the original bilingual corpus.
- Retrain the models on the new combined dataset.

3. Third iteration:
- Once more, repeat the generation of synthetic data and retrain the models on the combined data.

In [None]:
# Fine tune the model ES->RU
fine_tune_model(model_name_es_ru, output_model_es_ru_path, tokenizer_es_ru, compute_metrics_es_ru, tokenized_train_dataset_es_ru, tokenized_dev_dataset_es_ru)

pytorch_model.bin:   0%|          | 0.00/309M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,0.502211,21.655,31.493
2,No log,0.489586,21.8738,31.701
3,No log,0.482159,21.7009,31.6735
4,No log,0.477004,21.5916,31.761
5,No log,0.475382,21.5094,31.9185


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].
Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}


In [None]:
evaluate_model(output_model_es_ru_path, source_es, target_ru)

/content/drive/MyDrive/apln/fine-tuned-model-es-ru
{'score': 21.727459780613138, 'counts': [24267, 12507, 7390, 4437], 'totals': [48934, 46934, 44934, 42943], 'precisions': [49.59128622225855, 26.648058976434992, 16.446343526060446, 10.332300957082644], 'bp': 0.9981012857763029, 'sys_len': 48934, 'ref_len': 49027}


In [None]:
'''
/content/drive/MyDrive/apln/fine-tuned-model-es-ru
{'score': 21.727459780613138, 'counts': [24267, 12507, 7390, 4437], 'totals': [48934, 46934, 44934, 42943], 'precisions': [49.59128622225855, 26.648058976434992, 16.446343526060446, 10.332300957082644], 'bp': 0.9981012857763029, 'sys_len': 48934, 'ref_len': 49027}
'''

In [None]:
# Fine tune the model RU->ES
fine_tune_model(model_name_ru_es, output_model_ru_es_path, tokenizer_ru_es, compute_metrics_ru_es, tokenized_train_dataset_ru_es, tokenized_dev_dataset_ru_es)

pytorch_model.bin:   0%|          | 0.00/309M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,0.492237,27.686,33.0445
2,No log,0.484824,27.9391,32.833
3,No log,0.482185,27.6849,33.0905
4,No log,0.481513,27.6757,32.8335
5,No log,0.480608,27.6022,32.9225


Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}


In [None]:
evaluate_model(output_model_ru_es_path, source_ru, target_es)

/content/drive/MyDrive/apln/fine-tuned-model-ru-es
{'score': 27.8189706169949, 'counts': [34127, 18949, 11993, 7714], 'totals': [59262, 57262, 55262, 53267], 'precisions': [57.586649117478316, 33.09175369354895, 21.702073757735878, 14.48176169110331], 'bp': 1.0, 'sys_len': 59262, 'ref_len': 58326}


In [None]:
'''
/content/drive/MyDrive/apln/fine-tuned-model-ru-es
{'score': 27.8189706169949, 'counts': [34127, 18949, 11993, 7714], 'totals': [59262, 57262, 55262, 53267], 'precisions': [57.586649117478316, 33.09175369354895, 21.702073757735878, 14.48176169110331], 'bp': 1.0, 'sys_len': 59262, 'ref_len': 58326}
'''

## Fine-tuning models using synthetic data ES->RU / RU->ES

In [13]:
# Define a function to generate a synthetic dataset using a monolingual dataset and a translation model
def generate_synthetic_dataset(mono_dataset, model_path, source, target):
    # Initialize the translation pipeline with the specified model, setting the device to GPU and batch size to 64 for efficiency
    translator = pipeline("translation", model=model_path, device="cuda:0", batch_size=64)

    # Extract the source texts from the monolingual dataset to prepare for translation
    inputs = [ex['translation'][source] for ex in mono_dataset]

    # Use the translator to translate the batch of source texts
    pre_outputs = translator(inputs)
    # Extract the translated texts from the translator's output
    translations = [ex['translation_text'] for ex in pre_outputs]

    # Create a new dataset from the original sentences and their translations, forming synthetic bilingual pairs
    synthetic_dataset = Dataset.from_dict({
        'translation': [{source: sentence['translation'][source], target: translation.strip()} for sentence, translation in zip(mono_dataset, translations)]
    })

    # Return the synthetic dataset
    return synthetic_dataset

In [14]:
from datasets import concatenate_datasets

# Define the main loop for fine-tuning the model
iterations = range(3)

# Store the initial paths of the output models for ES->RU and RU->ES translations
old_output_model_es_ru_path = output_model_es_ru_path
old_output_model_ru_es_path = output_model_ru_es_path

# Iterate over the specified number of iterations
for iteration in iterations:
    print(f'=======================')
    print(f'iteration: {iteration}')
    print(f'old_output_model_es_ru_path: {old_output_model_es_ru_path}')
    print(f'old_output_model_ru_es_path: {old_output_model_ru_es_path}')

    # Calculate the version of the next model by adding 1 to the current iteration
    next_model_v = iteration + 1
    # Generate new paths for the output models by appending the next model version number
    new_output_model_ru_es_path = old_output_model_ru_es_path + f'{next_model_v}'
    new_output_model_es_ru_path = old_output_model_es_ru_path + f'{next_model_v}'

    print(f'new_output_model_ru_es_path: {new_output_model_ru_es_path}')
    print(f'new_output_model_es_ru_path: {new_output_model_es_ru_path}')

    # Generate a synthetic dataset for ES->RU translation using the RU monolingual dataset and the current RU->ES model
    synthetic_dataset_es_ru = generate_synthetic_dataset(mono_dataset_ru, old_output_model_ru_es_path, source_ru, target_es)
    print(synthetic_dataset_es_ru[0])

    # Merge the newly created synthetic dataset with the original training dataset for ES->RU translation
    merged_dataset_es_ru = concatenate_datasets([train_dataset, synthetic_dataset_es_ru])
    print(merged_dataset_es_ru[0])

    # Tokenize the merged training dataset for ES->RU translation
    tokenized_train_dataset = merged_dataset_es_ru.map(preprocess_function,
        batched=True,
        fn_kwargs={
            'source': source_es,
            'target': target_ru,
            'tokenizer': tokenizer_es_ru
        })

    # Fine-tune the ES->RU translation model with the tokenized training and development datasets, then evaluate its performance
    fine_tune_model(old_output_model_es_ru_path, new_output_model_es_ru_path, tokenizer_es_ru, compute_metrics_es_ru, tokenized_train_dataset, tokenized_dev_dataset_es_ru, num_train_epochs=3)
    evaluate_model(new_output_model_es_ru_path, source_es, target_ru)

    # Repeat the process for RU->ES translation
    synthetic_dataset_ru_es = generate_synthetic_dataset(mono_dataset_es, old_output_model_es_ru_path, source_es, target_ru)
    print(synthetic_dataset_ru_es[0])

    merged_dataset_ru_es = concatenate_datasets([train_dataset, synthetic_dataset_ru_es])

    print(merged_dataset_ru_es[0])  # Print the first entry of the RU->ES merged dataset for inspection
    # Tokenize the merged training dataset for RU->ES translation
    tokenized_train_dataset = merged_dataset_ru_es.map(preprocess_function,
        batched=True,
        fn_kwargs={
            'source': source_ru,
            'target': target_es,
            'tokenizer': tokenizer_ru_es
        })

    # Fine-tune the RU->ES translation model with the tokenized training and development datasets, then evaluate its performance
    fine_tune_model(old_output_model_ru_es_path, new_output_model_ru_es_path, tokenizer_ru_es, compute_metrics_ru_es, tokenized_train_dataset, tokenized_dev_dataset_ru_es, num_train_epochs=3)
    evaluate_model(new_output_model_ru_es_path, source_ru, target_es)

    # Update the paths of the output models for the next iteration
    old_output_model_es_ru_path = new_output_model_es_ru_path
    old_output_model_ru_es_path = new_output_model_ru_es_path

iteration: 0
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es1
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru1
{'translation': {'es': 'Estados Unidos está orgulloso de ser uno de los países más prósperos del mundo, y puede alardearse de que en los últimos años, con la excepción de un año 2009, su PIB per cápita ha aumentado constantemente.', 'ru': 'Америка гордится тем, что является одной самых процветающих стран в мире, она может хвалиться тем, что все последние годы за исключение�� одного – 2009-го – её подушевой ВВП неуклонно рос.'}}
{'translation': {'es': '¿Debemos buscar la mayor cantidad total posible de felicidad o el mayor nivel de felicidad promedio?', 'ru': 'Должны ли мы стремиться к максимально возможному общему количеству счастья или же к максимально выс

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Model already fine-tuned and saved at /content/drive/MyDrive/apln/fine-tuned-model-es-ru1. Skipping training.
/content/drive/MyDrive/apln/fine-tuned-model-es-ru1
{'score': 22.60148783861412, 'counts': [24005, 12578, 7554, 4652], 'totals': [46665, 44665, 42667, 40674], 'precisions': [51.44112289724633, 28.16075226687563, 17.704549183209505, 11.437281801642326], 'bp': 0.9711726261394117, 'sys_len': 46665, 'ref_len': 48030}
{'translation': {'es': 'Los bajísimos tipos de interés que ahora predominan han movido a los inversores a correr riesgos excesivos para logar un mayor rédito actual de sus carteras, en muchos casos para atender las obligaciones de devolución establecidas en los contratos de pensiones y seguros.', 'ru': 'Самые низкие процентные ставки, которые сейчас преобладают, заставляют инвесторов идти на чрезмерные риски, чтобы получить большую прибыль от своих портфелей в настоящее время, во многих случаях для выполнения обязательств по возврату, предусмотренных пенсионными и стра

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Model already fine-tuned and saved at /content/drive/MyDrive/apln/fine-tuned-model-ru-es1. Skipping training.
/content/drive/MyDrive/apln/fine-tuned-model-ru-es1
{'score': 28.485214380520908, 'counts': [33784, 19045, 12201, 7986], 'totals': [58593, 56593, 54595, 52599], 'precisions': [57.65876469885481, 33.652571872846465, 22.34820038465061, 15.182798152056122], 'bp': 1.0, 'sys_len': 58593, 'ref_len': 56801}
iteration: 1
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru1
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es1
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es12
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru12
{'translation': {'es': 'Estados Unidos está orgulloso de ser uno de los países más prósperos del mundo, y puede alardearse de que en los últimos años, con la excepción de uno el 2009-, su PIB per cápita ha crecido constantemente.', 'ru': 'Америка го

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Model already fine-tuned and saved at /content/drive/MyDrive/apln/fine-tuned-model-es-ru12. Skipping training.
/content/drive/MyDrive/apln/fine-tuned-model-es-ru12
{'score': 22.722523174113054, 'counts': [24007, 12649, 7600, 4694], 'totals': [46487, 44487, 42489, 40497], 'precisions': [51.6423946479661, 28.433025378200373, 17.886982513121044, 11.590982048052942], 'bp': 0.9673527372694961, 'sys_len': 46487, 'ref_len': 48030}
{'translation': {'es': 'Los bajísimos tipos de interés que ahora predominan han movido a los inversores a correr riesgos excesivos para logar un mayor rédito actual de sus carteras, en muchos casos para atender las obligaciones de devolución establecidas en los contratos de pensiones y seguros.', 'ru': 'Очень низкие процентные ставки, которые сейчас преобладают, побуждают инвесторов брать на себя чрезмерные риски, чтобы добиться большей текущей доходности своих портфелей, во многих случаях для выполнения обязательств по возврату, предусмотренных пенсионными и страхо

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Model already fine-tuned and saved at /content/drive/MyDrive/apln/fine-tuned-model-ru-es12. Skipping training.
/content/drive/MyDrive/apln/fine-tuned-model-ru-es12
{'score': 28.750397709047416, 'counts': [33760, 19064, 12270, 8079], 'totals': [58323, 56323, 54324, 52327], 'precisions': [57.88453954700547, 33.84762885499707, 22.586702010161254, 15.43944808607411], 'bp': 1.0, 'sys_len': 58323, 'ref_len': 56801}
iteration: 2
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru12
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es12
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es123
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru123
{'translation': {'es': 'Estados Unidos está orgulloso de ser uno de los países más prósperos del mundo, y puede alarderse de que en los últimos años, con la excepción de un año  2009, su PIB per cápita ha crecido constantemente.', 'ru': 'Америк

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Model already fine-tuned and saved at /content/drive/MyDrive/apln/fine-tuned-model-es-ru123. Skipping training.
/content/drive/MyDrive/apln/fine-tuned-model-es-ru123
{'score': 22.801704079989857, 'counts': [23975, 12672, 7639, 4732], 'totals': [46410, 44410, 42412, 40420], 'precisions': [51.659125188536954, 28.534113938302184, 18.011411864566632, 11.707075705096488], 'bp': 0.9656959265014239, 'sys_len': 46410, 'ref_len': 48030}
{'translation': {'es': 'Los bajísimos tipos de interés que ahora predominan han movido a los inversores a correr riesgos excesivos para logar un mayor rédito actual de sus carteras, en muchos casos para atender las obligaciones de devolución establecidas en los contratos de pensiones y seguros.', 'ru': 'Очень низкие процентные ставки, которые сейчас преобладают, привели инвесторов к тому, что они берут на себя чрезмерные риски, чтобы добиться большей текущей доходности своих портфелей, во многих случаях для выполнения обязательств по возврату, предусмотренных пе

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,0.476358,29.3698,33.6315
2,No log,0.477307,29.375,33.823
3,No log,0.478286,29.3223,33.798


Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].
Non-default generation parameters: {'max_length': 512, 'num_beams': 6, 'bad_words_ids': [[63429]], 'forced_eos_token_id': 0}


/content/drive/MyDrive/apln/fine-tuned-model-ru-es123
{'score': 28.79784865889285, 'counts': [33869, 19151, 12347, 8143], 'totals': [58534, 56534, 54536, 52541], 'precisions': [57.86209724262822, 33.87519015105954, 22.640090949097843, 15.498372699415695], 'bp': 1.0, 'sys_len': 58534, 'ref_len': 56801}


In [None]:
'''
=======================
iteration: 0
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es1
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru1

/content/drive/MyDrive/apln/fine-tuned-model-es-ru1
{'score': 22.60148783861412, 'counts': [24005, 12578, 7554, 4652], 'totals': [46665, 44665, 42667, 40674], 'precisions': [51.44112289724633, 28.16075226687563, 17.704549183209505, 11.437281801642326], 'bp': 0.9711726261394117, 'sys_len': 46665, 'ref_len': 48030}

/content/drive/MyDrive/apln/fine-tuned-model-ru-es1
{'score': 28.485214380520908, 'counts': [33784, 19045, 12201, 7986], 'totals': [58593, 56593, 54595, 52599], 'precisions': [57.65876469885481, 33.652571872846465, 22.34820038465061, 15.182798152056122], 'bp': 1.0, 'sys_len': 58593, 'ref_len': 56801}

=======================
iteration: 1
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru1
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es1
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es12
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru12

/content/drive/MyDrive/apln/fine-tuned-model-es-ru12
{'score': 22.722523174113054, 'counts': [24007, 12649, 7600, 4694], 'totals': [46487, 44487, 42489, 40497], 'precisions': [51.6423946479661, 28.433025378200373, 17.886982513121044, 11.590982048052942], 'bp': 0.9673527372694961, 'sys_len': 46487, 'ref_len': 48030}

/content/drive/MyDrive/apln/fine-tuned-model-ru-es12
{'score': 28.750397709047416, 'counts': [33760, 19064, 12270, 8079], 'totals': [58323, 56323, 54324, 52327], 'precisions': [57.88453954700547, 33.84762885499707, 22.586702010161254, 15.43944808607411], 'bp': 1.0, 'sys_len': 58323, 'ref_len': 56801}

=======================
iteration: 2
old_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru12
old_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es12
new_output_model_ru_es_path: /content/drive/MyDrive/apln/fine-tuned-model-ru-es123
new_output_model_es_ru_path: /content/drive/MyDrive/apln/fine-tuned-model-es-ru123

/content/drive/MyDrive/apln/fine-tuned-model-es-ru123
{'score': 22.801704079989857, 'counts': [23975, 12672, 7639, 4732], 'totals': [46410, 44410, 42412, 40420], 'precisions': [51.659125188536954, 28.534113938302184, 18.011411864566632, 11.707075705096488], 'bp': 0.9656959265014239, 'sys_len': 46410, 'ref_len': 48030}

/content/drive/MyDrive/apln/fine-tuned-model-ru-es123
{'score': 28.79784865889285, 'counts': [33869, 19151, 12347, 8143], 'totals': [58534, 56534, 54536, 52541], 'precisions': [57.86209724262822, 33.87519015105954, 22.640090949097843, 15.498372699415695], 'bp': 1.0, 'sys_len': 58534, 'ref_len': 56801}
'''

<table>
  <tr>
    <th>BLEU</th>
    <th colspan="2">Helsinki-NLP/opus-mt-es-ru</th>
    <th colspan="2">Helsinki-NLP/opus-mt-ru-es</th>
  </tr>
  <tr>
    <td>Base</td>
    <td>20.98</td>
    <td></td>
    <td>27.36</td>
    <td></td>
  </tr>
  <tr>
    <td>Fine-tuning</td>
    <td>21.72</td>
    <td>+0.74</td>
    <td>27.81</td>
    <td>+0.45</td>
  </tr>
  <tr>
    <td>1 iteration</td>
    <td>22.60</td>
    <td>+0.78</td>
    <td>28.48</td>
    <td>+0.67</td>
  </tr>
  <tr>
    <td>2 iteration</td>
    <td>22.72</td>
    <td>+0.12</td>
    <td>28.75</td>
    <td>+0.32</td>
  </tr>
  <tr>
    <td>3 iteration</td>
    <td>22.80</td>
    <td>+0.08</td>
    <td>28.79</td>
    <td>+0.04</td>
  </tr>
  <tr>
    <td>IN TOTAL</td>
    <td></td>
    <td>+1.72</td>
    <td></td>
    <td>+1.48</td>
  </tr>
</table>