# NLP 2 Project: Backtranslation for Domain Adaptation

In this project, you will fine-tune a translation model by backtranslating monolingual in-domain text. You will then test performance in that domain as well as general domains.

Your first task is to compare fine-tuning with backtranslation.
Next, you will explore a method of data selection.
Third, you will extend backtranslation, either modifying decoding, the model, or using multilingual pivots.
Finally, you will explore your own research question.

This notebook provides starter code to preprocess, fine-tune, and generate with a translation model. This is enough to get you started on the task.

In [None]:
%load_ext autoreload
%autoreload 2

In [2]:
# imports
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, Dataset, DatasetDict
from evaluate import load
import numpy as np
# import vllm
from tqdm import tqdm

## Preprocessing
First, we need to tokenize our inputs. With HF Transformers, this is fairly simple and is done for you below. Here, we use the model's tokenizer to split the inputs into the model's pre-defined numerical tokens, i.e. convert text into tensors. We also need a function to convert back from tensors into text.

In [3]:
from lib.preprocessing import preprocess_data, postprocess_predictions

## Evaluation
During fine-tuning, we need to see how good the outputs are on our dev set. For this, we can use BLEU score (Papineni 2002). This function decodes the predicted tensor tokens, and computes the BLEU score.

On our test sets, we also want to calculate an automatic metric, but on decoded text. We can use BLEU again, but also more advanced metrics like COMET. It's up to you to implement your choice of metric. We will discuss some metrics from the literature in class. It's always good to use at least 2 metrics.

In [4]:
from lib.metrics import compute_comet, compute_bleu

## Fine-tuning
Now that we've tokenized our data and got our evaluation ready, we can start fine-tuning (i.e., training from a pre-trained model). This is a minimal training loop.

We also need to generate at test time from a text dataset. This function involves generation without calculating gradients.

In [5]:
from lib.training_utils import train_model, translate_text

## Final Setup
We now have all the ingredients to run our experiments. This is all standard training code; the interesting results come from what you do with the data. Below, we give an initial setup for getting the code running (either in Colab or on Snellius).

In [28]:

SRC_LANG = "en"
TGT_LANG = "ru"
MODEL_NAME = "Helsinki-NLP/opus-mt-en-ru"
TRAIN_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
DEV_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
TEST_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
OUTPUT_DIR = "./results/checkpoints"

train_dataset = load_dataset(TRAIN_DATASET_NAME)
dev_dataset = load_dataset(DEV_DATASET_NAME)
test_dataset = load_dataset(TEST_DATASET_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# change the splits for actual training. here, using flores-dev as training set because it's small (<1k examples)
tokenized_train_dataset = preprocess_data(train_dataset, tokenizer, SRC_LANG, TGT_LANG, "train")
tokenized_dev_dataset = preprocess_data(dev_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")
# Note(jp): Here test is the same as dev
tokenized_test_dataset = preprocess_data(test_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")

# print sizes
print(f"Train dataset size: {len(tokenized_train_dataset)}")
print(f"Dev dataset size: {len(tokenized_dev_dataset)}")
print(f"Test dataset size: {len(tokenized_test_dataset)}")

tokenized_datasets = DatasetDict({
    "train": tokenized_train_dataset,
    "dev": tokenized_dev_dataset,
    "test": tokenized_test_dataset
})

# {'learning_rate': 0.0001, 'weight_decay': 0.05, 'num_train_epochs': 8}
# modify these as you wish; RQ3 could involve testing effects of various hyperparameters
training_args = Seq2SeqTrainingArguments(
    torch_compile=True, # generally speeds up training, try without it to see if it's faster for small datasets
    output_dir=OUTPUT_DIR,
    evaluation_strategy="epoch",
    learning_rate=0.0001,
    per_device_train_batch_size=32, # change batch sizes to fit your GPU memory and train faster
    per_device_eval_batch_size=128,
    weight_decay=0.05,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    save_total_limit=1, # modify this to save more checkpoints
    num_train_epochs=8, # modify this to train more epochs
    predict_with_generate=True,
    generation_num_beams=4,
    generation_max_length=128,
    no_cuda=False,  # Set to False to enable GPU
    fp16=True,      # Enable mixed precision training for faster training
)

Train dataset size: 7500
Dev dataset size: 1000
Test dataset size: 1000




In [29]:
# fine-tune model
model_finetuned = train_model(MODEL_NAME, tokenized_datasets, tokenizer, training_args)

Using GPU: NVIDIA A100-SXM4-40GB


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.825157,5.395509
2,No log,0.690568,7.257873
3,1.034600,0.650453,8.324024
4,1.034600,0.621671,8.954924
5,0.577100,0.607129,9.217407
6,0.577100,0.597891,9.487136
7,0.460100,0.60147,9.918562
8,0.460100,0.599004,9.546207




# Evaluate the model

In [30]:
# Test baseline model
predictions = translate_text(test_dataset["dev"][SRC_LANG], model, tokenizer, max_length=128, batch_size=64)

Using device: cuda


Translating: 100%|█████████████████████████████████████████████████████████████████████████████| 16/16 [01:08<00:00,  4.30s/it]


In [31]:
# evaluate checkpoint
# checkpoint_model_path = "results/checkpoints/final_model/best-tuned-opus-mt-en-ru"

# checkpoint_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_model_path)
# checkpoint_tokenizer = AutoTokenizer.from_pretrained(checkpoint_model_path)

predictions_finetuned = translate_text(test_dataset["dev"][SRC_LANG], model_finetuned,
                                       tokenizer, max_length=128, batch_size=64)

Using device: cuda


Translating: 100%|█████████████████████████████████████████████████████████████████████████████| 16/16 [01:27<00:00,  5.48s/it]


In [36]:
print(test_dataset["dev"][SRC_LANG][0])
print(test_dataset["dev"][TGT_LANG][0])
print(predictions[0])
print(predictions_finetuned[0])
# В случаях более тяжелого термального воздействия, при костной ткани хорошо выраженные признаки термического воздействия, при

In the cases of a more rigorous thermal impact when the bone tissue exhibits well pronounced signs of heat destruction, it should be considered as inherently unsuitable for genotyping of mtDNA.
Установили, что хромосомная ДНК уступает в аналитической устойчивости мтДНК.
В случае более жесткого термического удара, когда костная ткань имеет хорошо заметные признаки термического разрушения, ее следует рассматривать как по своей сути непригодную для гнилого гнилого гнильного гнильного гнильного гнильного гнильного гнильного гнилого гнильного гнильного гнильного гнили с гнилью гнилого смолы и гнилого смолы.
В случаях более тяжелого темплематического воздействия, при костной ткани хрошо выраженные признаки термического воздействия,


In [37]:
# Source: In the cases of a more rigorous thermal impact when the bone tissue exhibits well pronounced signs of heat destruction, it should be considered as inherently unsuitable for genotyping of mtDNA.
# Target: Установили, что хромосомная ДНК уступает в аналитической устойчивости мтДНК.
# Prediction: В случаях более тяжелого термического воздействия, при костной ткани хорошо выраженные признаки терапии, должно считать, что

In [38]:
bleu_baseline = compute_bleu(
        test_dataset["dev"][TGT_LANG],
        predictions)

bleu_finetuned = compute_bleu(
        test_dataset["dev"][TGT_LANG],
        predictions_finetuned)

print(f"BLEU score for baseline model: {bleu_baseline}")
print(f"BLEU score for fine-tuned model: {bleu_finetuned}")

# BLEU score for baseline model: 6.71
# BLEU fine-tuned-opus: 5.36
# BLEU HILRfine-tuned-opus: 6.03

BLEU score for baseline model: 6.714642882868273
BLEU score for fine-tuned model: 6.379819414189476


In [39]:
comet_baseline = compute_comet(
        test_dataset["dev"][SRC_LANG],
        test_dataset["dev"][TGT_LANG],
        predictions)

comet_finetuned = compute_comet(
        test_dataset["dev"][SRC_LANG],
        test_dataset["dev"][TGT_LANG],
        predictions_finetuned)

print(f"COMET score for baseline model: {comet_baseline}")
print(f"COMET score for fine-tuned model: {comet_finetuned}")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../home/scur2189/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/scur2189/.conda/envs/nmt/lib/python3.10/site-p ...
You are using the plain ModelCheckpoint callback. Consider using Lit

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../home/scur2189/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/scur2189/.conda/envs/nmt/lib/python3.10/site-p ...
You are using the plain ModelCheckpoint callback. Consider using Lit

COMET score for baseline model: 0.4779623000472784
COMET score for fine-tuned model: 0.5186686627268792


In [None]:
test_dataset["dev"][SRC_LANG][0]

'In the cases of a more rigorous thermal impact when the bone tissue exhibits well pronounced signs of heat destruction, it should be considered as inherently unsuitable for genotyping of mtDNA.'

In [12]:
test_dataset["dev"][TGT_LANG][0]

'Установили, что хромосомная ДНК уступает в аналитической устойчивости мтДНК.'

In [14]:
predictions[0]

'В случае более жесткого термического удара, когда костная ткань имеет хорошо заметные признаки термического разрушения, ее следует рассматривать как по своей сути непригодную для гнилого гнилого гнильного гнильного гнильного гнильного гнильного гнильного гнилого гнильного гнильного гнильного гнили с гнилью гнилого смолы и гнилого смолы.'

In [15]:
predictions_finetuned[0]

'В случаях более тяжелого термального воздействия, при костной ткани хорошо выраженные признаки термического воздействия, при'

In [None]:
!pip freeze | grep "evaluate" # evaluate==0.4.3

evaluate==0.4.3


In [17]:
!python3 scripts/evaluate_test.py --checkpoint_path results/checkpoints/final_model/best-tuned-opus-mt-en-ru

[INFO] Loading checkpoint model from: results/checkpoints/final_model/best-tuned-opus-mt-en-ru
[INFO] Loading datasets...
[INFO] Loading dataset: sethjsa/medline_en_ru_parallel (dev)
[INFO] Preprocessing data...
[INFO] Translating test data...
Translating: 100%|██████████████████████████████| 16/16 [01:25<00:00,  5.32s/it]
[INFO] Computing metrics...
Fetching 5 files: 100%|████████████████████████| 5/5 [00:00<00:00, 41282.52it/s]
Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../home/scur2189/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddin

In [18]:
!python3 scripts/evaluate_test.py

[INFO] Loading baseline model from: Helsinki-NLP/opus-mt-en-ru
[INFO] Loading datasets...
[INFO] Loading dataset: sethjsa/medline_en_ru_parallel (dev)
[INFO] Preprocessing data...
[INFO] Translating test data...
Translating: 100%|██████████████████████████████| 16/16 [01:08<00:00,  4.26s/it]
[INFO] Computing metrics...
Fetching 5 files: 100%|████████████████████████| 5/5 [00:00<00:00, 24188.60it/s]
Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../home/scur2189/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/home/scur2189

You will find all the datasets for this project under: https://huggingface.co/sethjsa

For other models, consider "Helsinki-NLP/opus-mt-en-ru" (general MT model), "glazzova/translation_en_ru" (tuned on biomedical domain), or "facebook/m2m100_418M" (multilingual model with 100 languages -- consider using for multilingual pivot experiments).

To read more about the WMT Biomedical test data, see here: https://aclanthology.org/2022.wmt-1.69/

# Advanced



ONLY if you have GPU hours left and want to generate backtranslations with an LLM, consider using vLLM for faster generation. An example function is given below.

In [None]:

# if using LLM for generation, consider using vllm for faster generation
def translate_text_vllm(texts, model_name, tokenizer, max_length=128, batch_size=32):
    """
    Translate texts using vllm for faster generation

    Args:
        texts: List of texts to translate
        model_name: Name or path of the model (str)
        tokenizer: Tokenizer object
        max_length: Maximum sequence length
        batch_size: Batch size for translation
    Returns:
        translations: List of translated texts
    """
    # Use model_name instead of model object
    llm = vllm.LLM(
        model=model_name,  # Changed from model to model_name
        tokenizer=tokenizer,
        tensor_parallel_size=1,
        max_num_batched_tokens=max_length * batch_size
    )

    # Create sampling params
    sampling_params = vllm.SamplingParams(
        temperature=0.0,  # Equivalent to greedy decoding
        max_tokens=max_length,
        stop=None
    )

    # Generate translations in batches
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        outputs = llm.generate(batch, sampling_params)

        # Extract generated text from outputs
        batch_translations = [output.outputs[0].text for output in outputs]
        translations.extend(batch_translations)

    return translations