# NLP 2 Project: Backtranslation for Domain Adaptation

In this project, you will fine-tune a translation model by backtranslating monolingual in-domain text. You will then test performance in that domain as well as general domains.

Your first task is to compare fine-tuning with backtranslation.
Next, you will explore a method of data selection.
Third, you will extend backtranslation, either modifying decoding, the model, or using multilingual pivots.
Finally, you will explore your own research question.

This notebook provides starter code to preprocess, fine-tune, and generate with a translation model. This is enough to get you started on the task.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# imports
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, Dataset, DatasetDict
from evaluate import load
import numpy as np
# import vllm
from tqdm import tqdm

## Preprocessing
First, we need to tokenize our inputs. With HF Transformers, this is fairly simple and is done for you below. Here, we use the model's tokenizer to split the inputs into the model's pre-defined numerical tokens, i.e. convert text into tensors. We also need a function to convert back from tensors into text.

In [3]:
from lib.preprocessing import preprocess_data, postprocess_predictions

## Evaluation
During fine-tuning, we need to see how good the outputs are on our dev set. For this, we can use BLEU score (Papineni 2002). This function decodes the predicted tensor tokens, and computes the BLEU score.

On our test sets, we also want to calculate an automatic metric, but on decoded text. We can use BLEU again, but also more advanced metrics like COMET. It's up to you to implement your choice of metric. We will discuss some metrics from the literature in class. It's always good to use at least 2 metrics.

In [4]:
from lib.metrics import compute_comet, compute_bleu

## Fine-tuning
Now that we've tokenized our data and got our evaluation ready, we can start fine-tuning (i.e., training from a pre-trained model). This is a minimal training loop.

We also need to generate at test time from a text dataset. This function involves generation without calculating gradients.

In [5]:
from lib.training_utils import train_model, translate_text

## Final Setup
We now have all the ingredients to run our experiments. This is all standard training code; the interesting results come from what you do with the data. Below, we give an initial setup for getting the code running (either in Colab or on Snellius).

In [6]:

SRC_LANG = "en"
TGT_LANG = "ru"
MODEL_NAME = "Helsinki-NLP/opus-mt-en-ru"
TRAIN_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
DEV_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
TEST_DATASET_NAME = "sethjsa/medline_en_ru_parallel"
OUTPUT_DIR = "./results/checkpoints"

train_dataset = load_dataset(TRAIN_DATASET_NAME)
dev_dataset = load_dataset(DEV_DATASET_NAME)
test_dataset = load_dataset(TEST_DATASET_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# change the splits for actual training. here, using flores-dev as training set because it's small (<1k examples)
tokenized_train_dataset = preprocess_data(train_dataset, tokenizer, SRC_LANG, TGT_LANG, "train")
tokenized_dev_dataset = preprocess_data(dev_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")
# Note(jp): Here test is the same as dev
tokenized_test_dataset = preprocess_data(test_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")

# print sizes
print(f"Train dataset size: {len(tokenized_train_dataset)}")
print(f"Dev dataset size: {len(tokenized_dev_dataset)}")
print(f"Test dataset size: {len(tokenized_test_dataset)}")

tokenized_datasets = DatasetDict({
    "train": tokenized_train_dataset,
    "dev": tokenized_dev_dataset,
    "test": tokenized_test_dataset
})

# modify these as you wish; RQ3 could involve testing effects of various hyperparameters
training_args = Seq2SeqTrainingArguments(
    torch_compile=True, # generally speeds up training, try without it to see if it's faster for small datasets
    output_dir=OUTPUT_DIR,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=128, # change batch sizes to fit your GPU memory and train faster
    per_device_eval_batch_size=128,
    weight_decay=0.01,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    save_total_limit=1, # modify this to save more checkpoints
    num_train_epochs=1, # modify this to train more epochs
    predict_with_generate=True,
    generation_num_beams=4,
    generation_max_length=128,
    no_cuda=False,  # Set to False to enable GPU
    fp16=True,      # Enable mixed precision training for faster training
)

Train dataset size: 7500
Dev dataset size: 1000
Test dataset size: 1000




In [None]:
# fine-tune model
model_finetuned = train_model(MODEL_NAME, tokenized_datasets, tokenizer, training_args)

Using GPU: NVIDIA A100-SXM4-40GB


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss


# Evaluate the model

In [13]:
# test baseline model
predictions = translate_text(test_dataset["dev"][SRC_LANG], model, tokenizer, max_length=128, batch_size=64)

Translating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [01:08<00:00,  4.30s/it]


In [16]:
bleu_score = compute_bleu(
        # test_dataset["dev"][SRC_LANG],
        test_dataset["dev"][TGT_LANG],
        predictions)
print(bleu_score)

6.714642882868273


In [19]:
comet_score = compute_comet(
        test_dataset["dev"][SRC_LANG],
        test_dataset["dev"][TGT_LANG],
        predictions)
print(comet_score)

Downloading builder script:   0%|          | 0.00/6.97k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.40k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../home/scur2189/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Encoder model frozen.
/home/scur2189/.conda/envs/nmt/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


ValueError: Module inputs don't match the expected format.
Expected format: {'sources': Value(dtype='string', id='sequence'), 'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')},
Input sources: ['In the cases of a more rigorous thermal impact when the bone tissue exhibits well pronounced signs of heat destruction, it should be considered as inherently unsuitable for genotyping of mtDNA.', 'It was shown that chromosomal DNA is inferior to mtDNA in terms of heat resistance.  This finding agrees with the currently adopted view, however this advantage of mtDNA is relatively insignificant from the standpoint of genotyping efficiency.', '[The dynamics of the dimensional characteristics of the sella turcica in the subjects above 20 years of age].  The objective of the present study was to determine the biological age of the unidentified dead subjects based on the morphometric characteristics of the sella turcica in comparison with other methods available for the purpose in order to narrow the range of the alleged ages of the human remains being examined.', ..., 'PATIENTS AND METHODS', 'It has been carried out a prospective analysis of treatment of 218 patients with scars of different duration, locations and anatomic areas with the use of CO2-laser for the period 2011-2017.', 'POSAS scale and sonography were used for analysis.'],
Input predictions: ['В случае более жесткого термического удара, когда костная ткань имеет хорошо заметные признаки термического разрушения, ее следует рассматривать как по своей сути непригодную для гнилого гнилого гнильного гнильного гнильного гнильного гнильного гнильного гнилого гнильного гнильного гнильного гнили с гнилью гнилого смолы и гнилого смолы.', 'Было продемонстрировано, что хромосомная ДНК с точки зрения теплостойкости ниже mtDNA, что согласуется с принятой в настоящее время точкой зрения, однако это преимущество mtden является относительно незначительным с точки зрения с точки зрения mut grum grum grum sub sub sub sub sub sub sub sub sub sub sub sub sub sub sub int int int int int it sput sput', 'Цель настоящего исследования состояла в том, чтобы определить биологический возраст неопознанных мёртвых субъектов на основе морфометрических характеристик цокольной дыры по сравнению с другими имеющимися для этой цели методами, с тем чтобы сузить диапазон предполагаемых возрастных групп хребтов хребтовой хижины хижины хижины хижины хижины хижины хижины хижины хижины.', ..., 'СОВЕЩАНИЕ ПО ПО ПОДДЕРЖАНИЮ ЭТИХ ПОМОЩЬ ПО ПОМОЩЬ В СВЯЗИ С С ПОМОЩЬЮ ПО СВЯЗЯМ С С ПОМОЩЬЮ ПО СВЯЗЯМ С С ПОМОЩЬЮ ПО СВЯЗЯМ С С СВЯЗЯМ С С ПОМОЩЬЮ ПО СВЯЗЯМ С С ПОАМАМ ПО ПО САМАМАМ', 'Был проведен перспективный анализ лечения 218 пациентов с шрамами различной продолжительности, местоположениями и анатомическими зонами с использованием в период с 2011 по 2017 годы кустарниковых шрамов и кустарниковых черепах с использованием кустарниковых черепах.', 'Сценарий и сонография по сценарию по сухости и сухости по сухости и сухости по сухости и сухости по сухости и сухости по сухости и сухости по сухости по сухости и сухости по сухости и сухости.'],
Input references: [['Установили, что хромосомная ДНК уступает в аналитической устойчивости мтДНК.'], ['Это соответствует устоявшимся представлениям, однако, с точки зрения эффективности генотипирования, это преимущество мтДНК относительно невелико.'], ['Цель исследования - установление биологического возрастного периода неопознанных погибших лиц с помощью морфометрических параметров турецкого седла для использования их с уже имеющимися методами, чтобы сузить диапазон предполагаемого возраста изучаемых человеческих останков.'], ..., ['Материал и методы.'], ['Проведен проспективный анализ лечения 218 пациентов с рубцами различных сроков существования, площадей поражения и анатомических областей с использованием CO2-лазера в период с 2011 по 2017 г.'], ['Оценку эффективности лечения проводили с помощью шкалы POSAS и ультразвукового исследования.']]

You will find all the datasets for this project under: https://huggingface.co/sethjsa

For other models, consider "Helsinki-NLP/opus-mt-en-ru" (general MT model), "glazzova/translation_en_ru" (tuned on biomedical domain), or "facebook/m2m100_418M" (multilingual model with 100 languages -- consider using for multilingual pivot experiments).

To read more about the WMT Biomedical test data, see here: https://aclanthology.org/2022.wmt-1.69/

# Advanced



ONLY if you have GPU hours left and want to generate backtranslations with an LLM, consider using vLLM for faster generation. An example function is given below.

In [None]:

# if using LLM for generation, consider using vllm for faster generation
def translate_text_vllm(texts, model_name, tokenizer, max_length=128, batch_size=32):
    """
    Translate texts using vllm for faster generation

    Args:
        texts: List of texts to translate
        model_name: Name or path of the model (str)
        tokenizer: Tokenizer object
        max_length: Maximum sequence length
        batch_size: Batch size for translation
    Returns:
        translations: List of translated texts
    """
    # Use model_name instead of model object
    llm = vllm.LLM(
        model=model_name,  # Changed from model to model_name
        tokenizer=tokenizer,
        tensor_parallel_size=1,
        max_num_batched_tokens=max_length * batch_size
    )

    # Create sampling params
    sampling_params = vllm.SamplingParams(
        temperature=0.0,  # Equivalent to greedy decoding
        max_tokens=max_length,
        stop=None
    )

    # Generate translations in batches
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        outputs = llm.generate(batch, sampling_params)

        # Extract generated text from outputs
        batch_translations = [output.outputs[0].text for output in outputs]
        translations.extend(batch_translations)

    return translations