## Description of the assignement execution steps

In the course of working on hw1 assignement, the following steps were passed through

1. Since the baseline model is too far from the target BLEU value, it was decided to go straight to the transformer model, which according to the article "Attention is all you need" shows better results. At first I implemented the Transformer from the article by myself. Unfortunately my attempts to train it were unsuccessful. After a day of trying to find the root of the problem, I decided to take a ready-made solution from PyTorch.

2. PyTorch has a ready-made Transformer module and I decided to use it. This led nowhere at first, and I concluded that the earlier problem was not in my implementation, but in the way the model was trained. I initially, similar to the baseline model, fed src and <sos> to the transformer, and recursively generated a sentence, and counted cross-entropy. In this mode, even after several hours of training, the transformer did not start generating meaningful sentences. Apparently, because of the long generation loop, there is complete chaos with gradients. To fix this, I decided to change my approach, and started to give it src and part of trg during the training, and predict only the next word. In this mode, the gradient step occurred after each word prediction, and the model started to train noticeably better.

3. Next, I started experimenting with batch size and learning rate to improve training. The recommended lr_scheduler with warmup and gradually decreasing lr showed good results. After half a day of training, the model started generating text that was already a bit similar to human, but the BLEU was low: 8. I think it could be raised significantly if I further trained the model on predicting two words ahead, then four, etc. However, training the Transformer takes a long time (apparently several days), and it also requires a lot of GPU resources, and **GPU access is expensive**.

4. In light of the above, it was decided to take a pretrained transformer and finetune it on our data. Initially the transformer produced BLEU 16, after finetune we managed to achieve BLEU 39. Below you can find the training process and results.

I did not generate .txt file with translations as its generation takes a lot of time


In [None]:
! pip install transformers
! pip install datasets
! pip install evaluate
! pip install sentencepiece
! pip install sacrebleu
! pip install accelerate


In [2]:
BATCH_SIZE = 32
MAX_LEN = 200
NUM_EPOCHS = 10

MODEL_CHECKPOINT = "Helsinki-NLP/opus-mt-ru-en"
FINETUNED_MODEL_CHECKPOINT = "marian-ru-en-finetuned"

PAD_LABEL = -100


In [3]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    pipeline
)
from datasets import load_dataset
import torch
import evaluate
import functools
import numpy as np


In [4]:
device = torch.device(
 "cuda"
    if torch.cuda.is_available()
    else "cpu"
)

device


device(type='cuda')

In [5]:
import os

path_do_data = "../../datasets/Machine_translation_EN_RU/data.txt"
if not os.path.exists(path_do_data):
    print("Dataset not found locally. Downloading from github.")
    !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/datasets/Machine_translation_EN_RU/data.txt -nc
    path_do_data = "./data.txt"


Dataset not found locally. Downloading from github.
File ‘data.txt’ already there; not retrieving.



In [6]:
def split_row_into_ru_en(row):
    line = row["text"]
    split = line.strip().split("\t")
    return {
        "ru": split[1],
        "en": split[0],
    }


split_datasets = (
    load_dataset("text", data_files=["data.txt"], split="train")
    .map(split_row_into_ru_en)
    .remove_columns(["text"])
    .train_test_split(0.2, seed=20)
)


In [7]:
split_datasets["train"][0]


{'ru': "Расстояние до Дворца фестивалей и конгрессов составляет 2 км, а до церкви Нотр-Дам-д'Эсперанс — 1,4 км.",
 'en': "The property is 2 km from Palais des Festivals de Cannes and 1.4 km from Notre Dame d'Esperance Church."}

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)


def preprocess_function(rows, tokenizer, max_len=MAX_LEN):
    inputs = [row for row in rows["ru"]]
    targets = [row for row in rows["en"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=MAX_LEN, truncation=True
    )
    return model_inputs


tokenized_datasets = split_datasets.map(
    functools.partial(preprocess_function, tokenizer=tokenizer),
    batched=True,
    remove_columns=split_datasets["train"].column_names,
).with_format("torch")




Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [9]:
example_input = tokenized_datasets["train"][0]


In [10]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)


In [11]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


In [12]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])

batch["labels"]


tensor([[   32, 19894, 16096,  3949,  1375,    11, 51246, 30124,    23,     8,
         35852,    11, 39944, 26213,   649,     3,     0],
        [ 3235,    23,    63, 16282,  2397,     8,  1586, 23366, 41911,     3,
             0,  -100,  -100,  -100,  -100,  -100,  -100]])

In [13]:
metric = evaluate.load("sacrebleu")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != PAD_LABEL, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}


In [14]:
FP16 = True if device == 'cuda' else False

args = Seq2SeqTrainingArguments(
    FINETUNED_MODEL_CHECKPOINT,
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    fp16=FP16,
)


In [15]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


In [16]:
trainer.evaluate(max_length = MAX_LEN)


{'eval_loss': 2.3310399055480957,
 'eval_bleu': 16.380540601728995,
 'eval_runtime': 441.3917,
 'eval_samples_per_second': 22.656,
 'eval_steps_per_second': 0.356}

In [17]:
translator = pipeline("translation", model=model.to('cpu'), tokenizer=tokenizer)

def show_examples(translator):
    for i in range(3):
        test_example = split_datasets['test'][i]
        print(f"Original text: {test_example['en']}")
        print(f"Generated text: {translator(test_example['ru'])[0]['translation_text']}\n")

show_examples(translator)


Original text: Free WiFi is offered in the entire building.
Generated text: A free WiFi is available throughout the building.

Original text: The rooms feature a TV and a private balcony overlooking the hotel's gardens.
Generated text: The rooms are equipped with a television set and come out on a separate balcony with a view to the garden.

Original text: You can book a session at the sauna or Turkish bath, or take a dip in the hot tub.
Generated text: Guests may order a visit to a sauna or a Turkish bath or rest in a hydraulic bathtub.



In [18]:
model.to(device)
trainer.train()


Step,Training Loss
500,1.5827
1000,1.3216
1500,1.1868
2000,1.1111
2500,1.0869
3000,1.0054
3500,0.9872
4000,0.9541
4500,0.9267
5000,0.914


TrainOutput(global_step=12500, training_loss=0.9203389233398438, metrics={'train_runtime': 1475.3576, 'train_samples_per_second': 271.121, 'train_steps_per_second': 8.473, 'total_flos': 7164682139860992.0, 'train_loss': 0.9203389233398438, 'epoch': 10.0})

In [19]:
trainer.evaluate()


{'eval_loss': 0.9772880673408508,
 'eval_bleu': 39.738467247339194,
 'eval_runtime': 417.0196,
 'eval_samples_per_second': 23.98,
 'eval_steps_per_second': 0.376,
 'epoch': 10.0}

In [20]:
translator = pipeline('translation', model=model.to('cpu'), tokenizer=tokenizer)

show_examples(translator)


Original text: Free WiFi is offered in the entire building.
Generated text: Free Wi-Fi is available throughout the property.

Original text: The rooms feature a TV and a private balcony overlooking the hotel's gardens.
Generated text: Rooms here will provide you with a TV and a private balcony with garden views.

Original text: You can book a session at the sauna or Turkish bath, or take a dip in the hot tub.
Generated text: Guests can enjoy a sauna or Turkish bath or relax in the hot tub.

