First, install the packages needed in this notebook:

In [1]:
%%capture
! pip install transformers datasets==2.7.1 evaluate bert_score==0.3.13 sacrebleu==2.3.1
! pip install git+https://github.com/google-research/bleurt.git

# 1 Seq2seq evaluation metrics

### 1.1 You are given a candidate and a reference translation and the score of a metric. What type of metrics was used? Can you suggest better metric? Justify your answer!

```
Reference: "My cat loves to watch the birds outside the window."
Candidate: "My cat hates to watch the birds outside the window."
-> score: 0.99
```


### 1.2 You want to train a machine translation system but you only have a few thousand aligned sentences. Are there metrics that are especially suited for this low-resource scenario? Why?


### 1.3 Your friend tells you this: "I cannot use a learned metric for my task because my data is from a very special domain and there will be a domain mismatch." - Is she right? Does she miss something?



## 1.4 Recreate the scores from the lecture slides with Huggingface evaluate

In [4]:
%%capture
from evaluate import load # use the Huggingface evaluate implementations
bertscore = load("bertscore")
bleu = load("sacrebleu")
bleurt = load("bleurt", module_type="metric", checkpoint="Elron/bleurt-base-128")

In [7]:
print(bleu.compute(predictions=["My weekend was bad"], references=["My weekend was superb"])['score'])
print(bleu.compute(predictions=["At the weekend, we ate my grandma's house."], references=["At the weekend, we visited my grandma's house and ate cake."])['score'])
print(bleu.compute(predictions=["At the weekend, we visited my grandma's house. And we ate cake."], references=["At the weekend, we visited my grandma's house and ate cake."])['score'])

59.460355750136046
41.154215810165745
64.75445426291287


In [8]:
# This function makes comparing different scores for a given reference-candidate pair more handy
def evaluate_and_compare_scores(reference: str, candidate: str, language: str='en') -> None:
    print("Reference: ", reference)
    print("Candidate: ", candidate)

    score_bleu = bleu.compute(predictions=[candidate], references=[reference], smooth_method='none')['score']
    print(f"BLEU: {score_bleu}")
    score_bertscore = bertscore.compute(predictions=[candidate], references=[reference], lang=language)['f1']
    print(f"BERTscore: {score_bertscore}")
    score_bleurt = bleurt.compute(predictions=[candidate], references=[reference])['scores']
    print(f"BlEURT: {score_bleurt}")

In [9]:
ref = "This house is in a big city."
cands = ["The house is in a big city.", 
         "The house is not in a big city.", 
         "The house in a big city is.", 
         "This house is in the big city close to the ocean."
         ]

####################################################################
# TODO come up with own examples and try to fool the scores
# Can you make further observations?
####################################################################
ref = ref
cands = cands
####################################################################
for cand in cands:
    evaluate_and_compare_scores(ref, cand)
    print('***')

ref_de = "Dieses Haus ist in einer großen Stadt."
cand_de = "Das Haus in einer großen Stadt ist."
evaluate_and_compare_scores(ref_de, cand_de, language='de')

Reference:  This house is in a big city.
Candidate:  The house is in a big city.
BLEU: 84.08964152537145


Downloading (…)lve/main/config.json: 100%|██████████| 482/482 [00:00<00:00, 3.92MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.16MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 8.33MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.43G/1.43G [02:22<00:00, 10.0MB/s]


BERTscore: [0.9993593692779541]
BlEURT: [0.7634966373443604]
***
Reference:  This house is in a big city.
Candidate:  The house is not in a big city.
BLEU: 51.33450480401705
BERTscore: [0.9788229465484619]
BlEURT: [-0.2529091536998749]
***
Reference:  This house is in a big city.
Candidate:  The house in a big city is.
BLEU: 39.76353643835254
BERTscore: [0.9517703652381897]
BlEURT: [-0.27970874309539795]
***
Reference:  This house is in a big city.
Candidate:  This house is in the big city close to the ocean.
BLEU: 26.20251007173262
BERTscore: [0.96694016456604]
BlEURT: [-0.026131577789783478]
***
Reference:  Dieses Haus ist in einer großen Stadt.
Candidate:  Das Haus in einer großen Stadt ist.
BLEU: 39.76353643835254


Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 181kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 625/625 [00:00<00:00, 2.16MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 996k/996k [00:00<00:00, 2.18MB/s]
Downloading pytorch_model.bin: 100%|██████████| 714M/714M [01:06<00:00, 10.7MB/s] 


BERTscore: [0.9289785027503967]
BlEURT: [0.41285941004753113]


In [10]:
####################################################################
# TODO Look at the Huggingface metrics page (https://huggingface.co/metrics)
# Select two additional metrics and test them on our sample sentences
# Note!: you may have to install additional packages to use these metrics!
####################################################################
%%capture
#! pip install 
metric1 = None
metric2 = None
####################################################################

for cand in cands:
  print("Reference: ", ref)
  print("Candidate: ", cand)
  print(f"{metric1.name}: ", metric1.compute(predictions=[cand], references=[ref]))
  print(f"{metric2.name}: ", metric2.compute(predictions=[cand], references=[ref]))

UsageError: Line magic function `%%capture` not found.


## 1.5 Explain the predicted scores

Instead of using the Huggingface evaluate library, you can also load the scoring models with the transformers library. With this, you can use any explainability framework that can interact with Huggingface to explain your score.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [None]:
%%capture
model_name = "Elron/bleurt-base-128"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

def predict_bleurt_score(reference:str, candidate:str) -> None:
    print("Reference: ", reference)
    print("Candidate: ", candidate)
    ####################################################################
    # TODO Tokenize the reference and candidate and feed the tokenizer 
    # output into the model. Print the score prediction.
    ####################################################################
    
    ####################################################################

In [None]:
ref = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.")
cand = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. It was really delicious! "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids.")
cand2 = ("At the weekend, we visited my grandma's house and ate cake. She has baked a chocolate cake especially for me as it is my favourite cake. "
  "Afterwards, we went for a long walk across the fields. The weather was superb and we saw a lot of birds, squirrels and even some wild rabbids. It was really delicious!")
predict_bleurt_score(ref, cand)
print('***')
predict_bleurt_score(ref, cand2)

### 1.5.1 Both candidates hallucinate "It was really delicious!". However, the second candidate does not seem to get punished for it. Can you think of an explanation why?

# 2 Machine translation

## 2.1 Open [Google translator](translate.google.com) and enter the sentence "The Elbphilharmonie is a concert hall in Hamburg, Germany." Now select a language of your choice to translate the sentence. Copy the result and translate it to another language. In the end, translate it back to English. What do you observe?

## 2.2 Monolingual vs. multilingual translation models

At first, load the models and their tokenizers.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

In [None]:
%%capture
def load_tokenizer_and_model(model_name:str) -> tuple[AutoTokenizer, AutoModelForSeq2SeqLM]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    return tokenizer, model


monolingual_model_name = "Helsinki-NLP/opus-mt-de-en"
tokenizer_mono, model_mono = load_tokenizer_and_model(monolingual_model_name)

multilingual_model_name = "google/mt5-base" 
tokenizer_multi, model_multi = load_tokenizer_and_model(multilingual_model_name)

Compare the translations of the different models

In [None]:
source_text_de = ("Die TUM ist erneut Exzellenzuniversität und damit die einzige Technische Universität, die den Titel seit 2006 durchgehend hält."
  " Die Auszeichnung wird als Teil der Exzellenzstrategie von Bund und Ländern vergeben, um die deutsche Spitzenforschung international strategisch zu unterstützen.")
source_text = source_text_de

def translate(source_text:str, tokenizer: AutoTokenizer, model:AutoModelForSeq2SeqLM) -> str:
    gen_config = GenerationConfig(num_beams=3, early_stopping=True, no_repeat_ngram_size=3)
    tokenizer_output = tokenizer(source_text, return_tensors='pt')['input_ids'].to(model.device)
    generated_output = model.generate(tokenizer_output, max_new_tokens=300, generation_config=gen_config)
    return tokenizer.batch_decode(generated_output)[0]

print("Monolingual model:")
print(translate(source_text, tokenizer_mono, model_mono))
print("Multilingual model:")
print(translate("Translate German to English: "+source_text, tokenizer_multi, model_multi))

Looks like the multilingual model performs worse than the model specifically trained for this language pair. Let us try a different model.

In [None]:
tokenizer_multi, model_multi = load_tokenizer_and_model("bigscience/mt0-base")
print(translate("Translate to English: "+source_text, tokenizer_multi, model_multi))

### 2.2.1 Compare the model pages of [mt5](https://huggingface.co/google/mt5-base) and [mt0](https://huggingface.co/bigscience/mt0-base). Can you find an explanation why mt0 performs better?


### 2.2.2 Your own language combinations

In [None]:
####################################################################
# TODO 
# - Think of own example sentences from other languages
# - Visit the Huggingface Model page and select models that support this language
# - What do you observe?
####################################################################
source_text = ""
model_name = ""
####################################################################
tokenizer_custom, model_custom = load_tokenizer_and_model(model_name)
print("Monolingual model:")
print(translate(source_text, tokenizer_custom, model_custom))
print("Multilingual model:")
print(translate("Translate to English: "+source_text, tokenizer_multi, model_multi))

## 2.3 Fine-tuning models for translation

### Loading and preparing WMT data 
WMT is a large Machine translation conference that publishes aligned datasets for many language pairs. These datasets are available on Huggingface. We use the [wmt16](https://huggingface.co/datasets/wmt16) `de-en` dataset.

In [None]:
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

wmt_data = load_dataset("wmt16", "de-en")
wmt_data['train'] = Dataset.from_dict(wmt_data['train'][:1000]) # reduce training size
wmt_data

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
})

In [None]:
wmt_data['train']['translation'][0]

{'de': 'Wiederaufnahme der Sitzungsperiode', 'en': 'Resumption of the session'}

In [None]:
prefix = "Translate German to English: "
src_lang = "de"
tgt_lang = "en"

def preprocess_function(examples):
    ####################################################################
    # TODO append the prefix to all source language samples and store them in inputs
    # collect all target language samples in translations
    ####################################################################
    inputs = None
    translations = None
    assert len(inputs) == len(translations)
    ####################################################################
    tokenizer_output = tokenizer(inputs, text_target=translations, padding=True)
    return tokenizer_output

wmt_data = wmt_data.map(preprocess_function, batched=True)
wmt_data.set_format(type="torch")
print(wmt_data)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

  0%|          | 0/1 [00:00<?, ?ba/s]



DatasetDict({
    train: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2999
    })
})


In [None]:
print(tokenizer.decode(wmt_data["test"]["input_ids"][1], skip_special_tokens=True))
print(tokenizer.decode(wmt_data["test"]["labels"][1], skip_special_tokens=True))

Translate German to English: Das Verhältnis zwischen Obama und Netanyahu ist nicht gerade freundschaftlich.
The relationship between Obama and Netanyahu is not exactly friendly.


### Use BLEURT and BLEU to evaluate the translation quality

In [None]:
from evaluate import load
import numpy as np
from transformers import EvalPrediction 

metric_bleurt = load("bleurt", module_type="metric", checkpoint="Elron/bleurt-base-128")
metric_bleu = load("sacrebleu")

def postprocess_text(preds: str, labels: str) -> tuple[str, str]:
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    return preds, labels

def compute_metric(eval_preds: EvalPrediction) -> dict:
    preds, targets = eval_preds

    ####################################################################
    # TODO use the tokenizer.batch_decode function to get the strings 
    # based on the model's prediction. 
    # Call the postprocess_text function afterwards
    ####################################################################
    decoded_preds = None
    decoded_targets = None

    decoded_preds, decoded_targets = None
    ####################################################################

    scores_bleurt = metric_bleurt.compute(predictions=decoded_preds, references=decoded_targets)["scores"]
    score_bleu = metric_bleu.compute(predictions=decoded_preds, references=decoded_targets)["score"]
    return {"bleurt": sum(scores_bleurt)/len(scores_bleurt), "bleu": score_bleu}



### Load the model and train it with the trainer API

With the trainer API, it is no longer necessary to write your own training loop and perform the gradient updates manually. You can define some TrainingArguments, in which you set hyperparameters such as learning rate. Then, define a trainer object and train the model with trainer.train()

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
# Initial translation quality
print(translate(prefix + source_text, tokenizer, model))
print(translate(prefix + "Heute ist ein wunderschöner Tag und wir besuchen meine Großeltern.", tokenizer, model))

In [None]:
output_dir = "mt_model"

training_args = Seq2SeqTrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    num_train_epochs=5,
    fp16=False,
    predict_with_generate=True,
    output_dir=output_dir,
    report_to="tensorboard",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=wmt_data["train"],
    eval_dataset=wmt_data["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metric
)

With tensorboard, you can view your training and how your loss and metrics evolve over time.

In [None]:
# Start TensorBoard
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir "{output_dir}"/runs

In [None]:
# If you get an "OutOfMemoryError: CUDA out of memory." error here, try to restart the runtime to free all CUDA memory.
trainer.train()

In [None]:
print(translate(prefix + source_text, tokenizer, model))
print(translate(prefix + "Heute ist ein wunderschöner Tag und wir besuchen meine Großeltern.", tokenizer, model))

## Think of problems with this fine-tuning procedure. Remember why we tested multilingual models in the first place.
