# Model Evaluation

In this notebook we evaluate the model's translation on:
- In-domain Ladin
- Out-of-domain Ladin (Transfer Learning Across Domains)
- Italian-to-English (Forgetting of previous knowledge)

**Note**: For the sake of simplicity, we show the evaluation on a single model. However, the evaluation can be done for any model by simply changing `MODEL_LOAD_PATH`.

## Requirements

In [None]:
!pip install sentencepiece transformers sacrebleu bert-score -q

In [None]:
import pandas as pd
import csv
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm
import sacrebleu
from bert_score import BERTScorer

## Data

In [None]:
!wget https://raw.githubusercontent.com/jo-valer/machine-translation-ladin-fascian/main/data/test_id.tsv
!wget https://raw.githubusercontent.com/jo-valer/machine-translation-ladin-fascian/main/data/test_ood.tsv

In [None]:
df_test = pd.read_csv('test_id.tsv', sep="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
df_test_ood = pd.read_csv('test_ood.tsv', sep="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

## Model

In [None]:
# 'jo-valer/nllb-multi' or 'jo-valer/nllb-pivot'
MODEL_LOAD_PATH = 'jo-valer/nllb-multi'

tokenizer = NllbTokenizer.from_pretrained(MODEL_LOAD_PATH)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_LOAD_PATH).cuda()

## Testing loop

In [None]:
bleu_calc = sacrebleu.BLEU()
chrf_calc = sacrebleu.CHRF(word_order=2)
scorer = BERTScorer(model_type='bert-base-uncased')

lang_codes = {
  "it": ["italian", "ita_Latn"],
  "en": ["english", "eng_Latn"],
  "lld": ["ladin", "fur_Latn"]
}

def translate(text, src_lang='fur_Latn', tgt_lang='eng_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    """Translate a sentence."""
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    outputs = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

def test_loop(data=df_test, column='en_translated', src='lld', tgt='en'):
    model.eval()
    data[column] = [translate(t, lang_codes[src][1], lang_codes[tgt][1])[0] for t in tqdm(data[lang_codes[src][0]])]
    bleu_score = bleu_calc.corpus_score(data[column].tolist(), [data[lang_codes[tgt][0]].tolist()]).score
    chrf_score = chrf_calc.corpus_score(data[column].tolist(), [data[lang_codes[tgt][0]].tolist()]).score
    P, R, F1 = scorer.score(data[column].tolist(), data[lang_codes[tgt][0]].tolist())
    print("\nSrc:", src, "Tgt:", tgt)
    print(f"BLEU = {bleu_score:.2f} / chrF++ = {chrf_score:.2f} / BERTscoreF1 = {(F1.mean()*100):.2f}")

## In-domain Evaluation

Ladin-English

In [None]:
test_loop()

English-Ladin

In [None]:
test_loop(column='lld_translated_en', src='en', tgt='lld')

Ladin-Italian

In [None]:
test_loop(column='it_translated', src='lld', tgt='it')

Italian-Ladin

In [None]:
test_loop(column='lld_translated_it', src='it', tgt='lld')

## Out-of-domain Evaluation

Ladin-English

In [None]:
test_loop(data=df_test_ood, column='en_translated_ood')

English-Ladin

In [None]:
test_loop(data=df_test_ood, column='lld_translated_en_ood', src='en', tgt='lld')

Ladin-Italian

In [None]:
test_loop(data=df_test_ood, column='it_translated_ood', src='lld', tgt='it')

Italian-Ladin

In [None]:
test_loop(data=df_test_ood, column='lld_translated_it_ood', src='it', tgt='lld')

## Forgetting of previous knowledge

Test Italian to English translation.

In [None]:
test_loop(column='en_translated', src='it', tgt='en')

Test English to Italian translation.

In [None]:
test_loop(column='it_translated', src='en', tgt='it')