# Baseline

In this notebook, we are going to learn how to use Meta's large pre-trained model [NLLB](https://huggingface.co/docs/transformers/model_doc/nllb) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [None]:
!pip install datasets evaluate transformers==4.30 accelerate peft bitsandbytes
!pip install sacrebleu

Europarl-ST is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. The [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

print(raw_datasets)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 602605
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 86170
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 81968
    })
})


As shown, the Europarl-ST already comes with a pre-defined partition on the three conventional sets: training, validation and test. Each set is a dictionary with a list of source sentences (source_text), target sentences (dest_text) and the target language (dest_lang).

Let's take a closer look at the features of the training set:

In [3]:
raw_datasets["train"].features

{'source_text': Value(dtype='string', id=None),
 'dest_text': Value(dtype='string', id=None),
 'dest_lang': ClassLabel(names=['de', 'en', 'es', 'fr', 'it', 'nl', 'pl', 'pt', 'ro'], id=None)}

As you can see, the possible target languages are German, English, Spanish, French, Italian, Dutch, Polish, Portuguese and Romanian.

Let us take a look at the translations of the first two English sentences:

In [4]:
raw_datasets["train"][:14]["source_text"]

['Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in ord

In [5]:

raw_datasets["train"][:14]["dest_text"]

['Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.',
 'La mayoría de los servicios financieros, incluidos los seguros y la gestión de fondos de inversión, están exentos de IVA desde 1977.',
 'Depuis 1977, la plupart des services financiers, dont les assurances et la gestion des fonds de placement, ne sont pas tenus d ’ appliquer une TVA.',
 'Dal 1997 la maggior parte dei servizi finanziari, compresi i servizi assicurativi e la gestione di fondi di investimento, sono esenti da IVA.',
 'Sinds 1977 zijn de meeste financiële diensten, met inbegrip van verzekeringen en het beheer van beleggingsfondsen, vrijgesteld van btw.',
 'większość usług finansowych, w tym usług w zakresie ubezpieczeń i zarządzania funduszami inwestycyjnymi, była zwolniona z opodatkowania podatkiem VAT.',
 'Desde 1977 que a maioria dos serviços financeiros, incluindo os seguros e a gestão de fundos de investimento,

In [6]:
raw_datasets["train"][:14]["dest_lang"]

[0, 2, 3, 4, 5, 6, 7, 0, 2, 3, 4, 5, 6, 7]

As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

Provided that the NLLB model was pretrained on sentence pairs involving 200 languages, being one of the them the translation from English into Spanish, we are going to be filtering Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter).

In [7]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets = raw_datasets.filter(lambda x: x["dest_lang"] == lang_id)

Now we load the pre-trained tokenizer for the NLLB model and apply it to the English-Spanish pair:

In [8]:
max_tok_length = 16

from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
# from flores200_codes import flores_codes
src_code = "eng_Latn"
tgt_code = "spa_Latn"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, 
    padding=True, 
    pad_to_multiple_of=8, 
    src_lang=src_code, 
    tgt_lang=tgt_code, 
    truncation=True, 
    max_length=max_tok_length,
    )



We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be used:

In [9]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["source_text"], 
        text_target = sample["dest_text"],
        )
    return model_inputs


The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [10]:
sample = raw_datasets["train"].select(range(2))
model_input = preprocess_function(sample)
print(model_input)

{'input_ids': [[256047, 97919, 190824, 10268, 57248, 34029, 248079, 48478, 111679, 540, 104166, 8292, 63821, 248079, 2986, 8857, 15740, 248065, 5057, 171138, 248075, 2], [256047, 130245, 3423, 19964, 248079, 9872, 70314, 2986, 65840, 5472, 283, 5196, 248144, 349, 128450, 452, 349, 2931, 6467, 452, 349, 15740, 1590, 540, 349, 184094, 20651, 452, 3629, 139, 5334, 171138, 189348, 2404, 108, 22756, 202, 57454, 15740, 248065, 34029, 248079, 111612, 139794, 202, 349, 164135, 5835, 18, 452, 221010, 171138, 248075, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[256161, 1034, 63269, 79, 1563, 79392, 19822, 23012, 248079, 18691, 8897, 1563, 159450, 35, 82, 159318, 79, 152025, 79, 222800, 248079, 23087, 689, 19849, 79, 20424, 248085, 22172, 201728, 2], [256161, 66630, 4872, 4609

In [11]:
for sample in model_input['input_ids']:
    print(tokenizer.convert_ids_to_tokens(sample))

['eng_Latn', '▁Since', '▁1977,', '▁most', '▁financial', '▁services', ',', '▁including', '▁insurance', '▁and', '▁investment', '▁fund', '▁management', ',', '▁have', '▁been', '▁exemp', 't', '▁from', '▁VAT', '.', '</s>']
['eng_Latn', '▁During', '▁this', '▁period', ',', '▁two', '▁problems', '▁have', '▁essenti', 'ally', '▁ar', 'isen', ':', '▁the', '▁definition', '▁of', '▁the', '▁sc', 'ope', '▁of', '▁the', '▁exemp', 'tion', '▁and', '▁the', '▁impossi', 'bility', '▁of', '▁rec', 'ov', 'ering', '▁VAT', '▁incur', 'red', '▁in', '▁order', '▁to', '▁provide', '▁exemp', 't', '▁services', ',', '▁giving', '▁rise', '▁to', '▁the', '▁phen', 'omen', 'on', '▁of', '▁hidden', '▁VAT', '.', '</s>']


We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer 

In [12]:
tokenizer.batch_decode(model_input['input_ids'])

2024-11-16 23:21:05.605854: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-16 23:21:05.618674: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-16 23:21:05.631711: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-16 23:21:05.635281: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-16 23:21:05.645254: I tensorflow/core/platform/cpu_feature_guar

['eng_Latn Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.</s>',
 'eng_Latn During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in order to provide exempt services, giving rise to the phenomenon of hidden VAT.</s>']

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [14]:
tokenized_datasets = tokenized_datasets.filter(lambda x: len(x["input_ids"]) <= max_tok_length and len(x["labels"]) <= max_tok_length , desc=f"Discarding source and target sentences with more than {max_tok_length} tokens")

We can take a quick look at the length histogram in the source language:

In [15]:
dic = {}
for sample in tokenized_datasets['train']:
    sample_length = len(sample['input_ids'])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1 

for i in range(1,max_tok_length+1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 4   8
 5  60
 6  85
 7 292
 8 457
 9 581
10 751
11 840
12 902
13 912
14 828
15 682
16 520


Checking a sample after filtering by maximum number of tokens:

In [16]:
for sample in tokenized_datasets['train'].select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[256047, 13959, 32604, 248079, 659, 50100, 81437, 10003, 59602, 202, 3423, 248075, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256161, 36077, 31877, 248079, 254, 36601, 3888, 8423, 1563, 120385, 248075, 2]
[256047, 2820, 8625, 202, 41101, 112070, 20511, 8095, 80333, 248075, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256161, 15296, 340, 44635, 3841, 74652, 1125, 8095, 214249, 40226, 248075, 2]
[256047, 9680, 4062, 280, 10966, 351, 10003, 131189, 540, 140515, 351, 349, 57248, 170048, 248075, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256161, 106057, 53920, 1115, 103192, 174023, 23, 235963, 1115, 1563, 138823, 19822, 23012, 248075, 2]
[256047, 2820, 14173, 9903, 110432, 9, 87081, 281, 28911, 6780, 2402, 470, 16322, 29633, 248075, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[256161, 50098, 7687, 248079, 4979, 16628, 29943, 45720, 79, 51867, 4229, 21027, 193878, 248075, 2]
[256047, 1617, 74867, 2790, 63968, 1398, 121077, 15880, 69569, 248075, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [17]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [18]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config
    )


  return torch.load(checkpoint_file, map_location="cpu")
  return torch.load(checkpoint_file, map_location="cpu")


## Evaluation

The last thing to define for our Seq2SeqTrainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

:

In [19]:
from evaluate import load

metric = load("sacrebleu")

We need to define a function compute_metrics to compute BLEU scores at each epoch. The example below performs a basic post-processing to decode the predictions into texts:

In [20]:
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace negative ids in the labels as we can't decode them.
    #labels = np.where(labels < 0, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j<0 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Inference

At inference time, it is recommended to use the [generate function](https://huggingface.co/docs/transformers/main_classes/text_generation). This method takes care of encoding the input and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.
There’s also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how generation works in general in encoder-decoder models.

Let us first load the default inference parameters of NLLB.

In [21]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
)

print(generation_config)

GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "eos_token_id": 2,
  "max_length": 200,
  "pad_token_id": 1,
  "transformers_version": "4.30.0"
}



We prepare the test set in batches to be translated:

In [22]:
test_batch_size = 32
batch_tokenized_test = tokenized_datasets['test'].batch(test_batch_size)

Processing in batches to add padding and convert to tensors, then perform inference with num_beams = 1 and do_sample = False, that is, greedy search.

In [23]:
number_of_batches = len(batch_tokenized_test["source_text"])
output_sequences = []
for i in range(number_of_batches):
    inputs = tokenizer(
        batch_tokenized_test["source_text"][i], 
        max_length=max_tok_length, 
        truncation=True, 
        return_tensors="pt", 
        padding=True,
        )
    output_batch = model.generate(
        generation_config=generation_config, 
        input_ids=inputs["input_ids"].cuda(), 
        attention_mask=inputs["attention_mask"].cuda(), 
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_code), 
        max_length = max_tok_length, 
        num_beams=1, 
        do_sample=False,
        )
    output_sequences.extend(output_batch.cpu())

In [24]:
result = compute_metrics((output_sequences,tokenized_datasets["test"]["labels"]))
print(f'BLEU score: {result["bleu"]}')

BLEU score: 44.6235
