# Prompting

Prompting in LLMs is the design of a structured input to provide task description, demostrations and the actual input for the model to generate a desired output.

In [1]:
!pip install datasets evaluate transformers==4.30 accelerate peft bitsandbytes
!pip install sacrebleu
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting transformers==4.30
  Downloading transformers-4.30.0-py3-none-any.whl.metadata (113 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.6/113.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.30)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70

In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [Llama2 model](https://huggingface.co/docs/transformers/model_doc/llama2) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

print(raw_datasets)

  from .autonotebook import tqdm as notebook_tqdm
Using the latest cached version of the dataset since tj-solergibert/Europarl-ST-processed-mt-en couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /home/jorcisai/.cache/huggingface/datasets/tj-solergibert___europarl-st-processed-mt-en/default/0.0.0/b562e57d6d9a3c5c4d67ddd334a969c67f93c005 (last modified on Sat Nov 16 22:03:28 2024).


DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 602605
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 86170
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 81968
    })
})


As shown, the Europarl-ST already comes with a pre-defined partition on the three conventional sets: training, validation and test. Each set is a dictionary with a list of source sentences (source_text), target sentences (dest_text) and the target language (dest_lang).

Let's take a closer look at the features of the training set:

In [2]:
raw_datasets["train"].features

{'source_text': Value(dtype='string', id=None),
 'dest_text': Value(dtype='string', id=None),
 'dest_lang': ClassLabel(names=['de', 'en', 'es', 'fr', 'it', 'nl', 'pl', 'pt', 'ro'], id=None)}

As you can see, the possible target languages are German, English, Spanish, French, Italian, Dutch, Polish, Portuguese and Romanian.

Let us take a look at the translations of the first two English sentences:

In [3]:
raw_datasets["train"][:14]["source_text"]

['Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in ord

In [4]:
raw_datasets["train"][:14]["dest_text"]

['Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.',
 'La mayoría de los servicios financieros, incluidos los seguros y la gestión de fondos de inversión, están exentos de IVA desde 1977.',
 'Depuis 1977, la plupart des services financiers, dont les assurances et la gestion des fonds de placement, ne sont pas tenus d ’ appliquer une TVA.',
 'Dal 1997 la maggior parte dei servizi finanziari, compresi i servizi assicurativi e la gestione di fondi di investimento, sono esenti da IVA.',
 'Sinds 1977 zijn de meeste financiële diensten, met inbegrip van verzekeringen en het beheer van beleggingsfondsen, vrijgesteld van btw.',
 'większość usług finansowych, w tym usług w zakresie ubezpieczeń i zarządzania funduszami inwestycyjnymi, była zwolniona z opodatkowania podatkiem VAT.',
 'Desde 1977 que a maioria dos serviços financeiros, incluindo os seguros e a gestão de fundos de investimento,

In [5]:
raw_datasets["train"][:14]["dest_lang"]

[0, 2, 3, 4, 5, 6, 7, 0, 2, 3, 4, 5, 6, 7]

As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

The Llama2 model is a pretrained Large Language Model (LLM) ready to tackle several NLP tasks, being one of the them the translation from English into Spanish. Let us filter the Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter).

In [6]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets = raw_datasets.filter(lambda x: x["dest_lang"] == lang_id)

More precisely, we are going to be using the Llama-2 checkpoint [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) to run our experiments for which you need to accept the LLAMA 2 COMMUNITY LICENSE AGREEMENT. Processing your request may take some time, so please do it in advance.

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence:

In [8]:
from transformers import AutoTokenizer

max_tok_length = 16
checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_tok_length,
    padding_side='left',
    )
tokenizer.pad_token = "[PAD]"

In [9]:
def preprocess_function(sample):
    model_inputs = tokenizer(
        sample["source_text"], 
        text_target = sample["dest_text"],
        )
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*. We can check what the preprocess_function is doing with a small sample

In [10]:
sample = raw_datasets["train"].select(range(2))
model_input = preprocess_function(sample)
print(model_input)

{'input_ids': [[1, 4001, 29871, 29896, 29929, 29955, 29955, 29892, 1556, 18161, 5786, 29892, 3704, 1663, 18541, 322, 13258, 358, 5220, 10643, 29892, 505, 1063, 429, 3456, 515, 478, 1299, 29889], [1, 7133, 445, 3785, 29892, 1023, 4828, 505, 13674, 564, 7674, 29901, 278, 5023, 310, 278, 6874, 310, 278, 11875, 683, 322, 278, 7275, 29879, 4127, 310, 9792, 292, 478, 1299, 297, 2764, 1127, 297, 1797, 304, 3867, 429, 3456, 5786, 29892, 6820, 14451, 304, 278, 27791, 265, 310, 7934, 478, 1299, 29889]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [11]:
for sample in model_input['input_ids']:
    print(tokenizer.convert_ids_to_tokens(sample))

['<s>', '▁Since', '▁', '1', '9', '7', '7', ',', '▁most', '▁financial', '▁services', ',', '▁including', '▁ins', 'urance', '▁and', '▁invest', 'ment', '▁fund', '▁management', ',', '▁have', '▁been', '▁ex', 'empt', '▁from', '▁V', 'AT', '.']
['<s>', '▁During', '▁this', '▁period', ',', '▁two', '▁problems', '▁have', '▁essentially', '▁ar', 'isen', ':', '▁the', '▁definition', '▁of', '▁the', '▁scope', '▁of', '▁the', '▁exem', 'ption', '▁and', '▁the', '▁impos', 's', 'ibility', '▁of', '▁recover', 'ing', '▁V', 'AT', '▁in', 'cur', 'red', '▁in', '▁order', '▁to', '▁provide', '▁ex', 'empt', '▁services', ',', '▁giving', '▁rise', '▁to', '▁the', '▁phenomen', 'on', '▁of', '▁hidden', '▁V', 'AT', '.']


We can recover the source text by applying [batch_decode](https://huggingface.co/docs/transformers/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_decode) of the tokenizer 

In [12]:
tokenizer.batch_decode(model_input['input_ids'])

2024-11-17 09:56:30.477993: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-17 09:56:30.486354: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-17 09:56:30.495403: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-17 09:56:30.498044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-17 09:56:30.505760: I tensorflow/core/platform/cpu_feature_guar

['<s> Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 '<s> During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in order to provide exempt services, giving rise to the phenomenon of hidden VAT.']

Now, we can apply the preprocess_function to the raw datasets (training, validation and test):

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map: 100%|██████████| 76469/76469 [00:04<00:00, 16579.18 examples/s]
Map: 100%|██████████| 11008/11008 [00:00<00:00, 14931.69 examples/s]
Map: 100%|██████████| 10686/10686 [00:00<00:00, 14750.06 examples/s]


We are going to filter the tokenized datasets by maximum number of tokens in source and target language:

In [14]:
tokenized_datasets = tokenized_datasets.filter(lambda x: len(x["input_ids"]) <= max_tok_length and len(x["labels"]) <= max_tok_length , desc=f"Discarding source and target sentences with more than {max_tok_length} tokens")

Discarding source and target sentences with more than 16 tokens: 100%|██████████| 76469/76469 [00:02<00:00, 26592.84 examples/s]
Discarding source and target sentences with more than 16 tokens: 100%|██████████| 11008/11008 [00:00<00:00, 26965.55 examples/s]
Discarding source and target sentences with more than 16 tokens: 100%|██████████| 10686/10686 [00:00<00:00, 26930.91 examples/s]


We can take a quick look at the length histogram in the source language:

In [15]:
dic = {}
for sample in tokenized_datasets['train']:
    sample_length = len(sample['input_ids'])
    if sample_length not in dic:
        dic[sample_length] = 1
    else:
        dic[sample_length] += 1 

for i in range(1,max_tok_length+1):
    if i in dic:
        print(f"{i:>2} {dic[i]:>3}")

 3   6
 4  64
 5  79
 6 304
 7 455
 8 568
 9 704
10 703
11 629
12 545
13 370
14 200
15 135
16  68


Checking a sample after filtering by maximum number of tokens:

In [16]:
for sample in tokenized_datasets['train'].select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[1, 3237, 7178, 29892, 591, 2609, 12522, 1749, 5076, 304, 445, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 922, 30046, 272, 28828, 29892, 694, 13279, 7681, 274, 3127, 279, 1232, 288, 14736, 29889]
[1, 1334, 817, 304, 4337, 7113, 7824, 2428, 4924, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 11389, 712, 1029, 4096, 279, 17926, 1185, 2428, 1730, 3175, 14721, 29874, 29889]
[1, 450, 24161, 411, 16762, 471, 6200, 1407, 10676, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 319, 15255, 6079, 29892, 425, 24161, 378, 16762, 3576, 12287, 15258, 29889]
[1, 512, 445, 3390, 2086, 29892, 591, 526, 10223, 292, 363, 278, 5434, 29889]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 18247, 427, 831, 29877, 29892, 4697, 14054, 29892, 707, 14054, 447, 27182, 3105, 2192, 29889]
[1, 2193, 338, 451, 1855, 29311, 3381, 29889]
[1, 1, 1, 1, 1, 1, 1, 1]
[1, 382, 578, 694, 831, 425, 1120, 29463, 983, 29311, 22919, 29889]


In [52]:
src = "en"
tgt = lang
task_prefix = f"Translate from {src} to {tgt}:\n"
num_shots = 1
shots = ""
s = ""

prefix_tok_len = len(tokenizer.encode(f"{task_prefix}{shots}{src}: {s} = {tgt}: "))
shot_tok_len   = len(tokenizer.encode(f"{src}: {s} = {tgt}: {s}\n"))
max_tok_len = prefix_tok_len
max_tok_len += num_shots * (shot_tok_len + 2 * max_tok_length) 
max_tok_len += max_tok_length

random_seed = 13
sample = tokenized_datasets['train'].shuffle(seed=random_seed).select(range(num_shots))
for s in sample: shots += f"{src}: {s['source_text']} = {tgt}: {s['dest_text']}\n" 

def preprocess4test_function(sample):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["source_text"]]
    model_inputs = tokenizer(
        inputs,
        max_length=max_tok_len, 
        truncation=True, 
        return_tensors="pt", 
        padding=True)
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [64]:
sample = tokenized_datasets['test'].select(range(5))
model_input = preprocess4test_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input['input_ids']))

{'input_ids': tensor([[    0,     0,     1,  4103,  9632,   515,   427,   304,   831, 29901,
            13,   264, 29901,  2567,   363,  1023,  3291, 29889,   353,   831,
         29901,  4785, 29891,   263,  9248,   279,  3248, 28780, 29889,    13,
           264, 29901,  1938,   591,   864,   304, 26054,   895,   278,  2791,
          1691, 29973,   353,   831, 29901, 29871],
        [    1,  4103,  9632,   515,   427,   304,   831, 29901,    13,   264,
         29901,  2567,   363,  1023,  3291, 29889,   353,   831, 29901,  4785,
         29891,   263,  9248,   279,  3248, 28780, 29889,    13,   264, 29901,
          2398, 29892,   445,   947,   451,  2099,   766, 29885,   424,  1847,
           963, 29889,   353,   831, 29901, 29871],
        [    0,     0,     0,     0,     0,     0,     0,     1,  4103,  9632,
           515,   427,   304,   831, 29901,    13,   264, 29901,  2567,   363,
          1023,  3291, 29889,   353,   831, 29901,  4785, 29891,   263,  9248,
           27

In [65]:
preprocessed_test_dataset = tokenized_datasets['test'].map(preprocess4test_function, batched=True)

In [66]:
for sample in preprocessed_test_dataset.select(range(5)):
    print(sample['input_ids'])
    print(sample['attention_mask'])
    print(sample['labels'])

[0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2567, 363, 1023, 3291, 29889, 353, 831, 29901, 4785, 29891, 263, 9248, 279, 3248, 28780, 29889, 13, 264, 29901, 1938, 591, 864, 304, 26054, 895, 278, 2791, 1691, 29973, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 18613, 2182, 7884, 26054, 15356, 1232, 16856, 2255, 29973]
[0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2567, 363, 1023, 3291, 29889, 353, 831, 29901, 4785, 29891, 263, 9248, 279, 3248, 28780, 29889, 13, 264, 29901, 2398, 29892, 445, 947, 451, 2099, 766, 29885, 424, 1847, 963, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 28608, 831, 29877, 694, 28711, 553, 29885, 424, 295, 279, 5409, 29889]
[0, 0, 0, 0, 0, 0, 0, 0, 0

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [33]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [34]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)


Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.27s/it]


# Inference

Loading default inference parameters for the model, so that additional parameters could be added and passed to the [generate function](https://huggingface.co/docs/transformers/main_classes/text_generation):

In [35]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
    )

print(generation_config)

GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9,
  "transformers_version": "4.30.0"
}



We prepare the test set in batches to be translated:

In [48]:
test_batch_size = 32
batch_tokenized_test = preprocessed_test_dataset.batch(test_batch_size)

Batching examples: 100%|██████████| 678/678 [00:00<00:00, 11019.04 examples/s]


In [67]:
number_of_batches = len(batch_tokenized_test["input_ids"])
output_sequences = []
for i in range(number_of_batches):
    output_batch = model.generate(
        generation_config=generation_config, 
        input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(), 
        attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(), 
        max_length = max_tok_len, 
        num_beams=1, 
        do_sample=False,)
    output_sequences.extend(output_batch)

## Evaluation

The last thing to define for our Trainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

In [68]:
from evaluate import load

metric = load("sacrebleu")

The example below performs a basic post-processing to decode the predictions and extract the translation:

In [69]:
import re

def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}{shots}{src}: {s} = {tgt}: " for s in sample["source_text"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    print(inputs)
    print(preds)
    for i, (input,pred) in enumerate(zip(inputs,preds)):
      pred = re.search(r'^.*\n',pred.removeprefix(input).strip())
      if pred is not None:
        preds[i] = pred.group()[:-1]
      else:
        preds[i] = ""
    print(sample["source_text"])
    print(sample["dest_text"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["dest_text"])
    result = {"bleu": result["score"]}
    return result

In [70]:
result = compute_metrics(preprocessed_test_dataset,output_sequences)
print(f'BLEU score: {result["bleu"]}')

['Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: Do we want to liberalise the markets? = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: However, this does not mean dismantling them. = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: I will now start. = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: The international community cannot remain impassive. = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: Nonetheless, there must be a balanced outcome. = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: We now know what he wanted it for. = es: ', 'Translate from en to es:\nen: Now for two points. = es: Voy a tratar dos puntos.\nen: Secondly, some of your suggestions are inadvisable. = es: ', 'Translate from en to es:\nen: Now for