# Practical Machine Learning and Deep Learning
# Lesson 5



# Fine-tuning a model on a translation task
Today we will be finetunning T5(Text-To-Text Transfer Transformer) [model](https://github.com/google-research/t5x) on translation task. For this purpose we will be using [HuggingFace transformers](https://huggingface.co/docs/transformers/index) and [WMT16](https://huggingface.co/datasets/wmt16) dataset.

In [None]:
# installing huggingface libraries for dataset, models and metrics
!pip install datasets transformers[sentencepiece] sacrebleu

!pip install numpy==1.24.3

!pip install transformers[torch]

In [None]:
# Necessary inputs
import warnings

from datasets import load_dataset, load_metric
import transformers
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

warnings.filterwarnings('ignore')

## Selecting the model
For the example purpose we select as model checkpoint the smallest transformer in T5 family - `t5_small`. Other pre-trained models can be found [here](https://huggingface.co/docs/transformers/model_doc/t5#:~:text=T5%20comes%20in%20different%20sizes%3A).

In [None]:
# selecting model checkpoint
model_checkpoint = "t5-small"

## Loading the dataset

In [None]:
# setting random seed for transformers library
transformers.set_seed(42)

# Load the WMT16 dataset
raw_datasets = load_dataset("wmt16", "de-en")

# Load the BLUE metric
metric = load_metric("sacrebleu")

## Dataset
Downloaded from HuggingFace dataset is a `DatasetDict`.

In the Hugging Face datasets library, DatasetDict is a class used to manage multiple datasets, usually representing different splits of a dataset, such as `train`, `test`, and `validation`. It allows users to work with these splits in a structured and intuitive way.

### Why use `DatasetDict`?

1.   **Dictionary-Like Structure:** DatasetDict is essentially a dictionary where keys are split names (e.g., `train`, `test`, `validation`) and values are Dataset objects.
2.   **Convenient Access:** Provides easy access to different splits of a dataset using dictionary-style lookups.
3.   **Integrated Operations:** Facilitates applying transformations or preprocessing steps across all dataset splits simultaneously.
4.   **Seamless Integration:** Works well with other components of the Hugging Face ecosystem, including tokenizers and models.




In [None]:
raw_datasets

In [None]:
# samples from train dataset
raw_datasets["train"][:5]

## Metric
[Sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) computes:
- `score`: BLEU score
- `counts`: list of counts of correct n-grams
- `totals`: list of counts of total n-grams
- `precisions`: list of precisions
- `bp`: Brevity penalty
- `sys_len`: cumulative sysem length
- `ref_len`: cumulative reference length

The main metric is [BLEU score](https://en.wikipedia.org/wiki/BLEU). BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score measures the similarity of the machine-translated text to a set of high quality reference translations. It was originally designed for machine translation tasks but has since been adapted and widely used in other natural language generation tasks.


Here’s a step-by-step overview of how BLEU score is calculated:

1. **N-gram Precision**:
   - BLEU score calculates precision for [n-grams](https://en.wikipedia.org/wiki/N-gram) (contiguous sequences of n items) of varying lengths (typically 1 to 4). The precision for each n-gram length is computed as follows:
     - **Modified Precision**: For each n-gram length n, count the maximum number of times any single n-gram appears in any single reference translation, denoted as
$$ \text{max}_n $$

     - **Candidate n-gram Count**: Count the total number of n-grams in the candidate (machine-generated) translation, denoted as $\( \text{count}_n \)$.
     - **Clip Count**: Clip $\( \text{count}_n \)$ by $\( \text{max}_n \)$, denoted as $\( \text{clip}_n \)$. This prevents artificially inflating the score by penalizing over-generation of n-grams not found in the references.
     - **Precision Calculation**: Calculate the precision for each n-gram length as $\( \frac{\sum \text{clip}_n}{\sum \text{count}_n} \)$.

2. **Brevity Penalty**:
   - BLEU includes a penalty term to account for shorter candidate translations compared to the reference translations. This is to discourage generating overly short translations that may inflate precision scores unfairly.
   - **Brevity Penalty Calculation**: Compute the brevity penalty $\text{BP}$ as $ \exp(1 - \frac{\text{reference length}}{\text{candidate length}}) $, where $ \text{reference length} $ is the total length of all reference translations and $ \text{candidate length} $ is the length of the candidate translation.

3. **BLEU Score Calculation**:
   - Combine the n-gram precisions using a weighted geometric mean, adjusted by the brevity penalty:
     - $\( \text{BLEU} = \text{BP} \times \exp \left( \frac{1}{n} \sum_{i=1}^{n} \log p_i \right) \)$, where $\( p_i \)$ is the precision for n-gram length $\( i \)$.

4. **Interpretation**:
   - BLEU score ranges typically from 0 to 1, with 1 indicating a perfect match between candidate and reference translations. Higher BLEU scores indicate better translation quality, but the interpretation can vary depending on the task and the specific context.

5. **Implementation**:
   - BLEU score computation is often implemented in libraries such as NLTK or directly in evaluation scripts provided by machine translation frameworks. These implementations handle the counting of n-grams, calculation of precisions, and application of the brevity penalty.





In [None]:
# how to use sacrebleu and its purpose
metric

In [None]:
fake_preds = ["hello there", "general kenobi", "Can I get an A"]
fake_labels = [["hello there"], ["general kenobi"], ['Can I get a C']]
metric.compute(predictions=fake_preds, references=fake_labels)

## Preprocessing the data
As usual we will need to preprocess data and tokenize it before passing to model

In [None]:
from transformers import AutoTokenizer

# we will use autotokenizer for this purpose
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokenizer("Hello, this one sentence!")

In [None]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

In [None]:
# prefix for model input
prefix = "translate English to Deutsch:"

#### Constants and Language Specifications

The constants `max_input_length` and `max_target_length` will define the maximum lengths for the input and target sequences, respectively. The `source_lang` and `target_lang` variables will specify the source language (English) and the target language (German).

## Preprocess Function
Let's define preprocessing function that will prepare text data for a sequence-to-sequence model. This function will handle the conversion of raw text into tokenized sequences suitable for input into a model.


1. Input and Target Sentence Preparation

    Inside the function, `inputs` will be generated by concatenating a prefix with the source language sentences extracted from `examples["translation"]`. Similarly, `targets` will be a list of target language sentences from the same dictionary.

2. Tokenize Inputs

    The function will then use a `tokenizer` to convert these sentences into token IDs. For the inputs, the `tokenizer` will be applied with a `max_length` parameter set to `max_input_length` and truncation to ensure that sequences longer than the specified length are truncated.

3. Tokenize Targets

    Next, the function will tokenize the target sentences, again using the `tokenizer` with a `max_length` parameter set to `max_target_length` and truncation.

4. Add Tokenized Targets to Model Inputs

    The resulting tokenized target sequences will be added to the `model_inputs` dictionary under the key `"labels"`.

5. Return Model Inputs

    Finally, return the `model_inputs` dictionary, which will now include both the tokenized input sequences and their corresponding tokenized target sequences (labels).


In [None]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "de"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# example of preprocessing
preprocess_function(raw_datasets['train'][:2])

In [None]:
# for the example purpose we will crop the dataset and select first 5000 for train
# and 500 for validation and test
cropped_datasets = raw_datasets
cropped_datasets['train'] = raw_datasets['train'].select(range(5000))
cropped_datasets['validation'] = raw_datasets['validation'].select(range(500))
cropped_datasets['test'] = raw_datasets['test'].select(range(500))
tokenized_datasets = cropped_datasets.map(preprocess_function, batched=True)
tokenized_datasets['train'][0]

## Fine-tuning the model

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
# create a model for the pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

About the training arguments:

  - An output directory will be created for saving model checkpoints and logs, named based on the model name, source language, and target language.
  - The model will be set to evaluate at the end of each epoch.
  - The learning rate will be set to 2e-5 for the optimizer.
  - Both training and evaluation will use a per-device batch size of 32.
  - A weight decay of 0.01 will be applied to the model for regularization.
  - A maximum of 3 checkpoints will be saved during training.
  - The model will be trained for 10 epochs.
  - During evaluation, the model will predict with generation capabilities (e.g., for tasks like translation or summarization).
  - Mixed precision training will be enabled to speed up training and reduce memory usage.
  - Training metrics will be reported to TensorBoard for visualization

In [None]:
# defining the parameters for training
batch_size = 32
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    report_to='tensorboard',
)

Instead of writing `collate_fn` function we will use `DataCollatorForSeq2Seq`.
 Simliarly it implements the batch creation for training.

In [None]:

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The function `postprocess_text` mainly iterates through each prediction (preds) and labels, and strips leading and trailing whitespace.

However, for `labels`, it would create a nested list structure where each label is stripped of whitespace and wrapped in another list.

Here is how the metrics would be computed:


1. **Decoding Predictions and Labels**:
   - `preds` would be decoded using `tokenizer.batch_decode(preds, skip_special_tokens=True)`. This will convert the model's token IDs back into readable text, skipping any special tokens.
   - `labels` would be decoded similarly, but with special handling for `-100` values. These would be replaced with `tokenizer.pad_token_id` because `-100` indicates tokens that were ignored during training (e.g., padding tokens).

2. **Post-processing**:
   - After decoding, both `decoded_preds` and `decoded_labels` would undergo post-processing using the `postprocess_text` function. This function would remove leading and trailing whitespace from each prediction (`preds`) and label (`labels`).

3. **Metric Calculation**:
   - The `metric.compute` function would compute evaluation metrics (like BLEU score) based on the decoded predictions (`decoded_preds`) and decoded labels (`decoded_labels`).

4. **Additional Metrics**:
   - The function would calculate the average length of the generated predictions (`prediction_lens`) excluding padding tokens.
   - The results would be formatted into a dictionary (`result`), where `"bleu"` would contain the computed BLEU score, and `"gen_len"` would contain the average length of the generated predictions.

5. **Output**:
   - Finally, the function would return `result`, a dictionary containing `"bleu"` and `"gen_len"` metrics rounded to four decimal places.


In [None]:
import numpy as np

# simple postprocessing for text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

# compute metrics function to pass to trainer
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [None]:
# instead of writing trSeq2SeqTrainerain loop we will use Seq2SeqTrainer
trainer = (
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Let's train the model, and save it when it has been trained completely

In [None]:
trainer.train()

In [None]:
# saving model
trainer.save_model('best')

Now we will load the trained model and use it to perform some translations.

In [None]:
# loading the model and run inference for it
model = AutoModelForSeq2SeqLM.from_pretrained('best')
model.eval()
model.config.use_cache = False

In [None]:
def translate(model, inference_request, tokenizer=tokenizer):
    input_ids = tokenizer(inference_request, return_tensors="pt").input_ids
    outputs = model.generate(input_ids=input_ids)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True,temperature=0))

In [None]:
inference_request = prefix + 'Why did it take so long to train the model?'
translate(model, inference_request,tokenizer)

In [None]:
inference_request = prefix +"My name is Wolfgang and I live in Berlin"
translate(model, inference_request,tokenizer)

In [None]:
inference_request = prefix + "Your assignment is hard. Start it today"
translate(model, inference_request,tokenizer)