Copyright (c) 2023 Graphcore Ltd.

# Textual Entailment on IPUs using Flan-T5 - Fine-tuning

[Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) is an encoder-decoder transformer model that reframes all NLP tasks into a text-to-text format. Compared to T5, Flan-T5 has been fine-tuned on more than 1000 additional tasks. It can also be used for text generation.

Language models are very powerful because a huge variety of tasks can be formulated as text-to-text problems and thus adapted to fit the generative setup, where the model is asked to correctly predict future tokens. For more details, check out the T5 paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf), and the Flan-T5 paper [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf).

In this notebook we apply the idea from the T5 paper and fine-tune Flan-T5 on the task of textual entailment with the [GLUE MNLI](https://huggingface.co/datasets/glue#mnli) dataset.

We also show how you can easily adapt the example in this notebook for custom fine-tuning of several downstream tasks, such as question answering, named entity recognition, sentiment analysis and text classification. All you would need to do is prepare the data in the way needed for the specific task.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   NLP   |  Textual entailment  | Flan-T5 | Glue-MNLI| Fine-tuning | recommended: 64 (min: 16) |  4h (with 16 IPUs: 7h)  |

### Learning outcomes
In this demo, you will:
- Prepare a configuration for fine-tuning and one for validation, and optionally customise them according to your needs
- Download and preprocess the MNLI dataset in a way suitable for the T5 architecture
- Fine-tune the Flan-T5 model on the dataset, and then perform validation in order to compute the achieved accuracy
- Save a Hugging Face checkpoint of the fine-tuned model, suitable for use on any hardware
- Learn how to adapt this notebook to any downstream task, exploiting the generality of the text-to-text approach of T5

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](../../../gradient-badge.svg)](https://ipu.dev/IotDe5)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to do this. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies and configuration


In [None]:
%pip install -r requirements.txt

## Initial setup
This notebook supports both Flan-T5 XXL (11B parameters) and Flan-T5 XL (3B parameters) and the code below refers to the XXL model. Note: the XXL variant needs a minimum of 16 IPUs to run, while the XL variant needs a minimum of 8 IPUs. Handling the weights and checkpoints of XXL is going to take a bit more time compared to the XL variant. So if you want to run quick experiments, you can change the code below to use the XL model.

Note that if you have enough IPUs available, you can speed up training by using data parallelism. We'll see how to do this in the section [Data parallel](#data-parallel).

In [None]:
# Change the following to "xl" if you want to use the smaller variant
model_size = "xxl"

In [None]:
import os

ipus_needed = 16 if model_size == "xxl" else 8
number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", ipus_needed))
if number_of_ipus < ipus_needed:
    raise ValueError(
        f"This example needs {ipus_needed} IPUs to work. Detected {number_of_ipus}"
    )

os.environ["POPART_CACHE_DIR"] = os.getenv(
    "POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/"
)
checkpoint_dir = os.getenv("CHECKPOINT_DIR", "checkpoints")

### Fine-tuning configuration
First of all, we need to load a default configuration, defined in `config/finetuning.yml`.
This file has optimised configurations to run the model on IPUs.
We need to pick the configuration suitable for our model size.

This configuration uses a sequence length of 512 tokens. Flan-T5 XXL layers are split across 16 IPUs, using [tensor model parallelism](https://arxiv.org/pdf/1909.08053.pdf). No data parallelism is used by default.
The `t5_config_setup` sets up the specified configuration and configures logging and Weights & Biases.

In [None]:
from utils.setup import t5_config_setup
from config import CONFIG_DIR

We support logging to Weights & Biases. If you want to use it, you first need to manually log in (see the [W&B quickstart](https://docs.wandb.ai/quickstart) for details).


In [None]:
# Set this to True if you want to use W&B. Be sure to be logged in.
wandb_setup = False

In [None]:
# Choose a configuration
config_name = "xxl_pod16" if model_size == "xxl" else "xl_pod8"
config, args, _ = t5_config_setup(
    CONFIG_DIR / "finetuning.yml", "release", config_name, wandb_setup
)

In [None]:
print(config.dumps_yaml())

### Validation configuration
Configurations for inference are found in `config/inference.yml`:
- `xxl` is the default configuration, it uses a batch size of 12 and it's guaranteed to fit into memory with a 512 sequence length.
- `xxl-mnli` is optimised for the MNLI dataset. This increases the batch size to 20 but requires the sequence length to be reduced to 268.

In this example we will start with the default configuration, and manually reduce the sequence length and increase the batch size later on.
You can do the same on your custom dataset to find the best configuration.

In [None]:
# Set this to True if you want to use W&B. Be sure to be logged in.
wandb_setup_on_eval = False

In [None]:
eval_config, *_ = t5_config_setup(
    CONFIG_DIR / "inference.yml",
    "release",
    model_size,
    hf_model_setup=False,
    wandb_setup=wandb_setup_on_eval,
)

In [None]:
print(eval_config.dumps_yaml())

## Dataset
The MNLI dataset consists of pairs of sentences, a *premise* and a *hypothesis*.
The task is to predict the relation between the premise and the hypothesis, which can be:
- `entailment`: hypothesis follows from the premise,
- `contradiction`: hypothesis contradicts the premise,
- `neutral`: hypothesis and premise are unrelated.

Data splits for the MNLI dataset are the following:

|train |validation_matched|validation_mismatched|
|-----:|-----------------:|--------------------:|
|392702|              9815|                 9832|

The matched split is made of samples derived from the same sources as those in the training set, and samples in the mismatched split are not derived from the same sources as those in the training set and therefore don't closely resemble any of the examples seen at training time. For validation we're going to use the latter.
You can explore it on [Hugging Face](https://huggingface.co/datasets/glue/viewer/mnli/train).
![MNLI dataset](imgs/mnli_dataset.png)


### Training preprocessing
The columns we are interested in are `hypothesis`, `premise` and `label`.

As mentioned at the beginning, T5 has an encoder-decoder architecture, so it needs 2 input sequences: one for the encoder, and one for the decoder.
So, the first step consists of forming input prompts for the encoder with the format
```
mnli hypothesis: {hypothesis} premise: {premise}
```
Next, we provide the decoder with the corresponding label, shifted right and prepended with the `<pad>` token:
```
<pad>{label}
```
For example, an encoder sequence would be:
```
mnli hypothesis: Product and geography are what make cream skimming work.  premise: Conceptually cream skimming has two basic dimensions - product and geography.
```
Similarly, an example decoder sequence would be:
```
<pad>neutral
```
The pad token acts as `decoder_start_token_id` for the T5 models.

Then, the encoder and decoder sequences are tokenized and padded to the model sequence length of 512. Attention masks are generated accordingly.
Since the model is trained to predict the `mnli` class, the labels are simply the decoder input sequence shifted by one token to the left, which means that the labels will simply be the `mnli` class, without the pad token at the beginning.

In [None]:
from datasets import load_dataset, disable_progress_bar, enable_progress_bar
from transformers import AutoTokenizer
import data.mnli_data as mnli_data

The next two cells are the ones you want to change when doing custom fine-tuning.

We first load the MNLI dataset, and then create a custom preprocessing function to build prompts suitable for a
text-to-text setup.
For a custom fine-tuning, you will need to choose a format for your prompts and change the `form_training_prompts` function.

In [None]:
# Load HF dataset
dataset = load_dataset("glue", "mnli", split="train")

In [None]:
dataset[0]

In [None]:
# Form prompts in the format: "mnli hypothesis: {hypothesis} premise: {premise}"
def form_training_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example["text"] = f"mnli hypothesis: {hypothesis} premise: {premise}"
    example["target"] = f"{class_label}"
    return example


dataset = dataset.map(
    form_training_prompts,
    remove_columns=["hypothesis", "premise", "label", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

In [None]:
# shows first textual prompt
dataset[0]

After that, we tokenize the prompts. You won't need to change this step for custom fine-tuning.

In [None]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")

In [None]:
# Tokenize prompts
# We disable the progress bar for this,
# because when num_proc > 1 it's a bit messy
disable_progress_bar()
dataset = dataset.map(
    mnli_data.tokenizes_text(tokenizer),
    batched=True,
    batch_size=1000,
    num_proc=8,
    remove_columns=dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)
enable_progress_bar()

In [None]:
# shows first tokenized prompt
dataset[0]

In [None]:
len(dataset)

In [None]:
# Show a portion of first decoder input. You can see that the decoder input is the label prepended with the pad token (0).
print("First 10 tokens of first decoder input")
print("decoder_input_ids")
print(dataset[0]["decoder_input_ids"][:10])
# Note that the labels' tokens that correspond to padding have been set to -100,
# this value will be ignored when computing the cross-entropy loss
print("labels")
print(dataset[0]["labels"][:10])

> **Note** If you want to adapt this code for another dataset, be sure to call all the inputs and labels in the same way: `input_ids`, `attention_mask`, `decoder_input_ids`, `decoder_attention_mask` and `labels`.

### Validation preprocessing
For validation, we use the [MNLI `validation_mismatched`](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/validation) split.

Similar to what we did for training, the first preprocessing step is creating prompts for the encoder inputs.

In [None]:
eval_dataset = load_dataset("glue", "mnli", split="validation_mismatched")

In [None]:
def form_validation_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]

    example["text"] = f"mnli hypothesis: {hypothesis} premise: {premise}"
    return example


eval_dataset = eval_dataset.map(
    form_validation_prompts,
    remove_columns=["hypothesis", "premise", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

In [None]:
eval_dataset[0]

Finally, input prompts are tokenized.

In [None]:
def prepare_validation_features(tokenizer):
    def func(dataset):
        tokenized = tokenizer(
            dataset["text"], padding="max_length", return_tensors="np"
        )
        tokenized.update(label=dataset["label"])
        return tokenized

    return func


eval_dataset = eval_dataset.map(
    prepare_validation_features(tokenizer),
    batched=True,
    remove_columns=eval_dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)

In [None]:
eval_dataset[0]

> **Note** If you want to adapt this code for another dataset, be sure to call all the inputs in the same way: `input_ids`, `attention_mask` and `labels`.

## Customise configuration and create a Trainer instance
Right at the beginning of the notebook we loaded the default configurations for training and inference. We are now going to show how to customise some of the parameters for your needs.

### Customise training configuration
In the cells below we list the parameters you are most likely to change when doing a custom fine-tuning.

These are the training steps, learning rate and optimizer parameters.

Moreover, it is important that you specify **checkpoint** parameters, namely a folder to save the fine-tuned weights and a periodicity for checkpointing. Be aware that saving checkpoints takes time, so you don't want to save them too often.
To disable intermediate checkpoints set `config.checkpoint.steps = 0`. The final checkpoint is always saved provided the `config.checkpoint.save` directory is given. Set it to `None` if you don't want to save weights, but it's unlikely that you want to disable the last checkpoint.

If you are not resuming training and you don't care about resuming the training later on you can reduce the time and memory required to save checkpoints by specifying `optim_state=False`. In this case, only the model weights will be saved, while the optimiser state will be discarded.

Checkpoints will be saved in the directory given by the environment variable `CHECKPOINT_DIR`, which we saved in `checkpoint_dir` at the beginning. You can have a look at the path by printing it out in the cell below.

In [None]:
checkpoint_dir

In [None]:
# Customise training arguments
# Note that for the XXL variant it will take approximately 36 seconds per training step,
# and the default number of steps is 500, which means it will take
# about 5 hours to train.
# For the XL variant it will take approximately 16 seconds
# per training step, so about 2.25 hours for 500 steps.
# Here you can set a lower number of steps, to finish training earlier.
# NOTE: you need to increase the following to 500 in order to reach 87% accuracy.
config.training.steps = 50

In [None]:
# Customise optimiser and learning rate schedule
config.training.optimizer.learning_rate.maximum = 5e-6
config.training.optimizer.learning_rate.warmup_steps = 10
config.training.optimizer.beta1 = 0.9
config.training.optimizer.beta2 = 0.999
config.training.optimizer.weight_decay = 0.0
config.training.optimizer.gradient_clipping = 1.0

In [None]:
# Customise checkpoints
config.checkpoint.save = (
    checkpoint_dir  # where the model is saved. None means don't save any checkpoint
)
config.checkpoint.steps = (
    0  # how often you save the model. 0 means only the final checkpoint is saved
)
config.checkpoint.to_keep = 4  # maximum number of checkpoints kept on disk
config.checkpoint.optim_state = (
    False  # Whether to include the optimiser state in checkpoints
)

In [None]:
# Resume training
config.checkpoint.load = (
    None  # you can specify a directory containing a previous checkpoint
)
# os.path.join(checkpoint_dir, ...)

#### Data parallel
If you have enough IPUs available, you can speed up training by using data parallelism. As mentioned previously, the XXL variant needs 16 IPUs, so if you have 64 IPUs you can set data parallel to 4. Similarly, the XL variant needs 8 IPUs, so if you have 16 IPUs you can set data parallel to 2 (or to higher values if you have more than 16 IPUs).

Note that when changing the data parallel value the model will need to be re-compiled the first time it is run with the changed parameter.

In [None]:
# You can change the following to a suitable value that depends on
# the number of IPUs in your system and the model variant you chose
config.execution.data_parallel = 1

### Customise validation configuration
For validation there are fewer relevant parameters.

You can control the maximum number of tokens generated by the model using the  `output_length` parameter. If you know that the targets are only a few tokens long, it is convenient to set it to a small number. Generation stops if the output token is `</s>` (the end-of-text token) or after `output_length` tokens are generated.

In our case, we set the `output_length` to 5 to accommodate all class labels and the `</s>` token.

You can also reduce the model `sequence_length` to account for the maximum length of sentences encountered in the validation dataset.
In our case of the MNLI example, `sequence_length = 268` and this allows us to increase the batch size to `20`.

If you adapt this example to another dataset, you can compare your dataset `max_len` with ours and decide if you can safely use the increased batch size. Otherwise, stick with the default one.

Note that when changing the batch size and sequence length the model will need to be re-compiled the first time it is run with the changed parameters.

You can specify a checkpoint to be used for validation. If `None` is provided, the latest weights from the fine-tuning session will be used.

In [None]:
from functools import reduce

In [None]:
# Specify maximum number of tokens that can be generated.
eval_config.inference.output_length = 5
# Reducing sequence length
max_len = reduce(lambda l, e: max(l, sum(e["attention_mask"])), eval_dataset, 0)
print(f"Maximum length in mnli-mismatched dataset: {max_len}")
eval_config.model.sequence_length = max_len
print(f"Setting sequence length to {eval_config.model.sequence_length}")
# Increase batch size
eval_config.execution.micro_batch_size = 20

In [None]:
# You can specify a directory containing a previous checkpoint
# None means the latest weights are used
eval_config.checkpoint.load = None  # os.path.join(checkpoint_dir, ...)

In [None]:
print(eval_config.dumps_yaml())

We also need to define a validation metric. We will use `accuracy`.

In [None]:
import evaluate

accuracy_metric = evaluate.load("accuracy")

Finally, we define a function to convert our generated labels to integer indices.

In [None]:
def postprocess_mnli_predictions(generated_sentences):
    labels_to_ids = {"entailment": 0, "neutral": 1, "contradiction": 2, "unknown": -1}
    predictions = []
    for s in generated_sentences:
        answer = mnli_data.extract_class_label(s)
        predictions.append(labels_to_ids[answer])
    return predictions

### Create a Trainer instance
Once a config is specified, we are ready to create the training session with the help of the 
`T5Trainer` class.
You can provide the following arguments:

- `config`: the training configuration.
- `pretrained`: the Hugging Face pre-trained model, used to initialise the weights.
- `dataset`: the training dataset.
- `eval_dataset`: the validation dataset.
- `eval_config`: the inference configuration, to be used in validation.
- `tokenizer`: the tokenizer, needed during validation.
- `metric`: the metric for validation. A Hugging Face metric from the `evaluate` module. We use the `accuracy` metric.
- `process_answers_func`: a function to convert the generated answers to the format required by the metric and the labels. For example we need to convert textual categories `[entailment, contradiction, neutral]` to indices.

These arguments can also be provided later on when calling `trainer.train(...)` or `trainer.evaluate(...)`.

If you want to run fine-tuning, the pre-trained model should be `google/flan-t5-xxl`.

In [None]:
from utils.trainer import T5Trainer
from transformers.models.t5.modeling_t5 import T5ForConditionalGeneration

In [None]:
# This will probably take a few minutes to load, due to the size of the model
pretrained = T5ForConditionalGeneration.from_pretrained(f"google/flan-t5-{model_size}")

In [None]:
trainer = T5Trainer(
    config,
    pretrained,
    dataset,
    eval_dataset,
    eval_config,
    tokenizer,
    accuracy_metric,
    postprocess_mnli_predictions,
    args,
)

## Run fine-tuning
We can now run training for the number of steps you set in the config.
Checkpoints will be saved in the folder you specified in `config.checkpoint.save`, with the periodicity set in `config.checkpoint.steps`.

It will take around 15 minutes to compile the training model The first time you run `trainer.train()`.

Training takes around 10 minutes to start because the pre-trained weights need to be downloaded to the IPUs.
After that, each step takes around 36 seconds (with no data parallelism).
This does not include time for checkpointing.

In [None]:
trainer.train()

## Run validation
Finally, we validate our model on the [`validation_mismatched`](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/test) split of the MNLI dataset.

Generative inference is performed token-by-token using a greedy heuristic: the next token is chosen based on the highest logits.

We run token-by-token inference in batches to generate labels for multiple examples at once.
We then compute the accuracy by comparing the model answers with the true labels.

The resulting model should achieve an accuracy of about 87% when fine-tuned for 500 steps.

```
Total number of examples                 9832
Number with badly formed result          0
Number with incorrect result             1253
Number with correct result               8579 [87.3%]
 ```

The first time you run `trainer.evaluate()` it takes around 6 minutes to compile the inference model.

Running validation on the whole dataset takes around 18 minutes.

Note that:
- If you specify `trainer.evaluate(use_pretrained=True)`, we will use the weights from `pretrained`.
- If you set `eval_config.checkpoint.load` to the path of a specific checkpoint, the weights from that checkpoint will be used.
- If a training session exists we will use the latest weights from the training model.
- If none of the above is true, we will use the weights from the latest checkpoint, if present.

In [None]:
trainer.evaluate()

## Save Hugging Face checkpoint
You can save the trained weights so that they can be uploaded to Hugging Face and used with Hugging Face's PyTorch model, on any hardware. For guidance on how to upload the saved model to the Model Hub, check the [Hugging Face documentation](https://huggingface.co/docs/hub/models-uploading).

You can specify a checkpoint path if you want to convert a specific checkpoint instead of the latest fine-tuning weights.

In [None]:
hf_checkpoint_path = os.path.join(checkpoint_dir, "hf_checkpoint")
ckpt_path = None  # os.path.join(checkpoint_dir, ...)

In [None]:
finetuned = trainer.save_hf_checkpoint(hf_checkpoint_path, ckpt_path)

## Run the model with Hugging Face pipeline
The same model can later be used with the standard Hugging Face pipeline on any hardware.

In [None]:
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")
hf_model = T5ForConditionalGeneration.from_pretrained(hf_checkpoint_path)
generator = pipeline("text2text-generation", model=hf_model, tokenizer=tokenizer)

In [None]:
prompt = (
    "mnli hypothesis: Your contributions were of no help with our students' education. "
    "premise: Your contribution helped make it possible for us to provide our students with a quality education."
)

out = generator(prompt)
print(out)

## Conclusion
This notebook has demonstrated how easy it is to perform fine-tuning on Flan-T5 on the Graphcore IPU for a textual entailment task. While not as powerful as larger models for free text-generation, medium-size encoder-decoder models like Flan-T5 can still be successfully fine-tuned to handle a range of NLP downstream tasks such as question answering, sentiment analysis, and named entity recognition. In fact, for these kind of tasks you don't need models with 175B parameters like GPT-3. Flan-T5 XXL with 11B parameters has very good language understanding and is suitable for most language tasks. Larger models only give a small improvement in language understanding, but they do add more world knowledge. They therefore perform better at free text generation as might be used in an AI Assistant or chatbot.

In this example we performed fine-tuning on Flan-T5 for textual entailment on the GLUE MNLI dataset.

You can easily adapt this example to do your custom fine-tuning on several downstream tasks, such as question answering, named entity recognition, sentiment analysis, and text classification in general, simply by preparing your data accordingly.

Overall, this notebook showcases the potential for Flan-T5 to be used effectively and efficiently for fine-tuning.

If you'd like to use Flan-T5 XL and XXL to perform inference, check out the notebook in this directory [Flan-T5-generative-inference.ipynb](Flan-T5-generative-inference.ipynb).