# Textual Entailment on IPU using GPT-J - Fine-tuning

Copyright (c) 2023 Graphcore Ltd.

[GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) is a causal decoder-only transformer model which can be used for text-generation.
Causal means that a causal mask is used in the decoder attention, so that each token has visibility on previous tokens only.

Language models are very powerful because a huge variety of tasks can be formulated as text-to-text problems and thus adapted to fit the generative setup, where the model is asked to correctly predict future tokens. This idea has been widely explored in the [T5 paper: Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf).

In this example we apply this idea and fine-tune GPT-J as a Causal Language Model (CLM) for Text Entailment on [GLUE MNLI dataset](https://huggingface.co/datasets/glue#mnli).

You can easily adapt this example to do your custom fine-tuning on several downstream tasks, such as Question Answering, Named Entity Recognition, Sentiment Analysis, Text Classification in general: you just need to prepare data in the right way.

Note that, for these kind of tasks you don't need GPT-3 175B sized models. GPT-J at 6B has very good language understanding and is suitable for most of these scenarios. Larger models give only a small improvement in language understanding. Mainly they add more world knowledge and better performance at free text generation as might be used in an AI Assistant or chatbot.

To provide an indication of notebook timing and costs, it would take approximately 2 hours 47 minutes to run through this notebook in its entirety on the Bow Pod16 platform (including compile time, model loading and checkpointing) which would cost approximately $74 based on the currently published prices. This estimate uses the hyperparameters and configurations in the notebook: sequence length of 1024, global batch size of 128 and 400 steps.

Our weights are also available as an Hugging Face checkpoint at [Graphcore/gptj-mnli]( https://huggingface.co/Graphcore/gptj-mnli).

## Paperspace setup

In [None]:

%pip install -r requirements.txt

This notebook saves checkpoints during training to allow you to resume the fine-tuning from intermediate states.
On Paperspace these checkpoints are stored in `/storage`, this is a shared storage space with your team, allowing you to reuse the checkpoint on a different notebook instance.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
if number_of_ipus != 16:
    raise ValueError(f"This example need 16 IPUs to work. Detected {number_of_ipus}")

os.environ["POPART_CACHE_DIR"] = os.getenv(
    "POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/"
)
checkpoint_dir = os.getenv("PERSISTENT_CHECKPOINT_DIR", "checkpoints") + "/gpt-j"

## Initial setup
### Fine-tuning configuration
First of all, we need to load a base configuration, defined in `config/finetuning.yml`.
This file has optimised configurations to run the model on IPUs.
We need to pick the one suitable for a Pod16.

This configuration uses a sequence length of 1024 tokens. GPT-J layers are split across 16 IPUs, using [Tensor Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf). No data parallelism is used (this extra optimization is available when using a Pod64).
The `gptj_config_setup` sets up the specified configuration and configures logging and Weight and Biases.

In [None]:
from utils.setup import gptj_config_setup
from config import CONFIG_DIR

> **W&B**: We support logging to Weights & Biases.
If you want to use it, you will first need to manually log in (see the quickstart guide [here](https://docs.wandb.ai/quickstart)).


In [None]:
# Set this to True if you want to use W&B. Be sure to be logged in.
wandb_setup = False

In [None]:
# Choose a configuration
config, *_ = gptj_config_setup(
    CONFIG_DIR / "finetuning.yml", "release", "gptj_6B_1024_pod16", wandb_setup
)

In [None]:
print(config.dumps_yaml())

### Validation configuration
Configurations for inference-only are available in `config/inference.yml`:
- `gpt-j` is the base configuration, and is guaranteed to fit into memory with 1024 sequence length.
- `gpt-j-mnli` is optimised for the MNLI dataset. This selects a bigger batch size but requires the sequence length to be reduced.

In this example we will start from the base one, and manually reduce the sequence length and increase the batch size later on.
You can do the same on your custom dataset to find the best configuration.

In [None]:
from utils.setup import gptj_config_setup

> **W&B** We support logging to Weights & Biases.
If you want to use it, you will first need to manually log in (see the quickstart guide [here](https://docs.wandb.ai/quickstart)).

In [None]:
wandb_setup_on_eval = False

In [None]:
eval_config, args, _ = gptj_config_setup(
    CONFIG_DIR / "inference.yml",
    "release",
    "gpt-j",
    hf_model_setup=False,
    wandb_setup=wandb_setup,
)

In [None]:
print(eval_config.dumps_yaml())

## Dataset
The MNLI dataset consists of pairs of sentences, a *premise* and a *hypothesis*.
The task is to predict the relation between the premise and the hypothesis, which can be:
- `entailment`: hypothesis follows from the premise,
- `contradiction`: hypothesis contradicts the premise,
- `neutral`: hypothesis and premise are unrelated.

Data splits for the MNLI dataset are the following:

|train |validation_matched|validation_mismatched|
|-----:|-----------------:|--------------------:|
|392702|              9815|                 9832|


You can explore it [on Hugging Face](https://huggingface.co/datasets/glue/viewer/mnli/train).
![MNLI dataset](imgs/mnli_dataset.png)


### Training pre-processing
The columns we are interested in are `hypothesis`, `premise` and `label`.

The first step consists of forming input prompts with the format
```bash
mnli hypothesis: {hypothesis} premise: {premise} target: {label} <|endoftext|>
```
For example:
```
mnli hypothesis: Product and geography are what make cream skimming work.  premise: Conceptually cream skimming has two basic dimensions - product and geography. target: neutral<|endoftext|>
```

Then, prompt sentences are tokenized and packed together to form 1024 token sequences, following the [Hugging Face packing algorithm](https://github.com/huggingface/transformers/blob/v4.20.1/examples/pytorch/language-modeling/run_clm.py). No padding is used.

Finally, the prompt is split into `input_ids` and `labels`. The input consists of the full sentence but for the last token (`prompt[:-1]`), and the label is the sentence shifted by one (`prompt[1:]`).
Given the training format, no extra care is needed to account for different sequences: the model does not need to know which sentence a token belongs to.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import data.hf_data_utils as hf_data_utils
import data.mnli_data as mnli_data

The next two cells are the ones you want to change when doing a custom fine-tuning.

We first load the MNLI dataset, and then create a custom pre-processing function to build prompts suitable for a
text-to-text setup.
For a custom fine-tuning, you will need to choose a format for your prompts and change the `form_training_prompts` function.

In [None]:
# Load HF dataset
dataset = load_dataset("glue", "mnli", split="train")

In [None]:
print(dataset[0])

In [None]:
# Form prompts in the format mnli hypothesis: {hypothesis} premise: {premise} target: {class_label} <|endoftext|>
def form_training_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example[
        "text"
    ] = f"mnli hypothesis: {hypothesis} premise: {premise} target: {class_label}<|endoftext|>"
    return example

In [None]:
dataset = dataset.map(
    form_training_prompts,
    remove_columns=["hypothesis", "premise", "label", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

In [None]:
# shows first textual prompt
print(dataset[0])

After that, we tokenize the prompts. You won't need to change this step for a custom fine-tuning.

In [None]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6b")
tokenizer.add_special_tokens({"pad_token": "<|extratoken_1|>"})  # index 50257

In [None]:
# Tokenize prompts
dataset = dataset.map(
    mnli_data.tokenizes_text(tokenizer),
    batched=True,
    batch_size=1000,
    num_proc=1,
    remove_columns=dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)

In [None]:
# shows first tokenized prompt
print(dataset[0])

Finally, we use the Hugging Face packing algorithm (`group_text`) to create packed sentences of the specified sequence length,
and separate inputs and labels.
Again, this is a step you are not going to change for a custom fine-tuning.

In [None]:
# Pack tokenized prompts into sequences and split sequences in input_ids and labels
dataset = dataset.map(
    hf_data_utils.group_texts(config),
    batched=True,
    batch_size=1000,
    num_proc=1,
    load_from_cache_file=False,
    desc="Packing sequences",
)

In [None]:
print(len(dataset))

In [None]:
# Show a portion of first sentence. You can see that the label is the input shifted by one.
print("first 10 tokens of first sentence")
print("input_ids")
print(dataset["input_ids"][0][:10])
print("labels - shifted by one")
print(dataset["labels"][0][:10])

> **Note** If you want to adapt this code for another dataset, be sure to call the inputs and labels in the same way: `input_ids` and `labels`.

### Validation pre-processing
For validation, we use the [mnli validation_mismatched](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/validation) split.

"Mismatched" means that the validation examples are not derived from the same sources as those in the training set and therefore don't closely resemble any of the examples seen at training time.

Similar to what we did for training, the first pre-processing step is creating prompts, this time without including the answer.

In [None]:
def form_validation_prompts(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example["text"] = f"mnli hypothesis: {hypothesis} premise: {premise} target:"
    return example

In [None]:
eval_dataset = load_dataset("glue", "mnli", split="validation_mismatched")
eval_dataset = eval_dataset.map(
    form_validation_prompts,
    remove_columns=["hypothesis", "premise", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

In [None]:
print(eval_dataset[0])

Finally, input prompts are tokenized.

In [None]:
def prepare_validation_features(dataset, tokenizer):
    tokenized_examples = []
    for example in dataset["text"]:
        tokenized_example = tokenizer.encode(example, return_tensors="pt").squeeze()
        tokenized_examples.append(tokenized_example)
    return {"input_ids": tokenized_examples, "label": dataset["label"]}


eval_dataset = eval_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
    load_from_cache_file=False,
    fn_kwargs={"tokenizer": tokenizer},
)

In [None]:
print(eval_dataset[0])

> **Note** If you want to adapt this code for another dataset, be sure to call the inputs and labels in the same way: `input_ids` and `label`.

## Customise configuration and create a Trainer

Right at the beginning of the notebook we loaded the base configurations for training and inference. We are now going to show how to customise some of the parameters for your needs.

### Customise training configuration
In the cells below we list the parameters you are most likely to play around with when doing a custom fine-tuning.

These are the training steps, dropout probability and optimizer/learning rate parameters.

Moreover, it is important that you specify **checkpoint** parameters, namely a folder to save the fine-tuned weights and a periodicity for checkpointing. Be aware that saving checkpoints takes time, so you don't want to save them too often.
To disable intermediate checkpoints set `config.checkpoint.steps = 0`. The final checkpoint is always saved provided the `config.checkpoint.save` directory is given. Set it to `None` if you don't want to save weights, but it's unlikely you want to disable the last checkpoint.

If you are not resuming training and you don't care about resuming the training later on you can reduce the time and memory required to save checkpoints by specifying `optim_state=False`. In this case, only the model weights will be saved, while the optimiser state will be discarded.

Checkpoints will be saved in the directory given by the environment variable `PERSISTENT_CHECKPOINT_DIR`, which we saved in `checkpoint_dir` at the beginning. You can have a look at the path by printing it out in the cell below.

In [None]:
print(checkpoint_dir)

In [None]:
# Customise training arguments
config.model.dropout_prob = 0.0
config.training.steps = 400

In [None]:
# Customise optimiser and learning rate schedule
config.training.optimizer.learning_rate.maximum = 5e-06
config.training.optimizer.learning_rate.warmup_proportion = 0.005995
config.training.optimizer.learning_rate.beta1 = 0.9
config.training.optimizer.learning_rate.beta2 = 0.999
config.training.optimizer.learning_rate.weight_decay = 0.0
config.training.optimizer.learning_rate.gradient_clipping = 1.0

In [None]:
# Customise checkpoints
config.checkpoint.save = (
    checkpoint_dir  # where the model is saved. None means don't save any checkpoint
)
config.checkpoint.steps = (
    100  # how often you save the model. 0 means only the final checkpoint is saved
)
config.checkpoint.to_keep = 4  # maximum number of checkpoints kept on disk
config.checkpoint.optim_state = (
    False  # Whether to include the optimiser state in checkpoints
)

In [None]:
# Resume training
config.checkpoint.load = (
    None  # you can specify a directory containing a previous checkpoint
)
# os.path.join(checkpoint_dir, ...)

### Customise validation configuration
For validation there are fewer relevant parameters.

You can control the maximum number of tokens generated by the model using the  `output_length` parameter. If you know that the targets are only a few tokens long, it is convenient to set it to a small number. Generation stops if the output token is `<|endoftext|>` or after `output_length` tokens are generated.

In our case, we set the `output_length` to 5 to accommodate all class labels and the `<|endoftext|>` token.

You can also reduce the model `sequence_len` to account for the maximum length of sentences encountered in the validation dataset, plus the `output_length` specified.
In our case of the MNLI example, `sequence_len = 229` and this allows us to increase the batch size to `16`.

If you adapt this example to another dataset, you can compare your dataset `max_len` with ours and decide if you can safely use the increased batch size. Otherwise, stick with the default one or create your own.

You can specify a checkpoint to be used for validation. If `None` is provided, the latest weights will be used.

In [None]:
from functools import reduce

In [None]:
# Specify maximum number of tokens that can be generated.
eval_config.inference.output_length = 5

In [None]:
# Reducing sequence length
max_len = reduce(lambda l, e: max(l, len(e["input_ids"])), eval_dataset, 0)

In [None]:
print(f"Maximum length in mnli-mismatched dataset: {max_len}")

In [None]:
eval_config.model.sequence_length = max_len + eval_config.inference.output_length

In [None]:
print(f"Setting sequence length to {eval_config.model.sequence_length}")

In [None]:
# Increase batch size
eval_config.execution.micro_batch_size = 16

In [None]:
# You can specify a directory containing a previous checkpoint
# None means latest weights are used
config.checkpoint.load = None  # os.path.join(checkpoint_dir, ...)

In [None]:
print(config.dumps_yaml())

Finally, we need to define a validation metric. We will use `accuracy`.

In [None]:
import evaluate

In [None]:
accuracy_metric = evaluate.load("accuracy")

We also need to define a function to convert our generated labels to integer indices.

In [None]:
def postprocess_mnli_predictions(generated_sentences):
    labels_to_ids = {"entailment": 0, "neutral": 1, "contradiction": 2, "unknown": -1}
    predictions = []
    for s in generated_sentences:
        answer = mnli_data.extract_class_label(s)
        predictions.append(labels_to_ids[answer])
    return predictions

### Create a Trainer
Once a config is specified, we are ready to create the training session with the help of the 
`GPTJTrainer` class.
You need to provide the following arguments:

- *config*: the training configuration.
- *pretrained*: the Hugging Face pre-trained model, used to initialise the weights.
- *dataset*: the training dataset.

Moreover, you can specify:

- *eval_dataset*: the validation dataset.
- *eval_config*: the inference configuration, to be used in validation.
- *tokenizer*: the tokenizer, needed by validation.
- *metric*: the metric for validation. An Hugging Face metric from the `evaluate` module. We use the `accuracy` metric.
- *process_answers_func*: a function to convert the generated answers to the format required by the metric and the labels. For example we need to convert textual categories `[entailment, contradiction,neutral]` to indices.

These extra arguments can also be provided later on when calling `trainer.evaluate(...)`.

If you want to run fine-tuning, the pre-trained model should be `EleutherAI/gpt-j-6b`. If you want to just run validation on the existing Hugging Face checkpoint, you should change it to the fine-tuned model `Graphcore/gptj-mnli`.

In [None]:
from utils.trainer import GPTJTrainer
from transformers.models.gptj.modeling_gptj import GPTJForCausalLM

In [None]:
# change to Graphcore/gptj-mnli for validation only
pretrained = GPTJForCausalLM.from_pretrained(r"EleutherAI/gpt-j-6b")

In [None]:
trainer = GPTJTrainer(
    config,
    pretrained,
    dataset,
    eval_dataset,
    eval_config,
    tokenizer,
    accuracy_metric,
    postprocess_mnli_predictions,
)

## Run fine-tuning
We can now run training for the number of steps you set in the config.
Checkpoints will be saved in the folder you specified in `config.checkpoint.save`, with the periodicity set in `config.checkpoint.steps`.

The first time you run `trainer.train()` it takes around 10 minutes to compile the training model.

Training takes around 8 minutes to start because the pre-trained weights need to be downloaded to the IPUs.
After that, each step takes around 20-22 seconds.
This time does not include checkpointing time.

In [None]:
trainer.train()

## Run validation
Finally, we validate our model on the [mnli_mismatched](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/test) split of the MNLI dataset.

Generative inference is performed token-by-token using a greedy heuristic: the next token is chosen based on the highest logits.

We run token-by-token inference in batches to generate labels for multiple examples at once.
We then compute the accuracy by comparing the model answers with the true labels.

The resulting model matches SOTA performance with 82.5% accuracy.

```
Total number of examples                 9832
Number with badly formed result          0
Number with incorrect result             1725
Number with correct result               8107 [82.5%]
 ```

The first time you run `trainer.evaluate()` it takes around 4 minutes to compile the inference model.

Running validation on the whole dataset takes around 7 minutes.

Note that:

- If you have compiled the training model and you don't specify an `eval_config.checkpoint.load` folder, the latest weights will be used.

- If you instead specify a repository in `eval_config.checkpoint.load`, you will be evaluating that specific set of weights.

- If none of the above holds or if you specify `trainer.evaluate(use_pretrained=True)`, weights from `pretrained` will be used.

In [None]:
# If you want to change the pretrained model to run validation on the HF checkpoint, uncomment and run below
# pretrained = GPTJForCausalLM.from_pretrained("Graphcore/gptj-mnli")
# trainer.pretrained = pretrained
# trainer.evaluate(use_pretrained=True)

In [None]:
trainer.evaluate()

## Save Hugging Face checkpoint
You can save the trained weights so that they can be uploaded to Hugging Face and used with Hugging Face's PyTorch model.
You can specify a checkpoint path if you want to convert a specific checkpoint instead of the latest weights.

In [None]:
hf_checkpoint_path = os.path.join(checkpoint_dir, "hf_checkpoint")
ckpt_path = None  # os.path.join(checkpoint_dir, ...)

In [None]:
finetuned = trainer.save_hf_checkpoint(hf_checkpoint_path, ckpt_path)

## Run the model with Hugging Face pipeline
The same model can later be used with the standard Hugging Face pipeline on any hardware.

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6b")
tokenizer.add_special_tokens({"pad_token": "<|extratoken_1|>"})
hf_model = AutoModelForCausalLM.from_pretrained(
    "Graphcore/gptj-mnli", pad_token_id=tokenizer.pad_token_id
)
generator = pipeline("text-generation", model=hf_model, tokenizer=tokenizer)

In [None]:
prompt = (
    "mnli hypothesis: Your contributions were of no help with our students' education."
    "premise: Your contribution helped make it possible for us to provide our students with a quality education. target:"
)

out = generator(prompt, return_full_text=False, max_new_tokens=5, top_k=1)
print(out)
# [{'generated_text': ' contradiction'}]