# GPT-J Finetuning

Copyright (c) 2023 Graphcore Ltd.

[GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) is a causal decoder-only transformer model which can be used for text-generation.
Causal means that a causal mask is used in the decoder attention, so that each token has visibility on previous tokens only.

Language models are very powerful because a huge variety of tasks can be formulated as a text-to-text problem and thus adapted to fit the generative setup, where the model is asked to correctly predict future tokens. This idea has been widely explored in [T5 paper: Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)

In this example we apply this idea and finetune GPT-J as a Causal Language Model (CLM) for Text Entailment on [GLUE MNLI dataset](https://huggingface.co/datasets/glue#mnli).

You can easily adapt this example to do your custom finetuning on several downstream tasks, such as Question Answering, Named Entity Recognition, Sentiment Analysis, Text Classification: you just need to prepare data in the right way.

Our weights are also available as an HF checkpoint at [Graphcore/gptj-mnli]( https://huggingface.co/Graphcore/gptj-mnli).

## Paperspace Setup

In [None]:
%%capture
%pip install -r requirements.txt

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
if number_of_ipus != 16:
    raise ValueError(f"This example need 16 IPUs to work. Detected {number_of_ipus}")
    
os.environ["POPART_CACHE_DIR"] = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "cache")
checkpoint_dir =  os.getenv("CHECKPOINT_DIR", "checkpoints")

## Finetuning

### Configuration
First of all, we need to load the default configuration, defined in `config/finetuning_mnli.yml`.
These are optimised configuration to run the model on IPUs.
We need to pick the one suitable for a POD16.

This configuration uses a sequence length of 1024 tokens. GPT-J layers are split across 16 IPUs, using [Tensor Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf). No data parallelism is used (This extra optimization is available when using a POD64).

The `gptj_fine_tuning_setup` setup the specified configuration, configures logging and Weight and Biases, and loads the Hugging Face pretrained model [EleutherAI/gpt-j-6B](https://huggingface.co/EleutherAI/gpt-j-6B).

In [None]:
from utils.setup import gptj_fine_tuning_setup
from config import CONFIG_DIR

> **W&B**: We support logging to Weights & Biases.
If you want to use it, you will first need to manually log in (see the quickstart guide [here](https://docs.wandb.ai/quickstart)).


In [None]:
# Set this to True if you want to use W&B. Be sure to be logged in.
wandb_setup = False

In [None]:
# Choose a configuration
config, args, pretrained = gptj_fine_tuning_setup(
    CONFIG_DIR / "finetuning_mnli.yml", "release", "gptj_6B_1024_pod16", wandb_setup
)

In [None]:
print(config.dumps_yaml())

### Dataset
Mnli dataset consists of pairs of sentences, a *premise* and a *hypothesis*.
The task is to predict the relation between the premise and the hypothesis, which can be:
- `entailment`: hypothesis follows from the premise,
- `contradiction`: hypothesis contradicts the premise,
- `neutral`: hypothesis and premise are unrelated.

You can explore the [MNLI dataset on hugginface](https://huggingface.co/datasets/glue/viewer/mnli/train).
![MNLI dataset](imgs/mnli_dataset.png)

#### Preprocessing
The columns we are interested in are `hypothesis`, `premise` and `label`.

The first step consists in forming input prompts with the format
```bash
mnli hypothesis: {hypothesis} premise: {premise} target: {class_label} <|endoftext|>
```
For example:
```
mnli hypothesis: Your contributions were of no help with our students' education. premise: Your contribution helped make it possible for us to provide our students with a quality education. target: contradiction <|endoftext|>
```

Then, prompt sentences are tokenized and packed together to form 1024 token sequences, following [HF packing algorithm](https://github.com/huggingface/transformers/blob/v4.20.1/examples/pytorch/language-modeling/run_clm.py). No padding is used.

Finally, the prompt is split into `input_ids` and `labels`. The input consists of the full sentence but for the last token (`prompt[:-1]`), and the label is the sentence shifted by one (`prompt[1:]`).
Given the training format, no extra care is needed to account for different sequences: the model does not need to know which sentence a token belongs to.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import data.hf_data_utils as hf_data_utils
import data.mnli_data as mnli_data

The next two cells are the ones you want to change in a custom finetuning.

We first load the MNLI dataset, and then create a custom preprocessing function to build prompts suitable for a
text-to-text setup.
In a custom finetuning, you will need to choose a format for your prompts and change the `form_text` function.

In [None]:
# Load HF dataset
dataset = load_dataset("glue", "mnli", split="train")

In [None]:
print(dataset[0])

In [None]:
# Form prompts in the format mnli hypothesis: {hypothesis} premise: {premise} target: {class_label} <|endoftext|>
def form_text(example):
    hypothesis = example["hypothesis"]
    premise = example["premise"]
    class_label = ["entailment", "neutral", "contradiction"][example["label"]]

    example["text"] = f"mnli hypothesis: {hypothesis} premise: {premise} target: {class_label}<|endoftext|>"
    return example


In [None]:
dataset = dataset.map(
    form_text,
    remove_columns=["hypothesis", "premise", "label", "idx"],
    load_from_cache_file=False,
    desc="Generating text prompt",
)

In [None]:
# shows first textual prompt
print(dataset[0])

After that, we tokenize the prompts. You won't need to change this step in a custom finetuning.

In [None]:
# Tokenize prompts
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.add_special_tokens({"pad_token": "<|extratoken_1|>"})  # index 50257
dataset = dataset.map(
    mnli_data.tokenizes_text(tokenizer),
    batched=True,
    batch_size=1000,
    num_proc=1,
    remove_columns=dataset.column_names,
    load_from_cache_file=False,
    desc="Tokenizing text",
)

In [None]:
# shows first tokenized prompt
print(dataset[0])

Finally, we use the HF packing algorithm (`group_text`) to create packed sentences of the specified sequence length,
and separate inputs and labels.
Again, this is a step you are not going to change in a custom finetuning.

In [None]:
# Pack tokenized prompts into sequences and split sequences in input_ids and labels
dataset = dataset.map(
    hf_data_utils.group_texts(config),
    batched=True,
    batch_size=1000,
    num_proc=1,
    load_from_cache_file=False,
    desc="Packing sequences",
)

In [None]:
print(len(dataset))

In [None]:
# Show a portion of first sentence. You can see that the label is the input shifted by one.
print("first 10 tokens of first sentence")
print("input_ids")
print(dataset["input_ids"][0][:10])
print("labels - shifted by one")
print(dataset["labels"][0][:10])

### Customise configuration and create a Trainer

In the cells below we list the parameters you are most likely to play around when doing a custom finetuning.

These are the training steps, dropout probability and optimizer/learning rate parameters.

Moreover, it is important that you specify **checkpoints** parameters, namely a folder to save the finetuned weights and a periodicity for checkpointing. Be aware that saving checkpoints takes time, so you don't want to save them too often.
To disable intermediate checkpoints set `config.checkpoint.steps = 0`. The final checkpoint is always saved provided the save directory is given. Set it to `None` if you don't want to save weights.

If you are not resuming training, and you don't care about resuming the training later on but you still want to save the model weights at different training steps, you can reduce the time and memory required to save checkpoints by specifying `optim_state=False` when creating the session.

Checkpoints will be saved in the directory given by the environment variable `CHECKPOINT_DIR`, which we saved in `checkpoint_dir` at the beginning.

In [None]:
print(checkpoint_dir)

In [None]:
# Customise training arguments
config.model.dropout_prob = 0.0 
config.training.steps = 400

In [None]:
# Customise optimiser and learning rate schedule
config.training.optimizer.learning_rate.maximum = 5e-06
config.training.optimizer.learning_rate.warmup_proportion = 0.005995
config.training.optimizer.learning_rate.beta1 = 0.9
config.training.optimizer.learning_rate.beta2 = 0.999
config.training.optimizer.learning_rate.weight_decay = 0.0
config.training.optimizer.learning_rate.gradient_clipping = 1.0

In [None]:
# Customise checkpoints
config.checkpoint.save = checkpoint_dir # where the model is saved. None means don't save any checkpoint.
config.checkpoint.steps = 100  # how often you save the model. 0 means only the final checkpoint is saved.
config.checkpoint.to_keep = 4  # maximum number of checkpoints kept on disk
config.checkpoint.optim_state = False # Whether to include the optimiser state in checkpoints.

In [None]:
# Resume training
config.checkpoint.load = None # you can specify a directory containing a previous checkpoint,
                              # os.path.join(checkpoint_dir, ...)

In [None]:
print(config.dumps_yaml())

Once a config is specified, we are ready to create the training session with the help of the 
`MNLIFinetuningTrainer` class.
You need to provide the following arguments:

- *config*: the training configuration
- *pretrained*: the Hugging Face pre-trained model, used to initialise the weights
- *dataset*: the training dataset.

Moreover, you can specify:

- *eval_dataset*: the validation dataset
- *eval_config*: the inference configuration, to be used in validation
- *tokenizer*: the tokenizer, needed by validation

These extra arguments can also be provided later on when calling `trainer.evaluate(...)`.

The first time you run this notebook, it will take around 10 minutes to compile the training model.

In [None]:
from utils.trainer import MNLIFinetuningTrainer

In [None]:
trainer = MNLIFinetuningTrainer(config, pretrained, dataset)

### Run Finetuning
We are done! We can now run training for the number of steps you set in the config.
Checkpoints will be saved in the folder you specified in `config.save`, with the periodicity identified by `config.checkpoint.steps`.

In [None]:
trainer.train()

## Validation
We can now validate our model on [mnli-mismatched](https://huggingface.co/datasets/glue/viewer/mnli_mismatched/test) split of MNLI dataset.

Generative inference is performed token-by-token using a greedy heuristic: the next token is chosen based on the highest logits.

### Config
A default configuration for inference-only is available in `config/inference.yml`.

In [None]:
from utils.setup import gptj_config_setup

> **W&B** We support logging to Weights & Biases.
If you want to use it, you will first need to manually log in (see the quickstart guide [here](https://docs.wandb.ai/quickstart)).

In [None]:
wandb_setup = False

In [None]:
eval_config, args, _ = gptj_config_setup(
    CONFIG_DIR / "inference.yml", "release", "gpt-j-mnli", hf_model_setup=False, wandb_setup=wandb_setup
)

The only interesting parameter for inference is `output_length`. This is the maximum number of tokens you want the model to generate during validation. If you know that the targets are only a few tokens long, it is convenient to set it to a small number.
In our case, we set the output_len to 5 to accommodate all class labels and the `<|endoftext|>` token.

In [None]:
eval_config.inference.output_length = 5

### Dataset
First of all we need to prepare the prompts, similarly to what we did for training.

In [None]:
eval_dataset = load_dataset("glue", "mnli", split="validation_mismatched")
eval_dataset = eval_dataset.map(form_text,
                                remove_columns=["hypothesis", "premise", "label", "idx"],
                                load_from_cache_file=False,
                               )

In [None]:
print(eval_dataset[0])

Now we want separate the input prompts, to be fed to the model, from the labels, which we need later to compute the accurancy.

In [None]:
eval_dataset = eval_dataset.map(mnli_data.split_text, load_from_cache_file=False)

In [None]:
print(eval_dataset[0])

Finally, input prompts are tokenized.

In [None]:
def prepare_validation_features(dataset, tokenizer):
    tokenized_examples = []
    for example in dataset["prompt_text"]:
        tokenized_example = tokenizer.encode(example, return_tensors="pt").squeeze()
        tokenized_examples.append(tokenized_example)
    return {"input_ids": tokenized_examples, "class_label": dataset["class_label"]}

eval_dataset = eval_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=eval_dataset.column_names,
    load_from_cache_file=False,
    fn_kwargs={"tokenizer": tokenizer},
)

In [None]:
print(eval_dataset[0])

### Run validation
Now that we have preprocessed the dataset, we can compute the maximum length of sequences, `max_len`, and use this value to define the model `sequence_len`.

Each sequence is right-padded to `max_len + output_len`. We use right padding so that padded tokens are never attended, thanks to the causal mask.

GPTJTokenizer has no native padding token. However, we can safetly use the first `<|extratoken_1|>`.

Padded sequences are fed to the model and generative inference is performed token-by-token: each time a new token is generated, it replaces a padding token, and the new sequence is fed back to the model.

To increase efficiency, we perform inference on micro batches.

Finally, we retrieve literal labels detokenizing the predictions and we compute the accuracy comparing the result with the expected one.

If you want to evaluate a specific checkpoint, you can provide a `ckpt_load_path`. Otherwise, the latest weights will be used.

In [None]:
ckpt_load_path = None # os.path.join(checkpoint_dir, ...)

In [None]:
trainer.evaluate(eval_dataset,eval_config,tokenizer, ckpt_load_path=ckpt_load_path)

## Save HF checkpoint
You can save the trained weights so that they can be uploaded to Hugging Face and used in Hugging Face torch model.
You can specify a checkpoint path if you want to convert a specific checkpoint, instead of the latest weights.

In [None]:
hf_checkpoint_path = "hf_checkpoint"
ckpt_path = None # os.path.join(checkpoint_dir, ...)

In [None]:
finetuned = trainer.save_hf_checkpoint(hf_checkpoint_path, ckpt_path)

The same model can be later used with standard HF pipeline on any hardware.

```python
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6B')
hf_model = AutoModelForCausalLM.from_pretrained("Graphcore/gptj-mnli", pad_token_id=tokenizer.eos_token_id)
generator =  pipeline('text-generation', model=hf_model, tokenizer=tokenizer)

prompt = "mnli hypothesis: Your contributions were of no help with our students' education." \
         "premise: Your contribution helped make it possible for us to provide our students with a quality education. target:"

out = generator(prompt, return_full_text=False, max_new_tokens=5, top_k=1)
# [{'generated_text': ' contradiction'}]
```