# Text generation with GPT-J 6B

[GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) is a causal decoder-only transformer model which can be used for text-generation.
Causal means that a causal mask is used in the decoder attention, so that each token has visibility of previous tokens only.

Language models are very powerful because a huge variety of tasks can be formulated as text-to-text problems and thus adapted to fit the generative setup, where the model is asked to correctly predict future tokens. This idea has been widely explored in the [T5 paper: Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf).

Note that for these kind of tasks you don't need GPT-3 175B sized models. GPT-J at 6B has very good language understanding and is suitable for most of these scenarios. Larger models give only a small improvement in language understanding. Mainly they add more world knowledge and better performance at free text generation as might be used in an AI Assistant or chatbot.

In this notebook you will:

- generate text in 5 lines of code with GPT-J on the Graphcore IPU;
- use GPT-J to answer questions and build prompts to reliably use text-generation for more specific NLP tasks;
- explore the effect on the model of the prompt format and compare the model performance with 0-shot (no examples included in the prompt) and few-shot prompting (where a few typical examples are included in the prompt);
- Improve text generation throughput using batched inference;
- understand the limitations of the base GPT-J checkpoint when it comes to more complex NLP tasks;
- use the model to identify whether statements agree or disagree (entailment). For this more complex task, we show the benefit of fine-tuning and load a checkpoint fine-tuned on the MNLI dataset from the Hugging Face Hub which achieves much better performance on this specific task.


The MNLI checkpoint on Hugging Face mentioned above has been fine-tuned on IPU. No finetuning is performed in this notebook but  you can learn more on fine-tuning GPT-J in the [finetuning notebook](finetuning.ipynb).
By exploring this notebook, you will gain insight into how various tasks can be formulated in a text-to-text format and how this flexibility can be utilized to fine-tune the model using your own dataset.

## Environment setup

In order to run this notebook you will need to be in an environment with the Poplar SDK and PopART installed and enabled - on Paperspace this is handled by default.


In [None]:
%pip install -r requirements.txt

To ensure smooth execution of the notebook, we load and check environment variables.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 4))
if number_of_ipus < 4:
    raise ValueError("This notebook is designed to run with at least 4 IPUs")

executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/")
os.environ["POPART_CACHE_DIR"] = executable_cache_dir
checkpoint_directory = os.getenv("CHECKPOINT_DIR")

## Running GPT-J on the IPU

We start by running the original [GPT-J model published by Eleuther AI](https://huggingface.co/EleutherAI/gpt-j-6B) on the Graphcore IPU. 
In a few lines of code we load the configuration and use it to create a pipeline object which will allow us to interactively run the model and use for text generation.

While this application is written in Graphcore's PopXL framework, no knowledge of the framework is required to use or train this application as all parameters are controlled through configuration options.
Base configurations are available in the `config/inference.yml` file and can be loaded as follows:
 <!-- PopXL is a framework which provides fine grained control of execution, memory and parallelism.  -->

In [None]:
# --- Setup ---
import run_inference

config, *_ = run_inference.gptj_config_setup(
    "config/inference.yml", "release", "gpt-j-mnli"
)
print(config.dumps_yaml())

The configuration can be edited and stored in a new file to suit your needs. It contains all the arguments which define the model and control the execution of the application on the IPU.

For inference the main arguments we are interested in modifying are:

In [None]:
# The number of tokens generated before stopping
# Note the model will stop before this if it generates an <|endoftext|> token
config.inference.output_length = 10
# The number of prompts which will be processed at once
config.execution.micro_batch_size = 12
# The maximum tokenized sequence length (input + generated) handled by the model
config.model.sequence_length = 512

Next we're going to combine this configuration with pre-trained weights.
The `pipeline` utility accepts directly the name of a pre-trained checkpoint from the Hugging Face Hub.
The reference pre-trained checkpoint is the [6 billion parameter GPT-J checkpoint from EleutherAI](https://huggingface.co/EleutherAI/gpt-j-6B), which has been trained on [the Pile](https://pile.eleuther.ai/) open source dataset and has not undergone fine-tuning on any specific task. Hence, it is suitable for general text generation but, as you will see, is not very good on downstream tasks (such as question answering or entailment).


Once we have a config and chosen a pre-trained model we create a `GPTJPipeline`. If you are not planning to use long
prompts, you can reduce the sequence length either by changing the config or by providing the `sequence_length` argument to the pipeline. Here, we reduce it from the
default value of 1024 to 512. Reducing the sequence length allows you to fit more batches into memory. As you will see in the batched inference section, you can maximise performance by making the model process several prompts at the same time.

Creating the pipeline takes a few minutes as the checkpoint is downloaded and loaded into the session:

In [None]:
import api

general_model = api.GPTJPipeline(
    config,
    "EleutherAI/gpt-j-6b",
    sequence_length=512,
    micro_batch_size=12,
    output_length=20,
    print_live=True,
)

You can explore the attributes of the `general_model` pipeline:

- `pretrained` contains the `GPTJForCausalLM` class from the Transformers library which is used to load the weights;
- `tokenizer` contains the tokenizer loaded with the pre-trained checkpoint from the Hugging Face Hub;
- `config` has the input config;
- `session` is the PopXL session which can be run on the IPU.

In [None]:
general_model.tokenizer

You can use the pipeline to do standard text generation starting from an input prompt. In this demo we use **greedy generation**, meaning the most likely token is chosen at each step. This corresponds to running an Hugging Face pipeline with the option`sample=False`.
The first execution of the pipeline takes a few minutes because the program is compiled and the weights are downloaded from the Hugging Face Hub and copied to the IPU. Subsequent executions will be much faster.
In the cell below, we ask the model to answer a simple question:

In [None]:
out = general_model("What is the capital of France?")

While the model gets the correct answer, it includes it in a long form answer and continues generating irrelevant text instead of the `<|endoftext|>` token.



## Sensitivity to prompt format

The model output is very sensitive to the prompt format, even spaces can make a difference. To get the best results, you have to experiment with the format of your input prompts: structured prompts can make it easy to post-process the model outputs to extract the relevant pieces of information.

Lets try to give a different structure to our initial question:

In [None]:
out = general_model(
    """Question: What is the capital of Country?
Answer: City
Question: What is the capital of France?
Answer:""",
)
out

Now we get the answer in a short format, which is nice. Irrelevant extra text is still generated though. 
However, we can easily extract the relevant information. First of all, we can reduce the amount of text generated by the model to 10 tokens, which is enough for our answer.
We then extract only the string which answers the question:

In [None]:
out = general_model(
    f"""Question: What is the capital of Country?
Answer: city
Question: What is the capital of France?
Answer:""",
    output_length=10,
)
capitals = [answer.splitlines()[0].strip() for answer in out]
capitals

## Batched text generation

This model configuration supports batched generation: it lets us generate text based on multiple prompts at the same time.
This can be useful to increase throughput when text prompts are queued for processing in a production environment.
This config has a batch size of 12 which means this pipeline can generate answers to 12 questions at a time.

Let's use the model to generate countries' names we can later use to build batches.

In [None]:
prompt_for_countries = "List countries in Europe, America, Asia and Africa: France, "
out = general_model(prompt_for_countries, output_length=120, print_live=True)

In [None]:
countries = [
    c.strip() for c in (prompt_for_countries + out[0]).split(":")[1].split(",")
]
countries

Using these generated countries, we compose a batch of structured prompts and pass it off to the model:

In [None]:
out = general_model(
    [
        f"""Question: What is the capital of China?
Answer: Beijing
Question: What is the capital of {country}?
Answer:"""
        for country in countries
    ],
    print_live=False,
    output_length=10,
)
capitals = [answer.splitlines()[0].strip() for answer in out]
list(zip(countries, capitals))

Thanks to batched inference 12 prompts can be processed at the same time, which leads to much higher throughput.

## Determining entailment

In this section we evaluate the ability of our generative pipeline to perform a more complex natural language processing task.
This task will show the limits of the general model, even when few-shot prompting is used.

**Entailment** is the task of determining if two statements agree, disagree or a neutral relative to each other.
A common benchmark dataset is the [MNLI GLUE dataset](https://huggingface.co/datasets/glue).

It consists of pairs of sentences, a *premise* and a *hypothesis*.
The task is to predict the relation between the premise and the hypothesis, which can be:
- `entailment`: hypothesis follows from the premise,
- `contradiction`: hypothesis contradicts the premise,
- `neutral`: hypothesis and premise are unrelated.

As we did for question answering, we can try to use our generative model to tackle this task by creating prompts:


In [None]:
def entailment_prompt(hypothesis, premise, target=""):
    sep = ".\n" if target else ""
    return f"mnli hypothesis: {hypothesis} premise: {premise} target: {target}{sep}"


entailment_prompt("The person is leaving.", "Hello, welcome to the country.")

Valid prompts for this task are not as easy to come up with so we load validation data from the MNLI task from the GLUE dataset using the 🤗 Datasets library:

In [None]:
import datasets

dataset = datasets.load_dataset("glue", "mnli", split="validation_mismatched")
dataset[0]

We define the `label_to_target` mapping below to turn integer class labels from the dataset into the name of the class:

In [None]:
mnli_label_to_target = ["entailment", "neutral", "contradiction", "unknown"]
mnli_label_to_target[dataset[1]["label"]]

Let's try passing a single prompt to the model. For question answering this was enough to get the right answer:

In [None]:
# Add a check in the pipeline for sequence length - get one of each
general_model(
    entailment_prompt(dataset[1]["hypothesis"], dataset[1]["premise"]),
    print_live=True,
    output_length=10,
)

We were hoping for the model to predict "contradiction", or an answer which expresses that idea.
The model does not get it right: determining entailment from a single formatted prompt is not enough to get it to perform the task.

We can try to improve the performance by adding instructions and providing examples using the **few-shot prompting** technique. To help with that we create a helper function `mnli_data_to_example` which will turn a single data entry from the MNLI dataset into an example prompt complete with the target prediction:

In [None]:
def mnli_data_to_example(data: dict):
    return entailment_prompt(
        data["hypothesis"], data["premise"], mnli_label_to_target[data["label"]]
    )


mnli_data_to_example(dataset[0])

The idea is to form input prompts that contain a few examples of the task we want the model to perform. 

Coming up with a prompt which will reliably generate the intended results can be challenging especially for a more complex task like entailment classification.
In the cell below we show 4 possible ways of generating entailment instructions in the prompt:

In [None]:
# Try asking for the task to be completed
entailment_instructions_v1 = "How are the statements related?\n"

# Try explaining the task with some explicit "Examples"
entailment_instructions_v2 = (
    "Tell me if the next statements entailment, contradiction, neutral.\n"
    + "Example - "
    + entailment_prompt("Goodbye.", "Hey there.", "contradiction")
    + "Example - "
    + entailment_prompt("Hello.", "Hey there.", "entailment")
)

# Reinforce the instructions with every example
entailment_instructions_v3 = (
    "Tell me if the statements entailment, contradiction, neutral.\n"
    + entailment_prompt("Goodbye.", "Hey there.", "contradiction")
    + "Tell me if the statements entailment, contradiction, neutral.\n"
    + entailment_prompt("Hello.", "Hey there.", "entailment")
    + "Tell me if the statements entailment, contradiction, neutral.\n"
    + entailment_prompt("The person is traveling.", "The cat is black.", "neutral")
    + "Tell me if the statements entailment, contradiction, neutral.\n"
)

#  Use the dataset instead of handcrafted prompts
entailment_instructions_v4 = (
    "Tell me if the statements entailment, contradiction, neutral.\n"
    + mnli_data_to_example(dataset[0])
    + "Tell me if the statements entailment, contradiction, neutral.\n"
    + mnli_data_to_example(dataset[2])
    + "Tell me if the statements entailment, contradiction, neutral.\n"
    + mnli_data_to_example(dataset[8])
    + "Tell me if the statements entailment, contradiction, neutral.\n"
)

general_model(
    entailment_instructions_v3
    + entailment_prompt("Hello, welcome.", "The person is leaving."),
    print_live=True,
    output_length=40,
)

While the model does pick one of the 3 options, it does not select the right one.
The big challenge with this approach is obtaining reliable results. Apparently inconsequential changes in the phrasing of the examples can lead to very different results.

To evaluate the effectiveness of the engineered prompts we test against a number of unseen examples in the MNLI dataset.
In the cell below we test `entailment_instructions_v3`, try out the other prompts and see how they perform, or come up with your own:

In [None]:
dataset_sample = dataset[12:48]
out = general_model(
    [
        entailment_instructions_v4 + entailment_prompt(hypothesis, premise)
        for hypothesis, premise in zip(
            dataset_sample["hypothesis"], dataset_sample["premise"]
        )
    ],
    print_live=False,
    output_length=10,
)
# Strip out everything in the output after new lines
processed = [
    (mnli_label_to_target[label], answer.splitlines()[0].strip())
    for label, answer in zip(dataset_sample["label"], out)
]
processed = [(label in answer, label, answer) for label, answer in processed]
print(f"Got {sum(p for p, *_ in processed)} / {len(processed)} correct")
processed[:12]

The instructions in prompt `entailment_instructions_v3` and `entailment_instructions_v4` are enough to capture some elements of the task and let the model guess one of the three target classes every time.
However they are not enough for the model to correctly classify the entailment of the hypothesis and the premise, they oscillate around 33% accuracy which corresponds to random choice between the three classes.

As we can see few-shot prompting is not sufficient for the GPT-J model to complete this task.
Lets release the IPUs to let us try a checkpoint for the model fine-tuned to perform this task.

In [None]:
general_model.detach()

## Using a fine-tuned model

In order to complete the entailment task we are going to use a fine-tuned model on the MNLI task of the GLUE dataset. For this model we have a shorter sequence length of 256, and so can fit a larger batch size of 16 into memory.

The checkpoint we will be using was fine-tuned on the Graphcore IPU and is hosted on the 🤗 Hugging Face Hub at [Graphcore/gptj-mnli](https://huggingface.co/Graphcore/gptj-mnli). To see how this checkpoint was generated check out the [fine-tuning notebook](finetuning.ipynb).
As we did before we can load this checkpoint with a single command:

In [None]:
mnli_model = api.GPTJPipeline(
    config,
    "Graphcore/gptj-mnli",
    sequence_length=256,
    micro_batch_size=16,
    print_live=True,
)

While this model was trained on IPU a checkpoint trained on GPUs could just as well be loaded in a multi-accelerator workflow.

Just like the previous checkpoint, the model can handle arbitrary text generation questions:

In [None]:
mnli_model("Hey there", output_length=30)

However on the entailment task the performance is much better, even without instructions:

In [None]:
mnli_model(entailment_prompt("The person is leaving.", "Hello, welcome."))

It got it right! Those sentences are contradictory. Now let's try our samples from the GLUE dataset:

In [None]:
out = mnli_model(
    [
        entailment_prompt(hypothesis, premise)
        for hypothesis, premise in zip(
            dataset[:16]["hypothesis"], dataset[:16]["premise"]
        )
    ],
    print_live=True,
    output_length=10,
)
# Strip out everything in the output after new lines
mnli_label_to_target = ["entailment", "neutral", "contradiction", "unknown"]
[
    (
        mnli_label_to_target[label],
        answer.splitlines()[0].strip().replace("<|endoftext|>", ""),
    )
    for label, answer in zip(dataset[:16]["label"], out)
]

It gets almost all of them right, it clearly has some knowledge of the task we need it to complete.

We can create a pipeline specific to this task which handles the prompt pre-processing and the post-processing of the generated text.
We can change pipeline using the `from_gptj_pipeline` factory method, which gives us a ready-to-use pipeline for entailment:

In [None]:
mnli_pipeline = api.GPTJEntailmentPipeline.from_gptj_pipeline(mnli_model)
mnli_pipeline("Hey there.", "Goodbye.")

To evaluate the model we now run 200 samples of the dataset through our pipeline:

In [None]:
sample_size = 200
out = mnli_pipeline(
    premise=dataset[:sample_size]["premise"],
    hypothesis=dataset[:sample_size]["hypothesis"],
    print_live=False,
    output_length=5,
)

import pandas as pd

results = pd.DataFrame(
    [
        (mnli_label_to_target[label] == answer, mnli_label_to_target[label], answer)
        for label, answer in zip(dataset[:sample_size]["label"], out)
    ],
    columns=["correct", "label", "prediction"],
)
results.head(16)

Lets check the performance on the 200 samples:

In [None]:
print(f"Got {results['correct'].sum()}/{len(results)} correct")

We get approximately 80% correct as expected from the model card.

In [None]:
mnli_pipeline.detach()

## Conclusion

This notebook has demonstrated how easy it is to run GPT-J on the Graphcore IPU using this implementation of the model and 🤗 Hugging Face Hub checkpoints of the model weights. While not as powerful as larger models for free text-generation, medium-size auto-regressive models like GPT-J can still be successfully fine-tuned to handle a range of NLP tasks such as question answering, sentiment analysis, and named entity recognition.

In less than 10 lines of code, we were able to load the model onto the IPU and perform NLP tasks. We showed how the prompt format influences the model output and built structured prompts for question answering. We demonstrated how batched inference can be used to increase throughput by generating multiple answers at a time.

We also tackled the more complex task of determining entailment between statements, and found that the standard GPT-J checkpoint was not effective.
We tried few-shot prompting, concatenating several examples in the input prompt, to give the model instructions for the task. Even with some improvement, this technique wasn't enough for the general model to successfully classify entailment.
Finally, we showed that a model fine-tuned on this specific downstream task (using the MNLI GLUE dataset) performs much better, achieving approximately 80% accuracy on 200 validation samples.

Overall, this notebook showcases the potential for GPT-J to be used effectively and efficiently on several downstream tasks after a simple fine-tuning.
Next, find out how to fine-tune GPT-J on the IPU in our notebook on fine-tuning [GPT-J on the MNLI dataset](finetuning.ipynb).