Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Faster Text Generation with GPT-J using 4-bit Weight Quantization on IPUs

The speed of text generation with large language models is often limited by the time it takes to read a model state from memory. One way to alleviate this issue is to: 
 * compress the model state for storage in low-bandwidth, external memory and for communication with high-bandwidth on-chip memory
 * decompress the model state on-chip into a number format you can compute with (for example float16).
 
Recently, many neural network practitioners have found that compressing model parameters to just 4 bits has minimal effect on the quality of model outputs.

Group quantisation is a simple approach for compressing model parameters to 4 bits with no finetuning and is described in
["FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU"](https://arxiv.org/abs/2303.06865). 

Here we will show you how to apply this technique to GPT-J on IPUs. 

In the notebook "Text Generation with GPT-J 6B on IPUs" `GPTJ-generative-inference.ipynb` you learned how to generate text with GPT-J, an accessible 6B parameter language model. You saw: 

- how GPT-J performs on NLP tasks using both a base and fine-tuned checkpoint. 
- the effects on output quality from adjustments to prompt structure.
- throughput improvements from batching text queries.

In this notebook you will:

- compress GPT-J weights to 4 bits, using 4x less memory.
- speed up GPT-J inference by ~1.5x with minimal degradation of MNLI task performance.
- see the trade-off between speed and accuracy.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   NLP   |  Question answering | GPT-J | Glue-MNLI| Inference | 16 | 30min (+1h for final cell)|

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Dependencies and configuration

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

Install the dependencies the notebook needs.

In [None]:
%pip install  -r requirements.txt
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

To make it easier for you to run this demo, we read in some configuration related to the environment you are running the notebook in.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 4))
if number_of_ipus < 4:
    raise ValueError("This notebook is designed to run with at least 4 IPUs")

executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/")
os.environ["POPART_CACHE_DIR"] = executable_cache_dir
checkpoint_directory = os.getenv("CHECKPOINT_DIR")

In [None]:
!gc-monitor

## How to generate text with GPT-J on the IPU with 4-bit weights

We start by showing you how to generate text using GPT-J with 4-bit weights on the Graphcore IPU. As we did in our previous notebook `GPTJ-generative-inference.ipynb`, we load a configuration to create a pipeline object that we can use for generating text interactively.

In [None]:
# --- Setup ---
import run_inference

config, *_ = run_inference.gptj_config_setup(
    "config/inference.yml", "release", "gpt-j-gq-4bit"
)
print(config.dumps_yaml())

We start by using the same config as we did previously, with a single change. We modify `config.execution.group_quantise_weights` to some value `n` greater than 0. This config option divides the columns of weight matrices into groups of `n`. Values within each group are then binned into one of 16 values (4 bits), and we store scale and bias factors in float16 to allow us to convert from 4 bits, back to float16 when we compute. This means the column dimension of the weights matrix must be divisible by `n`. Since native 4-bit formats don't exist, we store four consecutive 4-bit integers in a 16-bit integer value. As a result, `n` must be divisible by 4.

Since GPT-J weights all have a column dimension of `4096`, a convenient choice for `n` is 64. 

In [None]:
# Size of groups for 4-bit quantization
config.execution.group_quantise_weights = 64

Next, we download and compress EleutherAI's pretrained weights from Hugging Face. Compressing weights takes a bit of time, but you only need to do this once when you are happy with your quantization hyperparameters.

In [None]:
import api

int4_model = api.GPTJPipeline(
    config,
    "EleutherAI/gpt-j-6b",
    sequence_length=512,
    micro_batch_size=1,
    output_length=20,
    print_live=True,
)

Let's ask our newly quantized GPT-J model the same question we asked in our first notebook.

In [None]:
out = int4_model("What is the capital of France?")

The answer is correct, if a bit repetitive! But this behaviour is expected for models of this size that have not been fine-tuned for particular types of prompts or instructions.

So how fast was it? We added a timer to our pipeline to measure token generation time.

In [None]:
print(
    f"Average token generation time for our compressed model is {int4_model.token_generation_time:.3f} secs"
)

In [None]:
int4_model.detach()

Let's also compare with an uncompressed float16 model to look at the speedup.

In [None]:
config.execution.group_quantise_weights = 0
fp16_model = api.GPTJPipeline(
    config,
    "EleutherAI/gpt-j-6b",
    sequence_length=512,
    micro_batch_size=1,
    output_length=20,
    print_live=True,
)

In [None]:
out = fp16_model("What is the capital of France?")

In [None]:
print(
    f"Average token generation time for base model is {fp16_model.token_generation_time:.3f} secs"
)

In [None]:
speedup = fp16_model.token_generation_time / int4_model.token_generation_time
print(f"int4 model generates tokens {speedup:.3f}x faster than fp16 model!")

In [None]:
fp16_model.detach()

In [None]:
del int4_model, fp16_model

The compressed model is definitely faster! We can see that the answers are similar, but not exactly the same. Both models answered the question correctly, then started producing related text to fill the output quota of 20 tokens.

Next, we will try to quantify the change in model quality from quantising weights.

## Compressing model weights impacts task performance (but not by much!)

Although interactively asking simple questions to language models is important for getting a subjective feel for model quality, we would prefer to have more objective assessments where possible. In this section we will return to the MNLI task, in which we use a language model to classify the logical relationship between a premise and a hypothesis (entailment, contradiction, or neutral). Once again we'll download the MNLI fine-tuned weights from Hugging Face, but this time we'll also compress them using the same scheme.

First, let's set our config for the MNLI validation task:

In [None]:
# The number of tokens generated before stopping
# Note the model will stop before this if it generates an <|endoftext|> token
config.inference.output_length = 5
# The number of prompts which will be processed at once
config.execution.micro_batch_size = 12
# The maximum tokenized sequence length (input + generated) handled by the model
config.model.sequence_length = 256

# Size of groups for 4-bit quantization
config.execution.group_quantise_weights = 64

In [None]:
import datasets

dataset = datasets.load_dataset("glue", "mnli", split="validation_mismatched")

In [None]:
def compute_accuracy(predictions):
    mnli_classes = ["entailment", "neutral", "contradiction", "unknown"]
    correct = [
        pred == mnli_classes[actual]
        for pred, actual in zip(predictions, dataset[:]["label"])
    ]
    return sum(correct) * 100 / len(predictions)

In [None]:
mnli_int4_model = api.GPTJPipeline(
    config,
    "Graphcore/gptj-mnli",
    sequence_length=256,
    print_live=True,
)
mnli_int4_pipeline = api.GPTJEntailmentPipeline.from_gptj_pipeline(mnli_int4_model)
int4_out = mnli_int4_pipeline(
    premise=dataset[:]["premise"],
    hypothesis=dataset[:]["hypothesis"],
    print_live=False,
    output_length=5,
)

int4_acc = compute_accuracy(int4_out)
print(f"Compressed model accuracy is {int4_acc:.2f}%")

In [None]:
mnli_int4_model.detach()

In [None]:
config.execution.group_quantise_weights = 0
mnli_fp16_model = api.GPTJPipeline(
    config,
    "Graphcore/gptj-mnli",
    sequence_length=256,
    print_live=True,
)
mnli_fp16_pipeline = api.GPTJEntailmentPipeline.from_gptj_pipeline(mnli_fp16_model)
fp16_out = mnli_fp16_pipeline(
    premise=dataset[:]["premise"],
    hypothesis=dataset[:]["hypothesis"],
    print_live=False,
    output_length=5,
)

fp16_acc = compute_accuracy(fp16_out)
print(f"Base model accuracy is {fp16_acc:.2f}%")

You can see that quantising to 4 bits results in just a 1.27% degradation in accuracy! 

Shall we see what happens to accuracy and speed when we vary the group size? If you decide to run the `for` loop below you will cycle through group sizes of 16, 32, 128, and 256, and rerun our MNLI validation pipeline for each group size. This will take a while as we need to compress the checkpoint again for every new group size! If you just want to see the results, scroll down to where we have plotted this for you.

<img src="./imgs/gq-speed-accuracy-tradeoff.png">

In [None]:
mnli_fp16_model.detach()

In [None]:
times = {
    0: mnli_fp16_pipeline.token_generation_time,
    64: mnli_int4_pipeline.token_generation_time,
}
accuracies = {0: fp16_acc, 64: int4_acc}

for gs in [16, 32, 128, 256]:
    config.execution.group_quantise_weights = gs
    mnli_int4_model = api.GPTJPipeline(
        config,
        "Graphcore/gptj-mnli",
        sequence_length=256,
        print_live=True,
    )
    mnli_int4_pipeline = api.GPTJEntailmentPipeline.from_gptj_pipeline(mnli_int4_model)
    int4_out = mnli_int4_pipeline(
        premise=dataset[:]["premise"],
        hypothesis=dataset[:]["hypothesis"],
        print_live=False,
        output_length=5,
    )

    int4_acc = compute_accuracy(int4_out)
    times[gs] = mnli_int4_pipeline.token_generation_time
    accuracies[gs] = int4_acc
    mnli_int4_pipeline.detach()
    del mnli_int4_pipeline, mnli_int4_model

In [None]:
import matplotlib.pyplot as plt

keys = list(times.keys())
keys.sort()
for k in keys:
    plt.plot(times[k] * 1000, accuracies[k], marker="o")


def bits_per_param(n):
    # n = group size
    # 4 bits per param, plus float 16 scale and bias for each group
    return (n * 4 + 2 * 16) / n


plt.legend(
    ["16 bits per param (uncompressed)"]
    + [f"{bits_per_param(k)} bits per param (group_size={k})" for k in keys[1:]]
)
plt.xlabel("Batched token generation time (ms)")
plt.ylabel("MNLI accuracy (%)")
plt.title(
    "Finetuned GPT-J (6B) MNLI speed/accuracy \n trade-off for varying int4 quantization group size"
)

You can see that the uncompressed checkpoint produces the most accurate results, but is also the slowest. As you increase `group_size`, token generation time gets smaller and accuracy degrades quite smoothly. For group sizes > 128, you can see that accuracy degrades very sharply, without much improvement in speed. The best tradeoff looks to be for `group_size=64`, since you need a significantly slower model to obtain a marginal improvement in accuracy, whereas an only slightly faster model produces quite a big dropoff in accuracy. Depending on your requirements, you might decide differently, for example you may want a little bit more quality even if it means your model runs slower. 

If you want to try this for your own models and use-cases you can perform an evaluation like this to help you decide!

## Conclusions and next steps

In this notebook, we showed how:
1. to run 4-bit inference on GPT-J by changing the `group_quantise_weights` config option.
2. compressing weights to 4-bits with group quantisation gives a 2x speed up over float16 on text generation on IPUs for a batch size of 1.
3. quantising weights results in just a small drop off in accuracy on the MNLI entailment task.
4. groups of size 64 provide a good tradeoff between speed and accuracy.

Efficiently decompressing 4-bit integers to float16 values is surprisingly difficult for any hardware. We have some guidelines for [how to write efficient custom C++ code for the IPU](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/poplar_programs.html), and a description of how we wrote the code for our [custom op](https://github.com/graphcore/popxl-addons/blob/master/popxl_addons/ops/group_quantize_decompress/group_quantize_decompressx.cpp#L51-L109) and the [tile vertex (per thread kernel)](https://github.com/graphcore/popxl-addons/blob/master/popxl_addons/ops/group_quantize_decompress/group_quantize_decompress_codelet.cpp) for quickly decompressing weights. 

You can also run our other notebooks exploring other efficiency wins for deep learning such as:
- [Converting FLAN-T5 XL to float16 for faster inference](https://ipu.dev/tvxZ3Q)
- [A how-to on training GPT models in float8 on IPUs with unit scaling](https://ipu.dev/qXfm2a)
- [Accelerating transformers with packing for fine-tuning and inference](https://ipu.dev/q6HAUX)

Try out the other [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how how IPUs perform on other tasks.