# Accelerate GPT inference with DeepSpeed-Inference on GPUs

In this session, you will learn how to optimize GPT-2/GPT-J for Inerence using [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) and [DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/). The session will show you how to apply state-of-the-art optimization techniques using [DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/). 
This session will focus on single GPU inference for GPT-2, GPT-NEO and GPT-J like models
By the end of this session, you will know how to optimize your Hugging Face Transformers models (GPT-2, GPT-J) using DeepSpeed-Inference. We are going to optimize GPT-j 6B for text-generation.

You will learn how to:
1. [Setup Development Environment](#1-Setup-Development-Environment)
2. [Load vanilla GPT-J model and set baseline](#2-Load-vanilla-GPT-J-model-and-set-baseline)
3. [Optimize GPT-J for GPU using DeepSpeeds `InferenceEngine`](#3-Optimize-GPT-J-for-GPU-using-DeepSpeeds-InferenceEngine)
4. [Evaluate the performance and speed](#4-Evaluate-the-performance-and-speed)

Let's get started! 🚀

_This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including an NVIDIA T4._

---

## Quick Intro: What is DeepSpeed-Inference

[DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/) is an extension of the [DeepSpeed](https://www.deepspeed.ai/) framework focused on inference workloads.  [DeepSpeed Inference](https://www.deepspeed.ai/#deepspeed-inference) combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels.
DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py).
As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like [BLOOM](https://huggingface.co/bigscience/bloom) with 176 billion parameters.
If you want to learn more about DeepSpeed inference: 
* [Paper: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale](https://arxiv.org/pdf/2207.00032.pdf)
* [Blog: Accelerating large-scale model inference and training via system optimizations and compression](https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/)


## 1. Setup Development Environment

Our first step is to install Deepspeed, along with PyTorch, Transfromers and some other libraries. Running the following cell will install all the required packages.

_Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running `nvidia-smi` in your terminal. If your setup is correct, you should get statistics about your GPU._

In [1]:
!pip install torch==1.11.0 torchvision==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade -q 
!pip install deepspeed==0.7.0 --upgrade -q 
!pip install transformers[sentencepiece]==4.21.1 --upgrade -q 
!pip install datasets evaluate[evaluator]==0.2.2 seqeval --upgrade -q 

Before we start. Let's make sure all packages are installed correctly.

In [2]:
import re
import torch 

# check deepspeed installation
report = !python3 -m deepspeed.env_report
r = re.compile('.*ninja.*OKAY.*')
assert any(r.match(line) for line in report) == True, "DeepSpeed Inference not correct installed"

# check cuda and torch version
torch_version, cuda_version = torch.__version__.split("+")
torch_version = ".".join(torch_version.split(".")[:2])
cuda_version = f"{cuda_version[2:4]}.{cuda_version[4:]}"
r = re.compile(f'.*torch.*{torch_version}.*')
assert any(r.match(line) for line in report) == True, "Wrong Torch version"
r = re.compile(f'.*cuda.*{cuda_version}.*')
assert any(r.match(line) for line in report) == True, "Wrong Cuda version"


## 2. Load vanilla GPT-J model and set baseline

After we set up our environment, we create a baseline for our model. We use the [EleutherAI/gpt-j-6B](https://huggingface.co/EleutherAI/gpt-j-6B), a GPT-J 6B was trained on the [Pile](https://pile.eleuther.ai/), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai/). This model was trained for 402 billion tokens over 383,500 steps on TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

To create our baseline, we load the model with `transformers` and create a `text-generation` pipeline. We are loading the model by using `fp16` weights. 

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Model Repository on huggingface.co
model_id="EleutherAI/gpt-j-6B"
revision="float16"

# Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, revision=revision, torch_dtype=torch.float16, low_cpu_mem_usage=True
)

# Create a pipeline for token classification
generator = pipeline("text-generation", model=model, tokenizer=tokenizer,device=0)

# Test pipeline
example = "My name is Philipp and I"
prediction = generator(example)
print(prediction)


Downloading tokenizer_config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/836 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/11.3G [00:00<?, ?B/s]

Create a latency baseline we use the `measure_latency` function, which implements a simple python loop to run inference and calculate the avg, mean & p95 latency for our model.

In [None]:
from time import perf_counter
import numpy as np 

def measure_latency(pipe,payload, generation_args={}):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(300):
        start_time = perf_counter()
        _ = pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

We are going to use greedy search as decoding strategy and will generate 128 new tokens with 128 tokens as input. 

In [None]:
payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend "*2

print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}')

Our model achieves an f1 score of `95.8%` on the CoNLL-2003 dataset with an average latency across the dataset of `18.9ms`. 

## 3.Optimize GPT-J for GPU using DeepSpeeds `InferenceEngine`

The next and most important step is to optimize our model for GPU inference. This will be done using the DeepSpeed `InferenceEngine`. The `InferenceEngine` is initialized using the `init_inference` method. The `init_inference` method expects as parameters atleast:

* `model`: The model to optimize.
* `mp_size`: The number of GPUs to use.
* `dtype`: The data type to use.
* `replace_with_kernel_inject`: Whether inject custom kernels.

You can find more information about the `init_inference` method in the [DeepSpeed documentation](https://deepspeed.readthedocs.io/en/latest/inference-init.html) or [thier inference blog](https://www.deepspeed.ai/tutorials/inference-tutorial/).

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transformers import pipeline
from deepspeed.module_inject import HFBertLayerPolicy
import deepspeed

# Model Repository on huggingface.co
model_id="EleutherAI/gpt-j-6B"
revision="float16"

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained( model_id, revision=revision, torch_dtype=torch.float16, low_cpu_mem_usage=True)


# init deepspeed inference engine
ds_model = deepspeed.init_inference(
    model=model,      # Transformers models
    mp_size=1,        # Number of GPU
    dtype=torch.half, # dtype of the weights (fp16)
    # injection_policy={"BertLayer" : HFBertLayerPolicy}, # replace BertLayer with DS HFBertLayerPolicy
    replace_method="auto", # Lets DS autmatically identify the layer to replace
    replace_with_kernel_inject=True, # replace the model with the kernel injector
)

# create acclerated pipeline
ds_clf = pipeline("token-classification", model=ds_model, tokenizer=tokenizer,device=0)

# Test pipeline
example = "My name is Wolfgang and I live in Berlin"
ner_results = ds_clf(example)
print(ner_results)


We can now inspect our model graph to see that the vanilla `BertLayer` has been replaced with an `HFBertLayer`, which includes the `DeepSpeedTransformerInference` module, a custom `nn.Module` that is optimized for inference by the DeepSpeed Team.

```python
InferenceEngine(
  (module): BertForTokenClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 1024, padding_idx=0)
        (position_embeddings): Embedding(512, 1024)
        (token_type_embeddings): Embedding(2, 1024)
        (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): DeepSpeedTransformerInference(
            (attention): DeepSpeedSelfAttention()
            (mlp): DeepSpeedMLP()
          )
```

In [21]:
from deepspeed.ops.transformer.inference import DeepSpeedTransformerInference

assert isinstance(ds_model.module.bert.encoder.layer[0], DeepSpeedTransformerInference) == True, "Model not sucessfully initalized"

## 4. Evaluate the performance and speed

As the last step, we want to take a detailed look at the performance of our optimized model. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.

Let's test the performance (latency) of our optimized model. We will use the same generation args as for our vanilla model.

In [22]:
from time import perf_counter
import numpy as np 

payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend "*2

print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}')

vanilla_model=measure_latency(token_clf)
ds_opt_model=measure_latency(ds_clf)

print(f"Vanilla model: {vanilla_model[0]}")
print(f"Optimized model: {ds_opt_model[0]}")
print(f"Improvement through optimization: {round(vanilla_model[1]/ds_opt_model[1],2)}x")

Payload sequence length is: 128
Vanilla model: P95 latency (ms) - 30.401047450277474; Average latency (ms) - 29.68 +\- 0.54;
Optimized model: P95 latency (ms) - 10.401162500056671; Average latency (ms) - 10.10 +\- 0.17;
Improvement through optimization: 2.92x


We managed to accelerate the `BERT-Large` model latency from `30.4ms` to `10.40ms` or 2.92x for sequence length of 128.

![bert-latency](../assets/bert-inference-latency.png)

## Conclusion

We successfully optimized our BERT-large Transformers with DeepSpeed-inference and managed to decrease our model latency from 30.4ms to 10.4ms or 2.92x while keeping 99.88% of the model accuracy. 
The results are impressive, but applying the optimization was as easy as adding one additional call to `deepspeed.init_inference`. 
But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.