<center> 
    <a href="https://github.com/nebuly-ai/nebullvm#how-nebullvm-works" target="_blank" style="text-decoration: none;"> How Nebullvm Works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#tutorials" target="_blank" style="text-decoration: none;"> Tutorials </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#benchmarks" target="_blank" style="text-decoration: none;"> Benchmarks </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#installation" target="_blank" style="text-decoration: none;"> Installation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#get-started" target="_blank" style="text-decoration: none;"> Get Started </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#optimization-examples" target="_blank" style="text-decoration: none;"> Optimization Examples </a>
</center>
<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Discord </a> |
    <a href="https://nebuly.ai/" target="_blank" style="text-decoration: none;"> Website </a> |
    <a href="https://www.linkedin.com/company/72460022/" target="_blank" style="text-decoration: none;"> LinkedIn </a> |
    <a href="https://twitter.com/nebuly_ai" target="_blank" style="text-decoration: none;"> Twitter </a>
</center>

# Accelerate Hugging Face GPT2 and BERT with nebullvm¶

Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the open-source library nebullvm.

With nebullvm's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, nebullvm takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

We will first optimize a GPT2 transformer by taking advantage of nebullvm API option A (acceleration without loss of accuracy/precision).

## GPT2 - Import a pre-trained Hugging Face model

We chose GPT2 as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

We will also select one short and one long text for GPT2 to process, which it will use at a later stage to test the impact of nebullvm as the input tokens change.

In [None]:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Short text you wish to process"
long_text = " ".join([text]*100)

Let's run the prediction 100 times to calculate the average response time of the unoptimized model.

In [None]:
import time
import torch

In [None]:
encoded_input = tokenizer(text, return_tensors='pt')
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        output = model(**encoded_input)
    times.append(time.time()-st)
vanilla_short_token_time = sum(times)/len(times)*1000
print(f"Average response time for GPT2: ({encoded_input['input_ids'].shape[1]} tokens): {vanilla_short_token_time} ms")

In [None]:
long_encoded_input = tokenizer(long_text, return_tensors='pt', truncation=True)
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        new_out = model(**long_encoded_input)
    times.append(time.time()-st)
vanilla_long_token_time = sum(times)/len(times)*1000
print(f"Average response time for GPT2: ({long_encoded_input['input_ids'].shape[1]} tokens): {vanilla_long_token_time} ms")

## GPT2 - Speed up inference with nebullvm

It's now time of improving a bit the performance in terms of speed. Let's use `nebullvm`.

In [None]:
from nebullvm.api.frontend.huggingface import optimize_huggingface_model

Using nebullvm is very simple and straightforward! Just use the `optimize_huggngface_model` function and provide as input the model, the tokenizer and text example for the model input, the batch size, the maximum input size for each input (excluding the batch size already defined), and a directory in which to save the optimized model.

The function also takes as input some information about the context of the model. In this case, for example, we need to specify that the attention values can be 0 or 1 (in the `extra_input_info` dictionary).

In [None]:
optimized_model = optimize_huggingface_model(
    model=model,
    tokenizer=tokenizer,
    input_texts=[text],
    batch_size=1,
    max_input_sizes=[tuple(value.size()[1:]) for value in long_encoded_input.values()],
    save_dir=".",
    extra_input_info=[{}, {"max_value": 1, "min_value": 0}],
    use_torch_api=False,
    tokenizer_args={"truncation": True},
    perf_loss_ths=3,
)

Let's run the prediction 100 times to calculate the average response time of the unoptimized model.

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_short_token_time = sum(times)/len(times)*1000
print(f"Average response time for GPT2 ({encoded_input['input_ids'].shape[1]} tokens): {optimized_short_token_time} ms")

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        final_new_out = optimized_model(**long_encoded_input)
    times.append(time.time()-st)
optimized_long_token_time = sum(times)/len(times)*1000
print(f"Average response time for GPT2 ({long_encoded_input['input_ids'].shape[1]} tokens): {optimized_long_token_time} ms")

## GPT2 - Print the optimization results

In [None]:
# Enter here your username
your_username = "username"

In [None]:
## Uncomment the following line for installing gputil if you are running on an NVIDIA GPU.
#!pip install gputil

In [None]:
import cpuinfo
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
cpu_info = cpuinfo.get_cpu_info()['brand_raw']
gpu_info = "no"
if torch.cuda.is_available():
    import GPUtil
    gpus = GPUtil.getGPUs()
    gpu_info = list(gpus)[0].name

In [None]:
message = f"""
Hello, I'm {your_username}!
I've tested nebullvm on the following setup:

Hardware: {cpu_info} CPU and {gpu_info} GPU.
Model: GPT2 - HuggingFace
Tokens: {encoded_input['input_ids'].shape[1]}
- Vanilla performance: {round(vanilla_short_token_time, 2)}ms
- Optimized performance: {round(optimized_short_token_time, 2)}ms
- Speedup: {round(vanilla_short_token_time/optimized_short_token_time, 1)}x
Tokens: {long_encoded_input['input_ids'].shape[1]}
- Vanilla performance: {round(vanilla_long_token_time, 2)}ms
- Optimized performance: {round(optimized_long_token_time, 2)}ms
- Speedup: {round(vanilla_long_token_time/optimized_long_token_time, 1)}x
"""


print(message)

# BERT - Speed up inference with nebullvm

Let's see the nebullvm performance on another model. Let's optimize the popular Bert.

In [None]:
from transformers import BertTokenizer, BertModel

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [None]:
text = "Short text you wish to process"
inputs = tokenizer(text, return_tensors="pt")

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        outputs = model(**inputs)
    times.append(time.time()-st)
vanilla_bert_short = sum(times)/len(times)*1000
print(f"Average response time for BERT: ({inputs['input_ids'].shape[1]} tokens): {vanilla_bert_short}")

In [None]:
long_text = ". ".join(["Hello, my dog is cute"]*100)
new_inputs = tokenizer(long_text, return_tensors='pt', padding=True, truncation=True)

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        new_outputs = model(**new_inputs)
    times.append(time.time()-st)
vanilla_bert_long = sum(times)/len(times)*1000
print(f"Average response time for BERT: ({new_inputs['input_ids'].shape[1]} tokens): {vanilla_bert_long} ms")

In [None]:
optimized_model = optimize_huggingface_model(
    model=model,
    tokenizer=tokenizer,
    input_texts=[text],
    batch_size=1,
    max_input_sizes=[tuple(value.size()[1:]) for value in new_inputs.values()],
    save_dir=".",
    extra_input_info=[{}, {"max_value": 1, "min_value": 0}, {"max_value": 1, "min_value": 0}],
    use_torch_api=False,
    tokenizer_args={"truncation": True},
    perf_loss_ths=3,
)

Let's now calculate the time required to run a prediction as an average over 100 tests.

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        outputs = optimized_model(**inputs)
    times.append(time.time()-st)
optimized_bert_short = sum(times)/len(times)*1000
print(f"Average response time for BERT: ({inputs['input_ids'].shape[1]} tokens): {optimized_bert_short} ms")

In [None]:
times = []
for _ in range(100):
    st = time.time()
    with torch.no_grad():
        outputs = optimized_model(**new_inputs)
    times.append(time.time()-st)
optimized_bert_long = sum(times)/len(times)*1000
print(f"Average response time for BERT: ({new_inputs['input_ids'].shape[1]} tokens): {optimized_bert_long} ms")

## BERT - Print the optimization results

In [None]:
message = f"""
Hello, I'm {your_username}!
I've tested nebullvm on the following setup:

Hardware: {cpu_info} CPU and {gpu_info} GPU.
Model: BERT - HuggingFace
Tokens: {inputs['input_ids'].shape[1]}
- Vanilla performance: {round(vanilla_bert_short, 2)}ms
- Optimized performance: {round(optimized_bert_short, 2)}ms
- Speedup: {round(vanilla_bert_short/optimized_bert_short, 1)}x
Tokens: {new_inputs['input_ids'].shape[1]}
- Vanilla performance: {round(vanilla_bert_long, 2)}ms
- Optimized performance: {round(optimized_bert_long, 2)}ms
- Speedup: {round(vanilla_bert_long/optimized_bert_long, 1)}x
"""

print(message)

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about nebullvm and AI acceleration.

Note that the acceleration of nebullvm depends very much on the hardware configuration and your AI model. Given the same input model, nebullvm can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how nebullvm works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm#how-nebullvm-works" target="_blank" style="text-decoration: none;"> How Nebullvm Works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#tutorials" target="_blank" style="text-decoration: none;"> Tutorials </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#benchmarks" target="_blank" style="text-decoration: none;"> Benchmarks </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#installation" target="_blank" style="text-decoration: none;"> Installation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#get-started" target="_blank" style="text-decoration: none;"> Get Started </a> •
    <a href="https://github.com/nebuly-ai/nebullvm#optimization-examples" target="_blank" style="text-decoration: none;"> Optimization Examples </a>
</center>
<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Discord </a> |
    <a href="https://nebuly.ai/" target="_blank" style="text-decoration: none;"> Website </a> |
    <a href="https://www.linkedin.com/company/72460022/" target="_blank" style="text-decoration: none;"> LinkedIn </a> |
    <a href="https://twitter.com/nebuly_ai" target="_blank" style="text-decoration: none;"> Twitter </a>
</center>