<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/tensorrt_torchtrt_hf_bert/nvidia_logo.png" width="90px">


# Torch-TensorRT-optimized BERT for Sentence Classificatio


####  Requirements

NVIDIA's NGC provides a PyTorch Docker Container which contains PyTorch and Torch-TensorRT. Starting with version `22.05-py3`, we can make use of [latest pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container to run this notebook.


`sudo docker run --gpus all -it -p 8001:8888 --rm nvcr.io/nvidia/pytorch:24.03-py3`


Otherwise, you can follow the steps in `notebooks/README` to prepare a Docker container yourself, within which you can run this demo notebook.

In [4]:
#!pip install datasets

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import timeit
import numpy as np
#import torch_tensorrt
#import torch.backends.cudnn as cudnn

In [5]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.44k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.38M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/149985 [00:00<?, ? examples/s]

Map:   0%|          | 0/61998 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/120 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/30 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/62 [00:00<?, ?ba/s]

Generating train split:   0%|          | 0/119988 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29997 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/61998 [00:00<?, ? examples/s]

In [8]:
dataset.column_names

{'train': ['text', 'feeling'],
 'validation': ['text', 'feeling'],
 'test': ['text', 'feeling']}

In [11]:
from torch.utils.data import DataLoader
import torch

dataset.set_format(type="torch", columns=["text", "feeling"])
#dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

In [30]:
dataset['train'][4]

{'text': "@kathystover Didn't go much of any where - Life took over for a while",
 'feeling': tensor(1)}

In [31]:
df = dataset['train'].to_pandas()
df.shape

(119988, 2)

In [32]:
df

Unnamed: 0,text,feeling
0,@fa6ami86 so happy that salman won. btw the 1...,0
1,@phantompoptart .......oops.... I guess I'm ki...,0
2,@bradleyjp decidedly undecided. Depends on the...,1
3,@Mountgrace lol i know! its so frustrating isn...,1
4,@kathystover Didn't go much of any where - Lif...,1
...,...,...
119983,I so should be in bed but I can't sleep,0
119984,@mickeymab mine's in my profile - '77cb550 and...,1
119985,@stacyreeves Awe... I wish I could. I am here...,0
119986,Is it me or is Vodafone UK business support ru...,0


## BERT for Sentence Classification

```
Example output:
[[
{'label': 'sadness', 'score': 0.0005138228880241513}, 
{'label': 'joy', 'score': 0.9972520470619202}, 
{'label': 'love', 'score': 0.0007443308713845909}, 
{'label': 'anger', 'score': 0.0007404946954920888}, 
{'label': 'fear', 'score': 0.00032938539516180754}, 
{'label': 'surprise', 'score': 0.0004197491507511586}
]]
```


In [117]:
labels = ('sadness', 'joy', 'love', 'anger', 'fear', 'surprise')

In [35]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bhadresh-savani/bert-base-uncased-emotion")
model = AutoModelForSequenceClassification.from_pretrained("bhadresh-savani/bert-base-uncased-emotion", torchscript=True)

model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [36]:
print(f"Model memory footprint: {model.get_memory_footprint()/1e9:.2f}G")

Model memory footprint: 0.44G


### Model Tracing

Trace a function and return an executable or ScriptFunction that will be optimized using just-in-time compilation.\
Tracing is ideal for code that operates only on Tensor\s and lists, dictionaries, and tuples of Tensor\s.

Using torch.jit.trace and torch.jit.trace_module, you can turn an existing module or Python function into a TorchScript ScriptFunction or ScriptModule. You must provide example inputs, and we run the function, recording the operations performed on all the tensors.

The resulting recording of a standalone function produces ScriptFunction.\
The resulting recording of nn.Module.forward or nn.Module produces ScriptModule.\
This module also contains any parameters that the original module had as well.

In [62]:
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token = chr(50256)
model.config.pad_token_id = model.config.eos_token_id

# pad on the left so we can append new tokens on the right
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"

In [64]:
batch = df.sample(10).text.values.tolist()

batch

['@MirandaBuzz Haha! Violin Hero. Genius parody!!! Rock the violin out! Anyway, awesome show especially the cast and Dan. Good day!!!',
 '...but my hair smells like wood smoke',
 "Hmmm...I guess TwitterFon doesn't like emoji very much. It just comes out as numbers and puctuations. Darn it!",
 "@NeenDhie Hey hun aww bless u. hope u have a great day (: Oh i was watchin the eurovision last night n Denmark's song got so low points",
 'i sooooo wish i could be at the Lakers parade tomorrow!  will someone please give me a ride? http://twurl.nl/3upnus',
 "@boxofcrayons Let's say the 3rd type of person is he who wants to count, but needs some guidance! That's where we come in. Happy Monday!",
 'a long day for me as well tomorrow but a happy one  goodnight.',
 'Want much to go to the Library Mall to show support, but am far too sick.  With you in spirit, though!! Rally starts @ 3pm...',
 '@MariahCarey hi! is it really you?',
 'On my way to work, are we havin a summer this year? Is june and is c

In [75]:
example_inputs = tokenizer(batch, padding='max_length', max_length=512, return_tensors="pt")
example_inputs['input_ids'].size()

torch.Size([10, 512])

In [76]:
example_inputs.keys()


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [77]:
batch_size = 10

tokens_tensor = example_inputs['input_ids']
token_type_tensor = example_inputs['token_type_ids']
attention_masks_tensor = example_inputs['attention_mask']

tokens_tensor.size(), token_type_tensor.size(), attention_masks_tensor.size()

(torch.Size([10, 512]), torch.Size([10, 512]), torch.Size([10, 512]))

In [78]:
traced_model = torch.jit.trace(model, [tokens_tensor, token_type_tensor, attention_masks_tensor])

In [136]:
traced_model.save('models/bert-base-uncased-emotion_traced.pt')

In [79]:
type(traced_model)

torch.jit._trace.TopLevelTracedModule

In [134]:
import torch.nn.functional as nnf

encoded_inputs = tokenizer(batch, return_tensors='pt', padding='max_length', max_length=512)
with torch.no_grad():
    outputs = model(**encoded_inputs)
    probs = nnf.softmax(outputs[0], dim=1)
    for i, sentence in enumerate(batch):
        print(f"{sentence}")
        for j, prob in enumerate(probs[i].tolist()):
            print(f"{labels[j]}:{prob:.2f}", end = '\t')
        print()
    print()
    
            

@MirandaBuzz Haha! Violin Hero. Genius parody!!! Rock the violin out! Anyway, awesome show especially the cast and Dan. Good day!!!
sadness:0.00	joy:1.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
...but my hair smells like wood smoke
sadness:0.05	joy:0.01	love:0.00	anger:0.45	fear:0.49	surprise:0.00	
Hmmm...I guess TwitterFon doesn't like emoji very much. It just comes out as numbers and puctuations. Darn it!
sadness:0.01	joy:0.22	love:0.04	anger:0.72	fear:0.01	surprise:0.01	
@NeenDhie Hey hun aww bless u. hope u have a great day (: Oh i was watchin the eurovision last night n Denmark's song got so low points
sadness:0.04	joy:0.94	love:0.01	anger:0.00	fear:0.00	surprise:0.00	
i sooooo wish i could be at the Lakers parade tomorrow!  will someone please give me a ride? http://twurl.nl/3upnus
sadness:0.04	joy:0.80	love:0.08	anger:0.02	fear:0.05	surprise:0.01	
@boxofcrayons Let's say the 3rd type of person is he who wants to count, but needs some guidance! That's where we come in. Happ

In [135]:
# Traced model
with torch.no_grad():
    outputs = traced_model(**encoded_inputs)
    probs = nnf.softmax(outputs[0], dim=1)
    for i, sentence in enumerate(batch):
        print(f"{sentence}")
        for j, prob in enumerate(probs[i].tolist()):
            print(f"{labels[j]}:{prob:.2f}", end = '\t')
        print()
    print()

@MirandaBuzz Haha! Violin Hero. Genius parody!!! Rock the violin out! Anyway, awesome show especially the cast and Dan. Good day!!!
sadness:0.00	joy:1.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
...but my hair smells like wood smoke
sadness:0.05	joy:0.01	love:0.00	anger:0.45	fear:0.49	surprise:0.00	
Hmmm...I guess TwitterFon doesn't like emoji very much. It just comes out as numbers and puctuations. Darn it!
sadness:0.01	joy:0.22	love:0.04	anger:0.72	fear:0.01	surprise:0.01	
@NeenDhie Hey hun aww bless u. hope u have a great day (: Oh i was watchin the eurovision last night n Denmark's song got so low points
sadness:0.04	joy:0.94	love:0.01	anger:0.00	fear:0.00	surprise:0.00	
i sooooo wish i could be at the Lakers parade tomorrow!  will someone please give me a ride? http://twurl.nl/3upnus
sadness:0.04	joy:0.80	love:0.08	anger:0.02	fear:0.05	surprise:0.01	
@boxofcrayons Let's say the 3rd type of person is he who wants to count, but needs some guidance! That's where we come in. Happ

### Compiling with Torch-TensorRT

In [None]:
new_level = torch_tensorrt.logging.Level.Error
torch_tensorrt.logging.set_reportable_log_level(new_level)

In [None]:
traced_model.to('cuda')

In [None]:
trt_model = torch_tensorrt.compile(traced_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda'),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda'),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda')], # attention_mask
    enabled_precisions= {torch.float32}, # Run with 32-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)

In [None]:
# Test
enc_inputs = tokenizer(batch, return_tensors='pt', padding='max_length', max_length=512)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
output_trt = trt_model(enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])
#print(output_trt[0])

most_likely_token_ids_trt = [torch.argmax(output_trt[0][i, pos, :]) for i, pos in enumerate(pos_masks)] 
unmasked_tokens_trt = enc.decode(most_likely_token_ids_trt).split(' ')
unmasked_sentences_trt = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens_trt)]
for sentence in unmasked_sentences_trt:
    print(sentence)

In [None]:
# Compile again with 16 bit precision

trt_model_fp16 = torch_tensorrt.compile(traced_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32)], # attention_mask
    enabled_precisions= {torch.half}, # Run with 16-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)

## Bert-base-uncased

First, create a pretrained BERT tokenizer from the `bert-base-uncased` model

In [59]:
enc = BertTokenizer.from_pretrained('bert-base-uncased')

mlm_model_ts = BertForMaskedLM.from_pretrained('bert-base-uncased', torchscript=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Create dummy inputs to generate a traced TorchScript model later

In [70]:
mlm_model_ts

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [60]:
print(f"Model memory footprint: {mlm_model_ts.get_memory_footprint()/1e9:.2f}G")

Model memory footprint: 0.53G


In [71]:
batch_size = 4

batched_indexed_tokens = [[101, 64]*64]*batch_size
batched_segment_ids = [[0, 1]*64]*batch_size
batched_attention_masks = [[1, 1]*64]*batch_size

tokens_tensor = torch.tensor(batched_indexed_tokens)
segments_tensor = torch.tensor(batched_segment_ids)
attention_masks_tensor = torch.tensor(batched_attention_masks)

In [72]:
tokens_tensor.size()

torch.Size([4, 128])

In [73]:
attention_masks_tensor.size()

torch.Size([4, 128])

Obtain a BERT masked language model from Hugging Face in the (scripted) TorchScript, then use the dummy inputs to trace it

In [75]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bhadresh-savani/bert-base-uncased-emotion")
model_ts = AutoModelForSequenceClassification.from_pretrained("bhadresh-savani/bert-base-uncased-emotion", torchscript=True)


tokenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [76]:
print(f"Model memory footprint: {model_ts.get_memory_footprint()/1e9:.2f}G")

Model memory footprint: 0.44G


In [None]:
# Example tensors


In [None]:
traced_model_ts = torch.jit.trace(model_ts, [tokens_tensor, segments_tensor, attention_masks_tensor])

In [74]:
mlm_model_ts = BertForMaskedLM.from_pretrained('bert-base-uncased', torchscript=True)
traced_mlm_model = torch.jit.trace(mlm_model_ts, [tokens_tensor, segments_tensor, attention_masks_tensor])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define 4 masked sentences, with 1 word in each sentence hidden from the model. Fluent English speakers will probably be able to guess the masked words, but just in case, they are `'capital'`, `'language'`, `'innings'`, and `'mathematics'`.

Also create a list containing the position of the masked word within each sentence. Given Python's 0-based indexing convention, the numbers are each higher by 1 than might be expected. This is because the token at index 0 in each sentence is a beginning-of-sentence token, denoted `[CLS]` when entered explicitly. 

In [6]:
masked_sentences = ['Paris is the [MASK] of France.', 
                    'The primary [MASK] of the United States is English.', 
                    'A baseball game consists of at least nine [MASK].', 
                    'Topology is a branch of [MASK] concerned with the properties of geometric objects that remain unchanged under continuous transformations.']
pos_masks = [4, 3, 9, 6]

Pass the masked sentences into the (scripted) TorchScript MLM model and verify that the unmasked sentences yield the expected results.  

Because the sentences are of different lengths, we must specify the `padding` argument in calling our encoder/tokenizer. There are several possible padding strategies, but we'll use `'max_length'` padding with `max_length=128`. Later, when we compile an optimized version of the model with Torch-TensorRT, the optimized model will expect inputs of length 128, hence our choice of padding strategy and length here. 

In [7]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = mlm_model_ts(**encoded_inputs)
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


Pass the masked sentences into the traced MLM model and verify that the unmasked sentences yield the expected results. 

Note the difference in how the `encoded_inputs` are passed into the model in the following cell compared to the previous one. If you examine `encoded_inputs`, you'll find that it's a dictionary with 3 keys, `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`, each with a PyTorch tensor as an associated value. The traced model will accept `**encoded_inputs` as an input, but the Torch-TensorRT-optimized model (to be defined later) will not. 

In [8]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = traced_mlm_model(encoded_inputs['input_ids'], encoded_inputs['token_type_ids'], encoded_inputs['attention_mask'])
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the all of France.
The primary all of the United States is English.
A baseball game consists of at least nine each.
Topology is a branch of each concerned with the properties of geometric objects that remain unchanged under continuous transformations.


<a id="4"></a>
## 4. Compiling with Torch-TensorRT

Change the logging level to avoid long printouts

In [9]:
new_level = torch_tensorrt.logging.Level.Error
torch_tensorrt.logging.set_reportable_log_level(new_level)

Compile the model

In [13]:
traced_mlm_model.to('cuda')

BertForMaskedLM(
  original_name=BertForMaskedLM
  (bert): BertModel(
    original_name=BertModel
    (embeddings): BertEmbeddings(
      original_name=BertEmbeddings
      (word_embeddings): Embedding(original_name=Embedding)
      (position_embeddings): Embedding(original_name=Embedding)
      (token_type_embeddings): Embedding(original_name=Embedding)
      (LayerNorm): LayerNorm(original_name=LayerNorm)
      (dropout): Dropout(original_name=Dropout)
    )
    (encoder): BertEncoder(
      original_name=BertEncoder
      (layer): ModuleList(
        original_name=ModuleList
        (0): BertLayer(
          original_name=BertLayer
          (attention): BertAttention(
            original_name=BertAttention
            (self): BertSelfAttention(
              original_name=BertSelfAttention
              (query): Linear(original_name=Linear)
              (key): Linear(original_name=Linear)
              (value): Linear(original_name=Linear)
              (dropout): Dropout(origina

In [16]:
trt_model = torch_tensorrt.compile(traced_mlm_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32, device='cuda'),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32, device='cuda'),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32, device='cuda')], # attention_mask
    enabled_precisions= {torch.float32}, # Run with 32-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)



Pass the masked sentences into the compiled model and verify that the unmasked sentences yield the expected results.

In [23]:
enc_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
output_trt = trt_model(enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])
#print(output_trt[0])

most_likely_token_ids_trt = [torch.argmax(output_trt[0][i, pos, :]) for i, pos in enumerate(pos_masks)] 
unmasked_tokens_trt = enc.decode(most_likely_token_ids_trt).split(' ')
unmasked_sentences_trt = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens_trt)]
for sentence in unmasked_sentences_trt:
    print(sentence)

Paris is the all of France.
The primary all of the United States is English.
A baseball game consists of at least nine each.
Topology is a branch of each concerned with the properties of geometric objects that remain unchanged under continuous transformations.


Compile the model again, this time with 16-bit precision

In [24]:
trt_model_fp16 = torch_tensorrt.compile(traced_mlm_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32)], # attention_mask
    enabled_precisions= {torch.half}, # Run with 16-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)



<a id="5"></a>
## 5. Benchmarking

In developing this notebook, we conducted our benchmarking on a single NVIDIA A100 GPU. Your results may differ from those shown, particularly on a different GPU.

This function passes the inputs into the model and runs inference `num_loops` times, then returns a list of length containing the amount of time in seconds that each instance of inference took.

In [25]:
def timeGraph(model, input_tensor1, input_tensor2, input_tensor3, num_loops=50):
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(20):
            features = model(input_tensor1, input_tensor2, input_tensor3)

    torch.cuda.synchronize()

    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(num_loops):
            start_time = timeit.default_timer()
            features = model(input_tensor1, input_tensor2, input_tensor3)
            torch.cuda.synchronize()
            end_time = timeit.default_timer()
            timings.append(end_time - start_time)
            # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))

    return timings

This function prints the number of input batches the model is able to process each second and summary statistics of the model's latency.

In [26]:
def printStats(graphName, timings, batch_size):
    times = np.array(timings)
    steps = len(times)
    speeds = batch_size / times
    time_mean = np.mean(times)
    time_med = np.median(times)
    time_99th = np.percentile(times, 99)
    time_std = np.std(times, ddof=0)
    speed_mean = np.mean(speeds)
    speed_med = np.median(speeds)

    msg = ("\n%s =================================\n"
            "batch size=%d, num iterations=%d\n"
            "  Median text batches/second: %.1f, mean: %.1f\n"
            "  Median latency: %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
            ) % (graphName,
                batch_size, steps,
                speed_med, speed_mean,
                time_med, time_mean, time_99th, time_std)
    print(msg)

In [27]:
cudnn.benchmark = True

Benchmark the (scripted) TorchScript model on GPU

In [28]:
timings = timeGraph(model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 448.6, mean: 446.3
  Median latency: 0.008916, mean: 0.008965, 99th_p: 0.009542, std_dev: 0.000154



Benchmark the traced model on GPU

In [29]:
timings = timeGraph(traced_model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 621.5, mean: 610.9
  Median latency: 0.006436, mean: 0.006558, 99th_p: 0.007502, std_dev: 0.000283



Benchmark the compiled FP32 model on GPU

In [30]:
timings = timeGraph(trt_model, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 754.3, mean: 754.3
  Median latency: 0.005303, mean: 0.005303, 99th_p: 0.005326, std_dev: 0.000008



Benchmark the compiled FP16 model on GPU

In [31]:
timings = timeGraph(trt_model_fp16, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 1688.7, mean: 1692.5
  Median latency: 0.002369, mean: 0.002363, 99th_p: 0.002384, std_dev: 0.000013



<a id="6"></a>
## 6. Conclusion

In this notebook, we have walked through the complete process of compiling TorchScript models with Torch-TensorRT for Masked Language Modeling with Hugging Face's `bert-base-uncased` transformer and testing the performance impact of the optimization. With Torch-TensorRT on an NVIDIA A100 GPU, we observe the speedups indicated below. These acceleration numbers will vary from GPU to GPU (as well as implementation to implementation based on the ops used) and we encorage you to try out latest generation of Data center compute cards for maximum acceleration.

Scripted (GPU): 1.0x
Traced (GPU): 1.62x
Torch-TensorRT (FP32): 2.14x
Torch-TensorRT (FP16): 3.15x

### What's next
Now it's time to try Torch-TensorRT on your own model. If you run into any issues, you can fill them at https://github.com/pytorch/TensorRT. Your involvement will help future development of Torch-TensorRT.

# 