


# Torch-TensorRT-optimized BERT for Sentence Classification


####  Requirements

NVIDIA's NGC provides a PyTorch Docker Container which contains PyTorch and Torch-TensorRT. Starting with version `22.05-py3`, we can make use of [latest pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container to run this notebook.


`sudo docker run --gpus all -it -p 8001:8888 --rm nvcr.io/nvidia/pytorch:24.03-py3`


Otherwise, you can follow the steps in `notebooks/README` to prepare a Docker container yourself, within which you can run this demo notebook.

In [3]:
#!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting transformers
  Downloading transformers-4.40.1-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.40.1-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m109.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m263.2 MB/s

In [22]:
from transformers import BertTokenizer, BertForMaskedLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import timeit
import numpy as np
import torch_tensorrt
import torch.backends.cudnn as cudnn
import torch.nn.functional as nnf

In [5]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis")

INFO:datasets:PyTorch version 2.3.0a0+40ec155e58.nv24.3 available.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.44k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.38M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/149985 [00:00<?, ? examples/s]

Map:   0%|          | 0/61998 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/120 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/30 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/62 [00:00<?, ?ba/s]

Generating train split:   0%|          | 0/119988 [00:00<?, ? examples/s]

INFO:datasets_modules.datasets.carblacac--twitter-sentiment-analysis.cd65e23e456de6a4f7264e305380b0ffe804d6f5bfd361c0ec0f68d8d1fab95b.twitter-sentiment-analysis:generating examples from = /root/.cache/huggingface/datasets/downloads/twitter-sentiment-analysis-train.jsonl


Generating validation split:   0%|          | 0/29997 [00:00<?, ? examples/s]

INFO:datasets_modules.datasets.carblacac--twitter-sentiment-analysis.cd65e23e456de6a4f7264e305380b0ffe804d6f5bfd361c0ec0f68d8d1fab95b.twitter-sentiment-analysis:generating examples from = /root/.cache/huggingface/datasets/downloads/twitter-sentiment-analysis-validation.jsonl


Generating test split:   0%|          | 0/61998 [00:00<?, ? examples/s]

INFO:datasets_modules.datasets.carblacac--twitter-sentiment-analysis.cd65e23e456de6a4f7264e305380b0ffe804d6f5bfd361c0ec0f68d8d1fab95b.twitter-sentiment-analysis:generating examples from = /root/.cache/huggingface/datasets/downloads/twitter-sentiment-analysis-test.jsonl


In [6]:
dataset.column_names

{'train': ['text', 'feeling'],
 'validation': ['text', 'feeling'],
 'test': ['text', 'feeling']}

In [7]:
from torch.utils.data import DataLoader
import torch

dataset.set_format(type="torch", columns=["text", "feeling"])
#dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

In [8]:
dataset['train'][4]

{'text': "@kathystover Didn't go much of any where - Life took over for a while",
 'feeling': tensor(1)}

In [9]:
df = dataset['train'].to_pandas()
df.shape

(119988, 2)

In [10]:
df

Unnamed: 0,text,feeling
0,@fa6ami86 so happy that salman won. btw the 1...,0
1,@phantompoptart .......oops.... I guess I'm ki...,0
2,@bradleyjp decidedly undecided. Depends on the...,1
3,@Mountgrace lol i know! its so frustrating isn...,1
4,@kathystover Didn't go much of any where - Lif...,1
...,...,...
119983,I so should be in bed but I can't sleep,0
119984,@mickeymab mine's in my profile - '77cb550 and...,1
119985,@stacyreeves Awe... I wish I could. I am here...,0
119986,Is it me or is Vodafone UK business support ru...,0


## BERT for Sentence Classification

```
Example output:
[[
{'label': 'sadness', 'score': 0.0005138228880241513}, 
{'label': 'joy', 'score': 0.9972520470619202}, 
{'label': 'love', 'score': 0.0007443308713845909}, 
{'label': 'anger', 'score': 0.0007404946954920888}, 
{'label': 'fear', 'score': 0.00032938539516180754}, 
{'label': 'surprise', 'score': 0.0004197491507511586}
]]
```


In [11]:
labels = ('sadness', 'joy', 'love', 'anger', 'fear', 'surprise')

In [12]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bhadresh-savani/bert-base-uncased-emotion")
model = AutoModelForSequenceClassification.from_pretrained("bhadresh-savani/bert-base-uncased-emotion", torchscript=True)

model

tokenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [13]:
print(f"Model memory footprint: {model.get_memory_footprint()/1e9:.2f}G")

Model memory footprint: 0.44G


### Model Tracing

Trace a function and return an executable or ScriptFunction that will be optimized using just-in-time compilation.\
Tracing is ideal for code that operates only on Tensor\s and lists, dictionaries, and tuples of Tensor\s.

Using torch.jit.trace and torch.jit.trace_module, you can turn an existing module or Python function into a TorchScript ScriptFunction or ScriptModule. You must provide example inputs, and we run the function, recording the operations performed on all the tensors.

The resulting recording of a standalone function produces ScriptFunction.\
The resulting recording of nn.Module.forward or nn.Module produces ScriptModule.\
This module also contains any parameters that the original module had as well.

In [14]:
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token = chr(50256)
model.config.pad_token_id = model.config.eos_token_id

# pad on the left so we can append new tokens on the right
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"

In [15]:
batch = df.sample(10).text.values.tolist()

batch

['@susan_adrian A happy Monday here--rainy, but going to lunch with the hubster will put a little Thai-flavored sunshine in my day.',
 'Is it to early to be waking up at 6 and ur sick  i hate it i can barley breath it suxs =(',
 'sun aint shining no more! tired and got work soon',
 "@_hikky I just noticed, I have the same Samsung monitor as you.  Actually I think alot of people have this one. Isn't it nice?",
 "@JayRWren Please don't ask me which one because I am too ashamed to tell you  Perhaps your text would have been better ...",
 'Zahlbar38 is cancelled today. Sorry for all the hundreds of people who where looking forward to this awesome party',
 "A very happy Mother's Day to all of you, with a hug on top and toddler kisses thrown in.",
 "@iancantdecide OMG.. talaga? PE? reflection paper?? what's ur PE ba?  i miss our PE days.. ) palagi wala tayong ginagawa.. )",
 '@Keicyx3 Polly is not on',
 'Day one finished - quite liked it will go back tommorrow']

In [16]:
example_inputs = tokenizer(batch, padding='max_length', max_length=512, return_tensors="pt")
example_inputs['input_ids'].size()

torch.Size([10, 512])

In [17]:
example_inputs.keys()


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [18]:
batch_size = 10

tokens_tensor = example_inputs['input_ids']
token_type_tensor = example_inputs['token_type_ids']
attention_masks_tensor = example_inputs['attention_mask']

tokens_tensor.size(), token_type_tensor.size(), attention_masks_tensor.size()

(torch.Size([10, 512]), torch.Size([10, 512]), torch.Size([10, 512]))

In [19]:
traced_model = torch.jit.trace(model, [tokens_tensor, token_type_tensor, attention_masks_tensor])



In [136]:
traced_model.save('models/bert-base-uncased-emotion_traced.pt')

In [20]:
type(traced_model)

torch.jit._trace.TopLevelTracedModule

In [40]:
encoded_inputs = tokenizer(batch, return_tensors='pt', padding='max_length', max_length=512)

In [73]:
batch
n_tokens = encoded_inputs['input_ids'].size()
tokens_per_batch = n_tokens[0]*n_tokens[1]
tokens_per_batch

5120

In [44]:
def print_outputs(batch, outputs):
    probs = nnf.softmax(outputs[0], dim=1)
    for i, sentence in enumerate(batch):
        print(f"{sentence}")
        for j, prob in enumerate(probs[i].tolist()):
            print(f"{labels[j]}:{prob:.2f}", end = '\t')
        print()
    print()
    

In [45]:
%%time
with torch.no_grad():
    outputs = model(**encoded_inputs)
    print_outputs(batch, outputs)
    
            

@susan_adrian A happy Monday here--rainy, but going to lunch with the hubster will put a little Thai-flavored sunshine in my day.
sadness:0.00	joy:1.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
Is it to early to be waking up at 6 and ur sick  i hate it i can barley breath it suxs =(
sadness:0.10	joy:0.00	love:0.00	anger:0.88	fear:0.01	surprise:0.00	
sun aint shining no more! tired and got work soon
sadness:0.47	joy:0.42	love:0.00	anger:0.10	fear:0.01	surprise:0.00	
@_hikky I just noticed, I have the same Samsung monitor as you.  Actually I think alot of people have this one. Isn't it nice?
sadness:0.00	joy:0.89	love:0.09	anger:0.01	fear:0.00	surprise:0.01	
@JayRWren Please don't ask me which one because I am too ashamed to tell you  Perhaps your text would have been better ...
sadness:1.00	joy:0.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
Zahlbar38 is cancelled today. Sorry for all the hundreds of people who where looking forward to this awesome party
sadness:0.33	joy:0.64	lov

In [48]:
%%time
# Traced model
with torch.no_grad():
    outputs = traced_model(**encoded_inputs)
    print_outputs(batch, outputs)

@susan_adrian A happy Monday here--rainy, but going to lunch with the hubster will put a little Thai-flavored sunshine in my day.
sadness:0.00	joy:1.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
Is it to early to be waking up at 6 and ur sick  i hate it i can barley breath it suxs =(
sadness:0.10	joy:0.00	love:0.00	anger:0.88	fear:0.01	surprise:0.00	
sun aint shining no more! tired and got work soon
sadness:0.47	joy:0.42	love:0.00	anger:0.10	fear:0.01	surprise:0.00	
@_hikky I just noticed, I have the same Samsung monitor as you.  Actually I think alot of people have this one. Isn't it nice?
sadness:0.00	joy:0.89	love:0.09	anger:0.01	fear:0.00	surprise:0.01	
@JayRWren Please don't ask me which one because I am too ashamed to tell you  Perhaps your text would have been better ...
sadness:1.00	joy:0.00	love:0.00	anger:0.00	fear:0.00	surprise:0.00	
Zahlbar38 is cancelled today. Sorry for all the hundreds of people who where looking forward to this awesome party
sadness:0.33	joy:0.64	lov

### Compiling with Torch-TensorRT

In [35]:
new_level = torch_tensorrt.logging.Level.Error
torch_tensorrt.logging.set_reportable_log_level(new_level)

In [49]:
traced_model.to('cuda')

BertForSequenceClassification(
  original_name=BertForSequenceClassification
  (bert): BertModel(
    original_name=BertModel
    (embeddings): BertEmbeddings(
      original_name=BertEmbeddings
      (word_embeddings): Embedding(original_name=Embedding)
      (position_embeddings): Embedding(original_name=Embedding)
      (token_type_embeddings): Embedding(original_name=Embedding)
      (LayerNorm): LayerNorm(original_name=LayerNorm)
      (dropout): Dropout(original_name=Dropout)
    )
    (encoder): BertEncoder(
      original_name=BertEncoder
      (layer): ModuleList(
        original_name=ModuleList
        (0): BertLayer(
          original_name=BertLayer
          (attention): BertAttention(
            original_name=BertAttention
            (self): BertSelfAttention(
              original_name=BertSelfAttention
              (query): Linear(original_name=Linear)
              (key): Linear(original_name=Linear)
              (value): Linear(original_name=Linear)
            

In [37]:
trt_model = torch_tensorrt.compile(traced_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda'),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda'),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32, device='cuda')], # attention_mask
    enabled_precisions= {torch.float32}, # Run with 32-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)



In [59]:
%%time
enc_inputs = tokenizer(batch, return_tensors='pt', padding='max_length', max_length=512)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
output_trt = trt_model(enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])
print_outputs(batch, output_trt)

@susan_adrian A happy Monday here--rainy, but going to lunch with the hubster will put a little Thai-flavored sunshine in my day.
sadness:0.01	joy:0.91	love:0.02	anger:0.04	fear:0.00	surprise:0.01	
Is it to early to be waking up at 6 and ur sick  i hate it i can barley breath it suxs =(
sadness:0.01	joy:0.92	love:0.02	anger:0.04	fear:0.00	surprise:0.01	
sun aint shining no more! tired and got work soon
sadness:0.01	joy:0.91	love:0.02	anger:0.04	fear:0.00	surprise:0.01	
@_hikky I just noticed, I have the same Samsung monitor as you.  Actually I think alot of people have this one. Isn't it nice?
sadness:0.01	joy:0.91	love:0.02	anger:0.04	fear:0.00	surprise:0.01	
@JayRWren Please don't ask me which one because I am too ashamed to tell you  Perhaps your text would have been better ...
sadness:0.01	joy:0.91	love:0.02	anger:0.04	fear:0.00	surprise:0.01	
Zahlbar38 is cancelled today. Sorry for all the hundreds of people who where looking forward to this awesome party
sadness:0.01	joy:0.91	lov

In [60]:
# Compile again with 16 bit precision

trt_model_fp16 = torch_tensorrt.compile(traced_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 512], dtype=torch.int32)], # attention_mask
    enabled_precisions= {torch.half}, # Run with 16-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)



In [9]:
new_level = torch_tensorrt.logging.Level.Error
torch_tensorrt.logging.set_reportable_log_level(new_level)

<a id="5"></a>
## 5. Benchmarking

In developing this notebook, we conducted our benchmarking on a single NVIDIA A100 GPU. Your results may differ from those shown, particularly on a different GPU.

This function passes the inputs into the model and runs inference `num_loops` times, then returns a list of length containing the amount of time in seconds that each instance of inference took.

In [88]:
def timeGraph(model, input_tensor1, input_tensor2, input_tensor3, num_loops=50):
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(20):
            features = model(input_tensor1, input_tensor2, input_tensor3)

    torch.cuda.synchronize()

    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(num_loops):
            start_time = timeit.default_timer()
            features = model(input_tensor1, input_tensor2, input_tensor3)
            torch.cuda.synchronize()
            end_time = timeit.default_timer()
            timings.append(end_time - start_time)
            tokens_generated = features[0].size()[0]*features[0].size()[1]
            # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))

    return timings

This function prints the number of input batches the model is able to process each second and summary statistics of the model's latency.

In [99]:
# Tokens per batch

num_loops=50
print(f"Tokens per batch: {tokens_per_batch}")

(tokens_per_batch*num_loops)/np.sum(timings)

Tokens per batch: 5120


357733.7941443705

In [62]:
def printStats(graphName, timings, batch_size):
    times = np.array(timings)
    steps = len(times)
    speeds = batch_size / times
    time_mean = np.mean(times)
    time_med = np.median(times)
    time_99th = np.percentile(times, 99)
    time_std = np.std(times, ddof=0)
    speed_mean = np.mean(speeds)
    speed_med = np.median(speeds)
    tokens_mean=np.sum(times)

    msg = ("\n%s =================================\n"
            "batch size=%d, num iterations=%d\n"
            "  Median text batches/second: %.1f, mean: %.1f\n"
            "  Median latency: %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
            ) % (graphName,
                batch_size, steps,
                speed_med, speed_mean,
                time_med, time_mean, time_99th, time_std)
    print(msg)

In [63]:
cudnn.benchmark = True

Benchmark the (scripted) TorchScript model on GPU

In [77]:
num_loops = 50

torch.Size([10, 512])

#### Base Model on CUDA

In [82]:
%%time
timings = timeGraph(model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'],
                   num_loops=num_loops)

tokens_per_batch = enc_inputs['input_ids'].size()[0]*enc_inputs['input_ids'].size()[1]

printStats("BERT", timings, batch_size)
print(f"Tokens processed: {tokens_per_batch*num_loops}")

Warm up ...
Start timing ...

batch size=10, num iterations=50
  Median text batches/second: 195.3, mean: 194.8
  Median latency: 0.051204, mean: 0.051331, 99th_p: 0.052065, std_dev: 0.000280

Tokens processed: 256000
CPU times: user 3.25 s, sys: 367 ms, total: 3.61 s
Wall time: 3.61 s


Benchmark the traced model on GPU

In [95]:
256000/3.61

70914.12742382272

#### Traced Model on CUDA

In [91]:
%%time
timings = timeGraph(traced_model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'],
                   num_loops=num_loops)

printStats("BERT", timings, batch_size)
print(f"Tokens processed: {tokens_per_batch*num_loops}")

Warm up ...
Start timing ...

batch size=10, num iterations=50
  Median text batches/second: 194.7, mean: 194.2
  Median latency: 0.051362, mean: 0.051489, 99th_p: 0.052154, std_dev: 0.000301

Tokens processed: 256000
CPU times: user 3.18 s, sys: 457 ms, total: 3.64 s
Wall time: 3.63 s


Benchmark the compiled FP32 model on GPU

In [104]:
toks_per_sec = 256000/3.63
toks_per_sec

70523.41597796143

#### Compiled Model

In [92]:
%%time

timings = timeGraph(trt_model, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'],
                   num_loops=num_loops)

printStats("BERT", timings, batch_size)
print(f"Tokens processed: {tokens_per_batch*num_loops}")

Warm up ...
Start timing ...

batch size=10, num iterations=50
  Median text batches/second: 241.7, mean: 240.5
  Median latency: 0.041376, mean: 0.041583, 99th_p: 0.042858, std_dev: 0.000417

Tokens processed: 256000
CPU times: user 2.65 s, sys: 275 ms, total: 2.92 s
Wall time: 2.92 s


Benchmark the compiled FP16 model on GPU

In [107]:
toks_per_sec_compiled = 256000/2.92
print(f"{toks_per_sec_compiled}: {100*(toks_per_sec_compiled-toks_per_sec)/toks_per_sec:.2f}%")

87671.23287671233: 24.32%


#### Compiled Model in Half Precision

In [94]:
%%time
timings = timeGraph(trt_model_fp16, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'],
                   num_loops=num_loops)

printStats("BERT", timings, batch_size)
print(f"Tokens processed: {tokens_per_batch*num_loops}")

Warm up ...
Start timing ...

batch size=10, num iterations=50
  Median text batches/second: 706.8, mean: 699.2
  Median latency: 0.014148, mean: 0.014312, 99th_p: 0.015696, std_dev: 0.000380

Tokens processed: 256000
CPU times: user 923 ms, sys: 79.5 ms, total: 1 s
Wall time: 999 ms


In [109]:
toks_per_sec_compiled_half = 256000/.999

toks_per_sec_compiled = 256000/2.92
print(f"{toks_per_sec_compiled_half}: {100*(toks_per_sec_compiled_half-toks_per_sec_compiled)/toks_per_sec_compiled:.2f}%")
print(f"{toks_per_sec_compiled_half}: {100*(toks_per_sec_compiled_half-toks_per_sec)/toks_per_sec:.2f}%")


256256.25625625625: 192.29%
256256.25625625625: 263.36%


<a id="6"></a>
## 6. Conclusion

In this notebook, we have walked through the complete process of compiling TorchScript models with Torch-TensorRT for Masked Language Modeling with Hugging Face's `bert-base-uncased` transformer and testing the performance impact of the optimization. With Torch-TensorRT on an NVIDIA A100 GPU, we observe the speedups indicated below. These acceleration numbers will vary from GPU to GPU (as well as implementation to implementation based on the ops used) and we encorage you to try out latest generation of Data center compute cards for maximum acceleration.

Scripted (GPU): 1.0x
Traced (GPU): 1.62x
Torch-TensorRT (FP32): 2.14x
Torch-TensorRT (FP16): 3.15x

### What's next
Now it's time to try Torch-TensorRT on your own model. If you run into any issues, you can fill them at https://github.com/pytorch/TensorRT. Your involvement will help future development of Torch-TensorRT.

# 