<div style="line-height:0.5">
<h1 style="color:#BF66F2 "> Tranformer Networks in PyTorch 2 </h1>
<h4> Fast Tranformer Inference with better Transformers, using the XLM-RoBERTa model </h4>

<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    margin-top in markdown + torchtext models + to_tensor()
</span>
</div>

<h3 style="color:#BF66F2 "> Recap: Better Transformers</h3>
<div style="margin-top: -7px;">
Better Transformers are a way to accelerate deployment of Transformer models with high performance on CPU and GPU. <br>
The fastpath feature works transparently for models based either directly on PyTorch core nn.module or with torchtext. <br>

The models which can be accelerated by Better Transformer fastpath execution are those using the following PyTorch core torch.nn.module classes <br>
=> TransformerEncoder, TransformerEncoderLayer, and MultiHeadAttention.  
</div>

<h4 style="color:#BF66F2 "> Steps </h4>
<div style="margin-top: -20px;">

- Load pre-trained models (pre-1.12 created without Better Transformer)
- Run and benchmark inference on CPU with and without BT fastpath (native MHA only)
- Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)
- Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)
</div>

In [1]:
import torch
import torch.nn as nn
import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor

In [2]:
print(f"torch version: {torch.__version__}")

torch version: 2.0.1+cu118


In [3]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
if device.type!='cpu':
  print(f"torch cuda available: {torch.cuda.is_available()}")

device

torch cuda available: True


device(type='cuda')

<h3 style="color:#BF66F2 "> Note: </h3>
<div style="margin-top: -30px;">
The XLM-RoBERTa is a large pre-trained encoder model from TorchText.
</div>

In [4]:
# Load XLM-RoBERTa
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER

In [5]:
# Create and initialize a classification head for the model
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)

In [6]:
# Combine the encoder and classification head to create a complete classification model
model = xlmr_large.get_model(head=classifier_head)

Downloading: "https://download.pytorch.org/models/text/xlmr.large.encoder.pt" to /root/.cache/torch/hub/checkpoints/xlmr.large.encoder.pt
100%|██████████| 2.08G/2.08G [02:08<00:00, 17.4MB/s]


In [7]:
# Create a transformation function to preprocess input data for the model
transform = xlmr_large.transform()

100%|██████████| 5.07M/5.07M [00:01<00:00, 3.66MB/s]
Downloading: "https://download.pytorch.org/models/text/xlmr.vocab.pt" to /root/.cache/torch/hub/checkpoints/xlmr.vocab.pt
100%|██████████| 4.85M/4.85M [00:01<00:00, 3.74MB/s]


<h3 style="color:#BF66F2 "> Recap: TorchText</h3>
<div style="margin-top: -18px;">
PyTorch library that provides tools for handling text data
<div style="margin-top: -1px;">

- Preprocessing: Tokenization, vocabularies creation, encoding.
- Datasets: Pre-built loaders for some common NLP datasets.
- Dataloaders: Batch generation with padding to handle variable-length sequences.
</div>
</div>

In [8]:
""" 2 types of inputs: a small input batch and a big input batch with sparsity. """

small_input_batch = ["Hello world", "How are you!"]

big_input_batch = [
    "Hello world",
    "How are you!",
    "The quick brown fox jumps over the lazy dog.",
    "In the beginning God created the heavens and the earth.",
    "To be or not to be, that is the question.",
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    """I have a dream that one day this nation will rise up and live out the true meaning of its creed:
    'We hold these truths to be self-evident, that all men are created equal.""",
    """
    Madam, I must implore you to reconsider your position on this matter.
    The consequences of your decision could be catastrophic not only for yourself but for all those who depend on you.
    I urge you to think carefully before taking any further action.
    """
    """
    The verdant hills of the countryside undulated gently in the distance, their slopes adorned with a patchwork quilt of fields and forests, while the tranquil river flowed lazily by,
    its surface shimmering in the golden light of the setting sun, casting long shadows across the landscape.
    """
    ]


In [9]:
input_batch=big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

torch.Size([8, 2])

In [10]:
# Benchmark iteration count
ITERATIONS=10

<h3 style="color:#BF66F2 "> <u> Benchmark # 1: </u> </h3>
Inference on CPU with and without BT fastpath (native MHA only)

In [11]:
print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                    aten::eq         0.00%      28.000us         0.00%      28.000us      28.000us             1  
                             aten::embedding         0.01%       1.591ms         0.02%       5.080ms       5.080ms             1  
                               aten::reshape         0.01%       1.566ms         0.01%       1.569ms       1.569ms             1  
                        aten::_reshape_alias         0.00%       3.000us         0.00%       3.000us       3.000us             1  
                          aten::index_select         0.01%       1.904ms

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      40.000us         0.00%      40.000us      40.000us             1  
                            aten::embedding         0.00%      30.000us         0.00%     343.000us     343.000us             1  
                              aten::reshape         0.00%      14.000us         0.00%      19.000us      19.000us             1  
                       aten::_reshape_alias         0.00%       5.000us         0.00%       5.000us       5.000us             1  
                         aten::index_select         0.00%     269.000us         0.00%     

In [12]:
# Check the BT sparsity setting
model.encoder.transformer.layers.enable_nested_tensor

True

<h3 style="color:#BF66F2 "> <u> Benchmark # 2: </u> </h3>
Inference Disabling the BT sparsity

In [13]:
model.encoder.transformer.layers.enable_nested_tensor=False

In [14]:
""" Run and benchmark inference on GPU """
model.to(device)
model_input = model_input.to(device)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.42%      14.793ms         0.42%      14.830ms      14.830ms             1  
                                       cudaLaunchKernel         0.00%      37.000us         0.00%      37.000us      37.000us             1  
                                        aten::embedding         0.00%      39.000us         0.64%      22.324ms      22.324ms             1  
                                          aten::reshape         0.00%      11.000us         0.00%      14.000us      14.000us            

  return torch._transformer_encoder_layer_fwd(


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.18%     725.000us         0.19%     764.000us     764.000us             1  
                                       cudaLaunchKernel         0.01%      39.000us         0.01%      39.000us      39.000us             1  
                                        aten::embedding         0.01%      21.000us         0.03%     103.000us     103.000us             1  
                                          aten::reshape         0.00%      12.000us         0.00%      17.000us      17.000us             1  
      

<h3 style="color:#BF66F2 "> <u> Benchmark # 3: </u> </h3>
Enabling sparsity support

In [15]:
model.encoder.transformer.layers.enable_nested_tensor = True

In [16]:
model.to(device)
model_input = model_input.to(device)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.02%      82.000us         0.02%     104.000us     104.000us     111.000us         0.01%     111.000us     111.000us             1  
                                       cudaLaunchKernel         0.00%      22.000us         0.00%      22.000us      22.000us       0.000us         0.00%       0.000us       0.000us             1 