# Faster Transformers with Pytorch and TVM

credit: https://github.com/t-vi/pytorch-tvmisc/blob/master/transformers-pytorch-tvm/bert-tvm.ipynb

a tutorial by Thomas Viehmann <tv@lernapparat.de>


Acknowledgement & Disclosure: The creation of this tutorial was sponsored by AMD. Thank you!

Some of the most intriguing applications of Artificial Intelligence have been in Natural Language Processing.
Models like BERT or GPT-2 and their variants can seemingly grasp enough of a text to continue it in a way that needs a second look to recognize as gibberish.

These models belong to a class of neural network architectures called *Transformers*. One of the favourite libraries implementing them is the [HuggingFace transformers library](https://github.com/huggingface/transformers/).

But, in contrast to convolutional models or LSTMs where we have heavily optimized implementations, this is not as much the case for transformers.

In [1]:
# I sometimes need to choose PyTorch...
import inspect
import sys
import torch
import torch.utils.dlpack

# import TVM
import sys
import os

os.environ["PATH"]="/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"

tvm_root = '/home/ubuntu/workspace/tvm/'
tvm_paths = [os.path.join(tvm_root, p) for p in ['python', 'topi/python', 'nnvm/python']]
os.environ['PYTHONPATH'] = ':'.join([os.environ.get('PYTHONPATH', '')] + tvm_paths)
for p in tvm_paths:
    sys.path.insert(0, p)
    

import tvm
import tvm.relay

torch.cuda.get_device_name()

'Tesla T4'

In [2]:
import sys
!{sys.executable} -m pip install -i https://opentuna.cn/pypi/web/simple/ regex sacremoses

Looking in indexes: https://opentuna.cn/pypi/web/simple/


Helpfully, transformers supports tracing their model with the PyTorch JIT. We use their [tutorial on it](https://huggingface.co/transformers/torchscript.html), the following is copied straight from the tutorial

In [3]:
import transformers

from transformers import BertModel, BertTokenizer, BertConfig
import numpy

import torch

enc = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]

# If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

model.eval()
for p in model.parameters():
    p.requires_grad_(False)

transformers.__version__

'3.1.0'

Now we can trace our model. As we want to do inference, we impose evaluation mode and not requiring gradients for the parameters.

In [4]:
# Creating the trace
traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
traced_model.eval()
for p in traced_model.parameters():
    p.requires_grad_(False)

  position_ids = self.position_ids[:, :seq_length]
  input_tensor.shape == tensor_shape for input_tensor in input_tensors


Let us run try our traced model on the GPU:

In [5]:
model.cuda()
tt_c = tokens_tensor.cuda()
st_c = segments_tensors.cuda()
res_pt = model(tt_c, st_c)
torch.cuda.synchronize()

It worked, but is it fast? Let's run it 100 times and see.
When timing CUDA models, it's always good to do some "warm-up", running the model before the measurement, and we need to be sure to synchronize before the start and end of the timing.


In [6]:
def y():
    for i in range(100):
        model(tt_c, st_c)
    torch.cuda.synchronize()

import time
tic = time.time()
y()
toc = time.time()
print("elapsed: {:.03f} ms".format((toc - tic)*1000.0))

elapsed: 922.147 ms


Around 0.91-0.98 seconds for 100 runs means 9.1-9.8ms per run. That's not too bad.

But let us see if TVM can help us to get faster. Let us convert our model to TVM.

In [7]:
shape_list = [(i.debugName().split('.')[0], i.type().sizes()) for i in  list(traced_model.graph.inputs())[1:]]
print(shape_list)

[('input_ids', [1, 14]), ('attention_mask', [1, 14])]


In [8]:
print("- 0 -", flush=True)
mod_bert, params_bert = tvm.relay.frontend.pytorch.from_pytorch(traced_model,
                        shape_list, default_dtype="float32")
print("- 0 -", flush=True)

- 0 -
- 0 -


That went well! (Be sure to use the TVM model from my git branch.) We can now build and run it. Building follows the standard TVM recipe.

In [9]:
target = "cuda -model=t4"
ctx = tvm.context(target, 0)

In [10]:
target_host = 'llvm'

tt_a = tvm.nd.array(tokens_tensor.numpy(), ctx)
st_a = tvm.nd.array(segments_tensors.numpy(), ctx)

In [11]:
print("- 1 -")
tvm.relay.backend.compile_engine.get().clear() # just to be sure, see https://github.com/apache/incubator-tvm/pull/5724

with tvm.transform.PassContext(opt_level=3):
        graph, lib, params = tvm.relay.build(mod_bert,
                                     target=target,
                                     target_host=target_host,
                                     params=params_bert)
module = tvm.contrib.graph_runtime.create(graph, lib, ctx)
print("- 2 -")

- 1 -
download failed due to URLError(ConnectionRefusedError(111, 'Connection refused')), retrying, 2 attempts left
download failed due to URLError(ConnectionRefusedError(111, 'Connection refused')), retrying, 1 attempt left


  


- 2 -


Uh oh, _may bring great performance regression_. Let's see. We run the module:

Let us run the model and see if the outputs match:

In [12]:
module.set_input("input_ids", tt_a)
module.set_input("attention_mask", st_a)
module.set_input(**params)
module.run()
o0 = module.get_output(0)
o1 = module.get_output(1)
print(numpy.abs((res_pt[0].cpu().numpy() - o0.asnumpy())).max(), 
      numpy.abs((res_pt[1].cpu().numpy() - o1.asnumpy())).max())

7.6293945e-06 8.34465e-07


Looks good. Remember that we're computing in float32, so $10^{-6}$ish is a good result. Now that we know it gets the correct result, let us see what the speed is:

In [13]:
def x():
    for i in range(100):
        module.run()
    ctx.sync()
tic = time.time()
x()
toc = time.time()
print("0 elapsed: {:.03f} ms".format((toc - tic)*1000.0))
# get_ipython().run_line_magic('timeit', 'x()')

0 elapsed: 2040.782 ms


Ouch, 20 ms per run of the model. That's slow indeed. But the warning said that is was because it could not find (tuned) configurations. Let us then tune the tasks.
We extract the tasks.

In [17]:
tasks = tvm.autotvm.task.extract_from_program(mod_bert["main"], target=target, params=params)
print(tasks)

...100%, 0.40 MB, 368 KB/s, 1 seconds passed
[Task(func_name=batch_matmul.cuda, args=(('TENSOR', (1, 14, 3072), 'float32'), ('TENSOR', (1, 768, 3072), 'float32')), kwargs={}, workload=('batch_matmul.cuda', ('TENSOR', (1, 14, 3072), 'float32'), ('TENSOR', (1, 768, 3072), 'float32'))), Task(func_name=batch_matmul.cuda, args=(('TENSOR', (1, 14, 768), 'float32'), ('TENSOR', (1, 3072, 768), 'float32')), kwargs={}, workload=('batch_matmul.cuda', ('TENSOR', (1, 14, 768), 'float32'), ('TENSOR', (1, 3072, 768), 'float32'))), Task(func_name=batch_matmul.cuda, args=(('TENSOR', (1, 14, 768), 'float32'), ('TENSOR', (1, 768, 768), 'float32')), kwargs={}, workload=('batch_matmul.cuda', ('TENSOR', (1, 14, 768), 'float32'), ('TENSOR', (1, 768, 768), 'float32'))), Task(func_name=batch_matmul.cuda, args=(('TENSOR', (12, 14, 14), 'float32'), ('TENSOR', (12, 64, 14), 'float32')), kwargs={}, workload=('batch_matmul.cuda', ('TENSOR', (12, 14, 14), 'float32'), ('TENSOR', (12, 64, 14), 'float32'))), Task(func_

OK, so we have are our tasks that we need to be able to perform fast.

Below is the corresponding tuning. We have set `n_trial` to 20 here for you to play along. For serious tuning, you need to put this to 2000 steps. Each task than takes about 1-2 hours (on my computer).

As I wanted this to be runnable from Jupyter, I'm doing a bit of a dance with threading and the tornado IOLoop module. In a regular script, you would only have the call to `tuner.tune` between _do tuning_ and _done tuning_.

In [18]:
log_filename = 'bert-tuning.stage1.log'

In [19]:
n_trial = 20  # for real tuning, make this 2000!

def do_tune(tasks, log_filename):
    tmp_log_file = log_filename + ".tmp"
    for i, tsk in enumerate(reversed(tasks)):
        prefix = "[Task %2d/%2d] " %(i+1, len(tasks))

        # we use threading and tornado here to work around TVM and Jupyter colliding over IOLoops
        # In a regular python command line, you should be able to just call the tuner...
        import threading 
        import tornado

        # create tuner
        tuner = tvm.autotvm.tuner.XGBTuner(tsk, loss_type='rank')
        if os.path.isfile(tmp_log_file):
            tuner.load_history(tvm.autotvm.record.load_from_file(tmp_log_file))

        # do tuning
        tsk_trial = min(n_trial, len(tsk.config_space))
        def tune_task_fn():
            iol = tornado.ioloop.IOLoop()  # we need an event loop
            tuner.tune(
                n_trial=n_trial,
                early_stopping=600,
                measure_option=tvm.autotvm.measure_option(
                    builder=tvm.autotvm.LocalBuilder(timeout=10),
                    runner=tvm.autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150)),
                callbacks=[
                    tvm.autotvm.callback.progress_bar(tsk_trial, prefix=prefix),
                    tvm.autotvm.callback.log_to_file(tmp_log_file)
                ])

        tuning_thread = threading.Thread(target=tune_task_fn)  # create a thread start it and wait on it
        tuning_thread.start()
        tuning_thread.join()
        # done tuning, on to the next task

    # pick best records to a cache file
    tvm.autotvm.record.pick_best(tmp_log_file, log_filename)

do_tune(tasks, log_filename)

[Task  1/ 6]  Current/Best:   95.46/ 288.23 GFLOPS | Progress: (18/18) | 25.22 s Done.
[Task  2/ 6]  Current/Best:   88.12/  99.91 GFLOPS | Progress: (20/20) | 24.88 s Done.
[Task  3/ 6]  Current/Best:  100.60/ 100.60 GFLOPS | Progress: (20/20) | 24.81 s Done.
[Task  4/ 6]  Current/Best:  613.18/ 712.88 GFLOPS | Progress: (20/20) | 27.42 s Done.
[Task  5/ 6]  Current/Best:  891.24/ 938.43 GFLOPS | Progress: (20/20) | 22.13 s Done.
[Task  6/ 6]  Current/Best:  578.87/ 578.87 GFLOPS | Progress: (20/20) | 22.26 s Done.


After this, we can again build the model, this time with the new configuration. This time we should see no comments about missing configurations.

In [20]:
tvm.relay.backend.compile_engine.get().clear()

with tvm.autotvm.apply_history_best(log_filename):
    with tvm.transform.PassContext(opt_level=3):
        graph, lib, params = tvm.relay.build(mod_bert,
                                     target=target,
                                     target_host=target_host,
                                     params=params_bert)
module = tvm.contrib.graph_runtime.create(graph, lib, ctx)

  


In [21]:
module.set_input("input_ids", tt_a)
module.set_input("attention_mask", st_a)
module.set_input(**params)
module.run()
o0 = module.get_output(0)
o1 = module.get_output(1)
print(numpy.abs((res_pt[0].cpu().numpy() - o0.asnumpy())).max(), 
      numpy.abs((res_pt[1].cpu().numpy() - o1.asnumpy())).max())

7.6293945e-06 8.34465e-07


Let's see if the speed improved:

In [22]:
def x():
    for i in range(100):
        module.run()
    ctx.sync()
tic = time.time()
x()
toc = time.time()
print("tvm elapsed: {:.03f} ms".format((toc - tic)*1000.0))

tvm elapsed: 584.373 ms


Now it's in the region of 5.5-7ms per run. That's faster comparing to PyTorch. This is what we get from this very elementary optimization of our operators. We can push it a little further, though.