# TVM Tutorial


In this notebook, you will focus on the deployment with TVM. [TVM](https://tvm.ai) is an open deep learning compiler for CPUs, GPUs, and specialized accelerators. [Amazon Sagemaker Neo](https://aws.amazon.com/sagemaker/neo/) provides the compilation service that's built on top of TVM.

In this tutorial, you'll use the BERT model you created in a previous tutorial for the question-answering task to show how deployment through TVM works. Specifically, you will:

- Learn how to convert an MXNet model to its TVM representation (Relay).
- Learn how to compile and run a TVM model.
- Evaluate the TVM performance.

## Installation

Install TVM using pip (these wheels only work with Sagemaker notebook).

In [1]:
!pip install https://haichen-tvm.s3-us-west-2.amazonaws.com/tvm_cu100-0.6.dev0-cp36-cp36m-linux_x86_64.whl
!pip install https://haichen-tvm.s3-us-west-2.amazonaws.com/topi-0.6.dev0-py3-none-any.whl

[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 10.0.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


You can also compile TVM from source.
```bash
git clone --recursive https://github.com/dmlc/tvm.git
cd tvm && mkdir build && cp cmake/config.cmake build/
cd build
echo "set(USE_LLVM ON)" >> config.cmake # "set(USE_LLVM /path/to/llvm-config)" to enable specific LLVM
echo "set(USE_CUDA ON)" >> config.cmake # Enable CUDA when Nvidia GPU is available
cmake .. && make -j
```

## Convert MXNet model

Load the libraries.

In [1]:
import tvm
from tvm import relay
from tvm import autotvm
import numpy as np
import mxnet as mx
import gluoncv as gcv

OSError: libLLVM-6.0.so: cannot open shared object file: No such file or directory

Convert model from MXNet to Relay

In [3]:
net = gcv.model_zoo.get_model('resnet18_v1', pretrained=True)
relay.frontend.from_mxnet(net, )

Done! Transform dataset costs 0.22 seconds.


## Compile the MXNet model with TVM

First, convert the MXNet model into Relay. You'll need to provide a mapping from input names to their shapes at this step like in the code below. TVM frontend converter supports both MXNet static graphs (symbolic) and `HybridBlock`s.

In [4]:
shape_dict = {
    'data0': (1, max_seq_length), # inputs
    'data1': (1, max_seq_length), # token types
    'data2': (1,) # sequence length
}
mod, params = relay.frontend.from_mxnet(net, shape_dict)
# uncomment the following line to see the converted model in Relay IR
# print(mod)

### Load the AutoTVM logs and build the module

Next, load the AutoTVM logs that were previously tuned on c5.9x instances.

In this tutorial, we will not cover how to tune kernels using `AutoTVM`. If you are interested, you can check the [auto tuning tutorial](https://docs.tvm.ai/tutorials/autotvm/tune_relay_x86.html).

In [5]:
log_dir = "autotvm_logs"
logs = [os.path.join(log_dir, f) for f in os.listdir(log_dir)]
autotvm_ctx = autotvm.apply_history_best(None)
for log_file in logs:
    autotvm_ctx.load(log_file)

Then compile the model. You must specify the target CPU as skylake avx512 to use the vectorized instructions. 

If compiling on other devices, e.g., ARM CPU, you need to change the target, e.g., "llvm -device=arm_cpu -target=aarch64-linux-gnu".

In [6]:
target = "llvm -mcpu=skylake-avx512"
# change the target when compile on ARM CPU
# target = "llvm -device=arm_cpu -target=aarch64-linux-gnu"
with autotvm_ctx:
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(mod[mod.entry_func], target, params=params)

### Export the library

Lastly, export the library, graph structure, and parameters into files.

In [7]:
lib.export_library("deploy_lib.tar")
with open("deploy_graph.json", "w") as fo:
    fo.write(graph)
with open("deploy_param.params", "wb") as fo:
    fo.write(relay.save_param_dict(params))

## Evaluate TVM

Now load back the graph, library, and parameters from the files that were exported earlier, and create the graph runtime to execute the compiled graph.

There are tutorials that show how to deploy model on [Android](https://docs.tvm.ai/tutorials/frontend/deploy_model_on_android.html), [Raspberry Pi](https://docs.tvm.ai/tutorials/frontend/deploy_model_on_rasp.html), and [C++ deployment](https://github.com/dmlc/tvm/blob/master/apps/howto_deploy/cpp_deploy.cc).

In [8]:
import tvm.contrib.graph_runtime as runtime

loaded_graph = open("deploy_graph.json").read()
loaded_lib = tvm.module.load("deploy_lib.tar")
loaded_params = bytearray(open("deploy_param.params", "rb").read())

tvm_ctx = tvm.cpu()
ex = runtime.create(loaded_graph, loaded_lib, tvm_ctx)
ex.load_params(loaded_params)

Note that the hybrid BERT model requires fixed length inputs. Therefore, before you feed in the input and token types, you'll need to pad them to the max sequence length.

In [9]:
def pad(arr, length, pad_val, dtype="float32"):
    padded = np.full(shape=(1, length), fill_value=pad_val, dtype=dtype)
    padded[0, :arr.shape[1]] = arr.asnumpy()[0]
    return padded

example_ids, inputs, token_types, valid_length, _, _ = next(iter(dev_dataloader))
padded_inputs = pad(inputs, max_seq_length, vocab[vocab.padding_token])
padded_token_types = pad(token_types, max_seq_length, 0)

Now, run the graph runtime.

In [10]:
# Run the graph runtime
ex.set_input(data0=padded_inputs,
             data1=padded_token_types,
             data2=valid_length.astype('float32').asnumpy())
ex.run()
out = ex.get_output(0)

# post-processing
tvm_results = collections.defaultdict(list)
output = np.split(out.asnumpy(), axis=2, indices_or_sections=2)
example_ids = example_ids.asnumpy().tolist()
pred_start = output[0].reshape((1, -1))
pred_end = output[1].reshape((1, -1))
for example_id, start, end in zip(example_ids, pred_start, pred_end):
    tvm_results[example_id].append(PredResult(start=start, end=end))

In [11]:
qa_utils.predict(dataset, tvm_results, vocab, number=1)


Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.36% 	 Denver Broncos
0.23% 	 The American Football Conference (AFC) champion Denver Broncos
0.20% 	 Broncos



## Benchmark the TVM performance

This benchmark shows the mean inference time of TVM.

In [12]:
inputs = np.random.uniform(size=(1, max_seq_length)).astype('float32')
token_types = np.random.uniform(size=(1, max_seq_length)).astype('float32')
valid_length = np.asarray([max_seq_length]).astype('float32')
ex.set_input(data0=inputs, data1=token_types, data2=valid_length)

ftimer = ex.module.time_evaluator("run", tvm_ctx, number=10)
prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
print("TVM mean inference time: %.2f ms" % np.mean(prof_res))

TVM mean inference time: 101.87 ms
