Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference TensorFlow Bert Model for High Performance in ONNX Runtime #

This tutorial shows how to convert the original Tensorflow Bert model into ONNX, and then inference it with ONNX Runtime for high performance with transformer optimization. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

## Step 1 - Convert a TensorFlow Bert model to ONNX

To start with, we need an ONNX Bert model. We can get an ONNX Bert model by converting from Tensorflow.

Follow instructions in [Converting a Tensorflow Bert model to ONNX](https://github.com/onnx/tensorflow-onnx/blob/master/tutorials/BertTutorial.ipynb) from step 1 through 6 to train and export Tensorflow Bert models to ONNX.  


## Step 2 - Optimize Model
After we get the ONNX model, apply the graph transformer script on it to get an optimized graph. 

### Download the Bert Optimization Script

In [None]:
!mkdir bert_op_scripts
!wget -O ./bert_op_scripts/bert_model_optimization.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/bert_model_optimization.py
!wget -O ./bert_op_scripts/BertOnnxModelTF.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/BertOnnxModelTF.py
!wget -O ./bert_op_scripts/BertOnnxModel.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/BertOnnxModel.py
!wget -O ./bert_op_scripts/OnnxModel.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/OnnxModel.py
    

Run the `bert_model_optimization.py` with `--framework tensorflow` option to optimize the converted model. Other notable options to use are 
- `--gpu_only`: allow half-precision float for better performance.
- `--input_int32`: Use int32 tensors instead of default int64 as input to avoid un-necessary Cast nodes and get better performance.
- `--float16`: Use float16 tensors instead of default float32 as input to enable half precision floats. Recommended for NVidia GPU with Tensor Core like V100 and T4. For older GPUs, float32 is likely faster.

In [None]:
# Below are three examples to run bert_model_optimization.py. Choose one according to your needs and adjust --input
# --output path names as necessary.

# For CPU
!python bert_op_scripts/bert_model_optimization.py --input <bert.onnx> --output <bert_cpu.onnx> --framework tensorflow

# # For inferences under NVidia GPU with Tensor Core like V100 and T4
# !python bert_op_scripts/bert_model_optimization.py --input <bert.onnx> --output <bert_gpu_fp16.onnx> --framework tensorflow --gpu_only –float16

# # For inferences under other NVidia GPUs except V100 and T4
# !python bert_op_scripts/bert_model_optimization.py --input <bert.onnx> --output <bert_gpu_fp32.onnx> --framework tensorflow --gpu_only



## Step 3 - Inference the Optimized Model with ONNX Runtime

#### Install ONNX Runtime
Install the latest ONNX Runtime if you haven't done so already. 

Install one `onnxruntime` python build. Choose to install `onnxruntime` to use CPU features, or `onnxruntime-gpu` to enjoy GPU execution providers such as CUDA. 

In [None]:
# Install ONNX Runtime for CPU
!{sys.executable} -m pip install -U onnxruntime

## Alternatively, install onnxruntime for GPU
# !{sys.executable} -m pip install -U onnxruntime-gpu

Now we're ready to use ONNX Runtime to do inference on the optimized model.

In [None]:
import onnxruntime as rt  
import numpy as np
import time

sess_options = rt.SessionOptions()

# Set graph optimization level to ORT_ENABLE_EXTENDED to enable bert optimization.
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

session = rt.InferenceSession("./bert_squad_op.onnx", sess_options)

# evaluate the model
# Generate dummy inputs to the model. Adjust if neccessary
inputs = {
    'input_ids:0':   np.random.randint(0, 256, size=[1, 256], dtype=np.int64), # list of numerical ids for the tokenised text
    'segment_ids:0': np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'input_mask:0':  np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'unique_ids_raw_output___9:0': np.arange(0, 256, dtype=np.int64)
}

start = time.time()
# Run the optimized model with inputs
output_names = ['unstack:1', 'unstack:0', 'unique_ids:0']
res = session.run(output_names, inputs) 
end = time.time()
print("ONNX Runtime Inference time: ", end - start)

Get the perf numbers from TensorFlow model.

In [None]:
import tensorflow as tf

# Get input and output keys
with tf.Session(graph=tf.Graph()) as sess:
    # Load TensorFlow saved model
    metagraph = tf.saved_model.loader.load(sess, 
                                           [tf.saved_model.tag_constants.SERVING], 
                                           "./saved_model")
    
    # Get the input/output names for the saved model.
    inputs_mapping = dict(metagraph.signature_def['serving_default'].inputs)
    input_names = [inputs_mapping[i].name for i in inputs_mapping.keys()]
    print("input names ", input_names)
    start = time.time()
    out = sess.run(output_names, 
                   {input_names[0]: inputs["unique_ids_raw_output___9:0"],
                     input_names[1]: inputs["segment_ids:0"],
                     input_names[2]: inputs["input_ids:0"], 
                     input_names[3]: inputs["input_mask:0"]})
    end = time.time()
    print("\n")
    print("Tensorflow Inference time: ", end - start)
    print("\n")
    print("***** Verifying correctness *****")
    for i in range(3):
        print('Tensorflow and ONNX Runtime matching numbers:', np.allclose(res[i], out[i], rtol=1e-05, atol=1e-02))

    sess.close()
