# Optimize ONNX Model With TensorRT
The scope of this notebook is convertion of well known network to highly optimize inference agent using TensorRT Python API with ONNX format network.

To this end we will get classification network in ONNX format from the ONNX Zoo - https://github.com/onnx/models

### Intro on ONNX
ONNX (Open Neural Network Exchange) is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

### Setup
* tf2onnx - python library to convert tensorflow models to ONNX format
* onnx-simplifier - python library to help us simplify the generated ONNX descriptor

In [None]:
!pip install tf2onnx onnx-simplifier

Limit GPU memory usage

In [1]:
import tensorflow as tf

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2500)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

### Model Creation
Fisrt we import the dataset and preprocess it

In [2]:
from tensorflow.python.keras.utils.np_utils import to_categorical
from tensorflow.keras.datasets import mnist

def load_dataset():
    # load dataset
    (trainX, trainY), (testX, testY) = mnist.load_data()
    # reshape dataset to have a single channel
    trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
    testX = testX.reshape((testX.shape[0], 28, 28, 1))
    # one hot encode target values
    trainY = to_categorical(trainY)
    testY = to_categorical(testY)
    return trainX, trainY, testX, testY


x_train, y_train, x_test, y_test = load_dataset()

def prep_pixels(train, test):
    # convert from integers to floats
    train_norm = train.astype('float32')
    test_norm = test.astype('float32')
    # normalize to range 0-1
    train_norm = train_norm / 255.0
    test_norm = test_norm / 255.0
    # return normalized images
    return train_norm, test_norm


x_train, x_test = prep_pixels(x_train, x_test)
print("Done converting to float32 normalized values")

Done converting to float32 normalized values


Define the model and train

In [3]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(10, activation='softmax')
])

opt = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=opt, loss=tf.keras.losses.CategoricalCrossentropy(), metrics=['accuracy'])

model.fit(x_train, y_train, epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fc7625d9ee0>

Run inference on the current model and measure time

In [7]:
import time
import numpy as np

s1 = time.time()
np.argmax(model.predict(np.asanyarray([x_test[0]]),), axis=-1)
e1 = time.time()

### Convert to ONNX
To convert to ONNX format file we use tf2onnx python lib, it parses to model api from tensorflow and encode the data into ONNX format

In [8]:
import tf2onnx

input_signature = [tf.TensorSpec([1, 28, 28, 1], tf.float32, name='x')]
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=13)
with open("/tmp/model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

print("Saved onnx model to /tmp/model.onnx")

Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
Saved onnx model to /tmp/model.onnx


tf2onnx requires the input shape of the model, and the desired ONNX format version (opset - can learn here: https://github.com/onnx/onnx/blob/master/docs/Versioning.md)

After conversion we save the onnx object to binary file, which later can be used to load the model to other frameworks / tensorrt

It is recommended to use onnx simplification on the output model, often, especially in large models, the output of the tf2onnx exporter (or any other exporter for that matter) in a complicated description of the network. we can use onnx-simplify python lib to simplify the output model.

e.g. Before simplification

<img src='images/complex.png' height="50%" width="50%">

After simplification

<img src='images/simple.png' height="15%" width="15%">

This is recommended because this is a known voodoo that TensorRT does not handle well with big networks complex description.

#### Simplify the ONNX model

In [9]:
import onnx
from onnxsim import simplify

# load your predefined ONNX model
model = onnx.load('/tmp/model.onnx')

# convert model
model_simp, check = simplify(model)

# save simplified model
with open("/tmp/model-simplified.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())
    
print("Saved onnx model to /tmp/model-simplified.onnx")

Saved onnx model to /tmp/model-simplified.onnx


### Convert ONNX Model to TensorRT Engine

To convert the ONNX model we can use three methods:
* trtexec cli utility
* TensorRT Python API
* TensorRT C++ API

We're going to show the trtexec method and the python cli method.

#### Convert using trtexec to engine file

In [25]:
!trtexec --onnx=/tmp/model-simplified.onnx --explicitBatch --workspace=64 --buildOnly --saveEngine=optimized.trt

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=/tmp/model-simplified.onnx --explicitBatch --workspace=64 --buildOnly --saveEngine=optimized.trt
[06/07/2021-17:55:38] [I] === Model Options ===
[06/07/2021-17:55:38] [I] Format: ONNX
[06/07/2021-17:55:38] [I] Model: /tmp/model-simplified.onnx
[06/07/2021-17:55:38] [I] Output:
[06/07/2021-17:55:38] [I] === Build Options ===
[06/07/2021-17:55:38] [I] Max batch: explicit
[06/07/2021-17:55:38] [I] Workspace: 64 MiB
[06/07/2021-17:55:38] [I] minTiming: 1
[06/07/2021-17:55:38] [I] avgTiming: 8
[06/07/2021-17:55:38] [I] Precision: FP32
[06/07/2021-17:55:38] [I] Calibration: 
[06/07/2021-17:55:38] [I] Refit: Disabled
[06/07/2021-17:55:38] [I] Safe mode: Disabled
[06/07/2021-17:55:38] [I] Save engine: optimized.trt
[06/07/2021-17:55:38] [I] Load engine: 
[06/07/2021-17:55:38] [I] Builder Cache: Enabled
[06/07/2021-17:55:38] [I] NVTX verbosity: 0
[06/07/2021-17:55:38] [I] Tactic sources: Using default tactic sources
[06/07/2021-17:55:38] [I] Input(

#### Convert onnx model to TensorRT engine using Python API

> Note: TensorRT can't handle at the moment with dynamic batch size, which means the most common issue with ONNX model optimization is forgetting to set the EXPLICIT_BATCH flag in the NetworkDefinitionCreationFlag.

In [17]:
import tensorrt as trt

logger = trt.Logger(trt.Logger.VERBOSE)
logger.min_severity = trt.Logger.Severity.VERBOSE
EXPLICIT_BATCH = []

print('trt version', trt.__version__)
if trt.__version__[0] >= '7':
    EXPLICIT_BATCH.append(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

with trt.Builder(logger) as builder, builder.create_network(*EXPLICIT_BATCH) as network, trt.OnnxParser(network, logger) as parser:
    builder.max_workspace_size = 1 << 28
    builder.max_batch_size = 1
    builder.fp16_mode = False

    with open('/tmp/model-simplified.onnx', 'rb') as f:
        if not parser.parse(f.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
        else:
            print("Parsed ONNX successfully")

    # reshape input from 32 to 1
    shape = list(network.get_input(0).shape)
    print("building tensorrt engine")
    engine = builder.build_cuda_engine(network)
    print("Saving serialized model")
    with open('optimized.trt', 'wb') as f:
        f.write(engine.serialize())
        
    print("Done")

trt version 7.2.3.4
Parsed ONNX successfully
building tensorrt engine
Saving serialized model
Done


### Inference of TensorRT compile engine

We define here some utility functions to load and allocate memory to the inputs and outputs of the model

In [26]:
import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda


logger = trt.Logger(trt.Logger.VERBOSE)
logger.min_severity = trt.Logger.Severity.VERBOSE

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        """Within this context, host_mom means the cpu memory and device means the GPU memory
        """
        self.host = host_mem 
        self.device = device_mem
    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()
    

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

def load_engine(engine_path):
    with open(engine_path, "rb") as f, trt.Runtime(logger) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

engine = load_engine('./optimized.trt')

ctx = engine.create_execution_context() 

# Allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings 



In [27]:
import numpy as np

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer data from CPU to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

def postprocess_the_outputs(h_outputs, shape_of_output):
    h_outputs = h_outputs.reshape(*shape_of_output)
    return h_outputs

max_batch_size = 1 # The batch size of input mush be smaller the max_batch_size once the engine is built


In [28]:
np.copyto(inputs[0].host, np.asanyarray(x_train[0]).ravel())

shape_of_output = (1, 10)
s2 = time.time()
np.argmax(postprocess_the_outputs(do_inference(ctx, bindings, inputs, outputs, stream)[0], shape_of_output))
e2 = time.time()

In [31]:
print("Original model: " + str(e1 - s1))
print("Optimized model: " + str(e2 - s2))

Original model: 0.09577703475952148
Optimized model: 0.0016007423400878906
