This notebook looks at taking a frozen tensorflow model (of mobilenet) which has been converted to UFF with the uff_convert tool and building a tensorRT optimized plan.  Then using this plan with the tensorRT runtime to execute inference in native TRT.



Do the imports needed to build a trt python api based network, cuda interface, and image manipulation for preprocessing:

In [None]:
import argparse
import numpy as np
import tensorrt as trt
import time

from PIL import Image

import pycuda.driver as cuda
import pycuda.autoinit

    We will pull a pre converted model file from S3 for this inference engine and also a set of images to explore

In [None]:
!bash ./setup.sh

Define our constants:

In [None]:

MAX_BATCH_SIZE = 1
#workspace size matters!  try with 20 and look at the output - not all tactics will be able to run as some need scracth space beyond that size.

MAX_WORKSPACE_SIZE = 1 << 30

#with loglevel set to INFO the trt library will use STDERR to outline the details of the optimization path.
#switch to the console to look at the optimizations, fusions, tactic timings done on the model.
LOGGER = trt.Logger(trt.Logger.INFO)
DTYPE = trt.float32

# Model
MODEL_FILE = 'mobilenet_v1_1.0_224.uff'
INPUT_NAME = 'input'
INPUT_SHAPE = (3, 224, 224)
OUTPUT_NAME = 'MobilenetV1/Predictions/Reshape_1'


In [None]:
def allocate_buffers(engine):
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(DTYPE))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(DTYPE))
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    return h_input, d_input, h_output, d_output


def build_engine(model_file, fp16=False):
    with trt.Builder(LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_workspace_size = MAX_WORKSPACE_SIZE
        builder.max_batch_size = MAX_BATCH_SIZE
        if fp16:
            builder.fp16_mode = True
        parser.register_input(INPUT_NAME, INPUT_SHAPE, trt.UffInputOrder.NCHW)
        parser.register_output(OUTPUT_NAME)
        parser.parse(model_file, network, DTYPE)

        return builder.build_cuda_engine(network)


In [None]:
def load_input(img_path, host_buffer):
    print('load input')

    with Image.open(img_path) as img:
        c, h, w = INPUT_SHAPE
        dtype = trt.nptype(DTYPE)
        img_array = np.asarray(img.resize((w, h), Image.BILINEAR)).transpose([2, 0, 1]).astype(dtype).ravel()
        # preprocess for mobilenet
        img_array = img_array / 127.5 - 1.0

    np.copyto(host_buffer, img_array)


def do_inference(n, context, h_input, d_input, h_output, d_output):
    # Transfer input data to the GPU.
    cuda.memcpy_htod(d_input, h_input)

    # Run inference.
    st = time.time()
    context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])
    print('Inference time {}: {} [msec]'.format(n, (time.time() - st)*1000))

    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh(h_output, d_output)

    return h_output

In [None]:
!ls
!ls ./calibration_images

In [None]:
LABELS = "mobilenet_labels.txt"
with open(LABELS) as f:
     labels = f.read().split('\n')

    
engine = build_engine(MODEL_FILE)
h_input, d_input, h_output, d_output = allocate_buffers(engine)





In [None]:
img_file = "./calibration_images/dolphin-203875_960_720.jpg"

from IPython.display import Image as Img
display(Img(img_file))

load_input(img_file, h_input)
with engine.create_execution_context() as context:
    
    output = do_inference(1, context, h_input, d_input, h_output, d_output)

    pred_idx = np.argsort(output)[::-1]
    pred_prob = np.sort(output)[::-1]

    print('\nClassification Result:')
    for i in range(5):
        print('{} {} {:.5f}'.format(i + 1, labels[pred_idx[i]], pred_prob[i]))

Now lets serialize this engine so we can use it later, and also create a fp16 model

In [None]:
with open('mobilenet.engine32', 'wb') as f:
   f.write(engine.serialize())

In [None]:
engine = build_engine(MODEL_FILE, True)
with open('mobilenet.engine16', 'wb') as f:
   f.write(engine.serialize())

Create a profile using Nsight systems command line to explore the system-gpu interaction.  we will need to download the output file mobilenet_fp16.qdrep (or 32.qdrep for the FP32 engine) locally to open in the visual explorer.

In [None]:
!nsys profile --show-output true --output mobilenet_fp16 --trace osrt,cuda,cudnn,cublas,nvtx  python do_inference.py mobilenet.engine ./calibration_images/fish-3322230_960_720.jpg

In [None]:
!nsys profile --show-output true --output mobilenet_fp32 --trace osrt,cuda,cudnn,cublas,nvtx  python do_inference.py mobilenet.engine32 ./calibration_images/fish-3322230_960_720.jpg

Nsight compute command to look at tensor core Metrics:
nv-nsight-cu-cli --metrics tensor_precision_fu_utilization python do_inference.py mobilenet.engine32 ./calibration_images/fish-3322230_960_720.jpg

to run  profile tool over entire network:
!nv-nsight-cu-cli python do_inference.py mobilenet.engine16 ./calibration_images/fish-3322230_960_720.jpg
!nv-nsight-cu-cli -k fusedConvolutionReluKernel -s 11 -c 1 '/usr/bin/python' do_inference.py mobilenet.engine32 ./calibration_images/fish-3322230_960_720.jpg

to generate an nsight compute profile:

In [None]:
!nv-nsight-cu-cli -o profile python do_inference.py mobilenet.engine16 ./calibration_images/fish-3322230_960_720.jpg