<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>

<a id="structure"></a>
### Create Model Directory Structure


```
root@server:/models$ tree
.
└── simple-tensorrt-model
    ├── 1
    │   └── model.plan
    └── config.pbtxt

```


Below, we'll create the model directory structure for each of our TensorRT models.

In [None]:
!mkdir -p models/simple-tensorrt-fp32-model/
!mkdir -p models/simple-tensorrt-fp32-model/1/
!mkdir -p models/simple-tensorrt-fp16-model/
!mkdir -p models/simple-tensorrt-fp16-model/1/

<a id="model"></a>
### Convert ONNX to TensorRT



In [None]:
# !trtexec

In [None]:
!trtexec \
  --onnx=models/simple-onnx-model/1/model.onnx \
  --explicitBatch \
  --optShapes=actual_input_1:16x3x224x224 \
  --maxShapes=actual_input_1:32x3x224x224 \
  --minShapes=actual_input_1:1x3x224x224 \
  --shapes=actual_input_1:1x3x224x224 \
  --saveEngine=models/simple-tensorrt-fp32-model/1/model.plan


To convert our ONNX representation to a TensorRT plan, we'll point to our `model.onnx` file and specify the output for our newly created TensorRT plan. 

By adding the `--fp16` flag, we can specify that our TensorRT plan will be optimized for FP16. There are a lot of benefits to using FP16, mainly of which is that it is faster (fewer computations) and uses less memory.

In [None]:
!trtexec \
  --onnx=models/simple-onnx-model/1/model.onnx \
  --explicitBatch \
  --optShapes=actual_input_1:16x3x224x224 \
  --maxShapes=actual_input_1:32x3x224x224 \
  --minShapes=actual_input_1:1x3x224x224 \
  --shapes=actual_input_1:1x3x224x224 \
  --saveEngine=models/simple-tensorrt-fp16-model/1/model.plan \
  --fp16


<a id="configuration"></a>
### Create Configuration File


In [None]:
configuration = """
name: "simple-tensorrt-fp32-model"
platform: "tensorrt_plan"
max_batch_size: 32
input [
 {
    name: "actual_input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
"""

with open('models/simple-tensorrt-fp32-model/config.pbtxt', 'w') as file:
    file.write(configuration)

We'll also create a configuration file for the TensorRT Fp16 model. Note that our input and output data types still remain in their FP32 representation - the internal layers and activations of our neural network will use the FP16 data type but our input and output data will still be in FP32.

In [None]:
configuration = """
name: "simple-tensorrt-fp16-model"
platform: "tensorrt_plan"
max_batch_size: 32
input [
 {
    name: "actual_input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
"""

with open('models/simple-tensorrt-fp16-model/config.pbtxt', 'w') as file:
    file.write(configuration)

<a id="load"></a>
### Load Model in Triton Inference Server




In [None]:
!sleep 45

In [None]:
!curl -v triton:8000/v2/health/ready

The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.




In [None]:
!curl -v triton:8000/v2/models/simple-tensorrt-fp32-model

In [None]:
!curl -v triton:8000/v2/models/simple-tensorrt-fp16-model

<a id="infer"></a>
### Send Inference Request to Server




In [None]:
import tritonclient.http as tritonhttpclient
from tritonclient.utils import triton_to_np_dtype

In [None]:
import json

with open('./imagenet-simple-labels.json') as file:
    labels = json.load(file)

In [None]:
import numpy as np
from PIL import Image


img_path = './assets/goldfish.jpg'
image_pil = Image.open(img_path)
image_pil

In [None]:
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input
import numpy as np


img = image.load_img(img_path, target_size=(224, 224))
image_numpy = image.img_to_array(img)
image_numpy = np.expand_dims(image_numpy, axis=0)
image_numpy = preprocess_input(image_numpy)
image_numpy = np.transpose(image_numpy, [0, 3, 1, 2]) / 255.
print(image_numpy.shape)

Next, we'll define the input and output names of our model, the name of our model, the URL where our models are deployed with Triton Inference Server (in this case local host of `triton:8000`), and our model version.

In [None]:
VERBOSE = False
input_name = 'actual_input_1'
output_name = 'output1'
model_name = 'simple-tensorrt-fp32-model'
url = 'triton:8000'
model_version = '1'

In [None]:
triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

In [None]:
input0 = tritonhttpclient.InferInput(input_name, (1, 3, 224, 224), 'FP32')
input0.set_data_from_numpy(image_numpy, binary_data=False)

output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
response = triton_client.infer(model_name, model_version=model_version, 
                               inputs=[input0], outputs=[output])
logits = response.as_numpy('output1')
logits = np.asarray(logits, dtype=np.float32)

In [None]:
print(labels[np.argmax(logits)])

In [None]:
VERBOSE = FIXME
input_name = FIXME
input_shape = FIXME
input_dtype = FIXME
output_name = FIXME
model_name = FIXME
url = FIXME
model_version = FIXME

In [None]:
VERBOSE = False
input_name = 'actual_input_1'
input_shape = (1, 3, 224, 224)
input_dtype = 'FP32'
output_name = 'output1'
model_name = 'simple-tensorrt-fp16-model'
url = 'triton:8000'
model_version = '1'

In [None]:
triton_client = FIXME
model_metadata = FIXME
model_config = FIXME

In [None]:
triton_client = tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)