<p> <center> <a href="../Start_here.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="6.Challenge_DeepStream.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="1.Data_labeling_and_preprocessing.ipynb">1</a>
        <a href="2.Object_detection_using_TAO_YOLOv4.ipynb">2</a>
        <a href="3.Model_deployment_with_Triton_Inference_Server.ipynb">3</a>
        <a href="4.Model_deployment_with_DeepStream.ipynb">4</a>
        <a href="5.Measure_object_size_using_OpenCV.ipynb">5</a>
        <a href="6.Challenge_DeepStream.ipynb">6</a>
        <a >7</a>
    </span>
</div>

# Exercise: model deployment with Triton Inference Server

***

In this notebook, you will review the concepts learned in [3.Model_deployment_with_Triton_Inference_Server.ipynb](3.Model_deployment_with_Triton_Inference_Server.ipynb) while trying to deploy your NVIDIA® TAO Toolkit model to Triton™ Inference Server and improve performance with inference optimization.

As an exercise, you are asked to re-implement the same HTTP and gRPC inference pipelines that have been analyzed in the tutorial notebook.

<img src="images/triton_inference_server.jpg" width="720">
<div style="font-size:11px">Source: https://developer.nvidia.com/nvidia-triton-inference-server</div><br>

Let us get started with the challenge. You will have to fill in the `COMPLETE THIS SECTION` parts of the code present in the notebook to complete the pipelines. Feel free to refer to the previous notebooks for the commands but make sure to grasp the most important underlying concepts.

## Setup server and client

**Server**

To successfully execute the code in this notebook, you should already have an instance of Triton Inference Server running. Please relaunch the server container following the instructions in the `README` file if you shut it down previously. Remember to use the container in polling mode, so that changes you make to the model repository while running the code cells will be detected periodically and Triton will attempt to load and unload models as necessary based on those changes. If you are using Docker, you can launch the container by running the command below.

```
docker run \
  --gpus=1 --rm \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /full/path/to/model/repository:/models \
  nvcr.io/nvidia/tritonserver:<yy.mm>-py3 \
  tritonserver \
  --model-repository=/models \
  --exit-on-error=false \
  --model-control-mode=poll \
  --repository-poll-secs 30
```

The `--gpus=1` flag indicates that 1 system GPU should be made available to Triton for inferencing, while `<yy.mm>` is the version of Triton that you want to use and pull from the NVIDIA Container Toolkit. The path to the model repository needs to be set as well.

**Client**

The Triton client libraries that provide application programming interfaces (APIs) that make it easy to communicate with Triton from a C++ or Python application have also been installed in the environment from which [3.Model_deployment_with_Triton_Inference_Server.ipynb]( 3.Model_deployment_with_Triton_Inference_Server.ipynb) was executed. Please make sure you are running this exercise from the same virtual environment/container. For any doubt, please follow the instructions in the `README` file.

## Create the model repository

Triton Inference Server stores available models in the model repository. The directory where the models reside inside the container is specified when starting the server instance using the `tritonserver --model-repository=/models` flag. Each model then resides in its own subdirectory within the main model repository (i.e. each directory within `/models` represents a unique model). For example, in this notebook, we will deploy the TensorRT engine generated from the TAO training in the `yolov4_tao_challenge` subdirectory.

The layout of a minimal model repository should look like this:

```
models
└── yolov4_tao_challenge
    ├── 1
    │   └── model.plan
    └── config.pbtxt
```

For more details on how to work with model repositories and model directory structures in Triton Inference Server, please check the documentation [here](https://github.com/triton-inference-server/server/blob/r22.07/docs/model_repository.md).

Below, we'll create the model directory structure for our TensorRT model and copy the engine we generated in the previous [2.Object_detection_using_TAO_YOLOv4.ipynb](2.Object_detection_using_TAO_YOLOv4.ipynb) notebook to the newly prepared folder.

In [None]:
!mkdir -p ../models/yolov4_tao_challenge/1/
# Copy the TensorRT engine and rename it to match the default name model.plan
!cp ../yolo_v4/export/trt.engine ../models/yolov4_tao_challenge/1/model.plan

## Create configuration file

With our TAO model already defined and exported in TensorRT plan representation, we now focus on creating the configuration file that provides required and optional information about the model.

A minimal model configuration must specify the platform and/or backend properties, the max_batch_size property, and the input and output tensors of the model (name, data type, and shape). A YOLOv4 model has 1 input node `Input` and 4 output nodes `BatchedNMS`, `BatchedNMS_1`, `BatchedNMS_2` and `BatchedNMS_3`.

For more details on how to create model configuration files within Triton Inference Server, please see the documentation [here](https://github.com/triton-inference-server/server/blob/r22.07/docs/model_configuration.md).

In [None]:
############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
configuration = """
platform: "tensorrt_plan"
max_batch_size: 16
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
"""
###################### ~~~~~~~ END ~~~~~~~ ######################

with open("../models/yolov4_tao_challenge/config.pbtxt", 'w') as file:
    file.write(configuration)

## Check loaded model in Triton Inference Server

With the model repository created, the TensorRT model defined and exported, and the configuration file written, we will now wait for Triton Inference Server to load our model. This notebook is set to continuously poll for modifications once every 30 seconds, so please run the cell below to ensure enough time has passed before proceeding (15 seconds have been added just to be safe).

In [None]:
!sleep 45

At this point, our model should be deployed and ready to use! To confirm Triton Inference Server is up and running, we can see the output of a `curl` request to the below URL.

In [None]:
!curl -v localhost:8000/v2/health/ready

The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.

We can also send a `curl` request to our model endpoints to confirm our model is deployed and ready to use. This `curl` request returns status 200 if the model is ready and non-200 if it is not ready. 

Additionally, we will also see information about our model such as:
- The name of our model.
- The versions available for our model.
- The backend platform (e.g. tensorrt_plan).
- The inputs and outputs, with their respective names, data types, and shapes.

In [None]:
!curl -v localhost:8000/v2/models/yolov4_tao_challenge

## Send inference request to the server

With our model deployed and ready to use, it is now time to send inference requests to it. We'll start by loading the `tritonclient.http` module and defining a set of variables including the name of our model, the URL where it is deployed, the model version, and paths from which to load the images and where to save the processed outputs. Make sure to use the newly created model, not the one loaded for the previous lab.

In [None]:
import sys
import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException
sys.path.append("../source_code/N3")
from utils import convert_http_metadata_config

############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
verbose =
url =
model_name =
model_version =
protocol =
batch_size =
###################### ~~~~~~~ END ~~~~~~~ ######################
image_filename = "../data/testing/image_2/"
output_path = "../source_code/challenge_triton/triton_output_http"

import os
if not os.path.exists("../source_code/challenge_triton"):
    !mkdir ../source_code/challenge_triton

Then, we instantiate the Triton Client and get access to additional properties from the model metadata and configuration.

In [None]:
print("Running the inference client \n")

try:
    triton_client = httpclient.InferenceServerClient(
        url=url, verbose=verbose)
except Exception as e:
    print("client creation failed: " + str(e))
    sys.exit(1)

# Make sure the model matches our requirements, and get some
# properties of the model that we need for preprocessing
try:
    model_metadata = triton_client.get_model_metadata(
        model_name=model_name, model_version=model_version)
except InferenceServerException as e:
    print("failed to retrieve the metadata: " + str(e))
    sys.exit(1)

try:
    model_config = triton_client.get_model_config(
        model_name=model_name, model_version=model_version)
except InferenceServerException as e:
    print("failed to retrieve the config: " + str(e))
    sys.exit(1)

model_metadata, model_config = convert_http_metadata_config(
    model_metadata, model_config)

Next, we load the model and process the images from our input directory by converting, resizing, and loading them into a data structure.

In [None]:
from yolov4_model import YOLOv4Model
from tritonclient.utils import triton_to_np_dtype
import os
from frame import Frame

triton_model = YOLOv4Model.from_metadata(model_metadata, model_config)
max_batch_size = triton_model.max_batch_size
target_shape = (triton_model.c, triton_model.h, triton_model.w)
npdtype = triton_to_np_dtype(triton_model.triton_dtype)

print("\nLoading images... \n")

frames = []

if os.path.exists(image_filename):
    # The input is a folder of images
    if os.path.isdir(image_filename):
        frames = [
            Frame(os.path.join(image_filename, f),
                triton_model.data_format,
                npdtype,
                target_shape)
            for f in os.listdir(image_filename)
            if os.path.isfile(os.path.join(image_filename, f)) and
            os.path.splitext(f)[-1] in [".jpg", ".jpeg", ".png"]
        ]
    # The input is an image
    else:
        frames = [
            Frame(os.path.join(image_filename),
                triton_model.data_format,
                npdtype,
                target_shape)
        ]
    print("Done! \n")
else:
    print("No images found, please specify a valid path \n")

Finally, we use a request generator to submit our inputs to the Triton Inference Server using the `triton_client.infer()` method, specifying our model name, version, inputs and outputs. The responses we get are stored in an array.

In [None]:
from tqdm import tqdm
import numpy as np
from utils import requestGenerator
import time

# Send requests of batch_size images. If the number of
# images isn't an exact multiple of batch_size then just
# start over with the first images until the batch is filled.

print("Sending inference request for batches of data \n")

responses = []
image_idx = 0
last_request = False
sent_count = 0
pbar_total = len(frames)

start_time = time.time()

with tqdm(total=pbar_total) as pbar:
    while not last_request:
        batched_image_data = None

        repeated_image_data = []

        for idx in range(batch_size):
            frame = frames[image_idx]

            img = frame._load_img()
            repeated_image_data.append(img)

            image_idx = (image_idx + 1) % len(frames)
            if image_idx == 0:
                last_request = True

        if max_batch_size > 0:
            batched_image_data = np.stack(repeated_image_data, axis=0)
        else:
            batched_image_data = repeated_image_data[0]

        # Send request
        try:
            req_gen_args = [batched_image_data, triton_model.input_names,
                triton_model.output_names, triton_model.triton_dtype,
                protocol.lower()]
            req_generator = requestGenerator(*req_gen_args)
            for inputs, outputs in req_generator:
                sent_count += 1

                responses.append(
                    triton_client.infer(model_name,
                                        inputs,
                                        request_id=str(sent_count),
                                        model_version=model_version,
                                        outputs=outputs))

        except InferenceServerException as e:
            print("inference failed: " + str(e))
            sys.exit(1)
        
        pbar.update(batch_size)

end_time = time.time()

print("Average latency: ~{} seconds".format((end_time - start_time) / sent_count))
print("Average throughput: ~{} examples / second".format(batch_size * sent_count / (end_time - start_time)))

The responses we get need to be decoded and converted to a NumPy array. Fill in the cell below to examine the shapes of a sample output after the conversion to NumPy.

In [None]:
############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
sample_output =

output_names =
output_array =       
  
for output_name in output_names:
    output_array.append( )

print([a.shape for a in output_array])
###################### ~~~~~~~ END ~~~~~~~ ######################

We recognize the four output shapes of our model but to convert these numbers into an output we can read and visualize, we pass the responses to a specific postprocessor that renders images with bounding boxes at `$output_path/infer_images` and labels in KITTI format at `$output_path/infer_labels`. 

In [None]:
from yolov4_postprocessor import YOLOv4Postprocessor

print("Gathering responses from the server and post-processing the inferenced outputs \n")

args_postprocessor = [
    batch_size, frames, output_path, triton_model.data_format
]

postprocessor = YOLOv4Postprocessor(*args_postprocessor)

processed_request = 0
with tqdm(total=len(frames)) as pbar:
    while processed_request < sent_count:
        response = responses[processed_request]

        this_id = response.get_response()["id"]

        postprocessor.apply(
            response, this_id, render=True
        )
        processed_request += 1
        pbar.update(batch_size)

Let's observe the output on the test images to confirm that the model is working correctly.

In [None]:
# Simple grid visualizer
import matplotlib.pyplot as plt
from math import ceil

def visualize_images(image_dir, num_cols=4, num_images=10):
    num_rows = int(ceil(float(num_images) / float(num_cols)))
    f, axarr = plt.subplots(num_rows, num_cols, figsize=[80,30])
    f.tight_layout()
    a = [os.path.join(image_dir, image) for image in os.listdir(image_dir) 
         if os.path.splitext(image)[1].lower() == '.png']
    for idx, img_path in enumerate(a[:num_images]):
        col_id = idx % num_cols
        row_id = idx // num_cols
        img = plt.imread(img_path)
        axarr[row_id, col_id].imshow(img)
        
# Visualizing the sample images
OUTPUT_PATH = os.path.join(output_path, 'infer_images')
COLS = 3 # number of columns in the visualizer grid
IMAGES = 9 # number of images to visualize

visualize_images(OUTPUT_PATH, num_cols=COLS, num_images=IMAGES)

With this, we have successfully run HTTP inference using Triton with our object detection model and rendered the results in a useful format. As you may have noticed, a lot of work is required for preprocessing and postprocessing of the results, while inference itself does not require a lot of code and could be simplified even more. As inference is at the heart of this lab, you are now asked to speed it up even further using inference optimization tricks.

## Improve inference performance

Here we quickly go through a list of things to help deliver maximum performance. These include variable batch size, dynamic batching, gRPC protocol, and asynchronous inference.

### Variable batch size

In our example, we have worked with data inputs that have a batch size of 1. However, we might often want to use different batch sizes such as 4, 8, 16, or even higher. This has a natural tradeoff of latency and throughput. Since our batches are larger, it might take longer to process an individual batch - increasing the latency. However, since the GPU has more data to work with and we're less constrained by networking and I/O, we might see an increase in throughput - or the number of examples that can be processed per second. Depending on the application, this might be a good way to go. Feel free to go back and vary the batch size to see the impact it has on latency and throughput.

### Dynamic batching

For most models, the Triton feature that provides the largest performance improvement is dynamic batching. This is a feature that allows individual inference requests to be combined by the server, creating batches dynamically. As we said just above, creating a batch of requests typically results in increased throughput since it executes much more efficiently on the GPU. To enable dynamic batching, simply add the following:

```
dynamic_batching { }
```

to the model configuration file to enable dynamic batching with all default settings. By default, the dynamic batcher will create batches as large as possible up to the maximum batch size and will not delay when forming batches. 

This behavior can be modified by specifying the `preferred_batch_size property`, which indicates the batch sizes that the dynamic batcher should attempt to create, and the `max_queue_delay_microseconds`, setting the maximum delay in sending an inference request as is (even if not of a preferred size) when a batch of a preferred size cannot be formed. For more information on this, please check the [model configuration](https://github.com/triton-inference-server/server/blob/r22.07/docs/model_configuration.md) and [model optimization](https://github.com/triton-inference-server/server/blob/r22.07/docs/optimization.md) docs.

Below, you can modify our model configuration file so that Triton Inference Server will deploy it using dynamic batching.

In [None]:
############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
configuration = """
platform: "tensorrt_plan"
max_batch_size: 16
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
"""
###################### ~~~~~~~ END ~~~~~~~ ######################

with open("../models/yolov4_tao_challenge/config.pbtxt", 'w') as file:
    file.write(configuration)

In [None]:
!sleep 45

### Asynchronous inference

So far, our requests have been submitted to Triton Inference Server synchronously. In other words, we submit a request to Triton, which computes and returns the result, and then we submit our next request. However, it is also possible to submit as many requests as possible, allow Triton to queue requests it hasn't yet processed, and return results as soon as they are computed. This paradigm is known as asynchronous inferencing and can result in some impressive speedups for throughput.

### gRPC protocol

Last but not least, let's spend a couple of words on switching protocol to gRPC. As we discussed, clients can communicate with Triton using either HTTP or gRPC protocol. Most people are familiar with HTTP, which is the backbone of the internet, but gRPC is a newer, open-source remote procedure call system initially developed at Google in 2015 that uses HTTP/2 for transport and protocol buffers as the interface description language. It is highly efficient and using it is very easy: all you need to do is switch to the `tritonclient.grpc.InferenceServerClient` module, change the inference server URL and make other minimal changes to the pipeline. Using a slightly different protocol can have an enormous impact on latency and throughput, so remember that gRPC exists!


## Analyze the impact of inference optimization

You are now asked to implement the aforementioned strategies in this notebook and see the effect they have on performance. In particular, you will add both asynchronous inference and gRPC protocol to the pipeline. The model configuration file has already been updated to make use of dynamic batching. Let's import the `tritonclient.grpc` module and set the new url for `gRPC` protocol requests.

In [None]:
import tritonclient.grpc as grpcclient

############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
verbose =
url =
protocol =
batch_size =
###################### ~~~~~~~ END ~~~~~~~ ######################
output_path = "../source_code/challenge_triton/triton_output_grpc"

Then, we instantiate the new Triton Client and get access to additional properties from the model metadata and configuration.

In [None]:
print("Running the inference client \n")

try:
    # Create gRPC client for communicating with the server
    triton_client = grpcclient.InferenceServerClient(
        url=url, verbose=verbose)
except Exception as e:
    print("client creation failed: " + str(e))
    sys.exit(1)

try:
    model_metadata = triton_client.get_model_metadata(
        model_name=model_name, model_version=model_version)
except InferenceServerException as e:
    print("failed to retrieve the metadata: " + str(e))
    sys.exit(1)

try:
    model_config = triton_client.get_model_config(
        model_name=model_name, model_version=model_version)
except InferenceServerException as e:
    print("failed to retrieve the config: " + str(e))
    sys.exit(1)

model_config = model_config.config

Images are already loaded so we can go ahead and submit our inputs to the Triton Inference Server using the `triton_client.async_infer()` method, specifying once again our model name, version, inputs and outputs. The responses we get are then stored in an array at the end like before. Below, we also call a utility callback function for handling asynchronous requests.

In [None]:
from user_data import UserData
from functools import partial

def completion_callback(user_data, result, error):
    """Callback function used for async_stream_infer()."""
    user_data._completed_requests.put((result, error))

print("Sending inference request for batches of data \n")

responses = []
image_idx = 0
last_request = False
user_data = UserData()
sent_count = 0
pbar_total = len(frames)

start_time = time.time()

with tqdm(total=pbar_total) as pbar:
    while not last_request:
        batched_image_data = None

        repeated_image_data = []

        for idx in range(batch_size):
            frame = frames[image_idx]
            
            img = frame._load_img()
            repeated_image_data.append(img)
            
            image_idx = (image_idx + 1) % len(frames)
            if image_idx == 0:
                last_request = True

        if max_batch_size > 0:
            batched_image_data = np.stack(repeated_image_data, axis=0)
        else:
            batched_image_data = repeated_image_data[0]

        # Send request
        try:
            req_gen_args = [batched_image_data, triton_model.input_names,
                triton_model.output_names, triton_model.triton_dtype,
                protocol.lower()]
            req_generator = requestGenerator(*req_gen_args)
            for inputs, outputs in req_generator:
                sent_count += 1

                triton_client.async_infer(
                    model_name,
                    inputs,
                    partial(completion_callback, user_data),
                    request_id=str(sent_count),
                    model_version=model_version,
                    outputs=outputs)

        except InferenceServerException as e:
            print("inference failed: " + str(e))
            sys.exit(1)
        
        pbar.update(batch_size)
    
    processed_count = 0
    while processed_count < sent_count:
        (results, error) = user_data._completed_requests.get()
        processed_count += 1
        if error is not None:
            print("inference failed: " + str(error))
            sys.exit(1)
        responses.append(results)

end_time = time.time()

print("Average latency: ~{} seconds".format((end_time - start_time) / sent_count))
print("Average throughput: ~{} examples / second".format(batch_size * sent_count / (end_time - start_time)))

As you can see, the gain in performance is quite significant, and considering the very small changes we made to the pipeline, it was definitely worth it!

Now we pass the responses to the postprocessor that renders images with bounding boxes and show them to make sure nothing has changed compared to the http inference.

In [None]:
############# ~~~~~~~ COMPLETE THIS SECTION ~~~~~~~ #############
print("Gathering responses from the server and post-processing the inferenced outputs \n")

args_postprocessor = [
    #
]

postprocessor =
###################### ~~~~~~~ END ~~~~~~~ ######################

processed_request = 0
with tqdm(total=len(frames)) as pbar:
    while processed_request < sent_count:
        response = responses[processed_request]

        this_id = response.get_response().id

        postprocessor.apply(
            response, this_id, render=True
        )
        processed_request += 1
        pbar.update(batch_size)

In [None]:
# Visualizing the sample images
OUTPUT_PATH = os.path.join(output_path, 'infer_images')
COLS = 3 # number of columns in the visualizer grid
IMAGES = 9 # number of images to visualize

visualize_images(OUTPUT_PATH, num_cols=COLS, num_images=IMAGES)

In this notebook, you have reviewed some concepts related to deployment using Triton Inference Server. Congratulations, with this you have also finished the challenges we have prepared, we hope they have been helpful in establishing the main concepts.

## Other bootcamps

The contents of this bootcamp originate from the [OpenHackathons Github](https://github.com/openhackathons-org). You are welcome to visit the page and search for other material that may interest you.

***

## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<br>
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="6.Challenge_DeepStream.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="1.Data_labeling_and_preprocessing.ipynb">1</a>
        <a href="2.Object_detection_using_TAO_YOLOv4.ipynb">2</a>
        <a href="3.Model_deployment_with_Triton_Inference_Server.ipynb">3</a>
        <a href="4.Model_deployment_with_DeepStream.ipynb">4</a>
        <a href="5.Measure_object_size_using_OpenCV.ipynb">5</a>
        <a href="6.Challenge_DeepStream.ipynb">6</a>
        <a >7</a>
    </span>
</div>

<br>
<p> <center> <a href="../Start_here.ipynb">Home Page</a> </center> </p>