<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>

# Deploying a Model for Inference at Production Scale

## 06 - Advanced Inference
-------

**Table of Contents**

* [Introduction](#introduction)
* [Housekeeping](#housekeeping)
* [Performance Analyzer](#performance)
* [Model Analyzer](#model)
* [CPU Benchmark](#cpu)
* [Variable Batch Size](#variable)
* [Dynamic Batching](#dynamic-batching)
* [HTTP vs. gRPC](#protocol)
* [Asynchronous Inference](#async)
* [Shared Memory](#shared)
* [Conclusion](#conclusion)

<a id="introduction"></a>
### Introduction

In this notebook, we will explore how to do advanced inferencing with Triton Inference Server. We will explore tools like the Performance Analyzer, the Model Analyzer, how to access metrics, and how to optimize latency and throughput in your applications using the GPU, variable batch size, dynamic batching, different protocols like HTTP and gRPC, asynchronous inference, and shared memory.

<a id="housekeeping"></a>
### Housekeeping

Before we go any further, we'll do some housekeeping and import some of the client libraries we'll be using as well as define some variables we'll use throughout the notebook.

In [None]:
import numpy as np
import time
import tritonclient.http as tritonhttpclient
import tritonclient.grpc as tritongrpcclient
from tqdm import tqdm


http_url = 'triton:8000'
grpc_url = 'triton:8001'
verbose = False
concurrency = 32
model_version = '1'
triton_http_client = tritonhttpclient.InferenceServerClient(url=http_url, verbose=verbose, concurrency=concurrency)
triton_grpc_client = tritongrpcclient.InferenceServerClient(url=grpc_url, verbose=verbose)
input_dtype = 'FP32'

<a id="performance"></a>
### Performance Analyzer

A critical part of optimizing the inference performance of your model is being able to measure changes in performance as you experiment with different optimization strategies. The `perf_analyzer` application (previously known as `perf_client`) performs this task for the Triton Inference Server. The `perf_analyzer` is included with the client examples which are available from several sources.

The `perf_analyzer` application generates inference requests to your model and measures the throughput and latency of those requests. To get representative results, perf_analyzer measures the throughput and latency over a time window, and then repeats the measurements until it gets stable values. By default `perf_analyzer` uses average latency to determine stability but you can use the `--percentile` flag to stabilize results based on that confidence level. For example, if `--percentile=95` is used the results will be stabilized using the 95-th percentile request latency. 

For example, we can run any of the following to analyze the performance of our models:

```
perf_analyzer \
  -m simple-tensorflow-model \
  -b 1 \
  --concurrency-range 1:1 \
  --shape input_0:1,224,224,3

perf_analyzer \
  -m simple-pytorch-model \
  -b 1 \
  --concurrency-range 1:1

perf_analyzer \
  -m simple-onnx-model \
  -b 1 \
  --concurrency-range 1:1
  
perf_analyzer \
  -m simple-tensorrt-fp32-model \
  -b 1 \
  --concurrency-range 1:1
  
perf_analyzer \
  -m simple-tensorrt-fp16-model \
  -b 1 \
  --concurrency-range 1:1
```

Unfortunately, we're not able to run `perf_analyzer` while Triton Inference Server is deployed in **polling** mode. However, for more details on `perf_analyzer`, you can find the documentation here: https://github.com/triton-inference-server/server/blob/r20.12/docs/perf_analyzer.md

<a id="model"></a>
### Model Analyzer

The Triton Model Analyzer is a tool that uses Performance Analyzer to send requests to your model while measuring GPU memory and compute utilization. The Model Analyzer is specifically useful for characterizing the GPU memory requirements for your model under different batching and model instance configurations. Once you have this GPU memory usage information you can more intelligently decide on how to combine multiple models on the same GPU while remaining within the memory capacity of the GPU.

For more information see the [Model Analyzer repository](https://github.com/triton-inference-server/model_analyzer) and the detailed explanation provided in [Maximizing Deep Learning Inference Performance with NVIDIA Model Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer).

<a id="cpu"></a>
### CPU Benchmark

Before we get into some of our advanced inferencing techniques, let's first benchmark one of our models on the CPU. Triton Inference Server works not only with any kind of deep learning framework, but is also flexible enough to be able to deploy models onto the CPU. To deploy the CPU, simply add:

```
instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
  ]
```

to your configuration file. Below, we copy our `simple-pytorch-model` into a new model directory and modify the model configuration file so that Triton Inference Server will deploy it on the CPU.

In [None]:
!rm -rf models/simple-pytorch-model-cpu/
!cp -R models/simple-pytorch-model/ models/simple-pytorch-model-cpu/

In [None]:
configuration = """
name: "simple-pytorch-model-cpu"
platform: "pytorch_libtorch"
max_batch_size: 32
instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
  ]
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
"""

with open('models/simple-pytorch-model-cpu/config.pbtxt', 'w') as file:
    file.write(configuration)

In [None]:
!sleep 45

Next, we'll go through our usual process of defining our `InferInput` and `InferRequestedOutput` objects and assign data to our inputs.

In [None]:
input_name = 'input__0'
input_shape = (1, 3, 224, 224)
output_name = 'output__0'
model_name = 'simple-pytorch-model'

input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
dummy_data = np.ones(shape=input_shape, dtype=np.float32)
input0.set_data_from_numpy(dummy_data, binary_data=True)

output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=True)

Let's submit 1000 requests (each request is batch size 1) to our `simple-pytorch-model`, deployed on the GPU.

In [None]:
# note: batch size 1

start_time = time.time()
requests = []
request_count = 1000
for i in tqdm(range(request_count)):
    requests.append(triton_http_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
batch_size = 1
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

Next, we'll submit the same 1000 requests to our `simple-pytorch-model-cpu`, deployed on the CPU. The difference is quite stark!

In [None]:
model_name = 'simple-pytorch-model-cpu'

# note: feel free to stop running this cell at any time!

start_time = time.time()
requests = []
request_count = 1000
for i in tqdm(range(request_count)):
    requests.append(triton_http_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

<a id="variable"></a>
### Variable Batch Size

Until now, we're worked with data inputs that have a batch size of 1. However, we might often want to send different batch sizes such as 4, 8, 32, or even higher. This has a natural tradeoff of latency and throughput. Since our batches are larger, it might take longer to process an individual batch - increasing the latency. However, since the GPU has more data to work with and we're less constrained by networking and I/O, we might see an increase in throughput - or the number of examples that can be processed per second. 

Below, we'll use our `simple-tensorrt-fp16-model` and pass in 10000 requests of batch size 1. We see this process takes ~45 seconds.

In [None]:
input_name = 'actual_input_1'
input_shape = (1, 3, 224, 224)
output_name = 'output1'
model_name = 'simple-tensorrt-fp16-model'

input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
dummy_data = np.ones(shape=input_shape, dtype=np.float32)
input0.set_data_from_numpy(dummy_data, binary_data=True)
output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=True)

In [None]:
# note: batch size 1

start_time = time.time()
requests = []
request_count = 10000
for i in tqdm(range(request_count)):
    requests.append(triton_http_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
batch_size = 1
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

Now, we'll use a batch size of 32 and pass in 300 requests. We see how by increasing the batch size, we increase our average latency but are able to increase the total throughput.

In [None]:
input_shape = (32, 3, 224, 224)
input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
dummy_data = np.ones(shape=input_shape, dtype=np.float32)
input0.set_data_from_numpy(dummy_data, binary_data=True)

In [None]:
# note: batch size 32

start_time = time.time()
requests = []
request_count = 300
for i in tqdm(range(request_count)):
    requests.append(triton_http_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
batch_size = 32
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

<a id="dynamic-batching"></a>
### Dynamic Batching

Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically. Creating a batch of requests typically results in increased throughput. To enable dynamic batching, simply add:

```
dynamic_batching {
    preferred_batch_size: [ 4, 8 ]
    max_queue_delay_microseconds: 100
  }
```

to your configuration file. The `preferred_batch_size property` indicates the batch sizes that the dynamic batcher should attempt to create. For example, the above configuration enables dynamic batching with preferred batch sizes of 4 and 8.

The dynamic batcher can be configured to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch. For example, the following configuration sets the maximum delay time of 100 microseconds for a request.

The `max_queue_delay_microseconds` property setting changes the dynamic batcher behavior when a batch of a preferred size cannot be created. When a batch of a preferred size cannot be created from the available requests, the dynamic batcher will delay sending the batch as long as no request is delayed longer than the configured `max_queue_delay_microseconds` value. If a new request arrives during this delay and allows the dynamic batcher to form a batch of a preferred batch size, then that batch is sent immediately for inferencing. If the delay expires the dynamic batcher sends the batch as is, even though it is not a preferred size.

Below, we copy our `simple-tensorrt-fp16-model` into a new model directory and modify the model configuration file so that Triton Inference Server will deploy it using dynamic batching.

In [None]:
!rm -rf models/dynamic-batching-model/
!cp -R models/simple-tensorrt-fp16-model/ models/dynamic-batching-model/

In [None]:
configuration = """
name: "dynamic-batching-model"
platform: "tensorrt_plan"
dynamic_batching { 
  preferred_batch_size: [ 4, 8, 16, 32 ] 
  max_queue_delay_microseconds: 100 }
max_batch_size: 32
input [
 {
    name: "actual_input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output {
    name: "output1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
"""

with open('models/dynamic-batching-model/config.pbtxt', 'w') as file:
    file.write(configuration)

In [None]:
!sleep 45

Below, we'll use our `dynamic-batching-model` and pass in 10000 requests of batch size 1.

In [None]:
input_shape = (1, 3, 224, 224)
model_name = 'dynamic-batching-model'

input0 = tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
dummy_data = np.ones(shape=input_shape, dtype=np.float32)
input0.set_data_from_numpy(dummy_data, binary_data=True)

In [None]:
# note: batch size 1

start_time = time.time()
requests = []
request_count = 10000
for i in tqdm(range(request_count)):
    requests.append(triton_http_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
batch_size = 1
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

<a id="protocol"></a>
### HTTP vs. gRPC

Clients can communicate with Triton using either an HTTP/REST or GRPC protocol, or by a C API. Most people are familiar with HTTP, which is the backbone of the internet. gRPC is a newer, open source remote procedure call system initially developed at Google in 2015 that uses HTTP/2 for transport and Protocol Buffers as the interface description language. It is highly efficient and using it is very easy. 

Below, we use the `tritonclient.grpc` module to instantiate new `InferInput` and `InferRequestedOutput` objects, and our `tritonclient.grpc.InferenceServerClient` instance to send 10000 requests of batch size 1 to our `dynamic-batching-model`. We can immediately see that just using a slightly different protocol can have an enormous impact on latency and throughput!

In [None]:
input_shape = (1, 3, 224, 224)
model_name = 'dynamic-batching-model'

input0 = tritongrpcclient.InferInput(input_name, input_shape, input_dtype)
dummy_data = np.ones(shape=input_shape, dtype=np.float32)
input0.set_data_from_numpy(dummy_data)
output = tritongrpcclient.InferRequestedOutput(output_name)

In [None]:
start_time = time.time()
requests = []
request_count = 10000
for i in tqdm(range(request_count)):
    requests.append(triton_grpc_client.infer(model_name, model_version=model_version, 
                                             inputs=[input0], outputs=[output]))
end_time = time.time()

In [None]:
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

<a id="async"></a>
### Asynchronous Inference

So far, our requests have been submitted to Triton Inference Server synchronously. In other words, we submit a request to Triton, Triton then computes and returns the result, and we submit our next request. However, what if we could submit as many requests as possible, allow Triton to queue requests it hasn't yet processed, and return results as soon as they are computed? This paradigm is known as asynchronous inferencing and can result in some of dramatic speedups for throughput.

Below, we create a utility `callback` function for handling asynchronous requests and submit 10000 requests of batch size 1 to our `dynamic-batching-model` using the `async_infer` method of our `tritonclient.grpc.InferenceServerClient` instance. Our improvement in throughput is incredible!

In [None]:
from functools import partial


results = []

def callback(user_data, result, error):
    if error:
        user_data.append(error)
    else:
        user_data.append(result)

In [None]:
start_time = time.time()
async_requests = []
request_count = 10000
for i in tqdm(range(request_count)):
    # Asynchronous inference call.
    async_requests.append(triton_grpc_client.async_infer(model_name=model_name, inputs=[input0], 
                                                         callback=partial(callback, results), 
                                                         outputs=[output]))
end_time = time.time()

In [None]:
print('Average Latency: ~{} seconds'.format((end_time - start_time) / request_count))
print('Average Throughput: ~{} examples / second'.format(batch_size * request_count / (end_time - start_time)))

In [None]:
print('Example shape of one example of our output data:', results[0].as_numpy(output_name).shape)

<a id="shared"></a>
### Shared Memory

Using system shared memory and CUDA shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases. Unfortunately, this area is beyond the scope of this lab but those curious are highly encouraged to check out the documentation and client examples found here: https://github.com/triton-inference-server/server/blob/r20.12/docs/client_examples.md#system-shared-memory

<a id="conclusion"></a>
### Conclusion

In this notebook, we explored how to do advanced inferencing with Triton Inference Server. We explored tools like the Performance Analyzer, the Model Analyzer, how to access metrics, and how to optimize latency and throughput in your applications using the GPU, variable batch size, dynamic batching, different protocols like HTTP and gRPC, asynchronous inference, and shared memory.

We kindly ask to do some clean up and run the cell below. This will free up GPU memory for other section of the lab.

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

<div align="center"><a href="https://www.nvidia.com/en-us/deep-learning-ai/education/"><img src="./assets/DLI_Header.png"></a></div>