<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 1.0 Exporting the Model
In this notebook, you'll explore options for exporting a BERT checkpoint trained using PyTorch, to NVIDIA Triton Inference Server.

**[1.1 Overview: Optimization and Performance](#1.1-Overview:-Optimization-and-Performance)<br>**
**[1.2 Export a BERT Checkpoint](#1.2-Export-a-BERT-Checkpoint)<br>**
&nbsp; &nbsp; &nbsp; &nbsp; [1.2.1 Triton Model Repository](#1.2.1-Triton-Model-Repository)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [1.2.2 TorchScript Export](#1.2.2-TorchScript-Export)<br>
**[1.3 Test Our Export](#1.3-Test-Our-Export)<br>**
**[1.4 Beyond TorchScript](#1.4-Beyond-TorchScript)<br>**
&nbsp; &nbsp; &nbsp; &nbsp; [1.4.1 Exercise: Enable TensorRT Optimization](#1.4.1-Exercise:-Enable-TensorRT-Optimization)<br>
**[1.5 Performance Comparison](#1.5-Performance-Comparison)<br>**

# 1.1 Overview: Optimization and Performance
Optimization of the trained model will have a fairly dramatic impact on the inference performance, measured in bandwidth and latency. Even if the project requirements do not justify investing engineering effort into advanced techniques, such as knowledge distillation or pruning, a fair amount of model performance improvement can be achieved by using model optimization tools. The diagram below illustrates the difference in inference performance between a model deployed using non-optimized TensorFlow, the same model post-processed with TensorRT, and a model fully optimized with TensorRT. 

<img src="images/TFvTRT.jpg" alt="Header" style="width: 600px;"/>

Modern inference servers typically support substantially more than one model format to cater to a wider range of projects, tools, and preferences. Since in this class we are working with a BERT checkpoint trained using PyTorch, and we are deploying it with Triton Inference Server, we will focus on options for deploying PyTorch-based models. These include:
   - PyTorch JIT / TorchScript
   - ONNX runtime
   - ONNX-TensorRT
   - TensorRT
    
It's important to point out that Triton Server supports a much broader set of deployment mechanisms including:
   - TensorFlow GraphDef
   - TensorFlow saved model
   - Caffe 2 exports
   - Custom models (which can be any custom executable)

In this section we will look at how to deploy a model using some of the deployment engines listed above and the impact each has on performance. We will also experiment with some of the key settings, namely the batch size and numerical precision (FP32 and FP16).

# 1.2 Export a BERT Checkpoint

The BERT model checkpoint we want to deploy, <code>bert_qa.pt</code>, should be located in your `data` directory. 

In [1]:
!ls data/*.pt

data/bert_qa.pt


This file is a standard checkpoint of a BERT-Large network, fine-tuned on the [Stanford Question Answering Dataset (SQuAD)](https://arxiv.org/abs/1606.05250). 

#### Helper Scripts
As we explore various deployment configurations, we'll repeat some steps over and over.  Therefore, we'll use some helper scripts to partially automate the process so that we can focus our attention on the configuration settings and results.  You can explore the code details yourself if you are curious:

- [utilities/wait_for_triton_server.sh](utilities/wait_for_triton_server.sh): Check the "live" and "ready" status of the Triton server via the API
- [deployer/deployer.py](deployer/deployer.py): Convert a checkpoint to a deployable model and export it
- [utilities/run_perf_analyzer_local.sh](utilities/run_perf_analyzer_local.sh): Measure performance with the [perf_analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) application
- [utilities/run_warmup.sh](utilities/run_warmup.sh): Run some inferences using `perf_analyzer` to warm up the model.  Prewarming the model results in more stable measurements.

The Triton server has been deployed in a container or local and is available to us at host "triton" on port "8000". Run the next cell to to check for a "200 OK" HTTP response from the API.

In [4]:
# Set the server hostname and check it - you should get a message that "Triton Server is ready!"
tritonServerHostName = "127.0.0.1"
!./utilities/wait_for_triton_server.sh {tritonServerHostName}

Waiting for Triton Server to be ready at 127.0.0.1:8000...
200
Triton Server is ready!


## 1.2.1 Triton Model Repository
When Triton Server is started, it is typically configured to observe a local or remote file system where models are hosted. The directory which is being observed is called a *model repository*. A typical command to start the Triton Server identifies the location of the model repository with an option:<br>
```bash
tritonserver --model-repository="/path/to/model/repository"
```

The model repository needs to have the following layout:

```python
<model-repository-path>/
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  <model-name>/
    [config.pbtxt]
    [<output-labels-file> ...]
    <version>/
      <model-definition-file>
    <version>/
      <model-definition-file>
    ...
  ...
```

This lab container is configured to use the <code>./model_repository</code> folder as the model repository, so any change within this folder will affect the behavior of Triton Server.<br/>

In order to expose a new model to Triton you need to: <br/>
   1. Create a new model folder in the model repository. The name of the folder needs to reflect the name of the service you will be exposing to your users/applications.<br/>
   2. Within the model folder, create a <code>config.pbtxt</code> file that contains the basic serving configuration for the model<br/>
   3. Also within the model folder, create at least one folder containing a copy of the model. The name of the folder reflects the version name of the model. You can create and host multiple versions of the same model.<br/>
    
Next, we'll walk through the process of exporting the model to Triton.

## 1.2.2 TorchScript Export

In this part of the lab we will:
   - Convert the PyTorch checkpoint into [TorchScript](https://pytorch.org/docs/stable/jit.html#torchscript)
   - Generate the Triton configuration file
   - Deploy the created assets to our model repository
Please execute the cells below. Since we are loading a PyTorch checkpoint and converting it into TorchScript, it might take a minute or two to complete.

In [5]:
modelName = "bertQA-torchscript"

In [6]:
!python ./deployer/deployer.py \
    --ts-script \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint "data/bert_qa.pt" \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8 

deploying model bertQA-torchscript in format pytorch_libtorch

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.012563228607177734

average L_inf error over output tensors:  0.008614718914031982
variance of L_inf error over output tensors:  1.08197624560565e-05
stddev of L_inf error over output tensors:  0.003289340732739085

time of error check of native model:  1.416276454925537 seconds
time of error check of ts model:  1.355846881866455 seconds

done


The `deployer.py` script loads the `bert_qa.pt` checkpoint, deploys it in `ts-script` format into a folder called `bertQA-torchscript`, and marks it as version `1`. We will discuss some of the more advanced settings later. For now, let's inspect the files generated by the script:

In [None]:
!ls -al ./candidatemodels/bertQA-torchscript/
!ls -al ./candidatemodels/bertQA-torchscript/1

As expected, the script exported the model into the TorchScript format and saved it as `model.pt`. It also generated the `config.pbtxt` file. <br> 
Let's take a look:

In [None]:
!cat ./candidatemodels/bertQA-torchscript/config.pbtxt

The configuration file is fairly simple and defines:
   - Name of the model
   - Type of platform to be used for inference; in this case `pytorch_libtorch`
   - Input and output dimensions used by the network
   - Optimizations used; in this case GPU and the default TorchScript optimization 
   - Instance group configuration; in this case instance group count is set to one, meaning that only one copy of the model will be held in GPU memory (GPU 0 is being used).
    
To deploy the model, move the folder to the Triton model repository:

In [None]:
!mv ./candidatemodels/bertQA-torchscript model_repository/

Congratulations!  You have successfully deployed your first model to Triton Inference Server!

We'll come back to discuss the detailed configuration later, but for now let's see how our model is performing.

#  1.3 Test Our Export
Execute the cells below to start an inference process and make a simple measurement of inference performance. First, we'll set up some configuration. `maxConcurrency` is set to two, meaning that the stress test will be executed twice. The first run will use just a single thread and the second one will use two threads to query the server. Without turning on the concurrent model execution or dynamic batching features, what do you think will be the impact on performance of running two processes querying the server? Do you think:<br/>
- Bandwidth will increase or decrease?<br/>
- Latency will increase or decrease?<br/>

In [None]:
modelVersion = "1"
precision = "fp32"
batchSize = "8"
maxLatency = "500"
maxClientThreads = "10"
maxConcurrency = "2"
dockerBridge = "host"
resultsFolderName = "1"
profilingData = "utilities/profiling_data_int64"
measurement_request_count = 50
percentile_stability = 85
stability_percentage = 50

In [None]:
%%time
modelName = "bertQA-torchscript"
maxConcurrency = "2"
batchSize = "8"
print("Running: "+modelName)
!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

If everything went okay you should have been presented with output similar to the following example result, showing the inference performance across two different configurations.<br/>
<img src="images/InferenceJob1.png" alt="Example output of inference job 1" style="width: 1200px;"/>

If you happened to get "error: failed to get model metatdata", try running the cell again.


# 1.4 Beyond TorchScript

Let's investigate a different route for model deployment onto Triton, namely <a href="https://onnx.ai">Open Neural Network Exchange (ONNX)</a>. ONNX is an open format for representation and exchange of neural network models. It defines a common set of operators that are used to build common models, as well as a file format for exchanging them. The advantage of ONNX is that it is relatively widely adopted and can be used to exchange models between <a href="https://onnx.ai/supported-tools.html">a wide range of deep learning tools</a>, such as deep learning frameworks or deployment tools. This also includes TensorRT, which can consume ONNX models. </br>

As before, start by exporting the model, but this time using the ONNX format. We will take advantage of the export tool that we used earlier, but change the export format from <code>ts-script</code> to <code>onnx</code>:

In [None]:
modelName = "bertQA-onnx"
exportFormat = "onnx"

In [None]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8

Similar to the <a href="https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/docs/serialization.md">TorchScript serialization format</a>, the <a href="https://onnx.ai/get-started.html">ONNX format</a> can be inspected quite easily (and parts are human readable). Lets have a look at the assets our export has generated:

In [None]:
!ls -al ./candidatemodels/bertQA-onnx/
!ls -al ./candidatemodels/bertQA-onnx/1

Once again, we have a configuration file as well as a model, this time stored in ONNX format. 

We have a couple of options for executing the ONNX-based export in Triton:
- We can take advantage of ONNX runtime </br>
- We can ask TensorRT to parse the ONNX assets in order to generate a TensorRT engine to use instead </br>

We'll try both approaches and look at the impact this has on inference performance. In order to deploy the current ONNX model, move it to the model repository...

In [None]:
!mv ./candidatemodels/bertQA-onnx model_repository/

...and run our stress testing code across 10 different levels of concurrency:

In [None]:
%%time
modelName = "bertQA-onnx"
maxConcurrency = "10"
batchSize = "8"
print("Running: "+modelName)

!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

Have a look at the results. Did we manage to run our benchmark at all 10 concurrency levels (or did the benchmark time out earlier)? What happened to the request latency in relation to the 500 ms time limit we configured?</br>

Now let's export the ONNX model again, so that we can configure it for TensorRT execution.</br>

In [None]:
modelName = "bertQA-onnx-trt-fp16"
exportFormat = "onnx"

In [None]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8

Once again the above command should have generated the ONNX export as well as a configuration file: 

In [None]:
!ls -al ./candidatemodels/bertQA-onnx-trt-fp16/

## 1.4.1 Exercise: Enable TensorRT Optimization

In order to enable TensorRT, we need to add an additional section to the "config.pbtxt" configuration file. In particular, we need to add an additional segment to the <code>optimization</code> section:

```text
optimization {
   execution_accelerators {
      gpu_execution_accelerator : [ {
         name : "tensorrt"
         parameters { key: "precision_mode" value: "FP16" }
      }]
   }
cuda { graphs: 0 }
}
```

#### Exercise Steps:
1. Modify [config.pbtxt](candidatemodels/bertQA-onnx-trt-fp16/config.pbtxt) to enable TensorRT. Feel free to look at the [solution](solutions/ex-1-4-1_config.pbtxt) as needed.
2. Once you have saved your changes (Main menu: File -> Save File), move the folder to the model repository using the cell below. 

In [None]:
# quick fix!
!cp solutions/ex-1-4-1_config.pbtxt candidatemodels/bertQA-onnx-trt-fp16/config.pbtxt

In [None]:
!mv ./candidatemodels/bertQA-onnx-trt-fp16 model_repository/

3. Execute our profiling tool in the next cell and investigate the impact on performance. This could take a while to start, as we are waiting for the server to migrate the model to TensorRT.

In [None]:
%%time
modelName = "bertQA-onnx-trt-fp16"
maxConcurrency = "10"
batchSize = "8"
print("Running: " + modelName)

!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

# 1.5 Performance Comparison

Finally, let's compare the performance against ONNX runtime. 
* How did the latency change, especially across larger concurrency runs? 
* How did the bandwidth change? Can you explain the level of bandwidth change observed? 
* Why did the ONNX model timeout at concurrency of less than 10? How does the TensorRT latency at concurrency 10 compare to latency of pure ONNX runtime at an earlier concurrency?

Discuss with the instructor.

<h3 style="color:green;">Congratulations!</h3><br>
You've successfully deployed an NLP model to Triton Server with TorchScript and applied both reduced precision and TensorRT optimizations.
In the next notebook you'll learn how to optimize the model itself and to deploy it in an efficient way. 

Please proceed to the next notebook:<br>
[2.0 Hosting the model](020_HostingTheModel.ipynb)

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>