SPDX-FileCopyrightText: Copyright (c) 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: MIT

# Demo: Triton Model Analyzer

## Assumptions
You have an access to host, where you can run docker containers. The host is connected to the Internet.

## Prepare your model repository


For the next commands to work, it's important to clone this repository to the filesystem, which supports symbollic links, transparently for docker. Any linux disk partition is sufficient. NFS and NTFS are not.

As an example, we're using [Hi-Fi GAN](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_hifigan) model from [NeMo](https://github.com/NVIDIA/NeMo).
To get it's ONNX one should run the NeMo container in the current directory.

In [None]:
docker run --rm --gpus '"device=0"' -it --ipc=host \
-v $HOME/:/ext_home \
-v ${PWD}:${PWD} \
-w ${PWD} \
--name ${USER}_nemo \
nvcr.io/nvidia/nemo:1.3.0 \
-- python get_hifigan.py



The command we run inside the container
```
python get_hifigan.py
```
is equivalent to the next cell:

In [None]:
from nemo.collections.tts.models import HifiGanModel

model = HifiGanModel.from_pretrained(model_name="tts_hifigan")
model.export("./hifigan.onnx")

model = HifiGanModel.from_pretrained(model_name="tts_hifigan")
model.export("./hifigan.pt")

After running the cell above, two files are to appear in the current directory: `hifigan.onnx` and `hifigan.pt`

We'll need the ONNX model to experiment with Model Analyzer — this is the tool, that helps select the optimal inference config, within the specific backrnd. We need to copy  `hifigan.onnx` to `model_repository/hifigan/1/model.onnx`

In [None]:
mkdir -p model_repository/hifigan/1
cp hifigan.onnx model_repository/hifigan/1/model.onnx


TorchScript will be required later, for the Model Navigator experiments. It help in selecting the most optimal backend fot the specific model.
Having these files, the NeMo container can be stopped. For this, it's sufficient to exit its shell due to the `--rm` flag


## Curl and Perf Analyzer

A quick Triton test launch:


In [None]:
docker run --rm --gpus '"device=0"' -it --ipc=host \
-v $HOME/:/ext_home \
-v ${PWD}:${PWD} \
-w ${PWD} \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
--name ${USER}_triton \
nvcr.io/nvidia/tritonserver:22.12-py3 \
tritonserver --model-repository ${PWD}/model_repository --log-verbose 4



Note, that such amount of logs can negatively affect the performance, and is recommended for debug only.

After this, one can run in another terminal to quickly check, if the server works. One should expect long json as an output of the command.


In [None]:
curl -kv -X POST 'http://127.0.0.1:8000/v2/models/hifigan/infer' \
 -H 'accept: application/json' \
 -H 'Content-Type: application/octet-stream' \
 -H 'connection: keep-alive' \
 -d @hifigan_curl_data.json


Note, that it uses an HTTP protocol, which has quite high overhead.



Now one can launch Triton SDK container in yet another terminal:

In [None]:
docker run --rm --gpus '"device=0"' -it --ipc=host \
-v $HOME/:/ext_home \
-v ${PWD}:${PWD} \
-w ${PWD} \
--net=host \
--name ${USER}_triton_sdk \
nvcr.io/nvidia/tritonserver:22.12-py3-sdk \
/bin/bash

One can save perf_analyzer help for later usage:

In [None]:
perf_analyzer --help 2>&1 | tee perf_analyzer_help.txt

And then measure the model performance

In [None]:
perf_analyzer -m hifigan --shape "spec:80,140"

The command above is again using the inefficient HTTP. The optimal launch will use GRPC, shared memory, batch size != 1 and several streams:

In [None]:
perf_analyzer -m hifigan --shape "spec:80,140" \
-b 4 \
-i gRPC \
--concurrency-range 1:3 \
--shared-memory "cuda" \
--output-shared-memory-size 60000000

On a V100 the performance gain is 30%. But which set of hyperparameters is the most optimal? Model Analyzer to the rescue!

## Model Analyzer Launch

Model Analyzer is used to select the optimal model cofig both for offline and online modes. To do it, it creates models with various configurations, launches the Triton container and uses Perf Analyzer to measure the performance. Model Analyzer is [Open Source](https://github.com/triton-inference-server/model_analyzer) and written in Python.

It is advised to go through this Notebook in the latest Triton SDK container. One should mount the full path to the Notebook by the similar path inside the container, so that the model analyzer could mount it again to the Triton container. If one launches the container from the path, where this Notebook is, it is recommended to launch it like

In [None]:
docker run --rm --gpus '"device=0"' -it --ipc=host \
-v $HOME/:/ext_home \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ${PWD}:${PWD} \
-w ${PWD} \
--net=host \
--name ${USER}_triton_sdk \
nvcr.io/nvidia/tritonserver:22.12-py3-sdk \
/bin/bash

To launch the Notebook from the container, one should additionally install `ipykernel`, but it may turn out to be simpler just to copy all the next coommands to the terminal.

In [None]:
pip install ipykernel

Note the docker.sock mount, which enables model_analyzer launch containers from woithin other containers.

If the access to the inference machine is available only through Kubernetes, it is also [supported](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/kubernetes_deploy.md), but is out of scope for this demo.

Before launching Model Analyzer, one must make sure the server has no other GPU containers running. Otherwise, the results would be skewed. To achieve this, we should kill our previous Triton container, if it is still running:

In [None]:
docker rm -f tritonserver 

## Model Analyzer Config


From the measurable models one should create a standard Triton model repo. In our `model_repository` directory there is one hifigan model in the signle version: 1. 


Model Analyzer has several modes of hyperparameter search: 

* [Automatic Brute Search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#automatic-brute-search)
* [Manual Brute Search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#manual-brute-search)
* and [Quick Search Mode](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#quick-search-mode)

Model Analyzer's **brute search mode** will do a brute-force sweep of the cross product of all possible configurations. 

**Automatic brute** configuration search is the default behavior when running Model Analyzer without manually specifying what values to search. The parameters that are automatically searched are max_batch_size and instance_group. Additionally, dynamic_batching will be enabled if it is legal to do so.

Using **manual config search**, you can create custom sweeps for every parameter that can be specified in the model configuration. Model Analyzer only checks the syntax of the model_config_parameters that is specified and cannot guarantee that the configuration that is generated is loadable by Triton.

[See the CLI docs for reference](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#subcommand-profile)

We will use the manual config search to speed up the demo. Please, examine the config file [profile_config_manual.yaml](profile_config_manual.yaml)

Note the shapes specified in config.

The next command creates an `export_path` and launches Model Analyzer with the prepared manual config.

In [None]:
mkdir -p analyzer_export && model-analyzer profile -f profile_config_manual.yaml && echo -e "\07"

If you would like to pause the command execution, you can send it `SIGINT` signal. One can simply do it via `Ctrl+C`, pressing the stop sign next to the cell or run from another terminal within the same container

kill -INT $(ps aux | grep model-ana | grep python | sed "s/^[[:alnum:]]*[[:space:]]*\([[:digit:]]*\).*/\1/")

Then the log should read like
```
INFO[analyzer_state_manager.py:174] Received SIGINT 1/3. Will attempt to exit after current measurement.
```
This means model-analyzer is waiting for the completion of the current measurement, before checkpointing all the results. This checkpoint will already allow for preliminary analysis (see below)

After the profiling is completed, one should build the reports. It's very fast and doesn't induce any GPU load. 

In [None]:
!model-analyzer report -f profile_config_manual.yaml

The most interesting results one should be able to find in `analyzer_export/reports/summaries/hifigan/results_summary.pdf`.\
If you're just browsing, see the examples of the report available for another model in the [Model Analyzer/examples](https://github.com/triton-inference-server/model_analyzer/tree/main/examples)

If you use Jupyter, the reports are available for you to view from the browser

[analyzer_export/reports/summaries/hifigan/results_summary.pdf](analyzer_export/reports/summaries/hifigan/results_summary.pdf)

The details are in the csv files
[analyzer_export/results/metrics-model-gpu.csv](analyzer_export/results/metrics-model-gpu.csv)

[analyzer_export/results/metrics-model-inference.csv](analyzer_export/results/metrics-model-inference.csv)

[analyzer_export/results/metrics-server-only.csv](analyzer_export/results/metrics-server-only.csv)