## TRT Conversion

**References**
 
https://github.com/deepinsight/insightface

https://github.com/SthPhoenix/InsightFace-REST/tree/master/src/converters


**Pre-requisites**

`$ docker pull nvcr.io/nvidia/tensorrt:20.12-py3 `

`$ docker run --gpus all -it --net=host -v /path/to/files:/workspace/insightface nvcr.io/nvidia/tensorrt:20.12-py3`

**Container Installations**

In [None]:
!pip install tqdm

In [None]:
!pip install onnx==1.8.0

In [None]:
!pip install mxnet==1.6.0

In [None]:
!bash /opt/tensorrt/install_opensource.sh -b master

In [None]:
!mkdir model_repository
!mkdir model_repository/retina_trt_fp16
!mkdir model_repository/retina_trt_fp16/1
!mkdir model_repository/arcface_trt_fp16
!mkdir model_repository/arcface_trt_fp16/1

### Retinaface: Detection

In [None]:
! git clone https://github.com/SthPhoenix/InsightFace-REST

In [None]:
cd InsightFace-REST/src

**Download the required Retinaface Model: https://github.com/deepinsight/insightface/wiki/Model-Zoo**

**Modification**

- build_retina_trt.py to add 
   >model_dir, model_name, im_size [640, 480] (W, H)>

**ONNX Conversion**

In [None]:
! python converters/build_retina_trt.py

**TRT Conversion**

- Ignore .plan file generated from above python scripsts
- Consider .onnx or .onnx.tmp file for TRT conversion
#(B, C, H, W )

In [None]:
!trtexec --onnx=/workspace/Courses/CV/inference/Bharat/FaceRecognition/insightface_onnx_trt_triton/models/onnx/retinaface_r50_v1/retinaface_r50_v1.onnx --saveEngine=/workspace/Courses/CV/inference/Bharat/FaceRecognition/insightface_onnx_trt_triton/model_repository/retina_trt_fp16/1/retinaface_r50_FP16.plan --fp16 --shapes=data:1x3x480x640 --minShapes=data:1x3x480x640 --optShapes=data:1x3x480x640 --maxShapes=data:32x3x480x640

### Arcface: Recognition

**Download the required Arcface Model: https://github.com/deepinsight/insightface/wiki/Model-Zoo**

Example:- arcface_r100_v1

**Modification**

- build_insight_trt.py to add 
    >model_dir, model_name, im_size [112, 112]

**ONNX Conversion**

In [None]:
!python converters/build_insight_trt.py

**TRT Conversion**

- Ignore .plan file generated from above python scripts
- Consider .onnx or .onnx.tmp file for TRT conversion

In [None]:
!trtexec --onnx=/workspace/Courses/CV/inference/Bharat/FaceRecognition/insightface_onnx_trt_triton/models/onnx/arcface_r100_v1/arcface_r100_v1.onnx --saveEngine=/workspace/Courses/CV/inference/Bharat/FaceRecognition/insightface_onnx_trt_triton/model_repository/arcface_trt_fp16/1/arcface_r100_v1_FP16.plan --fp16 --shapes=data:1x3x112x112 --minShapes=data:1x3x112x112 --optShapes=data:1x3x112x112 --maxShapes=data:32x3x112x112

# Save Final model models in a "model_repository"

### model_repository

```
  model_repository/
    arcface_trt_fp16/
      1/
        arcface_fp16.plan
    retina_trt_fp16/
      1/
        retinaface_fp16.plan

```

### Note: You can create fp32 tensorrt model as well.

In [2]:
!rm -rf `find -type d -name .ipynb_checkpoints`

### Run Triton Server
docker run --gpus device=2 --net=host -v /home/path/to/model_repository/:/models --ipc=host  nvcr.io/nvidia/tritonserver:20.12-py3 tritonserver --model-repository=/models --strict-model-config=False --log-verbose=2

### Run Triton Server Cliednt SDK
docker run -it -v /home/path/:/workspace/data --net=host nvcr.io/nvidia/tritonserver:20.12-py3-sdk

In [None]:
!curl -v localhost:8000/v2/models/arcface_trt_fp16/config 

In [None]:
!curl -v localhost:8000/v2/models/retina_trt_fp16/config 

In [1]:
!perf_analyzer -m arcface_trt_fp16 -b 1 -u localhost:8001 -i grpc --concurrency-range 1

*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 2008
    Throughput: 401.6 infer/sec
    Avg latency: 2489 usec (standard deviation 1021 usec)
    p50 latency: 2449 usec
    p90 latency: 2481 usec
    p95 latency: 2499 usec
    p99 latency: 2770 usec
    Avg gRPC time: 2463 usec ((un)marshal request/response 25 usec + response wait 2438 usec)
  Server: 
    Inference count: 2423
    Execution count: 2423
    Successful request count: 2423
    Avg request latency: 1925 usec (overhead 1 usec + queue 19 usec + compute input 985 usec + compute infer 910 usec + compute output 10 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 401.6 infer/sec, latency 2489 usec


In [3]:
!perf_analyzer -m retina_trt_fp16 -b 1 -u localhost:8001 -i grpc --concurrency-range 1

*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 482
    Throughput: 96.4 infer/sec
    Avg latency: 10357 usec (standard deviation 2244 usec)
    p50 latency: 9983 usec
    p90 latency: 11858 usec
    p95 latency: 12124 usec
    p99 latency: 12567 usec
    Avg gRPC time: 10279 usec ((un)marshal request/response 583 usec + response wait 9696 usec)
  Server: 
    Inference count: 578
    Execution count: 578
    Successful request count: 578
    Avg request latency: 4814 usec (overhead 3 usec + queue 22 usec + compute input 2237 usec + compute infer 2288 usec + compute output 264 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 96.4 infer/sec, latency 10357 usec
