# Performance tricks in OpenVINO for latency mode

The goal of this notebook is to be a step-by-step tutorial for improving performance for inferencing in a latency mode. Low latency is especially desired in real-time applications, when the results are needed as soon as possible after the data appeared. This notebook assumes computer vision workflow and uses [YOLOv5n](https://github.com/ultralytics/yolov5) model. We will simulate a camera application which provides frames one by one.

The performance tips applied in this notebook could be summarized in the following figure. While quantization and pre-post-processing API are not included here due to space limitation, you could find examples on how to apply them to optimize performance on OpenVINO IR files in Notebook 110-114 and Notebook 118. 

![](https://github.com/zhuo-yoyowz/classification/raw/master/images/109-latency.png)

_NOTE: Many of the steps presented below will give you better performance. However, some of them may not change anything if they are strongly dependent on either the hardware or the model. Please run this notebook on your computer with your model to learn which of them make sense in your case._


In [None]:
import os
import sys
import time
from pathlib import Path

sys.path.append("../utils")
import notebook_utils as utils

## Data

For all experiments below we're using the same image.

In [None]:
import numpy as np
import cv2

IMAGE_WIDTH = 640
IMAGE_HEIGHT = 640

image = utils.load_image("../data/image/coco_bike.jpg")
input_image = cv2.resize(image, dsize=(IMAGE_WIDTH, IMAGE_HEIGHT), interpolation=cv2.INTER_AREA)
input_image = np.expand_dims(np.transpose(input_image, axes=(2, 0, 1)), axis=0)
utils.show_array(image)

## Model

The model we selected is for object detection.

In [None]:
import torch

base_model_dir = Path("model")
model_name = "yolov5n"
model_path = base_model_dir / model_name

pytorch_model = torch.hub.load("ultralytics/yolov5", "custom", path=model_path, device='cpu')
pytorch_model.eval()

## Hardware

The code below lists the available hardware. The hardware below is used in the benchmarking process.

In [None]:
import openvino.runtime as ov

core = ov.Core()

for device in core.available_devices:
    device_name = core.get_property(device, "FULL_DEVICE_NAME")
    print(f"{device}: {device_name}")

## Optimizations

We're defining a benchmark model function to use it for all optimized models below. It runs inference 100 times and average the time.

In [None]:
# todo make it 1000
INFER_NUMBER = 100

def benchmark_model(model, input, model_name, device="CPU"):
    start = time.perf_counter()
    for _ in range(INFER_NUMBER):
        model(input)
    end = time.perf_counter()

    infer_time = end - start

    print(f"{model_name} on {device}: {infer_time/INFER_NUMBER:.3f} seconds per image ({INFER_NUMBER/infer_time:.2f} FPS)")

def show_result(model, result):
    # todo draw results
    # utils.viz_result_image(image, result, resize=True)
    pass

### PyTorch model

First, we're benchmarking the original PyTorch model without any optimizations applied.

In [None]:
import torch

with torch.no_grad():
    result = None
    # result = pytorch_model(torch.as_tensor(input_image).float())[0]["boxes"].detach().numpy()
    show_result(pytorch_model, result=result)
    benchmark_model(pytorch_model, input=torch.as_tensor(input_image).float(), model_name="PyTorch model")

### ONNX model

The first optimization is exporting the PyTorch model to ONNX and run it in OpenVINO.

In [None]:
onnx_path = base_model_dir / Path(f"{model_name}_{IMAGE_WIDTH}_{IMAGE_HEIGHT}").with_suffix(".onnx")

if not onnx_path.exists():
    dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
    torch.onnx.export(pytorch_model, dummy_input, onnx_path)

onnx_model = core.read_model(onnx_path)
onnx_model = core.compile_model(onnx_model, device_name="CPU")

In [None]:
show_result(model=onnx_model, result=result)
benchmark_model(model=onnx_model, input=input_image, model_name="ONNX model")

del onnx_model

### OpenVINO IR model

Let's convert the ONNX model to OpenVINO Intermediate Representation (IR) and run it.

In [None]:
from openvino.tools import mo

ov_model = mo.convert_model(onnx_path)
ov_cpu_model = core.compile_model(ov_model, device_name="CPU")

show_result(model=ov_cpu_model, result=result)
benchmark_model(model=ov_cpu_model, input=input_image, model_name="OpenVINO model")

del ov_cpu_model  # release resources

### OpenVINO IR FP16 model

Reducing the precision is one of the well-know methods for faster inference. We could use quantization but in that case we should expect a little accuracy drop. That's why we skip that step in this notebook.

In [None]:
ov_model_fp16 = mo.convert_model(onnx_path, compress_to_fp16=True)
ov_cpu_model_fp16 = core.compile_model(ov_model_fp16, device_name="CPU")

show_result(model=ov_cpu_model_fp16, result=result)
benchmark_model(model=ov_cpu_model_fp16, input=input_image, model_name="OpenVINO FP16 model")

del ov_cpu_model_fp16  # release resources

### OpenVINO IR model + additional config

There is a possibility to add a config for any device (CPU in this case). We're going to increase the number of threads to equal number of our cores

In [None]:
num_cores = os.cpu_count()

ov_cpu_config_model = core.compile_model(ov_model, device_name="CPU", config={"INFERENCE_NUM_THREADS": num_cores})

show_result(model=ov_cpu_config_model, result=result)
benchmark_model(model=ov_cpu_config_model, input=input_image, model_name="OpenVINO model + config")

del ov_cpu_config_model  # release resources

### OpenVINO IR model on GPU

Usually, GPU device is faster than CPU, so let's run the above model on the GPU. Please note you need to have an Intel GPU and install drivers to be able to run this step.

In [None]:
if "GPU" in core.available_devices:
    ov_gpu_model = core.compile_model(ov_model, device_name="GPU")

    show_result(model=ov_gpu_model, result=result)
    benchmark_model(model=ov_gpu_model, input=input_image, model_name="OpenVINO model", device="GPU")

    del ov_gpu_model  # release resources

### OpenVINO IR model in latency mode

OpenVINO offers a virtual device called [AUTO](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_AUTO.html), which can select the best device for us based on a performance hint. There are 3 different hints: `LATENCY`, `THROUGHPUT` and `CUMULATIVE_THROUGHPUT`. As this notebook is focused on the latency mode, we're going to use `LATENCY`.

In [None]:
ov_auto_model = core.compile_model(ov_model, device_name="AUTO", config={"PERFORMANCE_HINT": "LATENCY"})

show_result(model=ov_auto_model, result=result)
benchmark_model(model=ov_auto_model, input=input_image, model_name="OpenVINO model", device="AUTO")

### OpenVINO IR model in latency mode + shared memory

OpenVINO is a C++ toolkit with Python wrappers (API). The default behaviour in the Python API is to copy the input to the additional buffer and then run processing in C++. It prevents to have many multiprocessing-related issues. However, it also takes some time. We can create a tensor with enabled shared memory (keeping in mind, we cannot overwrite our input), save time for copying and improve the performance!

In [None]:
c_input_image = np.ascontiguousarray(input_image, dtype=np.float32)  # it must be assigned to a variable, not to be garbage collected
input_tensor = ov.Tensor(c_input_image, shared_memory=True)

show_result(model=ov_auto_model, result=result)
benchmark_model(model=ov_auto_model, input=input_tensor, model_name="OpenVINO model", device="AUTO")

### OpenVINO IR model in latency mode + shared memory + asynchronous processing

Asynchronous mode means that OpenVINO immediately returns from an inference call and doesn't wait for the result. It requires more concurrent code to be written, but should offer better processing time utilization e.g. we can run some pre- or post-processing code while waiting for the result. Although we could use async processing directly (`start_async()` function), it's recommended to use AsyncInferQueue, which is an easier approach to achieve the same outcome. This class automatically spawns the pool of InferRequest objects (also called “jobs”) and provides synchronization mechanisms to control the flow of the pipeline.

In [None]:
from openvino.runtime import AsyncInferQueue

def callback(infer_request, info):
    # put your post-processing here
    pass

infer_queue = AsyncInferQueue(ov_auto_model)
infer_queue.set_callback(callback)  # set callback to post-process results

show_result(model=ov_auto_model, result=result)
benchmark_model(model=infer_queue.start_async, input=input_tensor, model_name="OpenVINO model", device="AUTO")

del infer_queue  # release resources

## Conclusions

We already showed the steps needed to improve the performance for an object detection model. Even if you experience much better performance after running this notebook, please note this may not be a true for every hardware or every model. For the most accurate results please use `benchmark_app` [command-line tool](https://docs.openvino.ai/latest/openvino_inference_engine_samples_benchmark_app_README.html). Note that `benchmark_app` is not able to measure an impact of some tricks above e.g. shared memory.