# Performance tricks in OpenVINO for latency mode

The goal of this notebook is to be a step-by-step tutorial for improving performance for inferencing in a latency mode. Low latency is especially desired in real-time applications when the results are needed as soon as possible after the data appears. This notebook assumes computer vision workflow and uses [YOLOv5n](https://github.com/ultralytics/yolov5) model. We will simulate a camera application that provides frames one by one.

The performance tips applied in this notebook could be summarized in the following figure. The quantization and pre-post-processing API are not included here as they change the precision (quantization) or processing graph (prepostprocessor). You can find examples of how to apply them to optimize performance on OpenVINO IR files in [111-detection-quantization](../111-detection-quantization) and [118-optimize-preprocessing](../118-optimize-preprocessing).

![](https://github.com/zhuo-yoyowz/classification/raw/master/images/109-latency-new.png)

> **NOTE**: Many of the steps presented below will give you better performance. However, some of them may not change anything if they are strongly dependent on either the hardware or the model. Please run this notebook on your computer with your model to learn which of them makes sense in your case._



In [None]:
import os
import sys
import time
from pathlib import Path
from typing import Any

sys.path.append("../utils")
import notebook_utils as utils

## Data

We will use the same image of the dog sitting on a bicycle for all experiments below. The image is resized and preprocessed to fulfill the requirements of this particular object detection model.

In [None]:
import numpy as np
import cv2

IMAGE_WIDTH = 640
IMAGE_HEIGHT = 480

# load image
image = utils.load_image("../data/image/coco_bike.jpg")
image = cv2.resize(image, dsize=(IMAGE_WIDTH, IMAGE_HEIGHT), interpolation=cv2.INTER_AREA)

# preprocess it for YOLOv5
input_image = image / 255.0
input_image = np.transpose(input_image, axes=(2, 0, 1))
input_image = np.expand_dims(input_image, axis=0)

# show the image
utils.show_array(image)

## Model

We decided to go with [YOLOv5n](https://github.com/ultralytics/yolov5), one of the state-of-the-art object detection models, easily available through the PyTorch Hub and small enough to see the difference in performance.

In [None]:
import torch

# directory for all models
base_model_dir = Path("model")

model_name = "yolov5n"
model_path = base_model_dir / model_name

# load YOLOv5n from PyTorch Hub
pytorch_model = torch.hub.load("ultralytics/yolov5", "custom", path=model_path, device='cpu')
pytorch_model.eval()

## Hardware

The code below lists the available hardware we will use in the benchmarking process.

> **NOTE**: The hardware you have is probably completely different from ours. It means you can see completely different results.

In [None]:
import openvino.runtime as ov

# initialize OpenVINO
core = ov.Core()

# print available devices
for device in core.available_devices:
    device_name = core.get_property(device, "FULL_DEVICE_NAME")
    print(f"{device}: {device_name}")

## Helper functions

We're defining a benchmark model function to use for all optimized models below. It runs inference 1000 times, averages the latency time, and prints two measures: seconds per image and frames per second (FPS).

In [None]:
INFER_NUMBER = 1000

def benchmark_model(model: Any, input_data: np.ndarray, benchmark_name: str, device_name: str="CPU"):
    """
    Helper function for benchmarking the model. It measures the time and prints results.
    """
    start = time.perf_counter()
    for _ in range(INFER_NUMBER):
        model(input_data)
    end = time.perf_counter()

    # elapsed time
    infer_time = end - start

    # print second per image and FPS
    print(f"{benchmark_name} on {device_name}: {infer_time/INFER_NUMBER:.3f} seconds per image ({INFER_NUMBER/infer_time:.2f} FPS)")


The following functions aim to post-process results and draw boxes on the image.

In [None]:
# https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda
classes = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant",
    "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra",
    "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite",
    "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork",
    "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
    "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard",
    "cell phone", "microwave", "oven", "oaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
    "hair drier", "toothbrush"
]

# Colors for the classes above (Rainbow Color Map).
colors = cv2.applyColorMap(
    src=np.arange(0, 255, 255 / len(classes), dtype=np.float32).astype(np.uint8),
    colormap=cv2.COLORMAP_RAINBOW,
).squeeze()


def postprocess(detections: np.ndarray):
    """
    Postprocess the raw results from the model.
    """
    # candidates - probability > 0.25
    detections = detections[detections[..., 4] > 0.25]

    boxes = []
    labels = []
    scores = []
    for obj in detections:
        xmin, ymin, ww, hh = obj[:4]
        score = obj[4]
        label = np.argmax(obj[5:])
        # Create a box with pixels coordinates from the box with normalized coordinates [0,1].
        boxes.append(
            tuple(map(int, (xmin - ww // 2, ymin - hh // 2, ww, hh)))
        )
        labels.append(int(label))
        scores.append(float(score))

    # Apply non-maximum suppression to get rid of many overlapping entities.
    # See https://paperswithcode.com/method/non-maximum-suppression
    # This algorithm returns indices of objects to keep.
    indices = cv2.dnn.NMSBoxes(
        bboxes=boxes, scores=scores, score_threshold=0.25, nms_threshold=0.5
    )

    # If there are no boxes.
    if len(indices) == 0:
        return []

    # Filter detected objects.
    return [(labels[idx], scores[idx], boxes[idx]) for idx in indices.flatten()]

def draw_boxes(img: np.ndarray, boxes):
    """
    Draw detected boxes on the image.
    """
    for label, score, box in boxes:
        # Choose color for the label.
        color = tuple(map(int, colors[label]))
        # Draw a box.
        x2 = box[0] + box[2]
        y2 = box[1] + box[3]
        cv2.rectangle(img=img, pt1=box[:2], pt2=(x2, y2), color=color, thickness=2)

        # Draw a label name inside the box.
        cv2.putText(
            img=img,
            text=f"{classes[label]} {score:.2f}",
            org=(box[0] + 10, box[1] + 20),
            fontFace=cv2.FONT_HERSHEY_COMPLEX,
            fontScale=img.shape[1] / 1200,
            color=color,
            thickness=1,
            lineType=cv2.LINE_AA,
        )

def show_result(results: np.ndarray):
    """
    Postprocess the raw results, draw boxes and show the image.
    """
    output_img = image.copy()

    detections = postprocess(results)
    draw_boxes(output_img, detections)

    utils.show_array(output_img)

## Optimizations

### PyTorch model

First, we're benchmarking the original PyTorch model without any optimizations applied.

In [None]:
import torch

with torch.no_grad():
    result = pytorch_model(torch.as_tensor(input_image)).detach().numpy()[0]
    show_result(result)
    benchmark_model(pytorch_model, input=torch.as_tensor(input_image).float(), model_name="PyTorch model")

### ONNX model

The first optimization is exporting the PyTorch model to ONNX and run it in OpenVINO.

In [None]:
onnx_path = base_model_dir / Path(f"{model_name}_{IMAGE_WIDTH}_{IMAGE_HEIGHT}").with_suffix(".onnx")

if not onnx_path.exists():
    dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
    torch.onnx.export(pytorch_model, dummy_input, onnx_path)

onnx_model = core.read_model(onnx_path)
onnx_model = core.compile_model(onnx_model, device_name="CPU")

In [None]:
result = onnx_model(input_image)[onnx_model.output(0)][0]
show_result(result)
benchmark_model(model=onnx_model, input=input_image, model_name="ONNX model")

del onnx_model

### OpenVINO IR model

Let's convert the ONNX model to OpenVINO Intermediate Representation (IR) and run it.

In [None]:
from openvino.tools import mo

ov_model = mo.convert_model(onnx_path)
ov_cpu_model = core.compile_model(ov_model, device_name="CPU")

result = ov_cpu_model(input_image)[ov_cpu_model.output(0)][0]
show_result(result)
benchmark_model(model=ov_cpu_model, input=input_image, model_name="OpenVINO model")

del ov_cpu_model  # release resources

### OpenVINO IR FP16 model

Reducing the precision is one of the well-know methods for faster inference. We could use quantization but in that case we should expect a little accuracy drop. That's why we skip that step in this notebook.

In [None]:
ov_model_fp16 = mo.convert_model(onnx_path, compress_to_fp16=True)
ov_cpu_model_fp16 = core.compile_model(ov_model_fp16, device_name="CPU")

result = ov_cpu_model_fp16(input_image)[ov_cpu_model_fp16.output(0)][0]
show_result(result)
benchmark_model(model=ov_cpu_model_fp16, input=input_image, model_name="OpenVINO FP16 model")

del ov_cpu_model_fp16  # release resources

### OpenVINO IR model + additional config

There is a possibility to add a config for any device (CPU in this case). We're going to increase the number of threads to equal number of our cores

In [None]:
num_cores = os.cpu_count()

ov_cpu_config_model = core.compile_model(ov_model, device_name="CPU", config={"INFERENCE_NUM_THREADS": num_cores})

result = ov_cpu_config_model(input_image)[ov_cpu_config_model.output(0)][0]
show_result(result)
benchmark_model(model=ov_cpu_config_model, input=input_image, model_name="OpenVINO model + config")

del ov_cpu_config_model  # release resources

### OpenVINO IR model on GPU

Usually, GPU device is faster than CPU, so let's run the above model on the GPU. Please note you need to have an Intel GPU and install drivers to be able to run this step.

In [None]:
if "GPU" in core.available_devices:
    ov_gpu_model = core.compile_model(ov_model, device_name="GPU")

    result = ov_gpu_model(input_image)[ov_gpu_model.output(0)][0]
    show_result(result)
    benchmark_model(model=ov_gpu_model, input=input_image, model_name="OpenVINO model", device="GPU")

    del ov_gpu_model  # release resources

### OpenVINO IR model in latency mode

OpenVINO offers a virtual device called [AUTO](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_AUTO.html), which can select the best device for us based on a performance hint. There are 3 different hints: `LATENCY`, `THROUGHPUT` and `CUMULATIVE_THROUGHPUT`. As this notebook is focused on the latency mode, we're going to use `LATENCY`.

In [None]:
ov_auto_model = core.compile_model(ov_model, device_name="AUTO", config={"PERFORMANCE_HINT": "LATENCY"})

result = ov_auto_model(input_image)[ov_auto_model.output(0)][0]
show_result(result)
benchmark_model(model=ov_auto_model, input=input_image, model_name="OpenVINO model", device="AUTO")

### OpenVINO IR model in latency mode + shared memory

OpenVINO is a C++ toolkit with Python wrappers (API). The default behaviour in the Python API is to copy the input to the additional buffer and then run processing in C++. It prevents to have many multiprocessing-related issues. However, it also takes some time. We can create a tensor with enabled shared memory (keeping in mind, we cannot overwrite our input), save time for copying and improve the performance!

In [None]:
c_input_image = np.ascontiguousarray(input_image, dtype=np.float32)  # it must be assigned to a variable, not to be garbage collected
input_tensor = ov.Tensor(c_input_image, shared_memory=True)

result = ov_auto_model(input_image)[ov_auto_model.output(0)][0]
show_result(result)
benchmark_model(model=ov_auto_model, input=input_tensor, model_name="OpenVINO model", device="AUTO")

## Conclusions

We already showed the steps needed to improve the performance for an object detection model. Even if you experience much better performance after running this notebook, please note this may not be a true for every hardware or every model. For the most accurate results please use `benchmark_app` [command-line tool](https://docs.openvino.ai/latest/openvino_inference_engine_samples_benchmark_app_README.html). Note that `benchmark_app` is not able to measure an impact of some tricks above e.g. shared memory.