
# Inference Tutorial

This tutorial describes the inference process.

**Requirements:**

* [HailoRT](https://hailo.ai/developer-zone/sw-downloads/) installed on the same virtual environment, or as part of the Hailo SW Suite.
* Run this code in Jupyter notebook, see the Introduction tutorial for more details.
* Run the [Compilation Tutorial](./DFC_3_Compilation_Tutorial.ipynb) before running this one.

Note:
This section demonstrates PyHailoRT, which is a python library for communication with Hailo devices.
For evaluation purposes, refer to `hailortcli run2 --help` (or the alias `hailo run2 --help`).
For more details on the HailoRT User Guide / Command Line Tools.

In [None]:
# General imports used throughout the tutorial
from multiprocessing import Process

import numpy as np

from hailo_platform import (
    HEF,
    ConfigureParams,
    FormatType,
    HailoSchedulingAlgorithm,
    HailoStreamInterface,
    InferVStreams,
    InputVStreamParams,
    InputVStreams,
    OutputVStreamParams,
    OutputVStreams,
    VDevice,
)

from hailo_sdk_client import ClientRunner, InferenceContext

## Standalone Hardware Deployment

The standalone flow allows direct access to the HW, developing applications directly on top of Hailo
core HW, using HailoRT.

This way the Hailo hardware can be used without Tensorflow, and even without the Hailo Dataflow Compiler (after the HEF is built).

A HEF is Hailo’s binary format for neural networks. The HEF file contains:

* Low level representation of the model
* Target HW configuration
* Weights
* Metadata for HailoRT (e.g. input/output scaling)

First create the desired target object.


In [None]:
# Setting VDevice params to disable the HailoRT service feature
params = VDevice.create_params()
params.scheduling_algorithm = HailoSchedulingAlgorithm.NONE

# The target can be used as a context manager ("with" statement) to ensure it's released on time.
# Here it's avoided for the sake of simplicity
target = VDevice(params=params)

# Loading compiled HEFs to device:
model_name = "resnet_v1_18"
hef_path = f"{model_name}.hef"
hef = HEF(hef_path)

# Get the "network groups" (connectivity groups, aka. "different networks") information from the .hef
configure_params = ConfigureParams.create_from_hef(hef=hef, interface=HailoStreamInterface.PCIe)
network_groups = target.configure(hef, configure_params)
network_group = network_groups[0]
network_group_params = network_group.create_params()

# Create input and output virtual streams params
# Quantized argument signifies whether or not the incoming data is already quantized.
# Data is quantized by HailoRT if and only if quantized == False .
input_vstreams_params = InputVStreamParams.make(network_group, quantized=False, format_type=FormatType.FLOAT32)
output_vstreams_params = OutputVStreamParams.make(network_group, quantized=True, format_type=FormatType.UINT8)

# Define dataset params
input_vstream_info = hef.get_input_vstream_infos()[0]
output_vstream_info = hef.get_output_vstream_infos()[0]
image_height, image_width, channels = input_vstream_info.shape
num_of_images = 10
low, high = 2, 20

# Generate random dataset
dataset = np.random.randint(low, high, (num_of_images, image_height, image_width, channels)).astype(np.float32)

### Running Hardware Inference
Infer the model and then display the output shape:

In [None]:
input_data = {input_vstream_info.name: dataset}

with InferVStreams(network_group, input_vstreams_params, output_vstreams_params) as infer_pipeline:
    with network_group.activate(network_group_params):
        infer_results = infer_pipeline.infer(input_data)
        # The result output tensor is infer_results[output_vstream_info.name]
        print(f"Stream output shape is {infer_results[output_vstream_info.name].shape}")

## Streaming Inference

This section shows how to run streaming inference using multiple processes in Python.

Infer will not be used and instead a send and receive model will be employed.
The send function and the receive function will run in different processes.

Define the send and receive functions:

In [None]:
def send(configured_network, num_frames):
    vstreams_params = InputVStreamParams.make(configured_network)
    with InputVStreams(configured_network, vstreams_params) as vstreams:
        configured_network.wait_for_activation(1000)
        vstream_to_buffer = {
            vstream: np.ndarray([1] + list(vstream.shape), dtype=vstream.dtype) for vstream in vstreams
        }
        for _ in range(num_frames):
            for vstream, buff in vstream_to_buffer.items():
                vstream.send(buff)


def recv(configured_network, num_frames):
    vstreams_params = OutputVStreamParams.make(configured_network)
    configured_network.wait_for_activation(1000)
    with OutputVStreams(configured_network, vstreams_params) as vstreams:
        for _ in range(num_frames):
            for vstream in vstreams:
                _data = vstream.recv()

Define the amount of images to stream and processes, then recreate the target and run the processes:

In [None]:
# Define the amount of frames to stream
num_of_frames = 1000

# Start the streaming inference
send_process = Process(target=send, args=(network_group, num_of_frames))
recv_process = Process(target=recv, args=(network_group, num_of_frames))
recv_process.start()
send_process.start()
print(f"Starting streaming (hef='{model_name}', num_of_frames={num_of_frames})")
with network_group.activate(network_group_params):
    send_process.join()
    recv_process.join()

# Clean pcie target
target.release()
print("Done")

## DFC Inference in Tensorflow Environment

Note: This section is not yet supported on the Hailo-15, as it requires the Dataflow Compiler to be installed on the device.

The ```runner.infer()``` method that was used for emulation in the model optimization tutorial can also be used for running inference on the Hailo device inside the ```infer_context``` environment. Before calling this function with hardware context, please make sure a HEF file is loaded to a runner, by one of the options: calling ```runner.compile()```, loading a complied HAR using ```runner.load_har()```, or setting the HEF attribute ```runner.hef```.

First, create the runner and load a compiled HAR:

In [None]:
model_name = "resnet_v1_18"
compiled_model_har_path = f"{model_name}_compiled_model.har"
runner = ClientRunner(hw_arch="hailo8", har=compiled_model_har_path)
# For Mini PCIe modules or Hailo-8R devices, use hw_arch='hailo8r'

Calling ```runner.infer()``` within inference HW context to run on the Hailo device (```InferenceContext.SDK_HAILO_HW```):

In [None]:
hef_path = f"{model_name}.hef"
hef = HEF(hef_path)
input_vstream_info = hef.get_input_vstream_infos()[0]
image_height, image_width, channels = input_vstream_info.shape
num_of_images = 10
low, high = 2, 20

with runner.infer_context(InferenceContext.SDK_HAILO_HW) as hw_ctx:
    # Running hardware inference:
    for i in range(10):
        dataset = np.random.randint(low, high, (num_of_images, image_height, image_width, channels)).astype(np.uint8)
        results = runner.infer(hw_ctx, dataset)

## Profiler with Runtime Data

This will demonstrate the usage of the HTML profiler with runtime data:

Note: On the Hailo-15 device:

1. The `hailortcli run2` command should be run on the device itself
2. The created json file should be copied to the Dataflow Compiler environment
3. The `hailo profiler` command should be used

In [None]:
model_name = "resnet_v1_18"
hef_path = f"{model_name}.hef"
compiled_har_path = f"{model_name}_compiled_model.har"
runtime_data_path = f"runtime_data_{model_name}.json"

# Run hailortcli (can use `hailo` instead) to run the .hef on the device, and save runtime statistics to runtime_data.json
!hailortcli run2 -m raw measure-fw-actions --output-path {runtime_data_path} set-net {hef_path}
!hailo profiler {compiled_har_path} --runtime-data {runtime_data_path} --out-path runtime_profiler.html


# Instead, this command could be used: hailo profiler {compiled_har_path} --collect-runtime-data --out-path runtime_profiler.html

### Notes on the Profiler with runtime data
resnet_v1_18 is a small network, which fits in a single device without context-switch (it is called "single context"). Its FPS and Latency are always displayed.

The ``--runtime-data`` flag is useful with big models, where the FPS and latency cannot be calculated on compile time. With runtime data, the profiler displays the load, config and runtime of the contexts, the fps and latency for multiple batch sizes.

The runtime FPS is also displayed on the hailortcli output.