# Asynchronous inference
In this notebook, we will configure a model deployment for asynchronous inference.
This notebook assumes that you are familiar with model deployments for local inference (if not, please have a look at [notebook 008](008_deploy_project.ipynb) first). 

## Synchronous vs Asynchronous inference
First things first, what is asynchronous inference and why bother to use it? Up to now, we've been strictly using *synchronous* code to run local inference for our Geti models. This means that whenever we make an infer request to our model (using `deployment.infer()`), the code execution blocks and waits for the model to process the image that is being inferred. Since this is a compute-intensive operation, the CPU (or whatever device we use for inference) will be fully occupied at this time to perform the required calculations to compute the model activations for the image that we feed it. So far, so good. 

However, things may be different if you are running your inference code on a machine with multiple CPU cores. In that case, the synchronous call to `deployment.infer()` will fully occupy one of the cores, but leave the others running idle. The reason is that we can't efficiently share memory between processes running on different cores (this introduces overhead), so the computations for a single image to be inferred cannot simply be distributed across all CPU cores. 

If you only care about *latency*, meaning you want to get a result for your single image as quickly as possible, then this is not an issue. However, if *throughput* is an issue (for example for video processing), we may be able to improve the situation by using parallel processing. Instead of blocking execution for each infer request and waiting for it to complete, we can already send the *next* frame to another CPU core that would otherwise by sitting idle. This puts both cores to work at the same time, thereby increasing the rate at which the frames can be processed. This is exactly what we refer to as *asynchronous inference*.

Luckily OpenVINO takes care of the parallelization and optimization of this process for us, we just have to set up our code for running local model inference a bit differently.

## Contents of this notebook
In this notebook we will go through the following steps:
1. Create a deployment for a Geti project
2. Prepare the deployment for *asynchronous* inference
3. Run a benchmark to measure the inference rate
4. Switch to *synchronous* inference mode
5. Benchmark again and compare the async and sync inference rates

Special topic: Aysynchronous video processing

## Step 1: Create deployment
Let's connect to Geti and create the deployment for any project. Here, we'll use the project from [notebook 004](004_create_pipeline_project_from_dataset.ipynb) again, `COCO multitask animal demo`.

This is a multi-task project with a detection and classification task. If you don't have it yet on your Geti instance, you can run notebook 004 to create it. Or, you can use one of your own projects instead.

In [None]:
from geti_sdk import Geti
from geti_sdk.utils import get_server_details_from_env

geti_server_configuration = get_server_details_from_env()

geti = Geti(server_config=geti_server_configuration)

PROJECT_NAME = "COCO multitask animal demo"
project = geti.get_project(PROJECT_NAME)

Now, let's deploy the project and save the deployment for future use

In [None]:
DEPLOYMENT_FOLDER = "deployments"

deployment = geti.deploy_project(PROJECT_NAME, output_folder=DEPLOYMENT_FOLDER)

## Step 2: Prepare the deployment for asynchronous inference
To use the deployment in asynchronous mode, there are two main things to consider:
1. Upon loading the inference models, we need to specify the size of the `infer queue` for the model. The infer queue is essentially a space of shared memory in which infer requests are stored. A request will be in the queue until one of the machine's cores is ready to process it. A larger queue means that requests may be picked up more rapidly, but will also consume more of the available system memory. Usually, setting the queue size to be roughly equal to the number of CPU cores on your system is a good choice.
2. Defining what should happen when an infer request has finished processing. This is done via a function referred to as a `callback`. The callback executes whenever an infer request is ready, and the results are available. In this notebook, we'll set up a callback to print the inference results to the screen and save our image (with prediction overlay) to a folder on disk.

First of all, let's load the inference models. We'll set the number of infer requests (the infer queue) to be equal to the number of cores on the system. This is done using the parameter `max_async_infer_requests`. 

In addition, we can configure OpenVINO to load our model in such a way so that throughput is maximized. This can be specified in the `openvino_configuration` parameter. See how it's done in the cell below:

In [None]:
import os

num_cores = os.cpu_count()
print(f"Detected {num_cores} cpu cores.")

deployment.load_inference_models(
    device="CPU",
    max_async_infer_requests=num_cores,
    openvino_configuration={"PERFORMANCE_HINT": "THROUGHPUT"},
)

You should see some output showing that the models in the deployment are loaded to CPU, with the number of infer requests set equal to the number of CPU cores.

Now, let's define a `callback` function to handle the inference results. The callback function has a particular signature. It should take as it's arguments:
- The `image` or video frame that was inferred, as a numpy array
- The `prediction`, which is the result of the model inference
- Any additional `runtime_data`, which was passed along with the infer request

The first two arguments are always the same, the image as a numpy array and the resulst as a `Prediction` object. However, the runtime data is more flexible. We can decide what we pass here, it can be anything that we want to use in the callback to further process our results. For example, a filename, timestamp, index, etc. In this example we will simply use the image index. The callback should not return any value

In [None]:
import numpy as np

from geti_sdk.data_models import Prediction
from geti_sdk.utils import show_image_with_annotation_scene

# First, we'll specify the output folder and make sure it exists
output_folder = "output"
os.makedirs(output_folder, exist_ok=True)


def handle_results(image: np.ndarray, result: Prediction, runtime_data: int) -> None:
    """
    Handles asynchronous inference results. Gets called after completion of each infer request.
    """
    # First, save the image in the `output_folder`,
    filepath = os.path.join(output_folder, f"result_{runtime_data}.jpg")
    show_image_with_annotation_scene(image, result, filepath)

    # Print the number of predicted objects, and the probability score for each label in each object
    predicted_objects = result.annotations
    print(f"Image {runtime_data} contains {len(predicted_objects)} objects:")
    for obj in predicted_objects:
        label_mapping = {lab.name: lab.probability for lab in obj.labels}
        print(f"    {label_mapping}")

Now that we have defined the callback, we need to assign it to the deployment. This will switch the deployment over to asynchronous mode. 

In [None]:
deployment.set_asynchronous_callback(handle_results)

If all goes well, you should see a log line output stating that asynchronous inference mode has been enabled. Now, we are ready to infer!

## Step 3: Run a benchmark to measure inference rate
The next section shows how to run inference in asynchronous mode. We will run inference on 50 images from the COCO dataset. In the next cell, we'll select the filepaths for the images to infer.

In [None]:
from geti_sdk.annotation_readers import DatumAnnotationReader
from geti_sdk.demos import get_coco_dataset

n_images = 50

path = get_coco_dataset()
ar = DatumAnnotationReader(path, annotation_format="coco")
ar.filter_dataset(labels=["dog", "horse", "elephant"])
coco_image_filenames = ar.get_all_image_names()
coco_image_filepaths = [
    os.path.join(path, "images", "val2017", fn + ".jpg") for fn in coco_image_filenames
][0:n_images]
print(f"Selected {n_images} images from COCO dataset")

Now, we're ready to run the benchmark! Here we go:

In [None]:
import time

import cv2

t_start_async = time.time()
for img_index, image_path in enumerate(coco_image_filepaths):
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    deployment.infer_async(img_rgb, img_index)

# Wait until inference completes
deployment.await_all()
t_elapsed_async = time.time() - t_start_async

print(
    f"Asynchronous mode: Inferred {len(coco_image_filepaths)} images in {t_elapsed_async:.2f} seconds ({len(coco_image_filepaths)/t_elapsed_async:.1f} fps)"
)

You should see the model output printed on the screen for each image. The model detects animals, and classifies them as `wild` or `domestic`. In the printed output, it shows the number of objects (animals) for each image, as well as the labels for each object and the probability associated with it.

In addition, your workspace should now contain a folder called `output`, which contains the result overlay for each image. Each file should be named `result_x.jpg`, where `x` is the index of the image. 

Finally, at the bottom of the printed output you should see a line stating the time it took to run the inference for all images.

## Step 4: Switch to *synchronous* execution mode
Let's switch back to the familiar synchronous inference mode. The deployment provides a simple method to do so:

In [None]:
deployment.asynchronous_mode = False

This removes any callback function that we set and allows us to use the regular `deployment.infer` method again.

## Step 5: Running the benchmark in synchronous mode
Now, let's run the same inference code in synchronous execution mode and compare the time required.

In [None]:
t_start_sync = time.time()
for img_index, image_path in enumerate(coco_image_filepaths):
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # We modify this part only: use `infer` instead of `infer_async`
    result = deployment.infer(img_rgb)
    # Manually call the function that we defined to handle model results
    handle_results(image=img_rgb, result=result, runtime_data=img_index)

# No need to wait anymore, in synchronous mode the code will not stop until all images are inferred
t_elapsed_sync = time.time() - t_start_sync

print(
    f"Synchronous mode: Inferred {len(coco_image_filepaths)} images in {t_elapsed_sync:.2f} seconds ({len(coco_image_filepaths)/t_elapsed_sync:.1f} fps)"
)

You should see the same output as before, with the number of objects and probabilities per label printed per image.
Also, the time required for the whole process is printed on the last line, like before. Let's have a look at the speedup we get from using the asynchronous mode by running the cell below

In [None]:
print(
    f"Sychronous mode: Time elapsed is {t_elapsed_sync:.2f} seconds ({len(coco_image_filepaths)/t_elapsed_sync:.1f} fps)"
)
print(
    f"Asychronous mode: Time elapsed is {t_elapsed_async:.2f} seconds ({len(coco_image_filepaths)/t_elapsed_async:.1f} fps)"
)
print(
    f"Asynchronous inference is {t_elapsed_sync/t_elapsed_async:.1f} times faster than synchronous inference."
)

## Asynchronous vs synchronous inference
Clearly, asynchronous mode gives a better speedup if you have more cores available. Also, if you care mostly about latency (i.e. minimal inference time for a single image) it is probably not the way to go, since the inference time for a single image can increase a bit due to the added overhead of the asynchronous processing. However, if you care mostly about the average inference time over a lot of images, asynchronous mode will almost always provide an increased inference rate compared to synchronous mode.

One thing you may have noticed is that in asynchronous mode, the output is not necessarily printed in order. Results for images at different indexes might be mixed up, because they are processed in parallel and one might take longer than another. If you simply want to infer a folder with a lot of images this is most likely not a problem, however for applications where the order of the images does matter (for example in video processing) extra care needs to be taken to re-order the frames once inference is done.

## Special topic: Asynchronous video processing

To avoid the problem of frames getting mixed up when inferring videos in asynchronous mode, geti-sdk provides a tool that keeps them in order, while still benefitting from the increased throughput offered by the asynchronous inference mode. The `AsyncVideoProcessor` class implements an ordered buffer for the frames and their results, which allows processing them in the correct sequence. This section of the notebook shows how to use it.

First, let's define a new callback function that collects the indices of the 'frames`, to find out how big the problem really is.

In [None]:
from typing import List, Tuple


def inference_callback(
    image: np.ndarray, prediction: Prediction, runtime_data: Tuple[int, List[int]]
):
    """
    Take the index of the processed frame, and append it to the list of indices
    """
    index, index_list = runtime_data
    index_list.append(index)


deployment.set_asynchronous_callback(inference_callback)

In the inference callback you would normally define some sort of I/O operation for each frame. For example, writing the frame to a video file using opencv, or sending it in a stream. 

However, because we are only interested in the frame processing order, our callback is really simple. `runtime_data` now consists of two objects: The first is an integer representing the index of the current frame, and the second is a list of indices for frames that have already been processed. Within the function, we just add the current frame index to the list of processed frames.

Now we will run the inference again and inspect the list of indices, so that we can see in what order they were processed

In [None]:
indices_async: List[int] = []
tstart_pure_async = time.time()
for img_index, image_path in enumerate(coco_image_filepaths):
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # runtime_data is now a tuple of the current index and the list of indices
    runtime_data = (img_index, indices_async)
    deployment.infer_async(img_rgb, runtime_data)

# Wait until inference completes
deployment.await_all()
telapsed_pure_async = time.time() - tstart_pure_async
print(
    f"Pure asynchronous mode: Time elapsed is {telapsed_pure_async:.2f} seconds ({len(coco_image_filepaths)/telapsed_pure_async:.1f} fps)"
)

Lets have a closer look at the list of indices

In [None]:
def is_list_sorted(input_list: List[int]):
    """
    Return True if the elements of `input_list` are sorted in ascending order, False otherwise
    """
    return all(a <= b for a, b in zip(input_list, input_list[1:]))


print(f"Is the list of indices sorted?: {is_list_sorted(indices_async)}")
print(f"The frames were processed in this order:\n{indices_async}")

You should see clearly now that the frames are not processed in the order of their original index.

Let's set up the `AsyncVideoProcessor` to do the same experiment, and have a look at the processing order again.

In [None]:
from geti_sdk.demos import AsyncVideoProcessor

# Initialize the processor
video_processor = AsyncVideoProcessor(
    deployment=deployment, processing_function=inference_callback
)

indices_async_vp: List[int] = []
video_processor.start()
tstart_async_vp = time.time()
for img_index, image_path in enumerate(coco_image_filepaths):
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    runtime_data = (img_index, indices_async_vp)

    # We now use the video_processor to infer and process the image, instead of the deployment
    video_processor.process(img_rgb, runtime_data)

# Wait until inference completes
video_processor.await_all()
telapsed_async_vp = time.time() - tstart_async_vp

print(
    f"AsyncVideoProcessor: Time elapsed is {telapsed_async_vp:.2f} seconds ({len(coco_image_filepaths)/telapsed_async_vp:.1f} fps)"
)

And let's see the list of indices now

In [None]:
print(f"Is the list of indices sorted?: {is_list_sorted(indices_async_vp)}")
print(f"The frames were processed in this order:\n{indices_async_vp}")

You should see that the frames are now processed in order! Most likely, the inference rate with the `AsyncVideoProcessor` will be slightly lower than in the 'pure' asynchronous mode of the `deployment` alone. However, it should still be significantly higher than inference in synchronous mode, while avoiding mixing up the order of the frames! 

You can define any post-processing (like showing the prediction results on the frame) and output operations (writing the frames to a video file) you want to do on the frames in the `processing_function` of the video processor.

## Summary
Using the asynchronous inference mode allows you to make more efficient use of the compute capacity you have in your system, by parallelizing infer requests. If throughput is important in your application, using the asynchronous mode is recommended because it can result in a significant increase in the number of frames that can be processed per second. Depending on hardware configuration, an increase of 2x or more in framerate compared to synchronous mode can be achieved.

The asynchronous mode does require a bit more care to set up and use than the synchronous inference mode. In asynchronous mode, inference results are processed via a pre-defined `asynchronous_callback`, which implements the required post processing steps for each inferred frame or image. As soon as the inference for an image or frame completes, the callback is executed. 

One of the key differences between asynchronous and synchronous inference is the following: *There is no guarantee that infer requests will be processed in the same order as in which they are submitted.* In synchronous mode this processing order is guaranteed, because we submit the frames for inference one by one, and only submit the next one as the previous one completes. However, in async mode multiple frames are submitted for inference (almost) at the same time, and processed in parallel. Inference for each frame may complete at any time, so the order of the inferred frames is likely to be mixed up. 

For some applications this may not be a problem: Suppose I want to get the inference results for each image in a folder. Most likely I won't care about the order in which those images are processed, as long as I get the results for all of them in the end. However, for applications involving video processing this is a major issue because in video, the order of the frames obviously does matter.

The geti-sdk provides a tool to avoid this problem, the `AsyncVideoProcessor`. It uses an ordered buffer to ensure that video frames are processed in the order in which they are passed. This allows for maximizing the frame rate for inferred video, while avoiding the problem of frames getting mixed up.