# Automatic Device Selection

The Auto Device (or AUTO in short) selects the most suitable device from the available compute devices by considering the network model inference precision, power efficiency and processing capability. The network model inference precision (if the network is quantized or not) is the first consideration to filter out the devices that cannot run the network efficiently.

Next, the dedicated accelerator devices are preferred, e.g., discrete GPU, integrated GPU, or VPU. CPU is used as the default “fallback device”. Please note that AUTO does this selection only once at the network load time. 

When choosing the accelerator device like GPUs, loading the network to these devices may take long time. To address this challenge for application that requires fast initial inference response the AUTO starts inferencing immediately on the CPU and then transparently shifts inferencing to the GPU once ready, dramatically reducing time to first inference.

More information about Automatic Device Selection: [click >>>](https://docs.openvino.ai/latest/openvino_docs_IE_DG_supported_plugins_AUTO.html)

![Auto Device Selection logic](data/auto_device_selection.png "Auto Device Selection")


## Prepare the network model files
The following demostrations use the [googlenet-v1](https://docs.openvino.ai/latest/omz_models_model_googlenet_v1.html) model from the [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/). The googlenet-v1 model is the first of the Inception family of models designed to perform image classification. Like the other Inception models, the googlenet-v1 model has been pre-trained on the ImageNet image database. For details about this family of models, check out the paper.

The following code downaloads googlenet-v1 network model files and convert them to IR files (model/public/googlenet-v1/FP16/googlenet-v1.xml). More details about network model tools, please refer to [104-model-tools](../104-model-tools/README.md)

In [None]:
from pathlib import Path
from IPython.display import Markdown, display

model_name = "googlenet-v1"
base_model_dir = Path("./model").expanduser()
precision = "FP16"

download_command = (
    f"omz_downloader --name {model_name} --output_dir {base_model_dir}"
)
display(Markdown(f"Download command: `{download_command}`"))
display(Markdown(f"Downloading {model_name}..."))

# Depends on your network connection, proxy may be required.
# Uncomment following 2 lines and add correct proxies if they are required.
# %env https_proxy=http://proxy
# %env http_proxy=http://proxy
! $download_command

convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {base_model_dir}"
display(Markdown(f"Convert command: `{convert_command}`"))
display(Markdown(f"Converting {model_name}..."))

! $convert_command

## Imports

In [None]:
import cv2
import matplotlib.pyplot as plt
import numpy as np
from openvino.runtime import Core, CompiledModel, AsyncInferQueue, InferRequest
import sys
import time

## Load the model with AUTO device
### Default behavior of Core::compile_model API without device_name
By default compile_model API will select AUTO as device_name if it is not specificed.

In [None]:
ie = Core()

# set LOG_LEVEL to LOG_INFO
ie.set_property("AUTO", {"LOG_LEVEL":"LOG_INFO"})

# read the model into ngraph representation
model = ie.read_model(model="model/public/googlenet-v1/FP16/googlenet-v1.xml")

# load the model to target device
compiled_model = ie.compile_model(model=model)

if isinstance(compiled_model, CompiledModel):
    print("Compile model without device_name successfully.")   

In [None]:
del compiled_model # Delete model will wait for selected device compiling network done.
print("Deleted compiled_model")

### Explicitly pass AUTO as device_name to Core::compile_model API
It is up to you, it may improve the code readiblity to explicitly pass AUTO as device_name

In [None]:
compiled_model = ie.compile_model(model=model, device_name="AUTO")

if isinstance(compiled_model, CompiledModel):
    print("Compile model with AUTO successfully.")


In [None]:
del compiled_model # Delete model will wait for selected device compiling network done.
print("Deleted compiled_model")

## First inference latency benefit with AUTO
One of the key performance benefits of AUTO is on first inference latency (FIL = compile model time + fist inference execution time). Directly using CPU device would produce the shortest first inference latency as the OpenVINO graph representations can really quickly be JIT-compiled to CPU. The challenge is with the GPU. Since the OpenCL complication of graph to GPU-optimized kernels takes a few seconds to complete for this platform. If AUTO selects GPU as the device, this initialization time may be intolerable to some applications, which is the reason for AUTO to transparently use the CPU as the first inference device until GPU is ready. 
### Load an Image

In [None]:
# For demo purpose, load model to CPU and get the input information for input buffer preparation.
compiled_model = ie.compile_model(model=model, device_name="CPU")

input_layer_ir = next(iter(compiled_model.inputs))

# Expect image in BGR format
image = cv2.imread("../001-hello-world/data/coco.jpg")

# N, C, H, W = batch size, number of channels, height, width
N, C, H, W = input_layer_ir.shape

# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))

# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)

plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

del compiled_model

### Load network model to GPU Device and do one inference

In [None]:
# Start to compile model, time point 1
gpu_load_start_time = time.perf_counter()
compiled_model = ie.compile_model(model=model, device_name="GPU")  # load to GPU

# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)

# get the first inference result
results = compiled_model([input_image])[output_layer]

# Get 1st inference, time point 2
gpu_fil_end_time = time.perf_counter()
gpu_fil_span = gpu_fil_end_time - gpu_load_start_time
print(f"Loaded model to GPU and get first inference in {gpu_fil_end_time-gpu_load_start_time:.2f} seconds.")
del compiled_model

### Load network model to AUTO Device and do one inference
Since GPU is the selected device and CPU is taken as acceleration device, the 1st inference and some following inferences are executed on CPU until GPU is ready (model compiline done)

In [None]:
# Start to compile model, time point 1
auto_load_start_time = time.perf_counter()
compiled_model = ie.compile_model(model=model)  # device_name is AUTO by default

# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)

# get the first inference result
results = compiled_model([input_image])[output_layer]


# Get 1st inference, time point 2
auto_fil_end_time = time.perf_counter()
auto_fil_span = auto_fil_end_time - auto_load_start_time
print(f"Loaded model to AUTO and get first inference in {auto_fil_end_time-auto_load_start_time:.2f} seconds.")

In [None]:
del compiled_model # Delete model will wait for selected device compiling network done.

## Performance hint
The next highlight is the differentiation of performance hint with AUTO. By specifying LATENCY hint or THROUGHTPUT hint, AUTO demonstrate significant performance results towards the desired metric. THROUGHTPUT hint delivers much higher frame per second (FPS) performance than LATENCY hint. In contrast, the LATENCY hint delivers much lower latency than THROUGHTPUT hint. Notice that the hints do not require low-level device-specific settings, and are also completely portable between the devices, which allows the AUTO just to expedite the hint value directly to the selected device.

More information about AUTO with performance hint, please go to [AUTO#performance-hints](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_AUTO.html#performance-hints)

### Class and callback definition

In [None]:
class PerformanceMetrics:
    """
    Record the latest performance metrics (fps and latency), update the metrics in each @interval seconds
    :member: fps: Frame per second, indicates the average accomplished inference number in each second during last @interval seconds.
    :member: latency: Average latency of accomplished inferences in last @interval seconds.
    :member: start_time: Record the start timestamp of onging @interval seconds duration.
    :member: latency_list: Record the latecny of each accomplished inferences of onging @interval seconds duration.
    :member: interval: The metrics will be updated in each @interval seconds
    """
    def __init__ (self, interval):
        """
        Create and initilize one instance of class PerformanceMetrics.
        :param: interval: The metrics will be updated in each @interval seconds
        :returns:
            Instance of PerformanceMetrics
        """
        self.fps = 0
        self.latency = 0
        
        self.start_time = time.perf_counter()
        self.latency_list = []
        self.interval = interval
        
    def update(self, infer_request: InferRequest) -> bool:
        """
        Update the metrics if current ongoing @interval seconds duration is expired. Record the latency only if it is not expired.
        :param: infer_request: InferRequest returned from inference callback, which includes the result of inference request.
        :returns:
            True, if metrics are updated.
            False, if @interval seconds duration is not expired and metrics are not updated.
        """
        self.latency_list.append(infer_request.latency)
        exec_time = time.perf_counter() - self.start_time
        if exec_time >= self.interval:
            # update the performance metrics
            self.start_time = time.perf_counter()
            self.fps = len(self.latency_list) / exec_time
            self.latency = sum(self.latency_list) / len(self.latency_list)
            print(f"fps: {self.fps: .2f}, latency: {self.latency: .2f}, time taken:{exec_time: .2f}s")
            sys.stdout.flush()
            self.latency_list = []
            return True
        else :
            return False
        
class InferContext:
    """
    Inference context. Record and update peforamnce metrics via @metrics, set @feed_inference to False once @remaining_update_num <=0
    :member: metrics: instance of class PerformanceMetrics 
    :member: remaining_update_num: the remaining times for peforamnce metrics updating.
    :member: feed_inference: if feed inference request is required or not.
    """
    def __init__ (self, update_interval, num):
        """
        Create and initilize one instance of class InferContext.
        :param: update_interval: The performance metrics will be updated in each @update_interval seconds. This parameter will be passed to class PerformanceMetrics dirctly.
        :param: num: The number of times for performance metrics updating.
        :returns:
            Instance of InferContext.
        """
        self.metrics = PerformanceMetrics(update_interval)
        self.remaining_update_num = num
        self.feed_inference = True
        
    def update (self, infer_request: InferRequest):
        """
        Update the conext. Set @feed_inference to False if remining performance metrcis updating times (@remaining_update_num) reaches 0
        :param: infer_request: InferRequest returned from inference callback, which includes the result of inference request.
        :returns: None
        """
        if self.remaining_update_num <= 0 :
            self.feed_inference = False
            
        if self.metrics.update(infer_request) :
            self.remaining_update_num = self.remaining_update_num - 1
            if self.remaining_update_num <= 0 :
                self.feed_inference = False
                
def completion_callback(infer_request: InferRequest, context) -> None:
    """
    callback for the inference request, pass the @infer_request to @context for updating
    :param: infer_request: InferRequest returned for the callback, which includes the result of inference request.
    :param: context: user data which is passed as 2nd parameter to AsyncInferQueue:start_async()
    :returns: None
    """
    context.update(infer_request)
            
# Performance metrics update interval (seconds) and times
metrics_update_inerval = 10
metrics_update_num = 6

### Inference with THROUGHTPUT hint

Loop for the inference and update the FPS/Latency for each @metrics_update_inerval sencods

In [None]:
THROUGHPUT_hint_context = InferContext(metrics_update_inerval, metrics_update_num)

print("Compiling Model for AUTO Device with THROUGHPUT hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, config={"PERFORMANCE_HINT":"THROUGHPUT"})

infer_queue = AsyncInferQueue(compiled_model, 0)  # set 0 will query optimal num by default
infer_queue.set_callback(completion_callback)

print(f"Start inference, {metrics_update_num: .0f} groups fps/latency will be out with {metrics_update_inerval: .0f}s interval")
sys.stdout.flush()

while THROUGHPUT_hint_context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, THROUGHPUT_hint_context)
    
infer_queue.wait_all()

# Take the fps and latency of latest period
THROUGHPUT_hint_fps = THROUGHPUT_hint_context.metrics.fps
THROUGHPUT_hint_latency = THROUGHPUT_hint_context.metrics.latency

print("Done")

del compiled_model

### Inference with LATENCY hint

Loop for the inference and update the FPS/Latency for each @metrics_update_inerval sencods

In [None]:
LATENCY_hint_context = InferContext(metrics_update_inerval, metrics_update_num)

print("Compiling Model for AUTO Device with LATENCY hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, config={"PERFORMANCE_HINT":"LATENCY"})

infer_queue = AsyncInferQueue(compiled_model, 0)  #set 0 will query optimal num by default
infer_queue.set_callback(completion_callback)

print(f"Start inference, {metrics_update_num: .0f} groups fps/latency will be out with {metrics_update_inerval: .0f}s interval")
sys.stdout.flush()

while LATENCY_hint_context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, LATENCY_hint_context)
    
infer_queue.wait_all()

# Take the fps and latency of latest period
LATENCY_hint_fps = LATENCY_hint_context.metrics.fps
LATENCY_hint_latency = LATENCY_hint_context.metrics.latency

print("Done")

del compiled_model

### FPS and latency difference


In [None]:
# output the difference
TPUT = 0
LAT = 1
labels = ["THROUGHPUT hint", "LATENCY hint"]

width = 0.4
fig, ax = plt.subplots(1,2)

rects1 = ax[0].bar([0], THROUGHPUT_hint_fps, width, label=labels[TPUT], color='#557f2d')
rects2 = ax[0].bar([width], LATENCY_hint_fps, width, label=labels[LAT])
ax[0].set_ylabel("frame per second")
ax[0].set_xticks([width / 2]) 
ax[0].set_xticklabels(["fps"])
ax[0].set_xlabel("Higher is better")

rects1 = ax[1].bar([0], THROUGHPUT_hint_latency, width, label=labels[TPUT], color='#557f2d')
rects2 = ax[1].bar([width], LATENCY_hint_latency, width, label=labels[LAT])
ax[1].set_ylabel("millisecond")
ax[1].set_xticks([width / 2])
ax[1].set_xticklabels(["latency (ms)"])
ax[1].set_xlabel("Lower is better")

fig.suptitle('Performance Hints')
fig.legend(labels)
fig.tight_layout()

plt.show()