# Automatic Device Selection

The Auto Device (or AUTO in short) selects the most suitable device from the available compute devices by considering the network precision, power efficiency and processing capability. The network precision (if the network is quantized or not) is the first consideration to filter out the devices that cannot run the network efficiently.

Next, the dedicated accelerator devices are preferred, e.g., discrete GPU, integrated GPU, or VPU. CPU is used as the default “fallback device”. Please note that AUTO does this selection only once at the network load time. 

When choosing the accelerator device like GPUs, loading the network to these devices may take long time. To address this challenge for application that requires fast initial inference response the AUTO starts inferencing immediately on the CPU and then transparently shifts inferencing to the GPU once ready, dramatically reducing time to first inference.

![Auto Device Selection logic](data/auto_device_selection.png "Auto Device Selection")


## Prepare the network model files
The following demostrations use the [googlenet-v1](https://docs.openvino.ai/latest/omz_models_model_googlenet_v1.html) model from the [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/). The googlenet-v1 model is the first of the Inception family of models designed to perform image classification. Like the other Inception models, the googlenet-v1 model has been pre-trained on the ImageNet image database. For details about this family of models, check out the paper.

The following code downaloads googlenet-v1 network model files and convert them to IR files (model/public/googlenet-v1/FP16/googlenet-v1.xml). More details about network model tools, please refer to [104-model-tools](../104-model-tools/README.md)

In [None]:
from pathlib import Path
from IPython.display import Markdown, display

model_name = "googlenet-v1"
base_model_dir = Path("./model").expanduser()
precision = "FP16"

download_command = (
    f"omz_downloader --name {model_name} --output_dir {base_model_dir}"
)
display(Markdown(f"Download command: `{download_command}`"))
display(Markdown(f"Downloading {model_name}..."))

# Depends on your network connection, proxy may be required.
# Uncomment following 2 lines and add correct proxies if they are required.
# %env https_proxy=http://proxy
# %env http_proxy=http://proxy

! $download_command

convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {base_model_dir}"
display(Markdown(f"Convert command: `{convert_command}`"))
display(Markdown(f"Converting {model_name}..."))

! $convert_command

## Imports

In [None]:
import cv2
import matplotlib.pyplot as plt
import numpy as np
from openvino.runtime import Core, CompiledModel, AsyncInferQueue, InferRequest
import sys
import time

## Load the model with AUTO device
### Default behavior of compile_model without device_name
By default compile_model will select AUTO as device_name if it is not specificed.

In [None]:
ie = Core()
# read the model into ngraph representation
model = ie.read_model(model="model/public/googlenet-v1/FP16/googlenet-v1.xml")
# load the model to target device
compiled_model = ie.compile_model(model=model, config={"LOG_LEVEL":"LOG_INFO"})

if isinstance(compiled_model, CompiledModel):
    print("Compile model without device_name successfully.")
    
del compiled_model # Delete model will wait for selected device compiling network done.
print("Deleted compiled_model")

### Explicitly load network model to AUTO device

In [None]:
compiled_model = ie.compile_model(model=model, device_name="AUTO")

if isinstance(compiled_model, CompiledModel):
    print("Compile model with AUTO successfully.")
        
del compiled_model # Delete model will wait for selected device compiling network done.
print("Deleted compiled_model")

## First inference latency benifit with AUTO
One of the key performance benefits of AUTO is on first inference latency (FIL = compile model time + fist inference execution time). Directly using CPU device would produce the shortest first inference latency as the OpenVINO graph representations can really quickly be JIT-compiled to CPU. The challenge is with the GPU. Since the OpenCL complication of graph to GPU-optimized kernels takes a few seconds to complete for this platform. If AUTO selects GPU as the device, this initialization time may be intolerable to some applications, which is the reason for AUTO to transparently use the CPU as the first inference device until GPU is ready. 
### Load an Image

In [None]:
compiled_model = ie.compile_model(model=model, device_name="CPU")

input_layer_ir = next(iter(compiled_model.inputs))

# Expect image in BGR format
image = cv2.imread("../001-hello-world/data/coco.jpg")

# N, C, H, W = batch size, number of channels, height, width
N, C, H, W = input_layer_ir.shape

# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))

# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)

plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB));

del compiled_model

### Load network model to GPU Device and do first inference

In [None]:

# Start to compile model, time point 1
gpu_load_start_time = time.perf_counter()
compiled_model = ie.compile_model(model=model, device_name="GPU")  # load to GPU

# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)

# get the first inference result
results = compiled_model([input_image])[output_layer]

# Get 1st inference, time point 2
gpu_fil_end_time = time.perf_counter()
gpu_fil_span = gpu_fil_end_time - gpu_load_start_time
print(f"Loaded model to GPU and get first inference in {gpu_fil_end_time-gpu_load_start_time:.2f} seconds.")
del compiled_model

### Load network model to AUTO Device and do first inference

In [None]:
# Start to compile model, time point 1
auto_load_start_time = time.perf_counter()
compiled_model = ie.compile_model(model=model)  # device_name is AUTO by default

# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)

# get the first inference result
results = compiled_model([input_image])[output_layer]


# Get 1st inference, time point 2
auto_fil_end_time = time.perf_counter()
auto_fil_span = auto_fil_end_time - auto_load_start_time
print(f"Loaded model to AUTO and get first inference in {auto_fil_end_time-auto_load_start_time:.2f} seconds.")
del compiled_model

### First inference latency benefit 

In [None]:
# Output the latency difference
device_list = ["GPU", "AUTO"]
load_and_fil_list = [gpu_fil_span, auto_fil_span]
plt.barh(range(len(load_and_fil_list)), load_and_fil_list, tick_label=device_list)
plt.show()

## Performance hint
The next highlight is the differentiation of performance hint with AUTO. By specifying LATENCY hint or THROUGHTPUT hint, AUTO demonstrate significant performance results towards the desired metric. THROUGHTPUT hint delivers much higher frame per second (FPS) performance than LATENCY hint. In contrast, the LATENCY hint delivers much lower latency than THROUGHTPUT hint. Notice that the hints do not require low-level device-specific settings, and are also completely portable between the devices, which allows the AUTO just to expedite the hint value directly to the selected device.

### Inference with THROUGHTPUT hint

Loop for the inference and output the FPS/Latency for each @period_seconds sencods

In [None]:
# output period (seconds)
period_seconds = 10
end_after_periods = 6  # Total time @period_seconds x @end_after_periods

class InferContext:
    def __init__ (self):
        self.reset()
        
    def reset(self):
        self.period_fps = 0
        self.period_latency = 0
        self.period_start_time = time.perf_counter()
        self.period_count = 0
        self.latency_list = []
        self.overall_latency_list = []
        self.feed_inference = True

class InferJob:
    def __init__ (self, id, context):
        self.id = id
        self.context = context

context = InferContext()

print("Compiling Model for AUTO Device with THROUGHPUT hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, config={"PERFORMANCE_HINT":"THROUGHPUT", "LOG_LEVEL":"LOG_INFO"})


def completion_callback(infer_request: InferRequest, job) -> None:
    context = job.context
    
    context.latency_list.append(infer_request.latency)
    context.overall_latency_list.append(infer_request.latency)
    period_exec_time = time.perf_counter() - context.period_start_time
    if period_exec_time >= period_seconds:
        context.period_start_time = time.perf_counter()
        context.period_fps = len(context.latency_list) / period_exec_time
        context.period_latency = sum(context.latency_list) / len(context.latency_list)
        print(f"fps: {context.period_fps: .2f}, latency: {context.period_latency: .2f}, period time:{period_exec_time: .2f}s")
        sys.stdout.flush()
        context.latency_list = []
        context.period_count = context.period_count + 1
        if context.period_count >= end_after_periods:  # Stop feed inference request
            context.feed_inference = False


infer_queue = AsyncInferQueue(compiled_model, 0)  # set 0 will query optimal num by default
infer_queue.set_callback(completion_callback)

print(f"Start inference, {end_after_periods: .0f} groups fps/latency will be out with {period_seconds: .0f}s interval")
sys.stdout.flush()

# Initilization for inference with THROUGHPUT hint
context.reset()

job_id = 0
while context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, InferJob(job_id, context))
    period_exec_time = time.perf_counter() - context.period_start_time
    job_id += 1
    
infer_queue.wait_all()

# Take the fps and latency of latest period
THROUGHPUT_fps = context.period_fps
THROUGHPUT_latency = context.period_latency

print("Done")
sys.stdout.flush()
# print(overall_latency_list)
del compiled_model

### Inference with LATENCY hint

Loop for the inference and output the FPS/Latency for each @period_seconds sencods

In [None]:
print("Compiling Model for AUTO Device with LATENCY hint")
sys.stdout.flush()

compiled_model = ie.compile_model(model=model, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})

infer_queue = AsyncInferQueue(compiled_model, 0)  #set 0 will query optimal num by default
infer_queue.set_callback(completion_callback)

print(f"Start inference, {end_after_periods:.0f} groups fps/latency will be out with {period_seconds:.0f}s interval")
sys.stdout.flush()

# Initilization for inference with LATENCY hint
context.reset()

job_id = 0
while context.feed_inference:
    infer_queue.start_async({input_layer_ir.any_name: input_image}, InferJob(job_id, context))
    period_exec_time = time.perf_counter() - context.period_start_time
    job_id += 1
    
infer_queue.wait_all()

# Take the fps and latency of latest period
LATENCY_fps = context.period_fps
LATENCY_latency = context.period_latency

print("Done")
sys.stdout.flush()
# print(overall_latency_list)
del compiled_model

### FPS and latency difference


In [None]:
# output the difference
labels = ["fps", "latency"]
THROUGHPUT = [THROUGHPUT_fps, THROUGHPUT_latency]
LATENCY = [LATENCY_fps, LATENCY_latency]

width = 0.4
fig, ax = plt.subplots(1,2)

rects1 = ax[0].bar([0], THROUGHPUT_fps, width, label='THROUGHPUT', color='#557f2d')
rects2 = ax[0].bar([width], LATENCY_fps, width, label='LATENCY')
ax[0].set_ylabel("frame per second")
ax[0].set_xticks([width / 2]) 
ax[0].set_xticklabels(["fps"])

rects1 = ax[1].bar([0], THROUGHPUT_latency, width, label='THROUGHPUT', color='#557f2d')
rects2 = ax[1].bar([width], LATENCY_latency, width, label='LATENCY')
ax[1].set_ylabel("millisecond")
ax[1].set_xticks([width / 2]) 
ax[1].set_xticklabels(["latency (ms)"])

fig.suptitle('Performance Hints')
ax[1].legend()
fig.tight_layout()

plt.show()