Create infer request per inference to enable concurrency #494

mzegla · 2023-12-21T10:58:47Z

This is just a draft PR for now to start a discussion.

It modifies forward calls to create inference request every time instead of reusing only one, created along with the model. This way multiple inferences can be run at the same time allowing higher overall throughput.

ngaloppo · 2024-01-03T19:06:19Z

optimum/intel/openvino/modeling_diffusion.py

@@ -19,6 +19,7 @@
 from pathlib import Path
 from tempfile import TemporaryDirectory, gettempdir
 from typing import Any, Dict, List, Optional, Union
+import copy


ngaloppo · 2024-01-03T19:08:54Z

Why create a new inference request every time forward() is called? I think there's some overhead with that... not sure if it matters. Can't we just create a request when the model is compiled and keep it as a member variable?

jiwaszki · 2024-01-05T06:37:07Z

@mzegla what is the example code for running in such "mode"? Does it require multiply number of objects/calls into the optimum in separate threads? Do you have any specific benchmarks to back up the gains?

Originally the "compile model approach" was introduced to remove overhead of creating multiple instances of requests (as @ngaloppo mentioned, here is the exact PR #265). Right now it is not clear to me how your changes should be utilized on higher-level user code.

Another idea that comes to my mind is introducing parallel_requests=True/False flag in either __call__ or even __init__ function of the model. This will allow to adjust pipelines to expected behaviors and decide which gains are more important -- sequential gains (one time IR creation) or parallel gains (multiple requests on-demand, BTW even the use of AsyncInferQueue as pool of requests can be considered here but this is depending on scenarios).

mzegla added 3 commits November 22, 2023 15:24

infer request per inference

b73b7c6

remove additional compliling

319ebf8

shallow output copy for diffusion

b5dbc3c

ngaloppo reviewed Jan 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create infer request per inference to enable concurrency #494

Create infer request per inference to enable concurrency #494

mzegla commented Dec 21, 2023

ngaloppo Jan 3, 2024

ngaloppo commented Jan 3, 2024 •

edited

Loading

jiwaszki commented Jan 5, 2024

Create infer request per inference to enable concurrency #494

Are you sure you want to change the base?

Create infer request per inference to enable concurrency #494

Conversation

mzegla commented Dec 21, 2023

This is just a draft PR for now to start a discussion.

ngaloppo Jan 3, 2024

Choose a reason for hiding this comment

ngaloppo commented Jan 3, 2024 • edited Loading

jiwaszki commented Jan 5, 2024

ngaloppo commented Jan 3, 2024 •

edited

Loading