Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create infer request per inference to enable concurrency #494

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mzegla
Copy link

@mzegla mzegla commented Dec 21, 2023

This is just a draft PR for now to start a discussion.

It modifies forward calls to create inference request every time instead of reusing only one, created along with the model. This way multiple inferences can be run at the same time allowing higher overall throughput.

@@ -19,6 +19,7 @@
from pathlib import Path
from tempfile import TemporaryDirectory, gettempdir
from typing import Any, Dict, List, Optional, Union
import copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

@ngaloppo
Copy link
Contributor

ngaloppo commented Jan 3, 2024

Why create a new inference request every time forward() is called? I think there's some overhead with that... not sure if it matters. Can't we just create a request when the model is compiled and keep it as a member variable?

@jiwaszki
Copy link
Contributor

jiwaszki commented Jan 5, 2024

@mzegla what is the example code for running in such "mode"? Does it require multiply number of objects/calls into the optimum in separate threads? Do you have any specific benchmarks to back up the gains?

Originally the "compile model approach" was introduced to remove overhead of creating multiple instances of requests (as @ngaloppo mentioned, here is the exact PR #265). Right now it is not clear to me how your changes should be utilized on higher-level user code.

Another idea that comes to my mind is introducing parallel_requests=True/False flag in either __call__ or even __init__ function of the model. This will allow to adjust pipelines to expected behaviors and decide which gains are more important -- sequential gains (one time IR creation) or parallel gains (multiple requests on-demand, BTW even the use of AsyncInferQueue as pool of requests can be considered here but this is depending on scenarios).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants