New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single Operator Execution Interface #4453
base: main
Are you sure you want to change the base?
Conversation
The current design implements a new ExecutionFrame that is used to execute the op. This is less than optimal; I will attempt to change this in the future. The API will also have to be extended to add other providers than CPU.
With this change, the ExecutableKernelContextImpl is initalized at kernel creation, and not at compute time, which should remove some overhead. This allows multiple calls with different datato be made using the same kernel. Furthermore, the main graph of the different op kernels is now shared through OrtKernelSession.
* Reuse provides across Kernels * Support CUDA providers
This looks interesting to expose into the Java API, but is there a reason why the input and output arguments are specified separately from the call to compute? In session.run they are supplied to that call, and I feel like that maps a little more naturally. |
I'm not particularly tied to this exact API; if exposed to the Java or Python API, it could be implemented similarly to |
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
Pull request contains merge conflicts. |
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux OpenVINO CI Pipeline |
Pull request contains merge conflicts. |
# Conflicts: # include/onnxruntime/core/session/onnxruntime_c_api.h # onnxruntime/core/session/onnxruntime_c_api.cc # onnxruntime/core/session/ort_apis.h
/azp run Windows GPU CI Pipeline, WIndows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline |
/azp run Linux CPU CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
Azure Pipelines successfully started running 8 pipeline(s). |
Some CUDA ops static_cast the context to OpKernelContextInternal. Switching to this required initializing a session state.
Hi @jywu-msft / @orausch either we get traction on this PR or we close it. Can you please drive this to closure? it has been outstanding for a while. |
Thanks for following up @codemzs. I think a good next step would be to get a review in from someone on the ORT team. Let me know if there is any other way I can help drive this forward. |
@orausch I believe Pranav from ORT team will be looking at this. |
+1 |
@pranavsharma @jywu-msft @RyanUnderhill Can one of you (or another ORT team member) review this? Looking forward to this getting in and then exposed in the Java API ( @Craigacp ). Thanks in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
The number of APIs required to execute one kernel seems quite a lot. I wonder if we can reduce this number and simplify it. Need some more thought on this.
@@ -69,6 +69,8 @@ class IExecutionFrame { | |||
|
|||
Status ReleaseMLValue(int ort_value_idx); | |||
|
|||
Status SetOrtValue(OrtValue &value, int ort_value_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value can be passed by const-ref.
@@ -0,0 +1,161 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs license header.
@@ -0,0 +1,648 @@ | |||
// Licensed under the MIT License. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full license header needed here.
/*model_functions=*/std::initializer_list<ONNX_NAMESPACE::FunctionProto>{}, | ||
/*logger=*/logging::LoggingManager::DefaultLogger()); | ||
|
||
KernelSessionImpl *session = new KernelSessionImpl(std::move(model)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using unique_ptr to avoid the potential of a mem leak.
ORT_API2_STATUS(CreateExecutableKernel, | ||
_Inout_ OrtKernelSession* session, | ||
_In_ OrtExecutableKernelContext* context, | ||
size_t provider_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the user know which id corresponds to which provider?
ORT_ENFORCE(provider_id < session->provider_list.size(), | ||
"provider_id (" + std::to_string(provider_id) + ")must be less than the provider list size (" + std::to_string(session->provider_list.size()) + ")."); | ||
|
||
SingleKernelExecutionFrame* frame; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential for mem leak.
|
||
std::unique_ptr<NodeIndexInfo> node_index_info_; | ||
|
||
std::vector<int> input_index_to_mlvalue_map_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these don't look like maps
std::vector<int> fetches_mlvalue_idxs_; | ||
std::vector<OrtValue> fetches_; | ||
std::vector<int> feed_mlvalue_idxs_; | ||
std::vector<OrtValue> feeds_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to store the feeds and fetches?
} | ||
|
||
// create the context info | ||
std::unique_ptr<SingleKernelExecutionFrame::Info> info = onnxruntime::make_unique<SingleKernelExecutionFrame::Info>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to expose this Info object outside? Looks like this will require an unnecessary heap allocation.
ORT_API_STATUS_IMPL(OrtApis::ExecutableKernel_Compute, | ||
_Inout_ OrtExecutableKernel *kernel_) { | ||
API_IMPL_BEGIN | ||
SingleKernelExecutionFrame* kernel = reinterpret_cast<SingleKernelExecutionFrame*>(kernel_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing. The distinction between a kernel and a frame is lost here.
Thanks @pranavsharma . |
@EmergentOrder, this is still on my radar, but I'll likely only get around to this in april. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Bumping to prevent auto-closure |
checking in, @orausch could you take another pass at this? |
Hey, I discussed this with @souptc and it seems like the "most mergable" way forward is to instead expose the ORT eager mode (the one that is already commited) as a C API. While this will have more overhead that the solution proposed here, some caching of the constructed graph should hopefully bring latency down far enough to be useful for performance-oriented use cases. This work will be done in new PRs, and is largely unrelated to the solution presented here. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Description: This PR adds an interface to the C ABI that enables the execution of single ONNX nodes, without the overhead of graph construction and memory allocation
Motivation and Context
The alternative way to execute single operators/nodes is to create an ONNX graph containing a single node only. However, this (understandably) adds a lot of overhead, as can be seen in the plot below.
Here is an example of how the API can be used (UPDATE: new api for adding attributes):
It has been tested with the CPU and CUDA execution providers.