Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Operator Execution Interface #4453

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

orausch
Copy link
Contributor

@orausch orausch commented Jul 8, 2020

Description: This PR adds an interface to the C ABI that enables the execution of single ONNX nodes, without the overhead of graph construction and memory allocation

Motivation and Context
The alternative way to execute single operators/nodes is to create an ONNX graph containing a single node only. However, this (understandably) adds a lot of overhead, as can be seen in the plot below.

image

Here is an example of how the API can be used (UPDATE: new api for adding attributes):

// Setup the context
OrtExecutableKernelContext* kernel_context;
CheckStatus(OrtApi->CreateExecutableKernelContext("MyConv", "Conv", &kernel_context));

// Add parameter X
CheckStatus(OrtApi->ExecutableKernelContext_AddInput(kernel_context, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT));
// Add parameter W
CheckStatus(OrtApi->ExecutableKernelContext_AddInput(kernel_context, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT));
// Add parameter Y
CheckStatus(OrtApi->ExecutableKernelContext_AddOutput(kernel_context, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT));

// Setup attributes
CheckStatus(OrtApi->ExecutableKernelContext_AddAttributeString(kernel_context, "auto_pad", "NOTSET"));
CheckStatus(OrtApi->ExecutableKernelContext_AddAttributeInt(kernel_context, "group", 1));
{
	// Setup attribute strides
	int64_t values[2];
	values[0] = 2;
	values[1] = 2;

	CheckStatus(OrtApi->ExecutableKernelContext_AddAttributeInts(kernel_context, "strides", values, 2));
}
// Create the executable kernel
OrtExecutableKernel* kernel;
CheckStatus(OrtApi->CreateExecutableKernel(__ort_session, kernel_context, /*provider_index=*/0, &kernel));

// Execute the kernel
CheckStatus(OrtApi->ExecutableKernel_SetInput(kernel, 0, ort_value_input_X));
CheckStatus(OrtApi->ExecutableKernel_SetInput(kernel, 1, ort_value_input_W));
CheckStatus(OrtApi->ExecutableKernel_SetOutput(kernel, 0, ort_value_output_Y));
CheckStatus(OrtApi->ExecutableKernel_Compute(kernel));

It has been tested with the CPU and CUDA execution providers.

The current design implements a new ExecutionFrame that is used to
execute the op. This is less than optimal; I will attempt to change this
in the future.

The API will also have to be extended to add other providers than CPU.
With this change, the ExecutableKernelContextImpl is initalized at kernel
creation, and not at compute time, which should remove some overhead.  This
allows multiple calls with different datato be made using the same kernel.

Furthermore, the main graph of the different op kernels is now shared through
OrtKernelSession.
* Reuse provides across Kernels
* Support CUDA providers
@orausch orausch requested a review from a team as a code owner July 8, 2020 12:40
@ghost
Copy link

ghost commented Jul 8, 2020

CLA assistant check
All CLA requirements met.

@Craigacp
Copy link
Contributor

Craigacp commented Jul 8, 2020

This looks interesting to expose into the Java API, but is there a reason why the input and output arguments are specified separately from the call to compute? In session.run they are supplied to that call, and I feel like that maps a little more naturally.

@orausch
Copy link
Contributor Author

orausch commented Jul 8, 2020

This looks interesting to expose into the Java API, but is there a reason why the input and output arguments are specified separately from the call to compute? In session.run they are supplied to that call, and I feel like that maps a little more naturally.

I'm not particularly tied to this exact API; if exposed to the Java or Python API, it could be implemented similarly to session.run. The current option was just easy implement, and it has the performance advantage of not creating any intermediate data structures.

@hariharans29
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Pull request contains merge conflicts.

@hariharans29
Copy link
Member

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Pull request contains merge conflicts.

# Conflicts:
#	include/onnxruntime/core/session/onnxruntime_c_api.h
#	onnxruntime/core/session/onnxruntime_c_api.cc
#	onnxruntime/core/session/ort_apis.h
@faxu
Copy link
Contributor

faxu commented Aug 17, 2020

/azp run Windows GPU CI Pipeline, WIndows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline

@faxu
Copy link
Contributor

faxu commented Aug 17, 2020

/azp run Linux CPU CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

Some CUDA ops static_cast the context to OpKernelContextInternal.
Switching to this required initializing a session state.
@codemzs
Copy link
Member

codemzs commented Sep 21, 2020

Hi @jywu-msft / @orausch either we get traction on this PR or we close it. Can you please drive this to closure? it has been outstanding for a while.

@orausch
Copy link
Contributor Author

orausch commented Sep 22, 2020

Thanks for following up @codemzs. I think a good next step would be to get a review in from someone on the ORT team.

Let me know if there is any other way I can help drive this forward.

@codemzs
Copy link
Member

codemzs commented Sep 22, 2020

@orausch I believe Pranav from ORT team will be looking at this.

@EmergentOrder
Copy link

This looks interesting to expose into the Java API, but is there a reason why the input and output arguments are specified separately from the call to compute? In session.run they are supplied to that call, and I feel like that maps a little more naturally.

+1
This is exactly the kind of interface I have in ONNX-Scala, including the graph overhead.
I recently switched to using the official ORT Java API there, although I have encountered some issues, will report separately.
It would be really great to see this merged and then exposed in Java.

@EmergentOrder
Copy link

EmergentOrder commented Nov 14, 2020

@pranavsharma @jywu-msft @RyanUnderhill Can one of you (or another ORT team member) review this? Looking forward to this getting in and then exposed in the Java API ( @Craigacp ). Thanks in advance.

Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.

The number of APIs required to execute one kernel seems quite a lot. I wonder if we can reduce this number and simplify it. Need some more thought on this.

@@ -69,6 +69,8 @@ class IExecutionFrame {

Status ReleaseMLValue(int ort_value_idx);

Status SetOrtValue(OrtValue &value, int ort_value_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value can be passed by const-ref.

@@ -0,0 +1,161 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs license header.

@@ -0,0 +1,648 @@
// Licensed under the MIT License.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full license header needed here.

/*model_functions=*/std::initializer_list<ONNX_NAMESPACE::FunctionProto>{},
/*logger=*/logging::LoggingManager::DefaultLogger());

KernelSessionImpl *session = new KernelSessionImpl(std::move(model));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using unique_ptr to avoid the potential of a mem leak.

ORT_API2_STATUS(CreateExecutableKernel,
_Inout_ OrtKernelSession* session,
_In_ OrtExecutableKernelContext* context,
size_t provider_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the user know which id corresponds to which provider?

ORT_ENFORCE(provider_id < session->provider_list.size(),
"provider_id (" + std::to_string(provider_id) + ")must be less than the provider list size (" + std::to_string(session->provider_list.size()) + ").");

SingleKernelExecutionFrame* frame;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential for mem leak.


std::unique_ptr<NodeIndexInfo> node_index_info_;

std::vector<int> input_index_to_mlvalue_map_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't look like maps

std::vector<int> fetches_mlvalue_idxs_;
std::vector<OrtValue> fetches_;
std::vector<int> feed_mlvalue_idxs_;
std::vector<OrtValue> feeds_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to store the feeds and fetches?

}

// create the context info
std::unique_ptr<SingleKernelExecutionFrame::Info> info = onnxruntime::make_unique<SingleKernelExecutionFrame::Info>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose this Info object outside? Looks like this will require an unnecessary heap allocation.

ORT_API_STATUS_IMPL(OrtApis::ExecutableKernel_Compute,
_Inout_ OrtExecutableKernel *kernel_) {
API_IMPL_BEGIN
SingleKernelExecutionFrame* kernel = reinterpret_cast<SingleKernelExecutionFrame*>(kernel_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing. The distinction between a kernel and a frame is lost here.

@EmergentOrder
Copy link

Thanks @pranavsharma .
@orausch could you take another pass at this?

@orausch
Copy link
Contributor Author

orausch commented Mar 12, 2021

@EmergentOrder, this is still on my radar, but I'll likely only get around to this in april.

@stale
Copy link

stale bot commented Jun 11, 2021

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@stale stale bot added stale issues that have not been addressed in a while; categorized by a bot and removed stale issues that have not been addressed in a while; categorized by a bot labels Jun 11, 2021
@EmergentOrder
Copy link

Bumping to prevent auto-closure

@EmergentOrder
Copy link

checking in, @orausch could you take another pass at this?

@orausch
Copy link
Contributor Author

orausch commented Oct 20, 2021

Hey, I discussed this with @souptc and it seems like the "most mergable" way forward is to instead expose the ORT eager mode (the one that is already commited) as a C API.

While this will have more overhead that the solution proposed here, some caching of the constructed graph should hopefully bring latency down far enough to be useful for performance-oriented use cases.

This work will be done in new PRs, and is largely unrelated to the solution presented here.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api issues related to all other APIs: C, C++, Python, etc. stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants