Honor disable_synchronize_execution_providers for CUDA graph replay#28686
Open
tianleiwu wants to merge 1 commit into
Open
Honor disable_synchronize_execution_providers for CUDA graph replay#28686tianleiwu wants to merge 1 commit into
tianleiwu wants to merge 1 commit into
Conversation
When disable_synchronize_execution_providers=1 is set in RunOptions, CUDA graph replay now skips cudaStreamSynchronize after cudaGraphLaunch, enabling fully async execution with IO Binding and pre-bound GPU buffers. Previously, CUDA graph replay always called cudaStreamSynchronize regardless of the disable_synchronize_execution_providers setting. This was the only remaining synchronization point preventing fully async Run() with IO Binding + CUDA graph. Changes: - Add bool sync parameter (default true) to IExecutionProvider::ReplayGraph - Thread the parameter through CUDAExecutionProvider and plugin CUDA EP - Session-level graph replay reads the run option to determine sync - OnRunEnd capture-end replay uses the existing sync_stream flag - All other EP overrides updated for signature compatibility
Contributor
There was a problem hiding this comment.
Pull request overview
Threads a sync flag through IExecutionProvider::ReplayGraph so that CUDA graph replay honors the existing disable_synchronize_execution_providers RunOption. Previously, even when the option was set, the session-level replay path always synchronized the CUDA stream after cudaGraphLaunch, defeating fully async IO-binding workflows.
Changes:
- Added a backward-compatible
bool sync = trueparameter toIExecutionProvider::ReplayGraphand its overrides across CUDA, plugin CUDA, TensorRT, NV TensorRT RTX, DML, JS, and WebGPU EPs. InferenceSession::RunImplnow readsdisable_synchronize_execution_providersand passes the derived flag toReplayGraph; CUDA EP also forwardssync_streamwhen replaying after first-capture inOnRunEnd.- Plugin EP bridge launches the OrtEp graph without sync and then calls
Sync()only whensync=true(note: device-wide sync, since the C APIOrtEp::ReplayGraphcannot be ABI-extended).
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| include/onnxruntime/core/framework/execution_provider.h | Adds sync parameter (default true) and doc to virtual ReplayGraph. |
| onnxruntime/core/session/inference_session.h | Forwards sync through cached-EP graph-replay helper. |
| onnxruntime/core/session/inference_session.cc | Derives sync_graph_replay from RunOptions and passes to ReplayGraph. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.{h,cc} | Threads sync through CUDA EP and PerThreadContext::ReplayGraph; uses sync_stream in OnRunEnd first-capture replay. |
| onnxruntime/core/providers/cuda/plugin/cuda_ep.cc | ReplayGraphImpl always launches without sync; bridge handles sync. |
| onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.{h,cc} | Plugin bridge: launches via C API then calls Sync() when sync=true. |
| onnxruntime/core/providers/{tensorrt,nv_tensorrt_rtx,dml,js,webgpu}/... | Signature updates; sync parameter accepted but unused. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When using IO Binding with pre-allocated GPU buffers and
disable_synchronize_execution_providers=1in RunOptions, CUDA graph replay was the only remaining synchronization point that prevented fully asyncSession::Run(). This PR threads the sync flag through theReplayGraphvirtual so that CUDA graph replay respects the same run option.Motivation
For latency-sensitive inference pipelines, users want to:
Run()Before this change, even with
disable_synchronize_execution_providers=1, CUDA graph replay always calledcudaStreamSynchronizeaftercudaGraphLaunch(hardcodedsync_status_flag=true). This forced a host-GPU sync on every replay, defeating the purpose of the async config.Behavior Change
disable_synchronize_execution_providersunset or"0")cudaStreamSynchronizeafter graph launchcudaStreamSynchronizeafter graph launchdisable_synchronize_execution_providers = "1"cudaStreamSynchronizeafter graph launch (ignored the config)cudaGraphLaunchreturns immediately, fully asyncKey Changes
IExecutionProvider::ReplayGraph— Addedbool sync = trueparameter to the virtual method (backward-compatible default)InferenceSession::RunImpl— Session-level graph replay path now readsdisable_synchronize_execution_providersand passessync=falsewhen setCUDAExecutionProvider::OnRunEnd— First-capture replay passes existingsync_streamflag (already derived from the run option)CUDAExecutionProvider::ReplayGraph→PerThreadContext::ReplayGraph→CUDAGraphManager::Replay—syncflag threaded through the entire chainReplayGraphImpllaunches graph without sync;PluginExecutionProvider::ReplayGraphbridge callsSync()only whensync=truesyncparameter accepted but unused (these EPs have their own sync semantics)Usage Example
Notes
cudaDeviceSynchronize(viaSync()) for the default sync path instead of stream-level sync. This is because the C APIOrtEp::ReplayGraphsignature cannot be extended with asyncparameter without a versioned ABI change. Functionally correct; slightly broader than stream sync but only matters on the default (blocking) path.OnRunEndwas already gated bysync_stream, which is derived from the same run option — no additional change needed there beyond passing it through.Testing
cudaStreamSynchronizeshould appear betweencudaGraphLaunchcalls when the option is set