Honor disable_synchronize_execution_providers for CUDA graph replay by tianleiwu · Pull Request #28686 · microsoft/onnxruntime

tianleiwu · 2026-05-27T06:06:18Z

Description

When using IO Binding with pre-allocated GPU buffers and disable_synchronize_execution_providers=1 in RunOptions, CUDA graph replay was the only remaining synchronization point that prevented fully async Session::Run(). This PR threads the sync flag through the ReplayGraph virtual so that CUDA graph replay respects the same run option.

Motivation

For latency-sensitive inference pipelines, users want to:

Bind inputs/outputs to fixed GPU memory (IO Binding)
Set a custom compute stream
Use CUDA graph capture for reduced kernel launch overhead
Run fully async — no host-side synchronization during Run()

Before this change, even with disable_synchronize_execution_providers=1, CUDA graph replay always called cudaStreamSynchronize after cudaGraphLaunch (hardcoded sync_status_flag=true). This forced a host-GPU sync on every replay, defeating the purpose of the async config.

Behavior Change

Configuration	Before	After
Default (`disable_synchronize_execution_providers` unset or `"0"`)	`cudaStreamSynchronize` after graph launch	Same — `cudaStreamSynchronize` after graph launch
`disable_synchronize_execution_providers = "1"`	`cudaStreamSynchronize` after graph launch (ignored the config)	No sync — `cudaGraphLaunch` returns immediately, fully async

Key Changes

IExecutionProvider::ReplayGraph — Added bool sync = true parameter to the virtual method (backward-compatible default)
InferenceSession::RunImpl — Session-level graph replay path now reads disable_synchronize_execution_providers and passes sync=false when set
CUDAExecutionProvider::OnRunEnd — First-capture replay passes existing sync_stream flag (already derived from the run option)
CUDAExecutionProvider::ReplayGraph → PerThreadContext::ReplayGraph → CUDAGraphManager::Replay — sync flag threaded through the entire chain
Plugin CUDA EP — ReplayGraphImpl launches graph without sync; PluginExecutionProvider::ReplayGraph bridge calls Sync() only when sync=true
Other EPs (TensorRT, DML, JS, WebGPU, NV TensorRT RTX) — Signature updated for compilation; sync parameter accepted but unused (these EPs have their own sync semantics)

Usage Example

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"cuda_stream": str(stream_ptr)})]
session = ort.InferenceSession("model.onnx", providers=providers)
io_binding = session.io_binding()

# Bind pre-allocated GPU buffers
io_binding.bind_input("input", "cuda", 0, np.float16, shape, input_ptr)
io_binding.bind_output("output", "cuda", 0, np.float16, shape, output_ptr)

# Fully async run — no host sync during Run()
run_options = ort.RunOptions()
run_options.add_run_config_entry("disable_synchronize_execution_providers", "1")
session.run_with_iobinding(io_binding, run_options)

# Sync only when consuming output
torch.cuda.current_stream().synchronize()

Notes

The plugin CUDA EP uses cudaDeviceSynchronize (via Sync()) for the default sync path instead of stream-level sync. This is because the C API OrtEp::ReplayGraph signature cannot be extended with a sync parameter without a versioned ABI change. Functionally correct; slightly broader than stream sync but only matters on the default (blocking) path.
CUDA graph capture-end replay in OnRunEnd was already gated by sync_stream, which is derived from the same run option — no additional change needed there beyond passing it through.

Testing

Build passes with CUDA 13.0
Existing CUDA graph tests continue to pass (default sync=true behavior unchanged)
Async behavior can be verified with nsys profiling: no cudaStreamSynchronize should appear between cudaGraphLaunch calls when the option is set

When disable_synchronize_execution_providers=1 is set in RunOptions, CUDA graph replay now skips cudaStreamSynchronize after cudaGraphLaunch, enabling fully async execution with IO Binding and pre-bound GPU buffers. Previously, CUDA graph replay always called cudaStreamSynchronize regardless of the disable_synchronize_execution_providers setting. This was the only remaining synchronization point preventing fully async Run() with IO Binding + CUDA graph. Changes: - Add bool sync parameter (default true) to IExecutionProvider::ReplayGraph - Thread the parameter through CUDAExecutionProvider and plugin CUDA EP - Session-level graph replay reads the run option to determine sync - OnRunEnd capture-end replay uses the existing sync_stream flag - All other EP overrides updated for signature compatibility

Copilot

Pull request overview

Threads a sync flag through IExecutionProvider::ReplayGraph so that CUDA graph replay honors the existing disable_synchronize_execution_providers RunOption. Previously, even when the option was set, the session-level replay path always synchronized the CUDA stream after cudaGraphLaunch, defeating fully async IO-binding workflows.

Changes:

Added a backward-compatible bool sync = true parameter to IExecutionProvider::ReplayGraph and its overrides across CUDA, plugin CUDA, TensorRT, NV TensorRT RTX, DML, JS, and WebGPU EPs.
InferenceSession::RunImpl now reads disable_synchronize_execution_providers and passes the derived flag to ReplayGraph; CUDA EP also forwards sync_stream when replaying after first-capture in OnRunEnd.
Plugin EP bridge launches the OrtEp graph without sync and then calls Sync() only when sync=true (note: device-wide sync, since the C API OrtEp::ReplayGraph cannot be ABI-extended).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
include/onnxruntime/core/framework/execution_provider.h	Adds `sync` parameter (default true) and doc to virtual `ReplayGraph`.
onnxruntime/core/session/inference_session.h	Forwards `sync` through cached-EP graph-replay helper.
onnxruntime/core/session/inference_session.cc	Derives `sync_graph_replay` from RunOptions and passes to `ReplayGraph`.
onnxruntime/core/providers/cuda/cuda_execution_provider.{h,cc}	Threads `sync` through CUDA EP and `PerThreadContext::ReplayGraph`; uses `sync_stream` in `OnRunEnd` first-capture replay.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	`ReplayGraphImpl` always launches without sync; bridge handles sync.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.{h,cc}	Plugin bridge: launches via C API then calls `Sync()` when `sync=true`.
onnxruntime/core/providers/{tensorrt,nv_tensorrt_rtx,dml,js,webgpu}/...	Signature updates; sync parameter accepted but unused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu mentioned this pull request May 27, 2026

[Feature Request] Allow IO binding on run async #28539

Open

tianleiwu requested review from Copilot and yuslepukhin May 27, 2026 06:14

Copilot started reviewing on behalf of tianleiwu May 27, 2026 06:14 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

tianleiwu requested review from edgchen1 and hariharans29 May 27, 2026 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor disable_synchronize_execution_providers for CUDA graph replay#28686

Honor disable_synchronize_execution_providers for CUDA graph replay#28686
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/async-cuda-graph-replay

tianleiwu commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented May 27, 2026

Description

Motivation

Behavior Change

Key Changes

Usage Example

Notes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants