Skip to content

Honor disable_synchronize_execution_providers for CUDA graph replay#28686

Open
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/async-cuda-graph-replay
Open

Honor disable_synchronize_execution_providers for CUDA graph replay#28686
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/async-cuda-graph-replay

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

When using IO Binding with pre-allocated GPU buffers and disable_synchronize_execution_providers=1 in RunOptions, CUDA graph replay was the only remaining synchronization point that prevented fully async Session::Run(). This PR threads the sync flag through the ReplayGraph virtual so that CUDA graph replay respects the same run option.

Motivation

For latency-sensitive inference pipelines, users want to:

  1. Bind inputs/outputs to fixed GPU memory (IO Binding)
  2. Set a custom compute stream
  3. Use CUDA graph capture for reduced kernel launch overhead
  4. Run fully async — no host-side synchronization during Run()

Before this change, even with disable_synchronize_execution_providers=1, CUDA graph replay always called cudaStreamSynchronize after cudaGraphLaunch (hardcoded sync_status_flag=true). This forced a host-GPU sync on every replay, defeating the purpose of the async config.

Behavior Change

Configuration Before After
Default (disable_synchronize_execution_providers unset or "0") cudaStreamSynchronize after graph launch SamecudaStreamSynchronize after graph launch
disable_synchronize_execution_providers = "1" cudaStreamSynchronize after graph launch (ignored the config) No synccudaGraphLaunch returns immediately, fully async

Key Changes

  • IExecutionProvider::ReplayGraph — Added bool sync = true parameter to the virtual method (backward-compatible default)
  • InferenceSession::RunImpl — Session-level graph replay path now reads disable_synchronize_execution_providers and passes sync=false when set
  • CUDAExecutionProvider::OnRunEnd — First-capture replay passes existing sync_stream flag (already derived from the run option)
  • CUDAExecutionProvider::ReplayGraphPerThreadContext::ReplayGraphCUDAGraphManager::Replaysync flag threaded through the entire chain
  • Plugin CUDA EPReplayGraphImpl launches graph without sync; PluginExecutionProvider::ReplayGraph bridge calls Sync() only when sync=true
  • Other EPs (TensorRT, DML, JS, WebGPU, NV TensorRT RTX) — Signature updated for compilation; sync parameter accepted but unused (these EPs have their own sync semantics)

Usage Example

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"cuda_stream": str(stream_ptr)})]
session = ort.InferenceSession("model.onnx", providers=providers)
io_binding = session.io_binding()

# Bind pre-allocated GPU buffers
io_binding.bind_input("input", "cuda", 0, np.float16, shape, input_ptr)
io_binding.bind_output("output", "cuda", 0, np.float16, shape, output_ptr)

# Fully async run — no host sync during Run()
run_options = ort.RunOptions()
run_options.add_run_config_entry("disable_synchronize_execution_providers", "1")
session.run_with_iobinding(io_binding, run_options)

# Sync only when consuming output
torch.cuda.current_stream().synchronize()

Notes

  • The plugin CUDA EP uses cudaDeviceSynchronize (via Sync()) for the default sync path instead of stream-level sync. This is because the C API OrtEp::ReplayGraph signature cannot be extended with a sync parameter without a versioned ABI change. Functionally correct; slightly broader than stream sync but only matters on the default (blocking) path.
  • CUDA graph capture-end replay in OnRunEnd was already gated by sync_stream, which is derived from the same run option — no additional change needed there beyond passing it through.

Testing

  • Build passes with CUDA 13.0
  • Existing CUDA graph tests continue to pass (default sync=true behavior unchanged)
  • Async behavior can be verified with nsys profiling: no cudaStreamSynchronize should appear between cudaGraphLaunch calls when the option is set

When disable_synchronize_execution_providers=1 is set in RunOptions,
CUDA graph replay now skips cudaStreamSynchronize after cudaGraphLaunch,
enabling fully async execution with IO Binding and pre-bound GPU buffers.

Previously, CUDA graph replay always called cudaStreamSynchronize
regardless of the disable_synchronize_execution_providers setting.
This was the only remaining synchronization point preventing fully
async Run() with IO Binding + CUDA graph.

Changes:
- Add bool sync parameter (default true) to IExecutionProvider::ReplayGraph
- Thread the parameter through CUDAExecutionProvider and plugin CUDA EP
- Session-level graph replay reads the run option to determine sync
- OnRunEnd capture-end replay uses the existing sync_stream flag
- All other EP overrides updated for signature compatibility
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Threads a sync flag through IExecutionProvider::ReplayGraph so that CUDA graph replay honors the existing disable_synchronize_execution_providers RunOption. Previously, even when the option was set, the session-level replay path always synchronized the CUDA stream after cudaGraphLaunch, defeating fully async IO-binding workflows.

Changes:

  • Added a backward-compatible bool sync = true parameter to IExecutionProvider::ReplayGraph and its overrides across CUDA, plugin CUDA, TensorRT, NV TensorRT RTX, DML, JS, and WebGPU EPs.
  • InferenceSession::RunImpl now reads disable_synchronize_execution_providers and passes the derived flag to ReplayGraph; CUDA EP also forwards sync_stream when replaying after first-capture in OnRunEnd.
  • Plugin EP bridge launches the OrtEp graph without sync and then calls Sync() only when sync=true (note: device-wide sync, since the C API OrtEp::ReplayGraph cannot be ABI-extended).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
include/onnxruntime/core/framework/execution_provider.h Adds sync parameter (default true) and doc to virtual ReplayGraph.
onnxruntime/core/session/inference_session.h Forwards sync through cached-EP graph-replay helper.
onnxruntime/core/session/inference_session.cc Derives sync_graph_replay from RunOptions and passes to ReplayGraph.
onnxruntime/core/providers/cuda/cuda_execution_provider.{h,cc} Threads sync through CUDA EP and PerThreadContext::ReplayGraph; uses sync_stream in OnRunEnd first-capture replay.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc ReplayGraphImpl always launches without sync; bridge handles sync.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.{h,cc} Plugin bridge: launches via C API then calls Sync() when sync=true.
onnxruntime/core/providers/{tensorrt,nv_tensorrt_rtx,dml,js,webgpu}/... Signature updates; sync parameter accepted but unused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants