Skip to content

Add per-session thread pool work callbacks API#27253

Merged
sdotpeng merged 20 commits intomicrosoft:mainfrom
sdotpeng:sdotpeng/ThreadPoolCallbacks
Mar 30, 2026
Merged

Add per-session thread pool work callbacks API#27253
sdotpeng merged 20 commits intomicrosoft:mainfrom
sdotpeng:sdotpeng/ThreadPoolCallbacks

Conversation

@sdotpeng
Copy link
Copy Markdown
Contributor

@sdotpeng sdotpeng commented Feb 5, 2026

Description

Adds per-session thread pool work callbacks, allowing callers to hook into the enqueue/start/stop/abandon lifecycle of thread pool work items. The feature is gated behind a build flag (--enable_session_threadpool_callbacks) with zero overhead when disabled.

API additions

  • C API: OrtApi::SetPerSessionThreadPoolCallbacks — stores an OrtThreadPoolCallbacksConfig on the OrtEnv, applied to per-session thread pools
  • C++ wrapper: Ort::Env::SetPerSessionThreadPoolCallbacks
  • Versioned C config struct OrtThreadPoolCallbacksConfig with fields: on_enqueue, on_start_work, on_stop_work, on_abandon, user_context
  • Four callback typedefs: OrtThreadPoolWorkEnqueueFn, OrtThreadPoolWorkStartFn, OrtThreadPoolWorkStopFn, OrtThreadPoolWorkAbandonFn

Implementation

  • EigenNonBlockingThreadPool.h: Introduced a policy-based design with two compile-time callback policies:
    • WorkNoCallbackPolicy: Work = std::function<void()>, all callback methods are trivial inlines eliminated by the compiler. Zero overhead for non-callback builds.
    • WorkWithCallbackPolicy: Work = WorkItem bundling tasks with callback data; invokes user callbacks around task execution via MakeWork/Execute/OnEnqueue/OnAbandon methods.
    • ThreadPoolTempl<Environment, CallbackPolicy> uses the policy for all callback-related operations.
    • RunQueue::RevokeWithTag calls policy_->OnAbandon(e.w) on successful revocation; the policy implementation decides whether to invoke user callbacks.
  • threadpool.h: extended_eigen_threadpool_ changed to unique_ptr<ExtendedThreadPoolInterface> for type erasure across policy instantiations. EnableSpinning/DisableSpinning added to the virtual interface.
  • threadpool.cc: Single #ifdef selects policy at ThreadPoolTempl instantiation.
  • environment.h/.cc: Added SetPerSessionWorkCallbacks/GetPerSessionWorkCallbacks on Environment.
  • inference_session.cc: Propagates callbacks from Environment to per-session thread pool options.
  • thread_utils.h/.cc: Added callback fields to OrtThreadPoolParams and wiring in CreateThreadPoolHelper.
  • env.h: OrtThreadPoolCallbacksConfig* pointer in ThreadOptions.

Build

  • CMake option onnxruntime_ENABLE_SESSION_THREADPOOL_CALLBACKS; build.py argument --enable_session_threadpool_callbacks

Tests

  • 8 callback-specific tests: Schedule, OnEnqueueOnly, NoCallbacks, ParallelFor, ParallelSection, Abandon, EnqueueReturnsNull, NoEnqueueWithStartStop
  • End-to-end C API test (SetPerSessionThreadPoolCallbacks via ModelBuilder with 1M-element Mul)
  • All 73 existing ThreadPool tests pass unchanged with both callback-enabled and callback-disabled builds (81/81 and 73/73 respectively)

Motivation and Context

Thread pool work callbacks enable telemetry, tracing, and resource management by providing visibility into when work is enqueued, executed, and abandoned in per-session thread pools. This is needed for production diagnostics and performance instrumentation scenarios.

@sdotpeng sdotpeng marked this pull request as draft February 5, 2026 10:45
@sdotpeng
Copy link
Copy Markdown
Contributor Author

sdotpeng commented Feb 5, 2026

@microsoft-github-policy-service agree

Copy link
Copy Markdown
Member

@chwarr chwarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I see some issues around on_enqueue returning NULL.

Copy link
Copy Markdown
Member

@chwarr chwarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider what happens to any callback state allocated in on_enqueue when the thread pool is shutdown and the work items do not run.

@sdotpeng sdotpeng force-pushed the sdotpeng/ThreadPoolCallbacks branch from d1534de to 4cdf234 Compare February 25, 2026 18:30
@sdotpeng sdotpeng requested a review from chwarr February 25, 2026 18:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new opt-in C API (SetDefaultThreadPoolCallbacks) and corresponding C++ wrapper (Ort::Env::SetDefaultThreadPoolCallbacks) to register lifecycle callbacks for per-session thread pool work items. When enabled via the --session_threadpool_callbacks build flag, callbacks can observe when work is enqueued, started, stopped, or abandoned in per-session thread pools, enabling profiling, tracing, and custom scheduling instrumentation.

Changes:

  • New CMake option onnxruntime_SESSION_THREADPOOL_CALLBACKS and build script argument --session_threadpool_callbacks to opt into the feature
  • New callback types (OrtThreadPoolWorkEnqueueFn, etc.) in the C API header, with SetDefaultThreadPoolCallbacks added to the OrtApi struct (v1.25)
  • Thread pool implementation updated: introduces a WorkItem wrapper type bundling task + callback data, with InvokeOnEnqueue/InvokeWorkItem/InvokeOnAbandon helpers and revocation propagation

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
include/onnxruntime/core/session/onnxruntime_c_api.h Defines new callback typedefs and adds SetDefaultThreadPoolCallbacks to OrtApi v1.25
include/onnxruntime/core/session/onnxruntime_cxx_api.h Adds C++ Env::SetDefaultThreadPoolCallbacks declaration
include/onnxruntime/core/session/onnxruntime_cxx_inline.h Implements the C++ Env::SetDefaultThreadPoolCallbacks wrapper
include/onnxruntime/core/session/environment.h Adds ThreadPoolWorkCallbacks struct and default_session_work_callbacks_ field to Environment
include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Core implementation: WorkItem type, callback invocation helpers, queue revocation propagation
onnxruntime/core/session/environment.cc Implements SetDefaultSessionWorkCallbacks
onnxruntime/core/session/onnxruntime_c_api.cc Implements OrtApis::SetDefaultThreadPoolCallbacks C API entry point
onnxruntime/core/session/ort_apis.h Declares SetDefaultThreadPoolCallbacks in the OrtApis namespace
onnxruntime/core/session/inference_session.cc Propagates env-level callbacks to per-session thread pool options
onnxruntime/core/util/thread_utils.h / .cc Adds callback fields to OrtThreadPoolParams and wires them into thread pool creation
onnxruntime/core/platform/env.h Adds ThreadPoolWorkCallbacks struct and work_callbacks field to ThreadOptions
cmake/CMakeLists.txt / adjust_global_compile_flags.cmake CMake option and compile definition for the feature flag
tools/ci_build/build_args.py / build.py Build script support for --session_threadpool_callbacks
onnxruntime/test/platform/threadpool_test.cc Unit tests for all callback scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@sdotpeng sdotpeng marked this pull request as ready for review March 5, 2026 04:47
@sdotpeng sdotpeng requested a review from Copilot March 5, 2026 05:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yuslepukhin
Copy link
Copy Markdown
Member

The PR makes TP perf dependent on the nature of callbacks. Is there any perf numbers that could characterize the perf under the supposed usage?

@sdotpeng sdotpeng requested a review from skottmckay March 9, 2026 16:39
@sdotpeng
Copy link
Copy Markdown
Contributor Author

The PR makes TP perf dependent on the nature of callbacks. Is there any perf numbers that could characterize the perf under the supposed usage?

Yes, the overhead does depend on the callback implementation. We benchmarked the three relevant configurations to characterize this:

  1. Flag OFF: baseline, identical to ORT main
  2. Flag ON, no callbacks registered: measures structural overhead (extra pointer in work item, null-check branches)
  3. Flag ON, WinML callbacks registered: the intended usage, where each callback performs a lightweight NT kernel call on thread-local state

Methodology: Built ARM64 Release with --build_micro_benchmarks for each configuration. Ran onnxruntime_benchmark.exe threadpool microbenchmarks (BM_ThreadPoolParallelFor) with --benchmark_repetitions=3 on Snapdragon X Elite (12 cores, 2976 MHz). The benchmark parameters are iteration count (work volume) and cost (per-element cost that controls work partitioning across threads).

Results (mean real_time, percentages relative to Flag OFF):

Benchmark Flag OFF (ns) Flag ON, no callbacks (ns) Flag ON, WinML callbacks (ns)
ParallelFor 100/1 236 260 (+10%) 399 (+69%)
ParallelFor 100/400 225 253 (+12%) 328 (+46%)
ParallelFor 1K/1 2,114 2,229 (+5%) 2,853 (+35%)
ParallelFor 1K/200 2,046 2,237 (+9%) 2,527 (+24%)
ParallelFor 10K/1 19,516 22,026 (+13%) 22,624 (+16%)
ParallelFor 10K/200 30,423 30,031 (-1%) 32,546 (+7%)
ParallelFor 20K/200 57,804 58,457 (+1%) 61,884 (+7%)
ParallelFor 40K/200 114,663 119,569 (+4%) 123,416 (+8%)
ParallelFor 80K/200 225,654 233,793 (+4%) 248,085 (+10%)
ParallelFor 160K/200 454,959 475,879 (+5%) 463,962 (+2%)

Flag OFF and Flag ON numbers were collected in separate runs, so +/-5% variation is expected run-to-run noise.

Summary: The callback overhead is a fixed 100-500ns per dispatch from three kernel calls per work item. On very short loops (100 iterations, total time 230ns), this dominates. On realistic workloads (10K+ iterations), the overhead is 2-10%. In practice, ORT inference kernels run in the hundreds-of-microseconds to milliseconds range, making the per-dispatch callback cost negligible. Builds without --session_threadpool_callbacks have exactly zero overhead — the feature is entirely compiled out.

skottmckay
skottmckay previously approved these changes Mar 10, 2026
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@sdotpeng sdotpeng requested a review from yuslepukhin March 20, 2026 20:23
@sdotpeng sdotpeng requested a review from yuslepukhin March 20, 2026 23:21
yuslepukhin
yuslepukhin previously approved these changes Mar 20, 2026
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sdotpeng sdotpeng requested review from chwarr and yuslepukhin March 23, 2026 19:35
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eserscor
Copy link
Copy Markdown
Contributor

/azp run Linux_TRT_Minimal_CUDA_Test_CI

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sdotpeng sdotpeng merged commit f869122 into microsoft:main Mar 30, 2026
96 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants