Sync with Microsoft ONNX Runtime - 02042026 by ai-fw-intg · Pull Request #1010 · intel/onnxruntime

ai-fw-intg · 2026-04-01T21:03:44Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…#27342) ### Description Moves the `--build_wasm_static_lib → --build_wasm` implication from `build.py` into `build_args.py`'s post-processing, **before** the cmake generator selection. Previously, `build_args.py` chose the generator based on `args.build_wasm` (still `False`), and `build.py` only set it to `True` afterwards—too late. - **`tools/ci_build/build_args.py`**: Set `args.build_wasm = True` when `args.build_wasm_static_lib` is set, prior to generator and cross-compilation logic. - **`tools/ci_build/build.py`**: Remove the now-redundant identical check. ### Motivation and Context Using `--build_wasm_static_lib` without `--build_wasm` caused cmake to use the wrong generator (e.g., Visual Studio instead of Ninja on Windows) and miss Emscripten-specific configuration, leading to build failures like missing `libiconv`.  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>

…n MatMulNBits (microsoft#27820) ### Description Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path (`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper code from MLAS. **Operator changes (matmul_nbits.cc):** - PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16 scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on ARM64. - ComputeBPacked: Bulk fp16→fp32 conversion of A, calls `MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16 conversion of C. **MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):** - Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`, `HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel conditions. - Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in `MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>` compatibility. ### Motivation and Context The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile before calling the same SQ fp32 kernels. This change: 1. **Eliminates per-tile overhead** via bulk conversion at the operator level. 2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the `HQNBIT_CompInt8` path. 3. **Removes ~370 lines of dead wrapper code** from MLAS. ### Improvements Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU` **Asymmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms | | Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms | | Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms | | Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms | | Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms | | Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms | **Symmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms | | Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms | | Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms | | Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms | | Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms | | Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms | **NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25% on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion overhead replacing the old per-tile approach. The old HQ CompInt8 wrappers kept small tiles cache-hot, while the new unified path does full-matrix conversion passes. This trade-off is acceptable since 4-bit is the dominant quantization format (gaining 26–67%), 8-bit acc4 still outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at smaller model sizes where absolute latencies are already low. A proper fix would be 8-bit KleidiAI-style kernels rather than restoring the wrapper code.

…rt. (microsoft#27825) ### Description Support for Aarch64 SME intrinsics was added to version 19.40 of MSVC. The ONNX Runtime stated supported version of Visual Studio 2022 can go back before version 19.40. This patch modifies cmake/CMakeLists.txt to check the version of MSVC, if it is the target compiler. For versions less than 19.40 KleidiAi will be disabled in the build. ### Motivation and Context This issue was raised when cross compiling 1.24 for Windows on Arm. microsoft#27304 --------- Signed-off-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

) ### Description Enable ccache and vcpkg caching for Linux workflows that use `reusable_linux_build.yml`. Saves about ~15-20 min on a 100% cache hit. Also parallelises tests. Saves ~6 minutes. Additionally, enable vcpkg and ccache for other Linux workflows. No numbers avail for comparison. ### Motivation and Context This change reduces wasted CO2 and time. ### Known Issues Benign - Android workflow doesn't seem to be populating its ccache.

### Description See below ### Motivation and Context Summary:The vulnerability lies in the ONNX Runtime's validate_package.py script, which uses unsanitized string concatenation with os.system() to construct shell commands. This allows attackers to inject arbitrary shell commands via the --package_name argument, leading to potential remote code execution. The issue affects the release validation pipeline, which operates with elevated privileges, exposing sensitive credentials and secrets. The root cause is the lack of input sanitization and the use of os.system() for command execution. Affected code locations: tools/nuget/validate_package.py line 241: os.system("tar zxvf " + package_name) tools/nuget/validate_package.py line 339: os.system("copy " + full_nuget_path + " " + nupkg_copy_name) Suggested fix: Replace os.system() with subprocess.run() using argument lists (no shell interpolation): ``` # Instead of: os.system("tar zxvf " + package_name) subprocess.run(["tar", "zxvf", package_name], check=True) # Instead of: os.system("copy " + full_nuget_path + " " + nupkg_copy_name) shutil.copy2(full_nuget_path, nupkg_copy_name) ```

Align maxStorageBufferBindingSize down to the nearest multiple of minStorageBufferOffsetAlignment after querying device limits. This ensures that when large buffers are split into segments, each segment's byte offset satisfies WebGPU's bind group offset alignment requirement (typically 256 bytes).

### Description This PR updates the pattern matchings to perform multi-head attention fusion for the conformer encoder inside [Nemotron speech](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b). <img width="550" height="976" alt="image" src="https://github.com/user-attachments/assets/a194308e-ce69-4128-9389-aae2a64b312f" /> ### Motivation and Context These changes allow the `MultiHeadAttention` op to appear in the encoder ONNX model. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…t#27823) ### Description  DmlOperatorQuantization21 was missing the tensor reshaping logic that the older DmlOperatorElementwiseQLinear already had. Scalar scale tensors get padded to 4D, but a 5D input stays 5D. DML rejects the dimension mismatch with E_INVALIDARG, and the resulting exception unwind triggers a sized-delete bug in WRL's MakeAllocator which address sanitizer detects. The fix is to port the same logic from the DmlOperatorElementwiseQLinear into this path, so that the dimensions match. ### Motivation and Context  This is required to ensure the DML EP correctly handles this scenario. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description  This change tries to address a problem in the DML EP where AlignToPow2 rounded up tensorByteSize to a 4-byte boundary before the data was read from the source buffer. This caused CreateCpuResource, CreateResource, WriteToFile, and the inputRawData vector construction to read 1–3 bytes past the end of the original tensor data. CreateResource and CreateCpuResource already independently align the D3D12 resource descriptor size, so they work correctly with the original (unaligned) byte count. The fix is to move the alignment to the location where it's needed. ### Motivation and Context  This is required because it addresses a crash / incorrect behavior in the DML EP.

…ft#27595) This pull request introduces support for node "layering annotations" and improves resource accounting and memory management during graph partitioning in ONNX Runtime. The changes add new mechanisms for annotating nodes, filtering nodes by annotation during partitioning, and efficiently accounting for resources in fused nodes. Several APIs are extended to support these features, and new configuration options are introduced to guide layer assignment. **Layering annotations & partitioning:** * Added `layering_annotation_` member and associated getter/setter/clear methods to the `Node` class, allowing nodes to be annotated for layer assignment. Also added a method to clear these annotations after partitioning to save memory. (`include/onnxruntime/core/graph/graph.h`) [[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R177-R184) [[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R266-R272) [[3]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R702-R703) * Extended the graph partitioning logic to support filtering nodes by their layering annotation using a `LayeringIndex`, ensuring only nodes matching the current execution provider's assignment are considered during partitioning. (`onnxruntime/core/framework/graph_partitioner.cc`) [[1]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR155) [[2]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR199-R286) [[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL244-R357) [[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL433-R545) [[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL451-R564) [[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL477-R591) * Added a new session option `kOrtSessionOptionsLayerAssignmentSettings` to configure layer assignment using annotation prefixes per device. (`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`) **Resource accounting improvements:** * Improved the `IResourceAccountant` interface to allow resetting and committing pending weights per node, and updated resource accounting logic to correctly sum and commit costs for all constituent nodes in fused nodes, preventing double-counting or undercounting. (`include/onnxruntime/core/framework/resource_accountant.h`, `include/onnxruntime/core/graph/indexed_sub_graph.h`, `onnxruntime/core/framework/graph_partitioner.cc`) [[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L48-R72) [[2]](diffhunk://#diff-3f09a80586759ee33e272477c3eb96f28d9b37f1e8251d13f1211c0450945135L89-R114) [[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL391-L397) **API and code organization:** * Updated the `Graph` class and related APIs to propagate layering annotations during function inlining and to provide a method for removing all layering annotations after partitioning. (`include/onnxruntime/core/graph/graph.h`) [[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1341-R1346) [[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1590-R1594) * Moved the `CreateAccountants` function out of the `NodeStatsRecorder` class to the namespace level for clarity. (`include/onnxruntime/core/framework/resource_accountant.h`) These changes enable more flexible and memory-efficient graph partitioning, particularly for scenarios involving hardware-specific layer assignments and dynamic resource constraints. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…icrosoft#27699) ### Description If the ONNX file is malformed, it could lead to an incorrect memory access. This change enforces that does not happen. ### Motivation and Context security issue --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

### Description  ### Motivation and Context

…icrosoft#27778) This PR is on top of a previous PR and fixes the remaining issues. microsoft#27706 All tests here should be passing now over webgpu: https://wpt.live/webnn/conformance_tests/dequantizeLinear.https.any.html?gpu --------- Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

) ### Description Add a pre-check for zero values in the divisor tensor for integral types in `Mod`. Returns an error `Status` instead of hitting undefined behavior (SIGFPE / structured exception). - **`element_wise_ops.cc`**: Added `CheckZeroDivisorImpl` as a single template struct in the `mod_internal` namespace using `if constexpr (std::is_integral<T>::value)` to guard the check — no-op for non-integer types. The struct's `operator()` returns `Status` (via `ORT_RETURN_IF`) and is dispatched with `InvokeRet<Status>`. When the divisor is a constant initializer, `TryGetConstantInput` validates for zeros once at kernel creation time in the out-of-line constructor (using `ORT_THROW_IF_ERROR`), avoiding per-`Compute` overhead. A `divisor_is_validated_constant_` flag tracks whether the one-time check was performed. In `Compute`, non-constant divisors are scanned via the type dispatcher (using `ORT_RETURN_IF_ERROR`) before calling `CallModImpl`, skipping the check when the constant was already validated. The Mod constructor is defined out-of-line after the `mod_internal` namespace to keep it contiguous. - **`element_wise_ops_test.cc`**: Added `Mod_int8_by_zero`, `Mod_int32_by_zero`, `Mod_int64_by_zero_scalar` tests covering tensor and scalar divisor cases, plus `Mod_int32_by_zero_constant_initializer` to exercise the `TryGetConstantInput` constructor path with `is_initializer = true`. ### Motivation and Context Integer modulo by zero is UB in C++ and causes a hardware exception that crashes the process. Float types produce NaN naturally via `std::fmod`, but int8/int16/int32/int64/uint* types do not. This is the same class of issue that was fixed for the `Div` operator in microsoft#27693, now applied to the `Mod` operator.  --- 💬 Send tasks to Copilot coding agent from [Slack](https://gh.io/cca-slack-docs) and [Teams](https://gh.io/cca-teams-docs) to turn conversations into code. Copilot posts an update in your thread when it's finished. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

## Description Adds per-session thread pool work callbacks, allowing callers to hook into the enqueue/start/stop/abandon lifecycle of thread pool work items. The feature is gated behind a build flag (`--enable_session_threadpool_callbacks`) with zero overhead when disabled. ## API additions - C API: `OrtApi::SetPerSessionThreadPoolCallbacks` — stores an `OrtThreadPoolCallbacksConfig` on the `OrtEnv`, applied to per-session thread pools - C++ wrapper: `Ort::Env::SetPerSessionThreadPoolCallbacks` - Versioned C config struct `OrtThreadPoolCallbacksConfig` with fields: `on_enqueue`, `on_start_work`, `on_stop_work`, `on_abandon`, `user_context` - Four callback typedefs: `OrtThreadPoolWorkEnqueueFn`, `OrtThreadPoolWorkStartFn`, `OrtThreadPoolWorkStopFn`, `OrtThreadPoolWorkAbandonFn` ## Implementation - `EigenNonBlockingThreadPool.h`: Introduced a policy-based design with two compile-time callback policies: - `WorkNoCallbackPolicy`: `Work = std::function<void()>`, all callback methods are trivial inlines eliminated by the compiler. Zero overhead for non-callback builds. - `WorkWithCallbackPolicy`: `Work = WorkItem` bundling tasks with callback data; invokes user callbacks around task execution via `MakeWork`/`Execute`/`OnEnqueue`/`OnAbandon` methods. - `ThreadPoolTempl<Environment, CallbackPolicy>` uses the policy for all callback-related operations. - `RunQueue::RevokeWithTag` calls `policy_->OnAbandon(e.w)` on successful revocation; the policy implementation decides whether to invoke user callbacks. - `threadpool.h`: `extended_eigen_threadpool_` changed to `unique_ptr<ExtendedThreadPoolInterface>` for type erasure across policy instantiations. `EnableSpinning`/`DisableSpinning` added to the virtual interface. - `threadpool.cc`: Single `#ifdef` selects policy at `ThreadPoolTempl` instantiation. - `environment.h/.cc`: Added `SetPerSessionWorkCallbacks`/`GetPerSessionWorkCallbacks` on `Environment`. - `inference_session.cc`: Propagates callbacks from `Environment` to per-session thread pool options. - `thread_utils.h/.cc`: Added callback fields to `OrtThreadPoolParams` and wiring in `CreateThreadPoolHelper`. - `env.h`: `OrtThreadPoolCallbacksConfig*` pointer in `ThreadOptions`. ## Build - CMake option `onnxruntime_ENABLE_SESSION_THREADPOOL_CALLBACKS`; `build.py` argument `--enable_session_threadpool_callbacks` ## Tests - 8 callback-specific tests: Schedule, OnEnqueueOnly, NoCallbacks, ParallelFor, ParallelSection, Abandon, EnqueueReturnsNull, NoEnqueueWithStartStop - End-to-end C API test (`SetPerSessionThreadPoolCallbacks` via ModelBuilder with 1M-element Mul) - All 73 existing ThreadPool tests pass unchanged with both callback-enabled and callback-disabled builds (81/81 and 73/73 respectively) ## Motivation and Context Thread pool work callbacks enable telemetry, tracing, and resource management by providing visibility into when work is enqueued, executed, and abandoned in per-session thread pools. This is needed for production diagnostics and performance instrumentation scenarios. --------- Co-authored-by: Siyuan Peng <siyuanpeng@microsoft.com>

…icrosoft#27834) Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

### Description Run-level profiling (introduced in PR microsoft#26846) does not currently capture profiling events for operators inside subgraphs. This PR fixes that by threading the `run_profiler` pointer through `OpKernelContextInternal` to subgraph execution, following the same pattern as `terminate_flag`. ### Root Cause `utils::ExecuteSubgraph()` had no `run_profiler` parameter and always passed `nullptr` to `ExecuteGraphImpl`, so nested operators (inside If, Loop, Scan, BeamSearch, GreedySearch) were never profiled at the run level. ### Fix 1. **`OpKernelContextInternal`** — Added `run_profiler_` member and `GetRunProfiler()` accessor. 2. **`SessionScope` / `ExecuteKernel()`** — Pass the run profiler into `OpKernelContextInternal`. 3. **`ExecuteSubgraph()`** — Added `profiling::Profiler* run_profiler = nullptr` parameter, forwarded to `ExecuteGraphImpl()`. 4. **Control flow ops** (`if.cc`, `loop.cc`, `scan_utils.cc`) — Pass `context_.GetRunProfiler()` to `ExecuteSubgraph()`. 5. **Contrib transformer ops** (`beam_search_impl_gpt.h`, `beam_search_impl_t5.h`, `beam_search_impl_whisper.h`, `greedy_search_impl_gpt.h`) — All 8 `ExecuteSubgraph()` call sites updated to pass `this->context_.GetRunProfiler()`. Plugin EP control flow kernels (`PluginEpIfKernelImpl`, etc.) delegate to the same internal kernels, so the fix propagates automatically. ### Tests - **`CheckRunProfilerWithSubgraph`** (`inference_session_test.cc`) — Runs `if_mul.onnx`, enables run profiling, asserts `mul_0` (inside If's then-branch) appears in the profile JSON. - **`CheckRunProfilerWithBeamSearch`** (`beam_search_test.cc`) — Runs `tiny_gpt2_beamsearch.onnx`, enables run profiling, asserts decoder subgraph Node entries (beyond the top-level BeamSearch op) appear in the profile JSON. ### Files Changed (12 files) | File | Change | |------|--------| | `core/framework/op_kernel_context_internal.h` | Added `run_profiler_` member, `GetRunProfiler()`, constructor param | | `core/framework/sequential_executor.cc` | `SessionScope::GetRunProfiler()`, pass to `OpKernelContextInternal` | | `core/framework/utils.h` / `utils.cc` | `run_profiler` param on `ExecuteSubgraph()` | | `core/providers/cpu/controlflow/if.cc` | Forward `GetRunProfiler()` | | `core/providers/cpu/controlflow/loop.cc` | Forward `GetRunProfiler()` | | `core/providers/cpu/controlflow/scan_utils.cc` | Forward `GetRunProfiler()` | | `contrib_ops/cpu/transformers/beam_search_impl_gpt.h` | 2 call sites | | `contrib_ops/cpu/transformers/beam_search_impl_t5.h` | 2 call sites | | `contrib_ops/cpu/transformers/beam_search_impl_whisper.h` | 2 call sites | | `contrib_ops/cpu/transformers/greedy_search_impl_gpt.h` | 2 call sites | | `test/framework/inference_session_test.cc` | `CheckRunProfilerWithSubgraph` test | | `test/contrib_ops/beam_search_test.cc` | `CheckRunProfilerWithBeamSearch` test |

### Description Replace `actions/cache@v4` w/ `actions/cache@v5`. ### Motivation and Context `actions/cache@v4` uses node 20, which is deprecated.

This pull request introduces a new synchronization API for plugin execution providers (EPs) in ONNX Runtime, and adds comprehensive test infrastructure to verify its usage. The main theme is enabling EPs to synchronize device operations, which is particularly important for IO binding and async execution scenarios. The changes also update the test framework to support and validate this new capability. **Synchronization API for Plugin EPs:** * Added a new optional `Sync` method to the `OrtEp` C API interface, allowing EPs to block until all preceding device tasks are complete. This is primarily used by IO binding to ensure device inputs are ready before execution. (`include/onnxruntime/core/session/onnxruntime_ep_c_api.h`) * Implemented the `Sync` method in the example plugin EP, with a test hook that increments a counter for verification purposes. (`onnxruntime/test/autoep/library/example_plugin_ep/ep.cc`, `onnxruntime/test/autoep/library/example_plugin_ep/ep.h`) [[1]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R187) [[2]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R589-R601) [[3]](diffhunk://#diff-5e9391ab7d2d558c5fa992b5fc373add5c52225aa43ce1af323ffbd8c2b86733R105-R106) **Test Infrastructure and Verification:** * Added test hooks (`ExampleEpTestHooks_ResetSyncCount`, `ExampleEpTestHooks_GetSyncCount`) to the example plugin EP, allowing tests to reset and retrieve the sync call count. (`onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.h`, `onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.cc`) [[1]](diffhunk://#diff-a587d529618260bec7cbecf107513dacb795fff9fb34ae99c3a2db36bdcc8befR1-R23) [[2]](diffhunk://#diff-7123fbca69d2580f0483d6589817e275c05b086c1fb56281a83f0fb895bdc06fR1-R11) * Updated test execution logic to load these hooks dynamically and verify that the `Sync` method is called exactly once during inference with IO binding. (`onnxruntime/test/autoep/test_execution.cc`) [[1]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R299-R358) [[2]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R1099-R1119) **Plugin EP Interface Updates:** * Extended the `PluginExecutionProvider` C++ interface to support the new `Sync` method, delegating to the plugin EP if implemented. (`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h`, `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc`) [[1]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118) [[2]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638) **Performance Test Framework Enhancements:** * Added logic to detect if a plugin EP uses an NVIDIA GPU device, enabling CUDA IO binding automatically in performance tests when appropriate. (`onnxruntime/test/perftest/common_utils.cc`, `onnxruntime/test/perftest/utils.h`, `onnxruntime/test/perftest/ort_test_session.cc`) [[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221) [[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41) [[3]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R98-R108) * Ensured that async execution is used in performance tests with IO binding, relying on the new synchronization mechanism. (`onnxruntime/test/perftest/ort_test_session.cc`) [[1]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18L57) [[2]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R66-R69) These changes collectively improve device synchronization support for plugin EPs and provide robust testing to ensure correct behavior.This pull request introduces support for synchronizing plugin execution providers, especially for NVIDIA GPU devices, and refines the logic for CUDA I/O binding in performance tests. The main changes include adding a new `Sync` API for execution providers, updating the plugin EP interface to use this API, and improving test session configuration for CUDA devices. ### API and Interface Updates * Added a new optional `Sync` method to the `OrtEp` struct in `onnxruntime_ep_c_api.h`, allowing execution providers to block until all device tasks are complete. This is primarily used to ensure inputs are copied to the device before execution starts. * Implemented the `Sync` method in the `PluginExecutionProvider` class and its interface, enabling plugin EPs to support device synchronization if available. [[1]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638) [[2]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118) ### Performance Test Improvements * Added a utility function `UsesNvidiaDevice` to detect if any registered plugin EP uses an NVIDIA GPU device, improving test configuration logic. [[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221) [[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41) ### Motivation and Context

## Description This PR adds a standalone CUDA Plugin Execution Provider (`CudaPluginExecutionProvider`) built as a dynamically loadable shared library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks. ## Summary of Changes ### Build system and CMake | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN` build option, records plugin build info, and includes the plugin-specific CMake file. | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the plugin shared-library target: collects `.cc`/`.cu` sources from `core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components. | | `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose include paths needed by plugin builds. | | `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration. | | `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration support for the plugin build path. | ### Plugin runtime implementation (new files) | File | Purpose | |------|---------| | `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization. | | `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from `ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels). | | `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device and pinned memory, exposed through the EP API. | | `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc` contexts). | | `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings. | | `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path. | | `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`, `Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow hooks instead of inheriting from in-tree CPU base implementations. | | `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points (`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and release the CUDA EP factory. | | `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides `CudaKernel` base class, error-return macros, type helpers (`ToCudaType`), handle-management abstractions, and stream adapters. Force-included in all plugin `.cc` files to transparently adapt existing kernel code. | | `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel definitions via `PluginKernelCollector` macro overrides, replacing the centralized registration tables used in the bundled build. | | `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin (logging, error checking, config parsing). | | `plugin/provider_api_shims.cc` | Stub implementations for shared-provider bridge functions that are not needed in the plugin path. | | `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export definitions for the plugin DLL. | ### EP adapter and API extensions | File | Change | |------|--------| | `include/onnxruntime/ep/api.h` | Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading. | | `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels. | | `include/onnxruntime/ep/adapter/op_kernel.h` | Adds `RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts. | | `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter kernel-info with attribute and config accessors required by migrated kernels. | | `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter adjustments for plugin compatibility. | | `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel definition builder hooks for plugin registration. | | `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only `Tensor::Create` compatibility path for kernels relying on the older static factory form. | | `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution. | ### CUDA kernel compatibility migration - Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites). - Moves or templates reusable helper logic in shared CPU/CUDA headers (`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`, `ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile in adapter mode. - Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops. - Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression. ### Python integration | File | Change | |------|--------| | `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends `get_available_providers()` to surface dynamically registered plugin EPs discovered from `OrtEpDevice` enumeration. | | `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python session creation to instantiate providers from registered plugin EP devices, including `device_id` selection, instead of only built-in or legacy dynamic-load EP paths. | | `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query support for plugin-registered operators. | ### Testing and validation | File | Change | |------|--------| | `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation. | | `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192 lines). Utility for transparently routing existing tests to the plugin EP. | | `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length` tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP. | | `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP when available. | | `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration. | | `test/unittest_util/base_tester.cc` | Routes CUDA test requests to `CudaPluginExecutionProvider` when registered, allowing existing CUDA provider tests to exercise the plugin path. | | `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support. | | `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build instructions, C++ and Python usage examples, and known limitations. | ### Other | File | Change | |------|--------| | `tools/python/gen_opkernel_doc.py` | Extended to generate documentation for plugin-registered kernels. | | `orttraining/.../reduction_ops.cc` | Minor compatibility fix for training reduction ops under the plugin build configuration. | ## Testing - **Build**: Configure with `--build_cuda_ep_as_plugin` (or `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify `libonnxruntime_providers_cuda_plugin.so` is produced alongside existing CUDA provider artifacts. - **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester` routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new `dynamic_plugin_ep_test` for load/enumerate validation. - **Python tests**: Register the plugin library, confirm `onnxruntime.get_available_providers()` includes `CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation). - **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py` to verify kernel coverage parity between bundled and plugin builds. - **Backward compatibility**: Verify unchanged behavior for the in-tree CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`). - **Known limitation**: CUDA graph support remains disabled in the plugin path and is documented as deferred. ## Motivation and Context The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle. **Related**: PR microsoft#27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description) - [ ] CI passes

…ft#27914) ### Description Specify `main` as the target branch for the release candidate cron job. ### Motivation and Context Pipeline won't work without a branch specifier.

…rosoft#27713) ### Description Adds C/C++ APIs to the `OrtEpApi` that allow plugin EPs to query ONNX operator schemas from ORT's global schema registry. This enables EPs to programmatically discover operator metadata (input/output names, type constraints, allowed types, since_version) needed to correctly build kernel definitions with proper type constraints. ### Motivation Resolves microsoft#27680. Plugin EPs must provide exact type constraint names (e.g., `"T"`, `"T1"`) and allowed types when calling `KernelDefBuilder::AddTypeConstraint()`. Without schema access, EPs must either hard-code these names or skip type constraints entirely, leading to potentially incorrect kernel selection and data type mismatches at runtime. **Why can't an EP library just link to its own ONNX library?** The ONNX `OpSchemaRegistry` is a Meyers singleton (`static` local in `Instance()`). Each shared library gets its own copy of that static variable: on Windows each DLL is isolated by default, on macOS two-level namespaces have the same effect, and on Linux behavior depends on `dlopen` flags (`RTLD_LOCAL` isolates, `RTLD_GLOBAL` creates unpredictable interposition). Even when isolation doesn't occur, the EP's registry would lack ORT's contrib and internal schemas, and version mismatches between the EP's ONNX library and ORT's vendored copy could cause silent divergence. A C API through ORT is the only reliable, portable way to query the schemas ORT actually uses. ### Changes **New opaque types:** - `OrtOpSchema` — owning opaque struct wrapping an `onnx::OpSchema*` with precomputed type constraint data. Allocated by `GetOpSchema`, released by `ReleaseOpSchema`. - `OrtOpSchemaTypeConstraint` — non-owning opaque entity representing a single type constraint (e.g., "T"). Lifetime is tied to the parent `OrtOpSchema`. Each constraint carries its name, allowed types, and input/output index mappings. **New C APIs added to `OrtEpApi` (Version 1.25, 15 functions):** | Function | Description | |---|---| | `GetOpSchema` | Look up a schema by name, max opset version, and domain. Accepts `""` or `"ai.onnx"` for standard ONNX ops, `"ai.onnx.ml"` for ML ops, `"com.microsoft"` for contrib ops. | | `ReleaseOpSchema` | Release an `OrtOpSchema` allocated by `GetOpSchema`. | | `OpSchema_GetSinceVersion` | Get the opset version that introduced the schema. | | `OpSchema_GetNumInputs` / `GetNumOutputs` | Input/output counts. | | `OpSchema_GetInputName` / `GetOutputName` | Formal parameter names. | | `OpSchema_GetInputTypeConstraint` / `GetOutputTypeConstraint` | Get the type constraint for a given input/output (O(1) lookup). Returns `nullptr` if the input/output has no type constraint. Shared constraints return the same pointer (pointer identity = shared type). | | `OpSchema_GetTypeConstraintCount` | Number of unique type constraints. | | `OpSchema_GetTypeConstraint` | Get the i-th type constraint by index. | | `OpSchemaTypeConstraint_GetTypeParamName` | Get the type parameter name (e.g., `"T"`, `"T1"`). | | `OpSchemaTypeConstraint_GetAllowedTypes` | Get the allowed type strings (e.g., `"tensor(float)"`). | | `OpSchemaTypeConstraint_GetInputIndices` | Get input indices using this constraint. | | `OpSchemaTypeConstraint_GetOutputIndices` | Get output indices using this constraint. | **C++ wrappers:** - `Ort::OpSchema` — owning wrapper around `OrtOpSchema*` (move-only, auto-releases). - `Ort::ConstOpSchemaTypeConstraint` — non-owning wrapper around `const OrtOpSchemaTypeConstraint*`. - `Ort::GetOpSchema()` — free function to query the registry. **Design highlights:** - Type constraints are eagerly precomputed during `GetOpSchema` — all subsequent accessors are O(1) with no allocation. - `GetInputTypeConstraint`/`GetOutputTypeConstraint` return the full constraint object directly (not just a string), enabling a 2-call workflow: `GetInputTypeConstraint(0)` → `GetAllowedTypes()`. - Pointer identity: inputs sharing a constraint (e.g., both inputs of `Add` use `"T"`) return the same `OrtOpSchemaTypeConstraint*`. - Domain `"ai.onnx"` is normalized to `""` (the canonical ONNX domain) for transparent lookup. **Tests:** 14 unit tests covering known/unknown ops, version boundaries, wrong domains, `"ai.onnx"` alias, schema properties (Add, Relu, LSTM), type constraint access, pointer identity for shared constraints, out-of-range errors, and the input→constraint→allowed-types workflow. ### Files | File | Description | |---|---| | `onnxruntime/core/session/abi_opschema.h` | Internal struct definitions for `OrtOpSchemaTypeConstraint` and `OrtOpSchema`. | | `include/.../onnxruntime_ep_c_api.h` | Public C API: function signatures, doc comments, opaque type declarations. | | `onnxruntime/core/session/plugin_ep/ep_api.h` | Internal function declarations. | | `onnxruntime/core/session/plugin_ep/ep_api.cc` | Implementation of all 15 functions + API struct initializer. | | `include/.../onnxruntime_cxx_api.h` | C++ wrapper class declarations. | | `include/.../onnxruntime_cxx_inline.h` | C++ wrapper inline implementations. | | `onnxruntime/test/framework/ep_plugin_provider_test.cc` | Unit tests. | --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This pull request adjusts the tiling strategy for small matrix sizes in the DP4A matmul kernel. The changes are aimed at improving performance and compatibility, especially for specific GPU vendors. On Qualcomm, improving token generation from ~20 tps to ~25 tps.

### Description Update logger object in QnnBackendManager::SetupBackend. ### Motivation and Context While generating weight sharing context binary, Inference Session is created once for each graph. Inference session creates logger object and passes it to QnnBackendManager. QnnBackendManager stores this pointer in logger_ pointer and holds it long after Inference Session destroys Logger. On next Inference Session, another Logger object is created but QnnBackendManager do not use this as backend_setup_completed_ is already set, using this causes UAF. Co-authored-by: Trishansh Bhardwaj <quic_tbhardwa@quicinc.com>

### Description This PR contains fixes to various big endian support issues in onnxruntime, both in libraries and tests. ### Motivation and Context Currently some tests from onnxruntime testsuite fail. This change fixes all tests from onnxruntime testsuite when it's built without training support. It also includes a linking issue fix. Following tests are fixed on s390x: OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8 SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO SparseTensorConversionTests.TestConstantNodeConversion OrtModelOnlyTests.SparseInitializerHandling SparseTensorConversionTests.TestConstantNodeConversion SparseTensorConversionTests.TestDenseToSparseConversion ExecutionFrameTestInit.SparseInitializerAsOutput CApiTest.SparseOutputModel

@edgchen1

### Description #### TLDR This PR ports the existing C++ [EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359) interfaces used by provider-bridge EPs to the binary-stable C APIs for plugin EPs. It introduces C/C++ APIs for creating/querying profiling events, a container for appending EP events, and callback hooks (`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in real-time. #### Changes to the original C++ API The original `EpProfiler` C++ interface was adapted for the C API with the following intentional changes: 1. **`StartProfiling`** now receives an offset indicating the elapsed time since profiling started, as opposed to receiving an absolute/epoch-dependent profiling start time. This prevents EPs from having to do epoch conversions. Credit to @edgchen1 for the idea. 2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based correlation ID (`ort_event_correlation_id`)** instead of a relative ORT event ID. The `PluginEpProfiler` bridge layer automatically converts the C++ `relative_ort_event_id` (microseconds since profiling start) to an absolute `ort_event_correlation_id` by adding the epoch-based profiling start time. This means plugin EPs can use the correlation ID directly with profiling utilities like CUPTI or ROCTracer without computing the conversion themselves. 3. **`StopEvent` now receives the completed ORT event as a parameter.** This allows EPs to optionally inspect ORT event metadata (e.g., `op_name`, `event_name`) at the time the event ends, facilitating annotation of correlated EP events. 4. **`EndProfiling` only allows EPs to *append* events (via `OrtProfilingEventsContainer`), not read or modify the full events array.** This is motivated by: - Prevent any one EP from modifying events generated by ORT or another EP. - Certain EPs (VitisAI and WebGPU) already only append events without reading the entire events array. - The CUDA EP reads the entire events array solely to merge/sort its own EP events next to correlated ORT events and add `parent_name`/`op_name` metadata. However: - Merging/sorting is mostly unnecessary since trace viewers that load these files do their own event sorting. - This merging/sorting step was previously required to augment CUDA EP events with metadata from the correlated ORT event. However, that can now be obtained more simply via the new `StopEvent` parameter that provides the EP with the full correlated ORT event. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see microsoft#13706 (comment)). Fixing this would require sorting the entire Events array before asking a provider-bridge EP to merge in its events into the global events array. Not sure this is worth the runtime cost. #### Naming conventions for ORT event IDs - **C++ `EpProfiler` interface** (existing): Uses `relative_ort_event_id` — a timestamp offset in microseconds relative to profiling start. - **C API `OrtEpProfilerImpl`** (new in this PR): Uses `ort_event_correlation_id` — an absolute, epoch-based timestamp in microseconds computed from `std::chrono::high_resolution_clock` (platform-defined epoch). Unique across concurrent profiling sessions within the same process. - **Conversion**: The `PluginEpProfiler` bridge class (in `ep_event_profiling.cc`) performs `ort_event_correlation_id = relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the pattern in `GPUTracerManager::PushCorrelation`. ### New C APIs | API | Description | |-----|-------------| | `CreateProfilingEvent` | Create a profiling event with category, process/thread IDs, name, timestamp, duration, and key-value args | | `ReleaseProfilingEvent` | Release a profiling event | | `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`, `KERNEL`, `API`) | | `ProfilingEvent_GetName` | Get event name | | `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) | | `ProfilingEvent_GetDurationUs` | Get event duration (µs) | | `ProfilingEvent_GetArgValue` | Get an event argument value by key | | `ProfilingEventsContainer_AddEvents` | Append an array of EP events to the output container | | `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler implementation | | `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a profiling session. Receives elapsed time offset (ns) since ORT profiling started | | `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT event has started. Receives an absolute `ort_event_correlation_id` | | `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT event has ended. Receives the same `ort_event_correlation_id` and ORT event metadata | | `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling session and collect EP events into the output container | | `OrtEpProfilerImpl::Release` | Release the profiler instance | ### New C++ wrapper classes | Class | Description | |-------|-------------| | `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) | | `Ort::ProfilingEvent` | Owning wrapper that creates and manages an `OrtProfilingEvent` (e.g., for `EndProfiling`) | | `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding events to an `OrtProfilingEventsContainer` during `EndProfiling` | ### Example EP profiling implementation This PR updates an example plugin EP to use the new profiling APIs: - Plugin EP code: [test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry) - `OrtEpProfilerImpl` implementation: [ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h) / [ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc) - `OrtEp::CreateProfiler()` implementation: [ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc) ### Existing bugs found Not fixed in this PR. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see microsoft#13706 (comment)). - Run profilers do not handle subgraphs (e.g., subgraph of a control-flow operator). Has been the case since run profilers were [introduced](microsoft#26846). ### Motivation and Context Allows plugin EPs to generate profiling events, further closing the functionality gap between provider-bridge EPs and plugin EPs. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

…a layout on Avx512 (microsoft#27874) ### Description Adds a special AVX512 kernel for depthwise conv with multiplier = 2. These improve the performance of 3 costly conv operations (7x7 kernels) in the MobileClip model by approx 2.4x (will share MLAS benchmark numbers). These are 3 ops with 1) Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7 2) Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7 3) Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7 These Conv operations cannot be dispateched to NCHWc as the Cout per group is sub-block size. On AVX512, the block size is 16 and the Cout per group is only 2. There is a special depthwise kernel in the NCHWc suite but it can only handle Cout per group = 1. MLAS Benchmark Before and After comparison: | Benchmark | BEFORE mean (ns) | AFTER mean (ns) | Speedup | |---|---:|---:|---:| | SCONV_NCHW G64 | 3,151,190 | 1,391,419 | 2.26x | | SCONV_NCHW G128 | 1,646,040 | 824,654 | 2.00x | | SCONV_NCHW G256 | 978,843 | 533,375 | 1.84x | | SCONV_NCHW_THREADED G64 | 873,283 | 367,722 | 2.37x | | SCONV_NCHW_THREADED G128 | 445,786 | 226,777 | 1.97x | | SCONV_NCHW_THREADED G256 | 264,473 | 147,997 | 1.79x | ### Motivation and Context Just by optimizing these 3 conv operations, MobileClip is about 700us-850us faster and the entire model is <14ms on an AVX512 machine. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description  Fix int overflow issues in original implementation. Add some additional tests. ### Motivation and Context  Fix some int overflow issues. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI and others added 30 commits March 25, 2026 15:17

update jsvascript dependencies (microsoft#27838)

aeda0c7

Add cron job to release pipeline (microsoft#27864)

52709bc

### Description  ### Motivation and Context

[CI] chore: bump actions/cache@v5 (microsoft#27866)

cd120ee

### Description Replace `actions/cache@v4` w/ `actions/cache@v5`. ### Motivation and Context `actions/cache@v4` uses node 20, which is deprecated.

[CI] fix: missing branch specifier in schedule directive (microso…

e43d306

…ft#27914) ### Description Specify `main` as the target branch for the release candidate cron job. ### Motivation and Context Pipeline won't work without a branch specifier.

[VitisAI] external_ep_library typo fix (microsoft#27647)

a997c4f

Merge remote-tracking branch 'origin/master' into sync_msft_02042026

9e2e1ef

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel April 1, 2026 21:03

ankitm3k approved these changes Apr 2, 2026

View reviewed changes

ankitm3k merged commit b20f392 into ovep-develop Apr 2, 2026
5 of 7 checks passed

ankitm3k deleted the sync_msft_02042026 branch April 2, 2026 04:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 02042026#1010

Sync with Microsoft ONNX Runtime - 02042026#1010
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_02042026

ai-fw-intg commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ai-fw-intg commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants