Skip to content

Sync with Microsoft ONNX Runtime - 02042026#1010

Merged
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_02042026
Apr 2, 2026
Merged

Sync with Microsoft ONNX Runtime - 02042026#1010
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_02042026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

Copilot AI and others added 30 commits March 25, 2026 15:17
…#27342)

### Description

Moves the `--build_wasm_static_lib → --build_wasm` implication from
`build.py` into `build_args.py`'s post-processing, **before** the cmake
generator selection. Previously, `build_args.py` chose the generator
based on `args.build_wasm` (still `False`), and `build.py` only set it
to `True` afterwards—too late.

- **`tools/ci_build/build_args.py`**: Set `args.build_wasm = True` when
`args.build_wasm_static_lib` is set, prior to generator and
cross-compilation logic.
- **`tools/ci_build/build.py`**: Remove the now-redundant identical
check.

### Motivation and Context

Using `--build_wasm_static_lib` without `--build_wasm` caused cmake to
use the wrong generator (e.g., Visual Studio instead of Ninja on
Windows) and miss Emscripten-specific configuration, leading to build
failures like missing `libiconv`.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
…n MatMulNBits (microsoft#27820)

### Description

Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path
(`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit
MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper
code from MLAS.

**Operator changes (matmul_nbits.cc):**
- PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16
scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on
ARM64.
- ComputeBPacked: Bulk fp16→fp32 conversion of A, calls
`MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16
conversion of C.

**MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):**
- Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`,
`HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch
branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel
conditions.
- Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in
`MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>`
compatibility.

### Motivation and Context

The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile
before calling the same SQ fp32 kernels. This change:
1. **Eliminates per-tile overhead** via bulk conversion at the operator
level.
2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the
`HQNBIT_CompInt8` path.
3. **Removes ~370 lines of dead wrapper code** from MLAS.

### Improvements
Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU`

**Asymmetric:**

| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |

|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms |
| Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms |
| Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms |
| Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms |
| Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms |
| Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms |

**Symmetric:**

| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |

|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms |
| Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms |
| Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms |
| Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms |
| Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms |
| Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms |

**NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25%
on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion
overhead replacing the old per-tile approach. The old HQ CompInt8
wrappers kept small tiles cache-hot, while the new unified path does
full-matrix conversion passes. This trade-off is acceptable since 4-bit
is the dominant quantization format (gaining 26–67%), 8-bit acc4 still
outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at
smaller model sizes where absolute latencies are already low. A proper
fix would be 8-bit KleidiAI-style kernels rather than restoring the
wrapper code.
…rt. (microsoft#27825)

### Description
Support for Aarch64 SME intrinsics was added to version 19.40 of MSVC.
The ONNX Runtime stated supported version of Visual Studio 2022 can go
back before version 19.40.

This patch modifies cmake/CMakeLists.txt to check the version of MSVC,
if it is the target compiler. For versions less than 19.40 KleidiAi will
be disabled in the build.

### Motivation and Context
This issue was raised when cross compiling 1.24 for Windows on Arm.
microsoft#27304

---------

Signed-off-by: Colm Donelan <coldon01@e135129.arm.com>
Co-authored-by: Colm Donelan <coldon01@e135129.arm.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
)

### Description
Enable ccache and vcpkg caching for Linux workflows that use
`reusable_linux_build.yml`. Saves about ~15-20 min on a 100% cache hit.
Also parallelises tests. Saves ~6 minutes.

Additionally, enable vcpkg and ccache for other Linux workflows. No
numbers avail for comparison.

### Motivation and Context

This change reduces wasted CO2 and time.

### Known Issues

Benign  - Android workflow doesn't seem to be populating its ccache.
### Description
See below



### Motivation and Context
Summary:The vulnerability lies in the ONNX Runtime's validate_package.py
script, which uses unsanitized string concatenation with os.system() to
construct shell commands. This allows attackers to inject arbitrary
shell commands via the --package_name argument, leading to potential
remote code execution. The issue affects the release validation
pipeline, which operates with elevated privileges, exposing sensitive
credentials and secrets. The root cause is the lack of input
sanitization and the use of os.system() for command execution.

Affected code locations:

tools/nuget/validate_package.py line 241: os.system("tar zxvf " +
package_name)
tools/nuget/validate_package.py line 339: os.system("copy " +
full_nuget_path + " " + nupkg_copy_name)
Suggested fix: Replace os.system() with subprocess.run() using argument
lists (no shell interpolation):

```
# Instead of: os.system("tar zxvf " + package_name)
subprocess.run(["tar", "zxvf", package_name], check=True)

# Instead of: os.system("copy " + full_nuget_path + " " + nupkg_copy_name)
shutil.copy2(full_nuget_path, nupkg_copy_name)
```
Align maxStorageBufferBindingSize down to the nearest multiple of
minStorageBufferOffsetAlignment after querying device limits. This
ensures that when large buffers are split into segments, each segment's
byte offset satisfies WebGPU's bind group offset alignment requirement
(typically 256 bytes).
### Description

This PR updates the pattern matchings to perform multi-head attention
fusion for the conformer encoder inside [Nemotron
speech](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b).

<img width="550" height="976" alt="image"
src="https://github.com/user-attachments/assets/a194308e-ce69-4128-9389-aae2a64b312f"
/>

### Motivation and Context

These changes allow the `MultiHeadAttention` op to appear in the encoder
ONNX model.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…t#27823)

### Description
<!-- Describe your changes. -->
DmlOperatorQuantization21 was missing the tensor reshaping logic that
the older DmlOperatorElementwiseQLinear already had.

Scalar scale tensors get padded to 4D, but a 5D input stays 5D. DML
rejects the dimension mismatch with E_INVALIDARG, and the resulting
exception unwind triggers a sized-delete bug in WRL's MakeAllocator
which address sanitizer detects. The fix is to port the same logic from
the DmlOperatorElementwiseQLinear into this path, so that the dimensions
match.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required to ensure the DML EP correctly handles this scenario.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
This change tries to address a problem in the DML EP where AlignToPow2
rounded up tensorByteSize to a 4-byte boundary before the data was read
from the source buffer. This caused CreateCpuResource, CreateResource,
WriteToFile, and the inputRawData vector construction to read 1–3 bytes
past the end of the original tensor data.

CreateResource and CreateCpuResource already independently align the
D3D12 resource descriptor size, so they work correctly with the original
(unaligned) byte count. The fix is to move the alignment to the location
where it's needed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required because it addresses a crash / incorrect behavior in
the DML EP.
…ft#27595)

This pull request introduces support for node "layering annotations" and
improves resource accounting and memory management during graph
partitioning in ONNX Runtime. The changes add new mechanisms for
annotating nodes, filtering nodes by annotation during partitioning, and
efficiently accounting for resources in fused nodes. Several APIs are
extended to support these features, and new configuration options are
introduced to guide layer assignment.

**Layering annotations & partitioning:**

* Added `layering_annotation_` member and associated getter/setter/clear
methods to the `Node` class, allowing nodes to be annotated for layer
assignment. Also added a method to clear these annotations after
partitioning to save memory. (`include/onnxruntime/core/graph/graph.h`)
[[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R177-R184)
[[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R266-R272)
[[3]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R702-R703)
* Extended the graph partitioning logic to support filtering nodes by
their layering annotation using a `LayeringIndex`, ensuring only nodes
matching the current execution provider's assignment are considered
during partitioning. (`onnxruntime/core/framework/graph_partitioner.cc`)
[[1]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR155)
[[2]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR199-R286)
[[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL244-R357)
[[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL433-R545)
[[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL451-R564)
[[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL477-R591)
* Added a new session option `kOrtSessionOptionsLayerAssignmentSettings`
to configure layer assignment using annotation prefixes per device.
(`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`)

**Resource accounting improvements:**

* Improved the `IResourceAccountant` interface to allow resetting and
committing pending weights per node, and updated resource accounting
logic to correctly sum and commit costs for all constituent nodes in
fused nodes, preventing double-counting or undercounting.
(`include/onnxruntime/core/framework/resource_accountant.h`,
`include/onnxruntime/core/graph/indexed_sub_graph.h`,
`onnxruntime/core/framework/graph_partitioner.cc`)
[[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L48-R72)
[[2]](diffhunk://#diff-3f09a80586759ee33e272477c3eb96f28d9b37f1e8251d13f1211c0450945135L89-R114)
[[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL391-L397)

**API and code organization:**

* Updated the `Graph` class and related APIs to propagate layering
annotations during function inlining and to provide a method for
removing all layering annotations after partitioning.
(`include/onnxruntime/core/graph/graph.h`)
[[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1341-R1346)
[[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1590-R1594)
* Moved the `CreateAccountants` function out of the `NodeStatsRecorder`
class to the namespace level for clarity.
(`include/onnxruntime/core/framework/resource_accountant.h`)

These changes enable more flexible and memory-efficient graph
partitioning, particularly for scenarios involving hardware-specific
layer assignments and dynamic resource constraints.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#27699)

### Description
If the ONNX file is malformed, it could lead to an incorrect memory
access. This change enforces that does not happen.



### Motivation and Context
security issue

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#27778)

This PR is on top of a previous PR and fixes the remaining issues.
microsoft#27706

All tests here should be passing now over webgpu:

https://wpt.live/webnn/conformance_tests/dequantizeLinear.https.any.html?gpu

---------

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
)

### Description

Add a pre-check for zero values in the divisor tensor for integral types
in `Mod`. Returns an error `Status` instead of hitting undefined
behavior (SIGFPE / structured exception).

- **`element_wise_ops.cc`**: Added `CheckZeroDivisorImpl` as a single
template struct in the `mod_internal` namespace using `if constexpr
(std::is_integral<T>::value)` to guard the check — no-op for non-integer
types. The struct's `operator()` returns `Status` (via `ORT_RETURN_IF`)
and is dispatched with `InvokeRet<Status>`. When the divisor is a
constant initializer, `TryGetConstantInput` validates for zeros once at
kernel creation time in the out-of-line constructor (using
`ORT_THROW_IF_ERROR`), avoiding per-`Compute` overhead. A
`divisor_is_validated_constant_` flag tracks whether the one-time check
was performed. In `Compute`, non-constant divisors are scanned via the
type dispatcher (using `ORT_RETURN_IF_ERROR`) before calling
`CallModImpl`, skipping the check when the constant was already
validated. The Mod constructor is defined out-of-line after the
`mod_internal` namespace to keep it contiguous.
- **`element_wise_ops_test.cc`**: Added `Mod_int8_by_zero`,
`Mod_int32_by_zero`, `Mod_int64_by_zero_scalar` tests covering tensor
and scalar divisor cases, plus `Mod_int32_by_zero_constant_initializer`
to exercise the `TryGetConstantInput` constructor path with
`is_initializer = true`.

### Motivation and Context

Integer modulo by zero is UB in C++ and causes a hardware exception that
crashes the process. Float types produce NaN naturally via `std::fmod`,
but int8/int16/int32/int64/uint* types do not. This is the same class of
issue that was fixed for the `Div` operator in microsoft#27693, now applied to
the `Mod` operator.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 Send tasks to Copilot coding agent from
[Slack](https://gh.io/cca-slack-docs) and
[Teams](https://gh.io/cca-teams-docs) to turn conversations into code.
Copilot posts an update in your thread when it's finished.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
## Description

Adds per-session thread pool work callbacks, allowing callers to hook
into the enqueue/start/stop/abandon lifecycle of thread pool work items.
The feature is gated behind a build flag
(`--enable_session_threadpool_callbacks`) with zero overhead when
disabled.

## API additions

- C API: `OrtApi::SetPerSessionThreadPoolCallbacks` — stores an
`OrtThreadPoolCallbacksConfig` on the `OrtEnv`, applied to per-session
thread pools
- C++ wrapper: `Ort::Env::SetPerSessionThreadPoolCallbacks`
- Versioned C config struct `OrtThreadPoolCallbacksConfig` with fields:
`on_enqueue`, `on_start_work`, `on_stop_work`, `on_abandon`,
`user_context`
- Four callback typedefs: `OrtThreadPoolWorkEnqueueFn`,
`OrtThreadPoolWorkStartFn`, `OrtThreadPoolWorkStopFn`,
`OrtThreadPoolWorkAbandonFn`

## Implementation

- `EigenNonBlockingThreadPool.h`: Introduced a policy-based design with
two compile-time callback policies:
- `WorkNoCallbackPolicy`: `Work = std::function<void()>`, all callback
methods are trivial inlines eliminated by the compiler. Zero overhead
for non-callback builds.
- `WorkWithCallbackPolicy`: `Work = WorkItem` bundling tasks with
callback data; invokes user callbacks around task execution via
`MakeWork`/`Execute`/`OnEnqueue`/`OnAbandon` methods.
- `ThreadPoolTempl<Environment, CallbackPolicy>` uses the policy for all
callback-related operations.
- `RunQueue::RevokeWithTag` calls `policy_->OnAbandon(e.w)` on
successful revocation; the policy implementation decides whether to
invoke user callbacks.
- `threadpool.h`: `extended_eigen_threadpool_` changed to
`unique_ptr<ExtendedThreadPoolInterface>` for type erasure across policy
instantiations. `EnableSpinning`/`DisableSpinning` added to the virtual
interface.
- `threadpool.cc`: Single `#ifdef` selects policy at `ThreadPoolTempl`
instantiation.
- `environment.h/.cc`: Added
`SetPerSessionWorkCallbacks`/`GetPerSessionWorkCallbacks` on
`Environment`.
- `inference_session.cc`: Propagates callbacks from `Environment` to
per-session thread pool options.
- `thread_utils.h/.cc`: Added callback fields to `OrtThreadPoolParams`
and wiring in `CreateThreadPoolHelper`.
- `env.h`: `OrtThreadPoolCallbacksConfig*` pointer in `ThreadOptions`.

## Build

- CMake option `onnxruntime_ENABLE_SESSION_THREADPOOL_CALLBACKS`;
`build.py` argument `--enable_session_threadpool_callbacks`

## Tests

- 8 callback-specific tests: Schedule, OnEnqueueOnly, NoCallbacks,
ParallelFor, ParallelSection, Abandon, EnqueueReturnsNull,
NoEnqueueWithStartStop
- End-to-end C API test (`SetPerSessionThreadPoolCallbacks` via
ModelBuilder with 1M-element Mul)
- All 73 existing ThreadPool tests pass unchanged with both
callback-enabled and callback-disabled builds (81/81 and 73/73
respectively)

## Motivation and Context

Thread pool work callbacks enable telemetry, tracing, and resource
management by providing visibility into when work is enqueued, executed,
and abandoned in per-session thread pools. This is needed for production
diagnostics and performance instrumentation scenarios.

---------

Co-authored-by: Siyuan Peng <siyuanpeng@microsoft.com>
…icrosoft#27834)

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel,
doubling the number of threads working on K-dimension reduction per
output row. This improves token generation throughput by ~3% on NVIDIA
GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and
cache characteristics.

Changes:
- matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to
MatMulNBitsProgram constructor.
- matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass
to program constructor.
### Description

Run-level profiling (introduced in PR microsoft#26846) does not currently capture
profiling events for operators inside subgraphs. This PR fixes that by
threading the `run_profiler` pointer through `OpKernelContextInternal`
to subgraph execution, following the same pattern as `terminate_flag`.

### Root Cause

`utils::ExecuteSubgraph()` had no `run_profiler` parameter and always
passed `nullptr` to `ExecuteGraphImpl`, so nested operators (inside If,
Loop, Scan, BeamSearch, GreedySearch) were never profiled at the run
level.

### Fix

1. **`OpKernelContextInternal`** — Added `run_profiler_` member and
`GetRunProfiler()` accessor.
2. **`SessionScope` / `ExecuteKernel()`** — Pass the run profiler into
`OpKernelContextInternal`.
3. **`ExecuteSubgraph()`** — Added `profiling::Profiler* run_profiler =
nullptr` parameter, forwarded to `ExecuteGraphImpl()`.
4. **Control flow ops** (`if.cc`, `loop.cc`, `scan_utils.cc`) — Pass
`context_.GetRunProfiler()` to `ExecuteSubgraph()`.
5. **Contrib transformer ops** (`beam_search_impl_gpt.h`,
`beam_search_impl_t5.h`, `beam_search_impl_whisper.h`,
`greedy_search_impl_gpt.h`) — All 8 `ExecuteSubgraph()` call sites
updated to pass `this->context_.GetRunProfiler()`.

Plugin EP control flow kernels (`PluginEpIfKernelImpl`, etc.) delegate
to the same internal kernels, so the fix propagates automatically.

### Tests

- **`CheckRunProfilerWithSubgraph`** (`inference_session_test.cc`) —
Runs `if_mul.onnx`, enables run profiling, asserts `mul_0` (inside If's
then-branch) appears in the profile JSON.
- **`CheckRunProfilerWithBeamSearch`** (`beam_search_test.cc`) — Runs
`tiny_gpt2_beamsearch.onnx`, enables run profiling, asserts decoder
subgraph Node entries (beyond the top-level BeamSearch op) appear in the
profile JSON.

### Files Changed (12 files)

| File | Change |
|------|--------|
| `core/framework/op_kernel_context_internal.h` | Added `run_profiler_`
member, `GetRunProfiler()`, constructor param |
| `core/framework/sequential_executor.cc` |
`SessionScope::GetRunProfiler()`, pass to `OpKernelContextInternal` |
| `core/framework/utils.h` / `utils.cc` | `run_profiler` param on
`ExecuteSubgraph()` |
| `core/providers/cpu/controlflow/if.cc` | Forward `GetRunProfiler()` |
| `core/providers/cpu/controlflow/loop.cc` | Forward `GetRunProfiler()`
|
| `core/providers/cpu/controlflow/scan_utils.cc` | Forward
`GetRunProfiler()` |
| `contrib_ops/cpu/transformers/beam_search_impl_gpt.h` | 2 call sites |
| `contrib_ops/cpu/transformers/beam_search_impl_t5.h` | 2 call sites |
| `contrib_ops/cpu/transformers/beam_search_impl_whisper.h` | 2 call
sites |
| `contrib_ops/cpu/transformers/greedy_search_impl_gpt.h` | 2 call sites
|
| `test/framework/inference_session_test.cc` |
`CheckRunProfilerWithSubgraph` test |
| `test/contrib_ops/beam_search_test.cc` |
`CheckRunProfilerWithBeamSearch` test |
### Description

Replace `actions/cache@v4` w/ `actions/cache@v5`.

### Motivation and Context

`actions/cache@v4` uses node 20, which is deprecated.
This pull request introduces a new synchronization API for plugin
execution providers (EPs) in ONNX Runtime, and adds comprehensive test
infrastructure to verify its usage. The main theme is enabling EPs to
synchronize device operations, which is particularly important for IO
binding and async execution scenarios. The changes also update the test
framework to support and validate this new capability.

**Synchronization API for Plugin EPs:**

* Added a new optional `Sync` method to the `OrtEp` C API interface,
allowing EPs to block until all preceding device tasks are complete.
This is primarily used by IO binding to ensure device inputs are ready
before execution.
(`include/onnxruntime/core/session/onnxruntime_ep_c_api.h`)
* Implemented the `Sync` method in the example plugin EP, with a test
hook that increments a counter for verification purposes.
(`onnxruntime/test/autoep/library/example_plugin_ep/ep.cc`,
`onnxruntime/test/autoep/library/example_plugin_ep/ep.h`)
[[1]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R187)
[[2]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R589-R601)
[[3]](diffhunk://#diff-5e9391ab7d2d558c5fa992b5fc373add5c52225aa43ce1af323ffbd8c2b86733R105-R106)

**Test Infrastructure and Verification:**

* Added test hooks (`ExampleEpTestHooks_ResetSyncCount`,
`ExampleEpTestHooks_GetSyncCount`) to the example plugin EP, allowing
tests to reset and retrieve the sync call count.
(`onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.h`,
`onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.cc`)
[[1]](diffhunk://#diff-a587d529618260bec7cbecf107513dacb795fff9fb34ae99c3a2db36bdcc8befR1-R23)
[[2]](diffhunk://#diff-7123fbca69d2580f0483d6589817e275c05b086c1fb56281a83f0fb895bdc06fR1-R11)
* Updated test execution logic to load these hooks dynamically and
verify that the `Sync` method is called exactly once during inference
with IO binding. (`onnxruntime/test/autoep/test_execution.cc`)
[[1]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R299-R358)
[[2]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R1099-R1119)

**Plugin EP Interface Updates:**

* Extended the `PluginExecutionProvider` C++ interface to support the
new `Sync` method, delegating to the plugin EP if implemented.
(`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h`,
`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc`)
[[1]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118)
[[2]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638)

**Performance Test Framework Enhancements:**

* Added logic to detect if a plugin EP uses an NVIDIA GPU device,
enabling CUDA IO binding automatically in performance tests when
appropriate. (`onnxruntime/test/perftest/common_utils.cc`,
`onnxruntime/test/perftest/utils.h`,
`onnxruntime/test/perftest/ort_test_session.cc`)
[[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221)
[[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41)
[[3]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R98-R108)
* Ensured that async execution is used in performance tests with IO
binding, relying on the new synchronization mechanism.
(`onnxruntime/test/perftest/ort_test_session.cc`)
[[1]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18L57)
[[2]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R66-R69)

These changes collectively improve device synchronization support for
plugin EPs and provide robust testing to ensure correct behavior.This
pull request introduces support for synchronizing plugin execution
providers, especially for NVIDIA GPU devices, and refines the logic for
CUDA I/O binding in performance tests. The main changes include adding a
new `Sync` API for execution providers, updating the plugin EP interface
to use this API, and improving test session configuration for CUDA
devices.

### API and Interface Updates

* Added a new optional `Sync` method to the `OrtEp` struct in
`onnxruntime_ep_c_api.h`, allowing execution providers to block until
all device tasks are complete. This is primarily used to ensure inputs
are copied to the device before execution starts.
* Implemented the `Sync` method in the `PluginExecutionProvider` class
and its interface, enabling plugin EPs to support device synchronization
if available.
[[1]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638)
[[2]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118)

### Performance Test Improvements

* Added a utility function `UsesNvidiaDevice` to detect if any
registered plugin EP uses an NVIDIA GPU device, improving test
configuration logic.
[[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221)
[[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
## Description

This PR adds a standalone CUDA Plugin Execution Provider
(`CudaPluginExecutionProvider`) built as a dynamically loadable shared
library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP
Plugin API. The implementation reuses the existing CUDA kernel stack
through adapter/shim layers (force-included headers and macro-based
registration overrides), eliminating the need to maintain a parallel
copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally
deferred until the plugin-facing EP API exposes the required session
callbacks.

## Summary of Changes

### Build system and CMake

| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN`
build option, records plugin build info, and includes the
plugin-specific CMake file. |
| `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the
plugin shared-library target: collects `.cc`/`.cu` sources from
`core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion
filters for incompatible files (tunable, controlflow, registration
tables), force-includes adapter headers, and links CUDA/cuDNN/ORT
components. |
| `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose
include paths needed by plugin builds. |
| `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage
in provider tests and fills in missing CUDA include/link settings for
the plugin configuration. |
| `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration
support for the plugin build path. |

### Plugin runtime implementation (new files)

| File | Purpose |
|------|---------|
| `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device
enumeration, session-option parsing, allocator registration, kernel
registry creation, and all static C-compatible plugin callbacks.
Thread-safe lazy kernel registry initialization. |
| `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from
`ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference,
TF32, cuDNN algorithm selection, convolution workspace, attention
kernels). |
| `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device
and pinned memory, exposed through the EP API. |
| `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS,
cuBLASLt, and cuDNN handle management. Provides two stream adapter modes
(`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc`
contexts). |
| `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for
host↔device copies used by plugin-backed tensors and Python bindings. |
| `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel
implementations for the plugin path. |
| `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`,
`Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow
hooks instead of inheriting from in-tree CPU base implementations. |
| `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points
(`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and
release the CUDA EP factory. |
| `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides
`CudaKernel` base class, error-return macros, type helpers
(`ToCudaType`), handle-management abstractions, and stream adapters.
Force-included in all plugin `.cc` files to transparently adapt existing
kernel code. |
| `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel
definitions via `PluginKernelCollector` macro overrides, replacing the
centralized registration tables used in the bundled build. |
| `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin
(logging, error checking, config parsing). |
| `plugin/provider_api_shims.cc` | Stub implementations for
shared-provider bridge functions that are not needed in the plugin path.
|
| `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export
definitions for the plugin DLL. |

### EP adapter and API extensions

| File | Change |
|------|--------|
| `include/onnxruntime/ep/api.h` | Makes plugin API initialization
thread-safe; preserves access to ORT, EP, and model editor API tables
during plugin loading. |
| `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors
(operator domain, optional-output handling) needed by reused CUDA
kernels. |
| `include/onnxruntime/ep/adapter/op_kernel.h` | Adds
`RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing
CUDA kernels run against plugin adapter contexts. |
| `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter
kernel-info with attribute and config accessors required by migrated
kernels. |
| `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter
adjustments for plugin compatibility. |
| `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel
definition builder hooks for plugin registration. |
| `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only
`Tensor::Create` compatibility path for kernels relying on the older
static factory form. |
| `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the
shared-provider bridge into a no-op for plugin builds so the EP adapter
facade owns type resolution. |

### CUDA kernel compatibility migration

- Adapts ~80 core CUDA and contrib CUDA kernel source files to compile
under the plugin build via macro-based registration overrides and
targeted compatibility fixes (not operator rewrites).
- Moves or templates reusable helper logic in shared CPU/CUDA headers
(`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`,
`ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile
in adapter mode.
- Key contrib kernel adaptations: attention variants (MHA, GQA, paged,
sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse,
bias-dropout, matmul-nbits, qordered ops.
- Key core kernel adaptations: softmax, topk, conv/conv-transpose,
batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum,
identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze,
gather-nd, concat, dropout, non-max-suppression.

### Python integration

| File | Change |
|------|--------|
| `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends
`get_available_providers()` to surface dynamically registered plugin EPs
discovered from `OrtEpDevice` enumeration. |
| `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python
session creation to instantiate providers from registered plugin EP
devices, including `device_id` selection, instead of only built-in or
legacy dynamic-load EP paths. |
| `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query
support for plugin-registered operators. |

### Testing and validation

| File | Change |
|------|--------|
| `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861
lines). Comprehensive test suite covering 5 stages: registration, ONNX
ops, NHWC layout preference, contrib ops, and op-level validation. |
| `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192
lines). Utility for transparently routing existing tests to the plugin
EP. |
| `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length`
tensor placement from CUDA to CPU (was causing failures under the plugin
EP's stricter memory layout); routes tests through plugin EP. |
| `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP
when available. |
| `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++
unit test exercising dynamic plugin EP loading and device enumeration. |
| `test/unittest_util/base_tester.cc` | Routes CUDA test requests to
`CudaPluginExecutionProvider` when registered, allowing existing CUDA
provider tests to exercise the plugin path. |
| `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines).
Comparison script that produces a parity report of ops in bundled-only
vs. plugin-only vs. both builds, via static parsing or runtime registry
interrogation. |

### Documentation

| File | Change |
|------|--------|
| `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines).
Plugin architecture, build/deployment flow, operator exclusions, adapter
design, and the decision to defer CUDA Graph support. |
| `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build
instructions, C++ and Python usage examples, and known limitations. |

### Other

| File | Change |
|------|--------|
| `tools/python/gen_opkernel_doc.py` | Extended to generate
documentation for plugin-registered kernels. |
| `orttraining/.../reduction_ops.cc` | Minor compatibility fix for
training reduction ops under the plugin build configuration. |

## Testing

- **Build**: Configure with `--build_cuda_ep_as_plugin` (or
`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify
`libonnxruntime_providers_cuda_plugin.so` is produced alongside existing
CUDA provider artifacts.
- **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester`
routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new
`dynamic_plugin_ep_test` for load/enumerate validation.
- **Python tests**: Register the plugin library, confirm
`onnxruntime.get_available_providers()` includes
`CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage
suite: registration → ONNX ops → NHWC → contrib ops → op validation).
- **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py`
to verify kernel coverage parity between bundled and plugin builds.
- **Backward compatibility**: Verify unchanged behavior for the in-tree
CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`).
- **Known limitation**: CUDA graph support remains disabled in the
plugin path and is documented as deferred.

## Motivation and Context

The CUDA EP is currently compiled into the ORT runtime binary, tightly
coupling its release cycle to the core runtime. This PR creates a path
to decouple CUDA EP delivery by implementing it as a standalone plugin
using the EP Plugin API. The key design tradeoff is reusing the existing
~100+ CUDA kernel implementations through force-include adapter headers
and macro-based registration overrides, rather than rewriting them. This
approach validates the plugin EP against current CUDA coverage without
maintaining a second kernel stack, at the cost of introducing
adapter/shim complexity. CUDA Graph support is explicitly deferred until
the EP Plugin API can represent the capture/replay lifecycle.

**Related**: PR microsoft#27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is
squash-merged into this branch.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (or documented in description)
- [ ] CI passes
…ft#27914)

### Description

Specify `main` as the target branch for the release candidate cron job.

### Motivation and Context

Pipeline won't work without a branch specifier.
…rosoft#27713)

### Description

Adds C/C++ APIs to the `OrtEpApi` that allow plugin EPs to query ONNX
operator schemas from ORT's global schema registry. This enables EPs to
programmatically discover operator metadata (input/output names, type
constraints, allowed types, since_version) needed to correctly build
kernel definitions with proper type constraints.

### Motivation

Resolves microsoft#27680. Plugin EPs must provide exact type constraint names
(e.g., `"T"`, `"T1"`) and allowed types when calling
`KernelDefBuilder::AddTypeConstraint()`. Without schema access, EPs must
either hard-code these names or skip type constraints entirely, leading
to potentially incorrect kernel selection and data type mismatches at
runtime.

**Why can't an EP library just link to its own ONNX library?** The ONNX
`OpSchemaRegistry` is a Meyers singleton (`static` local in
`Instance()`). Each shared library gets its own copy of that static
variable: on Windows each DLL is isolated by default, on macOS two-level
namespaces have the same effect, and on Linux behavior depends on
`dlopen` flags (`RTLD_LOCAL` isolates, `RTLD_GLOBAL` creates
unpredictable interposition). Even when isolation doesn't occur, the
EP's registry would lack ORT's contrib and internal schemas, and version
mismatches between the EP's ONNX library and ORT's vendored copy could
cause silent divergence. A C API through ORT is the only reliable,
portable way to query the schemas ORT actually uses.

### Changes

**New opaque types:**
- `OrtOpSchema` — owning opaque struct wrapping an `onnx::OpSchema*`
with precomputed type constraint data. Allocated by `GetOpSchema`,
released by `ReleaseOpSchema`.
- `OrtOpSchemaTypeConstraint` — non-owning opaque entity representing a
single type constraint (e.g., "T"). Lifetime is tied to the parent
`OrtOpSchema`. Each constraint carries its name, allowed types, and
input/output index mappings.

**New C APIs added to `OrtEpApi` (Version 1.25, 15 functions):**

| Function | Description |
|---|---|
| `GetOpSchema` | Look up a schema by name, max opset version, and
domain. Accepts `""` or `"ai.onnx"` for standard ONNX ops,
`"ai.onnx.ml"` for ML ops, `"com.microsoft"` for contrib ops. |
| `ReleaseOpSchema` | Release an `OrtOpSchema` allocated by
`GetOpSchema`. |
| `OpSchema_GetSinceVersion` | Get the opset version that introduced the
schema. |
| `OpSchema_GetNumInputs` / `GetNumOutputs` | Input/output counts. |
| `OpSchema_GetInputName` / `GetOutputName` | Formal parameter names. |
| `OpSchema_GetInputTypeConstraint` / `GetOutputTypeConstraint` | Get
the type constraint for a given input/output (O(1) lookup). Returns
`nullptr` if the input/output has no type constraint. Shared constraints
return the same pointer (pointer identity = shared type). |
| `OpSchema_GetTypeConstraintCount` | Number of unique type constraints.
|
| `OpSchema_GetTypeConstraint` | Get the i-th type constraint by index.
|
| `OpSchemaTypeConstraint_GetTypeParamName` | Get the type parameter
name (e.g., `"T"`, `"T1"`). |
| `OpSchemaTypeConstraint_GetAllowedTypes` | Get the allowed type
strings (e.g., `"tensor(float)"`). |
| `OpSchemaTypeConstraint_GetInputIndices` | Get input indices using
this constraint. |
| `OpSchemaTypeConstraint_GetOutputIndices` | Get output indices using
this constraint. |

**C++ wrappers:**
- `Ort::OpSchema` — owning wrapper around `OrtOpSchema*` (move-only,
auto-releases).
- `Ort::ConstOpSchemaTypeConstraint` — non-owning wrapper around `const
OrtOpSchemaTypeConstraint*`.
- `Ort::GetOpSchema()` — free function to query the registry.

**Design highlights:**
- Type constraints are eagerly precomputed during `GetOpSchema` — all
subsequent accessors are O(1) with no allocation.
- `GetInputTypeConstraint`/`GetOutputTypeConstraint` return the full
constraint object directly (not just a string), enabling a 2-call
workflow: `GetInputTypeConstraint(0)` → `GetAllowedTypes()`.
- Pointer identity: inputs sharing a constraint (e.g., both inputs of
`Add` use `"T"`) return the same `OrtOpSchemaTypeConstraint*`.
- Domain `"ai.onnx"` is normalized to `""` (the canonical ONNX domain)
for transparent lookup.

**Tests:** 14 unit tests covering known/unknown ops, version boundaries,
wrong domains, `"ai.onnx"` alias, schema properties (Add, Relu, LSTM),
type constraint access, pointer identity for shared constraints,
out-of-range errors, and the input→constraint→allowed-types workflow.

### Files

| File | Description |
|---|---|
| `onnxruntime/core/session/abi_opschema.h` | Internal struct
definitions for `OrtOpSchemaTypeConstraint` and `OrtOpSchema`. |
| `include/.../onnxruntime_ep_c_api.h` | Public C API: function
signatures, doc comments, opaque type declarations. |
| `onnxruntime/core/session/plugin_ep/ep_api.h` | Internal function
declarations. |
| `onnxruntime/core/session/plugin_ep/ep_api.cc` | Implementation of all
15 functions + API struct initializer. |
| `include/.../onnxruntime_cxx_api.h` | C++ wrapper class declarations.
|
| `include/.../onnxruntime_cxx_inline.h` | C++ wrapper inline
implementations. |
| `onnxruntime/test/framework/ep_plugin_provider_test.cc` | Unit tests.
|

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request adjusts the tiling strategy for small matrix sizes in
the DP4A matmul kernel. The changes are aimed at improving performance
and compatibility, especially for specific GPU vendors.

On Qualcomm, improving token generation from ~20 tps to ~25 tps.
### Description
Update logger object in QnnBackendManager::SetupBackend.



### Motivation and Context
While generating weight sharing context binary, Inference Session is
created once for each graph. Inference session creates logger object and
passes it to QnnBackendManager. QnnBackendManager stores this pointer in
logger_ pointer and holds it long after Inference Session destroys
Logger. On next Inference Session, another Logger object is created but
QnnBackendManager do not use this as backend_setup_completed_ is already
set, using this causes UAF.

Co-authored-by: Trishansh Bhardwaj <quic_tbhardwa@quicinc.com>
### Description
This PR contains fixes to various big endian support issues in
onnxruntime, both in libraries and tests.

### Motivation and Context
Currently some tests from onnxruntime testsuite fail.
This change fixes all tests from onnxruntime testsuite when it's built
without training support.
It also includes a linking issue fix.

Following tests are fixed on s390x:
OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild
FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8
SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO
SparseTensorConversionTests.TestConstantNodeConversion
OrtModelOnlyTests.SparseInitializerHandling
SparseTensorConversionTests.TestConstantNodeConversion
SparseTensorConversionTests.TestDenseToSparseConversion
ExecutionFrameTestInit.SparseInitializerAsOutput
CApiTest.SparseOutputModel
### Description
#### TLDR

This PR ports the existing C++
[EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359)
interfaces used by provider-bridge EPs to the binary-stable C APIs for
plugin EPs. It introduces C/C++ APIs for creating/querying profiling
events, a container for appending EP events, and callback hooks
(`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in
real-time.

#### Changes to the original C++ API

The original `EpProfiler` C++ interface was adapted for the C API with
the following intentional changes:

1. **`StartProfiling`** now receives an offset indicating the elapsed
time since profiling started, as opposed to receiving an
absolute/epoch-dependent profiling start time. This prevents EPs from
having to do epoch conversions. Credit to @edgchen1 for the idea.
2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based
correlation ID (`ort_event_correlation_id`)** instead of a relative ORT
event ID. The `PluginEpProfiler` bridge layer automatically converts the
C++ `relative_ort_event_id` (microseconds since profiling start) to an
absolute `ort_event_correlation_id` by adding the epoch-based profiling
start time. This means plugin EPs can use the correlation ID directly
with profiling utilities like CUPTI or ROCTracer without computing the
conversion themselves.
3. **`StopEvent` now receives the completed ORT event as a parameter.**
This allows EPs to optionally inspect ORT event metadata (e.g.,
`op_name`, `event_name`) at the time the event ends, facilitating
annotation of correlated EP events.
4. **`EndProfiling` only allows EPs to *append* events (via
`OrtProfilingEventsContainer`), not read or modify the full events
array.** This is motivated by:
- Prevent any one EP from modifying events generated by ORT or another
EP.
- Certain EPs (VitisAI and WebGPU) already only append events without
reading the entire events array.
- The CUDA EP reads the entire events array solely to merge/sort its own
EP events next to correlated ORT events and add `parent_name`/`op_name`
metadata. However:
- Merging/sorting is mostly unnecessary since trace viewers that load
these files do their own event sorting.
- This merging/sorting step was previously required to augment CUDA EP
events with metadata from the correlated ORT event. However, that can
now be obtained more simply via the new `StopEvent` parameter that
provides the EP with the full correlated ORT event.
- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
microsoft#13706 (comment)).
Fixing this would require sorting the entire Events array before asking
a provider-bridge EP to merge in its events into the global events
array. Not sure this is worth the runtime cost.

#### Naming conventions for ORT event IDs

- **C++ `EpProfiler` interface** (existing): Uses
`relative_ort_event_id` — a timestamp offset in microseconds relative to
profiling start.
- **C API `OrtEpProfilerImpl`** (new in this PR): Uses
`ort_event_correlation_id` — an absolute, epoch-based timestamp in
microseconds computed from `std::chrono::high_resolution_clock`
(platform-defined epoch). Unique across concurrent profiling sessions
within the same process.
- **Conversion**: The `PluginEpProfiler` bridge class (in
`ep_event_profiling.cc`) performs `ort_event_correlation_id =
relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the
pattern in `GPUTracerManager::PushCorrelation`.

### New C APIs

| API | Description |
|-----|-------------|
| `CreateProfilingEvent` | Create a profiling event with category,
process/thread IDs, name, timestamp, duration, and key-value args |
| `ReleaseProfilingEvent` | Release a profiling event |
| `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`,
`KERNEL`, `API`) |
| `ProfilingEvent_GetName` | Get event name |
| `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) |
| `ProfilingEvent_GetDurationUs` | Get event duration (µs) |
| `ProfilingEvent_GetArgValue` | Get an event argument value by key |
| `ProfilingEventsContainer_AddEvents` | Append an array of EP events to
the output container |
| `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler
implementation |
| `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a
profiling session. Receives elapsed time offset (ns) since ORT profiling
started |
| `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT
event has started. Receives an absolute `ort_event_correlation_id` |
| `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT
event has ended. Receives the same `ort_event_correlation_id` and ORT
event metadata |
| `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling
session and collect EP events into the output container |
| `OrtEpProfilerImpl::Release` | Release the profiler instance |

### New C++ wrapper classes

| Class | Description |
|-------|-------------|
| `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading
fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) |
| `Ort::ProfilingEvent` | Owning wrapper that creates and manages an
`OrtProfilingEvent` (e.g., for `EndProfiling`) |
| `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding
events to an `OrtProfilingEventsContainer` during `EndProfiling` |

### Example EP profiling implementation
This PR updates an example plugin EP to use the new profiling APIs:
- Plugin EP code:
[test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry)
- `OrtEpProfilerImpl` implementation:
[ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h)
/
[ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc)
- `OrtEp::CreateProfiler()` implementation:
[ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc)

### Existing bugs found
Not fixed in this PR.

- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
microsoft#13706 (comment)).
- Run profilers do not handle subgraphs (e.g., subgraph of a
control-flow operator). Has been the case since run profilers were
[introduced](microsoft#26846).

### Motivation and Context
Allows plugin EPs to generate profiling events, further closing the
functionality gap between provider-bridge EPs and plugin EPs.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…a layout on Avx512 (microsoft#27874)

### Description
Adds a special AVX512 kernel for depthwise conv with multiplier = 2.
These improve the performance of 3 costly conv operations (7x7 kernels)
in the MobileClip model by approx 2.4x (will share MLAS benchmark
numbers).

These are 3 ops with
1) Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7
2) Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7
3) Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7

These Conv operations cannot be dispateched to NCHWc as the Cout per
group is sub-block size. On AVX512, the block size is 16 and the Cout
per group is only 2. There is a special depthwise kernel in the NCHWc
suite but it can only handle Cout per group = 1.

MLAS Benchmark Before and After comparison:

| Benchmark | BEFORE mean (ns) | AFTER mean (ns) | Speedup |
|---|---:|---:|---:|
| SCONV_NCHW G64 | 3,151,190 | 1,391,419 | 2.26x |
| SCONV_NCHW G128 | 1,646,040 | 824,654 | 2.00x |
| SCONV_NCHW G256 | 978,843 | 533,375 | 1.84x |
| SCONV_NCHW_THREADED G64 | 873,283 | 367,722 | 2.37x |
| SCONV_NCHW_THREADED G128 | 445,786 | 226,777 | 1.97x |
| SCONV_NCHW_THREADED G256 | 264,473 | 147,997 | 1.79x |

### Motivation and Context
Just by optimizing these 3 conv operations, MobileClip is about
700us-850us faster and the entire model is <14ms on an AVX512 machine.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Fix int overflow issues in original implementation.

Add some additional tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix some int overflow issues.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ankitm3k ankitm3k merged commit b20f392 into ovep-develop Apr 2, 2026
5 of 7 checks passed
@ankitm3k ankitm3k deleted the sync_msft_02042026 branch April 2, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.