FF rel-1.25.0 to latest main for cut#28000
Closed
sanaa-hamel-microsoft wants to merge 56 commits intorel-1.25.0from
Closed
FF rel-1.25.0 to latest main for cut#28000sanaa-hamel-microsoft wants to merge 56 commits intorel-1.25.0from
sanaa-hamel-microsoft wants to merge 56 commits intorel-1.25.0from
Conversation
### Description <!-- Describe your changes. --> Fix int overflow issues in original implementation. Add some additional tests. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix some int overflow issues. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> Fix typo in CApiTest.VersionConsistencyWithApiVersion. It wasn't checking the third version string component. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix test.
### Description Disable 4-D MVN tests as there seems to be some issue within DirectML such that the tests pass sometimes on the same machine SKU but fail some other times. The 5-D tests always seem to pass (based on limited eyeballing of several runs). If that changes in future, we can disable the 5-D tests too. The bug might be narrrower than just 4-D cases. I think it is for 4-D inputs that ALSO includes 0 in the `axes` parameter but it needs more evidence to support that claim and it needs investigating from someone familiar with the core DML stack. Pending that, I am disabling the 4-D input tests for the MVN op from running using the DML EP. Without this fix Sample "passing" run: https://github.com/microsoft/onnxruntime/actions/runs/23826498144/job/69450582096#step:13:21501 Sample "failing" run: https://github.com/microsoft/onnxruntime/actions/runs/23831205376/job/69484574894#step:13:22000 ### Motivation and Context Mitigate DML EP MVN failires on CI Temporarily mitigates issue described in #27933
### Description <!-- Describe your changes. --> Fix `int` overflow issue in `ComputeAttentionSoftmaxInplace<MLFloat16>()` by using `size_t` and `SafeInt` instead. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix overflow issue. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…7945) ### Description <!-- Describe your changes. --> Add support for specifying dynamic plugin EP configuration via a JSON file path in the ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON_FILE environment variable. This is mutually exclusive with the specifying inline JSON using the existing ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON environment variable. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Allow more flexibility in specifying configuration. It may be impractical to put everything in an environment variable.
### Description Address some good leftover comments from PR that added EP APIs to retrieve operator schemas: #27713 ### Motivation and Context Clean up as promised
### Description <!-- Describe your changes. --> This pull request adds support for Conv3D operations to the WebGPU execution provider in ONNX Runtime. The main changes include implementing a new naive Conv3D shader, updating the convolution logic to handle 3D convolutions, and enabling relevant tests for Conv3D on WebGPU. Grouped Conv3D is not yet supported. **Conv3D WebGPU support:** * Added a new `Conv3DNaiveProgram` class (`conv3d_naive.h`, `conv3d_naive.cc`) that implements a per-element Conv3D shader for WebGPU, supporting both "channels last" and "channels first" layouts, with optional bias and activation. * Updated the main convolution logic in `conv.cc` to detect 5D tensors (Conv3D), construct the appropriate shader program, and pass spatial/stride/dilation parameters as uniforms. Grouped Conv3D is explicitly disallowed for now. * Included the new `conv3d_naive.h` header in the main convolution implementation. **Test coverage:** * Enabled Conv3D tests for the WebGPU provider by removing it from the excluded execution providers in several Conv3D test cases (`conv_op_test.cc`). * Added a note to the Conv3D fp16 test indicating that enabling it for WebGPU will require additional infrastructure to conditionally skip based on device capabilities. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support additional cases in WebGPU EP Conv kernel.
…ernel::UseSharePrePackedBuffers` (#27924) ### Description Consolidate `OpKernel::UseSharedPrePackedBuffers` and `OpKernel::UseSharedPrePackedBuffers_V2` into a single virtual method, resolving the TODO in `op_kernel.h`. #### Background The `OpKernel` class previously had two virtual methods for consuming shared pre-packed weight buffers: - **`UseSharedPrePackedBuffers`** (V1) — 3 params: `prepacked_buffers`, `input_idx`, `used_shared_buffers` - **`UseSharedPrePackedBuffers_V2`** — 4 params: added `prepacked_buffer_sizes` (a `gsl::span<const size_t>`) V2 was introduced to pass buffer sizes alongside the buffers. Its default implementation forwarded to V1 for backward compatibility. The framework (`session_state.cc`) only ever called V2. #### Changes Merged both methods into a single `UseSharedPrePackedBuffers` using the V2 signature: ```cpp virtual Status UseSharedPrePackedBuffers(std::vector<BufferUniquePtr>& prepacked_buffers, gsl::span<const size_t> prepacked_buffer_sizes, int input_idx, /*out*/ bool& used_shared_buffers); ``` Updated **27 files** across the codebase: | Category | Files | Change | |----------|-------|--------| | Base class | `op_kernel.h` | Removed V1 + V2; single 4-param method | | Framework | `session_state.cc` | Renamed `_V2` call | | Plugin EP bridge | `ep_kernel_registration.cc` | Renamed override | | QMoECPU | `moe_quantization_cpu.h/.cc` | Renamed V2 override + template instantiations | | CPU provider (8 kernels) | `gemm`, `matmul`, `conv_transpose`, `fp16_conv`, `qlinearconv`, `matmul_integer_base`, `deep_cpu_lstm`, `deep_cpu_gru` | Added `prepacked_buffer_sizes` param | | ACL provider (2 kernels) | `acl/conv`, `acl/matmul` | Added param | | Contrib ops (4 kernels) | `matmul_nbits`, `dynamic_quantize_lstm`, `attention_quant`, `bert/attention` | Added param | | Tests | `session_state_test.cc` | Updated test kernel override | #### Notes - Existing V1 overrides add the new `prepacked_buffer_sizes` parameter as **unnamed/unused** (`/*prepacked_buffer_sizes*/`) — no logic changes in those kernels. - The C API (`SetSharedPrePackedWeight` in `onnxruntime_ep_c_api.h`) already passes buffer sizes, so **no C API changes** were needed. - Private helper functions (e.g., `UseSharedPrePackedBuffersImpl` in LSTM/GRU) are not virtual overrides and were **not modified**. ### Motivation and Context Addresses the TODO at `include/onnxruntime/core/framework/op_kernel.h:139`: > TODO: Consolidate UseSharedPrePackedBuffers and UseSharedPrePackedBuffers_V2 into a single function, which will require updating kernel-based provider-bridge EPs (cpu, cuda, webgpu).
### Description Update the Attention Fusion optimizer to help fuse the Attention subgraph pattern in MobileClip model. The perf gain from this itself is paltry (mostly from not having to launch many kernels) but the real gain will be AFTER this fusion (i.e.) tuning the performance of the MHA kernel for the problem shapes seen in this model. There are 2 Attention blocks found in the model and this update fuses both of them. ### Motivation and Context Improve performance of MobileClip model --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Fix #27586 This pull request updates ONNX Runtime to support ONNX opset 26, including new operator implementations and related infrastructure changes. The most important changes are the upgrade of the ONNX dependency, addition of new opset 26 kernels (such as `CumProd` and `BitCast`), and updates to macros and versioning to ensure compatibility. Below are the key changes grouped by theme: **ONNX Dependency Upgrade:** * Updated ONNX submodule and source references to the latest commit supporting opset 26, and changed versioning in `vcpkg.json` from 1.20.1 to 1.21.0. (`cmake/deps.txt`, `cmake/external/onnx`, `cmake/vcpkg-ports/onnx/portfile.cmake`, `cmake/vcpkg-ports/onnx/vcpkg.json`) [[1]](diffhunk://#diff-12c22e06cbb37ea0ed9f9eaf60cbe408dbeef04072df6a9f431c3290822ea835L37-R37) [[2]](diffhunk://#diff-9610cc34ae338fe4c5eeb890d8760392d918f49b9221dab12a0adcb4578f1f15L1-R1) [[3]](diffhunk://#diff-6464fa440100374a194593fb273aa9320ff3b3b55ecae464f48726e7594c0a7bL6-R7) [[4]](diffhunk://#diff-460c1544e72078765c253c50b5933395422dd8cd95666fdf93a99f1dcc56d699L3-R4) **Opset 26 Kernel Support:** * Registered new opset 26 kernels for `BitCast` and all supported types of `CumProd` in the CPU execution provider, including their instantiation and build logic. (`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`, `onnxruntime/core/providers/cpu/math/cumprod.cc`, `onnxruntime/core/providers/cpu/math/cumprod.h`) [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R1500-R1508) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R3673-R3681) [[3]](diffhunk://#diff-2ed3b3c2a90656fcb37d29cc6e11b60013a205ac8507ce3eded9b900a124d6f1R1-R222) [[4]](diffhunk://#diff-35021cf0759796ff8af4cd636161826fbafa369f7a326bd0aed7c5060e177921R1-R28) * Increased the maximum supported opset version in the optimizer API from 25 to 26. (`onnxruntime/core/optimizer/transpose_optimization/optimizer_api.h`) **Build and Patch Updates:** * Added a new `ONNX_MINIMAL_BUILD` option to ONNX CMake configuration and updated patch files for compatibility with the new ONNX version. (`cmake/patches/onnx/onnx.patch`, `cmake/vcpkg-ports/onnx/binskim.patch`) [[1]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL2-R13) [[2]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL51-R54) [[3]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL74-R77) **Macro Improvements:** * Updated operator schema macros to use `[[maybe_unused]]` instead of the deprecated `ONNX_UNUSED` attribute, improving code clarity and modernizing macro usage. (`onnxruntime/core/graph/contrib_ops/contrib_defs.h`, `onnxruntime/core/graph/dml_ops/dml_defs.h`) [[1]](diffhunk://#diff-534f8d4a645c4fa2e80983b37196e601f7a69424e93bbcb4008a75d18427b8d6L39-R48) [[2]](diffhunk://#diff-78f99ce10cbacc2195aaa4736129cdcb5fec7c381e826ebb04789dfbdd515903L15-R24) --- **ONNX Dependency Upgrade** - Updated ONNX submodule and source references to the latest commit supporting opset 26, and changed versioning in `vcpkg.json` from 1.20.1 to 1.21.0. [[1]](diffhunk://#diff-12c22e06cbb37ea0ed9f9eaf60cbe408dbeef04072df6a9f431c3290822ea835L37-R37) [[2]](diffhunk://#diff-9610cc34ae338fe4c5eeb890d8760392d918f49b9221dab12a0adcb4578f1f15L1-R1) [[3]](diffhunk://#diff-6464fa440100374a194593fb273aa9320ff3b3b55ecae464f48726e7594c0a7bL6-R7) [[4]](diffhunk://#diff-460c1544e72078765c253c50b5933395422dd8cd95666fdf93a99f1dcc56d699L3-R4) **Opset 26 Kernel Support** - Registered new opset 26 kernels for `BitCast` and all supported types of `CumProd` in the CPU execution provider, including their instantiation and build logic. [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R1500-R1508) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R3673-R3681) [[3]](diffhunk://#diff-2ed3b3c2a90656fcb37d29cc6e11b60013a205ac8507ce3eded9b900a124d6f1R1-R222) [[4]](diffhunk://#diff-35021cf0759796ff8af4cd636161826fbafa369f7a326bd0aed7c5060e177921R1-R28) - Increased the maximum supported opset version in the optimizer API from 25 to 26. **Build and Patch Updates** - Added a new `ONNX_MINIMAL_BUILD` option to ONNX CMake configuration and updated patch files for compatibility with the new ONNX version. [[1]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL2-R13) [[2]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL51-R54) [[3]](diffhunk://#diff-c9c00bf16b95029cc65c186ad8254d2db0fef67de0049ab4a895172588c547aaL74-R77) **Macro Improvements** - Updated operator schema macros to use `[[maybe_unused]]` instead of the deprecated `ONNX_UNUSED` attribute, improving code clarity and modernizing macro usage. [[1]](diffhunk://#diff-534f8d4a645c4fa2e80983b37196e601f7a69424e93bbcb4008a75d18427b8d6L39-R48) [[2]](diffhunk://#diff-78f99ce10cbacc2195aaa4736129cdcb5fec7c381e826ebb04789dfbdd515903L15-R24) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r fully-masked batches (#27831) Description: ### Summary Fixes three issues in the CUDA ONNX Attention operator and improves spec compliance: 1. min_bias_align crash on SM<80: The alignment check for Memory Efficient Attention (MEA) bias used 4*sizeof(T) (bytes), but the check is against element counts. Fixed to 4 elements, matching CUTLASS kMinimumAlignment. This prevented valid MEA dispatch on SM<80. 2. MEA NaN for fully-masked batches: When nonpad_kv_seqlen=0, CUTLASS MEA computes 1/s_prime where s_prime=0, producing NaN. Added ZeroOutputForFullyMaskedBatches kernel (MEA path only) to zero output for these batches. Uses int64_t for element count to prevent overflow at large context lengths. 3. Flash rejects attn_mask for spec compliance: Flash Attention's paged KV cache produces spec-divergent present_key/present_value layout when used with attn_mask + past_key. Flash now requires attn_mask == nullptr — cases with bool mask + past_key fall to the unfused runner which handles them spec-correctly. Removed ~137 lines of dead code (ConvertMaskToSeqlensKernel, LaunchConvertMaskToFlashSeqlensK) no longer needed after this change. ### Known limitation - GQA + bool attn_mask + past_key currently has no runner (Flash rejected, unfused doesn't support GQA, MEA blocked by past_key). Tracked via TODO — PR #27851 (MEA with past_key support) will close this gap. ### Related - Issue #27885: Flash Attention bool attn_mask semantic divergence (root cause documented) - PR #27851: MEA with past_key support (will close GQA gap) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description WebGPU EP: C++20 housekeeping Modernize WebGPU EP code after the repo-wide C++20 migration. 19 files changed, net -3 lines. C++20 idiom modernization: - find() != end() / count() > 0 → contains() (4 sites) - .size() == 0 / .size() > 0 → .empty() (6 sites) - std::enable_if_t SFINAE → requires clauses (4 sites) - typedef struct → plain struct (1 site) - erase-remove idiom → std::erase_if (1 site) - Redundant std::move in rvalue-qualified return removed (1 site) - + vs << inconsistency in stream operator chains fixed (2 sites) Strict aliasing fixes (removed 3 #pragma GCC diagnostic ignored "-Wstrict-aliasing" blocks): - program.h: Replaced C-style (int&)a | (int&)b reference casts in ProgramTensorMetadataDependency bitwise operators with static_cast - shader_variable.h: Replaced (uint32_t&)a.usage |= ... reference casts in ShaderUsage bitwise operators with value-based operations - unary_elementwise_ops.cc: Replaced *reinterpret_cast<const float*>(attr) with std::memcpy for MLFloat16[2] → float type punning (consistent with pad.cc) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description Fixes the DLL import-name conversion logic in `DetermineLoadLibraryError()` in `onnxruntime/core/platform/windows/dll_load_error.cc`. The changes: - Replace `LoadLibrary(dll_name)` with `LoadLibraryW(...)` so the code no longer passes a `const char*` import-table name to the wide-character Windows API under `UNICODE` builds. - Replace the naive `char*` to `std::wstring` construction with `MultiByteToWideChar(CP_ACP, ...)`, which is appropriate for ANSI PE import names. - Handle conversion failures by skipping the import entry instead of continuing with an invalid or empty converted name. - Fix the follow-up buffer sizing issue in the conversion path by allocating space for the terminating null, checking the conversion result, and then resizing the `std::wstring` back down to exclude the terminator. These updates keep the change focused on the Windows DLL dependency diagnostic path while ensuring the new conversion logic is memory-safe. ### Motivation and Context The original code mixed narrow and wide string handling in a Windows-specific path that reports missing dependent DLLs. Under standard `UNICODE` builds, that could cause the dependency lookup to fail because a narrow import name was effectively being passed to a wide-character API. The follow-up review also identified an off-by-one write risk in the first version of the conversion fix: `MultiByteToWideChar(..., -1, ...)` returns a length including the null terminator, so the destination buffer must reserve space for that terminator before conversion. This update corrects that issue and preserves the intended diagnostic behavior. ### Test ``` "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" git checkout tlwu/fix-dll-name-string-conversion-test cd 测试目录 test_dll_load.cmd ``` Output is like ``` ====== Caught expected error ====== [ONNXRuntimeError] : 1 : FAIL : Error loading "E:\git\onnxruntime\测试目录\test_main.dll" which depends on "missing_dep.dll" which is missing. (Error 126: "The specified module could not be found.") =================================== ✅ SUCCESS: The fix works! 'missing_dep.dll' is correctly identified and reported in the error message. ``` --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
### Description Fixes ICM issue https://portal.microsofticm.com/imp/v5/incidents/details/31000000562663/summary ### Motivation and Context Fix ICMs
This pull request introduces a new documentation page, `PartitioningWithAnnotationsAndMemoryConstraints.md`, which explains advanced ONNX Runtime features for partitioning model graphs across devices with explicit control. The doc covers how to annotate model layers for device assignment, collect per-node memory statistics, and enforce GPU memory budgets during partitioning. These features enable precise control over device placement and memory usage for large models. The most important changes are: **New Documentation: Advanced Partitioning Features** * Adds a comprehensive guide (`PartitioningWithAnnotationsAndMemoryConstraints.md`) describing how to use ONNX Runtime’s layer annotation and memory constraint features for graph partitioning. **Layer Assignment via Annotations** * Explains how to annotate ONNX model nodes with `layer_ann` metadata, including manual annotation and automated annotation using Olive’s `CaptureLayerAnnotations` pass. * Provides configuration examples for mapping annotation patterns to devices at runtime using the `session.layer_assignment_settings` session option. **Capacity-Aware Partitioning** * Details a two-phase workflow for profiling per-node memory usage and then enforcing a memory budget with the `session.resource_cuda_partitioning_settings` session option. * Covers both profiling-based and ad-hoc (estimation-only) approaches for memory-constrained partitioning. ([docs/annotated_partitioning/PartitioningWithAnnotationsAndMemoryConstraints.mdR1-R267](diffhunk://#diff-10b3051b9e36eccfc7ca0f2d44ce78a9980ca573cde0f931ffd1456da2c681daR1-R267) This is a follow up for #27595
### Description Increase version number to 1.26.0. The rel-1.25.0 release branch has been cut. ### Changes - VERSION_NUMBER: 1.25.0 → 1.26.0 - ORT_API_VERSION: 25 → 26 (header + C API struct rename) - Python, JS, docs version strings updated via update_version.py - C# NativeTrainingMethods ORT_API_VERSION: 23 → 26 - samples/cxx/README.md example paths updated - docs/Versioning.md example updated ### Motivation and Context Per release process: bump main branch version immediately after cutting the release branch.
Proposal for CausalConvWithState and LinearAttention onnxruntime custom operator. This follows the proposal in onnx/onnx#7767.
Add ORT_ENFORCE checks in the SVMRegressor constructor to validate that coefficients, support_vectors, and rho attribute array sizes are consistent with the declared n_supports dimension. Without this validation, a crafted model with undersized arrays causes the GEMM inner loop to read past buffer boundaries. This mirrors the existing validation already present in SVMClassifier. - Validate rho is non-empty (accessed as rho_[0] in LINEAR mode, passed to GEMM as bias in SVC mode) - Validate coefficients.size() >= vector_count_ in SVC mode - Validate feature_count_ > 0 after support_vectors division - Add two unit tests for undersized coefficients and support_vectors --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…alues (#27789) ### Description Fixes a heap out-of-bounds write (underflow) in the `Attention` contrib operator's `PrepareMask` function. Negative values in the 1D `mask_index` tensor were used directly as loop start indices without bounds checking, allowing writes at negative offsets before the `p_mask` buffer. In `PrepareMask()` (`attention_helper.h`), `end_position` is read from `mask_index[b_i]` and used as the starting index in a write loop with no lower-bound validation. When `end_position` is negative, the loop writes `mask_filter_value` at negative offsets — a heap buffer underflow. In contrast, `start_position` had partial clamping via `std::min()` but lacked a lower bound as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
for webgpu ep: + onnx rotary-embedding op + onnx rmsnorm + reshape-> opset-25 + transpose -> opset-24
### Description Extends the CUDA Transpose kernel registration from opset 23 to opset 25. - **`transpose.cc`**: Cap existing opset 23 kernel to versioned `(23, 24)`, add new non-versioned kernel at opset 25 - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries to match; add new `// Opset 25` section - **`docs/OperatorKernels.md`**: Update CUDA Transpose entry from `23+` to `25+` with new `[23, 24]` versioned range No functional or type constraint changes — the kernel implementation is identical across these opsets. ### Motivation and Context CUDA EP's Transpose registration stopped at opset 23 while the ONNX spec defines it through opset 25. This is one of the P1 gaps tracked in #27729, following the same pattern as #27728. ### Limitation This PR does not add support of new data type for Transpose: - int2 (opset 25) - float8e8m0 (opset 24) - float4e2m1 (opset 23) - float8e4m3fn,float8e4m3fnuz, float8e5m2, float8e5m2fnuz, uint4, int4 (opset 21) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
## Summary - Add `__array__`, `__dlpack__`, `__dlpack_device__`, and `from_dlpack` to the public `OrtValue` class - Enable standard Python interoperability protocols (numpy array protocol + DLPack) on `OrtValue` - Auto-detect boolean dtype from source objects in `from_dlpack` to avoid the uint8/bool ambiguity in older DLPack versions ## Motivation Fixes #24071 The C-level `C.OrtValue` already supports `__dlpack__`, `__dlpack_device__`, and `from_dlpack`, but the public Python wrapper `OrtValue` class does not expose them. Users currently have to access the private `_ortvalue` attribute (e.g. `ortvalue._ortvalue.__dlpack__()`) for DLPack interop. Similarly, `np.asarray(ortvalue)` doesn't work because `__array__` is not implemented. This makes `OrtValue` a well-behaved tensor type that works out of the box with: - `np.asarray(ortvalue)` / `np.array(ortvalue)` via `__array__` - `torch.from_dlpack(ortvalue)` via `__dlpack__` / `__dlpack_device__` - `OrtValue.from_dlpack(torch_tensor)` via the `from_dlpack` classmethod ## Changes **`onnxruntime/python/onnxruntime_inference_collection.py`**: - `__array__(dtype, copy)`: Delegates to `self.numpy()` with optional dtype conversion. Supports numpy 2.0 `copy` semantics while remaining compatible with older numpy versions. - `__dlpack__(*, stream)`: Thin wrapper over the C-level `__dlpack__`. - `__dlpack_device__()`: Thin wrapper over the C-level `__dlpack_device__`. - `from_dlpack(data)`: Classmethod that accepts any `__dlpack__`-compatible object or raw DLPack capsule. Detects boolean dtype from the source object's `dtype` attribute or `data_type()` method, avoiding the uint8/bool false-positive that `is_dlpack_uint8_tensor` would produce on genuine uint8 data. **`onnxruntime/test/python/onnxruntime_test_python.py`**: - `test_ort_value_array_protocol`: Tests `np.asarray`/`np.array` with float32, int64, bool dtypes, and dtype conversion. - `test_ort_value_dlpack_protocol`: Tests `__dlpack__` and `__dlpack_device__` on the public class. - `test_ort_value_from_dlpack_protocol_object`: Tests `from_dlpack` with numpy arrays and OrtValue-to-OrtValue round-trip, verifying zero-copy (shared memory). - `test_ort_value_from_dlpack_bool`: Tests bool round-trip and verifies uint8 is not falsely detected as bool. ## Test Plan - [x] `ruff check` passes on both modified files - [x] `ruff format --check` passes on both modified files - [x] `lintrunner` reports no issues - [x] Existing `test_ort_value_dlpack` test continues to pass - [x] All logic paths verified against C-level bindings (bool detection, dtype conversion, shared memory) - [ ] CI: new tests pass against a full build with DLPack enabled
### Description This PR introduces a packaging pipeline for ONNX Runtime WebGPU EP plugin for the following platforms: - win/x64 - win/arm64 - linux/x64 - mac/arm64 Key changes: **CI/CD Pipeline Additions and Improvements:** * Added a new Azure Pipelines YAML pipeline (`plugin-webgpu-pipeline.yml`) to automate building and packaging the WebGPU plugin for Windows, Linux, and macOS, with parameterized builds for architecture, API version, package version, and build type. The pipeline validates parameter combinations and orchestrates platform-specific packaging stages. * Introduced modular pipeline stage templates for Linux (`plugin-linux-webgpu-stage.yml`), macOS (`plugin-mac-webgpu-stage.yml`), and a top-level packaging stage (`plugin-webgpu-packaging-stage.yml`) to manage platform-specific build, artifact staging, and publishing processes. [[1]](diffhunk://#diff-8d9766b9dfb672636229c848b58bd4beb8469d8a2bc0aab7adfa332a04b49c25R1-R96) [[2]](diffhunk://#diff-c97395e205146bf044dce86c089595772fd09e1f36e898fc20fb831583568a39R1-R106) [[3]](diffhunk://#diff-4c2ad2fa235a30f8f589fa28c573e8e0969e997f1eafbdb3305b04743960b538R1-R75) **Plugin Versioning and Build Configuration:** * Updated `onnxruntime_providers_webgpu.cmake` to set the plugin EP version to match `ORT_VERSION` by default, unless explicitly specified, and to pass this version via a preprocessor definition for consistent version reporting. * Changed the plugin's reported version in `Factory::GetVersionImpl` to use the new `ORT_PLUGIN_EP_VERSION` macro, ensuring the runtime-reported version matches the build configuration. **Codebase Maintenance:** * Added a missing `<mutex>` include to `allocator.h` to ensure thread safety and proper compilation. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> The Vulkan interop works in a similar way as D3D12 interop. The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use file descriptors. As a sync primitive we use Vulkan timeline semaphores. They are widely supported since Vulkan 1.2 and work in a similar way as the existing `ID3D12Fence`s. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change allows to use graphics interop also on Vulkan and on Linux. It addresses a TODO in the external memory API.
## Summary This PR improves CUDA Plugin EP development and validation in three areas: Fixes the Windows CUDA Plugin EP build so the plugin can be compiled successfully with MSVC. Adds dedicated Windows and Linux GitHub Actions workflows for building and testing the CUDA Plugin EP. Expands the quick start documentation with instructions for running the CUDA Plugin EP Python tests locally. ## Changes ### Windows build fixes - Update the CUDA plugin CMake configuration to use the correct forced-include flags on Windows/MSVC. - Keep the existing forced-include behavior for non-MSVC toolchains. - Add the missing GetEnvironmentVar(const std::string&) forward declaration needed by plugin builds on Windows. ### CI coverage for CUDA Plugin EP Add a Windows CUDA Plugin EP workflow that: - builds ONNX Runtime with onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON - uploads build artifacts - installs the built wheel - sets ORT_CUDA_PLUGIN_PATH - runs test_cuda_plugin_ep.py Add a similar Linux CUDA Plugin EP workflow. ### Documentation updates - Add a Running Tests section to the CUDA Plugin EP quick start. - Document test prerequisites, dependency installation, and ORT_CUDA_PLUGIN_PATH. - Clarify that CPU-only PyTorch is sufficient for test_cuda_plugin_ep.py because it is used for CPU-side reference computations.
… enabled (#27798) ### Description `arena_extend_strategy` was defined in separate `#if defined(USE_CUDA)` and `#if defined(USE_MIGRAPHX)` preprocessor blocks, causing a redefinition error when both providers are enabled. Consolidates the definition and extern declaration into a single combined guard: ```cpp #if defined(USE_MIGRAPHX) || defined(USE_CUDA) || defined(USE_CUDA_PROVIDER_INTERFACE) onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; #endif ``` Both files updated: - `onnxruntime/python/onnxruntime_pybind_state_common.cc` — variable definition - `onnxruntime/python/onnxruntime_pybind_state_common.h` — extern declaration No other duplicate symbols found across CUDA/MiGraphX conditional blocks. ### Motivation and Context Building with `--use_cuda --use_migraphx` fails with: ``` error: redefinition of 'onnxruntime::ArenaExtendStrategy onnxruntime::python::arena_extend_strategy' ``` Both providers share this config variable; it just needs a unified guard. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[Build] Cannot build with cuda and migraphx</issue_title> > <issue_description>### Describe the issue > > My laptop has an nvidia dgpu and an amd igpg. To build with cuda and migraphx I had to edit onnxruntime/python/onnxruntime_pybind_state_common.cc to remove one of the two onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > > ``` > #if defined(USE_CUDA) || defined(USE_CUDA_PROVIDER_INTERFACE) > // TODO remove deprecated global config > OrtCudnnConvAlgoSearch cudnn_conv_algo_search = OrtCudnnConvAlgoSearchExhaustive; > // TODO remove deprecated global config > bool do_copy_in_default_stream = true; > // TODO remove deprecated global config > onnxruntime::cuda::TunableOpInfo tunable_op{}; > onnxruntime::CUDAExecutionProviderExternalAllocatorInfo external_allocator_info{}; > // TODO remove deprecated global config > onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > #endif > > #if defined(USE_MIGRAPHX) > // TODO remove deprecated global config > onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > #endif > ``` > > I think it could be changed to > ``` > #if defined(USE_CUDA) || defined(USE_CUDA_PROVIDER_INTERFACE) > // TODO remove deprecated global config > OrtCudnnConvAlgoSearch cudnn_conv_algo_search = OrtCudnnConvAlgoSearchExhaustive; > // TODO remove deprecated global config > bool do_copy_in_default_stream = true; > // TODO remove deprecated global config > onnxruntime::cuda::TunableOpInfo tunable_op{}; > onnxruntime::CUDAExecutionProviderExternalAllocatorInfo external_allocator_info{}; > // TODO remove deprecated global config > #endif > > #if defined(USE_MIGRAPHX) || defined(USE_CUDA) || defined(USE_CUDA_PROVIDER_INTERFACE) > // TODO remove deprecated global config > onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > #endif > ``` > > ### Urgency > > _No response_ > > ### Target platform > > AMD + Nvidia setups > > ### Build script > > ./build.sh --config Release --build_shared_lib --build_wheel --parallel 12 --skip_tests --use_cuda --cuda_home /opt/cuda --cudnn_home /opt/cuda --use_tensorrt --tensorrt_home /usr --use_openvino CPU --use_migraphx --migraphx_home /opt/rocm --cmake_extra_defines CMAKE_CXX_FLAGS="-Wno-maybe-uninitialized -Wno-free-nonheap-object" CMAKE_CUDA_ARCHITECTURES="120" > > ### Error / output > > /home/someone/onnxruntime-3.13/onnxruntime/onnxruntime/python/onnxruntime_pybind_state_common.cc:36:34: error: redefinition of ‘onnxruntime::ArenaExtendStrategy onnxruntime::python::arena_extend_strategy’ > 36 | onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > | ^~~~~~~~~~~~~~~~~~~~~ > /home/someone/onnxruntime-3.13/onnxruntime/onnxruntime/python/onnxruntime_pybind_state_common.cc:31:34: note: ‘onnxruntime::ArenaExtendStrategy onnxruntime::python::arena_extend_strategy’ previously defined here > 31 | onnxruntime::ArenaExtendStrategy arena_extend_strategy = onnxruntime::ArenaExtendStrategy::kNextPowerOfTwo; > | ^~~~~~~~~~~~~~~~~~~~~ > > ### Visual Studio Version > > na > > ### GCC / Compiler Version > > gcc version 15.2.1 20260209 (GCC)</issue_description> > > <agent_instructions>Please create a pull request, and check whether there are other places that enabling both cuda and migraphx might cause build errors.</agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes #27797 <!-- START COPILOT CODING AGENT TIPS --> --- 💬 Send tasks to Copilot coding agent from [Slack](https://gh.io/cca-slack-docs) and [Teams](https://gh.io/cca-teams-docs) to turn conversations into code. Copilot posts an update in your thread when it's finished. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
## Description This change wires the CUDA plugin EP into ORT's sync surface (see #27538) so `IOBinding` can safely coordinate device work when inputs and outputs are bound on CUDA. It also clarifies the split between EP-level and factory-level sync-stream creation in the design doc and adds Python coverage to validate the new path with simple CUDA-bound models. ## Summary of Changes ### CUDA plugin EP implementation | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Registers `OrtEp::CreateSyncStreamForDevice` and `OrtEp::Sync` in `CudaEp`, adds per-session CUDA sync-stream creation, and implements a conservative device-wide sync via `cudaSetDevice` + `cudaDeviceSynchronize()`. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Declares the new `CreateSyncStreamForDeviceImpl` and `SyncImpl` entry points on `CudaEp`. | ### Tests | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds a helper to resolve the CUDA ordinal from plugin device metadata and adds `IOBinding`-based Add and MatMul tests that bind CUDA inputs/outputs and exercise the plugin EP sync path. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Documents that `CudaEp` owns the preferred `OrtEp::CreateSyncStreamForDevice` and `OrtEp::Sync` implementations, while `CudaEpFactory::CreateSyncStreamForDevice` remains a fallback path; also records the new `IOBinding` test coverage. | ## Testing - Set `ORT_CUDA_PLUGIN_PATH` to the rebuilt CUDA plugin library under `build/cuda/Release` and run `python -m pytest onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`. - Verify the new `IOBinding` Add and MatMul tests pass with CUDA-bound `OrtValue` inputs and outputs. - Confirm existing CUDA plugin EP behavior is unchanged for non-`IOBinding` execution paths. ## Motivation and Context `IOBinding` relies on provider synchronization to ensure asynchronous device copies are complete before dependent kernel execution continues. The CUDA plugin EP already supported sync-stream creation at the factory layer, but the staged changes connect the per-session `OrtEp` callbacks that ORT prefers when coordinating bound CUDA execution. The documentation updates make that ownership model explicit so future plugin work does not conflate the fallback factory hook with the primary EP hook. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes - [ ] CI passes
## Description This PR reduces the amount of CUDA plugin-specific compatibility code by moving reusable validation and attribute-reading logic into shared helper paths that work for both bundled and plugin builds. It also fills in a missing allocator hook in the EP adapter so plugin kernels can reuse the same initialization path as the in-tree CUDA EP, which simplifies maintenance and improves behavior parity. The follow-up changes update the CUDA plugin design doc to reflect the new shared-helper model and add focused plugin regression tests for the two runtime paths that changed most materially. ## Summary of Changes ### EP adapter and shared helper extraction | File | Change | |------|--------| | `ep/adapter/op_kernel_info.h` | Adds `OpKernelInfo::GetAllocator(OrtMemType)` so adapter-based kernels can request device or CPU temp allocators in plugin builds. | | `cpu/tensor/scatter_nd.h` | Extracts shape validation into `scatter_nd_internal::ValidateShapes` so the same logic can be reused outside the CPU `ScatterND` class. | | `cpu/tensor/space_depth_ops.h` | Moves blocksize parsing, mode parsing, and dimension validation into `space_depth_internal` helpers that can be shared by CUDA kernels. | ### CUDA kernel cleanup and plugin parity | File | Change | |------|--------| | `cuda/tensor/scatter_nd.cc` | Removes the plugin-only `ScatterND` validation duplicate and reuses the shared helper implementation. | | `cuda/tensor/scatter_nd.h` | Drops the old conditional include split now that validation is shared through the common helper path. | | `cuda/tensor/space_depth_ops.h` | Deletes the plugin-only `SpaceToDepth`/`DepthToSpace` reimplementation and inherits from the shared base/helper logic in all builds. | | `cuda/tensor/upsample.cc` | Reuses the normal antialias lookup-table allocation/caching path in plugin builds via the new allocator adapter support. | | `cuda/tensor/upsample.h` | Keeps the persistent device lookup-table member available in plugin builds as well. | ### Shared-provider and diagnostics alignment | File | Change | |------|--------| | `cpu/cpu_provider_shared.cc` | Routes shared-provider `ScatterND` shape validation through the extracted helper. | | `provider_bridge_provider.cc` | Updates the bridge-side `ScatterND::ValidateShapes` implementation to call the shared helper directly. | | `cuda/cudnn_common.h` | Preserves the batch-norm epsilon warning path in plugin builds instead of suppressing it. | | `cuda/nn/conv.cc` | Removes plugin-specific shortened cuDNN frontend errors so bundled and plugin builds both include frontend JSON in failures. | | `cuda/nn/conv_transpose.cc` | Extends cuDNN frontend failures to include frontend JSON for easier debugging, matching the `Conv` behavior. | ### Documentation and regression coverage | File | Change | |------|--------| | `cuda_plugin_ep_design.md` | Updates the design doc to reflect that `ScatterND`, `SpaceDepth`, and `Upsample` now use shared adapter-safe helper paths instead of plugin-only fallback branches. | | `test_cuda_plugin_ep.py` | Adds plugin regression coverage for antialias `Resize`/`Upsample` and `ScatterND`, covering the new allocator-backed lookup-table path and the shared `ScatterND` validation helper. | ## Testing - Build with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON` and verify the affected CUDA provider sources compile without the removed plugin-only fallback paths. - Run targeted CUDA provider coverage for `ScatterND`, `SpaceToDepth`/`DepthToSpace`, `Resize`/`Upsample`, `Conv`, and `ConvTranspose` in both plugin and bundled CUDA configurations. - Confirm antialias upsample still initializes and uses the shared lookup table correctly in plugin builds. - Run the new plugin tests for antialias `Resize` and `ScatterND` in `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`. - Confirm cuDNN frontend failure paths now emit the same diagnostic detail in plugin and non-plugin builds. ## Motivation and Context The initial CUDA plugin enablement introduced several localized `#ifdef BUILD_CUDA_EP_AS_PLUGIN` branches and helper copies to get kernels compiling under the adapter path. This cleanup pays down that compatibility debt by extracting the truly shared pieces into reusable helpers and by teaching the adapter `OpKernelInfo` how to provide the allocators those kernels already expect. The result is less duplicated logic, fewer plugin-only code paths to keep in sync, and better debugging consistency between the plugin EP and the built-in CUDA EP. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description)
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.23 to 4.18.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/lodash/lodash/releases">lodash's releases</a>.</em></p> <blockquote> <h2>4.18.1</h2> <h2>Bugs</h2> <p>Fixes a <code>ReferenceError</code> issue in <code>lodash</code> <code>lodash-es</code> <code>lodash-amd</code> and <code>lodash.template</code> when using the <code>template</code> and <code>fromPairs</code> functions from the modular builds. See <a href="https://redirect.github.com/lodash/lodash/issues/6167#issuecomment-4165269769">lodash/lodash#6167</a></p> <p>These defects were related to how lodash distributions are built from the main branch using <a href="https://github.com/lodash-archive/lodash-cli">https://github.com/lodash-archive/lodash-cli</a>. When internal dependencies change inside lodash functions, equivalent updates need to be made to a mapping in the lodash-cli. (hey, it was ahead of its time once upon a time!). We know this, but we missed it in the last release. It's the kind of thing that passes in CI, but fails bc the build is not the same thing you tested.</p> <p>There is no diff on main for this, but you can see the diffs for each of the npm packages on their respective branches:</p> <ul> <li><code>lodash</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm">https://github.com/lodash/lodash/compare/4.18.0-npm...4.18.1-npm</a></li> <li><code>lodash-es</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es">https://github.com/lodash/lodash/compare/4.18.0-es...4.18.1-es</a></li> <li><code>lodash-amd</code>: <a href="https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd">https://github.com/lodash/lodash/compare/4.18.0-amd...4.18.1-amd</a></li> <li><code>lodash.template</code><a href="https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages">https://github.com/lodash/lodash/compare/4.18.0-npm-packages...4.18.1-npm-packages</a></li> </ul> <h2>4.18.0</h2> <h2>v4.18.0</h2> <p><strong>Full Changelog</strong>: <a href="https://github.com/lodash/lodash/compare/4.17.23...4.18.0">https://github.com/lodash/lodash/compare/4.17.23...4.18.0</a></p> <h3>Security</h3> <p><strong><code>_.unset</code> / <code>_.omit</code></strong>: Fixed prototype pollution via <code>constructor</code>/<code>prototype</code> path traversal (<a href="https://github.com/lodash/lodash/security/advisories/GHSA-f23m-r3pf-42rh">GHSA-f23m-r3pf-42rh</a>, <a href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b">fe8d32e</a>). Previously, array-wrapped path segments and primitive roots could bypass the existing guards, allowing deletion of properties from built-in prototypes. Now <code>constructor</code> and <code>prototype</code> are blocked unconditionally as non-terminal path keys, matching <code>baseSet</code>. Calls that previously returned <code>true</code> and deleted the property now return <code>false</code> and leave the target untouched.</p> <p><strong><code>_.template</code></strong>: Fixed code injection via <code>imports</code> keys (<a href="https://github.com/lodash/lodash/security/advisories/GHSA-r5fr-rjxr-66jc">GHSA-r5fr-rjxr-66jc</a>, CVE-2026-4800, <a href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6">879aaa9</a>). Fixes an incomplete patch for CVE-2021-23337. The <code>variable</code> option was validated against <code>reForbiddenIdentifierChars</code> but <code>importsKeys</code> was left unguarded, allowing code injection via the same <code>Function()</code> constructor sink. <code>imports</code> keys containing forbidden identifier characters now throw <code>"Invalid imports option passed into _.template"</code>.</p> <h3>Docs</h3> <ul> <li>Add security notice for <code>_.template</code> in threat model and API docs (<a href="https://redirect.github.com/lodash/lodash/pull/6099">#6099</a>)</li> <li>Document <code>lower > upper</code> behavior in <code>_.random</code> (<a href="https://redirect.github.com/lodash/lodash/pull/6115">#6115</a>)</li> <li>Fix quotes in <code>_.compact</code> jsdoc (<a href="https://redirect.github.com/lodash/lodash/pull/6090">#6090</a>)</li> </ul> <h3><code>lodash.*</code> modular packages</h3> <p><a href="https://redirect.github.com/lodash/lodash/pull/6157">Diff</a></p> <p>We have also regenerated and published a select number of the <code>lodash.*</code> modular packages.</p> <p>These modular packages had fallen out of sync significantly from the minor/patch updates to lodash. Specifically, we have brought the following packages up to parity w/ the latest lodash release because they have had CVEs on them in the past:</p> <ul> <li><a href="https://www.npmjs.com/package/lodash.orderby">lodash.orderby</a></li> <li><a href="https://www.npmjs.com/package/lodash.tonumber">lodash.tonumber</a></li> <li><a href="https://www.npmjs.com/package/lodash.trim">lodash.trim</a></li> <li><a href="https://www.npmjs.com/package/lodash.trimend">lodash.trimend</a></li> <li><a href="https://www.npmjs.com/package/lodash.sortedindexby">lodash.sortedindexby</a></li> <li><a href="https://www.npmjs.com/package/lodash.zipobjectdeep">lodash.zipobjectdeep</a></li> <li><a href="https://www.npmjs.com/package/lodash.unset">lodash.unset</a></li> <li><a href="https://www.npmjs.com/package/lodash.omit">lodash.omit</a></li> <li><a href="https://www.npmjs.com/package/lodash.template">lodash.template</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/cb0b9b9212521c08e3eafe7c8cb0af1b42b6649e"><code>cb0b9b9</code></a> release(patch): bump main to 4.18.1 (<a href="https://redirect.github.com/lodash/lodash/issues/6177">#6177</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/75535f57883b7225adb96de1cfc1cd4169cfcb51"><code>75535f5</code></a> chore: prune stale advisory refs (<a href="https://redirect.github.com/lodash/lodash/issues/6170">#6170</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/62e91bc6a39c98d85b9ada8c44d40593deaf82a4"><code>62e91bc</code></a> docs: remove n_ Node.js < 6 REPL note from README (<a href="https://redirect.github.com/lodash/lodash/issues/6165">#6165</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/59be2de61f8aa9461c7856533b51d31b7d8babc4"><code>59be2de</code></a> release(minor): bump to 4.18.0 (<a href="https://redirect.github.com/lodash/lodash/issues/6161">#6161</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/af634573030f979194871da7c68f79420992f53d"><code>af63457</code></a> fix: broken tests for _.template 879aaa9</li> <li><a href="https://github.com/lodash/lodash/commit/1073a7693e1727e0cf3641e5f71f75ddcf8de7c0"><code>1073a76</code></a> fix: linting issues</li> <li><a href="https://github.com/lodash/lodash/commit/879aaa93132d78c2f8d20c60279da9f8b21576d6"><code>879aaa9</code></a> fix: validate imports keys in _.template</li> <li><a href="https://github.com/lodash/lodash/commit/fe8d32eda854377349a4f922ab7655c8e5df9a0b"><code>fe8d32e</code></a> fix: block prototype pollution in baseUnset via constructor/prototype traversal</li> <li><a href="https://github.com/lodash/lodash/commit/18ba0a32f42fd02117f096b032f89c984173462d"><code>18ba0a3</code></a> refactor(fromPairs): use baseAssignValue for consistent assignment (<a href="https://redirect.github.com/lodash/lodash/issues/6153">#6153</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b8190803d48d60b8c80ad45d39125f32fa618cb2"><code>b819080</code></a> ci: add dist sync validation workflow (<a href="https://redirect.github.com/lodash/lodash/issues/6137">#6137</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.23...4.18.1">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
tianleiwu
requested changes
Apr 7, 2026
Contributor
There was a problem hiding this comment.
"Bump version to 1.26.0" shall be excluded. You can revert it like git revert edd9f58.
Need also double check that new APIs are placed in version 25 instead of 26. @adrianlizarraga to review.
Adds custom CUDA and CPU kernels for linear attention and causal 1D convolution with state, enabling efficient inference of Qwen3.5 hybrid decoder models in ONNX Runtime. ### New Operators **`LinearAttention`** — Implements the GatedDeltaNet recurrent linear attention mechanism: - Fused kernel computing gated delta-rule update of a recurrent state matrix - Supports both prefill (multi-token) and decode (single-token) paths - Inputs: Q, K, V, decay (alpha), beta gating, optional initial recurrent state - Outputs: attention output, updated recurrent state - CUDA implementation with per-head parallelism; CPU implementation with Eigen **`CausalConvWithState`** — Implements causal 1D convolution with persistent state for autoregressive decoding: - Supports prefill (full convolution) and decode (state-based sliding window) - Inputs: input tensor, conv weights, optional bias, optional initial conv state - Outputs: convolution output, updated conv state ### Op Definitions - Registered in `com.microsoft` domain (opset 1) - Full shape inference and type constraints in `bert_defs.cc` ### Testing - Parity test (`test_parity_linear_attention_causal_conv.py`) validates CUDA and CPU kernels against PyTorch reference implementations from the FLA (Flash Linear Attention) library --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description Initial pass at some AI agent instructions for this repo. Following some conventions: - https://agents.md/ - https://agentskills.io/ Add repository instructions for AI agents covering: - Build system phases (`--update`/`--build`/`--test`) and key flags - High-level architecture (graph, optimizer, session, providers) - C++ conventions (error macros, container types, span/string_view preferences) - Python conventions and testing patterns - C API guidelines and PR expectations Sourced from existing docs (`CONTRIBUTING.md`, `docs/Coding_Conventions_and_Standards.md`, `docs/PR_Guidelines.md`, `docs/C_API_Guidelines.md`) and codebase inspection. Copilot put most of this together. ### Motivation and Context Trying to help AI agents work better in this repo. Just getting some initial instructions in place. We should refine it. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Replace `int64_t err_index = 0` sentinel in PrepareForCompute with
`std::atomic<const Tind*> invalid_index{nullptr}`. The old sentinel
failed to detect out-of-bounds index 0 when a dimension has size 0,
since the final check `err_index == 0` treated it as success.
This also fixes a pre-existing data race where multiple threads in
TryParallelFor could write to err_index without synchronization.
Add GatherND_zero_dim_error regression test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix GatherND index validation issue.
This pull request makes several improvements to the CUDA plugin Execution Provider (EP) in ONNX Runtime, with a major focus on integrating advanced memory arena support, improving CMake build logic, and updating documentation to reflect these enhancements. The most important changes are summarized below. **Memory Arena Integration and Documentation:** * Added a full BFC-style arena (`CudaArenaAllocator` / `ArenaImpl`) and a CUDA native mempool allocator (`CudaMempoolOrtAllocator`) to the plugin, with detailed documentation of their design, integration with the ONNX Runtime core, and file layout. This enables stream-aware, shrinkable memory allocation for the CUDA plugin EP, matching ORT core capabilities. * Updated the file layout and design documentation to reflect the new arena and allocator classes, and documented how plugin allocators are now visible to session-level arena management via the standard `IArena` interface. * Removed the previous "future work" section about memory arena parity, as it is now implemented, and updated the "future work" section to focus on remaining tasks such as profiling, stream/adapter parity, and kernel registration validation. **CMake Build Improvements:** * Refactored force-include logic for adapter headers in `cmake/onnxruntime_providers_cuda_plugin.cmake` to use generator expressions, removing redundant MSVC/GCC/Clang blocks and simplifying the build configuration. * Mirrored the source directory structure in the Visual Studio solution for better organization, and set the solution folder for the CUDA plugin target. [[1]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fR114-R117) [[2]](diffhunk://#diff-38eb6ad4f3ce15c7a2395d8eb4edcf95f415e3e557eae50da2590e9e0bbccf8fL290-R294) * Added compile options to force `/permissive` and include `iso646.h` for MSVC builds, ensuring compatibility with CUTLASS headers and alternative C++ tokens. * Updated unit test CMake to include plugin-specific CUDA test sources when the plugin is enabled. **Documentation Fixes:** * Fixed heading levels and updated descriptions for device synchronization and NHWC layout transformation support in the CUDA plugin EP design document. [[1]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL273-R273) [[2]](diffhunk://#diff-7bfdbe8f16ed58d5800f488aeb1f7882d96827ea3a2c429b58a743811f9d371cL324-R324) * Clarified which files have been replaced or refactored in the plugin (e.g., `cuda_mempool_arena.cc` is now replaced by plugin-native arena allocator files). These changes collectively improve the maintainability, feature parity, and documentation quality of the CUDA plugin EP in ONNX Runtime.
- Migrate inline C++ shader to WGSL templates - Add bias and weight index support for gpt_oss_20b - Enable xe-3lpg config for PTL
…ruption (#27948) ### Description BufferManager::Upload() used NormalizeBufferSize() (16-byte alignment) to determine both the staging buffer size and the CopyBufferToBuffer copy size. When the actual data size was not a multiple of 16, the extra padding bytes in the staging buffer were uninitialized, and CopyBufferToBuffer would copy those garbage bytes into the destination GPU buffer beyond the intended range. This caused data corruption when external code (e.g., onnxruntime-genai) uploaded partial data to a pre-allocated static GPU buffer using ORT's CopyTensors API. For example, uploading 24 bytes (3 x int64) of attention mask data would copy 32 bytes (rounded to 16), writing 8 garbage bytes at position 24-31 of the destination buffer, corrupting the 4th element. This manifested as a 'device lost' crash in FlashAttention when running LLM inference with graph capture enabled and odd prompt lengths (e.g., 1 or 3 tokens), because the corrupted attention mask caused ReduceSum to produce wrong seqlen_k values, leading to out-of-bounds GPU memory access. ### Fix: - Use NormalizeCopySize() (4-byte alignment, the WebGPU minimum for CopyBufferToBuffer) instead of NormalizeBufferSize() (16-byte alignment) for both the staging buffer allocation and the copy command. - Zero any padding bytes between actual size and copy size to prevent garbage from being written to the destination buffer. - Apply the same 4-byte alignment fix to MemCpy() for consistency.
The ONNX specification (opset 11+) lists INT8, INT16, and UINT8 as valid input types for the TopK operator, but ONNX Runtime's CPU execution provider only registered float, double, int32, and int64 kernels. This causes a NOT_IMPLEMENTED error when running models that produce TopK nodes with these smaller integer types (e.g., PP-DocLayoutV2 exported via torch.onnx.export(dynamo=True)). This commit adds kernel registrations and template specializations for int8_t, int16_t, and uint8_t in opset 11-23 and opset 24, along with unit tests covering largest, smallest, negative values, explicit axis, and opset 24 scenarios. ### Description Add kernel registrations and template specializations for `int8_t`, `int16_t`, and `uint8_t` types in the CPU TopK operator (opset 11-23 and opset 24). Fixes #27859 **Changed files:** - `onnxruntime/core/providers/cpu/math/top_k.cc` — template specializations + kernel registrations - `onnxruntime/core/providers/cpu/cpu_execution_provider.cc` — forward declarations + BuildKernelCreateInfo entries - `onnxruntime/test/providers/cpu/math/topk_op_test.cc` — 8 new test cases (largest, smallest, negative values, explicit axis, opset 24) ### Motivation and Context The [ONNX specification (opset 11+)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#TopK) lists `INT8`, `INT16`, and `UINT8` as valid input types for the TopK operator. However, ONNX Runtime's CPU execution provider only had kernels registered for `float`, `double`, `int32`, and `int64`, causing a `NOT_IMPLEMENTED` error: ``` ONNXRuntimeError: Could not find an implementation for TopK(11) node with name '...' ``` This issue is encountered when running models like PP-DocLayoutV2 exported via `torch.onnx.export(dynamo=True)`, which produces a `Cast(bool → INT8) → TopK` pattern. The existing CPU TopK implementation is fully template-based, so no algorithmic changes were needed — only kernel registration and template instantiation for the missing types. All 64 TopK tests pass (including 8 new tests for the added types).
### Description
2 small fixes to JavaScript code
The quote in the pre.js is important. closure compiler will skip
minimizing the property names when they are quoted. Without the quote,
the following code is generated:
```
var SharedArrayBuffer=globalThis.SharedArrayBuffer??(new WebAssembly.Memory({initial:0,maximum:0,Be:!0})).buffer.constructor;
```
property `shared` is rewritten into `Be`, which is ignored by
`WebAssembly.Memory()` and `shared` is assumed to be `false` (default
value).
### Description
To support the model package design, one of the goals for ORT is to
automatically select the most suitable compiled EPContext binary from a
collection of precompiled variants based on the EP, provider options,
metadata, and available devices.
This PR is for ORT to support first phase model package. There could be
other follow-up PRs in the future.
A model package is a collection of models, binaries, and metadata files
organized in a hierarchically structured directory.
The directory structure is not yet finalized, so the following is just a
simple example of a model package directory:
````
<model>.ortpackage/
├── manifest.json
├── pipeline.json
├── configs/
| ├── genai_config.json
| └── chat_template.jinja
└── models/
└── model_name/
├── metadata.json
| └── Contains general information on the component model,
| and specific information about each model variant
| such as data types, quantization algo, EP, etc. that
| is updated on add/remove of model variant
└── shared_weights/ (shared weights from all variants)
└── <checksum of weights file A>/
└── model.data
└── <checksum of weights file B>/
└── model.data
└── ...
└── base model/
├── model.onnx
└── variant A /
├── optimized model.onnx (contains EPContext nodes)
└── [Compilation artifacts]
└── variant B /
├── optimized model.onnx (contains EPContext nodes)
└── [Compilation artifacts]
````
#### Spec and Format:
See
[here](https://github.com/microsoft/onnxruntime/blob/07e55627e75da24099c582331a0f786090e6382a/onnxruntime/core/session/model_package/README.md)
#### Definitions:
- Model Package
- A model package defines the overall logical ‘model’
- A model package contains one or more ‘component models’
- Component Model
- A component model comprises one or more ‘model variants’
- Model Variant
- A ‘model variant’ is a single ONNX or ORT format model
#### manifest.json and metadata.json
A manifest.json may look like:
````
{
"model_name": <logical_model_name>,
"component_models": [
<component_model_name_1>,
<component_model_name_2>
]
}
````
A metadata.json for a component model may look like:
````
{
"component_model_name": <component_model_name_1>,
"model_variants": {
<variant_name_1>: {
"file": <ep_context_model_1 onnx file>,
"constraints": {
"ep": <ep_name>,
"device": <device_type>,
"architecture": <hardware_architecture>
}
},
<variant_name_2>: {
"file": <ep_context_model_2 onnx file>,
"constraints": {
"ep": <ep_name>,
"device": <device_type>,
"architecture": <hardware_architecture>
}
}
}
}
````
#### Model Selection
The selection logic is implemented in `MatchesVariant()`, which
evaluates the following constraints:
(Note: A constraint refers to a value under the "constraints" field in
either manifest.json or metadata.json.)
- Check ep constraint
- Check device constraint
- For some provider-bridge EPs, they may not implement
`OrtEpFactory::GetSupportedDevices`, therefore ORT
won't have the supported device information for those EPs. In that case,
ORT will skip the device constraint validation for those EPs.
- If provider option contains key related to device type, then the value
must match the device constraint if any.
- Check ep_compatibility_info constraint
- ORT does not directly evaluate the architecture constraint. Instead,
it relies on the ep_compatibility_info constraint, which may encode
architecture information if needed.
- The ep_compatibility_info value is expected to match the EP
compatibility string stored in the EPContext model metadata. (See
OrtEp::GetCompiledModelCompatibilityInfo() for how this string is
generated.)
- The EP implementation of
EpFactory::ValidateCompiledModelCompatibilityInfo() is responsible for
validating the compatibility string against the target device (i.e.
OrtHardwareDevice) and returning the compatibility result.
#### Note
Check the unit test
[here](https://github.com/microsoft/onnxruntime/pull/27786/changes#diff-bfa4122a85543ae2d80bf4cf6d9f85248e51c2276a5956af32f9bd8c8983d23a)
to better understand how to use model package.
#### Code Change
This pull request introduces significant enhancements to the execution
provider (EP) selection and management infrastructure in ONNX Runtime.
The main focus is on supporting more sophisticated device selection and
manifest-based model packaging, as well as refactoring provider
selection logic for modularity and future extensibility.
Key changes include:
- Introduction of model package context and manifest parsing to support
selecting model components based on device and EP constraints.
- Refactoring of the execution provider interface and related classes to
support multiple devices per provider.
- Modularization of EP/device selection, creation, and registration
logic in the provider policy context.
The most important changes are:
**Model Package Context and Manifest Support**
- Added new files `model_package_context.h` and
`model_package_context.cc` to implement manifest parsing, device/EP
constraint matching, and component selection logic for model packages.
This enables ONNX Runtime to select the most appropriate model variant
based on available hardware and EP configuration.
[[1]](diffhunk://#diff-006078879d52b421c973e2880c65db474aad6b21ad81ba69d387df8661bafeb2R1-R78)
[[2]](diffhunk://#diff-45c29f481077e424c8969dc2198a8b40ab5908cf3b0bbf25dbeaca3ec51935d5R1-R279)
**Execution Provider Interface Enhancements**
- Updated the `IExecutionProvider` class to support construction with a
list of `OrtEpDevice` pointers, and added a `GetEpDevices()` method to
retrieve the supported devices. This allows plugin and bridge EPs to
expose multiple devices.
[[1]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R87-R104)
[[2]](diffhunk://#diff-e15769e35b807986b812aae3ff7192269e171c5846b2ff4d8ec571ec8ed57aa4R203-R207)
- Updated plugin EP construction to pass the list of supported devices
to the base class.
**Provider Policy Context Refactoring**
- Refactored provider policy context logic to modularize device
ordering, device selection, telemetry logging, EP creation, and
registration. This includes splitting the monolithic
`SelectEpsForSession` into smaller methods: `OrderDevices`,
`SelectEpDevices`, `LogTelemetry`, `CreateExecutionProviders`,
`RegisterExecutionProviders`, and a new flow for model package-based EP
selection.
[[1]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R53-R58)
[[2]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L118-L156)
[[3]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0L225-R199)
[[4]](diffhunk://#diff-dd9f398bec3f054aed2c930af620e3e1bfcc5b4a5d5667c4b0cd1f60ddfffda0R254-R365)
These changes collectively lay the groundwork for more flexible, robust,
and extensible device and EP selection in ONNX Runtime, especially in
scenarios involving packaged models with multiple variants and complex
hardware environments.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description The test was originally adding for testing model selection based on "device type" provider option. However, the check for provider option was removed from the selection logic but forget to remove the related test.
### Description Fix possible out of boundary target of class ids in TreeEnsemble. ### Motivation and Context security issue
…#27674) This patch sets `is_channels_last` to true by default in the parameter of `ComputeMatMul` and ignores it in `UseSplitK` when there is no `bias`.
### Description Improve DeformConv op performance ### Motivation and Context This PR consolidates a series of optimizations targeting the `DeformConv` (Deformable Convolution) operator across both CPU and CUDA execution providers. * **For CPU:** The previous implementation suffered from bottlenecks due to redundant computations, lack of vectorization in bilinear sampling, and sub-optimal thread pool utilization. This overhaul redesigns the memory layout and execution pipeline to maximize SIMD opportunities and harden memory safety. * **For GPU:** The batched GEMM operation previously relied on an intermediate buffer and a custom scatter kernel to format the output, which consumed extra memory and kernel launch overhead. This update introduces a zero-copy approach. --- #### 1. CPU Optimizations & Refactoring The CPU execution path has been heavily refactored to minimize branching in hot paths, maximize vectorization, and safely handle edge cases. | Feature / Optimization | Description | Key Benefit | | :--- | :--- | :--- | | **AoSoA Bilinear Sampling Plan** | Replaced on-the-fly interpolation with a precomputed sampling plan using an 8-lane Array-of-Structures-of-Arrays (AoSoA) layout (`kPlanAoSoALanes`). | Perfectly aligns with 256-bit AVX2 vectors, enabling highly efficient SIMD unrolling during the `im2col` gathering phase. | | **Kernel Metadata Caching** | Introduced `DeformConvKernelMetaCacheData` to cache static convolution geometry (e.g., `kH`, `kW`, `padding`, `dilation`). | Eliminates the O(kernel_size) overhead of reallocating and recomputing base offsets on every single `Compute()` step. | | **Fast Math & Branchless Logic** | Implemented a custom `DeformConvFastFloor` and utilized an inverted bounds check with bitwise operations to evaluate all four corners simultaneously. | Removes expensive `std::floor` calls and unpredictable branches from the operator's hottest path. | | **Enhanced Parallelization** | Flattened the bilinear sampling plan build tasks across spatial pixels. | Allows `concurrency::ThreadPool::TryParallelFor` to split fine-grained work effectively, drastically improving thread pool scaling. | | **Hardened Bounds Checking** | Introduced compute-time bounds checks using `CheckedMulSizeT` and `CheckedBatchSpan`. | Ensures batch indexing and stride calculations stay within the addressable `size_t` range, preventing integer overflow vulnerabilities. | | **Bias Addition Refactoring** | Refactored bias addition to avoid expensive `div`/`mod` operations, applying `ORT_CPU_RESTRICT` and force-inlining. | Maximizes memory throughput and instruction pipelining during the final bias addition phase. | --- #### 2. GPU (CUDA) Optimizations The CUDA implementation was optimized to reduce memory footprint and eliminate unnecessary kernel launches. * **Zero-Copy GEMM Output:** Removed the temporary `gemm_output_buffer` allocation entirely. By carefully configuring the `stride_c` parameter (`stride_c_y = M * output_image_size`), the `cublasGemmStridedBatchedHelper` now writes the computed output directly into the correct NCHW memory layout of the final `Y` tensor. * **Kernel Elimination:** Completely removed the `DeformConvCopyGemmOutputRowMajorToNCHW` custom kernel and its associated dispatch logic. This reduces kernel launch overhead, lowers GPU memory bandwidth pressure, and simplifies the overall CUDA execution pipeline. * **Reduced Memory Footprint:** Updated the `bytes_per_image` calculation for workspace memory to reflect the removal of the GEMM output buffer. This allows the operator to potentially process more images in parallel under the same memory constraints. --- #### 3. Changed - **Batch chunking:** Chunk size `k` is chosen so that the number of outer rounds is minimized under the temp-memory cap; **`k` does not have to divide `N`**. The host loop uses `cur_parallel = min(k, N - b)`, so the last chunk may be smaller. This is the intended default behavior for this EP (not yet in a formal release). - **Kernel-size templates:** Im2col is specialized for **1×1, 3×3, and 7×7**; other sizes (including **5×5**) use the **dynamic** `kH`/`kW` path. Rationale: 5×5 is less common in current stacks (often replaced by stacked 3×3); specializing 7×7 targets common large-kernel cases. Older DCN/detection models that still use **5×5** deformable conv will take the dynamic path—correctness is unchanged; only compile-time unrolling differs. - **Add aliasing flags:** Updated DeformConv aliasing comments to make the stronger guarantee explicit: if output `Y` overlaps any input buffer, results can be incorrect regardless of `restrict`, because output writes may clobber source elements before they are fully consumed. `restrict` further tightens this by introducing undefined behavior when aliasing assumptions are violated. --- ### Summary In the current implementation, CPU performance is 33x (main branch is 15x) that of TorchVision. If we were to implement AVX2/AVX512 optimizations from scratch, we could achieve a 36x performance boost. However, I haven’t found any similar reference code in the ONNX Runtime repository. This PR also significantly improves parallelism: <img width="540" height="332" alt="image" src="https://github.com/user-attachments/assets/d4f670bd-dde3-43f1-b597-4471bfde005b" /> _Both ort and tv are configured with 16 threads_ ### Open Question for Reviewers **Regarding CUDA Temporary Memory Allocation:** Currently, the effective maximum temporary memory for CUDA is calculated using a heuristic (`total_global_mem * 0.1` or similar logic in `GetDeformConvEffectiveMaxTempBytes`). While the removal of `gemm_output_buffer` has reduced the memory footprint per image, I am not entirely certain if this 10% threshold is still the most appropriate value for balancing parallel image processing (`n_parallel_imgs`) against overall VRAM consumption in large models. I would appreciate any feedback or suggestions on whether we should tune this threshold, or if there's a more robust way to dynamically determine the optimal temporary workspace size for `DeformConv` in ORT.
### Description In `Dequantize4BitsKernelReOrder` (CPU and CUDA EP), values from the `g_idx` tensor are used directly as array indices into the `scales` and `zero_points` buffers without bounds checking. This PR adds value-range validation and tests for the `g_idx` input tensor in the `MatMulNBits` operator. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description Add input validation to the LinearClassifier operator to prevent an out-of-bounds heap read in GEMM when a crafted model provides mismatched coefficients/intercepts sizes. Fixes https://portal.microsofticm.com/imp/v5/incidents/details/31000000559851/summary ### Changes - **Constructor**: Validate `class_count_ > 0` and `coefficients_.size() % class_count_ == 0` - **Compute()**: Validate `coefficients_.size() == class_count * num_features` before GEMM call - **Tests**: Two regression tests for invalid coefficient sizes ### Motivation and Context MSRC case 109185 (VULN-176698): OOB read via GEMM from crafted model in LinearClassifier operator. Root cause is missing validation that the coefficients vector size matches `[class_count, num_features]` before passing raw pointers to GEMM.
### Description Add a pre-commit [git hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) that runs lintrunner on staged files, catching lint and formatting issues before they reach CI. The hook runs lintrunner in check-only mode (no auto-fix) to avoid issues with partial staging. If lint issues are found, the commit is blocked and the developer is prompted to run `lintrunner -a` to fix. The hook is opt-in. Contributors enable it with: `git config core.hooksPath .githooks` ### Motivation and Context Follow-up from #27856. Catching lint issues at commit time saves CI cycles and review time.
webgpu support for qwen3.5, adding LinearAttention and CausalConvWithState ops based on this proposal: from onnx/onnx#7767 The model can be created with model builder from https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/builder.py. For example for the text only flavor: ``` python builder.py -m Qwen/Qwen3.5-0.8B -o Qwen3.5-0.8B -e webgpu -p int4 --extra_options int4_accuracy_level=4 exclude_embeds=False ```
) ### Description Add Arm64 BF16 fast-math convolution support in MLAS: - direct NCHW conv - depthwise 3x3 NCHWc conv - pointwise 1x1 NCHWc conv This change adds new AArch64 BF16 asm kernels, wires them into MLAS platform dispatch, keeps accumulated pointwise batches on the custom BF16 path instead of falling back to generic SBGEMM, and adds the required BF16 build flags. The new paths are only used when Arm64 BF16 fast-math is enabled via the existing session option. Baseline FP32 behavior is unchanged. ### Performance Individual convolution improvements when running on `c8g` AWS instance where in columns base is FP32 execution, fast-math when enabled without this PR and PR is fast-math with this change: | Type | Shape | fast-math vs base | PR w/ fast-math vs base | PR w/ fast-math vs fast-math | |---|---|---:|---:|---:| | depthwise | N1 IC32 OC32 H112xW112->112x112 K3x3 S1x1 D1x1 P1/1/1/1 G32 | 0.991x | 1.047x | 1.057x | | depthwise | N1 IC96 OC96 H112xW112->56x56 K3x3 S2x2 D1x1 P1/1/1/1 G96 | 1.015x | 1.015x | 1.000x | | depthwise | N1 IC144 OC144 H56xW56->28x28 K3x3 S2x2 D1x1 P1/1/1/1 G144 | 1.020x | 1.004x | 0.984x | | depthwise | N1 IC144 OC144 H56xW56->56x56 K3x3 S1x1 D1x1 P1/1/1/1 G144 | 1.034x | 1.138x | 1.101x | | depthwise | N1 IC192 OC192 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G192 | 0.997x | 1.033x | 1.037x | | depthwise | N1 IC384 OC384 H28xW28->14x14 K3x3 S2x2 D1x1 P1/1/1/1 G384 | 1.016x | 1.021x | 1.005x | | depthwise | N1 IC384 OC384 H28xW28->28x28 K3x3 S1x1 D1x1 P1/1/1/1 G384 | 1.011x | 1.090x | 1.077x | | depthwise | N1 IC576 OC576 H14xW14->7x7 K3x3 S2x2 D1x1 P1/1/1/1 G576 | 1.029x | 0.995x | 0.967x | | depthwise | N1 IC576 OC576 H14xW14->14x14 K3x3 S1x1 D1x1 P1/1/1/1 G576 | 1.025x | 1.006x | 0.982x | | depthwise | N1 IC960 OC960 H7xW7->7x7 K3x3 S1x1 D1x1 P1/1/1/1 G960 | 1.002x | 0.941x | 0.939x | | nchw | N1 IC3 OC32 H224xW224->112x112 K3x3 S2x2 D1x1 P1/1/1/1 G1 | 1.001x | 1.058x | 1.058x | | pointwise | N1 IC16 OC96 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.213x | 1.328x | 1.095x | | pointwise | N1 IC32 OC16 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.020x | 1.019x | 0.998x | | pointwise | N1 IC32 OC32 H112xW112->112x112 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.118x | 1.196x | 1.069x | | pointwise | N1 IC32 OC144 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.220x | 1.528x | 1.252x | | pointwise | N1 IC32 OC192 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.199x | 1.418x | 1.183x | | pointwise | N1 IC64 OC384 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.294x | 1.938x | 1.497x | | pointwise | N1 IC96 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.080x | 1.426x | 1.320x | | pointwise | N1 IC96 OC576 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.280x | 1.961x | 1.532x | | pointwise | N1 IC144 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.132x | 1.351x | 1.193x | | pointwise | N1 IC144 OC32 H56xW56->56x56 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.073x | 1.374x | 1.281x | | pointwise | N1 IC160 OC960 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.133x | 1.744x | 1.539x | | pointwise | N1 IC192 OC32 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.166x | 1.411x | 1.210x | | pointwise | N1 IC192 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.212x | 1.763x | 1.454x | | pointwise | N1 IC320 OC1280 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.136x | 2.059x | 1.812x | | pointwise | N1 IC384 OC64 H28xW28->28x28 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.256x | 1.904x | 1.516x | | pointwise | N1 IC384 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.206x | 1.929x | 1.600x | | pointwise | N1 IC576 OC96 H14xW14->14x14 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.250x | 2.055x | 1.644x | | pointwise | N1 IC576 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.902x | 1.423x | 1.577x | | pointwise | N1 IC960 OC160 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.915x | 1.527x | 1.668x | | pointwise | N1 IC960 OC320 H7xW7->7x7 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 1.020x | 1.756x | 1.723x | | pointwise | N1 IC1280 OC1008 H1xW1->1x1 K1x1 S1x1 D1x1 P0/0/0/0 G1 | 0.747x | 1.149x | 1.538x | When running the full models the performance improvements are on `c8g` (AWS Graviton 4) and `Standard_D32plds_v6` (Azure Cobalt-100) when running [MobileNet v2.7](https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx) with 8 threads are: | Instance | PR w/ fast-math vs base | PR w/ fast-math vs fast-mat | |---|---|---| `c8g` | 1.892x | 1.647x | `Standard_D32plds_v6` | 2.884x | 1.692x | (cc: @Rohanjames1997 @snadampal) --------- Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
### Description Fix ICM issue: https://portal.microsofticm.com/imp/v5/incidents/details/31000000567822/summary The ICM is mainly about 2 issues in `validate_package.py` which was fixed by #27840. But the ICM also references another issue in `whisper_jump_times.py` which is what this PR fixes ### Motivation and Context ICM fixes
## Description Ports graph capture/replay APIs (e.g., CUDA Graph) to the Plugin EP (`OrtEp`) C API so that plugin-based execution providers can participate in ORT-managed graph capture and replay. ### What changed **New Plugin EP C API functions** (`onnxruntime_ep_c_api.h`): - `OrtEp::IsGraphCaptureEnabled` — indicates whether the EP has graph capture enabled. - `OrtEp::IsGraphCaptured` — indicates whether a graph has been captured for a given annotation ID. - `OrtEp::ReplayGraph` — replays a previously captured graph. - `OrtEp::GetGraphCaptureNodeAssignmentPolicy` — returns the node assignment validation policy for graph capture. All four are optional (NULL defaults to safe behavior) and version-gated (`ort_version_supported >= 26`). If `IsGraphCaptureEnabled` returns true, `IsGraphCaptured` and `ReplayGraph` must also be implemented. otherwise `PluginExecutionProvider` logs a warning and disables graph capture for that EP. **New `OrtGraphCaptureNodeAssignmentPolicy` enum** (`onnxruntime_ep_c_api.h`): Replaces the hardcoded EP-name checks in `InferenceSession::Initialize()` with a policy-based approach: - `ALL_NODES_ON_EP` — all nodes must be on the target EP (e.g., TensorRT). - `ALLOW_CPU_FOR_SHAPES` — CPU nodes allowed for shape computation if no memcpy nodes exist (e.g., CUDA, WebGPU, DML). **Refactored `InferenceSession` graph capture selection** (`inference_session.cc`): - Removed the hardcoded `graph_support_ep_list` and per-EP `strcmp` checks. - Now iterates over all registered EPs and uses `IsGraphCaptureEnabled()` + `GetGraphCaptureNodeAssignmentPolicy()` to select and validate the graph-capturing EP. - `AreAllComputeNodesAssignedToCudaOrJsOrDmlEpWebGpuEp()` → generalized to `AreAllComputeNodesAssignedToEpOrCpu()`, which also requires at least one node on the target EP. - `IExecutionProvider::GetGraphCaptureNodeAssignmentPolicy()` added to the base class (defaults to `ALL_NODES_ON_EP`). **Bounded graph capture recursion** (`inference_session.cc/h`): - `Run()` now delegates to `RunImpl()` with a `graph_capture_depth` parameter. - Caps internal run attempts at `kMaxGraphCaptureRunAttempts = 8`, returning a clear error if the EP never reports `IsGraphCaptured() == true`. **EP implementations**: - **WebGPU plugin EP**: Fully implements all four graph capture APIs by forwarding to the underlying `IExecutionProvider`. - **CUDA plugin EP**: Stubs with TODOs (returns disabled/not-implemented). - **NvTensorRTRTX EP**: `IsGraphCaptureEnabled()` now returns `false` since this EP manages graph capture internally (not via ORT). **C++ wrapper** (`onnxruntime_cxx_api.h` / `onnxruntime_cxx_inline.h`): - Added `Ort::Env::CopyTensor()` convenience overload for copying a single tensor (wraps `CopyTensors` with `num_tensors=1`). ### Tests - **`ep_plugin_provider_test.cc`**: Unit tests for each new `PluginExecutionProvider` graph capture method, including NULL function pointer defaults, version < 26 backward compatibilities, and validation that `IsGraphCaptureEnabled()` returns false when `IsGraphCaptured` or `ReplayGraph` are NULL. - **`test_graph_capture.cc`**: End-to-end test for WebGPU plugin EP graph capture/replay using IO binding (warm-up + capture run, then replay with different inputs). ### Motivation and Context Previously, graph capture support was limited to a hardcoded list of EPs (`kCudaExecutionProvider`, `kTensorrtExecutionProvider`, `kJsExecutionProvider`, `kWebGpuExecutionProvider`, `kDmlExecutionProvider`) with EP-specific validation logic in `InferenceSession`. This made it impossible for plugin EPs to participate in ORT-managed graph capture/replay without modifying the core session code. This PR makes graph capture/replay extensible to any EP, including out-of-tree plugin EPs, by exposing it through the `OrtEp` C API.
### Description
<!-- Describe your changes. -->
- Update `WhereDummyDq` QDQ transformer to be more selective before
inserting a dummy `DequantizeLinear` around `Where`.
- `SatisfyCondition` now requires the `Where` output to have exactly one
consumer and that consumer must be `QuantizeLinear` (Q). Otherwise, the
transform is skipped.
- `InsertDummyDQ` additionally checks element type consistency between
the upstream DQ input tensor type and the downstream Q output tensor
type; if they differ, the transform returns without modifying the graph.
- Update the implementation of `WhereDummyDq` to avoid negative or zero
`scale` value. The change maps the float value to the **boundary** of
integer domain to ensure the `scale` value is positive.
- If `WhereOp` get a float scalar `xf` and a `DequantizeLinear` as its
two inputs, `WhereDummyDq` insert DQ to ensure `xf = DQ(xq, scale, zp)`
- The `xq`, `scale` and `zp` are determined with the following table.
| | uint8 | uint16 | int8 | int16 |
|-----------------|--------------|---------------|-------------|---------------|
| xf > 0 | | | | |
| xq | 255 | 65535 | 127 | 32767 |
| zp | 127 | 32767 | 0 | 0 |
| xf < 0 | | | | |
| xq | 0 | 0 | -128 | -32768 |
| zp | 127 | 32767 | 0 | 0 |
| xf = 0 | | | | |
| xq | 127 | 32767 | 0 | 0 |
| zp | 127 | 32767 | 0 | 0 |
- `scale = xf / (xq - zp)` if `xq != zp` else `1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Negative or zero scale value is not friendly for various EP and
backend such as QNN-EP.
- Inserting an additional DQ is only useful when it forms a valid QDQ
“node unit” pattern. If the `Where` output is not followed by a single
`QuantizeLinear` (e.g., multiple consumers or a non-Q consumer), adding
a dummy DQ cannot create the intended pattern and may lead to
non-fusible/undesired graph structures.
| @@ -1 +1 @@ | |||
| 1.25.0 | |||
| 1.26.0 | |||
Contributor
There was a problem hiding this comment.
This is probably not right.
Contributor
|
The following PR needs to be excluded as it adds APIs for ORT 1.26.0: #27958 |
## Description This PR brings CUDA graph capture/replay to the CUDA plugin execution provider so plugin-based CUDA deployments can get the same reduced CPU launch overhead that the in-tree CUDA EP already supports. It also adds the ORT framework and plugin-C-API plumbing needed to let graph-enabled plugin EPs participate safely in warmup, capture, and replay, while preserving compatibility with older plugins through version-gated fallbacks. ## Summary of Changes ### CUDA plugin EP runtime and allocator integration | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Implements plugin-side graph capture lifecycle callbacks, per-thread graph context management, graph replay, and stream selection for graph-enabled runs. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Adds CUDA graph configuration/state to the plugin EP, including per-thread graph context ownership. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.cc` | Adds `CudaGraphSet`/`CudaGraphManager` to own captured graphs and coordinate warmup, capture, and replay by annotation ID. | | `onnxruntime/core/providers/cuda/plugin/cuda_graph_plugin.h` | Declares the new graph manager types and graph-related constants. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc` | Adds external-stream wrapping so graph-enabled runs can reuse the thread’s graph stream without taking ownership of it. | | `onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h` | Declares the external-stream initialization path and stream ownership tracking. | | `onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc` | Parses `enable_cuda_graph` and `min_num_runs_before_cuda_graph_capture` provider/session options for the plugin EP. | | `onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc` | Updates allocator behavior needed for CUDA native mempool compatibility during graph capture/replay. | | `onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h` | Adjusts plugin kernel/device helpers used by the graph-enabled execution path. | | `onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h` | Adds supporting helpers used by the plugin CUDA graph flow. | ### ORT framework and plugin API support for graph replay | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_ep_c_api.h` | Documents and extends the plugin EP contract for graph-enabled runs, including replay behavior relative to `OnRunStart`/`OnRunEnd`. | | `include/onnxruntime/core/framework/execution_provider.h` | Adds graph-capture node-assignment policy support to the execution provider interface. | | `onnxruntime/core/session/inference_session.cc` | Generalizes the session replay path and warmup/capture retry loop so ORT can drive graph replay for graph-capable EPs. | | `onnxruntime/core/session/inference_session.h` | Updates replay-related messaging and supporting declarations for the new run flow. | | `onnxruntime/core/framework/session_state.cc` | Makes device-stream collection reuse thread-affine so warmup/capture/replay reuse stays on the owning thread. | | `onnxruntime/core/framework/session_state.h` | Adds supporting state for the thread-affine stream collection pool. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc` | Bridges the new graph callbacks, hardens validation of plugin graph support, and exposes effective plugin provider options gathered from session config. | | `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h` | Stores provider options and declares the new accessor/graph bridge behavior. | | `onnxruntime/core/providers/webgpu/webgpu_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | | `onnxruntime/core/providers/js/js_execution_provider.h` | Aligns graph-capture policy support with the new execution-provider interface. | ### Tests and validation coverage | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds end-to-end CUDA graph tests for warmup/capture/replay, replay after input updates, CUDA mempool mode, multiple graph annotation IDs, multi-GPU/device-id coverage, and a simple Add model. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_graph_for_cuda_plugin.md` | Adds a dedicated design/implementation document covering architecture, lifecycle, allocator interaction, concurrency, and verification guidance. | | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Updates the broader plugin EP design doc to reflect that CUDA graph support is implemented and documents the framework-level changes. | | `docs/cuda_plugin_ep/QUICK_START.md` | Updates quick-start/testing guidance and removes the outdated “no CUDA Graph support” limitation. | ## Testing - Build ONNX Runtime with `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`, install the generated wheel, and deploy the CUDA plugin shared library as described in `docs/cuda_plugin_ep/QUICK_START.md`. - Run `python onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`. - Pay particular attention to the new CUDA graph scenarios in that suite: warmup/capture/replay, replay after in-place input updates, CUDA mempool mode, multiple `gpu_graph_id` captures, and the second-device path when multiple GPUs are available. - Verify backward compatibility by confirming older plugins still load safely through the version-gated graph callback bridge, and that graph-disabled runs continue through the normal execution path. ## Motivation and Context The CUDA plugin EP exists to decouple CUDA EP delivery from core ONNX Runtime releases, but that model only works well if important runtime optimizations are also available through the plugin path. CUDA graph replay is one of the highest-value CUDA execution optimizations because it eliminates repeated kernel-launch overhead after capture, especially for steady-state inference workloads. Supporting that in the plugin EP required more than adding plugin-local capture code. ORT also needed a framework-level replay flow that works for plugin EPs, a plugin C API contract for graph support and node-assignment policy, and thread-affine stream reuse so captured graph resources and stream wrappers are not reused across unrelated threads. This PR packages those pieces together and documents the resulting behavior for future plugin EP work. It also depends on earlier plugin allocator work so warmup can stabilize allocations before capture begins. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description)
## Description This fixes a flaky failure in the plugin EP profiling tests on macOS, where reconstructed plugin event timestamps could land a few microseconds outside the correlated ORT parent event interval. The current example plugin profiler reconstructs EP-relative timestamps by combining ORT's profiling-start offset with elapsed time from the EP clock. That reconstruction is close but not exact across clocks, and on macOS the skew was enough to fail the strict containment checks in `KernelPluginEp_SessionProfiling` with cases like `ep_start < parent_start` by a small margin. Instead of weakening the test, this change keeps the strict contract and fixes the profiler output so child EP events are always emitted within the correlated ORT parent event interval. ## Key Changes | File | Change | |------|--------| | `onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h` | Stores the correlated ORT parent event start timestamp and duration on each collected EP event, and adds the helper signature updates needed to propagate that metadata. | | `onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc` | Captures parent event timing from `Ort::ConstProfilingEvent`, attaches it to EP events during `StopEventImpl`, and clamps the reconstructed EP start/end interval to the parent ORT interval before emitting the final profiling event. | ## Why This Change Is Needed - The plugin EP profiling tests intentionally require strict nesting: EP child events must stay within the ORT parent event interval. - The existing implementation reconstructs EP timestamps from two different clocks, which can drift by a few microseconds depending on platform timing behavior. - macOS exposed that drift often enough to make `KernelPluginEp_SessionProfiling` flaky even though the logical event ordering was correct. - Clamping the emitted child interval to the already-correlated parent interval preserves the expected semantics and removes the platform-specific skew from the final profiling output. ## Testing - `ninja -C build/cuda/Debug onnxruntime_autoep_test` - `cd build/cuda/Debug && ./onnxruntime_autoep_test --gtest_filter=OrtEpLibrary.KernelPluginEp_SessionProfiling` - `cd build/cuda/Debug && ./onnxruntime_autoep_test --gtest_filter=OrtEpLibrary.KernelPluginEp_RunProfiling` ## Notes For Reviewers - This is intentionally scoped to the example plugin EP profiling path used by the AutoEP tests. - The change avoids relaxing any assertions in `test_execution.cc`; it fixes the emitted profiling data instead.
### Description set the pointer to nullptr immediately after `UnloadDynamicLibrary`. ### Motivation and Context After unload library, set the function pointer to nullptr to avoid a dangling pointer. Otherwise, the following scenario may cause errors: ``` RegisterExecutionProviderLibrary() SessionOptions::AppendExecutionProvider_VitisAI() ``` In this scenario, the OrtVitisAIEpAPI will call `initialize_vitisai_ep` once but call `deinitialize_vitisai_ep` twice. During deinitialization, functions `deinitialize_onnxruntime_vitisai_ep` are invoked, which leads to errors.
### Description Centralise feed authentication & setup for build systems on ADO build pipelines. ### Motivation and Context SDL requires official build pipelines use a single controlled feed for external resources. --------- Co-authored-by: Sanaa Hamel <sanaahamel@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fast forward for actual cut point.