Backmerging with Msft Commits #724

jatinwadhwa921 · 2025-07-02T06:10:01Z

Backmerging with Msft Commits

### Description This PR sets adding support for the `DecoderMaskedMultiHeadAttention` (DMMHA) kernel inside `MultiHeadAttention` (MHA) to false by default. ### Motivation and Context The models containing the extra inputs for DMMHA (i.e. `past_sequence_length` and `cache_indirection`) have some runtime issues. Additionally, not all execution providers implement the DMMHA kernel inside MHA and will therefore not support these extra inputs.

…#25140) This PR optimizes the Intel path for subgroup_matrix_matmul_nbits by removing the per-thread load of matrix A and instead using subgroupMatrixLoad directly from global memory, reducing SLM usage and bandwidth pressure. - Removed var<workgroup> tile_A and the loadSHMA helper function. - Updated inner loop to compute a global offset and call subgroupMatrixLoad on input_a. - Adjusted indexing and stride parameters to match the global layout.

### Description This change replaces the previous zero-extend + 16-bit accumulation sequence with a single wasm_i32x4_relaxed_dot_i8x16_i7x16_add operation to compute row sums directly on 8-bit data. ### Motivation and Context This update eliminates unpacking overhead and lifts the former constraints on k stride.

add support for reverse slice and enable all unit test for it. This will fix microsoft#24744 with the new webgpu ep. I need to make a similar fix for jsep.

### Description  Re-enable unit tests in Android CI build. ### Motivation and Context  The CI build is not running the unit tests. It should run them.

- Implemented MeanOpBuilder to support ONNX Mean operator in QNN EP. - Decomposed Mean into a sequence of element-wise Add operations followed by a Div Op. - Added unit tests for Mean op running on HTP ### Description Adds support for the ONNX Mean operator in QNN EP via Add + Div decomposition. ### Motivation and Context Enables execution of models using Mean on QNN backend, improving Op support.

…t#25173) ### Description Per Windows team's CyberEO requirement, do not disable the warnings in project level.

…rosoft#25172) ### Description Fixed onnxruntime_mlas_test requiring /bigobj in MSVC Debug mode ### Motivation and Context microsoft#24741 microsoft#25169

### Description  Adding the following ORT EP APIs: - `GetPreferredDataLayout()` - `SetDynamicOptions()` - `OnRunStart()` - `OnRunEnd()` ### Motivation and Context  Expose additional EP APIs.

…ft#25171) ### Description * Re-enable tests and remove workarounds that were introduced as part of a QNN <= 2.31 upgrade but are no longer necessary. ### Motivation and Context QNN/QAIRT releases about once a month. As ONNX Runtime adopts these new versions, some number of tests are often found to be impacted. Consequently, tests are skipped and tolerances are loosened. This change reverts as many of those workarounds as possible that were made for QNN upgrades between 2.17 and 2.31, inclusive. The most recent few releases were intentionally not examined to minimize impact on users on old versions and to avoid lock-in to the bleeding edge. --------- Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>

…ecified in GetCapability (microsoft#25137) ### Description - Add ability to drop constant initializers for fused nodes specified in GetCapability. - Rework how an EP specifies nodes that should be fused into one node within GetCapability. - Instead of passing the set of nodes as arguments to `GraphSupportInfo_AddNodesToFuse()`, the EP creates an `OrtNodeFusionOptions` object to specify the nodes and other relevant options. This makes it easier to extend the API in the future since we can't add more parameters to an existing function, but we can add more functions that modify an options object. ### Motivation and Context Add more functionality missing from GetCapability() in the EP ABI.

### Description In TensorRT 10.12, weakly-typed network and related APIs have been marked deprecated. Ignore these deprecated API warnings for the Windows build. --------- Signed-off-by: Kevin Chen <kevinch@nvidia.com>

…soft#25176) Remove --enable_wcos. The flag is for the old WinML code only.

microsoft#25159) ### Description Updates the `OrtGraph` implementation to take advantage of the work done in PR microsoft#23979, which sets the infrastructure to store initializers as `OrtValue` instances in the `onnxruntime::Graph`. There still needs to be second part to the [aforementioned PR](microsoft#23979) to ensure that all initializers are stored as `OrtValue`s in the Graph. ### Motivation and Context

### Description Support bfloat16 for MatMulNBits in CUDA. ### Motivation and Context For LLM model with bfloat16 data type.

### Description Use lintrunner to format *.cu and *.cuh files. ### Motivation and Context Some cuda code is not formatted. This will make the style consistent.

@fs-eire

1. Delete ROCM EP, because there is no active development and we have another AMD GPU EP(migraphx) to use. 2. Delete WASM64 build option, because the feature was incomplete. Likely we will need to reimplement it later. But, we will delete it for now( I already discussed it with @fs-eire) . 3. Delete the kernel explorer python extension, which was solely used by the ROCM EP 4. Delete the triton related build options, which wasn't really put into use. 5. Add a pull request pipeline for Migraphx EP. The following cmake options are removed: - onnxruntime_USE_ROCM - onnxruntime_ENABLE_WEBASSEMBLY_MEMORY64 - onnxruntime_ENABLE_TRITON - onnxruntime_USE_COMPOSABLE_KERNEL - onnxruntime_USE_COMPOSABLE_KERNEL_CK_TILE - onnxruntime_USE_ROCBLAS_EXTENSION_API - onnxruntime_USE_TRITON_KERNEL - onnxruntime_BUILD_KERNEL_EXPLORER - onnxruntime_BUILD_CACHE - MSVC_Z7_OVERRIDE

### Description  Add allocator and data transfer infrastructure for plugin EP API Allocators are created via the OrtEpFactory using OrtMemoryInfo that as added to the OrtEpDevice instances the factory returns. This allows allocators to be created outside of an inference session and shared. When a library is loaded a default instance of each allocator is added to the shared allocators if there is no existing allocator (e.g. user provided custom allocator). CreateSharedAllocator can be used to replace this default instance with a user configured one. e.g. add an arena or provide other configuration options that are passed through to the OrtEpFactory's CreateAllocator function. Similarly IDataTransfer is supported by the factory implementing OrtDataTransferImpl, which will also enable data transfer outside of a session. That will be added in a future PR as the synchronization requirements need to be figured out and will affect the public API. ### Motivation and Context  --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Exclude lean attention from linux build. ### Motivation and Context Previously, lean attention was built in Linux but not in Windows. It is not used Gen AI so far, so we disable it in build to reduce binary size and build time.

…icrosoft#25017) ### Description MatMul+Add->Gemm fusion when AttentionFusion isn't enabled. ### Motivation and Context Graph transformation [MatMulAddFusion](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc) fold `ONNX::MatMul` followed by `ONNX::Add` into `ONNX::GEMM`, however, it [intentionally skipping the portion belongs to "Attention Pattern"](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc#L21). This result in poor performance on QNN EP (and other EPs who does not run *AttentionFusion transformers) due to unfused MatMul + Add pairs. ![image](https://github.com/user-attachments/assets/cad0b2c6-ab07-4ced-a647-396c04fed365) With this change, additional GEMM would be fused *post* AttentionFusions.

This PRs adds additional Node_GetAttributes C API for EP ABI use. It's based on microsoft#24887

Delete the legacy patches related to protobuf, which was added from microsoft#14279 and microsoft#15878 to simplify the ONNX patches.

### Description  Debug Windows build fails with unreachable code warning due to change added in microsoft#25161. Use an `else` to avoid the warning. ``` \onnxruntime\test\contrib_ops\matmul_4bits_test.cc(559,1): error C2220: the following warning is treated as an error [\build\Windows.vs22\Debug\onnxruntime_test_all.v cxproj] \onnxruntime\test\contrib_ops\matmul_4bits_test.cc(559,1): warning C4702: unreachable code [\build\Windows.vs22\Debug\onnxruntime_test_all.vcxproj] \onnxruntime\test\contrib_ops\matmul_4bits_test.cc(561,1): warning C4702: unreachable code [\build\Windows.vs22\Debug\onnxruntime_test_all.vcxproj] ... ``` ### Motivation and Context

Update pytorch to address https://nvd.nist.gov/vuln/detail/CVE-2025-32434

…yout sensitive ops (microsoft#25147) ### Description  Add `IExecutionProvider::ShouldConvertDataLayoutForOp()` to allow EPs to customize layout sensitive ops. Move existing hardcoded EP-specific logic out of layout transformer code. Add `OrtEp::ShouldConvertDataLayoutForOp` to ABI EP API to allow similar customization by plugin EPs. ### Motivation and Context  Enable layout sensitive op customization through internal EP interface and the ABI EP API.

This automated commit updates the vcpkg dependency to version 2025.06.13 and its corresponding commit hash ef7dbf94b919.

…crosoft#25200) ### Description  Add back the linker option to make stack non-executable, which was accidentally lost here: microsoft#22646 This just adds back the option in the same place where it was. ### Motivation and Context After upgrading to 1.22.0 we saw this warning: OpenJDK 64-Bit Server VM warning: You have loaded library /opt/vespa-deps/lib64/libonnxruntime.so.1.22.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

…torV2 (microsoft#25220) ### Description Distinguish between the memory types when creating a shared environment allocator for CUDAExecutionProvider ### Motivation and Context Fixed microsoft#25211

…oft#25218) ### Description Added missing "mem_info" parameter into CPUAllocator constructor ### Motivation and Context Without the correct mem_info, CudaPinned allocator is mapped with wrong (default) "Cpu" memory_info.

microsoft#25222) ### Description  EP implementations need to be able to read vendor id and device id to implement OrtDataTransferImpl::CanCopy correctly. ### Motivation and Context

### Description  Add the LUID to metadata. ### Motivation and Context  Requested by partners.

### Description Currently only build is enabled. Testing step is failing because of error like this: ``` 1: /onnxruntime_src/onnxruntime/core/providers/webgpu/webgpu_context.cc:87 onnxruntime::webgpu::WebGpuContext::Initialize(const onnxruntime::webgpu::WebGpuBufferCacheConfig&, int, bool)::<lambda()>::<lambda(wgpu::RequestAdapterStatus, wgpu::Adapter, wgpu::StringView, wgpu::Adapter*)> status == wgpu::RequestAdapterStatus::Success was false. Failed to get a WebGPU adapter: No supported adapters 1: ```

### Description Add telemetry error logging to InferenceSession::Run() methods to track runtime errors that occur during inference execution. ### Motivation and Context Currently, we do not have any telemetry error logging in the InferenceSession class to track runtime errors that occur during inference execution. This data would allow us to prioritize and identify the most frequent errors.

fix build error and typo

### Description  Revise existing PoolOpBuilder to support rank-5 inputs. Note that HTP only supports PoolAvg3d but not PoolMax3d. ### Motivation and Context Enable HSM-Net model support, which contains 3D AveragePool.

This pull request introduces telemetry enhancements for logging execution provider (EP) auto-selection events in ONNX Runtime. The changes include adding new methods to log EP selection data, updating classes to support session ID retrieval, and integrating telemetry logging into the session provider policy context. ### Telemetry Enhancements: * **Added `LogAutoEpSelection` Method**: Introduced a new virtual method in `Telemetry` and its override in `WindowsTelemetry` to log session-specific EP auto-selection data, including requested and available EP IDs and selection policies. ### Session ID Support: * **Added `GetCurrentSessionId` Method**: Added a method to `InferenceSession` to retrieve the current session ID, enabling session-specific telemetry logging. ### Integration with Provider Policy Context: * **Telemetry Logging in EP Selection**: Integrated telemetry logging into `ProviderPolicyContext::SelectEpsForSession`, capturing requested and available EP IDs, along with the selection policy type, and invoking the telemetry provider's `LogAutoEpSelection` method.

…icrosoft#25188) ### Description While EPContext model generation is enabled and some Nodes fallback on CPU. If the CPU nodes depend on external data. ORT force all external data to be embedded into new generated EPContext model by default. Ort used to create a dummy externa initializer file with maximum size threshold to force all initializer data dump into generated Onnx model file. Internally, a "./model_ext_ini.bin" file is created and got removed at the end of the call. It causes problem if multiple session doing the same thing. This fix is to avoid creating the temp empty external initializer file by adding a flag to force all external data to be embedded into new generated EPContext model.

### Description This PR add additional OpAttr_GetName C API for EP ABI use.

kunal-vaishnavi and others added 30 commits June 25, 2025 09:51

fix reverse slice and enable all ut (microsoft#25160)

6f89697

add support for reverse slice and enable all unit test for it. This will fix microsoft#24744 with the new webgpu ep. I need to make a similar fix for jsep.

[build] do not disable 4244/4267 warning when building Tint (microsof…

ea6d8ee

…t#25173) ### Description Per Windows team's CyberEO requirement, do not disable the warnings in project level.

Fixed onnxruntime_mlas_test requiring /bigobj in MSVC Debug mode (mic…

5c92b73

…rosoft#25172) ### Description Fixed onnxruntime_mlas_test requiring /bigobj in MSVC Debug mode ### Motivation and Context microsoft#24741 microsoft#25169

[TRT-EP] Ignore deprecated warnings for TRT APIs (microsoft#25105)

04cdb69

### Description In TensorRT 10.12, weakly-typed network and related APIs have been marked deprecated. Ignore these deprecated API warnings for the Windows build. --------- Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Update custom-nuget-packaging-pipeline.yml for Azure Pipelines (micro…

ff41305

…soft#25176) Remove --enable_wcos. The flag is for the old WinML code only.

[CUDA] bfloat16 MatMulNBits (microsoft#25161)

7fdd386

### Description Support bfloat16 for MatMulNBits in CUDA. ### Motivation and Context For LLM model with bfloat16 data type.

[web] fix IO binding for WebGPU EP (microsoft#25190)

505b135

Format *.cu and *.cuh with lintrunner (microsoft#25189)

7a6cef6

### Description Use lintrunner to format *.cu and *.cuh files. ### Motivation and Context Some cuda code is not formatted. This will make the style consistent.

Add Node_GetAttributes C API for EP ABI (microsoft#25143)

5ddd34e

This PRs adds additional Node_GetAttributes C API for EP ABI use. It's based on microsoft#24887

Simplify onnx.patch (microsoft#25204)

849eee8

Delete the legacy patches related to protobuf, which was added from microsoft#14279 and microsoft#15878 to simplify the ONNX patches.

Update pytorch > 2.6.0 (microsoft#25174)

6c4f2ff

Update pytorch to address https://nvd.nist.gov/vuln/detail/CVE-2025-32434

Update vcpkg to version 2025.06.13 (microsoft#25209)

a135796

This automated commit updates the vcpkg dependency to version 2025.06.13 and its corresponding commit hash ef7dbf94b919.

Support CUDA pinned allocator in Environment::CreateAndRegisterAlloca…

cf7bf3a

…torV2 (microsoft#25220) ### Description Distinguish between the memory types when creating a shared environment allocator for CUDAExecutionProvider ### Motivation and Context Fixed microsoft#25211

skottmckay and others added 10 commits July 1, 2025 07:12

[TRT RTX EP] fix build error and typo (microsoft#25153)

9534ab4

fix build error and typo

Add OpAttr_GetName C API for EP ABI (microsoft#25224)

1da50f0

### Description This PR add additional OpAttr_GetName C API for EP ABI use.

Merge branch 'master' into ovep-develop

f3672df

jatinwadhwa921 requested a review from ankitm3k July 2, 2025 06:10

ankitm3k approved these changes Jul 2, 2025

View reviewed changes

ankitm3k merged commit c3284bc into ovep-develop Jul 2, 2025
4 of 7 checks passed

ankitm3k deleted the sync_msft_2_7_27 branch July 2, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backmerging with Msft Commits #724

Backmerging with Msft Commits #724

Uh oh!

jatinwadhwa921 commented Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants

Backmerging with Msft Commits #724

Backmerging with Msft Commits #724

Uh oh!

Conversation

jatinwadhwa921 commented Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants