Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft Commits

kunal-vaishnavi and others added 30 commits June 25, 2025 09:51
### Description
This PR sets adding support for the `DecoderMaskedMultiHeadAttention`
(DMMHA) kernel inside `MultiHeadAttention` (MHA) to false by default.

### Motivation and Context
The models containing the extra inputs for DMMHA (i.e.
`past_sequence_length` and `cache_indirection`) have some runtime
issues. Additionally, not all execution providers implement the DMMHA
kernel inside MHA and will therefore not support these extra inputs.
…#25140)

This PR optimizes the Intel path for subgroup_matrix_matmul_nbits by
removing the per-thread load of matrix A and instead using
subgroupMatrixLoad directly from global memory, reducing SLM usage and
bandwidth pressure.

- Removed var<workgroup> tile_A and the loadSHMA helper function.
- Updated inner loop to compute a global offset and call
subgroupMatrixLoad on input_a.
- Adjusted indexing and stride parameters to match the global layout.
### Description
This change replaces the previous zero-extend + 16-bit accumulation
sequence with a single wasm_i32x4_relaxed_dot_i8x16_i7x16_add operation
to compute row sums directly on 8-bit data.

### Motivation and Context
This update eliminates unpacking overhead and lifts the former
constraints on k stride.
add support for reverse slice and enable all unit test for it.

This will fix microsoft#24744 with
the new webgpu ep.
I need to make a similar fix for jsep.
### Description
<!-- Describe your changes. -->

Re-enable unit tests in Android CI build.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The CI build is not running the unit tests. It should run them.
- Implemented MeanOpBuilder to support ONNX Mean operator in QNN EP.
- Decomposed Mean into a sequence of element-wise Add operations followed by a Div Op.
- Added unit tests for Mean op running on HTP

### Description
Adds support for the ONNX Mean operator in QNN EP via Add + Div decomposition.



### Motivation and Context
Enables execution of models using Mean on QNN backend, improving Op
support.
…t#25173)

### Description

Per Windows team's CyberEO requirement, do not disable the warnings in
project level.
…rosoft#25172)

### Description
Fixed onnxruntime_mlas_test requiring /bigobj in MSVC Debug mode


### Motivation and Context
microsoft#24741
microsoft#25169
### Description
<!-- Describe your changes. -->

Adding the following ORT EP APIs:
- `GetPreferredDataLayout()`
- `SetDynamicOptions()`
- `OnRunStart()`
- `OnRunEnd()`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Expose additional EP APIs.
…ft#25171)

### Description

* Re-enable tests and remove workarounds that were introduced as part of a QNN <= 2.31 upgrade but are no longer necessary.


### Motivation and Context

QNN/QAIRT releases about once a month. As ONNX Runtime adopts these new versions, some number of tests are often found to be impacted.
Consequently, tests are skipped and tolerances are loosened. This change reverts as many of those workarounds as possible that were made for QNN upgrades between 2.17 and 2.31, inclusive. The most recent few releases were intentionally not examined to minimize impact on users on old versions and to avoid lock-in to the bleeding edge.

---------

Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
…ecified in GetCapability (microsoft#25137)

### Description
- Add ability to drop constant initializers for fused nodes specified in
GetCapability.
- Rework how an EP specifies nodes that should be fused into one node
within GetCapability.
- Instead of passing the set of nodes as arguments to
`GraphSupportInfo_AddNodesToFuse()`, the EP creates an
`OrtNodeFusionOptions` object to specify the nodes and other relevant
options. This makes it easier to extend the API in the future since we
can't add more parameters to an existing function, but we can add more
functions that modify an options object.



### Motivation and Context
Add more functionality missing from GetCapability() in the EP ABI.
### Description
In TensorRT 10.12, weakly-typed network and related APIs have been
marked deprecated. Ignore these deprecated API warnings for the Windows
build.

---------

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
…soft#25176)

Remove --enable_wcos. The flag is for the old WinML code only.
microsoft#25159)

### Description
Updates the `OrtGraph` implementation to take advantage of the work done
in PR microsoft#23979, which sets
the infrastructure to store initializers as `OrtValue` instances in the
`onnxruntime::Graph`.

There still needs to be second part to the [aforementioned
PR](microsoft#23979) to ensure that
all initializers are stored as `OrtValue`s in the Graph.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Support bfloat16 for MatMulNBits in CUDA.


### Motivation and Context

For LLM model with bfloat16 data type.
### Description

Use lintrunner to format *.cu and *.cuh files.

### Motivation and Context

Some cuda code is not formatted. This will make the style consistent.
1. Delete ROCM EP, because there is no active development and we have
another AMD GPU EP(migraphx) to use.
2. Delete WASM64 build option, because the feature was incomplete.
Likely we will need to reimplement it later. But, we will delete it for
now( I already discussed it with @fs-eire) .
3. Delete the kernel explorer python extension, which was solely used by
the ROCM EP
4. Delete the triton related build options, which wasn't really put into
use.
5. Add a pull request pipeline for Migraphx EP.

The following cmake options are removed:

- onnxruntime_USE_ROCM
- onnxruntime_ENABLE_WEBASSEMBLY_MEMORY64
- onnxruntime_ENABLE_TRITON
- onnxruntime_USE_COMPOSABLE_KERNEL
- onnxruntime_USE_COMPOSABLE_KERNEL_CK_TILE
- onnxruntime_USE_ROCBLAS_EXTENSION_API
- onnxruntime_USE_TRITON_KERNEL
- onnxruntime_BUILD_KERNEL_EXPLORER
- onnxruntime_BUILD_CACHE
- MSVC_Z7_OVERRIDE
### Description
<!-- Describe your changes. -->
Add allocator and data transfer infrastructure for plugin EP API

Allocators are created via the OrtEpFactory using OrtMemoryInfo that as
added to the OrtEpDevice instances the factory returns. This allows
allocators to be created outside of an inference session and shared.

When a library is loaded a default instance of each allocator is added
to the shared allocators if there is no existing allocator (e.g. user
provided custom allocator).
CreateSharedAllocator can be used to replace this default instance with
a user configured one. e.g. add an arena or provide other configuration
options that are passed through to the OrtEpFactory's CreateAllocator
function.
 
Similarly IDataTransfer is supported by the factory implementing
OrtDataTransferImpl, which will also enable data transfer outside of a
session. That will be added in a future PR as the synchronization
requirements need to be figured out and will affect the public API.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

Exclude lean attention from linux build.

### Motivation and Context

Previously, lean attention was built in Linux but not in Windows.
It is not used Gen AI so far, so we disable it in build to reduce binary
size and build time.
…icrosoft#25017)

### Description
MatMul+Add->Gemm fusion when AttentionFusion isn't enabled.

### Motivation and Context

Graph transformation
[MatMulAddFusion](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc)
fold `ONNX::MatMul` followed by `ONNX::Add` into `ONNX::GEMM`, however, it [intentionally skipping the portion belongs to "Attention Pattern"](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc#L21).
This result in poor performance on QNN EP (and other EPs who does not run *AttentionFusion transformers) due to unfused MatMul + Add pairs.


![image](https://github.com/user-attachments/assets/cad0b2c6-ab07-4ced-a647-396c04fed365)

With this change, additional GEMM would be fused *post*
AttentionFusions.
This PRs adds additional Node_GetAttributes C API for EP ABI use.
It's based on microsoft#24887
Delete the legacy patches related to protobuf, which was added from
microsoft#14279 and
microsoft#15878 to simplify the ONNX
patches.
### Description
<!-- Describe your changes. -->
Debug Windows build fails with unreachable code warning due to change
added in microsoft#25161.

Use an `else` to avoid the warning.

```
\onnxruntime\test\contrib_ops\matmul_4bits_test.cc(559,1): error C2220: the following warning is treated as an error [\build\Windows.vs22\Debug\onnxruntime_test_all.v cxproj]
\onnxruntime\test\contrib_ops\matmul_4bits_test.cc(559,1): warning C4702: unreachable code [\build\Windows.vs22\Debug\onnxruntime_test_all.vcxproj] 
\onnxruntime\test\contrib_ops\matmul_4bits_test.cc(561,1): warning C4702: unreachable code [\build\Windows.vs22\Debug\onnxruntime_test_all.vcxproj] 
...
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…yout sensitive ops (microsoft#25147)

### Description
<!-- Describe your changes. -->

Add `IExecutionProvider::ShouldConvertDataLayoutForOp()` to allow EPs to
customize layout sensitive ops. Move existing hardcoded EP-specific
logic out of layout transformer code.

Add `OrtEp::ShouldConvertDataLayoutForOp` to ABI EP API to allow similar
customization by plugin EPs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enable layout sensitive op customization through internal EP interface
and the ABI EP API.
This automated commit updates the vcpkg dependency to version 2025.06.13
and its corresponding commit hash ef7dbf94b919.
…crosoft#25200)

### Description
<!-- Describe your changes. -->
Add back the linker option to make stack non-executable, which was
accidentally lost here:
microsoft#22646

This just adds back the option in the same place where it was.

### Motivation and Context
After upgrading to 1.22.0 we saw this warning:

OpenJDK 64-Bit Server VM warning: You have loaded library
/opt/vespa-deps/lib64/libonnxruntime.so.1.22.0 which might have disabled
stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'.
…torV2 (microsoft#25220)

### Description
Distinguish between the memory types when creating a shared environment
allocator for CUDAExecutionProvider

### Motivation and Context
Fixed microsoft#25211
…oft#25218)

### Description
Added missing "mem_info" parameter into CPUAllocator constructor


### Motivation and Context
Without the correct mem_info, CudaPinned allocator is mapped with wrong
(default) "Cpu" memory_info.
skottmckay and others added 10 commits July 1, 2025 07:12
microsoft#25222)

### Description
<!-- Describe your changes. -->
EP implementations need to be able to read vendor id and device id to
implement OrtDataTransferImpl::CanCopy correctly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add the LUID to metadata.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Requested by partners.
### Description

Currently only build is enabled. Testing step is failing because of
error like this:

```
1: /onnxruntime_src/onnxruntime/core/providers/webgpu/webgpu_context.cc:87 onnxruntime::webgpu::WebGpuContext::Initialize(const onnxruntime::webgpu::WebGpuBufferCacheConfig&, int, bool)::<lambda()>::<lambda(wgpu::RequestAdapterStatus, wgpu::Adapter, wgpu::StringView, wgpu::Adapter*)> status == wgpu::RequestAdapterStatus::Success was false. Failed to get a WebGPU adapter: No supported adapters
1: 
```
### Description
Add telemetry error logging to InferenceSession::Run() methods to track
runtime errors that occur during inference execution.

### Motivation and Context
Currently, we do not have any telemetry error logging in the
InferenceSession class to track runtime errors that occur during
inference execution. This data would allow us to prioritize and identify
the most frequent errors.
### Description
<!-- Describe your changes. -->
Revise existing PoolOpBuilder to support rank-5 inputs.
Note that HTP only supports PoolAvg3d but not PoolMax3d.

### Motivation and Context
Enable HSM-Net model support, which contains 3D AveragePool.
This pull request introduces telemetry enhancements for logging
execution provider (EP) auto-selection events in ONNX Runtime. The
changes include adding new methods to log EP selection data, updating
classes to support session ID retrieval, and integrating telemetry
logging into the session provider policy context.

### Telemetry Enhancements:
* **Added `LogAutoEpSelection` Method**: Introduced a new virtual method
in `Telemetry` and its override in `WindowsTelemetry` to log
session-specific EP auto-selection data, including requested and
available EP IDs and selection policies.

### Session ID Support:
* **Added `GetCurrentSessionId` Method**: Added a method to
`InferenceSession` to retrieve the current session ID, enabling
session-specific telemetry logging.

### Integration with Provider Policy Context:
* **Telemetry Logging in EP Selection**: Integrated telemetry logging
into `ProviderPolicyContext::SelectEpsForSession`, capturing requested
and available EP IDs, along with the selection policy type, and invoking
the telemetry provider's `LogAutoEpSelection` method.
…icrosoft#25188)

### Description
While EPContext model generation is enabled and some Nodes fallback on CPU. If the CPU nodes depend on external data. ORT force all external data to be embedded into new generated EPContext model by default. Ort
used to create a dummy externa initializer file with maximum size threshold to force all initializer data dump into generated Onnx model file. Internally, a "./model_ext_ini.bin" file is created and got removed at the end of the call. It causes problem if multiple session doing the same thing.
This fix is to avoid creating the temp empty external initializer file by adding a flag to force all external data to be embedded into new generated EPContext model.
### Description
This PR add additional  OpAttr_GetName C API for EP ABI use.
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k July 2, 2025 06:10
@ankitm3k ankitm3k merged commit c3284bc into ovep-develop Jul 2, 2025
4 of 7 checks passed
@ankitm3k ankitm3k deleted the sync_msft_2_7_27 branch July 2, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.