Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

quic-hungjuiw and others added 30 commits June 11, 2025 08:47
…24952)

### Description- In the InstanceNormOpBuilder, when inputs[1] (scale) is constant and running on NPU backend, transform S8 scale into U8 scale since Qnn Htp does not support SFIXED InstanceNorm scale.
- Allow RunInstanceNormQDQTest to specify different data types for inputs[0] (data) and inputs[1] (scale)
- Add corresponding unit test QnnHTPBackendTests.InstanceNormU8S8

### Motivation and Context
- Prevent InstanceNormOp from fallback to CPU on SFIXED scale
### Description

This PR is a redo of the reverted PR
microsoft#24935.

  - Upgrade version of Dawn to 991810463a.
- Comparing to previous commit c3999d7e3, it includes the fix to GCC13,
so that now Linux build passes.
    - Includes previous merged patches:
      - https://dawn-review.googlesource.com/c/dawn/+/242335
      - https://dawn-review.googlesource.com/c/dawn/+/242634
      - https://dawn-review.googlesource.com/c/dawn/+/243614

  - Remaining patches:
    - `dawn_destroy_buffer_on_destructor.patch`
**Allow WGPUBufferImpl class to destroy the buffer in the destructor**
In native implementation, wgpuBufferRelease will trigger the buffer
destroy (if refcount decreased to 0). But in emwgpu implementation, the
buffer destroy won't happen. This change adds a destructor to the buffer
class to destroy the buffer when the refcount is 0 for non-external
buffers.
      
    - `dawn_force_enable_f16_nvidia_vulkan.patch`
      **Force enable f16 support for NVIDIA Vulkan**
Dawn disabled f16 support for NVIDIA Vulkan by default because of
crashes in f16 CTS tests (crbug.com/tint/2164).
Since the crashes are limited to specific GPU models, we patched Dawn to
remove the restriction.
      
    - `dawn_allow_non_uniform_for_subgroup_matrix_args.patch` (new)
**Revert change "[tint] Check uniformity for subgroup matrix builtin
arguments"**
The following change was made in upstream to force the subgroup matrix
builtin arguments to be uniform:
https://dawn-review.googlesource.com/c/dawn/+/236054
(https://issues.chromium.org/issues/403611487)
Since we use `subgroup_id` as the subgroup matrix builtin argument, we
have to revert this change to allow the subgroup matrix builtin
arguments to be non-uniform.
### Description
Update Qnn default version to 2.35.0.250530
…rosoft#24949)

### Description
Replace the method of updating the opset version with onnx.version_converter.convert_version in static_quantize_runner.

### Motivation and Context
The current method of updating the opset version involves directly modifying the model's opset version, which does not ensure the model's validity.
### Description
This PR creates and saves the JSON files needed to run the Whisper
models that are exported with `--no_beam_search_op` in ONNX Runtime
GenAI.

### Motivation and Context
The JSON files need to be dynamically created based on the settings
provided during export.
### Description
Enable boolean support for expand op

### Motivation and Context
Additional op data type coverage.
### Description
This PR converts TensorProto graph initializers to TensorProto/OrtValue
pairs.
Currently, we only split the output for some optimizers to the above
pairs.
Eventually, we should be able to convert all initializers to OrtValues
on load.
Small weights will continue to be an exception as they are sometimes
required by ONNX inference functions.
Some graph API leaks to EPs so we are not able to remove it at present,
and this constrains our ability to convert everything at once.

### Motivation and Context
Lay Gound for proper layers separation. Eventually eliminate weights
copies in the EPs.
### Description
This PR changes the class from which the Whisper tokenizer is loaded.

### Motivation and Context
By using `AutoTokenizer` instead of `WhisperTokenizer`, the
`save_pretrained` method will also save the `tokenizer.json` file. This
file is used in ONNX Runtime GenAI's tokenizers.
### Description
Suppress some build warnings in Windows for sm=90. 

### Motivation and Context
I saw some build warnings during building Windows for sm=90 with cuda
12.8. Those are caused by cutlass. It is safe to suppress them.
…#25024)

### Description
- Add argument "--dynamic_input_shapes" to fix dynamic input shapes in qnn.preprocess

### Motivation and Context
- Because QNN-EP doesn't support dynamic shape, add this argument to support model with dynamic input dims but static input shape after qnn.preprocess.
### Description


Add retry to node tests data downloading. this should help to fix the
random pipeline failure because of download failures.
### Description
Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this
op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a
required input.


### Motivation and Context
Adding this op to ONNXRuntime is necessary to allow the GenAI team to
enable a continuous batching server API.
…oft#25027)

This util `GetWebNNOpFirstInputName` previously accepted a WebNN op name
as parameter, which will always return "input" because the key of
`op_inputs_map` is ONNX op type.

This PR fixes the bug by:
- Changing the parameter of `GetWebNNOpFirstInputName` to ONNX op
- Changing the `decomposed_op_map` to ONNX-to-ONNX op mapping, in order
to improve the search complexity.
…oft#25031)

Modified the script to explicitly add a build time dependency on onnxruntime.dll
…ft#25022)

Use WebNN EP from C++ code needs to call the
Module.webnnRegisterMLContext from EM_ASM block.
The build time for Windows CUDA job reduced from 3.5 hours to 1 hour.
Also add --use_vcpkg there.
### Description
<!-- Describe your changes. -->
Make OrtDevice generic by removing vendor specific hardcoded memory
types and replacing with generic DEFAULT/HOST_ACCESSIBLE values and an
explicit vendor id to differentiate.

Remove unnecessary device id from OrtMemoryInfo. The OrtDevice in
OrtMemoryInfo provides this.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support plugin EPs
This disables the uniformity checks for SubgroupMatrix in the WebGPU EP.
### Description
This change cherry-picks telemetry changes from win-onnxruntime to
improve telemetry data collection for ONNX Runtime on Windows.

### Motivation and Context
These changes are already present in win-onnxruntime, so cherry-picking
these changes here for Windows use cases that rely on public ONNX
Runtime.
…crosoft#25055)

### Description

Revert [Improve Windows ETW callback registration and fix
issues](microsoft#24877) to unblock
python packaging pipeline.

### Motivation and Context

Python packaging pipeline is failing due to the changes in the PR.

---------

Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
…#25025)

### Description
This commit refactors the `DP4AMatMulNBitsSmallMProgram` to allow both
`tile_size_k_vec` and `tile_size` to be configured. This change allows
more flexibility for performance tuning without altering the core shader
functionality.

There is no functional change in this commit.



### Motivation and Context
This is a preparatory change to enable `DP4AMatMulNBitsSmallMProgram`
performance optimization work in subsequent commits.

---------

Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
### Description

The result of Clip op is not expected when min > max for CUDA. This
fixes the implementation to align with operator spec:
https://onnx.ai/onnx/operators/onnx__Clip.html

### Motivation and Context

ONNX backend test failure with onnx 1.18:

python onnx_backend_test_series.py
======================================================================
FAIL: test_clip_min_greater_than_max_cuda
(__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
DESIRED: array([1., 1., 1.], dtype=float32)
ACTUAL: array([2., 2., 1.], dtype=float32)
The patch enables intel subgroup matrix on matmul_bits operator, and
temporarily supports it on vulkan backend and xe-2lpg arch, we will
extend the functions on more subgroup matrix configs and platforms.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…25056)

Missing test failures from ONNX 1.18 integration. 

1. The RMSNorm test models is invalid in terms of ONNX spec (e.g. Model
rms_normalization_2d_axis1_expanded failed to load:Node () Op (Range)
[ShapeInferenceError] Input to 'Range' op should be scalars (Tensor with
only one element and shape empty)).
![Screenshot 2025-06-13
120308](https://github.com/user-attachments/assets/cbc513ee-51d2-4d71-94e4-0133f712cc2d)

2. TopK with uint64 is currently not supported.
…t#24238)

### Description
Allows users to configure and enable the global thread pool via Python,
and have inference sessions use it instead of session-local thread
pools.

### Motivation and Context
Forked off of microsoft#23495 to take over implementation, see issue microsoft#23523.

Our particular use case involves a single service instance serving
thousands of individual models, each relatively small (e.g. small
decision trees). Creating individual services for each model is too much
overhead, and attempting to start several thousand thread-pools is a
non-starter. We could possibly have each session be single-threaded, but
we would like to be able to separate the request handler thread count
from the compute thread count (e.g. 2 handler threads but 4 intra-op
ones).
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: alex-halpin <alex.halpin@prizepicks.com>
Replace `ToStatus` with `ToStatusAndRelease` which also releases the `OrtStatus*`.
…t#25026)

Resolves the below error for florence2 model by using WebGPU built-in
sqrt(x) function instead of pow(x, y) when the exponent is 0.5. The
sqrt(x) built-in is both faster and more stable than using pow(x, 0.5).

```
Non-zero status code returned while running Reshape node. Name:'/Reshape_1' Status 
Message: /Users/runner/work/1/s/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:47 
  onnxruntime::ReshapeHelper::ReshapeHelper(const TensorShape &, TensorShapeVector &, bool) input_shape_size == size was false. 
  The input tensor cannot be reshaped to the requested shape. Input shape:{1,576,32}, requested shape:{1,23,23,32}
```
### Description
Fix the CANN compilation error in microsoft#25074
misspelled in microsoft#25056 
topk with uint64 is currently not supported in ORT
guschmue and others added 27 commits June 18, 2025 12:33
- Implemented ReciprocalOpBuilder to support ONNX Reciprocal Op in QNN EP.
- Decomposed Reciprocal into a Div Op.
- Added unit tests to run Reciprocal Op on HTP

### Description
Adds support for the ONNX Reciprocal operator in QNN EP via Div decomposition.

### Motivation and Context
Enables execution of models using Reciprocal on QNN backend, improving Op support.
### Description
This PRs sets the foundation for the EP ABI, which allows plugin-EPs to
interface with ORT using a binary stable interface. A plugin-EP can be
built separately from ORT and is not tied to a specific commit of ORT.

Currently, this PR adds basic APIs necessary to allow an example
plugin-EP to compile and run a simple model with a single `Mul` node.

- Example plugin-EP implementation:
https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/onnxruntime/test/autoep/library/example_plugin_ep.cc
- APIs: 
- Graph IR:
https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/include/onnxruntime/core/session/onnxruntime_c_api.h#L5290-L5439
- Plugin EP:
https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/include/onnxruntime/core/session/onnxruntime_c_api.h#L6177-L6481
- Example app code (from unit tests):
https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/onnxruntime/test/autoep/test_autoep_selection.cc#L614

### Motivation and Context
Based on microsoft#21450

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Add early exit in ComputeModelGraphHash when EPContext nodes are
present, returning "0" to indicate pre-compiled model state.
Conditionally skip ComputeModelWeightHash when graph hash is "0" to
avoid unnecessary computation for pre-compiled models.

This optimization reduces overhead for models containing EPContext
nodes, which represent execution provider pre-compiled subgraphs.

### Motivation and Context
Currently, the hash generated by ComputeModelGraphHash function when the
graph contains EPContext nodes does not correctly represent the graph
because we do not hash the contents of the context pointed to by the
EPContext node. Thus, it makes more sense to skip hashing for cases
involving EPContext nodes.
### Description
Missed a few documentation errors in the [previous
PR](microsoft#24887). This PR fixes
the C/C++ API documentation generation action:
https://github.com/microsoft/onnxruntime/actions/runs/15749541590



### Motivation and Context
Fix the C/C++ API documentation generation GitHub action.
### Description
Some initializers are stored as in-memory external data, WebNN EP should
support these initializers.

### Motivation and Context
This PR:
- Added `HasExternalDataInMemory` check for external data to avoid
unexpected error.
- Wrapped the `UnpackInitializerData` to make it compatible with
external data.

Fixed microsoft#25078
### Description

This PR allows to use WebGPU EP in `onnxruntime-web` NPM package.

### Migration Plan

Currently, there are 2 different EPs implemented the WebGPU backend of
onnxruntime-web. They are JSEP and WebGPU EP. The migration plan is to
replace the JSEP with WebGPU EP and eventually remove the JSEP. The plan
contains the following stages:

- STAGE 1: enable WebGPU EP on onnxruntime-web in local build. (Done)
- **STAGE 2: enable WebGPU EP on onnxruntime-web in the public package.
(This PR)**
- STAGE 3: remove JSEP from onnxruntime-web.

### Package consumption changes

- Default import (`import 'onnxruntime-web'`) and CPU only import
(`import 'onnxruntime-web/wasm'`) keeps their previous behaviors.

- WebGPU import (`import 'onnxruntime-web/webgpu'`) will now use WebGPU
EP instead of JSEP. Previously it was the same as default import.

- WebGPU import will use a different suffix for the .mjs and .wasm name
(which was `.jsep`):
    - ort-wasm-simd-threaded<b>.asyncify</b>.mjs
    - ort-wasm-simd-threaded<b>.asyncify</b>.wasm

- The suffix `.asyncify` is used as `.jspi` is planned for future. They
are 2 different ways of emscripten's async implementation (sync C++
function calls async JS function)
1. Updated abseil to the latest version, which fixed all BinSkim
warnings so that we no longer need to patch it.
2. However, the new abseil itself has some extra warnings, so I have to
patch it.
3. Added "--compile_no_warning_as_error" to windows 32-bit build. I will
fix the warnings later.
### Description
This change implements **Metadata-based hash override** optimization to
the model hashing logic. Added logic to check for "model_hash" in model
metadata before computing hashes. If present, both model_graph_hash and
model_weight_hash are set to the metadata value, bypassing computation
entirely.

### Motivation and Context
ONNX models generated using the Olive toolchain now have the option to
include the model hash as part of the ONNX metadata. For such models, it
would be beneficial to use the provided hash instead of computing it
from scratch.
### Description
This change introduced a 6x8 QGEMM micro kernel for WASM relaxed SIMD
build.

### Motivation and Context
This change optimizes the performance of QGEMM on x64 devices with
AVX-VNNI.

| Mlas bench/RPL laptop/node v24.1.0 | baseline | opt | diff |

|------------------------------------------------------------------------|----------|---------|------|
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:1024/Batch:1/Threads:4/real_time
| 2452212 | 1708338 | 44% |
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:3072/Batch:1/Threads:4/real_time
| 9053789 | 6395584 | 42% |
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:4096/Batch:1/Threads:4/real_time
| 12109727 | 8189719 | 48% |
| QGEMM/UnsignedANoPackB/M:384/N:4096/K:1024/Batch:1/Threads:4/real_time
| 11787607 | 7926226 | 49% |
Enhance MatMulNBits CUDA kernel testing:
(1) Add a kernel testing for different cuda kernels used in MatMulNBits.
(2) Refactoring the gemm profiler to use cuda allocator
(2) Add verbose logging macros.
(3) Adjustments to speed up compiling when sm90 is excluded from build.

Example kernel test output:

![image](https://github.com/user-attachments/assets/c6d3f6fa-ef84-4bfb-b0b9-acd9b6717980)
microsoft#25117)

Fix: delay CUDADriverWrapper instantiation to avoid uncaught exceptions
when CUDA is unavailable

### Description

This PR moves the static instantiation of CUDADriverWrapper from a
class-level static field to a function-local static inside
CUDADriverWrapper::GetInstance(). This change ensures that the CUDA
driver is only loaded when the instance is actually needed, rather than
at static initialization time. It preserves the singleton behavior while
deferring instantiation to runtime.

### Motivation and Context

When libcuda.so.1 is not available on the system, the constructor of
CUDADriverWrapper throws an exception. Previously, this exception was
triggered during static initialization, leading to an uncatchable
std::terminate() and process termination. By moving the instance into
GetInstance() as a function-local static, the exception can now be
caught by client code (e.g., in try/catch), allowing graceful fallback
when CUDA is unavailable.
)

* Change 90 to 90a. It is because FpA IntB Gemm uses accelerated
features (like WGMMA, TMA and setmaxnreg).
* Change 90 to `90a-real;90a-virtual`. It is because a recent change in
[cuda_configuration.cmake](https://github.com/microsoft/onnxruntime/blob/main/cmake/external/cuda_configuration.cmake)
will replace "90" to "90a-real", and we need explicitly add the virtual
one `90a-virtual` to the list.

Note that our pipelines are still using CUDA 12.2, so it is not able to
add 100 and 120 to the list. We can add it when we upgrade to CUDA 13 in
the future.
…ft#25102)

### Description
- Add argument "--exclude_initializer_from_input" to remove initializer from input if model.ir_version >= 4
- Add function argument on remove_initializer_from_input
- Modify the onnxruntime_python.cmake to include remove_initializer_from_input.py

### Motivation and Context
- To solve the issue `Initializer <name> appears in graph inputs and will not be treated as constant value/weight.`, we require the remove_initializer_from_input to be included in `qnn.preprocess`
…icrosoft#24869)

Enable NV TRT RTX EP engines to be weight stripped always when using EP
Context

We want to always use weight-stripped engines for EP Context to reduce
disk footprint on end-user system. With this, there are two ways to load
weights
1. provide weights via bytestream (recommended)
2. original `model.onnx` present in the same folder as the
`model_ctx.onnx`

```cpp
std::vector<char> model_bytes = ReadFileFromDisk("model.onnx");

// weight refitting using bytesteam
std::unordered_map<std::string, std::string> rtx_ep_options;
rtx_ep_options[onnxruntime::nv::provider_option_names::kONNXBytestream] = std::to_string(reinterpret_cast<size_t>(model_bytes.data()));
rtx_ep_options[onnxruntime::nv::provider_option_names::kONNXBytestreamSize] = std::to_string(model_bytes.size());
```
…5132)

### Description
Adds the start time of WebGPU kernel profiling to the logging output.



### Motivation and Context
To aid in performance analysis, this change includes the kernel
profiling start time in addition to the existing execution time. This
allows for a more detailed understanding of kernel performance and
scheduling.
### Description
Add UDO support in QNN

Example usage:
`
./onnx_test_runner -v -e qnn -j 1 -i "backend_path|./libQnnCpu.so
op_packages|<op_type>:<op_package_path>:<interface_symbol_name>[:<target>],<op_type2>:<op_package_path2>:<interface_symbol_nam2e>[:<target2>]" <models>
`

### Motivation and Context
Add QNN EP UDO support.
For more information, ref [op packages](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/op_packages.html)
### Description
<!-- Describe your changes. -->

This PR makes the intermediate generated buffers static in GQA for the
static kv cache so that it's possible to use the graph capture
capability on llm.

The changes may improve the buffer cache hit rate but also slightly
increase the average gpu memory usage.
…5090)

### Description
In the original script, the order of inputs in the generated ONNX model follows the order in the JSON file instead of the id.
Therefore, the script will first sort the inputs by ID before wrapping the context into the new model.

### Motivation and Context
If we use gen_qnn_ctx_onnx_model.py, it will wrap the context based on the order in qnn_model_net.json instead of the id in qnn_model_net.json, which may result in the onnx model having an incorrect order of inputs.

---------

Co-authored-by: chuteng <chuteng@qti.qualcomm.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This change optimizes MlasGemmQuantKernel for WASM SIMD build by
introducing 4x8 micro kernel.

### Motivation and Context
This change optimizes the performance of QGEMM on x64 devices using WASM
SIMD build.

| Mlas bench/LNL laptop/node v24.2.0 | improvement |

|------------------------------------------------------------------------|-------------|
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:1024/Batch:1/Threads:4/real_time
| 51% |
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:3072/Batch:1/Threads:4/real_time
| 50% |
| QGEMM/UnsignedANoPackB/M:384/N:1024/K:4096/Batch:1/Threads:4/real_time
| 51% |
| QGEMM/UnsignedANoPackB/M:384/N:4096/K:1024/Batch:1/Threads:4/real_time
| 71% |
### Description

This PR updates the default value for `past_present_share_buffer` in the
GenAI config for Whisper.

### Motivation and Context

ONNX Runtime GenAI does not currently support buffer sharing during beam
search. Whisper is often used for beam search so this should be set to
false by default. However, the CUDA model is an exception as the
`DecoderMaskedMultiHeadAttention` kernel inside `MultiHeadAttention`
requires buffer sharing and manages beam search through its
`cache_indirection`.
…tvh (microsoft#25148)

From Windows team's CyberEO requirement, teams are required to enable
warnings, such as 4018, 4146, 4244, 4267, 4302, 4308, 4509, 4532, 4533,
4700, 4789, 4995, 4996.
### Description
<!-- Describe your changes. -->

Add ONNX RotaryEmbedding(23) following
https://github.com/onnx/onnx/blob/main/docs/Operators.md#RotaryEmbedding.
The PR uses contrib op RotaryEmbedding implementation under the hood.

The main difference between this op and the contrib op is that the
`position_ids` in ONNX RotaryEmbedding is optional. When it's not
provided, `cos_cache` and `sin_cache` should be 3d.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix microsoft#24556
Reference microsoft#23507
### Description
Enabling the VTCM backup buffering feature on QNN EP, assuming all graphs are running sequentially where the input of the next graph is the output of the current graph.

Under these assumptions, rather than allocate buffers for all inputs and outputs, only a single buffer can be shared between all graphs.

### Motivation and Context
This will allow larger LLM models to be run
### Description
- Updates `OrtEp::Compile()` to allow a plugin EP to create and return
EPContext nodes.
- Updates the example EP plugin to generate an example EPContext model:
<img width="747" alt="image"
src="https://github.com/user-attachments/assets/e5d98a10-ec15-45aa-bfaf-887d3b6226e2"
/>


### Motivation and Context
Adds more of the functionality missing from the EP ABI used for plugin
EPs.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k June 25, 2025 11:00
@ankitm3k ankitm3k merged commit a95ed23 into ovep-develop Jun 25, 2025
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.