Skip to content

Conversation

@Jaswanth51
Copy link

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

guschmue and others added 20 commits October 10, 2025 15:25
fixes gather_nd on webgpu ep
(found by transformers.js for the vision encoder of docling)
### Description
<!-- Describe your changes. -->
- Added support for the `--cmake_deps_mirror_dir` option to allow users
to specify a custom local directory for CMake dependencies.
- Improved logging to show the source of `FetchContent` in CMake.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Previously, ONNX Runtime searched for CMake dependencies only in the
default `<repo_root>/mirror` directory.
- This change enables users to configure an alternative location for
storing CMake dependencies, offering greater flexibility in build
environments.
Now WebNN implementation for gemm's C operand has supported
unidirectional broadcasting, which is align with ONNX spec. Removing
constraints for Gemm's C input as which should be covered in ORT kernel.
### Description
The argument order of np.testing was incorrect.

### Motivation and Context
Before, the expected result and the actual result are reversed.
<img width="1285" height="697" alt="image"
src="https://github.com/user-attachments/assets/0a464008-9704-46f3-a04d-912ba5b41892"
/>
### Description
From an internal user, we see that sparse attention has similar memory
issue of microsoft#22290
So we follow that PR to make the change.

### Motivation and Context
SparseAttention memory issue.
Add windows server to supported list to avoid confusing users:

Marketing Name | Internal Version | platform.release().lower() | Release
Year | Based on
-- | -- | -- | -- | --
Windows Server 2025 | 10.0.26100+ | "2025server" | 2024–2025 | Windows
11 (24H2)
Windows Server 2022 | 10.0.20348 | "2022server" | 2021 | Windows 10
(21H2)
Windows Server 2019 | 10.0.17763 | "2019server" | 2018 | Windows 10
(1809)
Windows Server 2016 | 10.0.14393 | "2016server" | 2016 | Windows 10
(1607)
…rosoft#26166)

### **Key changes**
This PR makes changes to KleidiAI integration within the existing
sgemm_kleidiai.cpp implementation.

It was noted that during internal testing that memory allocation
overhead due to repeated allocations of vectors was having a negative
impact on performance figures.

The changes introduce thread local buffers for reusing memory during
inference.

Android platforms are particularly sensitive to this, we have observed
inference times being significantly impacted due to memory allocation
overheads
### Example performance
All runs were captured using onnxruntime_perf_test
e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1000
**Android Platform**
<img width="996" height="286" alt="image"
src="https://github.com/user-attachments/assets/252165af-c864-4b24-b1f2-c28ada208b06"
/>

In addition to this on M4 we have also observed slight improvements on
models, however its the gain is not as significant as the allocation
overhead is lower in terms of total time on that platform

**Mac Mini M4**
<img width="741" height="153" alt="image"
src="https://github.com/user-attachments/assets/93e6c545-96fd-4bfc-b90f-3a845a1551bc"
/>

**Onnxruntime Mlas Benchmark**
Mlas Benchmark was executed on a Mac Mini M4 with SME2 instructions
Tested code with and without changes in pr and observed the following
results (subset shown) comparison generated using compare.py located in
google benchmark repo tools
`./onnxruntime_mlas_benchmark --benchmark_filter="SGEMM/NORMAL*"
--benchmark_repetitions=100`

```

Benchmark                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------
SGEMM/NORMAL_NoTrans/M:63/N:63/K:63/real_time                      -0.1897         -0.1897          3270          2650          3270          2650
SGEMM/NORMAL_NoTrans/M:255/N:63/K:63/real_time                     -0.1468         -0.1469          8383          7152          8382          7151
SGEMM/NORMAL_NoTrans/M:1023/N:63/K:63/real_time                    -0.1506         -0.1506         19072         16200         19072         16200
SGEMM/NORMAL_NoTrans/M:63/N:255/K:63/real_time                     -0.1957         -0.1957          7742          6227          7742          6227
SGEMM/NORMAL_NoTrans/M:255/N:255/K:63/real_time                    -0.1032         -0.1032         14323         12845         14322         12845
SGEMM/NORMAL_TransB/M:63/N:63/K:63/real_time                       -0.2221         -0.2221          3356          2611          3356          2610
SGEMM/NORMAL_TransB/M:255/N:63/K:63/real_time                      -0.0439         -0.0438          8602          8224          8601          8224
SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time                     +0.0436         +0.0436         16488         17206         16487         17206
SGEMM/NORMAL_TransB/M:63/N:255/K:63/real_time                      -0.2000         -0.1999          8046          6437          8046          6437
SGEMM/NORMAL_TransB/M:255/N:255/K:63/real_time                     -0.0979         -0.0979         14131         12747         14130         12747
SGEMM/NORMAL_TransB/M:1023/N:255/K:63/real_time                    -0.2836         -0.2836         62540         44802         62540         44802
SGEMM/NORMAL_TransB/M:63/N:1023/K:63/real_time                     -0.2183         -0.2183         15342         11993         15342       
```

Some small regressions have been seen but are difficult to explain,
suspected machine variance during run could account for things like
```
SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time                     +0.0436         +0.0436         16488         17206         16487         17206
```
For example, as part of testing these results sgemm_kleidi.cpp was
instrumented (after the previous benchmark results) with timer code, in
MlasGemmBatch, MlasGemmPackB, and MlasGemmPackBSize.
Which produced the following, indicating that the code performs better
in this case on average than baseline which is currently in main
```
Head of main
Function           Count         Avg (ns)     Avg (pretty)
----------------------------------------------------------
MlasGemmBatch      42664        19601.015     19.601 us
MlasGemmPackB      42664          373.943    373.943 ns
MlasGemmPackBSize  42664           17.179     17.179 ns

TLB changes
Function           Count         Avg (ns)     Avg (pretty)
----------------------------------------------------------
MlasGemmBatch      55492        16985.256     16.985 us
MlasGemmPackB      55492          344.800    344.800 ns
MlasGemmPackBSize  55492           16.788     16.788 ns
```

---------

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
…t#26267)

This upgrades CUDA 12.2 + cuDNN 9.5 to CUDA 12.8 + cuDNN 9.8 in CI
pipelines, so that we can build 120-real to support Blackwell GPU.

To speed up build, we also disable relocatable-device-code.

MSVC is updated to latest for some windows build pipelines.

#### Known issues

Some onnx models (yolo v3, yolo v4, mobilenet v1) failed to run due to
cudnn frontend failed to find engine plan. We will try upgrade cudnn
frontend later. Related failed tests are disabled for now.

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
…oft#26231)

Do this so that MIGraphX can take in fp4 types from input/output tensors
and then use that to perform an inference via the MIGraphX API.

### Description
<!-- Describe your changes. -->
Mirroed changes going into ROCm 7.1 build. Cherry -picked mainline
OnnxRT changes to get fp4 tensor support before adding this ontop.

Moving this to mainline OnnxRt to enable the MIGraphX EP to allow for
fp4 input/output tensors
ROCm#176

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add fp4 support to MIGraphX EP
…osoft#26264)

### Description
This PR fixes an issue where running

```bash
bash build.sh ...... --parallel 1 ......
```

still triggers a parallel build.

The previous logic only added -j when num_parallel_jobs != 1, which
caused Ninja/Make/Xcode to use all CPU cores by default.

### Motivation and Context
When building ONNX Runtime, using parallel 4 caused an out-of-memory
(OOM) error in my computer. However, changing it to parallel 1 still
triggered parallel compilation and caused OOM again.
~~Test rel-1.19.1~~

Bump to ONNX==1.19.1
This pull request introduces support for indirect dispatch in the WebGPU
FlashAttention implementation, enabling more dynamic and efficient
kernel launches based on runtime sequence lengths. The changes add new
logic and parameters to propagate sequence length information and
indirect dispatch buffers through the attention pipeline, with
conditional code paths to maintain compatibility with the existing
direct dispatch approach.

It's part of the work to enable graph capture in phi4
microsoft#25868
…eragePool (microsoft#26162)

### Description
<!-- Describe your changes. -->
Add support for MIgraphX EP operators QLinearGlobalAveragePool and
QLinaerAveragePool ops


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Want support for these operators through MIGraphX EP and MIGraphX
- Allow empty axes input
- When axes is empty and ‘noop_with_empty_axes’ is true, WebNN should
set axes to []
- Simplify the code
…6263)

## Description

Fixes microsoft#26261

This PR resolves a regression introduced in v1.23.0 where models with
Constant nodes containing tensors larger than 127 bytes fail to load
with a shape inference error.

### Root Cause

Commit 3b97d79 (PR microsoft#25320) introduced an optimization to convert
large Constant node tensors (> 127 bytes) into OrtValues with in-memory
external data references for better memory management. However, ONNX
shape inference cannot distinguish between in-memory and file-based
external data, and rejects any TensorProto with `data_location =
EXTERNAL`.

### The Fix

Modified `InferenceContextImpl::getInputData()` to:
1. Detect tensors with in-memory external data using
`utils::HasExternalDataInMemory()`
2. Retrieve the corresponding OrtValue
3. Create a temporary TensorProto with embedded data (not external
reference)
4. Provide this temporary proto to ONNX shape inference

This allows ONNX shape inference to access the actual tensor data
without rejecting it as external.

### Memory Impact

This fix introduces a minor and temporary increase in memory usage
during the model loading phase.

- **When:** The additional memory is allocated only when the shape
inference engine needs to access the data of a constant tensor that is
larger than 127 bytes. This is a one-time event during the initial
analysis of the model.
- **What:** The fix creates a temporary in-memory copy of the tensor
data.
- **Duration:** This temporary copy is released as soon as shape
inference is complete.

The impact on the overall peak memory usage of the application is
expected to be negligible. The memory usage during inference is not
affected. While it is theoretically possible for the temporary tensor to
be large if a multi-gigabyte constant tensor is used for shape
inference, this is a highly unlikely scenario in practice for
well-designed models.

### Testing

- Tested with the problematic model from issue microsoft#26261
- All optimization levels now work correctly (DISABLE_ALL, BASIC,
EXTENDED, ALL)
- Unit tests to be added

### Changes

- **onnxruntime/core/graph/graph.cc**: 
  - Modified `getInputData()` method in `InferenceContextImpl` class
- Added `temp_tensor_protos_` member to store temporary TensorProtos
during shape inference

## TODO

- [ ] Add unit tests
- [ ] Run full test suite

---------

Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
### Description
Fix a bug in the TRT Execution Provider where the DDS output tensor was
not bound after an engine update.


### Motivation and Context
The `dds_output_allocator_map` is not cleared on engine update, so that
it will mis-recognized as a known DDS and will not bind the output
allocation.

Script to reproduce the issue:
```:python
# create an onnx model with:
# inputs: data -> NonZeros(data) -> GatherND -> output
# then run the model with onnxruntime

def create_model():
    import onnx
    from onnx import helper, TensorProto

    input = helper.make_tensor_value_info("data", TensorProto.FLOAT, ["d1", "d2"])
    output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["nzr"])

    nonzeros_node = helper.make_node("NonZero", ["data"], ["nonzeros"], "nonzeros_node")
    transpose_node = helper.make_node(
        "Transpose", ["nonzeros"], ["nonzeros_t"], "transpose_node"
    )
    gathernd_node = helper.make_node(
        "GatherND", ["data", "nonzeros_t"], ["output"], "gathernd_node"
    )

    value_info = [
        helper.make_tensor_value_info("nonzeros", TensorProto.INT64, [2, "nzr"]),
        helper.make_tensor_value_info("nonzeros_t", TensorProto.INT64, ["nzr", 2]),
    ]

    graph = helper.make_graph(
        [nonzeros_node, transpose_node, gathernd_node],
        "test_graph",
        [input],
        [output],
        value_info=value_info,
    )

    model = helper.make_model(graph)
    onnx.save(model, "model_dds.onnx")


def run_model():
    import onnxruntime as ort
    import numpy as np

    sess = ort.InferenceSession("model_dds.onnx", providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"])

    print("Running with data shape (3,4)")
    data = np.random.randn(3, 4).astype(np.float32)
    sess.run(None, {"data": data})

    print("Running with data shape (5,6)")
    data = np.random.randn(5, 6).astype(np.float32)
    sess.run(None, {"data": data})


create_model()
run_model()
```

Before the change:
> IExecutionContext::enqueueV3: Error Code 3: API Usage Error (Parameter
check failed, condition:
mContext.profileObliviousBindings.at(profileObliviousIndex) ||
getPtrOrNull(mOutputAllocators, profileObliviousIndex). Neither address
or allocator is set for output tensor scores. Call
setOutputTensorAddress, setTensorAddress or setOutputAllocator before
enqueue/execute.) ... Status Message: TensorRT EP execution context
enqueue failed.
This pull request extends the WebGPU execution provider to support int64
data type casting in the `Cast` operator, with conditional support based
on whether graph capture is enabled. It refactors kernel registration to
allow toggling int64 support and updates the shader code and kernel
logic to handle int64 tensors efficiently.

It's part of the work to enable graph capture in phi4
microsoft#25868
…oft#26315)

To fix build pipeline error `ModuleNotFoundError: No module named
'onnxscript._framework_apis.torch_2_9'` after recent torch 2.9 release.

This locks torch version to 2.8, and also updates onnxscript and onnx-ir
to latest versions.

I locked torchvision version since it is usually installed with torch
together. If torch and torchvision are not compatible, there might be
errors in transformers script.
@Jaswanth51 Jaswanth51 requested a review from ankitm3k October 16, 2025 03:42
@ankitm3k ankitm3k merged commit f7483e6 into ovep-develop Oct 16, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_16102025 branch October 16, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.