Sync with Microsoft ONNX Runtime - 16/10/2025 #833

Jaswanth51 · 2025-10-16T03:42:05Z

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

fixes gather_nd on webgpu ep (found by transformers.js for the vision encoder of docling)

### Description  - Added support for the `--cmake_deps_mirror_dir` option to allow users to specify a custom local directory for CMake dependencies. - Improved logging to show the source of `FetchContent` in CMake. ### Motivation and Context  - Previously, ONNX Runtime searched for CMake dependencies only in the default `<repo_root>/mirror` directory. - This change enables users to configure an alternative location for storing CMake dependencies, offering greater flexibility in build environments.

Now WebNN implementation for gemm's C operand has supported unidirectional broadcasting, which is align with ONNX spec. Removing constraints for Gemm's C input as which should be covered in ORT kernel.

### Description The argument order of np.testing was incorrect. ### Motivation and Context Before, the expected result and the actual result are reversed. <img width="1285" height="697" alt="image" src="https://github.com/user-attachments/assets/0a464008-9704-46f3-a04d-912ba5b41892" />

…ite-default (microsoft#26000)

### Description From an internal user, we see that sparse attention has similar memory issue of microsoft#22290 So we follow that PR to make the change. ### Motivation and Context SparseAttention memory issue.

Add windows server to supported list to avoid confusing users: Marketing Name | Internal Version | platform.release().lower() | Release Year | Based on -- | -- | -- | -- | -- Windows Server 2025 | 10.0.26100+ | "2025server" | 2024–2025 | Windows 11 (24H2) Windows Server 2022 | 10.0.20348 | "2022server" | 2021 | Windows 10 (21H2) Windows Server 2019 | 10.0.17763 | "2019server" | 2018 | Windows 10 (1809) Windows Server 2016 | 10.0.14393 | "2016server" | 2016 | Windows 10 (1607)

…rosoft#26166) ### **Key changes** This PR makes changes to KleidiAI integration within the existing sgemm_kleidiai.cpp implementation. It was noted that during internal testing that memory allocation overhead due to repeated allocations of vectors was having a negative impact on performance figures. The changes introduce thread local buffers for reusing memory during inference. Android platforms are particularly sensitive to this, we have observed inference times being significantly impacted due to memory allocation overheads ### Example performance All runs were captured using onnxruntime_perf_test e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1000 **Android Platform** <img width="996" height="286" alt="image" src="https://github.com/user-attachments/assets/252165af-c864-4b24-b1f2-c28ada208b06" /> In addition to this on M4 we have also observed slight improvements on models, however its the gain is not as significant as the allocation overhead is lower in terms of total time on that platform **Mac Mini M4** <img width="741" height="153" alt="image" src="https://github.com/user-attachments/assets/93e6c545-96fd-4bfc-b90f-3a845a1551bc" /> **Onnxruntime Mlas Benchmark** Mlas Benchmark was executed on a Mac Mini M4 with SME2 instructions Tested code with and without changes in pr and observed the following results (subset shown) comparison generated using compare.py located in google benchmark repo tools `./onnxruntime_mlas_benchmark --benchmark_filter="SGEMM/NORMAL*" --benchmark_repetitions=100` ``` Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------------- SGEMM/NORMAL_NoTrans/M:63/N:63/K:63/real_time -0.1897 -0.1897 3270 2650 3270 2650 SGEMM/NORMAL_NoTrans/M:255/N:63/K:63/real_time -0.1468 -0.1469 8383 7152 8382 7151 SGEMM/NORMAL_NoTrans/M:1023/N:63/K:63/real_time -0.1506 -0.1506 19072 16200 19072 16200 SGEMM/NORMAL_NoTrans/M:63/N:255/K:63/real_time -0.1957 -0.1957 7742 6227 7742 6227 SGEMM/NORMAL_NoTrans/M:255/N:255/K:63/real_time -0.1032 -0.1032 14323 12845 14322 12845 SGEMM/NORMAL_TransB/M:63/N:63/K:63/real_time -0.2221 -0.2221 3356 2611 3356 2610 SGEMM/NORMAL_TransB/M:255/N:63/K:63/real_time -0.0439 -0.0438 8602 8224 8601 8224 SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206 SGEMM/NORMAL_TransB/M:63/N:255/K:63/real_time -0.2000 -0.1999 8046 6437 8046 6437 SGEMM/NORMAL_TransB/M:255/N:255/K:63/real_time -0.0979 -0.0979 14131 12747 14130 12747 SGEMM/NORMAL_TransB/M:1023/N:255/K:63/real_time -0.2836 -0.2836 62540 44802 62540 44802 SGEMM/NORMAL_TransB/M:63/N:1023/K:63/real_time -0.2183 -0.2183 15342 11993 15342 ``` Some small regressions have been seen but are difficult to explain, suspected machine variance during run could account for things like ``` SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206 ``` For example, as part of testing these results sgemm_kleidi.cpp was instrumented (after the previous benchmark results) with timer code, in MlasGemmBatch, MlasGemmPackB, and MlasGemmPackBSize. Which produced the following, indicating that the code performs better in this case on average than baseline which is currently in main ``` Head of main Function Count Avg (ns) Avg (pretty) ---------------------------------------------------------- MlasGemmBatch 42664 19601.015 19.601 us MlasGemmPackB 42664 373.943 373.943 ns MlasGemmPackBSize 42664 17.179 17.179 ns TLB changes Function Count Avg (ns) Avg (pretty) ---------------------------------------------------------- MlasGemmBatch 55492 16985.256 16.985 us MlasGemmPackB 55492 344.800 344.800 ns MlasGemmPackBSize 55492 16.788 16.788 ns ``` --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

…t#26267) This upgrades CUDA 12.2 + cuDNN 9.5 to CUDA 12.8 + cuDNN 9.8 in CI pipelines, so that we can build 120-real to support Blackwell GPU. To speed up build, we also disable relocatable-device-code. MSVC is updated to latest for some windows build pipelines. #### Known issues Some onnx models (yolo v3, yolo v4, mobilenet v1) failed to run due to cudnn frontend failed to find engine plan. We will try upgrade cudnn frontend later. Related failed tests are disabled for now. --------- Co-authored-by: Changming Sun <chasun@microsoft.com>

…oft#26231) Do this so that MIGraphX can take in fp4 types from input/output tensors and then use that to perform an inference via the MIGraphX API. ### Description  Mirroed changes going into ROCm 7.1 build. Cherry -picked mainline OnnxRT changes to get fp4 tensor support before adding this ontop. Moving this to mainline OnnxRt to enable the MIGraphX EP to allow for fp4 input/output tensors ROCm#176 ### Motivation and Context  Add fp4 support to MIGraphX EP

…osoft#26264) ### Description This PR fixes an issue where running ```bash bash build.sh ...... --parallel 1 ...... ``` still triggers a parallel build. The previous logic only added -j when num_parallel_jobs != 1, which caused Ninja/Make/Xcode to use all CPU cores by default. ### Motivation and Context When building ONNX Runtime, using parallel 4 caused an out-of-memory (OOM) error in my computer. However, changing it to parallel 1 still triggered parallel compilation and caused OOM again.

~~Test rel-1.19.1~~ Bump to ONNX==1.19.1

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 microsoft#25868

…eragePool (microsoft#26162) ### Description  Add support for MIgraphX EP operators QLinearGlobalAveragePool and QLinaerAveragePool ops ### Motivation and Context  Want support for these operators through MIGraphX EP and MIGraphX

- Allow empty axes input - When axes is empty and ‘noop_with_empty_axes’ is true, WebNN should set axes to [] - Simplify the code

…6263) ## Description Fixes microsoft#26261 This PR resolves a regression introduced in v1.23.0 where models with Constant nodes containing tensors larger than 127 bytes fail to load with a shape inference error. ### Root Cause Commit 3b97d79 (PR microsoft#25320) introduced an optimization to convert large Constant node tensors (> 127 bytes) into OrtValues with in-memory external data references for better memory management. However, ONNX shape inference cannot distinguish between in-memory and file-based external data, and rejects any TensorProto with `data_location = EXTERNAL`. ### The Fix Modified `InferenceContextImpl::getInputData()` to: 1. Detect tensors with in-memory external data using `utils::HasExternalDataInMemory()` 2. Retrieve the corresponding OrtValue 3. Create a temporary TensorProto with embedded data (not external reference) 4. Provide this temporary proto to ONNX shape inference This allows ONNX shape inference to access the actual tensor data without rejecting it as external. ### Memory Impact This fix introduces a minor and temporary increase in memory usage during the model loading phase. - **When:** The additional memory is allocated only when the shape inference engine needs to access the data of a constant tensor that is larger than 127 bytes. This is a one-time event during the initial analysis of the model. - **What:** The fix creates a temporary in-memory copy of the tensor data. - **Duration:** This temporary copy is released as soon as shape inference is complete. The impact on the overall peak memory usage of the application is expected to be negligible. The memory usage during inference is not affected. While it is theoretically possible for the temporary tensor to be large if a multi-gigabyte constant tensor is used for shape inference, this is a highly unlikely scenario in practice for well-designed models. ### Testing - Tested with the problematic model from issue microsoft#26261 - All optimization levels now work correctly (DISABLE_ALL, BASIC, EXTENDED, ALL) - Unit tests to be added ### Changes - **onnxruntime/core/graph/graph.cc**: - Modified `getInputData()` method in `InferenceContextImpl` class - Added `temp_tensor_protos_` member to store temporary TensorProtos during shape inference ## TODO - [ ] Add unit tests - [ ] Run full test suite --------- Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>

### Description Fix a bug in the TRT Execution Provider where the DDS output tensor was not bound after an engine update. ### Motivation and Context The `dds_output_allocator_map` is not cleared on engine update, so that it will mis-recognized as a known DDS and will not bind the output allocation. Script to reproduce the issue: ```:python # create an onnx model with: # inputs: data -> NonZeros(data) -> GatherND -> output # then run the model with onnxruntime def create_model(): import onnx from onnx import helper, TensorProto input = helper.make_tensor_value_info("data", TensorProto.FLOAT, ["d1", "d2"]) output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["nzr"]) nonzeros_node = helper.make_node("NonZero", ["data"], ["nonzeros"], "nonzeros_node") transpose_node = helper.make_node( "Transpose", ["nonzeros"], ["nonzeros_t"], "transpose_node" ) gathernd_node = helper.make_node( "GatherND", ["data", "nonzeros_t"], ["output"], "gathernd_node" ) value_info = [ helper.make_tensor_value_info("nonzeros", TensorProto.INT64, [2, "nzr"]), helper.make_tensor_value_info("nonzeros_t", TensorProto.INT64, ["nzr", 2]), ] graph = helper.make_graph( [nonzeros_node, transpose_node, gathernd_node], "test_graph", [input], [output], value_info=value_info, ) model = helper.make_model(graph) onnx.save(model, "model_dds.onnx") def run_model(): import onnxruntime as ort import numpy as np sess = ort.InferenceSession("model_dds.onnx", providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]) print("Running with data shape (3,4)") data = np.random.randn(3, 4).astype(np.float32) sess.run(None, {"data": data}) print("Running with data shape (5,6)") data = np.random.randn(5, 6).astype(np.float32) sess.run(None, {"data": data}) create_model() run_model() ``` Before the change: > IExecutionContext::enqueueV3: Error Code 3: API Usage Error (Parameter check failed, condition: mContext.profileObliviousBindings.at(profileObliviousIndex) || getPtrOrNull(mOutputAllocators, profileObliviousIndex). Neither address or allocator is set for output tensor scores. Call setOutputTensorAddress, setTensorAddress or setOutputAllocator before enqueue/execute.) ... Status Message: TensorRT EP execution context enqueue failed.

This pull request extends the WebGPU execution provider to support int64 data type casting in the `Cast` operator, with conditional support based on whether graph capture is enabled. It refactors kernel registration to allow toggling int64 support and updates the shader code and kernel logic to handle int64 tensors efficiently. It's part of the work to enable graph capture in phi4 microsoft#25868

…oft#26315) To fix build pipeline error `ModuleNotFoundError: No module named 'onnxscript._framework_apis.torch_2_9'` after recent torch 2.9 release. This locks torch version to 2.8, and also updates onnxscript and onnx-ir to latest versions. I locked torchvision version since it is usually installed with torch together. If torch and torchvision are not compatible, there might be errors in transformers script.

guschmue and others added 20 commits October 10, 2025 15:25

fix gather_nd on webgpu ep (microsoft#26270)

a2e46f4

fixes gather_nd on webgpu ep (found by transformers.js for the vision encoder of docling)

[WebNN] Remove constraints for Gemm's C input (microsoft#26273)

ac7f4b7

Now WebNN implementation for gemm's C operand has supported unidirectional broadcasting, which is align with ONNX spec. Removing constraints for Gemm's C input as which should be covered in ORT kernel.

Bump vite from 6.3.5 to 6.3.6 in /js/web/test/e2e/exports/testcases/v…

dde2fef

…ite-default (microsoft#26000)

Fix Memory Issue sparse_attention Rotary (microsoft#26278)

41b238e

### Description From an internal user, we see that sparse attention has similar memory issue of microsoft#22290 So we follow that PR to make the change. ### Motivation and Context SparseAttention memory issue.

Bump onnx to 1.19.1 (microsoft#26202)

94de31f

~~Test rel-1.19.1~~ Bump to ONNX==1.19.1

[WebNN] Fix some issues in reduction ops (microsoft#26289)

04ed484

- Allow empty axes input - When axes is empty and ‘noop_with_empty_axes’ is true, WebNN should set axes to [] - Simplify the code

Merge branch 'master' into sync_msft_16102025

37b99c1

Jaswanth51 requested a review from ankitm3k October 16, 2025 03:42

ankitm3k approved these changes Oct 16, 2025

View reviewed changes

ankitm3k merged commit f7483e6 into ovep-develop Oct 16, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_16102025 branch October 16, 2025 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 16/10/2025 #833

Sync with Microsoft ONNX Runtime - 16/10/2025 #833

Uh oh!

Jaswanth51 commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Sync with Microsoft ONNX Runtime - 16/10/2025 #833

Sync with Microsoft ONNX Runtime - 16/10/2025 #833

Uh oh!

Conversation

Jaswanth51 commented Oct 16, 2025

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants