forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Sync with Microsoft ONNX Runtime - 16/10/2025 #833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fixes gather_nd on webgpu ep (found by transformers.js for the vision encoder of docling)
### Description <!-- Describe your changes. --> - Added support for the `--cmake_deps_mirror_dir` option to allow users to specify a custom local directory for CMake dependencies. - Improved logging to show the source of `FetchContent` in CMake. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - Previously, ONNX Runtime searched for CMake dependencies only in the default `<repo_root>/mirror` directory. - This change enables users to configure an alternative location for storing CMake dependencies, offering greater flexibility in build environments.
Now WebNN implementation for gemm's C operand has supported unidirectional broadcasting, which is align with ONNX spec. Removing constraints for Gemm's C input as which should be covered in ORT kernel.
### Description The argument order of np.testing was incorrect. ### Motivation and Context Before, the expected result and the actual result are reversed. <img width="1285" height="697" alt="image" src="https://github.com/user-attachments/assets/0a464008-9704-46f3-a04d-912ba5b41892" />
### Description From an internal user, we see that sparse attention has similar memory issue of microsoft#22290 So we follow that PR to make the change. ### Motivation and Context SparseAttention memory issue.
Add windows server to supported list to avoid confusing users: Marketing Name | Internal Version | platform.release().lower() | Release Year | Based on -- | -- | -- | -- | -- Windows Server 2025 | 10.0.26100+ | "2025server" | 2024–2025 | Windows 11 (24H2) Windows Server 2022 | 10.0.20348 | "2022server" | 2021 | Windows 10 (21H2) Windows Server 2019 | 10.0.17763 | "2019server" | 2018 | Windows 10 (1809) Windows Server 2016 | 10.0.14393 | "2016server" | 2016 | Windows 10 (1607)
…rosoft#26166) ### **Key changes** This PR makes changes to KleidiAI integration within the existing sgemm_kleidiai.cpp implementation. It was noted that during internal testing that memory allocation overhead due to repeated allocations of vectors was having a negative impact on performance figures. The changes introduce thread local buffers for reusing memory during inference. Android platforms are particularly sensitive to this, we have observed inference times being significantly impacted due to memory allocation overheads ### Example performance All runs were captured using onnxruntime_perf_test e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1000 **Android Platform** <img width="996" height="286" alt="image" src="https://github.com/user-attachments/assets/252165af-c864-4b24-b1f2-c28ada208b06" /> In addition to this on M4 we have also observed slight improvements on models, however its the gain is not as significant as the allocation overhead is lower in terms of total time on that platform **Mac Mini M4** <img width="741" height="153" alt="image" src="https://github.com/user-attachments/assets/93e6c545-96fd-4bfc-b90f-3a845a1551bc" /> **Onnxruntime Mlas Benchmark** Mlas Benchmark was executed on a Mac Mini M4 with SME2 instructions Tested code with and without changes in pr and observed the following results (subset shown) comparison generated using compare.py located in google benchmark repo tools `./onnxruntime_mlas_benchmark --benchmark_filter="SGEMM/NORMAL*" --benchmark_repetitions=100` ``` Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------------- SGEMM/NORMAL_NoTrans/M:63/N:63/K:63/real_time -0.1897 -0.1897 3270 2650 3270 2650 SGEMM/NORMAL_NoTrans/M:255/N:63/K:63/real_time -0.1468 -0.1469 8383 7152 8382 7151 SGEMM/NORMAL_NoTrans/M:1023/N:63/K:63/real_time -0.1506 -0.1506 19072 16200 19072 16200 SGEMM/NORMAL_NoTrans/M:63/N:255/K:63/real_time -0.1957 -0.1957 7742 6227 7742 6227 SGEMM/NORMAL_NoTrans/M:255/N:255/K:63/real_time -0.1032 -0.1032 14323 12845 14322 12845 SGEMM/NORMAL_TransB/M:63/N:63/K:63/real_time -0.2221 -0.2221 3356 2611 3356 2610 SGEMM/NORMAL_TransB/M:255/N:63/K:63/real_time -0.0439 -0.0438 8602 8224 8601 8224 SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206 SGEMM/NORMAL_TransB/M:63/N:255/K:63/real_time -0.2000 -0.1999 8046 6437 8046 6437 SGEMM/NORMAL_TransB/M:255/N:255/K:63/real_time -0.0979 -0.0979 14131 12747 14130 12747 SGEMM/NORMAL_TransB/M:1023/N:255/K:63/real_time -0.2836 -0.2836 62540 44802 62540 44802 SGEMM/NORMAL_TransB/M:63/N:1023/K:63/real_time -0.2183 -0.2183 15342 11993 15342 ``` Some small regressions have been seen but are difficult to explain, suspected machine variance during run could account for things like ``` SGEMM/NORMAL_TransB/M:1023/N:63/K:63/real_time +0.0436 +0.0436 16488 17206 16487 17206 ``` For example, as part of testing these results sgemm_kleidi.cpp was instrumented (after the previous benchmark results) with timer code, in MlasGemmBatch, MlasGemmPackB, and MlasGemmPackBSize. Which produced the following, indicating that the code performs better in this case on average than baseline which is currently in main ``` Head of main Function Count Avg (ns) Avg (pretty) ---------------------------------------------------------- MlasGemmBatch 42664 19601.015 19.601 us MlasGemmPackB 42664 373.943 373.943 ns MlasGemmPackBSize 42664 17.179 17.179 ns TLB changes Function Count Avg (ns) Avg (pretty) ---------------------------------------------------------- MlasGemmBatch 55492 16985.256 16.985 us MlasGemmPackB 55492 344.800 344.800 ns MlasGemmPackBSize 55492 16.788 16.788 ns ``` --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
…t#26267) This upgrades CUDA 12.2 + cuDNN 9.5 to CUDA 12.8 + cuDNN 9.8 in CI pipelines, so that we can build 120-real to support Blackwell GPU. To speed up build, we also disable relocatable-device-code. MSVC is updated to latest for some windows build pipelines. #### Known issues Some onnx models (yolo v3, yolo v4, mobilenet v1) failed to run due to cudnn frontend failed to find engine plan. We will try upgrade cudnn frontend later. Related failed tests are disabled for now. --------- Co-authored-by: Changming Sun <chasun@microsoft.com>
…oft#26231) Do this so that MIGraphX can take in fp4 types from input/output tensors and then use that to perform an inference via the MIGraphX API. ### Description <!-- Describe your changes. --> Mirroed changes going into ROCm 7.1 build. Cherry -picked mainline OnnxRT changes to get fp4 tensor support before adding this ontop. Moving this to mainline OnnxRt to enable the MIGraphX EP to allow for fp4 input/output tensors ROCm#176 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add fp4 support to MIGraphX EP
…osoft#26264) ### Description This PR fixes an issue where running ```bash bash build.sh ...... --parallel 1 ...... ``` still triggers a parallel build. The previous logic only added -j when num_parallel_jobs != 1, which caused Ninja/Make/Xcode to use all CPU cores by default. ### Motivation and Context When building ONNX Runtime, using parallel 4 caused an out-of-memory (OOM) error in my computer. However, changing it to parallel 1 still triggered parallel compilation and caused OOM again.
~~Test rel-1.19.1~~ Bump to ONNX==1.19.1
This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 microsoft#25868
…eragePool (microsoft#26162) ### Description <!-- Describe your changes. --> Add support for MIgraphX EP operators QLinearGlobalAveragePool and QLinaerAveragePool ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Want support for these operators through MIGraphX EP and MIGraphX
- Allow empty axes input - When axes is empty and ‘noop_with_empty_axes’ is true, WebNN should set axes to [] - Simplify the code
…6263) ## Description Fixes microsoft#26261 This PR resolves a regression introduced in v1.23.0 where models with Constant nodes containing tensors larger than 127 bytes fail to load with a shape inference error. ### Root Cause Commit 3b97d79 (PR microsoft#25320) introduced an optimization to convert large Constant node tensors (> 127 bytes) into OrtValues with in-memory external data references for better memory management. However, ONNX shape inference cannot distinguish between in-memory and file-based external data, and rejects any TensorProto with `data_location = EXTERNAL`. ### The Fix Modified `InferenceContextImpl::getInputData()` to: 1. Detect tensors with in-memory external data using `utils::HasExternalDataInMemory()` 2. Retrieve the corresponding OrtValue 3. Create a temporary TensorProto with embedded data (not external reference) 4. Provide this temporary proto to ONNX shape inference This allows ONNX shape inference to access the actual tensor data without rejecting it as external. ### Memory Impact This fix introduces a minor and temporary increase in memory usage during the model loading phase. - **When:** The additional memory is allocated only when the shape inference engine needs to access the data of a constant tensor that is larger than 127 bytes. This is a one-time event during the initial analysis of the model. - **What:** The fix creates a temporary in-memory copy of the tensor data. - **Duration:** This temporary copy is released as soon as shape inference is complete. The impact on the overall peak memory usage of the application is expected to be negligible. The memory usage during inference is not affected. While it is theoretically possible for the temporary tensor to be large if a multi-gigabyte constant tensor is used for shape inference, this is a highly unlikely scenario in practice for well-designed models. ### Testing - Tested with the problematic model from issue microsoft#26261 - All optimization levels now work correctly (DISABLE_ALL, BASIC, EXTENDED, ALL) - Unit tests to be added ### Changes - **onnxruntime/core/graph/graph.cc**: - Modified `getInputData()` method in `InferenceContextImpl` class - Added `temp_tensor_protos_` member to store temporary TensorProtos during shape inference ## TODO - [ ] Add unit tests - [ ] Run full test suite --------- Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
### Description
Fix a bug in the TRT Execution Provider where the DDS output tensor was
not bound after an engine update.
### Motivation and Context
The `dds_output_allocator_map` is not cleared on engine update, so that
it will mis-recognized as a known DDS and will not bind the output
allocation.
Script to reproduce the issue:
```:python
# create an onnx model with:
# inputs: data -> NonZeros(data) -> GatherND -> output
# then run the model with onnxruntime
def create_model():
import onnx
from onnx import helper, TensorProto
input = helper.make_tensor_value_info("data", TensorProto.FLOAT, ["d1", "d2"])
output = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["nzr"])
nonzeros_node = helper.make_node("NonZero", ["data"], ["nonzeros"], "nonzeros_node")
transpose_node = helper.make_node(
"Transpose", ["nonzeros"], ["nonzeros_t"], "transpose_node"
)
gathernd_node = helper.make_node(
"GatherND", ["data", "nonzeros_t"], ["output"], "gathernd_node"
)
value_info = [
helper.make_tensor_value_info("nonzeros", TensorProto.INT64, [2, "nzr"]),
helper.make_tensor_value_info("nonzeros_t", TensorProto.INT64, ["nzr", 2]),
]
graph = helper.make_graph(
[nonzeros_node, transpose_node, gathernd_node],
"test_graph",
[input],
[output],
value_info=value_info,
)
model = helper.make_model(graph)
onnx.save(model, "model_dds.onnx")
def run_model():
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("model_dds.onnx", providers=["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"])
print("Running with data shape (3,4)")
data = np.random.randn(3, 4).astype(np.float32)
sess.run(None, {"data": data})
print("Running with data shape (5,6)")
data = np.random.randn(5, 6).astype(np.float32)
sess.run(None, {"data": data})
create_model()
run_model()
```
Before the change:
> IExecutionContext::enqueueV3: Error Code 3: API Usage Error (Parameter
check failed, condition:
mContext.profileObliviousBindings.at(profileObliviousIndex) ||
getPtrOrNull(mOutputAllocators, profileObliviousIndex). Neither address
or allocator is set for output tensor scores. Call
setOutputTensorAddress, setTensorAddress or setOutputAllocator before
enqueue/execute.) ... Status Message: TensorRT EP execution context
enqueue failed.
This pull request extends the WebGPU execution provider to support int64 data type casting in the `Cast` operator, with conditional support based on whether graph capture is enabled. It refactors kernel registration to allow toggling int64 support and updates the shader code and kernel logic to handle int64 tensors efficiently. It's part of the work to enable graph capture in phi4 microsoft#25868
…oft#26315) To fix build pipeline error `ModuleNotFoundError: No module named 'onnxscript._framework_apis.torch_2_9'` after recent torch 2.9 release. This locks torch version to 2.8, and also updates onnxscript and onnx-ir to latest versions. I locked torchvision version since it is usually installed with torch together. If torch and torchvision are not compatible, there might be errors in transformers script.
ankitm3k
approved these changes
Oct 16, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.