Sync with Microsoft ONNX Runtime - 03/11/2025 #843

Jaswanth51 · 2025-11-03T03:38:52Z

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

Motivation and Context

### Description  To fix build error. ### Motivation and Context  Currently will build fail. ``` CMake Error at /Users/runner/work/onnxruntime-custom/onnxruntime-custom/tvOS/RelWithDebInfo/_deps/flatbuffers-src/CMakeLists.txt:636 (install): -- CMAKE_CXX_FLAGS: -DNDEBUG -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -O3 -pipe -g -ffunction-sections -fdata-sections -fvisibility=hidden -fvisibility-inlines-hidden -DCPUINFO_SUPPORTED install TARGETS given no BUNDLE DESTINATION for MACOSX_BUNDLE executable target "flatc". ```

@gaugarg-nv

… EP python wheel (microsoft#26115) ### Description - CUDA Runtime added as a dependency for NV EP python wheel - TRT RTX dlls and license copied to wheel as public wheels for TRT RTX are currently unavailable - Use onnxruntime.preload_dlls() to load CUDA Runtime DLL - python package name for NV EP is onnxruntime-trt-rtx ### Motivation and Context Enables out-of-the-box usage for the NV TensorRT RTX EP python wheel @gaugarg-nv @thevishalagarwal @gedoensmax @ishwar-raut1

### Description  Picked other models instead of those removed. ### Motivation and Context  Measure

Upgrade cpuinfo to a newer version to maintain it up to date.

### Description Reduce the time blocked waiting for the shader to be compiled. ### Motivation and Context Try to optimize the responsiveness of the application when running ort-web in main thread. See microsoft#25882

…soft#26400) ### Description This PR fused GeneratePositionIDs into FusedQKRotaryEmbedding which can reduce one kernel call. ### Motivation and Context Previously, for GQA, the processing flow was: `SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding -> FlashAttention` After this change, the pipeline becomes: `SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention` on NV5080, the token generation speed improved ~4%(128tps->133tps)

…er support (microsoft#26394) ### Description Currently the files which are needed for wasm aren't exported from the package at all. The files in question are: ``` ort-wasm-simd-threaded.wasm ort-wasm-simd-threaded.jsep.wasm ort-wasm-simd-threaded.asyncify.wasm ort-wasm-simd-threaded.mjs ort-wasm-simd-threaded.jsep.mjs ort-wasm-simd-threaded.asyncify.mjs ``` This PR changes that and adds them to `exports` field in the `package.json`. ### Motivation and Context Bundlers like `webpack` use the `copyPlugin` to move those files into the `public` directory, so the files can be accessed by a stable url. However more advanced and "state of the art" bundlers like `vite` are able to [import asset urls directly](https://vite.dev/guide/assets.html#explicit-url-imports). Vite takes the asset, moves it to to public assets folder (possibily renames the asset and adds a hash etc.). The imported value then is the bundled assets final url. Those urls can then be used in the `env.wasm.wasmPaths` directly. In vites case the full code example is: ```js import wasmUrl from 'onnxruntime-web/ort-wasm-simd-threaded.wasm?url'; import mjsUrl from 'onnxruntime-web/ort-wasm-simd-threaded.mjs?url'; env.wasm.wasmPaths = { wasm: wasmUrl, mjs: mjsUrl, }; ``` With those added exports we can leverage more of the bundlers capabilities and in vites case there isn't any need to add any additional configs. It would just work. When importing we also get proper suggestions: <img width="1604" height="498" alt="imports" src="https://github.com/user-attachments/assets/2678ccc2-ae46-4289-aa6e-607ecbc5388b" /> ---- I would like additional tests to ensure that the exports are available, but I couldn't make the `e2e` tests work on my system. I would appreciate some guidance on that topic.

### Description - Updates the `ORT_API_VERSION` value in onnxruntime_c_api.h to `24`. - Edit documentation for the `TensorTypeAndShape_HasShape` API function to indicate the correct API version (1.24). - Adds `static_assert` to ensure that API functions for 1.23 are not added or removed. ### Motivation and Context The version of ORT was previously updated to 1.24.0 but we forgot to update `ORT_API_VERSION`.

…rosoft#26420) Add nightly gpu pipelines for CUDA 13.

### Description [js] Upgrade ESLint from v8 to v9 for subfolder /js/ ### Motivation and Context

Previously, model_proto was passed by name, which triggered a copy constructor call instead of move construction. Using std::move(model_proto) ensures that the object is constructed via move semantics, reducing unnecessary memory allocation and copy overhead. Co-authored-by: liumingyue <mingyue@xilinx.com>

### Description  Change Xcode version from 16.4 to 16.2, according to what's available on the build agent image. Adjust casing in `xcodebuild -destination` argument. ### Motivation and Context  Fix iOS packaging pipeline issues.

…26422) ### Description Use MayInPlace hint for the FusedConv NCHW CPU kernel that allows for the allocation planner to re-use the optional "sum" input buffer as the kernel's output buffer. It will potentially save [this copy](https://github.com/microsoft/onnxruntime/blob/06004826cc99dd8b8b92dbf000db3d3525716f22/onnxruntime/core/providers/cpu/nn/conv.cc#L209) if the buffers can be shared (in most cases the buffers can be shared) ### Motivation and Context The kernel already had logic to save the copy but the kernel def was missing the hint to the allocation planner. Found while investigating improvements to a Conv model's performance on ARM64

### Description Add unidirectional support for MHA. Also updated `GetCapability` to make sure only run Attention on WebGPU under certain conditions

This pull request enables conditionally register GQA with total_sequence_length on gpu or not. It resolves the issue that a MemcpyToHost is generated when graph capture is enabled (refer to microsoft#25868). This is the last functionality part to support graph capture in webgpu ep in ORT. The main changes ensure that when graph capture is enabled, sequence length information is read from GPU buffers instead of CPU memory, and shader code generation adapts accordingly. This enables more efficient execution and compatibility with graph-captured models. In this PR, we still get total sequence length from `seqlen_k` tensor not `total_seqlen_tensor` tensor to keep consistent with other parts. In the next PR, we can refactor all places to directly use `total_seqlen_tensor` instead of `seqlen_k` when graph capture enabled.

ONNX's ScatterND and ScatterElements limit their indices input to int64, but some WebNN backends only support int32 indices. As a workaround for such backends, we can insert a Cast operation to convert the data type.

### Description  We don't want to adjusted the dispatch size when we try to run `Conv1d`. ### Motivation and Context

…#26234) ### Description Adds APIs to allow a plugin EP to create a virtual `OrtHardwareDevice` that can be used for model cross-compilation. For example, this allows an EP to create a compiled model for NPU on a device that does not have an NPU. #### Application code An application must explicitly allow registered plugin EPs to create virtual devices. This is currently done by using a registration name that ends in the `".virtual"` suffix. Ex: ```c++ #include "onnxruntime_cxx_api.h" #include "onnxruntime_ep_device_ep_metadata_keys.h" const char* ep_registration_name = "my_ep_lib.virtual"; // IMPORTANT: ".virtual" suffix is a signal to EP library ort_env->RegisterExecutionProviderLibrary(ep_registration_name, "my_ep.dll"); std::vector<Ort::ConstEpDevice> ep_devices = ort_env->GetEpDevices(); // ep_devices includes an OrtEpDevice from "my_ep.dll" that uses a virtual OrtHardwareDevice. Ort::ConstEpDevice virtual_ep_device = std::find_if(ep_devices.begin(), ep_devices.end(), [](Ort::ConstEpDevice& device) { return device.EpName() == std::string("MyEpName"); }); // App can look in HW metadata to check if is virtual Ort::ConstHardwareDevice virtual_hw_device = virtual_ep_device.Device(); std::unordered_map<std::string, std::string> metadata = virtual_hw_device.Metadata().GetKeyValuePairs(); assert(metadata[kOrtHardwareDevice_MetadataKey_IsVirtual] == "1"); // App can use the virtual OrtEpDevice in a session to, for example, compile a model // ... ``` #### Plugin EP code This PR introduces a new _optional_ C API function in the `OrtEpFactory` struct called `SetEnvironmentOptions` that allows ORT to pass options (as key/value pairs) to an EP factory. Currently, the only key supported is `"allow_virtual_devices"`, which indicates to the EP factory that creating virtual devices is allowed. When the application registers a plugin EP library, ORT creates the library's EP factories and checks if they implement the `SetEnvironmentOptions` API function. If so, ORT calls `ep_factory.SetEnvironmentOptions` with `"allow_virtual_devices"` set to `"1"` if the EP registration name set by the application ends in the `".virtual"` suffix (or `"0"` otherwise). Here's an example implementation of `OrtEpFactory::SetEnvironmentOptions` taken from a [test plugin EP that supports a virtual GPU](https://github.com/microsoft/onnxruntime/tree/adrianl/plugin-ep-specify-ort-hw-device/onnxruntime/test/autoep/library/example_plugin_ep_virt_gpu): ```c++ /*static*/ OrtStatus* ORT_API_CALL EpFactoryVirtualGpu::SetEnvironmentOptionsImpl(OrtEpFactory* this_ptr, const OrtKeyValuePairs* options) noexcept { auto* factory = static_cast<EpFactoryVirtualGpu*>(this_ptr); const char* value = factory->ort_api_.GetKeyValue(options, "allow_virtual_devices"); if (value != nullptr) { factory->allow_virtual_devices_ = strcmp(value, "1") == 0; } return nullptr; } ``` An EP factory can create a virtual hardware device within `OrtEpFactory::GetSupportedDevices` by using a new API function called `CreateHardwareDevice`. The EP factory is expected to own the hardware device instance, which should be released when the factory is destroyed via `ReleaseHardwareDevice`. The [test plugin EP shows an implementation](https://github.com/microsoft/onnxruntime/blob/d87f8b86406525f5801a7a9933b1ced1eb40940c/onnxruntime/test/autoep/library/example_plugin_ep_virt_gpu/ep_factory.cc#L86) of `OrtEpFactory::GetSupportedDevices` that creates a virtual GPU device. ```c++ /*static*/ OrtStatus* ORT_API_CALL EpFactoryVirtualGpu::GetSupportedDevicesImpl(OrtEpFactory* this_ptr, const OrtHardwareDevice* const* /*devices*/, size_t /*num_devices*/, OrtEpDevice** ep_devices, size_t max_ep_devices, size_t* p_num_ep_devices) noexcept { size_t& num_ep_devices = *p_num_ep_devices; auto* factory = static_cast<EpFactoryVirtualGpu*>(this_ptr); num_ep_devices = 0; // Create a virtual OrtHardwareDevice if application indicated it is allowed (e.g., for cross-compiling). // This example EP creates a virtual GPU OrtHardwareDevice and adds a new OrtEpDevice that uses the virtual GPU. if (factory->allow_virtual_devices_ && num_ep_devices < max_ep_devices) { OrtKeyValuePairs* hw_metadata = nullptr; factory->ort_api_.CreateKeyValuePairs(&hw_metadata); factory->ort_api_.AddKeyValuePair(hw_metadata, kOrtHardwareDevice_MetadataKey_IsVirtual, "1"); auto* status = factory->ep_api_.CreateHardwareDevice(OrtHardwareDeviceType::OrtHardwareDeviceType_GPU, factory->vendor_id_, /*device_id*/ 0, factory->vendor_.c_str(), hw_metadata, &factory->virtual_hw_device_); // ... OrtEpDevice* virtual_ep_device = nullptr; status = factory->ort_api_.GetEpApi()->CreateEpDevice(factory, factory->virtual_hw_device_, ep_metadata, ep_options, &virtual_ep_device); // ... ep_devices[num_ep_devices++] = virtual_ep_device; ``` ### Motivation and Context

…lues early (microsoft#26345) ### Description Converts weights early and revert "Properly remove in-memory references (microsoft#25652)" This reverts commit 3ca49d8 and makes appropriate adjustments for the current state of the code. This PR is made possible and on the heels of: microsoft#26263 microsoft#25833. Previous history: microsoft#23979 microsoft#25320 microsoft#25626 microsoft#25652 The first change (microsoft#26263) allows us to convert initializers to OrtValues early and save lots of memory at model loading time. Specifically, for Phi-4-mini-instruct-INT4 model before and after looks like this: **Before** <img width="1204" height="124" alt="Before change DEBUG 2025-10-16 144819" src="https://github.com/user-attachments/assets/674ff75b-057f-498a-a906-0140d59d46e6" /> **After** <img width="997" height="114" alt="After change DEBUG 2025-10-16 144819" src="https://github.com/user-attachments/assets/df1783af-7f50-4cd2-b3ad-6868f23be53f" /> The two peaks represent memory usage at optimization time (8.1Gb before) and after weights memory mapping (6.5Gb) After this change corresponding numbers look 3.5Gb and 4.7Gb respectively. Most of the savings during optimization phase come from `ConstantFolding` where we are able to reuse the resulting OrtValues directly for the new initializers. This PR concludes a series of PRs converting initializers to OrtValues. Memory consumption before the conversion began was 9.3Gb and 6.7Gb respectively. We are saving almost 6Gb during optimization and 2Gb for the steady state. <img width="1175" height="139" alt="image" src="https://github.com/user-attachments/assets/80e7d228-8a8e-4316-8e04-b02c2be30f04" /> The model also loads about 12 seconds faster. Example of ConstantFolding being one of the top contributors where we duplicate memory for higher peak before Resolve takes care of no longer used initializers. <img width="1100" height="558" alt="Sanpshot 3 Peak on ConstantFolding Transpose Optimizer" src="https://github.com/user-attachments/assets/95545abd-3f99-46d9-862e-bbf27cbb5b40" /> <img width="1060" height="600" alt="Snapshot 4 Peak AddInitializer from ConstantFolding" src="https://github.com/user-attachments/assets/dd457ec6-23ee-4efd-8c60-625d5faad61e" /> <img width="325" height="160" alt="image" src="https://github.com/user-attachments/assets/37c1194d-f683-49a7-afb1-073dfbb9bbfc" /> ### Motivation and Context  Reduce memory usage.

…ovider-bridge API (microsoft#26448) ### Description There is a memory leak whenever an EP uses the provider-bridge API to create a `std::unique_ptr<onnx::TypeProto>`. The `onnx::TypeProto` is not properly deleted due to a missing `operator delete()` override for the `TypeProto` wrapper class. This delete operator override is necessary because the onnx library may use custom allocators. Affected EPs: Run `git grep -rn "TypeProto::Create()" onnxruntime/core/providers/` - QNN EP: Happens when QNN EP creates an EPContext model. To reproduce, run the `onnxruntime_provider_tests` with `--gtest_filter=*ContextBinary*`. - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/qnn/builder/onnx_ctx_model_helper.cc#L73. - OpenVINO EP: Happens during QDQ stripping: - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/openvino/qdq_transformations/qdq_stripping.cc#L76 - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/openvino/qdq_transformations/qdq_stripping.cc#L473 - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/openvino/qdq_transformations/qdq_stripping.cc#L654 - NV EP: - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_helper.cc#L213 - VitisAI EP: - https://github.com/microsoft/onnxruntime/blob/860d0853a6eefdf19e21b0e9982bde2ffbc8a65d/onnxruntime/core/providers/vitisai/imp/node_arg.cc#L94 ### Motivation and Context

…microsoft#26439) ### Description Fixes microsoft#26294 When using the old model compilation approach (session option configuration), ORT should verify that the generated output model does not already exist. Importantly, this check should be done _before_ calling an EP's compile() method. This PR fixes this check, which was unintentionally disabled by a [previous PR.](microsoft#25455). Note that this check also (correctly) happens _after_ calling the EP's compile() method, but it is better to catch it early if we can. ### Motivation and Context Fixes a regression in the older compilation workflow.

### Description This change fixes a bug that causes crash on macOS (and also potentially other platforms using libc) at `OrtReleaseEnv`. Instead of using static variables, now they are function local static so that compiler can handle the destruction order correctly. ### Motivation and Context Fixes microsoft#24579 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description Adds `Elu`, `Exp` and `Softplus` to coreml. ### Motivation and Context  We use them in LeelaChessZero. --------- Co-authored-by: borg323 <borg323@users.noreply.github.com>

### Description  .NET MAUI 8 is out of support. See here: https://dotnet.microsoft.com/en-us/platform/support/policy/maui We started seeing errors about this in the NuGet packaging pipeline. ``` ##[error]C:\Program Files\dotnet\sdk\9.0.306\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.EolTargetFrameworks.targets(38,5): Error NETSDK1202: The workload 'net8.0-ios' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/maui-support-policy for more information about the support policy. ``` This change updates net8.0 mobile target framework monikers to net9.0. ### Motivation and Context  Fix packaging pipeline.

@ishwar-raut1

### Description - Use total inference time instead of the submission time for output statistics calculation ### Motivation and Context - The min, max, and other statistics reported for inference were using device submission time instead of the inference time. @ishwar-raut1 @gaugarg-nv @thevishalagarwal @umangb-09 @gedoensmax

1. Add safeint header file application path to MLAS 2. Fix syntax errors in sqnbitgem and lasx

Fixing dump model ops feature for MIGraphX EP on Windows. The feature wasn't functional because of saving format rules on Windows which are opposite from Linux. Current state of the feature gives us opportunity to generate and save onnx model after subgraph optimizations before compiling it. On this way we can look how model graph looks like after optimizations and we can use the optimized model. --------- Co-authored-by: Uros Petkovic <urpektov@amd.com>

add support for bias and weight_index, move subgroup_matrix_matmul_nbits to template and make program callable from other ops. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hans00 and others added 30 commits October 26, 2025 23:46

Upgrade cpuinfo (microsoft#26405)

6e19cbd

Upgrade cpuinfo to a newer version to maintain it up to date.

[WebGPU] allow async shader compilation (microsoft#25941)

954bb7b

### Description Reduce the time blocked waiting for the shader to be compiled. ### Motivation and Context Try to optimize the responsiveness of the application when running ort-web in main thread. See microsoft#25882

[CUDA] Add nightly gpu package and publish pipelines for CUDA 13 (mic…

1633299

…rosoft#26420) Add nightly gpu pipelines for CUDA 13.

[js] Upgrade ESLint from v8 to v9 for subfolder /js/ (microsoft#26419)

c6c6d78

### Description [js] Upgrade ESLint from v8 to v9 for subfolder /js/ ### Motivation and Context

[WebGPU] add unidirectional support for MHA (microsoft#26071)

41429ed

### Description Add unidirectional support for MHA. Also updated `GetCapability` to make sure only run Attention on WebGPU under certain conditions

Bump actions/download-artifact from 5 to 6 (microsoft#26409)

3862fe5

Bump actions/upload-artifact from 4 to 5 (microsoft#26411)

3a6a4c2

Add int8/uint8 support for Max and Min operators (microsoft#26386)

df418e2

[WebNN] Fallback int64 indices to int32 (microsoft#26308)

80b6e93

ONNX's ScatterND and ScatterElements limit their indices input to int64, but some WebNN backends only support int32 indices. As a workaround for such backends, we can insert a Cast operation to convert the data type.

Add QC license to the ort foundry package (microsoft#26428)

34b7558

Fix safeint-related errors in MLAS (microsoft#26435)

4d907b0

1. Add safeint header file application path to MLAS 2. Fix syntax errors in sqnbitgem and lasx

borg323 and others added 4 commits November 1, 2025 08:49

coreml: Fix typo that breaks fp16 support (microsoft#26443)

77555db

webgpu / nbitmm support for bias and weight_index (microsoft#26392)

423a03f

add support for bias and weight_index, move subgroup_matrix_matmul_nbits to template and make program callable from other ops. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into sync_msft_03112025

81772e9

Jaswanth51 requested a review from ankitm3k November 3, 2025 03:38

ankitm3k approved these changes Nov 3, 2025

View reviewed changes

ankitm3k merged commit 323cfeb into ovep-develop Nov 3, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_03112025 branch November 3, 2025 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 03/11/2025 #843

Sync with Microsoft ONNX Runtime - 03/11/2025 #843

Uh oh!

Jaswanth51 commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants

Sync with Microsoft ONNX Runtime - 03/11/2025 #843

Sync with Microsoft ONNX Runtime - 03/11/2025 #843

Uh oh!

Conversation

Jaswanth51 commented Nov 3, 2025

Description

Motivation and Context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants