Sync with Microsoft ONNX Runtime - 28/08/2025 #797

Jaswanth51 · 2025-08-28T03:58:11Z

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

### Description Fixed the macro `ORT_API_CALL` by replacing `_stdcall` with `__stdcall` ### Motivation and Context Recently, I found an issue that prevents ONNX Runtime from being built using the MinGW toolchain on Windows. After investigating, I discovered that the ONNX Runtime C API header contains a typo in the `ORT_API_CALL` preprocessor macro. It is incorrectly defined as `_stdcall` instead of the correct `__stdcall` (with two leading underscores). This causes build failures on compilers like MinGW that are strict about this syntax.

…ft#25782) ### Description allow custom CMAKE_C_STANDARD and CMAKE_CXX_STANDARD Fixes microsoft#25756 ### Motivation and Context

### Description This change adds support for Q2 quantized matmulnbits, in webgpu. ### Motivation and Context An alternate way to support bitnets is through adding support for lower bits in matmulnbits, this reuses our shaders and is more maintainable than a separate op. The model size grows a bit however for a 2B parameter model using 1.58bpw vs 2bpw the size difference is just 100MB. The simpler dequantization also improves perf, on an Intel XE matmul looks to be 20% faster using q2 weights vs q4 weights for the same matrix dimensions. Q2 version of the bitnet model is here https://huggingface.co/sushraja/bitnet-b1.58-2B-4T-fp16-onnx/tree/main/bitnet_q2

In the flash attention algorithm, each thread in a subgroup needs to access the same range (0-15) of data in workgroup memory `q_tile` and `v_tile`. If we use `subgroupShuffle`, there will be bank conflicts for `var k_local = k_tile[capped_sg_id][i];` since the sg_size is 32 and thread16~thread31 are accessing the same bank address. To avoid the bank conflicts, we can directly access the same address in workgroup memory by all threads which is a broadcast and well optimized in the NV GPUs. See ~10% improvement for phi4 prefill (1K) in NV RTX 2000 Ada. And as the input gets longer(total_sequence_length), the optimization effect gets better (~12% for 2K). Before ``` Batch size: 1, prompt tokens: 1000, tokens to generate: 128 Prompt processing (time to first token): avg (us): 2.0991e+06 avg (tokens/s): 476.394 p50 (us): 2.08457e+06 stddev (us): 36140.3 n: 5 * 1000 token(s) Token generation: avg (us): 25477.8 avg (tokens/s): 39.2498 p50 (us): 25028.2 stddev (us): 4841.89 n: 635 * 1 token(s) ``` After ``` Batch size: 1, prompt tokens: 1000, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.91138e+06 avg (tokens/s): 523.183 p50 (us): 1.92379e+06 stddev (us): 44768 n: 5 * 1000 token(s) Token generation: avg (us): 25237.2 avg (tokens/s): 39.624 p50 (us): 24860.9 stddev (us): 4874.52 n: 635 * 1 token(s) ```

Quoting cppreference.com: ``` (the [[noreturn]] attribute) Indicates that the function will not return control flow to the calling function after it finishes (e.g. functions that terminate the application, throw exceptions, loop indefinitely, etc.). This attribute applies to the name of the function being declared in function declarations only. If a function previously declared with `[[noreturn]]` is invoked and that invocation eventually returns, the behavior is runtime-undefined. ``` The `SafeIntOn*` member functions immediately throw, so if they are used in a function with non-void return type, g++ 14 issues a warning that there exist control paths in the function where no value is returned. Fix this by marking the member functions explicitly noreturn. This is needed so onnxruntime builds correctly with `-Wall -Wextra`.

…icrosoft#25832) This PR addresses accessibility issues with focus indicators on the ONNX Runtime website documentation where contrast ratios were insufficient for keyboard navigation users. The accessibility audit revealed that focus states for key navigation elements like "Learn more about ONNX Runtime & Generative AI", "Quickstart", "Tutorials", "Install ONNX Runtime", and "Hardware Acceleration" had contrast ratios as low as 1.152:1, well below the WCAG 2.1 AA requirement of 3:1 for UI components. ## Changes Made ### 1. Enhanced List Group Item Focus Contrast - **Before**: `color: #555` on `background-color: #f5f5f5` (6.8:1 ratio) - **After**: `color: #333` on `background-color: #f5f5f5` (**11.6:1 ratio**) ### 2. Improved Info List Group Item Focus Contrast - **Before**: `color: #31708f` on `background-color: #c4e3f3` (4.1:1 ratio) - **After**: `color: #1e4a5f` on `background-color: #c4e3f3` (**7.1:1 ratio**) ### 3. Added Visible Focus Indicators for Form Inputs Previously, search and filter inputs only removed the default outline (`outline: 0`) without providing alternative focus indicators, making them inaccessible to keyboard users. - **Added**: `border: 2px solid #0050C5` and `background-color: #f8f9fa` on focus - **Contrast ratio**: **6.7:1** (exceeds requirements) ## Accessibility Compliance All changes now exceed WCAG 2.1 AA standards: - ✅ **3:1 minimum** for UI components and focus indicators - ✅ **4.5:1 minimum** for normal text (all exceed 7:1) - ✅ **Keyboard navigation** fully supported with visible focus indicators - ✅ **Screen reader compatibility** improved with clear focus states ## Impact - Low vision users can now clearly see focused elements during keyboard navigation - All mentioned navigation elements meet accessibility standards - No functionality broken - purely visual accessibility enhancements - Compliance with MAS 1.4.11 Non-text Contrast requirements ## Files Modified - `csharp/ApiDocs/_exported_templates/default/styles/docfx.css` - Enhanced input focus indicators - `csharp/ApiDocs/_exported_templates/default/styles/docfx.vendor.css` - Improved text contrast ratios Fixes microsoft#24995.  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/onnxruntime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: MaanavD <24942306+MaanavD@users.noreply.github.com>

…ft#25819) The DocFX tab controls on onnxruntime.ai were not accessible via keyboard navigation, violating MAS 2.1.1 keyboard accessibility requirements. Users could not navigate between language tabs (Python, C#, Java, JavaScript, C++) using keyboard-only input. ## Problem The existing implementation in `docfx.js` only handled mouse click events but lacked keyboard event handlers. This prevented keyboard users from: - Navigating between tabs using arrow keys - Activating tabs using Enter/Space keys - Jumping to first/last tabs using Home/End keys ## Solution Added comprehensive keyboard navigation support following the WAI-ARIA tabs design pattern: ```javascript // Added keyboard event listener alongside existing click handler container.addEventListener('keydown', function (event) { return handleKeyDown(event, state); }); ``` The `handleKeyDown` function implements: - **Arrow key navigation**: Left/Right and Up/Down keys move focus between tabs with wrapping - **Tab activation**: Enter and Space keys activate the focused tab - **Quick navigation**: Home/End keys jump to first/last tabs - **Proper focus management**: Only the active tab has `tabIndex="0"`, others have `tabIndex="-1"` - **Event handling**: `preventDefault()` and `stopPropagation()` for handled keys ## Accessibility Features - Follows WAI-ARIA tabs pattern specifications - Maintains proper ARIA attributes (`role="tab"`, `aria-selected`, etc.) - Provides visual focus indicators via existing CSS - Supports both horizontal and vertical arrow key navigation - Implements circular navigation (wrapping at boundaries) ## Testing Validated functionality with comprehensive keyboard navigation tests: - ✅ Arrow keys navigate between tabs with proper wrapping - ✅ Enter/Space keys activate focused tabs and switch content panels - ✅ Home/End keys jump to first/last tabs correctly - ✅ Focus management works with proper `tabIndex` handling - ✅ Visual feedback shows focused vs selected tab states This ensures keyboard users can fully access all tab functionality without requiring mouse interaction. Fixes microsoft#24997.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: MaanavD <24942306+MaanavD@users.noreply.github.com>

This pull request introduces several improvements and refactorings to the quantized Mixture-of-Experts (QMoE) operator in ONNX Runtime, focusing on enhanced support for FP32 mode, improved SwiGLU activation handling, and better test coverage. The most important changes are grouped below by theme. ### Operator Registration and Type Support - Added explicit registration and support for `QMoE` operator with both `MLFloat16` and `float` data types, enabling FP32 (non-quantized) mode in addition to quantized modes. This includes updates to kernel registration and schema/type constraints. [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L109-R110) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L275-R277) [[3]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1467-R1467) [[4]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1548-R1548) ### SwiGLU Activation Improvements - Refactored `ApplySwiGLUActivation` to accept configurable `activation_alpha` and `activation_beta` parameters, matching CUDA behavior and allowing flexibility in activation function tuning. Also, dropped support for non-interleaved memory layouts (now not implemented). [[1]](diffhunk://#diff-4e4afb8dcdade0abe18bd8bea68b148b4090cd86d60a1b1422c049960231737dR49-R60) [[2]](diffhunk://#diff-edb344a38502bba9a0083ab98e274ec1b5b2606639a61df7be474a600a7b99d2L29-R61) [[3]](diffhunk://#diff-f85806c745243652a0336da094126687a6c0d14b19fe760abe73df1d940dc4cbL12-R13) - Now reads `activation_alpha` and `activation_beta` attributes from operator parameters, defaulting to values appropriate for SwiGLU. ### QMoE Operator Implementation Refactor - Refactored the QMoE operator to clarify separation between quantized and FP32 implementations, and restructured internal methods for better maintainability. Added template parameterization for data types and improved handling of expert weights and biases. [[1]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5R13-R35) [[2]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L38-R55) [[3]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L58-L59) ### Shape Checking and Layout - Removed legacy shape/layout support in QMoE input validation, enforcing only the new memory layout for expert weights and improving consistency and forward compatibility. ### Test and Documentation Updates - Updated unit tests for QMoE to use correct zero-point values for quantized weights (e.g., 0x88 for int4, 128 for int8), ensuring that test cases accurately reflect expected zero-output behavior for zero weights. Also clarified comments and expected outputs for SwiGLU and quantized scenarios. [[1]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1340-R1349) [[2]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1379-R1380) [[3]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1404-R1413) [[4]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1525-R1538) These changes collectively improve the flexibility, correctness, and maintainability of the QMoE operator in ONNX Runtime. Unit test result ``` sRunning test: batch_size=1, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000372 .Running test: batch_size=1, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000392 .Running test: batch_size=1, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470 .Running test: batch_size=1, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442 .Running test: batch_size=4, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470 .Running test: batch_size=4, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442 .Running test: batch_size=4, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000609 .Running test: batch_size=4, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000702 . ---------------------------------------------------------------------- Ran 9 tests in 46.754s OK (skipped=1) ``` --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…ile build (microsoft#25849) ### Description `ABSL_FLAGS_STRIP_NAMES `is set to 1 by default to disable flag registration when building for Android, iPhone, and "embedded devices". So, running onnxruntime_perf_test on Android will see that flags are not registered. <img width="872" height="182" alt="image (2)" src="https://github.com/user-attachments/assets/eb6a6772-cdff-4d60-a3c7-4352477e956c" /> Set `ABSL_FLAGS_STRIP_NAMES ` to 0 by default for all builds.

### Description The phi4 mini in Edge is using ai.onnx v21. Without this change, it results a `MemcpyToHost` inserted and slows the generation speed.

This change uses subgroupShuffle for sg_size=64 to perform the matmul. It also uses a loop instead of loop unrolling to reduce the register pressure. Phi4 prefill for 1K tokens becomes 8.8s from 11.32s on Qualcomm Adreno X1-85 GPU.

### Description This change adds CUDA Graph support to the NV TensorRT RTX Execution Provider (EP). ### Motivation and Context Integrating CUDA Graphs into the NV TRT RTX EP provides: Lower latency by minimizing per-kernel launch overhead. Better throughput for repeated inference runs. Improved efficiency on GPUs with high kernel launches overhead sensitivity. --------- Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>

### Description Enable einsum op with QK equations for attention in QNN EP. ### Motivation and Context Current einsum op in QNN doesn't support equations with capital alphabets. Loose this constraint to allow more usecases. Signed-off-by: Mu-Chein Hsu <quic_muchhsu@quicinc.com>

…#25833) ### Description  While memory profiling some models I noticed multiple file mapping failures. `WindowsEnv::MapFileIntoMemory()` While it properly checks for the mapping offset to be granularity aligned, it calculates it as page aligned. Also, while saving external tensors we do not need to align big tensors to windows granularity or anything that is platform dependent. Set it to 4096 for all platforms. Granularity matters only for calculating mapping address. ### Motivation and Context  Multiple failures for file mapping for certain models. This saves some hundreds of Mbs for some models.

### Description  Fix packaging pipelines ### Motivation and Context  During CIs and local builds Ort::Status() gets inherited from the base due to using directives, however, that does not work for packaging pipelines. Having default ctor is important for storing Status in containers if needed.

…at info (microsoft#25841) ### Description This PR adds a new API that applications can use to verify compatibility of a precompiled model with the underlying system, using only the compatibility info string from the model's metadata. ### Motivation and Context  - This is a feature to enable apps to check compatibility of a precompiled model without necessarily having the model locally on the device. This enables precompiled models to be stored remotely and downloaded once the application has been able to confirm the validity of a given model with EPs on the device. ### Testing - New unit tests pass - For regression testing, built a private version of WinML + AMD NPU EP with these changes. Ran the Cpp Selfcontained Desktop sample successfully; ran with compilation and also re-ran using the already-compiled model to verify that session initialization continued to work as expected. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

) ### Description  According to the [WebNN spec](https://www.w3.org/TR/webnn/#api-mlgraphbuilder-batchnorm), the batchNorm should have input names "mean" and "variance" instead of "input_mean" and "input_var". ### Motivation and Context This issue causes any BatchNorm with mean/variance inputs to fall back to wasm.

…dLayerNorm (microsoft#25850) ### Description Use similar shaders as SkipSimplifiedLayerNorm in SimplifiedLayerNorm, to fix the performance issues with SimplifiedLayerNorm. ### Motivation and Context Prior to this change, generation in Bitnet was bottlenecked on SimplifiedLayerNorm <img width="332" height="378" alt="image" src="https://github.com/user-attachments/assets/3bc16ac1-ef7d-46bf-b403-92fc9192a2df" /> with this change performance has now improved to match SkipSimplifiedLayerNorm <img width="699" height="179" alt="image" src="https://github.com/user-attachments/assets/30009d85-d5d9-4585-987a-b39ecf52e0b5" />

…s int32 (microsoft#25646) ### Description This PR makes DequantizeLinear support non-zero zero_point when input data type is int32. ### Motivation and Context For WebNN use case, we have some scenarios that input data type is int32 and the zero_point is not zero for DequantizeLinear.

preetha-intel

Backmerging with Master

dependabot bot and others added 23 commits August 25, 2025 09:13

Bump @babel/helpers from 7.25.6 to 7.26.10 in /js/react_native/e2e (m…

0482251

…icrosoft#23993)

[webgpu] Expand Unsqueeze version to 23 (microsoft#25858)

f58f7eb

### Description The phi4 mini in Edge is using ai.onnx v21. Without this change, it results a `MemcpyToHost` inserted and slows the generation speed.

[WebNN] Support Round op (microsoft#25810)

1d07e94

Merge branch 'master' into sync_msft_28082025

676a4d2

Jaswanth51 requested a review from ankitm3k August 28, 2025 03:58

preetha-intel approved these changes Aug 28, 2025

View reviewed changes

Jaswanth51 merged commit b9a1885 into ovep-develop Aug 28, 2025
6 of 8 checks passed

preetha-intel deleted the sync_msft_28082025 branch August 28, 2025 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 28/08/2025 #797

Sync with Microsoft ONNX Runtime - 28/08/2025 #797

Uh oh!

Jaswanth51 commented Aug 28, 2025

Uh oh!

preetha-intel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Sync with Microsoft ONNX Runtime - 28/08/2025 #797

Sync with Microsoft ONNX Runtime - 28/08/2025 #797

Uh oh!

Conversation

Jaswanth51 commented Aug 28, 2025

Description

Uh oh!

preetha-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants