forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Backmerging with Msft commits #708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Description <!-- Describe your changes. --> Do a manual load of dxcore.dll so that old Windows versions are still supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> microsoft#24771 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…4877) ### Description - `EtwRegistrationManager`. Make sure all fields initialized by a constructor - Register a callback object instead of a pointer to it. Store it in the map with a session unique key. - Register `ML_Ort_Provider_Etw_Callback` once for all the sessions. The first session registers, the last one to go away removes the callback to Log all sessions. For this we make callbacks ref-counted inside the map they are stored in. This is done to prevent a deadlock where `active_sessions_mutex_` and `callback_mutex_` are acquired from different threads in a different order. - Create a registration guard to remove callbacks in case `InferenceSession` constructor does not finish. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is inspired by microsoft#24773. Current code exhibits multiple issues. - `EtwRegistrationManager` constructor does not initialize all of the fields including the `InitializationStatus`. - Global callback object is registered and re-created by every session. Customers sometimes run thousands of models in the same sessions which results in a quadratic ETW costs. The callback object is destroyed and recreated every time a session is created. - There is a chance that InferenceSession constructor does not finish, and the callback would remain registered. This may result in intermittent hard to diagnose bugs. - `active_sessions_lock_` and `callback` lock are not acquired/released in the same order by different threads which is a classic deadlock scenario.
### Description This PR extract core implementations into gemm_utils.cc which is used to generate shader both GEMM and Matmul ops. The core implemenations included scalar and vec4 versions of GEMM and Matmul. ### Motivation and Context There are many common codes for GEMM and Matmul, so we want to extra common code to unify their implementations.  --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description Update unit tests for DNNL. (1) Increase QAttentionTest threshold when DNNL is used. (2) Skip some failed tests when DNNL is used. ### Motivation and Context When I build main branch for DNNL in Azure Linux VM, some tests failed: ``` pip install --user -r tools/ci_build/github/linux/python/requirements.txt python3 tools/ci_build/build.py --build_dir build/Release --config Release --cmake_generator Ninja --skip_submodule_sync --build_shared_lib --parallel --use_vcpkg --use_binskim_compliant_compile_flags --build_wheel --build_nuget --use_dnnl ``` ``` [ FAILED ] NhwcTransformerTests.ConvSplitQLinearConcat [ FAILED ] NhwcTransformerTests.Conv [ FAILED ] NhwcTransformerTests.ConvBlockBinary [ FAILED ] NhwcTransformerTests.ConvMaxPool [ FAILED ] NhwcTransformerTests.ConvAveragePool [ FAILED ] NhwcTransformerTests.ConvPad [ FAILED ] NhwcTransformerTests.ConvBlockActivation [ FAILED ] QDQTransformerTests.Conv_U8X8U8 [ FAILED ] QDQTransformerTests.ConvMaxPoolReshape_UInt8 [ FAILED ] QDQTransformerTests.ConvMaxPoolReshape_Int8 [ FAILED ] QDQTransformerTests.ConvRelu [ FAILED ] QDQTransformerTests.ConvAveragePoolReshape_UInt8 [ FAILED ] QDQTransformerTests.ConvAveragePoolReshape_Int8 [ FAILED ] QDQTransformerTests.ConvTranspose_QBackward [ FAILED ] QDQTransformerTests.QBackward_MutilpleSteps [ FAILED ] QDQTransformerTests.ConvTranspose_DQForward [ FAILED ] QDQTransformerTests.DQForward_MutilpleSteps [ FAILED ] InferenceSessionTests.ModelMetadata [ FAILED ] ActivationOpTest.LeakyRelu_bfloat16 [ FAILED ] QAttentionTest.QAttentionDNNLBatch1 [ FAILED ] QAttentionTest.QAttentionDNNLBatch2 [ FAILED ] QAttentionTest.QAttentionDNNLMaskPartialSequence [ FAILED ] QAttentionTest.QAttentionNoMaskIndex [ FAILED ] QAttentionTest.QAttentionPrunedModel ```
…#24548) ### Description Update README.md: remove the build pipeline status section because they are out of date. Most of our pipelines are in Github now.
…24896) ### Description Replace the Upsample with Resize during quantization to avoid causing the invalid graph ### Motivation and Context After the quantization, if the opset of original onnx model is less than 10, the opset of QDQ model will be upgraded to 11. However, Upsample is deprecated in opset 11, which will make the onnx model invalid. So, we replace the Upsample with Resize if the opset needs to be upgraded to 11. --------- Co-authored-by: chuteng <chuteng@qti.qualcomm.com>
### Description Fuse transposed channel shuffle pattern into QNN op -- ONNX does not have native ChannelShuffle op. ### Motivation and Context Improves performance on QNN EP.
Fixes pipeline error in MacOS_C_API_Packaging_CPU_x86_64 by disabling softmax NaN test for CoreML EP since it does not handle NaN.
### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.
### Description
Added a graph transform for mixed precision graphs when FP16 compute is
unavailable. At session creation, this graph transform converts FP16
initializers (_which were changed to FP16 to FP32 cast nodes_) to FP32
initializers and fuses them with their next FP32 nodes.
- Behavior before this change:
"fp16 initializers -> cast_from_fp16_to_fp32 -> fp32 node/s"
- Behavior after this change:
"fp16 initializers converted to fp32 initializers then fused with fp32
node/s"
### Motivation and Context
This change aims to run the FP16 models without the repetitive casting
of FP16 initializers to FP32 initializers, by fusing FP32 initializers
with their next nodes, when FP16 compute is not available.
> For naming purposes, the newly added Graph Transforms in long form is
called "Fused Initializers Graph Transforms", and in short form is
called "FIGT".
### Working
Currently, the Fuse Initializers Graph Transform fuses cast nodes that
casts from FP16 to FP32, back to their
next/output nodes. Below is an explanation of how this transforms works.
It depends on ```InsertCastTransforms```
to produce the intermediate representation from which it fuses the
initializers (which are the cast node with
zero input, one initializer, and one output) back to the next/output
node. After fusion, the link/edge between such
a cast node to the next/output node will then be removed. Cast nodes
will be removed as well.
```
"Input Graph" "Intermediate Representation" "FIGT Transforms"
-------- -------- -------- -------- --------
| X_Fp16 | | X_Fp16 | | W_Fp16 | | B_Fp16 | | X_Fp16 |
-------- -------- -------- -------- --------
| | | | |
| | | | |
| V V V V
| | Cast | | Cast | | Cast | | Cast |
| | Fp16 | | Fp16 | | Fp16 | | Fp16 |
| | To | | To | | To | | To |
| | Fp32 | | Fp32 | | Fp32 | | Fp32 |
| | | | |
| | | | |
V V V V V
---------------------------- ----------------------------------------- ----------------------------
| Conv_Fp16 | | | | Conv_Fp32 |
| --W_Fp16-- | ==> | Conv_Fp32 | ==> | --W_Fp32-- |
| --B_Fp16-- | | | | --B_Fp32-- |
---------------------------- ----------------------------------------- ----------------------------
| | |
| | |
| V V
| | Cast | | Cast |
| | Fp32 | | Fp32 |
| | To | | To |
| | Fp16 | | Fp16 |
| | |
| | |
V V V
-------- -------- --------
| Y_Fp16 | | Y_Fp16 | | Y_Fp16 |
-------- -------- --------
```
The newly added Graph Transforms perform the following actions.
* Detect Cast node/s with single FP16 initializer converting to FP32.
* Convert all such FP16 initializer/s to FP32 initializer/s.
* Fuse newly created FP32 initializer/s to relative FP32 node/s.
* Remove FP16 to FP32 Cast node/s.
This is run in a loop as follows. It excludes Level 1 and Partitioning
optimizations.
```
Level 2 --> Level 3 --> InsertCastTransforms --> FIGT
^ |
| "LOOP" |
| |
-------------------------------------------------
```
### Adding FIGT as a Level-4 Graph Transform.
This will have the following benefits.
1. Ability to turn off (any/all) the Level 4 Optimizations. We can use
the `disable optimizers` functionality to turn off one of such
optimizations during testing, or use the `-o` switch to turn off all
Level 4 optimizations while executing a model using the command line or
Python scripts (or any other scripts).
2. Ability to rerun Level 2 and Level 3 optimizations remains intact
after Level 4 Optimizations are applied. Adding Level 4 takes care that
FIGT (or any similar optimizations) will always run after
InsertCastNodes.
3. It keeps the current graph manipulations untouched and gives us more
flexibility to add future optimizations like adding `Int8 to Int32`
upconvert or `FP8 to FP16` upconvert under Level 4. Level 4 can, as of
now, work as a placeholder for any other such upcoming Graph
optimizations.
```
Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
^ |
| "LOOP" |
| |
--------------------------------------------------
```
> Added a placeholder for Level-4 for graph transforms utils under
orttraining. This helps resolve any exceptions that may be encountered
during training sessions.
#### Re-running Level 2+ optimizations after Level 4 / FIGT
The idea behind re-running Level2+ graph transforms is that, after the
fusion of initializers with their respective nodes, the nodes are now in
a format that might be supported by other graph transforms that were
previously skipped. Hence, some of the transformations previously unable
to be applied are now valid and can be applied to create a more optimal
graph for execution.
### Added a new session option
"kOrtSessionOptionsGraphOptimizationsLoopLevel" to handle the graph
optimization loop.
* When set to 2 or above it will loop until no more optimizations are
applied at any level starting Level 2 and above.
```
Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
^ |
| "Loop" |
| |
---------------------------------------------------
```
* When set to 1 (default) it will loop until no more optimizations are
applied at Level 4 only.
```
Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
^ |
| "Loop only depending on Level 4" |
| |
---------------------------------------------------
```
* When set to 0 it disables the loop.
```
Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
^ |
| "No Loop" |
| |
X xxxxxxxxxxx X
```
### Documentation
We have not added any details related to Level 4 in the [Graph
Optimizations in ONNX
Runtime](https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html)
documentation.
### OLD PR
This PR is created following a thorough discussion on the [OLD
PR](microsoft#24175).
Signed-off-by: Sunny Shukla <sunny.shukla@intel.com>
### Description
<!-- Describe your changes. -->
We might have a case where multiple Cast nodes in the chain cast back to
the original type. This fusion will remove extra nodes.
E.g.
`A ('float32') -> Cast (to='float16') -> Cast (to='int4') -> Cast
(to='float32') -> Cast (to='float16') -> B
`
will reduce to
` A ('float32') -> Cast (to='float16') -> B
`
All the Cast nodes throughout the path need to have one input and one
output to be considered for the fusion.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Gemma3 ONNX models used to have double casting, and many new models
created by the model builder might have as well. Extra Casts might
reduce accuracy and increase inference time.
The original code has a divided-by-zero error.
…soft#24895) ### Description Remove ep_weight_sharing_ctx_gen tool from QNN EP python wheel
* Add fp8 and int4 types in supported list for Onnxruntime EP * Add support for int4 inputs Map things to int8 right now as we don't explicitly set an int4 input type and pack/unpack int4 operands * Add flag to allow for fp8 quantization through Onnxruntime API * Add fp8 quantization to the compile stage of the MIGraphX EP Mirror the same calibration code we use for int8 and just change which quantize we call through the MIGraphx API * cleanup logging * Cleanup and encapsulate quantization / compile functions - Add additional flags for fp8 thats shared for int8 - Add lockout warning message when int8/fp8 used at the same time * Run lintrunner pass * Fix session options inputs + add better logging. Previous runs using session options failed as we were missing pulling in inputs from the python interface. This plus additional logging allowed me to track what options were invoked via env and what were added during the start of an inference session * Fix naming for save/load path varibles to be consistent with enable. * Print only env variables that are set as warnings need this so the user knows there's any of the environment variables running in the background to ensure proper consistently between runs. --------- ### Description <!-- Describe your changes. --> Changes to cleanup the MIGraphX EP quantization code as well as adding fp8 quantization support along with int4 support. Cleanup changes handle a few instances of issues seen with the python interface when taking in provider options ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Required as we fix ignored flags when using provider_options for the MIGraphX EP Adding fp8 quantization through the MIGraphX API Adding int4 weight support for packed int4 weights for MIGraphX inference --------- Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description 1. Support activation broadcasting in XNNPACK Matmul 2. Fix a subtle bug when activations is 1-D Per the existing gating logic, 1-D activations were allowed but the batch being passed through did not account for it. The batch size passed in was always `a->Shape()[0]` which is actually passing in the reduction dimension (K). This is incorrect as for a 1-D activation input, a `1` is to be prepended to the shape which meant that we should have actually passed in `1` for the batch. This passed the relevant test but I think it would have written outside the bounds of the output buffer because of the non-unary batch being passed through. ### Motivation and Context Resolve microsoft#24107 --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Add support for `bool` type to address the issue below. ### Motivation and Context This PR fixes microsoft#12286 Co-authored-by: Mauricio Cortazar <mcortazar@truora.com>
### Description <!-- Describe your changes. --> Extend IAllocator to get Allocator statistics: - Add `OrtAllocator::GetStats` and `AllocatorGetStats` C-API. - Add `Ort::Allocator::GetStats` Cxx API to parse the string and return as map. - Add UT. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Our system integrates multiple models for inference, each with varying memory demands. Providing a mechanism to retrieve detailed memory statistics would be useful for analyzing memory usage across models and devices more effectively.
### Description It seems like microsoft#24509 added a guard for the 8 bit Matmul tests that depends on an MLAS macro being set to compile and run on CPUs but that guard itself was preventing the inclusion of the MLAS header where the macro would have been set and so Matmul 8 bit tests were not being compiled and run on CPU builds. ### Motivation and Context Improve test coverage for CPU builds
### Description For TreeEnsemble, onnxruntime tries to fuse multiple nodes BRANCH_EQ into one node BRANCH_MEMBER. When a tree only contains BRANCH_EQ nodes, the final tree could be a mix between BRANCH_EQ and BRANCH_MEMBER. To be more efficient, onnxruntime detects that all the nodes use the same rule and avoids checking that value for every node while getting the final leaf. This detection happened before the fusion into BRANCH_MEMBER. This PR detects that this check must be done again. This extra cost only happens when a tree only contains nodes BRANCH_EQ and should not be much. It only happens during the initialization. ### Motivation and Context Fixes issue microsoft#24636. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description <!-- Describe your changes. --> Skip `_tpause` call for `_M_ARM64EC` in spin_pause.cc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix link error on ARM64EC for `_tpause` unresolved external symbol.
…CI (microsoft#24948) ### Description A temporary fix to unblock react native android CI. ### Motivation and Context After microsoft#24726 is merged to main, react native android CI starts failing with error like ``` /mnt/vss/_work/1/s/js/react_native/e2e/node_modules/onnxruntime-react-native/android/src/main/java/ai/onnxruntime/reactnative/OnnxruntimeModule.java:329: error: cannot find symbol {"layout", SessionOptions.OptLevel.LAYOUT_OPT}, ^ symbol: variable LAYOUT_OPT location: class OptLevel ``` The LAYOUT_OPT is defined in https://github.com/microsoft/onnxruntime/blob/8b3326e53249edb610cfe1648aff5c88f28b65f4/java/src/main/java/ai/onnxruntime/OrtSession.java#L656. The root cause of the build error is unknown. Since the layout level is just added, so it is not used by users. It is safe to comment the line to unblock the pipeline.
…icrosoft#24692) ### Description <!-- Describe your changes. --> Add `kleidiai` to `onnxruntime_EXTERNAL_LIBRARIES` in `setup_kleidiai()`. This is important when building an Apple static framework. If `kleidiai` is not in `onnxruntime_EXTERNAL_LIBRARIES`, we may encounter link errors when using the static framework, e.g., when building an iOS app targeting ARM64 iphoneos. It happened to work in the iOS packages built by the packaging pipeline because those are built with the XNNPACK EP enabled. In that case, `kleidiai` is added to `onnxruntime_EXTERNAL_LIBRARIES` elsewhere. https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/xnnpack.cmake#L95 https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/onnxruntime_external_deps.cmake#L569-L572 `onnxruntime_EXTERNAL_LIBRARIES` should tolerate duplicate values. But `kleidiai` needs to be included at least once if we use it. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix Apple static framework build with KleidiAI when the XNNPACK EP is not enabled.
### Description <!-- Describe your changes. --> Try to fix microsoft#24941 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…RM64 (microsoft#24947) ## Problem The `libonnxruntime4j_jni.so` native library was incompatible with 16KB page size configuration on ARM64 Android devices, while the main `libonnxruntime.so` was already compatible. This affected: - Modern Android devices using 16KB page configuration - Apple Silicon Macs running Android emulators - Any ARM64 system configured with 16KB pages ## Root Cause The issue occurred because: 1. The main `libonnxruntime.so` is built as a SHARED library and inherits `CMAKE_SHARED_LINKER_FLAGS` which contains the 16KB alignment flag (`-Wl,-z,max-page-size=16384`) 2. The `libonnxruntime4j_jni.so` is built as a MODULE library via `onnxruntime_add_shared_library_module()` function 3. `CMAKE_SHARED_LINKER_FLAGS` only applies to SHARED libraries, not MODULE libraries 4. Therefore, the JNI library was missing the required 16KB alignment ## Solution Added `CMAKE_MODULE_LINKER_FLAGS` alongside the existing `CMAKE_SHARED_LINKER_FLAGS` in `cmake/adjust_global_compile_flags.cmake` to ensure MODULE libraries also receive the 16KB alignment flag on Android builds. ```cmake if (ANDROID) # Build shared libraries with support for 16 KB ELF alignment # https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,max-page-size=16384") # Also apply to MODULE libraries (like libonnxruntime4j_jni.so) set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_MODULE_LINKER_FLAGS} -Wl,-z,max-page-size=16384") endif() ``` ## Impact - ✅ `libonnxruntime.so`: Already compatible (no change) - ✅ `libonnxruntime4j_jni.so`: Now compatible (fixed) - ✅ All provider libraries: Compatible (inherit global flags) - ✅ Zero impact on non-Android platforms - ✅ Minimal change: only 2 lines added ## Testing The fix has been validated to: - Apply 16KB alignment to both SHARED and MODULE libraries on Android - Only affect Android builds (properly guarded by `if (ANDROID)`) - Follow existing CMake patterns in the codebase - Preserve all existing functionality Fixes microsoft#24902. > [!WARNING] > > <details> > <summary>Firewall rules blocked me from connecting to one or more addresses</summary> > > #### I tried to connect to the following addresses, but was blocked by firewall rules: > > - `http://168.63.129.16:80/machine/` > - Triggering command: `/usr/bin/python3 -u bin/WALinuxAgent-2.13.1.1-py3.9.egg -collect-logs ` (http block) > > If you need me to access, download, or install something from one of these locations, you can either: > > - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled > - Add the appropriate URLs or hosts to my [firewall allow list](https://gh.io/copilot/firewall-config) > > </details> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
### Description <!-- Describe your changes. --> Support opset 23 RMSNormalization with CPU and CUDA kernel. https://github.com/onnx/onnx/blob/main/docs/Operators.md#RMSNormalization The PR uses LayerNormalization(simplified=True) under the hood. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix microsoft#24555
### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.
…#24886) ### Description <!-- Describe your changes. --> Add MIGraphX EP support for skipLayerNormalization via the supported OP list in the execution provider ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Turns on skipLayerNormalization instead of filtering this out as one of the supported ops by MIGraphX Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
…4885) * Use the latest hipify-perl for ROCm 7.0 instead of pinned version Required so we can convert kernels with the latest hipify that supports latest hipblas change - Remove hipify-perl version from rocm-6.3.0-14776 build - Use the argument pushed to the amd_hipify.py script. related to ROCm#69 * Remove roctracer_hcc.h include Not needed as roctracer_hip.h superceeds this * Removal of hipblas_v2_api reference Just make this use hipblas directly * Use local system hipify no os path join needed * Add HIPIFY Path log mesasge for build ### Description <!-- Describe your changes. --> Update to use the local version of hipify from the ROCm release. Relevant since we'd like to ensure the latest hipify changes are being used and being tested when using ROCm EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Latest and greatest for ROCm EP to convert kernels via hipify-perl mechanism Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description - QNN's 16x16 FC doesn't support asymmetric int16 weight - Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight - Add unit tests to verify 16x16 Gemm translation. ### Motivation and Context This fix schedules 16x16 Gemm Ops on QNN HTP accelerator. This improves inference time of Models contain 16x16 Gemm operators
### Description - QNN's 16x16 FC doesn't support asymmetric int16 weight - QNN's 16x16 MatMul doesn't support asymmetric int16 weight initializer. - Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight. - Add unit tests to verify 16x16 MatMul translations. ### Motivation and Context - This fix schedules 16x16 MatMul Ops on QNN HTP accelerator. - This improves inference time of Models contain 16x16 MatMul operators
…microsoft#24966) ### Description This reverts commit 11bcce3. [This change](https://dawn-review.googlesource.com/c/dawn/+/236054) from Dawn brakes some existing shaders related to subgroup. Following up: redo upgrade when the upstream resolves the problem. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description This change restores back fp16 math based FlashAttention. ### Motivation and Context Earlier we noticed quality issues with deepseek-r1 attributed to overflow of qk computation when performing math in fp16 precision. microsoft#24723, addressed it by promoting math to fp32 to avoid the precision issue. However the topic remained that, these models are trained with FP8 precision how is it that inferencing runs into precision issues with FP16 math? using FP32 math also resulted in slight performance degradation. In this follow up investigation, one issue identified is that we multiply scale for gqa quite late in the computation. Scale is 0.088 for deepseek-r1. Multiplying scale upfront seems to prevent the overflow issues. For now only the prefill shaders are updated to use this approach. Pending feedback on impact across models, the generation shader can also be restored to FP16 math. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR is to reduce the chance of having crashes when a process is shutting down. The main idea is: if we know the process is shutting down(or if we know the ORT DLL won't be reloaded), we do not need to run C++ object destructors. The approach is recommended by Windows SDK's official document and many Windows developers. For example, 18 years ago Raymond Chen wrote a blog [The old-fashioned theory on how processes exit](https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023). This is ORT's current behavior. Raymond Chen also wrote a blog [what a better approach is](https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=8683) In our case, when onnxruntime is built as a python package, the DLL(onnxruntime_pybind11_state.pyd ) will never be manually unloaded. Same on Linux. Python does not unload the DLLs on exit. Therefore, we do not need to worry about the potential memory leaks caused by any global variables. Therefore, we do not need to call OrtEnv's destructors, and we do not need to unload any EP dlls. In most cases, people do not unload DLLs on Windows. And, on Linux it is even more complicated because GCC needs to maintain a unique table to avoid odr violations, and this feature makes most C++ shared library cannot be unloaded. So, this change detects if the os is Windows and if the process is shutdown when calling destructors. If yes, the destructor will do nothing. After this change on Windows in most cases OrtEnv will not be destroyed. The only exception is: if someone manually load the DLL and manually unload the DLL, and also do not have a global threadpool. Then I think the user is an advanced user and they should know that they need to destroy all inference session objects and the OrtEnv singleton object before unloading the DLL. Besides, if they have enabled global thread pool, the DLL won't be unloaded if they haven't shutdown the thread pool and delete the OrtEnv object. And, even if the user has manually loaded/unloaded the DLL, there would still be some memory leaks(that are not related to this change). It's hard to get 100% clean.
### Description Add function of checking if all node input tensor ranks are supported by WebNN. Add check for binary op types, other special ops check will be added in later prs
### Description Moved the dimension limit because it seems to only apply to conv operations (texture memory is typically used for conv operations in the GPU because it has a slow write but fast read -- ChromaDB model had a slice operation with an input > 16384 -- operation worked fine after I had moved the dim check) Also added extra checks for Softmax on MLProgram that allows more softmax nodes to be moved to CoreML
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…24950) ### Description The motivation is to allow Windows 10 LTSC 2019 (currently the only LTSC supported until 2029) to run ONNXRuntime and the DML provider. Inspired by microsoft#24845 to remove dxcore.dll dependency. Currently confirmed to work in a VM. Real tests on hardware with DML compatible devices will be performed very soon --------- Co-authored-by: Julien Maille <julien.maille@data-pixel.com>
### Description Upgrade cudnn front end to 1.12 ### Motivation and Context https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.12.0 This replaces previous attempt to upgrade cudnn front end to 1.11: microsoft#24189
…23404) * Adding ourtvalue support for MGX EP --------- authored-by: Uros Petkovic <urpektov@amd.com> ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: urpetkov-amd <127323899+urpetkov-amd@users.noreply.github.com>
### Description fix for [Linux CPU E2E pipeline](https://github.com/microsoft/onnxruntime/actions/runs/15484124592/job/43595217388#step:6:2678)
…osoft#24984) # Description This pull request refactors the default aligned memory allocation functions (AllocatorDefaultAllocAligned and AllocatorDefaultFreeAligned) to use modern C++17 standard features, removing the need for platform-specific preprocessor directives. # Motivation The existing implementation relies on #ifdef _MSC_VER to switch between _aligned_malloc/_aligned_free on Windows and posix_memalign/free on other platforms. While functional, this approach has several drawbacks: It increases code complexity and reduces readability. It relies on legacy C functions instead of standard C++ features. It requires manual error handling for each platform-specific path. By switching to C++17's type-safe aligned allocation, we can achieve the same functionality with a single, portable, and more maintainable implementation. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Added support for Turing Arch The changes are to make the ORT NV TensorRT RTX EP use the --computeCapabilities=1 flag by default so they can support the engine build for Turing Arch as well.
…4994) This PR restores FP16 math in flash attention generation shader. It follows the changes in microsoft#24953 to use scale to multiply Q first instead of calculating it after QK to avoid data overflow in FP16.
### Description <!-- Describe your changes. --> In Transforms.js, the `sequentially_access_by_threads` flag should be set to `true` **only** when the GPU vendor is Intel, as experiments have shown that Intel GPUs perform better with this setting enabled. Currently, ORT sets `sequentially_access_by_threads` to `true` regardless of the GPU vendor. However, based on my local testing, setting `sequentially_access_by_threads` to `false` consistently results in better performance across all platforms. In ONNX Runtime (ORT), this flag is only applied to Conv operators that are not using `vec4` packing (i.e., `MakeMatMulPackedSource`). For GEMM/MatMul operators without `vec4`, the flag remains `false`. Therefore, this change will only affect Conv test cases without `vec4`. This PR leads to performance improvements in certain convolution cases. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I test with local conv model(x[1,256,224,224] weight[63, 256, 3, 3], which don't use vec4), the result is | (ms) | M3Max | NVIDA P620 | NVIDA 5080 | intel | |----------------|-------|------------|------------|-------| | sequentially_access_by_threads == true | 11.2 | 112 | 2.88 | 85.9 | | sequentially_access_by_threads == false | **7** | **66** | **1.90** | **53.4** |
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
`is_channels_last` is being passed to MatMulProgram but not to MatMulNaiveProgram causing issues for musicgen model
### Description WebAssembly build needs the latest version so that build breaks with LLVM v20.1.4 (emscripten 4.0.10) will be fixed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
Follow up microsoft#24449 This PR integrates with ONNX 1.18.0.
### Description Split DML nuget packaging jobs to a dedicated pipeline Remove Windows 32-bit packages from nuget pipelines. ### Motivation and Context To make the "Zip-Nuget-Java-Nodejs Packaging Pipeline" lighter.
### Description <!-- Describe your changes. --> - don't use cuda runtime API to set the device when a stream is already provided. - expose option to set limit on max shared memory TensorRT can use. - Fixed the Compilation issues for the deprecated APIs - Small test fix. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ankan Banerjee <anbanerjee@nvidia.com>
### Description In TopK op builder, add Transpose around TopK to permute the axis to the last before and permute back after. Additionally, since TopK's second output is indices which may have INT64 dtype, add Cast to cast transformed INT32 back to INT64 if is graph output. ### Motivation and Context QNN only accepts TopK on the last axis but ONNX/ORT's TopK has axis attribute. Complement TopK op builder to avoid falling back to CPU for non-last axis TopK.
ankitm3k
approved these changes
Jun 13, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backmerging with Msft commits