Backmerging with Msft commits #708

jatinwadhwa921 · 2025-06-13T09:20:33Z

Backmerging with Msft commits

### Description  Do a manual load of dxcore.dll so that old Windows versions are still supported. ### Motivation and Context  microsoft#24771 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…4877) ### Description - `EtwRegistrationManager`. Make sure all fields initialized by a constructor - Register a callback object instead of a pointer to it. Store it in the map with a session unique key. - Register `ML_Ort_Provider_Etw_Callback` once for all the sessions. The first session registers, the last one to go away removes the callback to Log all sessions. For this we make callbacks ref-counted inside the map they are stored in. This is done to prevent a deadlock where `active_sessions_mutex_` and `callback_mutex_` are acquired from different threads in a different order. - Create a registration guard to remove callbacks in case `InferenceSession` constructor does not finish. ### Motivation and Context  This PR is inspired by microsoft#24773. Current code exhibits multiple issues. - `EtwRegistrationManager` constructor does not initialize all of the fields including the `InitializationStatus`. - Global callback object is registered and re-created by every session. Customers sometimes run thousands of models in the same sessions which results in a quadratic ETW costs. The callback object is destroyed and recreated every time a session is created. - There is a chance that InferenceSession constructor does not finish, and the callback would remain registered. This may result in intermittent hard to diagnose bugs. - `active_sessions_lock_` and `callback` lock are not acquired/released in the same order by different threads which is a classic deadlock scenario.

### Description This PR extract core implementations into gemm_utils.cc which is used to generate shader both GEMM and Matmul ops. The core implemenations included scalar and vec4 versions of GEMM and Matmul. ### Motivation and Context There are many common codes for GEMM and Matmul, so we want to extra common code to unify their implementations. ![Blank diagram (1)](https://github.com/user-attachments/assets/45f8d7ac-6705-4cea-8b8c-966ded6a6ca5) --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

### Description Update unit tests for DNNL. (1) Increase QAttentionTest threshold when DNNL is used. (2) Skip some failed tests when DNNL is used. ### Motivation and Context When I build main branch for DNNL in Azure Linux VM, some tests failed: ``` pip install --user -r tools/ci_build/github/linux/python/requirements.txt python3 tools/ci_build/build.py --build_dir build/Release --config Release --cmake_generator Ninja --skip_submodule_sync --build_shared_lib --parallel --use_vcpkg --use_binskim_compliant_compile_flags --build_wheel --build_nuget --use_dnnl ``` ``` [ FAILED ] NhwcTransformerTests.ConvSplitQLinearConcat [ FAILED ] NhwcTransformerTests.Conv [ FAILED ] NhwcTransformerTests.ConvBlockBinary [ FAILED ] NhwcTransformerTests.ConvMaxPool [ FAILED ] NhwcTransformerTests.ConvAveragePool [ FAILED ] NhwcTransformerTests.ConvPad [ FAILED ] NhwcTransformerTests.ConvBlockActivation [ FAILED ] QDQTransformerTests.Conv_U8X8U8 [ FAILED ] QDQTransformerTests.ConvMaxPoolReshape_UInt8 [ FAILED ] QDQTransformerTests.ConvMaxPoolReshape_Int8 [ FAILED ] QDQTransformerTests.ConvRelu [ FAILED ] QDQTransformerTests.ConvAveragePoolReshape_UInt8 [ FAILED ] QDQTransformerTests.ConvAveragePoolReshape_Int8 [ FAILED ] QDQTransformerTests.ConvTranspose_QBackward [ FAILED ] QDQTransformerTests.QBackward_MutilpleSteps [ FAILED ] QDQTransformerTests.ConvTranspose_DQForward [ FAILED ] QDQTransformerTests.DQForward_MutilpleSteps [ FAILED ] InferenceSessionTests.ModelMetadata [ FAILED ] ActivationOpTest.LeakyRelu_bfloat16 [ FAILED ] QAttentionTest.QAttentionDNNLBatch1 [ FAILED ] QAttentionTest.QAttentionDNNLBatch2 [ FAILED ] QAttentionTest.QAttentionDNNLMaskPartialSequence [ FAILED ] QAttentionTest.QAttentionNoMaskIndex [ FAILED ] QAttentionTest.QAttentionPrunedModel ```

…#24548) ### Description Update README.md: remove the build pipeline status section because they are out of date. Most of our pipelines are in Github now.

…24896) ### Description Replace the Upsample with Resize during quantization to avoid causing the invalid graph ### Motivation and Context After the quantization, if the opset of original onnx model is less than 10, the opset of QDQ model will be upgraded to 11. However, Upsample is deprecated in opset 11, which will make the onnx model invalid. So, we replace the Upsample with Resize if the opset needs to be upgraded to 11. --------- Co-authored-by: chuteng <chuteng@qti.qualcomm.com>

### Description Fuse transposed channel shuffle pattern into QNN op -- ONNX does not have native ChannelShuffle op. ### Motivation and Context Improves performance on QNN EP.

Fixes pipeline error in MacOS_C_API_Packaging_CPU_x86_64 by disabling softmax NaN test for CoreML EP since it does not handle NaN.

### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.

### Description Added a graph transform for mixed precision graphs when FP16 compute is unavailable. At session creation, this graph transform converts FP16 initializers (_which were changed to FP16 to FP32 cast nodes_) to FP32 initializers and fuses them with their next FP32 nodes. - Behavior before this change: "fp16 initializers -> cast_from_fp16_to_fp32 -> fp32 node/s" - Behavior after this change: "fp16 initializers converted to fp32 initializers then fused with fp32 node/s" ### Motivation and Context This change aims to run the FP16 models without the repetitive casting of FP16 initializers to FP32 initializers, by fusing FP32 initializers with their next nodes, when FP16 compute is not available. > For naming purposes, the newly added Graph Transforms in long form is called "Fused Initializers Graph Transforms", and in short form is called "FIGT". ### Working Currently, the Fuse Initializers Graph Transform fuses cast nodes that casts from FP16 to FP32, back to their next/output nodes. Below is an explanation of how this transforms works. It depends on ```InsertCastTransforms``` to produce the intermediate representation from which it fuses the initializers (which are the cast node with zero input, one initializer, and one output) back to the next/output node. After fusion, the link/edge between such a cast node to the next/output node will then be removed. Cast nodes will be removed as well. ``` "Input Graph" "Intermediate Representation" "FIGT Transforms" -------- -------- -------- -------- -------- | X_Fp16 | | X_Fp16 | | W_Fp16 | | B_Fp16 | | X_Fp16 | -------- -------- -------- -------- -------- | | | | | | | | | | | V V V V | | Cast | | Cast | | Cast | | Cast | | | Fp16 | | Fp16 | | Fp16 | | Fp16 | | | To | | To | | To | | To | | | Fp32 | | Fp32 | | Fp32 | | Fp32 | | | | | | | | | | | V V V V V ---------------------------- ----------------------------------------- ---------------------------- | Conv_Fp16 | | | | Conv_Fp32 | | --W_Fp16-- | ==> | Conv_Fp32 | ==> | --W_Fp32-- | | --B_Fp16-- | | | | --B_Fp32-- | ---------------------------- ----------------------------------------- ---------------------------- | | | | | | | V V | | Cast | | Cast | | | Fp32 | | Fp32 | | | To | | To | | | Fp16 | | Fp16 | | | | | | | V V V -------- -------- -------- | Y_Fp16 | | Y_Fp16 | | Y_Fp16 | -------- -------- -------- ``` The newly added Graph Transforms perform the following actions. * Detect Cast node/s with single FP16 initializer converting to FP32. * Convert all such FP16 initializer/s to FP32 initializer/s. * Fuse newly created FP32 initializer/s to relative FP32 node/s. * Remove FP16 to FP32 Cast node/s. This is run in a loop as follows. It excludes Level 1 and Partitioning optimizations. ``` Level 2 --> Level 3 --> InsertCastTransforms --> FIGT ^ | | "LOOP" | | | ------------------------------------------------- ``` ### Adding FIGT as a Level-4 Graph Transform. This will have the following benefits. 1. Ability to turn off (any/all) the Level 4 Optimizations. We can use the `disable optimizers` functionality to turn off one of such optimizations during testing, or use the `-o` switch to turn off all Level 4 optimizations while executing a model using the command line or Python scripts (or any other scripts). 2. Ability to rerun Level 2 and Level 3 optimizations remains intact after Level 4 Optimizations are applied. Adding Level 4 takes care that FIGT (or any similar optimizations) will always run after InsertCastNodes. 3. It keeps the current graph manipulations untouched and gives us more flexibility to add future optimizations like adding `Int8 to Int32` upconvert or `FP8 to FP16` upconvert under Level 4. Level 4 can, as of now, work as a placeholder for any other such upcoming Graph optimizations. ``` Level 2 --> Level 3 --> InsertCastTransforms --> Level 4 ^ | | "LOOP" | | | -------------------------------------------------- ``` > Added a placeholder for Level-4 for graph transforms utils under orttraining. This helps resolve any exceptions that may be encountered during training sessions. #### Re-running Level 2+ optimizations after Level 4 / FIGT The idea behind re-running Level2+ graph transforms is that, after the fusion of initializers with their respective nodes, the nodes are now in a format that might be supported by other graph transforms that were previously skipped. Hence, some of the transformations previously unable to be applied are now valid and can be applied to create a more optimal graph for execution. ### Added a new session option "kOrtSessionOptionsGraphOptimizationsLoopLevel" to handle the graph optimization loop. * When set to 2 or above it will loop until no more optimizations are applied at any level starting Level 2 and above. ``` Level 2 --> Level 3 --> InsertCastTransforms --> Level 4 ^ | | "Loop" | | | --------------------------------------------------- ``` * When set to 1 (default) it will loop until no more optimizations are applied at Level 4 only. ``` Level 2 --> Level 3 --> InsertCastTransforms --> Level 4 ^ | | "Loop only depending on Level 4" | | | --------------------------------------------------- ``` * When set to 0 it disables the loop. ``` Level 2 --> Level 3 --> InsertCastTransforms --> Level 4 ^ | | "No Loop" | | | X xxxxxxxxxxx X ``` ### Documentation We have not added any details related to Level 4 in the [Graph Optimizations in ONNX Runtime](https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html) documentation. ### OLD PR This PR is created following a thorough discussion on the [OLD PR](microsoft#24175). Signed-off-by: Sunny Shukla <sunny.shukla@intel.com>

### Description  We might have a case where multiple Cast nodes in the chain cast back to the original type. This fusion will remove extra nodes. E.g. `A ('float32') -> Cast (to='float16') -> Cast (to='int4') -> Cast (to='float32') -> Cast (to='float16') -> B ` will reduce to ` A ('float32') -> Cast (to='float16') -> B ` All the Cast nodes throughout the path need to have one input and one output to be considered for the fusion. ### Motivation and Context  Gemma3 ONNX models used to have double casting, and many new models created by the model builder might have as well. Extra Casts might reduce accuracy and increase inference time.

The original code has a divided-by-zero error.

…soft#24895) ### Description Remove ep_weight_sharing_ctx_gen tool from QNN EP python wheel

* Add fp8 and int4 types in supported list for Onnxruntime EP * Add support for int4 inputs Map things to int8 right now as we don't explicitly set an int4 input type and pack/unpack int4 operands * Add flag to allow for fp8 quantization through Onnxruntime API * Add fp8 quantization to the compile stage of the MIGraphX EP Mirror the same calibration code we use for int8 and just change which quantize we call through the MIGraphx API * cleanup logging * Cleanup and encapsulate quantization / compile functions - Add additional flags for fp8 thats shared for int8 - Add lockout warning message when int8/fp8 used at the same time * Run lintrunner pass * Fix session options inputs + add better logging. Previous runs using session options failed as we were missing pulling in inputs from the python interface. This plus additional logging allowed me to track what options were invoked via env and what were added during the start of an inference session * Fix naming for save/load path varibles to be consistent with enable. * Print only env variables that are set as warnings need this so the user knows there's any of the environment variables running in the background to ensure proper consistently between runs. --------- ### Description  Changes to cleanup the MIGraphX EP quantization code as well as adding fp8 quantization support along with int4 support. Cleanup changes handle a few instances of issues seen with the python interface when taking in provider options ### Motivation and Context  Required as we fix ignored flags when using provider_options for the MIGraphX EP Adding fp8 quantization through the MIGraphX API Adding int4 weight support for packed int4 weights for MIGraphX inference --------- Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>

### Description 1. Support activation broadcasting in XNNPACK Matmul 2. Fix a subtle bug when activations is 1-D Per the existing gating logic, 1-D activations were allowed but the batch being passed through did not account for it. The batch size passed in was always `a->Shape()[0]` which is actually passing in the reduction dimension (K). This is incorrect as for a 1-D activation input, a `1` is to be prepended to the shape which meant that we should have actually passed in `1` for the batch. This passed the relevant test but I think it would have written outside the bounds of the output buffer because of the non-unary batch being passed through. ### Motivation and Context Resolve microsoft#24107 --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Add support for `bool` type to address the issue below. ### Motivation and Context This PR fixes microsoft#12286 Co-authored-by: Mauricio Cortazar <mcortazar@truora.com>

### Description  Extend IAllocator to get Allocator statistics: - Add `OrtAllocator::GetStats` and `AllocatorGetStats` C-API. - Add `Ort::Allocator::GetStats` Cxx API to parse the string and return as map. - Add UT. ### Motivation and Context  Our system integrates multiple models for inference, each with varying memory demands. Providing a mechanism to retrieve detailed memory statistics would be useful for analyzing memory usage across models and devices more effectively.

### Description It seems like microsoft#24509 added a guard for the 8 bit Matmul tests that depends on an MLAS macro being set to compile and run on CPUs but that guard itself was preventing the inclusion of the MLAS header where the macro would have been set and so Matmul 8 bit tests were not being compiled and run on CPU builds. ### Motivation and Context Improve test coverage for CPU builds

### Description For TreeEnsemble, onnxruntime tries to fuse multiple nodes BRANCH_EQ into one node BRANCH_MEMBER. When a tree only contains BRANCH_EQ nodes, the final tree could be a mix between BRANCH_EQ and BRANCH_MEMBER. To be more efficient, onnxruntime detects that all the nodes use the same rule and avoids checking that value for every node while getting the final leaf. This detection happened before the fusion into BRANCH_MEMBER. This PR detects that this check must be done again. This extra cost only happens when a tree only contains nodes BRANCH_EQ and should not be much. It only happens during the initialization. ### Motivation and Context Fixes issue microsoft#24636. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description  Skip `_tpause` call for `_M_ARM64EC` in spin_pause.cc. ### Motivation and Context  Fix link error on ARM64EC for `_tpause` unresolved external symbol.

…CI (microsoft#24948) ### Description A temporary fix to unblock react native android CI. ### Motivation and Context After microsoft#24726 is merged to main, react native android CI starts failing with error like ``` /mnt/vss/_work/1/s/js/react_native/e2e/node_modules/onnxruntime-react-native/android/src/main/java/ai/onnxruntime/reactnative/OnnxruntimeModule.java:329: error: cannot find symbol {"layout", SessionOptions.OptLevel.LAYOUT_OPT}, ^ symbol: variable LAYOUT_OPT location: class OptLevel ``` The LAYOUT_OPT is defined in https://github.com/microsoft/onnxruntime/blob/8b3326e53249edb610cfe1648aff5c88f28b65f4/java/src/main/java/ai/onnxruntime/OrtSession.java#L656. The root cause of the build error is unknown. Since the layout level is just added, so it is not used by users. It is safe to comment the line to unblock the pipeline.

…icrosoft#24692) ### Description  Add `kleidiai` to `onnxruntime_EXTERNAL_LIBRARIES` in `setup_kleidiai()`. This is important when building an Apple static framework. If `kleidiai` is not in `onnxruntime_EXTERNAL_LIBRARIES`, we may encounter link errors when using the static framework, e.g., when building an iOS app targeting ARM64 iphoneos. It happened to work in the iOS packages built by the packaging pipeline because those are built with the XNNPACK EP enabled. In that case, `kleidiai` is added to `onnxruntime_EXTERNAL_LIBRARIES` elsewhere. https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/xnnpack.cmake#L95 https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/onnxruntime_external_deps.cmake#L569-L572 `onnxruntime_EXTERNAL_LIBRARIES` should tolerate duplicate values. But `kleidiai` needs to be included at least once if we use it. ### Motivation and Context  Fix Apple static framework build with KleidiAI when the XNNPACK EP is not enabled.

### Description  Try to fix microsoft#24941 ### Motivation and Context  --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…RM64 (microsoft#24947) ## Problem The `libonnxruntime4j_jni.so` native library was incompatible with 16KB page size configuration on ARM64 Android devices, while the main `libonnxruntime.so` was already compatible. This affected: - Modern Android devices using 16KB page configuration - Apple Silicon Macs running Android emulators - Any ARM64 system configured with 16KB pages ## Root Cause The issue occurred because: 1. The main `libonnxruntime.so` is built as a SHARED library and inherits `CMAKE_SHARED_LINKER_FLAGS` which contains the 16KB alignment flag (`-Wl,-z,max-page-size=16384`) 2. The `libonnxruntime4j_jni.so` is built as a MODULE library via `onnxruntime_add_shared_library_module()` function 3. `CMAKE_SHARED_LINKER_FLAGS` only applies to SHARED libraries, not MODULE libraries 4. Therefore, the JNI library was missing the required 16KB alignment ## Solution Added `CMAKE_MODULE_LINKER_FLAGS` alongside the existing `CMAKE_SHARED_LINKER_FLAGS` in `cmake/adjust_global_compile_flags.cmake` to ensure MODULE libraries also receive the 16KB alignment flag on Android builds. ```cmake if (ANDROID) # Build shared libraries with support for 16 KB ELF alignment # https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,max-page-size=16384") # Also apply to MODULE libraries (like libonnxruntime4j_jni.so) set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_MODULE_LINKER_FLAGS} -Wl,-z,max-page-size=16384") endif() ``` ## Impact - ✅ `libonnxruntime.so`: Already compatible (no change) - ✅ `libonnxruntime4j_jni.so`: Now compatible (fixed) - ✅ All provider libraries: Compatible (inherit global flags) - ✅ Zero impact on non-Android platforms - ✅ Minimal change: only 2 lines added ## Testing The fix has been validated to: - Apply 16KB alignment to both SHARED and MODULE libraries on Android - Only affect Android builds (properly guarded by `if (ANDROID)`) - Follow existing CMake patterns in the codebase - Preserve all existing functionality Fixes microsoft#24902. > [!WARNING] > > <details> > <summary>Firewall rules blocked me from connecting to one or more addresses</summary> > > #### I tried to connect to the following addresses, but was blocked by firewall rules: > > - `http://168.63.129.16:80/machine/` > - Triggering command: `/usr/bin/python3 -u bin/WALinuxAgent-2.13.1.1-py3.9.egg -collect-logs ` (http block) > > If you need me to access, download, or install something from one of these locations, you can either: > > - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled > - Add the appropriate URLs or hosts to my [firewall allow list](https://gh.io/copilot/firewall-config) > > </details> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>

### Description  Support opset 23 RMSNormalization with CPU and CUDA kernel. https://github.com/onnx/onnx/blob/main/docs/Operators.md#RMSNormalization The PR uses LayerNormalization(simplified=True) under the hood. ### Motivation and Context  Fix microsoft#24555

### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.

…#24886) ### Description  Add MIGraphX EP support for skipLayerNormalization via the supported OP list in the execution provider ### Motivation and Context  Turns on skipLayerNormalization instead of filtering this out as one of the supported ops by MIGraphX Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>

…4885) * Use the latest hipify-perl for ROCm 7.0 instead of pinned version Required so we can convert kernels with the latest hipify that supports latest hipblas change - Remove hipify-perl version from rocm-6.3.0-14776 build - Use the argument pushed to the amd_hipify.py script. related to ROCm#69 * Remove roctracer_hcc.h include Not needed as roctracer_hip.h superceeds this * Removal of hipblas_v2_api reference Just make this use hipblas directly * Use local system hipify no os path join needed * Add HIPIFY Path log mesasge for build ### Description  Update to use the local version of hipify from the ROCm release. Relevant since we'd like to ensure the latest hipify changes are being used and being tested when using ROCm EP ### Motivation and Context  Latest and greatest for ROCm EP to convert kernels via hipify-perl mechanism Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>

### Description - QNN's 16x16 FC doesn't support asymmetric int16 weight - Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight - Add unit tests to verify 16x16 Gemm translation. ### Motivation and Context This fix schedules 16x16 Gemm Ops on QNN HTP accelerator. This improves inference time of Models contain 16x16 Gemm operators

### Description - QNN's 16x16 FC doesn't support asymmetric int16 weight - QNN's 16x16 MatMul doesn't support asymmetric int16 weight initializer. - Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight. - Add unit tests to verify 16x16 MatMul translations. ### Motivation and Context - This fix schedules 16x16 MatMul Ops on QNN HTP accelerator. - This improves inference time of Models contain 16x16 MatMul operators

…microsoft#24966) ### Description This reverts commit 11bcce3. [This change](https://dawn-review.googlesource.com/c/dawn/+/236054) from Dawn brakes some existing shaders related to subgroup. Following up: redo upgrade when the upstream resolves the problem. ### Motivation and Context

### Description This change restores back fp16 math based FlashAttention. ### Motivation and Context Earlier we noticed quality issues with deepseek-r1 attributed to overflow of qk computation when performing math in fp16 precision. microsoft#24723, addressed it by promoting math to fp32 to avoid the precision issue. However the topic remained that, these models are trained with FP8 precision how is it that inferencing runs into precision issues with FP16 math? using FP32 math also resulted in slight performance degradation. In this follow up investigation, one issue identified is that we multiply scale for gqa quite late in the computation. Scale is 0.088 for deepseek-r1. Multiplying scale upfront seems to prevent the overflow issues. For now only the prefill shaders are updated to use this approach. Pending feedback on impact across models, the generation shader can also be restored to FP16 math. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This PR is to reduce the chance of having crashes when a process is shutting down. The main idea is: if we know the process is shutting down(or if we know the ORT DLL won't be reloaded), we do not need to run C++ object destructors. The approach is recommended by Windows SDK's official document and many Windows developers. For example, 18 years ago Raymond Chen wrote a blog [The old-fashioned theory on how processes exit](https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023). This is ORT's current behavior. Raymond Chen also wrote a blog [what a better approach is](https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=8683) In our case, when onnxruntime is built as a python package, the DLL(onnxruntime_pybind11_state.pyd ) will never be manually unloaded. Same on Linux. Python does not unload the DLLs on exit. Therefore, we do not need to worry about the potential memory leaks caused by any global variables. Therefore, we do not need to call OrtEnv's destructors, and we do not need to unload any EP dlls. In most cases, people do not unload DLLs on Windows. And, on Linux it is even more complicated because GCC needs to maintain a unique table to avoid odr violations, and this feature makes most C++ shared library cannot be unloaded. So, this change detects if the os is Windows and if the process is shutdown when calling destructors. If yes, the destructor will do nothing. After this change on Windows in most cases OrtEnv will not be destroyed. The only exception is: if someone manually load the DLL and manually unload the DLL, and also do not have a global threadpool. Then I think the user is an advanced user and they should know that they need to destroy all inference session objects and the OrtEnv singleton object before unloading the DLL. Besides, if they have enabled global thread pool, the DLL won't be unloaded if they haven't shutdown the thread pool and delete the OrtEnv object. And, even if the user has manually loaded/unloaded the DLL, there would still be some memory leaks(that are not related to this change). It's hard to get 100% clean.

### Description Add function of checking if all node input tensor ranks are supported by WebNN. Add check for binary op types, other special ops check will be added in later prs

### Description Moved the dimension limit because it seems to only apply to conv operations (texture memory is typically used for conv operations in the GPU because it has a slow write but fast read -- ChromaDB model had a slice operation with an input > 16384 -- operation worked fine after I had moved the dim check) Also added extra checks for Softmax on MLProgram that allows more softmax nodes to be moved to CoreML

### Description  ### Motivation and Context

…24950) ### Description The motivation is to allow Windows 10 LTSC 2019 (currently the only LTSC supported until 2029) to run ONNXRuntime and the DML provider. Inspired by microsoft#24845 to remove dxcore.dll dependency. Currently confirmed to work in a VM. Real tests on hardware with DML compatible devices will be performed very soon --------- Co-authored-by: Julien Maille <julien.maille@data-pixel.com>

### Description Upgrade cudnn front end to 1.12 ### Motivation and Context https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.12.0 This replaces previous attempt to upgrade cudnn front end to 1.11: microsoft#24189

…23404) * Adding ourtvalue support for MGX EP --------- authored-by: Uros Petkovic <urpektov@amd.com> ### Description  ### Motivation and Context  --------- Co-authored-by: urpetkov-amd <127323899+urpetkov-amd@users.noreply.github.com>

### Description fix for [Linux CPU E2E pipeline](https://github.com/microsoft/onnxruntime/actions/runs/15484124592/job/43595217388#step:6:2678)

…osoft#24984) # Description This pull request refactors the default aligned memory allocation functions (AllocatorDefaultAllocAligned and AllocatorDefaultFreeAligned) to use modern C++17 standard features, removing the need for platform-specific preprocessor directives. # Motivation The existing implementation relies on #ifdef _MSC_VER to switch between _aligned_malloc/_aligned_free on Windows and posix_memalign/free on other platforms. While functional, this approach has several drawbacks: It increases code complexity and reduces readability. It relies on legacy C functions instead of standard C++ features. It requires manual error handling for each platform-specific path. By switching to C++17's type-safe aligned allocation, we can achieve the same functionality with a single, portable, and more maintainable implementation. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Added support for Turing Arch The changes are to make the ORT NV TensorRT RTX EP use the --computeCapabilities=1 flag by default so they can support the engine build for Turing Arch as well.

…4994) This PR restores FP16 math in flash attention generation shader. It follows the changes in microsoft#24953 to use scale to multiply Q first instead of calculating it after QK to avoid data overflow in FP16.

### Description  In Transforms.js, the `sequentially_access_by_threads` flag should be set to `true` **only** when the GPU vendor is Intel, as experiments have shown that Intel GPUs perform better with this setting enabled. Currently, ORT sets `sequentially_access_by_threads` to `true` regardless of the GPU vendor. However, based on my local testing, setting `sequentially_access_by_threads` to `false` consistently results in better performance across all platforms. In ONNX Runtime (ORT), this flag is only applied to Conv operators that are not using `vec4` packing (i.e., `MakeMatMulPackedSource`). For GEMM/MatMul operators without `vec4`, the flag remains `false`. Therefore, this change will only affect Conv test cases without `vec4`. This PR leads to performance improvements in certain convolution cases. ### Motivation and Context  I test with local conv model(x[1,256,224,224] weight[63, 256, 3, 3], which don't use vec4), the result is | (ms) | M3Max | NVIDA P620 | NVIDA 5080 | intel | |----------------|-------|------------|------------|-------| | sequentially_access_by_threads == true | 11.2 | 112 | 2.88 | 85.9 | | sequentially_access_by_threads == false | **7** | **66** | **1.90** | **53.4** |

### Description  ### Motivation and Context

`is_channels_last` is being passed to MatMulProgram but not to MatMulNaiveProgram causing issues for musicgen model

### Description WebAssembly build needs the latest version so that build breaks with LLVM v20.1.4 (emscripten 4.0.10) will be fixed. ### Motivation and Context

Follow up microsoft#24449 This PR integrates with ONNX 1.18.0.

### Description Split DML nuget packaging jobs to a dedicated pipeline Remove Windows 32-bit packages from nuget pipelines. ### Motivation and Context To make the "Zip-Nuget-Java-Nodejs Packaging Pipeline" lighter.

### Description  - don't use cuda runtime API to set the device when a stream is already provided. - expose option to set limit on max shared memory TensorRT can use. - Fixed the Compilation issues for the deprecated APIs - Small test fix. ### Motivation and Context  --------- Co-authored-by: Ankan Banerjee <anbanerjee@nvidia.com>

### Description In TopK op builder, add Transpose around TopK to permute the axis to the last before and permute back after. Additionally, since TopK's second output is indices which may have INT64 dtype, add Cast to cast transformed INT32 back to INT64 if is graph output. ### Motivation and Context QNN only accepts TopK on the last axis but ONNX/ORT's TopK has axis attribute. Complement TopK op builder to avoid falling back to CPU for non-last axis TopK.

skottmckay and others added 30 commits June 2, 2025 17:26

Update README.md: remove the build pipeline status section (microsoft…

b34ae7c

…#24548) ### Description Update README.md: remove the build pipeline status section because they are out of date. Most of our pipelines are in Github now.

[QNN-EP] Fuse ChannelShuffle pattern (microsoft#24904)

49ed0e8

### Description Fuse transposed channel shuffle pattern into QNN op -- ONNX does not have native ChannelShuffle op. ### Motivation and Context Improves performance on QNN EP.

[WebGPU EP] adjust test case to skip CoreML EP (microsoft#24927)

03b22ff

Fixes pipeline error in MacOS_C_API_Packaging_CPU_x86_64 by disabling softmax NaN test for CoreML EP since it does not handle NaN.

Fix a test failure in mlas (microsoft#24930)

f250ffc

The original code has a divided-by-zero error.

[webgpu] Bump version of Dawn to c3999d7e3 (microsoft#24935)

11bcce3

Remove ep_weight_sharing_ctx_gen tool from QNN EP python wheel (micro…

665922d

…soft#24895) ### Description Remove ep_weight_sharing_ctx_gen tool from QNN EP python wheel

Add support for bool type in SplitToSequence (microsoft#24929)

8b3326e

### Description Add support for `bool` type to address the issue below. ### Motivation and Context This PR fixes microsoft#12286 Co-authored-by: Mauricio Cortazar <mcortazar@truora.com>

Bump ruff from 0.11.11 to 0.11.12 (microsoft#24923)

5624a2b

quic-tirupath and others added 27 commits June 5, 2025 11:36

[WebNN] Add validation for node input tensor ranks (microsoft#24951)

5cae8d2

### Description Add function of checking if all node input tensor ranks are supported by WebNN. Add check for binary op types, other special ops check will be added in later prs

[web] handle script URL missing when in ESM (microsoft#24968)

a1e6842

### Description  ### Motivation and Context

[CUDA] upgrade cudnn front end to 1.12 (microsoft#24977)

89258ae

### Description Upgrade cudnn front end to 1.12 ### Motivation and Context https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.12.0 This replaces previous attempt to upgrade cudnn front end to 1.11: microsoft#24189

[CoreML] mark input params variable unused (microsoft#24976)

c9e5889

### Description fix for [Linux CPU E2E pipeline](https://github.com/microsoft/onnxruntime/actions/runs/15484124592/job/43595217388#step:6:2678)

Fix c/c++ document generation (microsoft#24979)

2a7f4ca

Adding support for Turing Arch (microsoft#24882)

a1217d5

Added support for Turing Arch The changes are to make the ORT NV TensorRT RTX EP use the --computeCapabilities=1 flag by default so they can support the engine build for Turing Arch as well.

[webgpu] add cases to cover the subgroup matrix path (microsoft#24970)

8412cac

### Description  ### Motivation and Context

[WebGPU EP] bug fix for convolution operator (microsoft#25000)

d07e85b

`is_channels_last` is being passed to MatMulProgram but not to MatMulNaiveProgram causing issues for musicgen model

Integration with ONNX 1.18.0 (microsoft#24945)

06fff6e

Follow up microsoft#24449 This PR integrates with ONNX 1.18.0.

Update transformers package version to 4.48.0 (microsoft#24982)

f810e55

Bump ruff from 0.11.12 to 0.11.13 (microsoft#24988)

aa64037

Merge branch 'master' into ovep-develop

81c5f59

ankitm3k approved these changes Jun 13, 2025

View reviewed changes

ankitm3k merged commit f86768b into ovep-develop Jun 13, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_13_6_25 branch June 13, 2025 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backmerging with Msft commits #708

Backmerging with Msft commits #708

Uh oh!

jatinwadhwa921 commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants

Backmerging with Msft commits #708

Backmerging with Msft commits #708

Uh oh!

Conversation

jatinwadhwa921 commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants