Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

skottmckay and others added 30 commits June 2, 2025 17:26
### Description
<!-- Describe your changes. -->
Do a manual load of dxcore.dll so that old Windows versions are still
supported.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#24771

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…4877)

### Description

- `EtwRegistrationManager`. Make sure all fields initialized by a
constructor
- Register a callback object instead of a pointer to it. Store it in the
map with a session unique key.
- Register `ML_Ort_Provider_Etw_Callback` once for all the sessions. The
first session registers, the last one to go away removes the callback to
Log all sessions. For this we make callbacks ref-counted inside the map
they are stored in. This is done to prevent a deadlock where
`active_sessions_mutex_` and `callback_mutex_` are acquired from
different threads in a different order.
- Create a registration guard to remove callbacks in case
`InferenceSession` constructor does not finish.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is inspired by
microsoft#24773.

Current code exhibits multiple issues.
- `EtwRegistrationManager` constructor does not initialize all of the
fields including the `InitializationStatus`.
- Global callback object is registered and re-created by every session.
Customers sometimes run thousands of models in the same sessions which
results in a quadratic ETW costs. The callback object is destroyed and
recreated every time a session is created.
- There is a chance that InferenceSession constructor does not finish,
and the callback would remain registered. This may result in
intermittent hard to diagnose bugs.
- `active_sessions_lock_` and `callback` lock are not acquired/released
in the same order by different threads which is a classic deadlock
scenario.
### Description
This PR extract core implementations into gemm_utils.cc which is used to
generate shader both GEMM and Matmul ops. The core implemenations
included scalar and vec4 versions of GEMM and Matmul.



### Motivation and Context
There are many common codes for GEMM and Matmul, so we want to extra
common code to unify their implementations.
![Blank diagram
(1)](https://github.com/user-attachments/assets/45f8d7ac-6705-4cea-8b8c-966ded6a6ca5)

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description

Update unit tests for DNNL.
(1) Increase QAttentionTest threshold when DNNL is  used.
(2) Skip some failed tests when DNNL is used. 

### Motivation and Context

When I  build main branch for DNNL in Azure Linux VM, some tests failed:
```
pip install --user -r tools/ci_build/github/linux/python/requirements.txt

python3 tools/ci_build/build.py --build_dir build/Release --config Release --cmake_generator Ninja --skip_submodule_sync --build_shared_lib --parallel --use_vcpkg --use_binskim_compliant_compile_flags --build_wheel --build_nuget --use_dnnl
```

```
[  FAILED  ] NhwcTransformerTests.ConvSplitQLinearConcat
[  FAILED  ] NhwcTransformerTests.Conv
[  FAILED  ] NhwcTransformerTests.ConvBlockBinary
[  FAILED  ] NhwcTransformerTests.ConvMaxPool
[  FAILED  ] NhwcTransformerTests.ConvAveragePool
[  FAILED  ] NhwcTransformerTests.ConvPad
[  FAILED  ] NhwcTransformerTests.ConvBlockActivation
[  FAILED  ] QDQTransformerTests.Conv_U8X8U8
[  FAILED  ] QDQTransformerTests.ConvMaxPoolReshape_UInt8
[  FAILED  ] QDQTransformerTests.ConvMaxPoolReshape_Int8
[  FAILED  ] QDQTransformerTests.ConvRelu
[  FAILED  ] QDQTransformerTests.ConvAveragePoolReshape_UInt8
[  FAILED  ] QDQTransformerTests.ConvAveragePoolReshape_Int8
[  FAILED  ] QDQTransformerTests.ConvTranspose_QBackward
[  FAILED  ] QDQTransformerTests.QBackward_MutilpleSteps
[  FAILED  ] QDQTransformerTests.ConvTranspose_DQForward
[  FAILED  ] QDQTransformerTests.DQForward_MutilpleSteps
[  FAILED  ] InferenceSessionTests.ModelMetadata
[  FAILED  ] ActivationOpTest.LeakyRelu_bfloat16
[  FAILED  ] QAttentionTest.QAttentionDNNLBatch1
[  FAILED  ] QAttentionTest.QAttentionDNNLBatch2
[  FAILED  ] QAttentionTest.QAttentionDNNLMaskPartialSequence
[  FAILED  ] QAttentionTest.QAttentionNoMaskIndex
[  FAILED  ] QAttentionTest.QAttentionPrunedModel
```
…#24548)

### Description
Update README.md: remove the build pipeline status section because they
are out of date.

Most of our pipelines are in Github now.
…24896)

### Description
Replace the Upsample with Resize during quantization to avoid causing the invalid graph

### Motivation and Context
After the quantization, if the opset of original onnx model is less than 10, the opset of QDQ model will be upgraded to 11.
However, Upsample is deprecated in opset 11, which will make the onnx model invalid.
So, we replace the Upsample with Resize if the opset needs to be upgraded to 11.

---------

Co-authored-by: chuteng <chuteng@qti.qualcomm.com>
### Description
Fuse transposed channel shuffle pattern into QNN op -- ONNX does not have native ChannelShuffle op.

### Motivation and Context
Improves performance on QNN EP.
Fixes pipeline error in MacOS_C_API_Packaging_CPU_x86_64 by disabling
softmax NaN test for CoreML EP since it does not handle NaN.
### Description

Implement fpA intB gemm preprocess in cuda kernel to speed up weight
prepacking.

### Motivation and Context

Original preprocess code (in
microsoft#24854) is for CPU, which
is slow and need extra memory copy between CPU and GPU.
### Description

Added a graph transform for mixed precision graphs when FP16 compute is
unavailable. At session creation, this graph transform converts FP16
initializers (_which were changed to FP16 to FP32 cast nodes_) to FP32
initializers and fuses them with their next FP32 nodes.
 

- Behavior before this change:
"fp16 initializers -> cast_from_fp16_to_fp32 -> fp32 node/s"
 
- Behavior after this change:
"fp16 initializers converted to fp32 initializers then fused with fp32
node/s"

### Motivation and Context

This change aims to run the FP16 models without the repetitive casting
of FP16 initializers to FP32 initializers, by fusing FP32 initializers
with their next nodes, when FP16 compute is not available.

> For naming purposes, the newly added Graph Transforms in long form is
called "Fused Initializers Graph Transforms", and in short form is
called "FIGT".

### Working

Currently, the Fuse Initializers Graph Transform fuses cast nodes that
casts from FP16 to FP32, back to their
next/output nodes. Below is an explanation of how this transforms works.
It depends on ```InsertCastTransforms```
to produce the intermediate representation from which it fuses the
initializers (which are the cast node with
zero input, one initializer, and one output) back to the next/output
node. After fusion, the link/edge between such
a cast node to the next/output node will then be removed. Cast nodes
will be removed as well.

```
        "Input Graph"                       "Intermediate Representation"                 "FIGT Transforms"

          --------                   --------        --------        --------                 --------
         | X_Fp16 |                 | X_Fp16 |      | W_Fp16 |      | B_Fp16 |               | X_Fp16 |
          --------                   --------        --------        --------                 --------
             |                          |               |               |                        |
             |                          |               |               |                        |
             |                          V               V               V                        V
             |                       | Cast |        | Cast |        | Cast |                 | Cast |
             |                       | Fp16 |        | Fp16 |        | Fp16 |                 | Fp16 |
             |                       |  To  |        |  To  |        |  To  |                 |  To  |
             |                       | Fp32 |        | Fp32 |        | Fp32 |                 | Fp32 |
             |                          |               |               |                        |
             |                          |               |               |                        |
             V                          V               V               V                        V
 ----------------------------       -----------------------------------------       ----------------------------
|        Conv_Fp16           |     |                                         |     |         Conv_Fp32          |
|        --W_Fp16--          | ==> |                Conv_Fp32                | ==> |         --W_Fp32--         |
|        --B_Fp16--          |     |                                         |     |         --B_Fp32--         |
 ----------------------------       -----------------------------------------       ----------------------------
             |                                          |                                        |
             |                                          |                                        |
             |                                          V                                        V
             |                                       | Cast |                                 | Cast |
             |                                       | Fp32 |                                 | Fp32 |
             |                                       |  To  |                                 |  To  |
             |                                       | Fp16 |                                 | Fp16 |
             |                                          |                                        |
             |                                          |                                        |
             V                                          V                                        V
          --------                                   --------                                 --------
         | Y_Fp16 |                                 | Y_Fp16 |                               | Y_Fp16 |
          --------                                   --------                                 --------
```

The newly added Graph Transforms perform the following actions.

* Detect Cast node/s with single FP16 initializer converting to FP32.
* Convert all such FP16 initializer/s to FP32 initializer/s.
* Fuse newly created FP32 initializer/s to relative FP32 node/s.
* Remove FP16 to FP32 Cast node/s.

This is run in a loop as follows. It excludes Level 1 and Partitioning
optimizations.

```
 Level 2 --> Level 3 --> InsertCastTransforms --> FIGT
   ^                                        	   |
   |                 "LOOP"                	   |
   |                                       	   |
   -------------------------------------------------
```

### Adding FIGT as a Level-4 Graph Transform.

This will have the following benefits.

1. Ability to turn off (any/all) the Level 4 Optimizations. We can use
the `disable optimizers` functionality to turn off one of such
optimizations during testing, or use the `-o` switch to turn off all
Level 4 optimizations while executing a model using the command line or
Python scripts (or any other scripts).

2. Ability to rerun Level 2 and Level 3 optimizations remains intact
after Level 4 Optimizations are applied. Adding Level 4 takes care that
FIGT (or any similar optimizations) will always run after
InsertCastNodes.

3. It keeps the current graph manipulations untouched and gives us more
flexibility to add future optimizations like adding `Int8 to Int32`
upconvert or `FP8 to FP16` upconvert under Level 4. Level 4 can, as of
now, work as a placeholder for any other such upcoming Graph
optimizations.

```
 Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
   ^                                        	    |
   |                 "LOOP"                	    |
   |                                       	    |
   --------------------------------------------------
```
> Added a placeholder for Level-4 for graph transforms utils under
orttraining. This helps resolve any exceptions that may be encountered
during training sessions.

#### Re-running Level 2+ optimizations after Level 4 / FIGT

The idea behind re-running Level2+ graph transforms is that, after the
fusion of initializers with their respective nodes, the nodes are now in
a format that might be supported by other graph transforms that were
previously skipped. Hence, some of the transformations previously unable
to be applied are now valid and can be applied to create a more optimal
graph for execution.

### Added a new session option
"kOrtSessionOptionsGraphOptimizationsLoopLevel" to handle the graph
optimization loop.

* When set to 2 or above it will loop until no more optimizations are
applied at any level starting Level 2 and above.

```
  Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
    ^                                                 |
    |                 "Loop"                          |
    |                                                 |
    ---------------------------------------------------
```

* When set to 1 (default) it will loop until no more optimizations are
applied at Level 4 only.

```
  Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
    ^                                                 |
    |        "Loop only depending on Level 4"         |
    |                                                 |
    ---------------------------------------------------
```

* When set to 0 it disables the loop.

```
  Level 2 --> Level 3 --> InsertCastTransforms --> Level 4
    ^                                                 |
    |                 "No Loop"                       |
    |                                                 |
    X                xxxxxxxxxxx                      X
```

### Documentation

We have not added any details related to Level 4 in the [Graph
Optimizations in ONNX
Runtime](https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html)
documentation.

### OLD PR

This PR is created following a thorough discussion on the [OLD
PR](microsoft#24175).

Signed-off-by: Sunny Shukla <sunny.shukla@intel.com>
### Description
<!-- Describe your changes. -->
We might have a case where multiple Cast nodes in the chain cast back to
the original type. This fusion will remove extra nodes.
E.g.
`A ('float32') -> Cast (to='float16') -> Cast (to='int4') -> Cast
(to='float32') -> Cast (to='float16') -> B
`
will reduce to
` A ('float32') -> Cast (to='float16') -> B
`
All the Cast nodes throughout the path need to have one input and one
output to be considered for the fusion.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Gemma3 ONNX models used to have double casting, and many new models
created by the model builder might have as well. Extra Casts might
reduce accuracy and increase inference time.
The original code has a divided-by-zero error.
…soft#24895)

### Description
Remove ep_weight_sharing_ctx_gen tool from QNN EP python wheel
* Add fp8 and int4 types in supported list for Onnxruntime EP

* Add support for int4 inputs

Map things to int8 right now as we don't explicitly set an int4 input
type and pack/unpack int4 operands

* Add flag to allow for fp8 quantization through Onnxruntime API

* Add fp8 quantization to the compile stage of the MIGraphX EP

Mirror the same calibration code we use for int8 and just change which
quantize we call through the MIGraphx API

* cleanup logging

* Cleanup and encapsulate quantization / compile functions

- Add additional flags for fp8 thats shared for int8

- Add lockout warning message when int8/fp8 used at the same time

* Run lintrunner pass

* Fix session options inputs + add better logging.

Previous runs using session options failed as we were missing pulling in
inputs from the python interface. This plus additional logging allowed
me to track what options were invoked via env and what were added during
the start of an inference session

* Fix naming for save/load path varibles to be consistent with  enable.

* Print only env variables that are set as warnings

need this so the user knows there's any of the environment variables
running in the background to ensure proper consistently between runs.

---------

### Description
<!-- Describe your changes. -->
Changes to cleanup the MIGraphX EP quantization code as well as adding
fp8 quantization support along with int4 support.

Cleanup changes handle a few instances of issues seen with the python
interface when taking in provider options


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Required as we fix ignored flags when using provider_options for the
MIGraphX EP
Adding fp8 quantization through the MIGraphX API
Adding int4 weight support for packed int4 weights for MIGraphX
inference

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
1. Support activation broadcasting in XNNPACK Matmul
2. Fix a subtle bug when activations is 1-D
Per the existing gating logic, 1-D activations were allowed but the
batch being passed through did not account for it. The batch size passed
in was always `a->Shape()[0]` which is actually passing in the reduction
dimension (K). This is incorrect as for a 1-D activation input, a `1` is
to be prepended to the shape which meant that we should have actually
passed in `1` for the batch. This passed the relevant test but I think
it would have written outside the bounds of the output buffer because of
the non-unary batch being passed through.


### Motivation and Context
Resolve microsoft#24107

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Add support for `bool` type to address the issue below.

### Motivation and Context
This PR fixes microsoft#12286

Co-authored-by: Mauricio Cortazar <mcortazar@truora.com>
### Description
<!-- Describe your changes. -->
Extend IAllocator to get Allocator statistics:
- Add `OrtAllocator::GetStats` and `AllocatorGetStats` C-API.
- Add `Ort::Allocator::GetStats` Cxx API to parse the string and return
as map.
- Add UT.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Our system integrates multiple models for inference, each with varying
memory demands. Providing a mechanism to retrieve detailed memory
statistics would be useful for analyzing memory usage across models and
devices more effectively.
### Description

It seems like microsoft#24509 added
a guard for the 8 bit Matmul tests that depends on an MLAS macro being
set to compile and run on CPUs but that guard itself was preventing the
inclusion of the MLAS header where the macro would have been set and so
Matmul 8 bit tests were not being compiled and run on CPU builds.

### Motivation and Context
Improve test coverage for CPU builds
### Description

For TreeEnsemble, onnxruntime tries to fuse multiple nodes BRANCH_EQ
into one node BRANCH_MEMBER. When a tree only contains BRANCH_EQ nodes,
the final tree could be a mix between BRANCH_EQ and BRANCH_MEMBER. To be
more efficient, onnxruntime detects that all the nodes use the same rule
and avoids checking that value for every node while getting the final
leaf. This detection happened before the fusion into BRANCH_MEMBER. This
PR detects that this check must be done again. This extra cost only
happens when a tree only contains nodes BRANCH_EQ and should not be
much. It only happens during the initialization.

### Motivation and Context
Fixes issue microsoft#24636.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Skip `_tpause` call for `_M_ARM64EC` in spin_pause.cc.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix link error on ARM64EC for `_tpause` unresolved external symbol.
…CI (microsoft#24948)

### Description

A temporary fix to unblock react native android CI.

### Motivation and Context

After microsoft#24726 is merged to
main, react native android CI starts failing with error like

```
/mnt/vss/_work/1/s/js/react_native/e2e/node_modules/onnxruntime-react-native/android/src/main/java/ai/onnxruntime/reactnative/OnnxruntimeModule.java:329: error: cannot find symbol
              {"layout", SessionOptions.OptLevel.LAYOUT_OPT},
                                                ^
  symbol:   variable LAYOUT_OPT
  location: class OptLevel
```

The LAYOUT_OPT is defined in
https://github.com/microsoft/onnxruntime/blob/8b3326e53249edb610cfe1648aff5c88f28b65f4/java/src/main/java/ai/onnxruntime/OrtSession.java#L656.

The root cause of the build error is unknown. Since the layout level is
just added, so it is not used by users. It is safe to comment the line
to unblock the pipeline.
…icrosoft#24692)

### Description
<!-- Describe your changes. -->

Add `kleidiai` to `onnxruntime_EXTERNAL_LIBRARIES` in
`setup_kleidiai()`.

This is important when building an Apple static framework.
If `kleidiai` is not in `onnxruntime_EXTERNAL_LIBRARIES`, we may
encounter link errors when using the static framework, e.g., when
building an iOS app targeting ARM64 iphoneos.

It happened to work in the iOS packages built by the packaging pipeline
because those are built with the XNNPACK EP enabled. In that case,
`kleidiai` is added to `onnxruntime_EXTERNAL_LIBRARIES` elsewhere.

https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/xnnpack.cmake#L95

https://github.com/microsoft/onnxruntime/blob/0aaccafd41eca1580ec409d4ccd32cd1288c7e05/cmake/external/onnxruntime_external_deps.cmake#L569-L572

`onnxruntime_EXTERNAL_LIBRARIES` should tolerate duplicate values. But
`kleidiai` needs to be included at least once if we use it.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix Apple static framework build with KleidiAI when the XNNPACK EP is
not enabled.
### Description
<!-- Describe your changes. -->

Try to fix microsoft#24941

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…RM64 (microsoft#24947)

## Problem

The `libonnxruntime4j_jni.so` native library was incompatible with 16KB
page size configuration on ARM64 Android devices, while the main
`libonnxruntime.so` was already compatible. This affected:

- Modern Android devices using 16KB page configuration
- Apple Silicon Macs running Android emulators
- Any ARM64 system configured with 16KB pages

## Root Cause

The issue occurred because:

1. The main `libonnxruntime.so` is built as a SHARED library and
inherits `CMAKE_SHARED_LINKER_FLAGS` which contains the 16KB alignment
flag (`-Wl,-z,max-page-size=16384`)
2. The `libonnxruntime4j_jni.so` is built as a MODULE library via
`onnxruntime_add_shared_library_module()` function
3. `CMAKE_SHARED_LINKER_FLAGS` only applies to SHARED libraries, not
MODULE libraries
4. Therefore, the JNI library was missing the required 16KB alignment

## Solution

Added `CMAKE_MODULE_LINKER_FLAGS` alongside the existing
`CMAKE_SHARED_LINKER_FLAGS` in `cmake/adjust_global_compile_flags.cmake`
to ensure MODULE libraries also receive the 16KB alignment flag on
Android builds.

```cmake
if (ANDROID)
  # Build shared libraries with support for 16 KB ELF alignment
  # https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment
  set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-z,max-page-size=16384")
  # Also apply to MODULE libraries (like libonnxruntime4j_jni.so)
  set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_MODULE_LINKER_FLAGS} -Wl,-z,max-page-size=16384")
endif()
```

## Impact

- ✅ `libonnxruntime.so`: Already compatible (no change)
- ✅ `libonnxruntime4j_jni.so`: Now compatible (fixed)
- ✅ All provider libraries: Compatible (inherit global flags)
- ✅ Zero impact on non-Android platforms
- ✅ Minimal change: only 2 lines added

## Testing

The fix has been validated to:
- Apply 16KB alignment to both SHARED and MODULE libraries on Android
- Only affect Android builds (properly guarded by `if (ANDROID)`)
- Follow existing CMake patterns in the codebase
- Preserve all existing functionality

Fixes microsoft#24902.

> [!WARNING]
>
> <details>
> <summary>Firewall rules blocked me from connecting to one or more
addresses</summary>
>
> #### I tried to connect to the following addresses, but was blocked by
firewall rules:
>
> - `http://168.63.129.16:80/machine/`
> - Triggering command: `/usr/bin/python3 -u
bin/WALinuxAgent-2.13.1.1-py3.9.egg -collect-logs ` (http block)
>
> If you need me to access, download, or install something from one of
these locations, you can either:
>
> - Configure [Actions setup
steps](https://gh.io/copilot/actions-setup-steps) to set up my
environment, which run before the firewall is enabled
> - Add the appropriate URLs or hosts to my [firewall allow
list](https://gh.io/copilot/firewall-config)
>
> </details>


---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Support opset 23 RMSNormalization with CPU and CUDA kernel.

https://github.com/onnx/onnx/blob/main/docs/Operators.md#RMSNormalization

The PR uses LayerNormalization(simplified=True) under the hood.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix microsoft#24555
### Description
* Enable fp16 intB gemm kernels when zero points is not provided.
* Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for
sm < 5.3.
* Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix
unreachable code warnings in debug build.

Note that we have existed test cases like
`MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test.

### Motivation and Context
The zero point input is optional for MatMulNBits. In
microsoft#24854, we only enable fp16
intB gemm when zero points is provided.
…#24886)

### Description
<!-- Describe your changes. -->
Add MIGraphX EP support for skipLayerNormalization via the supported OP
list in the execution provider


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Turns on skipLayerNormalization instead of filtering this out as one of
the supported ops by MIGraphX

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
…4885)

* Use the latest hipify-perl for ROCm 7.0 instead of pinned version

Required so we can convert kernels with the latest hipify that supports
latest hipblas change

- Remove hipify-perl version from rocm-6.3.0-14776 build
- Use the argument pushed to the amd_hipify.py script.

related to ROCm#69

* Remove roctracer_hcc.h include

Not needed as roctracer_hip.h superceeds this

* Removal of hipblas_v2_api reference

Just make this use hipblas directly

* Use local system hipify

no os path join needed

* Add HIPIFY Path log mesasge for build

### Description
<!-- Describe your changes. -->
Update to use the local version of hipify from the ROCm release.
Relevant since we'd like to ensure the latest hipify changes are being
used and being tested when using ROCm EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Latest and greatest for ROCm EP to convert kernels via hipify-perl
mechanism

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
quic-tirupath and others added 27 commits June 5, 2025 11:36
### Description
 - QNN's 16x16 FC doesn't support asymmetric int16 weight
- Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight
 - Add unit tests to verify 16x16 Gemm translation.


### Motivation and Context
This fix schedules 16x16 Gemm Ops on QNN HTP accelerator.
This improves inference time of Models contain 16x16 Gemm operators
### Description
 - QNN's 16x16 FC doesn't support asymmetric int16 weight
- QNN's 16x16 MatMul doesn't support asymmetric int16 weight initializer.
- Insert Convert Op to convert from asymmetric uint16 weight to symmetric int16 weight.
 - Add unit tests to verify 16x16 MatMul translations.



### Motivation and Context
- This fix schedules 16x16 MatMul Ops on QNN HTP accelerator.
- This improves inference time of Models contain 16x16 MatMul operators
…microsoft#24966)

### Description

This reverts commit 11bcce3.

[This change](https://dawn-review.googlesource.com/c/dawn/+/236054) from
Dawn brakes some existing shaders related to subgroup.

Following up: redo upgrade when the upstream resolves the problem.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change restores back fp16 math based FlashAttention.

### Motivation and Context
Earlier we noticed quality issues with deepseek-r1 attributed to
overflow of qk computation when performing math in fp16 precision.
microsoft#24723, addressed it by
promoting math to fp32 to avoid the precision issue.

However the topic remained that, these models are trained with FP8
precision how is it that inferencing runs into precision issues with
FP16 math? using FP32 math also resulted in slight performance
degradation.

In this follow up investigation, one issue identified is that we
multiply scale for gqa quite late in the computation. Scale is 0.088 for
deepseek-r1. Multiplying scale upfront seems to prevent the overflow
issues.

For now only the prefill shaders are updated to use this approach.
Pending feedback on impact across models, the generation shader can also
be restored to FP16 math.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR is to reduce the chance of having crashes when a process is
shutting down. The main idea is: if we know the process is shutting
down(or if we know the ORT DLL won't be reloaded), we do not need to run
C++ object destructors. The approach is recommended by Windows SDK's
official document and many Windows developers. For example, 18 years ago
Raymond Chen wrote a blog [The old-fashioned theory on how processes
exit](https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023).
This is ORT's current behavior. Raymond Chen also wrote a blog [what a
better approach
is](https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=8683)

In our case, when onnxruntime is built as a python package, the
DLL(onnxruntime_pybind11_state.pyd ) will never be manually unloaded.
Same on Linux. Python does not unload the DLLs on exit. Therefore, we do
not need to worry about the potential memory leaks caused by any global
variables. Therefore, we do not need to call OrtEnv's destructors, and we
do not need to unload any EP dlls.

In most cases, people do not unload DLLs on Windows. And, on Linux it is
even more complicated because GCC needs to maintain a unique table to
avoid odr violations, and this feature makes most C++ shared library
cannot be unloaded.

So, this change detects if the os is Windows and if the process is
shutdown when calling destructors. If yes, the destructor will do
nothing.

After this change on Windows in most cases OrtEnv will not be destroyed.
The only exception is: if someone manually load the DLL and manually
unload the DLL, and also do not have a global threadpool. Then I think
the user is an advanced user and they should know that they need to
destroy all inference session objects and the OrtEnv singleton object
before unloading the DLL. Besides, if they have enabled global thread
pool, the DLL won't be unloaded if they haven't shutdown the thread pool
and delete the OrtEnv object. And, even if the user has manually
loaded/unloaded the DLL, there would still be some memory leaks(that are not
related to this change). It's hard to get 100% clean.
### Description
Add function of checking if all node input tensor ranks are supported by
WebNN. Add check for binary op types, other special ops check will be
added in later prs
### Description
Moved the dimension limit because it seems to only apply to conv
operations (texture memory is typically used for conv operations in the
GPU because it has a slow write but fast read -- ChromaDB model had a
slice operation with an input > 16384 -- operation worked fine after I
had moved the dim check)

Also added extra checks for Softmax on MLProgram that allows more
softmax nodes to be moved to CoreML
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…24950)

### Description
The motivation is to allow Windows 10 LTSC 2019 (currently the only LTSC
supported until 2029) to run ONNXRuntime and the DML provider.
Inspired by microsoft#24845 to
remove dxcore.dll dependency.

Currently confirmed to work in a VM.
Real tests on hardware with DML compatible devices will be performed
very soon

---------

Co-authored-by: Julien Maille <julien.maille@data-pixel.com>
### Description

Upgrade cudnn front end to 1.12

### Motivation and Context

https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.12.0

This replaces previous attempt to upgrade cudnn front end to 1.11:
microsoft#24189
…23404)

* Adding ourtvalue support for MGX EP

---------

authored-by: Uros Petkovic <urpektov@amd.com>

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: urpetkov-amd <127323899+urpetkov-amd@users.noreply.github.com>
…osoft#24984)

# Description
This pull request refactors the default aligned memory allocation
functions (AllocatorDefaultAllocAligned and AllocatorDefaultFreeAligned)
to use modern C++17 standard features, removing the need for
platform-specific preprocessor directives.

# Motivation
The existing implementation relies on #ifdef _MSC_VER to switch between
_aligned_malloc/_aligned_free on Windows and posix_memalign/free on
other platforms. While functional, this approach has several drawbacks:

It increases code complexity and reduces readability. It relies on
legacy C functions instead of standard C++ features. It requires manual
error handling for each platform-specific path. By switching to C++17's
type-safe aligned allocation, we can achieve the same functionality with
a single, portable, and more maintainable implementation.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Added support for Turing Arch

The changes are to make the ORT NV TensorRT RTX EP use the
--computeCapabilities=1 flag by default so they can support the engine
build for Turing Arch as well.
…4994)

This PR restores FP16 math in flash attention generation shader. It
follows the changes in microsoft#24953 to use scale to multiply Q first instead
of calculating it after QK to avoid data overflow in FP16.
### Description
<!-- Describe your changes. -->
In Transforms.js, the `sequentially_access_by_threads` flag should be
set to `true` **only** when the GPU vendor is Intel, as experiments have
shown that Intel GPUs perform better with this setting enabled.

Currently, ORT sets `sequentially_access_by_threads` to `true`
regardless of the GPU vendor.

However, based on my local testing, setting
`sequentially_access_by_threads` to `false` consistently results in
better performance across all platforms.

In ONNX Runtime (ORT), this flag is only applied to Conv operators that
are not using `vec4` packing (i.e., `MakeMatMulPackedSource`). For
GEMM/MatMul operators without `vec4`, the flag remains `false`.
Therefore, this change will only affect Conv test cases without `vec4`.
This PR leads to performance improvements in certain convolution cases.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I test with local conv model(x[1,256,224,224] weight[63, 256, 3, 3],
which don't use vec4), the result is
|       (ms)         | M3Max | NVIDA P620 | NVIDA 5080 | intel |
|----------------|-------|------------|------------|-------|
| sequentially_access_by_threads == true | 11.2 | 112 | 2.88 | 85.9 |
| sequentially_access_by_threads == false | **7** | **66** | **1.90** |
**53.4** |
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`is_channels_last` is being passed to MatMulProgram but not to
MatMulNaiveProgram causing issues for musicgen model
### Description

WebAssembly build needs the latest version so that build breaks with
LLVM v20.1.4 (emscripten 4.0.10) will be fixed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Follow up microsoft#24449 
This PR integrates with ONNX 1.18.0.
### Description
Split DML nuget packaging jobs to a dedicated pipeline
Remove Windows 32-bit packages from nuget pipelines.

### Motivation and Context
To make the "Zip-Nuget-Java-Nodejs Packaging Pipeline" lighter.
### Description
<!-- Describe your changes. -->

- don't use cuda runtime API to set the device when a stream is already
provided.
- expose option to set limit on max shared memory TensorRT can use.

- Fixed the Compilation issues for the deprecated APIs
- Small test fix. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ankan Banerjee <anbanerjee@nvidia.com>
### Description
In TopK op builder, add Transpose around TopK to permute the axis to the last before and permute back after.
Additionally, since TopK's second output is indices which may have INT64 dtype, add Cast to cast transformed INT32 back to INT64 if is graph output.

### Motivation and Context
QNN only accepts TopK on the last axis but ONNX/ORT's TopK has axis attribute. Complement TopK op builder to avoid falling back to CPU for non-last axis TopK.
@ankitm3k ankitm3k merged commit f86768b into ovep-develop Jun 13, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_13_6_25 branch June 13, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.