Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

snnn and others added 30 commits April 10, 2025 08:26
The Azure DevOps pipeline template
[/nuget/templates/dml-vs-2022.yml](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/azure-pipelines/nuget/templates/dml-vs-2022.yml)
is used to build the ONNX Runtime DirectML (DML) components. It
historically contained two potential mechanisms for creating NuGet
packages:

1. Invoking `python tools/ci_build/build.py` with the `--build_nuget`
flag.
2. Executing a specific `NuPackScript` (usually calling `msbuild
/t:CreatePackage`).

This redundancy created a significant problem during release builds
(when the pipeline parameter IsReleaseBuild is set to true). Here's why:
- Duplicate Package Creation: Both packaging methods would execute.
- build.py --build_nuget created a package with a
development/pre-release version suffix (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1-dev-20250408-0849-84808eb710.nupkg).
- The NuPackScript's msbuild call, influenced by IsReleaseBuild=true,
created the clean release version package (e.g.,
Microsoft.ML.OnnxRuntime.DirectML.1.21.1.nupkg).
- ren Command Failure: For the x86 and arm64 builds, the NuPackScript
contains a command like:
    ```Bash
    ren Microsoft.ML.OnnxRuntime.DirectML.* win-dml-x86.zip
    ``` 
This command fails when two files match the pattern
Microsoft.ML.OnnxRuntime.DirectML.* (the dev package and the release
package), as ren requires a single source file when using wildcards for
renaming.
- Result: This caused build failures specifically when attempting to
create release candidates or final release builds for x86 and arm64 DML
components. This issue did not typically occur in regular nightly builds
(IsReleaseBuild: false) because only one package (the dev version) was
likely produced, allowing the ren command to succeed. Therefore we only
found the problem when doing a patch release for ONNX Runtime 1.21.

(@amarin16, the release manager of ONNX Runtime 1.21, found the issue
and explained it to us why the pipeline was not working)

The change is relatively simple. This PR removes the `--build_nuget`
flag from the `python tools/ci_build/build.py` command within the
dml-vs-2022.yml template. By removing the redundant packaging step from
build.py, only the NuPackScript's msbuild command generates a package
file. This ensures only one file matches the
Microsoft.ML.OnnxRuntime.DirectML.* pattern, allowing the subsequent ren
command in the x86 and arm64 scripts to execute successfully during
release builds.

# Background (how the DML packaging pipeline works)

The build has two stages:

1. Individual Architecture Builds (Using dml-vs-2022.yml): Each stage
(x64, x86, arm64) runs, now reliably using only its specific
NuPackScript to generate its artifact without the risk of the ren
command failing during release.
x64 produces: Microsoft.ML.OnnxRuntime.DirectML.[version].nupkg
x86 produces: win-dml-x86.zip
arm64 produces: win-dml-arm64.zip
(arm32 is not built/included).
2. Final Packaging Stage (e.g., stages/nuget_dml_packaging_stage.yml):
Downloads these artifacts and combines them by unpacking the base x64
.nupkg, injecting the contents of the .zip files into the appropriate
runtimes/ directories (e.g., runtimes/win-x86/native/,
runtimes/win-arm64/native/), and re-packing the final,
multi-architecture Microsoft.ML.OnnxRuntime.DirectML.nupkg.

In stage 1 only x64 produces a nuget package, therefore specific MSBuild
parameters: `/p:IsReleaseBuild=${{ parameters.IsReleaseBuild }}` is
passed to all architectures' MSBuild calls, while
`/p:CurrentData=$(BuildDate) /p:CurrentTime=$(BuildTime)` are passed
only in the x64 script. BTW, the property "CurrentData" apparently is a
typo. It should be `CurrentDate`.
…#24348)

### Description
<!-- Describe your changes. -->
Make test `CApiTest.RequestLoadCancellation` deterministic by removing
the `terminator` thread.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The test contributes to CI failures
### Description

This change allows NPM tests to run the nodejs binding for webgpu. This
helps to debug test failures much easier because WebAssembly is
generally very difficult to debug.

Steps to debug:

0. build
   - {ORT_ROOT}> build --config Debug --use_webgpu --build_nodejs
   - {ORT_ROOT}\js\web> npm ci
   - {ORT_ROOT}\js\web> npm run pull:wasm
2. run `npm test -- <args> -b=webgpu -e=node` once. ( this command
generates necessary .js files and `testdata-config.json`.)
3. use native debugger to debug:
   ```
C:\Program Files\nodejs\node.exe
{ORT_ROOT}\js\node_modules\mocha\bin\_mocha --timeout 999999 --colors -r
{ORT_ROOT}\js/web/dist/ort.node.min.js {ORT_ROOT}\js/web/test/test-main
   ```
### Description

MlasTranspose was running single-thread and was not performing well
enough on a multi-threaded CPU. Therefore, I modified it to run with
multi-thread to improve performance.

The `MlasTranspose` was previously running in a single-threaded, which
resulted in suboptimal performance on multi-threaded CPUs. To address
this, I have modified it to utilize multi-threading.

### Motivation and Context

We encountered this issue while running the
[multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large),
which was converted to ONNX format and executed on a multi-core CPU
(Xeon 6338). Below are the performance metrics before and after the
modification:

| | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE |
Duration time[sec] |
| ------ | ----------------- | ----------------- | ------------ |
---------- | ------------------ |
| BEFORE | 1 | 16 | 512 | 4 | 1.24 |
| AFTER | 1 | 16 | 512 | 4 | 1.09 |

Condition
- FP32
- CPUExecutionProvider

This change resulted in a performance improvement of approximately 14%.
MlasTranspose stand-alone performance improvements are as follows

| | INTRA_NUM_THREADS | BEFORE | AFTER |
| --------------------------------- | ---- | -------------- |
------------- |
| MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] |

`MlasTranspose` is x15~16 faster.
…24286)

On Qualcomm Adreno X1 GPUs, the previous implementation of the
FlashAttentionProgram shader in the WebGPU backend was causing high
register pressure, leading to performance degradation. This PR uses
workgroup memory to reduce the register pressure and improve
performance.

TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1
GPU.
)

### Description
1. Transform INT64 shape of Expand Op to INT32 shape.
2. Add Unit test to check INT64 Shape conversion to INT32 by QNN EP.


### Motivation and Context
QNN doesn't support INT64 shape for Expand Op. This commit delegates the Expand Ops
with INT64 shape on QNN EP. This improves the inference time.
### Description

- fix a bug in ConvTranspose

This bug causes `input_channels_per_group_int` to be `-3` for a test
case, and later causes a loop of `4294967293` times (`uint32_t(-3)`)
that causing timeout.

- fix cache hint of Conv2dMMProgram

After fixing the bug in ConvTranspose, more cache hint inconsistencies
are revealed. This change fixes channel_last missing in the cache hint
of Conv2dMMProgram.
1. Migrate OpenVino Pipeline to Github Actions
2. Update the OpenVino pipeline's docker file to use almalinux8 instead
of Ubuntu, to be aligned with the other Linux CI pipelines. (We cannot
pull images from docker hub because it requires a paid account)
### Description
Add InstanceNormalization operator to WebGPU EP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ite-default (microsoft#24396)

Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite)
from 6.2.5 to 6.2.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vitejs/vite/releases">vite's
releases</a>.</em></p>
<blockquote>
<h2>v6.2.6</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">vite's
changelog</a>.</em></p>
<blockquote>
<h2><!-- raw HTML omitted -->6.2.6 (2025-04-10)<!-- raw HTML omitted
--></h2>
<ul>
<li>fix: reject requests with <code>#</code> in request-target (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)
(<a
href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7">3bb0883</a>),
closes <a
href="https://redirect.github.com/vitejs/vite/issues/19830">#19830</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vitejs/vite/commit/d3dbf25fd5e21448f9ea6cec8fb5ac45d220037b"><code>d3dbf25</code></a>
release: v6.2.6</li>
<li><a
href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7"><code>3bb0883</code></a>
fix: reject requests with <code>#</code> in request-target (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)</li>
<li>See full diff in <a
href="https://github.com/vitejs/vite/commits/v6.2.6/packages/vite">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vite&package-manager=npm_and_yarn&previous-version=6.2.5&new-version=6.2.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Update protobuf-java to 3.25.5



### Motivation and Context
To fix the [CG
issue](https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/12112143?typeId=29309793&pipelinesTrackingFilter=0).

Change file links

- [x] java_linux_final_test.sh -> java-cuda-packaging-stage.yml
(Jar_Packaging_GPU stage from Zip-Nuget)
- [ ] final-jar-testing.yml (Final_Jar_Testing_$ stages)
### Description
- Adds C/C++ API functionality to compile a model (i.e., generate a
model with EPContext nodes) using explicit APIs.
- Adds support for compiling when input or output models are in memory
(not just files).
- Allows specifying the threshold for when initializers are stored in an
external file.
- Allows file paths of arbitrary lengths (session_option key/value
configs limited string length to 2048).

List of C API functions:
```C++
ORT_API(const OrtCompileApi*, GetCompileApi);

ORT_API(void, ReleaseModelCompilationOptions, _Frees_ptr_opt_ OrtModelCompilationOptions*);
ORT_API2_STATUS(CreateModelCompilationOptionsFromSessionOptions, _In_ const OrtEnv* env,
                _In_ const OrtSessionOptions* session_options, _Outptr_ OrtModelCompilationOptions** out);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* input_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetInputModelFromBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const void* input_model_data, size_t input_model_data_size);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelPath, _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* output_model_path);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelExternalInitializersFile,
                _In_ OrtModelCompilationOptions* model_compile_options,
                _In_ const ORTCHAR_T* external_initializers_file_path,
                size_t external_initializer_size_threshold);
ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelBuffer, _In_ OrtModelCompilationOptions* model_compile_options,
                _Inout_ OrtAllocator* allocator, void** output_model_buffer_ptr, size_t* output_model_buffer_size_ptr);
ORT_API2_STATUS(ModelCompilationOptions_SetEpContextEmbedMode, _In_ OrtModelCompilationOptions* model_compile_options,
                bool embed_ep_context_in_model);
ORT_API2_STATUS(CompileModel, _In_ const OrtEnv* env, _In_ const OrtModelCompilationOptions* model_options);
```

Example (see unit tests for others):
```C++
#include "onnxruntime_cxx_api.h"

// Test using the CompileModel() API with settings:
//   - input model from buffer
//   - output model file
//   - EPContext nodes in output model use embedded binary blobs.
TEST_F(QnnHTPBackendTests, CompileApi_FromSessionOptions_InputModelAsBuffer_Embedded) {
  const ORTCHAR_T* output_model_file = ORT_TSTR("./qnn_context_binary_multi_partition_test.onnx");
  std::filesystem::remove(output_model_file);

  // Initialize session options with QNN EP
  Ort::SessionOptions session_options;
  ProviderOptions provider_options;
#if defined(_WIN32)
  provider_options["backend_path"] = "QnnHtp.dll";
#else
  provider_options["backend_path"] = "libQnnHtp.so";
#endif
  provider_options["offload_graph_io_quantization"] = "0";
  session_options.AppendExecutionProvider("QNN", provider_options);

  // Create model compilation options from the session options.
  Ort::ModelCompilationOptions compile_options(*ort_env, session_options);
  compile_options.SetInputModelFromBuffer(reinterpret_cast<const void*>(model_data.data()), model_data.size());
  compile_options.SetOutputModelPath(output_model_file);
  compile_options.SetEpContextEmbedMode(true);

  // Compile the model.
  Ort::Status status = Ort::CompileModel(*ort_env, compile_options);
  ASSERT_TRUE(status.IsOK());

  // Make sure the compiled model was generated and has the expected number of EPContext nodes.
  ASSERT_TRUE(std::filesystem::exists(output_model_file));
  CheckEpContextNodeCounts(output_model_file, 2, 2);
}
```


### Motivation and Context
Improve compilation workflow and add new capabilities.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now
can quantize a const B in a MatMul to 8bits initializer.

### Motivation and Context
MatMul4Bits has accuracy issue for phi-4 model used for foundry local.
The early prototype showed >= 6bits can fix the issue.
To mitigate the issue as soon as possible, add 8bit support to
MatMulNBits.
…t#24400)

### Description

There are 2 benefits to this change:
- the comments contain "Σ", a unicode char causing `std::wclog` failed
and no longer output future logs on Windows native app, if not enabled
UTF-8 explicitly by `std::wclog.imbue(std::locale(".UTF-8"));`. Moving
it out resolves the problem.
- makes the WGSL code slightly shorter.
### Description
<!-- Describe your changes. -->
Replace use of gsl::narrow with narrow to build for xnnpack with
exceptions disabled @snnn


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address issue microsoft#24383
### Description
Support mixed precision in quantization for RTN



### Motivation and Context
More flexible for quantization
Usage:
```
customized_weight_config = {}

for i in layers_to_exclude:
    customized_weight_config["/model/layers."+str(i)+"/MatMul"] = {"bits": 8}

algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)
quant = MatMul4BitsQuantizer(
    model=onnx_model,
    block_size=32,
    is_symmetric=False,
    accuracy_level=4,
    nodes_to_exclude=nodes_to_exclude,
    algo_config=algo_config,
)
```
…crosoft#24385)

### Description
<!-- Describe your changes. -->

This PR adds support for the Resize operator in cubic mode without
antialiasing (antialias=0). It supports scaling constraints of the form
[1, scale_h, scale_w, 1], where scale_h ≥ 1 and scale_w ≥ 1.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The ARM64 Conv supports FP16, and we have an NhwcTransformer that
transforms FP16 Conv to FP16 NhwcFusedConv. As a result, the subsequent
Resize op also uses the NHWC format.
…4408)

With this PR, the generation speed for phi4 improves 2x on Qualcomm
Adreno X1 GPU (11.1 tps -> 23.2 tps for simple inputs).
### Description

pin triton to v3.2.0

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix typo in option text s/buildings/bindings

Signed-off-by: Andrew Davis <afd@ti.com>
Signed-off-by: Clément Péron <peron.clem@gmail.com>
Co-authored-by: Andrew Davis <afd@ti.com>
…24420)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix doc gen issue
### Description
Add MLProgram implementation for Gather

To support this change, I also added handling for converting int64 to
int32 in model builder

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…#24399)

### Description
 Support shared memory version of ReduceOps



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->

Change `status=completed` to `status=success`, because job cancelling is
also considered "completed".

See
https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-workflow--parameters
### Description
Fix the Python API docs update pipeline.
Add back the removed files in /tools/doc/ folder
### Description
<!-- Describe your changes. -->

Fix a bug when `sessionOptions.externalData === undefined` for Node.js
binding.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
'global_idx' should be used to calculate the output indices.
### Description
<!-- Describe your changes. -->
- Add flag to determine whether to save inference results.
- Implement infrastructure to transform OrtValue into TensorProto
- Update the README with corresponding descriptions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

- The PR aims to save the inference results of onnx-test-runner for inference purpose.
- Developer can proceed with custom metrics and verifications.
fs-eire and others added 19 commits April 15, 2025 16:13
### Description

This PR makes changes to the installation script of ONNX Runtime Node.js
binding.

#### Background

Because of the max size limit of NPM registry, the Node.js binding NPM
package does not include some of the binaries, eg. the CUDA EP binaries
for Linux/x64.

To make it working smoothly for CUDA EP users on Linux/x64, we need a
script to download the binaries from somewhere in the process of NPM
installation.

#### Problems

Before this PR, the script downloads the binaries from GitHub Release.
This is working well but have 2 problems:
- There is a gap between the release of the binaries and the release of
the NPM package. The GitHub release is always the final step of the
release process. Usually there are a few hours to a few days delay
between the release of the NPM package and the release of the binaries
on GitHub release.
- GitHub release does not work with dev/nightly.

#### Solution

We find that using Nuget feed perfectly resolves the above problems:
- anonymous download is allowed
- Nuget publish can be adjusted to be prior to NPM publish in the
release process
- ONNX Runtime has a nightly Nuget feed

The PR changes to use Nuget package for downloading the binaries.
### Description
Address additional review comments on
microsoft#24207:
- Remove use of `#ifdef ORT_MINIMAL_BUILD` in public C/C++ API headers
for Compile API
- Use `AllocatorPtr` internally to ensure memory is properly released if
an exception is thrown while serializing the output model to the user's
buffer.
- Improve C API function documentation.
- Clean up internal `ModelCompilationOptions` class



### Motivation and Context
Useful review comments were left on the original PR after merge. This
addresses those comments.
…icrosoft#24433)

### Description
Updates the `SessionOptionsAppendExecutionProvider` function to also
support full canonical provider names (e.g., QNNExecutionProvider) in
addition to the short names (e.g., QNN).



### Motivation and Context
There's an inconsistency in how ORT names EPs. The
`SessionOptionsAppendExecutionProvider` C API function uses short names
(e.g., "QNN"), but other ORT APIs use longer names (e.g.,
"QNNExecutionProvider").
- Python APIs to add EP to session uses "QNNExecutionProvider" (not
"QNN")
- Python and C APIs to GetAvailableProviders use "QNNExecutionProvider"
- Internal ORT code uses "QNNExecutionProvider" to assign nodes to cuda
ep.
- Only `SessionOptionsAppendExecutionProvider` uses short names like
"QNN".
### Description
<!-- Describe your changes. -->
- Add a general command-line tool for static quantization
- Support loading TensorQuantOverride from json file
- Add the corresponding README

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Currently, developers are able to use preprocess tool from command
line
-
https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#pre-processing
    - `python -m onnxruntime.quantization.preprocess --help`
- The PR aims to provide similar usage for static quantization.
    - `python -m onnxruntime.quantization.static_quantize_runner --help`
- Existing command-line examples in onnxruntime-inference-example are
not general for arbitrary ONNX models.
### Description


Implemented a thread safe OrtInstanceData to support Node.js binding in
multi env, and add an E2E test for running in worker.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enable the GPU backend also for the onnxruntime QNN EP.

### Motivation and Context
Why is this change required? What problem does it solve? 
It allows QNN EP to run on the GPU backend also.
With this change many models can now run fully on QNN EP GPU backend, like resnet_50, google_vit_base_fp32, squeezenet1.0-7 etc. Also the onnxruntime node tests and versioned operator tests pass numbers for the GPU is comparable to the HTP now.
Note: Currently QNN_LOG_LEVEL_DEBUG need to be enabled to run correctly.
### Description
Update QNN version to 2.33.2
### Description
This pull request combines multiple improvements, bug fixes for the
OpenVINO Execution Provider (OVEP). The changes are summarized as
follows:

1) Introduced Intel compiler level optimizations for QDQ models.
2) Added support to select intel devices based on the LUID. 
3) Code refactoring for improvement in querying the available devices
and setting it.
4) Load_config feature improvement to support AUTO, HETERO and MULTI
plugin.
5) Memory optimization during model compilation.
6) EPCtx optimizations.
7) Bug fixes.

---------

Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com>
Co-authored-by: co63oc <co63oc@users.noreply.github.com>
Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Alessio Soldano <services@soldano.it>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com>
Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: wp <webgraphics@intel.com>
Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Mark Schofield <mschofie@microsoft.com>
Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com>
### Description
<!-- Describe your changes. -->
Most models can benefit from fusing the pre-GQA nodes into a single
MatMul or MatMulNBits. This change will detect the patterns possible to
fuse and execute the fusion on CUDA EPs.


### Motivation and Context
This will enable publishing of a single GPU model going forward.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

Update N-API version to 6.

- NAPI v6 is required for `napi_set_instance_data` and
`napi_get_instance_data`, as used by microsoft#24366
- Adding the "binary" field in package.json for CMake-js to work
correctly. (was unintentially removed in microsoft#24418)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix compilation issue (undeclared identifier) in Azure EP unit test.



### Motivation and Context
A previous PR caused a compilation issue in the Azure EP unit test:
microsoft#24433

Our PR CI pipelines did not catch it. It was caught by our post-merge
packaging pipelines.

```shell
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(28,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(29,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(30,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj]
```
### Description
If it would improve performance, this patch moves outputs to MLTensor
backed Tensors.

### Motivation and Context
We are currently performing an extra copy on output tensors located in
the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)->
JS). This patch removes this copy by moving the readback to JS instead
of wasm. As an extra benefit, we can also start the readbacks and wait
for them in parallel.

This change is similar to microsoft#23073
### Description

Fix Nodejs binding build for Linux.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…24373)

### Description

MatmulTransposeFusion does not work correctly when input A and B are the
same for a `MatMul` node.


![image](https://github.com/user-attachments/assets/48a6afd8-13d0-48d4-b86f-53a866c47803)

Fixes microsoft#24341

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
zeros_ memory buffer was uninitialized, but it must be initialized to
zero.


### Motivation and Context
A memory allocator change in GenAI started crashing in FlashAttention
and this was eventually tracked down to be the cause. The allocator
change was innocent. I'm not sure how this didn't fail previously, or if
it was we weren't getting the reports about it.

Co-authored-by: Ryan Hill <{ID}+{username}@users.noreply.github.com>
…ft#24444)

### Description
Mapping ORT verbose logging back to QnnGpu Debug logging.

### Motivation and Context
Why is this change required? What problem does it solve?
As of now this change is required for the QnnGpu backend to run models correctly.
It's necessity is mentioned in this commit

microsoft@b4b5a79
It is temporarily reverting this commit. for the GPU case only, due to
loss of functionality

microsoft@9d45b9a
…24452)

### Description

 update Node.js binding document for 1.22 release



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Handle empty input cases in the native reduce kernel.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k April 17, 2025 07:25
@jatinwadhwa921
Copy link
Author

The internal ci support in ovep-develop will be added as a part of next pr , This pr is already validated, the run for this is present here (https://github.com/intel/onnxruntime/actions/runs/14512989745/job/40715701698?pr=664)

@jatinwadhwa921 jatinwadhwa921 merged commit 21c7ab3 into ovep-develop Apr 17, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.