Backmerging with Msft commits #663

jatinwadhwa921 · 2025-04-17T07:25:11Z

Backmerging with Msft commits

@amarin16

The Azure DevOps pipeline template [/nuget/templates/dml-vs-2022.yml](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/azure-pipelines/nuget/templates/dml-vs-2022.yml) is used to build the ONNX Runtime DirectML (DML) components. It historically contained two potential mechanisms for creating NuGet packages: 1. Invoking `python tools/ci_build/build.py` with the `--build_nuget` flag. 2. Executing a specific `NuPackScript` (usually calling `msbuild /t:CreatePackage`). This redundancy created a significant problem during release builds (when the pipeline parameter IsReleaseBuild is set to true). Here's why: - Duplicate Package Creation: Both packaging methods would execute. - build.py --build_nuget created a package with a development/pre-release version suffix (e.g., Microsoft.ML.OnnxRuntime.DirectML.1.21.1-dev-20250408-0849-84808eb710.nupkg). - The NuPackScript's msbuild call, influenced by IsReleaseBuild=true, created the clean release version package (e.g., Microsoft.ML.OnnxRuntime.DirectML.1.21.1.nupkg). - ren Command Failure: For the x86 and arm64 builds, the NuPackScript contains a command like: ```Bash ren Microsoft.ML.OnnxRuntime.DirectML.* win-dml-x86.zip ``` This command fails when two files match the pattern Microsoft.ML.OnnxRuntime.DirectML.* (the dev package and the release package), as ren requires a single source file when using wildcards for renaming. - Result: This caused build failures specifically when attempting to create release candidates or final release builds for x86 and arm64 DML components. This issue did not typically occur in regular nightly builds (IsReleaseBuild: false) because only one package (the dev version) was likely produced, allowing the ren command to succeed. Therefore we only found the problem when doing a patch release for ONNX Runtime 1.21. (@amarin16, the release manager of ONNX Runtime 1.21, found the issue and explained it to us why the pipeline was not working) The change is relatively simple. This PR removes the `--build_nuget` flag from the `python tools/ci_build/build.py` command within the dml-vs-2022.yml template. By removing the redundant packaging step from build.py, only the NuPackScript's msbuild command generates a package file. This ensures only one file matches the Microsoft.ML.OnnxRuntime.DirectML.* pattern, allowing the subsequent ren command in the x86 and arm64 scripts to execute successfully during release builds. # Background (how the DML packaging pipeline works) The build has two stages: 1. Individual Architecture Builds (Using dml-vs-2022.yml): Each stage (x64, x86, arm64) runs, now reliably using only its specific NuPackScript to generate its artifact without the risk of the ren command failing during release. x64 produces: Microsoft.ML.OnnxRuntime.DirectML.[version].nupkg x86 produces: win-dml-x86.zip arm64 produces: win-dml-arm64.zip (arm32 is not built/included). 2. Final Packaging Stage (e.g., stages/nuget_dml_packaging_stage.yml): Downloads these artifacts and combines them by unpacking the base x64 .nupkg, injecting the contents of the .zip files into the appropriate runtimes/ directories (e.g., runtimes/win-x86/native/, runtimes/win-arm64/native/), and re-packing the final, multi-architecture Microsoft.ML.OnnxRuntime.DirectML.nupkg. In stage 1 only x64 produces a nuget package, therefore specific MSBuild parameters: `/p:IsReleaseBuild=${{ parameters.IsReleaseBuild }}` is passed to all architectures' MSBuild calls, while `/p:CurrentData=$(BuildDate) /p:CurrentTime=$(BuildTime)` are passed only in the x64 script. BTW, the property "CurrentData" apparently is a typo. It should be `CurrentDate`.

…#24348) ### Description  Make test `CApiTest.RequestLoadCancellation` deterministic by removing the `terminator` thread. ### Motivation and Context  The test contributes to CI failures

### Description This change allows NPM tests to run the nodejs binding for webgpu. This helps to debug test failures much easier because WebAssembly is generally very difficult to debug. Steps to debug: 0. build - {ORT_ROOT}> build --config Debug --use_webgpu --build_nodejs - {ORT_ROOT}\js\web> npm ci - {ORT_ROOT}\js\web> npm run pull:wasm 2. run `npm test -- <args> -b=webgpu -e=node` once. ( this command generates necessary .js files and `testdata-config.json`.) 3. use native debugger to debug: ``` C:\Program Files\nodejs\node.exe {ORT_ROOT}\js\node_modules\mocha\bin\_mocha --timeout 999999 --colors -r {ORT_ROOT}\js/web/dist/ort.node.min.js {ORT_ROOT}\js/web/test/test-main ```

### Description MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance. The `MlasTranspose` was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading. ### Motivation and Context We encountered this issue while running the [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large), which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification: | | INTER_NUM_THREADS | INTRA_NUM_THREADS | INPUT_LENGTH | BATCH_SIZE | Duration time[sec] | | ------ | ----------------- | ----------------- | ------------ | ---------- | ------------------ | | BEFORE | 1 | 16 | 512 | 4 | 1.24 | | AFTER | 1 | 16 | 512 | 4 | 1.09 | Condition - FP32 - CPUExecutionProvider This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows | | INTRA_NUM_THREADS | BEFORE | AFTER | | --------------------------------- | ---- | -------------- | ------------- | | MlasTranspose [msec] | 16 | 182.55 [ms] | 11.60 [ms] | `MlasTranspose` is x15~16 faster.

…24286) On Qualcomm Adreno X1 GPUs, the previous implementation of the FlashAttentionProgram shader in the WebGPU backend was causing high register pressure, leading to performance degradation. This PR uses workgroup memory to reduce the register pressure and improve performance. TTFT for phi4 with 1K inputs becomes 10s from 40s on Qualcomm Adreno X1 GPU.

) ### Description 1. Transform INT64 shape of Expand Op to INT32 shape. 2. Add Unit test to check INT64 Shape conversion to INT32 by QNN EP. ### Motivation and Context QNN doesn't support INT64 shape for Expand Op. This commit delegates the Expand Ops with INT64 shape on QNN EP. This improves the inference time.

### Description - fix a bug in ConvTranspose This bug causes `input_channels_per_group_int` to be `-3` for a test case, and later causes a loop of `4294967293` times (`uint32_t(-3)`) that causing timeout. - fix cache hint of Conv2dMMProgram After fixing the bug in ConvTranspose, more cache hint inconsistencies are revealed. This change fixes channel_last missing in the cache hint of Conv2dMMProgram.

1. Migrate OpenVino Pipeline to Github Actions 2. Update the OpenVino pipeline's docker file to use almalinux8 instead of Ubuntu, to be aligned with the other Linux CI pipelines. (We cannot pull images from docker hub because it requires a paid account)

### Description Add InstanceNormalization operator to WebGPU EP. ### Motivation and Context

…ite-default (microsoft#24396) Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.2.5 to 6.2.6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/releases">vite's releases</a>.</em></p> <blockquote> <h2>v6.2.6</h2> <p>Please refer to <a href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">CHANGELOG.md</a> for details.</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/blob/v6.2.6/packages/vite/CHANGELOG.md">vite's changelog</a>.</em></p> <blockquote> <h2>6.2.6 (2025-04-10)</h2> <ul> <li>fix: reject requests with <code>#</code> in request-target (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>) (<a href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7">3bb0883</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19830">#19830</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/vitejs/vite/commit/d3dbf25fd5e21448f9ea6cec8fb5ac45d220037b"><code>d3dbf25</code></a> release: v6.2.6</li> <li><a href="https://github.com/vitejs/vite/commit/3bb0883d22d59cfd901ff18f338e8b4bf11395f7"><code>3bb0883</code></a> fix: reject requests with <code>#</code> in request-target (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)</li> <li>See full diff in <a href="https://github.com/vitejs/vite/commits/v6.2.6/packages/vite">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vite&package-manager=npm_and_yarn&previous-version=6.2.5&new-version=6.2.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Update protobuf-java to 3.25.5 ### Motivation and Context To fix the [CG issue](https://aiinfra.visualstudio.com/Lotus/_componentGovernance/218239/alert/12112143?typeId=29309793&pipelinesTrackingFilter=0). Change file links - [x] java_linux_final_test.sh -> java-cuda-packaging-stage.yml (Jar_Packaging_GPU stage from Zip-Nuget) - [ ] final-jar-testing.yml (Final_Jar_Testing_$ stages)

### Description - Adds C/C++ API functionality to compile a model (i.e., generate a model with EPContext nodes) using explicit APIs. - Adds support for compiling when input or output models are in memory (not just files). - Allows specifying the threshold for when initializers are stored in an external file. - Allows file paths of arbitrary lengths (session_option key/value configs limited string length to 2048). List of C API functions: ```C++ ORT_API(const OrtCompileApi*, GetCompileApi); ORT_API(void, ReleaseModelCompilationOptions, _Frees_ptr_opt_ OrtModelCompilationOptions*); ORT_API2_STATUS(CreateModelCompilationOptionsFromSessionOptions, _In_ const OrtEnv* env, _In_ const OrtSessionOptions* session_options, _Outptr_ OrtModelCompilationOptions** out); ORT_API2_STATUS(ModelCompilationOptions_SetInputModelPath, _In_ OrtModelCompilationOptions* model_compile_options, _In_ const ORTCHAR_T* input_model_path); ORT_API2_STATUS(ModelCompilationOptions_SetInputModelFromBuffer, _In_ OrtModelCompilationOptions* model_compile_options, _In_ const void* input_model_data, size_t input_model_data_size); ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelPath, _In_ OrtModelCompilationOptions* model_compile_options, _In_ const ORTCHAR_T* output_model_path); ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelExternalInitializersFile, _In_ OrtModelCompilationOptions* model_compile_options, _In_ const ORTCHAR_T* external_initializers_file_path, size_t external_initializer_size_threshold); ORT_API2_STATUS(ModelCompilationOptions_SetOutputModelBuffer, _In_ OrtModelCompilationOptions* model_compile_options, _Inout_ OrtAllocator* allocator, void** output_model_buffer_ptr, size_t* output_model_buffer_size_ptr); ORT_API2_STATUS(ModelCompilationOptions_SetEpContextEmbedMode, _In_ OrtModelCompilationOptions* model_compile_options, bool embed_ep_context_in_model); ORT_API2_STATUS(CompileModel, _In_ const OrtEnv* env, _In_ const OrtModelCompilationOptions* model_options); ``` Example (see unit tests for others): ```C++ #include "onnxruntime_cxx_api.h" // Test using the CompileModel() API with settings: // - input model from buffer // - output model file // - EPContext nodes in output model use embedded binary blobs. TEST_F(QnnHTPBackendTests, CompileApi_FromSessionOptions_InputModelAsBuffer_Embedded) { const ORTCHAR_T* output_model_file = ORT_TSTR("./qnn_context_binary_multi_partition_test.onnx"); std::filesystem::remove(output_model_file); // Initialize session options with QNN EP Ort::SessionOptions session_options; ProviderOptions provider_options; #if defined(_WIN32) provider_options["backend_path"] = "QnnHtp.dll"; #else provider_options["backend_path"] = "libQnnHtp.so"; #endif provider_options["offload_graph_io_quantization"] = "0"; session_options.AppendExecutionProvider("QNN", provider_options); // Create model compilation options from the session options. Ort::ModelCompilationOptions compile_options(*ort_env, session_options); compile_options.SetInputModelFromBuffer(reinterpret_cast<const void*>(model_data.data()), model_data.size()); compile_options.SetOutputModelPath(output_model_file); compile_options.SetEpContextEmbedMode(true); // Compile the model. Ort::Status status = Ort::CompileModel(*ort_env, compile_options); ASSERT_TRUE(status.IsOK()); // Make sure the compiled model was generated and has the expected number of EPContext nodes. ASSERT_TRUE(std::filesystem::exists(output_model_file)); CheckEpContextNodeCounts(output_model_file, 2, 2); } ``` ### Motivation and Context Improve compilation workflow and add new capabilities. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>

### Description Add 8bit support to matmulnbits quantizer. matmul_4bits_quantizer now can quantize a const B in a MatMul to 8bits initializer. ### Motivation and Context MatMul4Bits has accuracy issue for phi-4 model used for foundry local. The early prototype showed >= 6bits can fix the issue. To mitigate the issue as soon as possible, add 8bit support to MatMulNBits.

…t#24400) ### Description There are 2 benefits to this change: - the comments contain "Σ", a unicode char causing `std::wclog` failed and no longer output future logs on Windows native app, if not enabled UTF-8 explicitly by `std::wclog.imbue(std::locale(".UTF-8"));`. Moving it out resolves the problem. - makes the WGSL code slightly shorter.

@snnn

### Description  Replace use of gsl::narrow with narrow to build for xnnpack with exceptions disabled @snnn ### Motivation and Context  Address issue microsoft#24383

) Unblocks nomic-embed model.

### Description Support mixed precision in quantization for RTN ### Motivation and Context More flexible for quantization Usage: ``` customized_weight_config = {} for i in layers_to_exclude: customized_weight_config["/model/layers."+str(i)+"/MatMul"] = {"bits": 8} algo_config = matmul_4bits_quantizer.RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config) quant = MatMul4BitsQuantizer( model=onnx_model, block_size=32, is_symmetric=False, accuracy_level=4, nodes_to_exclude=nodes_to_exclude, algo_config=algo_config, ) ```

…crosoft#24385) ### Description  This PR adds support for the Resize operator in cubic mode without antialiasing (antialias=0). It supports scaling constraints of the form [1, scale_h, scale_w, 1], where scale_h ≥ 1 and scale_w ≥ 1. ### Motivation and Context  The ARM64 Conv supports FP16, and we have an NhwcTransformer that transforms FP16 Conv to FP16 NhwcFusedConv. As a result, the subsequent Resize op also uses the NHWC format.

…4408) With this PR, the generation speed for phi4 improves 2x on Qualcomm Adreno X1 GPU (11.1 tps -> 23.2 tps for simple inputs).

### Description pin triton to v3.2.0 ### Motivation and Context

Fix typo in option text s/buildings/bindings Signed-off-by: Andrew Davis <afd@ti.com> Signed-off-by: Clément Péron <peron.clem@gmail.com> Co-authored-by: Andrew Davis <afd@ti.com>

…24420) ### Description  ### Motivation and Context

### Description  Fix doc gen issue

### Description Add MLProgram implementation for Gather To support this change, I also added handling for converting int64 to int32 in model builder --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

…#24399) ### Description Support shared memory version of ReduceOps ### Motivation and Context

### Description  Change `status=completed` to `status=success`, because job cancelling is also considered "completed". See https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-workflow--parameters

### Description Fix the Python API docs update pipeline. Add back the removed files in /tools/doc/ folder

### Description  Fix a bug when `sessionOptions.externalData === undefined` for Node.js binding. ### Motivation and Context

'global_idx' should be used to calculate the output indices.

### Description  - Add flag to determine whether to save inference results. - Implement infrastructure to transform OrtValue into TensorProto - Update the README with corresponding descriptions. ### Motivation and Context  - The PR aims to save the inference results of onnx-test-runner for inference purpose. - Developer can proceed with custom metrics and verifications.

### Description This PR makes changes to the installation script of ONNX Runtime Node.js binding. #### Background Because of the max size limit of NPM registry, the Node.js binding NPM package does not include some of the binaries, eg. the CUDA EP binaries for Linux/x64. To make it working smoothly for CUDA EP users on Linux/x64, we need a script to download the binaries from somewhere in the process of NPM installation. #### Problems Before this PR, the script downloads the binaries from GitHub Release. This is working well but have 2 problems: - There is a gap between the release of the binaries and the release of the NPM package. The GitHub release is always the final step of the release process. Usually there are a few hours to a few days delay between the release of the NPM package and the release of the binaries on GitHub release. - GitHub release does not work with dev/nightly. #### Solution We find that using Nuget feed perfectly resolves the above problems: - anonymous download is allowed - Nuget publish can be adjusted to be prior to NPM publish in the release process - ONNX Runtime has a nightly Nuget feed The PR changes to use Nuget package for downloading the binaries.

### Description Address additional review comments on microsoft#24207: - Remove use of `#ifdef ORT_MINIMAL_BUILD` in public C/C++ API headers for Compile API - Use `AllocatorPtr` internally to ensure memory is properly released if an exception is thrown while serializing the output model to the user's buffer. - Improve C API function documentation. - Clean up internal `ModelCompilationOptions` class ### Motivation and Context Useful review comments were left on the original PR after merge. This addresses those comments.

…icrosoft#24433) ### Description Updates the `SessionOptionsAppendExecutionProvider` function to also support full canonical provider names (e.g., QNNExecutionProvider) in addition to the short names (e.g., QNN). ### Motivation and Context There's an inconsistency in how ORT names EPs. The `SessionOptionsAppendExecutionProvider` C API function uses short names (e.g., "QNN"), but other ORT APIs use longer names (e.g., "QNNExecutionProvider"). - Python APIs to add EP to session uses "QNNExecutionProvider" (not "QNN") - Python and C APIs to GetAvailableProviders use "QNNExecutionProvider" - Internal ORT code uses "QNNExecutionProvider" to assign nodes to cuda ep. - Only `SessionOptionsAppendExecutionProvider` uses short names like "QNN".

### Description  - Add a general command-line tool for static quantization - Support loading TensorQuantOverride from json file - Add the corresponding README ### Motivation and Context  - Currently, developers are able to use preprocess tool from command line - https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#pre-processing - `python -m onnxruntime.quantization.preprocess --help` - The PR aims to provide similar usage for static quantization. - `python -m onnxruntime.quantization.static_quantize_runner --help` - Existing command-line examples in onnxruntime-inference-example are not general for arbitrary ONNX models.

### Description Implemented a thread safe OrtInstanceData to support Node.js binding in multi env, and add an E2E test for running in worker. ### Motivation and Context

### Description Enable the GPU backend also for the onnxruntime QNN EP. ### Motivation and Context Why is this change required? What problem does it solve? It allows QNN EP to run on the GPU backend also. With this change many models can now run fully on QNN EP GPU backend, like resnet_50, google_vit_base_fp32, squeezenet1.0-7 etc. Also the onnxruntime node tests and versioned operator tests pass numbers for the GPU is comparable to the HTP now. Note: Currently QNN_LOG_LEVEL_DEBUG need to be enabled to run correctly.

### Description Update QNN version to 2.33.2

### Description This pull request combines multiple improvements, bug fixes for the OpenVINO Execution Provider (OVEP). The changes are summarized as follows: 1) Introduced Intel compiler level optimizations for QDQ models. 2) Added support to select intel devices based on the LUID. 3) Code refactoring for improvement in querying the available devices and setting it. 4) Load_config feature improvement to support AUTO, HETERO and MULTI plugin. 5) Memory optimization during model compilation. 6) EPCtx optimizations. 7) Bug fixes. --------- Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com> Co-authored-by: co63oc <co63oc@users.noreply.github.com> Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Alessio Soldano <services@soldano.it> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com> Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: wp <webgraphics@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com>

### Description  Most models can benefit from fusing the pre-GQA nodes into a single MatMul or MatMulNBits. This change will detect the patterns possible to fuse and execute the fusion on CUDA EPs. ### Motivation and Context This will enable publishing of a single GPU model going forward. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Update N-API version to 6. - NAPI v6 is required for `napi_set_instance_data` and `napi_get_instance_data`, as used by microsoft#24366 - Adding the "binary" field in package.json for CMake-js to work correctly. (was unintentially removed in microsoft#24418) ### Motivation and Context

### Description Fix compilation issue (undeclared identifier) in Azure EP unit test. ### Motivation and Context A previous PR caused a compilation issue in the Azure EP unit test: microsoft#24433 Our PR CI pipelines did not catch it. It was caught by our post-merge packaging pipelines. ```shell D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(28,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(29,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] D:\a\_work\1\s\onnxruntime\test\providers\azure\azure_basic_test.cc(30,3): error C2065: 'session_options': undeclared identifier [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_test_all.vcxproj] ```

### Description If it would improve performance, this patch moves outputs to MLTensor backed Tensors. ### Motivation and Context We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start the readbacks and wait for them in parallel. This change is similar to microsoft#23073

### Description Fix Nodejs binding build for Linux. ### Motivation and Context

…24373) ### Description MatmulTransposeFusion does not work correctly when input A and B are the same for a `MatMul` node. ![image](https://github.com/user-attachments/assets/48a6afd8-13d0-48d4-b86f-53a866c47803) Fixes microsoft#24341 ### Motivation and Context

### Description zeros_ memory buffer was uninitialized, but it must be initialized to zero. ### Motivation and Context A memory allocator change in GenAI started crashing in FlashAttention and this was eventually tracked down to be the cause. The allocator change was innocent. I'm not sure how this didn't fail previously, or if it was we weren't getting the reports about it. Co-authored-by: Ryan Hill <{ID}+{username}@users.noreply.github.com>

…ft#24444) ### Description Mapping ORT verbose logging back to QnnGpu Debug logging. ### Motivation and Context Why is this change required? What problem does it solve? As of now this change is required for the QnnGpu backend to run models correctly. It's necessity is mentioned in this commit microsoft@b4b5a79 It is temporarily reverting this commit. for the GPU case only, due to loss of functionality microsoft@9d45b9a

…24452) ### Description update Node.js binding document for 1.22 release ### Motivation and Context

### Description Handle empty input cases in the native reduce kernel. ### Motivation and Context

jatinwadhwa921 · 2025-04-17T10:56:44Z

The internal ci support in ovep-develop will be added as a part of next pr , This pr is already validated, the run for this is present here (https://github.com/intel/onnxruntime/actions/runs/14512989745/job/40715701698?pr=664)

snnn and others added 30 commits April 10, 2025 08:26

[Native WebGPU EP] Add InstranceNormalization (microsoft#24369)

aada488

### Description Add InstanceNormalization operator to WebGPU EP. ### Motivation and Context

[WebGPU EP] Fixes bugs in slice operator implementation (microsoft#24415

bb5a879

) Unblocks nomic-embed model.

[webgpu] Enable DP4A MatMul generation path for Qualcomm (microsoft#2…

8de1639

…4408) With this PR, the generation speed for phi4 improves 2x on Qualcomm Adreno X1 GPU (11.1 tps -> 23.2 tps for simple inputs).

workaround linux CI pipeline: pin triton to v3.2.0 (microsoft#24423)

ac5e434

### Description pin triton to v3.2.0 ### Motivation and Context

Fix typo in option text s/buildings/bindings (microsoft#24412)

fd22509

Fix typo in option text s/buildings/bindings Signed-off-by: Andrew Davis <afd@ti.com> Signed-off-by: Clément Péron <peron.clem@gmail.com> Co-authored-by: Andrew Davis <afd@ti.com>

[Native WebGPU EP] Increase error tolerance limit for f16 (microsoft#…

c47d694

…24420) ### Description  ### Motivation and Context

Fix doc gen issue (microsoft#24424)

c619218

### Description  Fix doc gen issue

Fix the Python API docs update pipeline (microsoft#24434)

c27975f

### Description Fix the Python API docs update pipeline. Add back the removed files in /tools/doc/ folder

[webgpu] Fix batch-norm for ort-web-tests (microsoft#24404)

d60891d

'global_idx' should be used to calculate the output indices.

fs-eire and others added 19 commits April 15, 2025 16:13

Update QNN version to 2.33.2 (microsoft#24440)

201012e

### Description Update QNN version to 2.33.2

[nodejs] add missing header files for linux build (microsoft#24448)

c5b82a5

### Description Fix Nodejs binding build for Linux. ### Motivation and Context

Merge branch 'master' into sync_msft_17_4_25

254bda0

jatinwadhwa921 requested a review from ankitm3k April 17, 2025 07:25

sfatimar approved these changes Apr 17, 2025

View reviewed changes

jatinwadhwa921 merged commit 21c7ab3 into ovep-develop Apr 17, 2025
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backmerging with Msft commits #663

Backmerging with Msft commits #663

Uh oh!

jatinwadhwa921 commented Apr 17, 2025

Uh oh!

jatinwadhwa921 commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants

Backmerging with Msft commits #663

Backmerging with Msft commits #663

Uh oh!

Conversation

jatinwadhwa921 commented Apr 17, 2025

Uh oh!

jatinwadhwa921 commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants