Backmerging with Msft commits #692

jatinwadhwa921 · 2025-05-16T05:46:06Z

Backmerging with Msft commits

### Description Integrate some neural compressor code since the ORT side in the repo is in maintenance mode. ### Motivation and Context Enable k-quant quantization.

### Description  Add initial selection policy implementations. Update device discovery - get vendor and vendor id for CPU from cpuid_info - trim metadata to known useful fields - NPU detection via dxcore only Bug fixes/updates from PRs for C# and python bindings Add some tests for selection policy - TODO: Add more tests ### Motivation and Context  Desire to boil oceans. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description  C# API updates for auto ep selection and the compilation API. Also includes bugfix to OrtKeyValuePairs::Remove. ### Motivation and Context  --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description As titled. ### Motivation and Context Dependency no need.

…icrosoft#24629) ### Description - Enables automatic selection of QNN EP for PREFER_NPU policy - Fixes cpuid vendor id for Qualcomm to be `'Q' | ('C' << 8) | ('O' << 16) | ('M' << 24);` Sample code from unit test: ```c++ // Tests autoEP feature to automatically select an EP that supports the NPU. // Currently only works on Windows. TEST_F(QnnHTPBackendTests, AutoEp_PreferNpu) { ASSERT_ORTSTATUS_OK(Ort::GetApi().RegisterExecutionProviderLibrary(*ort_env, kQnnExecutionProvider, ORT_TSTR("onnxruntime_providers_qnn.dll"))); Ort::SessionOptions so; so.SetEpSelectionPolicy(OrtExecutionProviderDevicePolicy_PREFER_NPU); const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "nhwc_resize_sizes_opset18.quant.onnx"; Ort::Session session(*ort_env, ort_model_path, so); EXPECT_TRUE(SessionHasEp(session, kQnnExecutionProvider)); ASSERT_ORTSTATUS_OK(Ort::GetApi().UnregisterExecutionProviderLibrary(*ort_env, kQnnExecutionProvider)); } ``` ### Motivation and Context A recent feature allows ORT to automatically select an EP according to policies set by the user (e.g., prefer npu or prefer gpu). This PR allows QNN EP to be potentially selected when the user sets the `PREFER_NPU` policy.

…the same type… (microsoft#24633) ### Description  Fix debug assertion when there are two devices of the same type that don't match the vendor. e.g. WebGPU and DML. ### Motivation and Context

When under wasm we can't check for metal by looking at backend because it will always be WEBGPU. Because of this we'll take the DP4A path on metal that results in sub-optimal performance. Use vendor to check for metal instead.

…microsoft#24640) ### Description enable use_vcpkg for QNN Nuget package build and Python arm64ec build

### Description Python API updates for auto ep selection and the compilation API. - Adds Python API `SessionOptions.add_provider()` (equivalent to C API's `SessionOptionsAppendExecutionProvider`) - Adds Python API `SessionOptions.add_provider_for_devices()` (equivalent to C API's `SessionOptionsAppendExecutionProvider_V2`) - Adds Python API `SessionOptions.set_provider_selection_policy()` (equivalent to C API's `SessionOptionsSetEpSelectionPolicy`) - Adds Python API class `ModelCompiler` to compile models (wraps C API's `OrtModelCompilationOptions` and `CompileModel()`) - TODO: Finish delegate callback. Need to add a `void*` parameter to delegate function. ### Sample program that uses autoep APIs Adapted from a unit test. ```python def test_cuda_prefer_gpu_and_inference(self): """ Test selecting CUDA EP via the PREFER_GPU policy and running inference. """ ep_lib_path = "onnxruntime_providers_cuda.dll" ep_registration_name = "CUDAExecutionProvider" if sys.platform != "win32": self.skipTest("Skipping test because device discovery is only supported on Windows") if not os.path.exists(ep_lib_path): self.skipTest(f"Skipping test because EP library '{ep_lib_path}' cannot be found") onnxrt.register_execution_provider_library(ep_registration_name, os.path.realpath(ep_lib_path)) # Set a policy to prefer GPU. Cuda should be selected. sess_options = onnxrt.SessionOptions() sess_options.set_provider_selection_policy(onnxrt.OrtExecutionProviderDevicePolicy.PREFER_GPU) self.assertTrue(sess_options.has_providers()) # Run sample model and check output sess = onnxrt.InferenceSession(get_name("mul_1.onnx"), sess_options=sess_options) x = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], dtype=np.float32) input_name = sess.get_inputs()[0].name res = sess.run([], {input_name: x}) output_expected = np.array([[1.0, 4.0], [9.0, 16.0], [25.0, 36.0]], dtype=np.float32) np.testing.assert_allclose(output_expected, res[0], rtol=1e-05, atol=1e-08) ``` ### Sample program that uses compile APIs Adapted from a unit test that compiles using EP selection policy. ```python def test_compile_with_files_prefer_npu_policy(self): """ Tests compiling a model (to/from files) using an EP selection policy (PREFER_NPU). """ ep_lib_path = "onnxruntime_providers_qnn.dll" ep_registration_name = "QNNExecutionProvider" onnxrt.register_execution_provider_library(ep_registration_name, ep_lib_path) input_model_path = get_name("nhwc_resize_scales_opset18.onnx") output_model_path = os.path.join(self._tmp_dir_path, "model.compiled0.onnx") session_options = onnxrt.SessionOptions() session_options.set_provider_selection_policy(onnxrt.OrtExecutionProviderDevicePolicy.PREFER_NPU) model_compiler = onnxrt.ModelCompiler( session_options, input_model_path, embed_compiled_data_into_model=True, external_initializers_file_path=None, ) model_compiler.compile_to_file(output_model_path) self.assertTrue(os.path.exists(output_model_path)) onnxrt.unregister_execution_provider_library(ep_registration_name) ``` Adapted from a unit test that compiles using explicit EPs. ```python def test_compile_with_input_and_output_files(self): """ Tests compiling a model (to/from files) using explicit EP. """ provider = None provider_options = dict() if "QNNExecutionProvider" in available_providers: provider = "QNNExecutionProvider" provider_options["backend_type"] = "htp" # TODO(adrianlizarraga): Allow test to run for other compiling EPs (e.g., OpenVINO) input_model_path = get_name("nhwc_resize_scales_opset18.onnx") output_model_path = os.path.join(self._tmp_dir_path, "model.compiled1.onnx") session_options = onnxrt.SessionOptions() if provider: session_options.add_provider(provider, provider_options) model_compiler = onnxrt.ModelCompiler( session_options, input_model_path, embed_compiled_data_into_model=True, external_initializers_file_path=None, ) model_compiler.compile_to_file(output_model_path) self.assertTrue(os.path.exists(output_model_path)) ``` ### Motivation and Context

### Description As titled.

…ft#24641) ### Description This PR adds a check for the package version for dev channel. This PR should be able to help avoid publishing packages like "-rc.*" to dev channel automatically. ### Motivation and Context

### Description  Add support for selection policy delegate - split API function into one for the policy enum and one for the delegate - add `void*` for user state - required to wire up using the delegate in other languages. Add C# support for specifying the selection policy delegate. Address comments from initial C# autoep support PR. ### Motivation and Context

### Description This PR adds the support for 8-bit quantization in the `MatMulNBits` operation in WebGPU. It does below things: 1. Unify to use `MatMulNBitsProgram` as the fallback path which is the original generation path for block size = 32. Now make it support any blocks size without limitations. And remove the original complicated programs. 2. Enable `MatMulNBitsWideTileProgram` for all platforms.

### Description If indices is a scalar(0 dimensional tensor) , gather OP produces incorrect output shape. Fix the gather op bug in VSINPU EP. ### Motivation and Context  Signed-off-by: Kee <xuke537@hotmail.com>

### Description  Fix type mismatch using float in place of unsigned int. ### Motivation and Context

fix shader compile; don't know how this made it past ci

…24645) ### Description Python Cuda Publishing pipeline references old test pipeline

### Description The random failure on Web CI is hard to investigate because it's not reproducible. Add this step to upload the log to help investigate the issue.

…ft#24650) ### Description Fix the outputSize computation causing duplicate indices. The outputSize should be the size of indices tensor without counting the last dimension. ### Motivation and Context  Fix the issue microsoft#24070

### Description  header file "dawn/dawn_proc.h" is only used in a non-monolithic build of dawn.

The patch optimizes pool operators when output size is small and kernel size is big ### Description  ### Motivation and Context

…crosoft#24634) ### Description Follow up to microsoft#24614 Example Python program (adapted from unit tests) that specifies a custom EP selection function to select a OrtEpDevice(s) for compiling: ```python def test_compile_with_ep_selection_delegate(self): # ... # User's custom EP selection function. def my_delegate( ep_devices: Sequence[onnxrt.OrtEpDevice], model_metadata: dict[str, str], runtime_metadata: dict[str, str], max_selections: int, ) -> Sequence[onnxrt.OrtEpDevice]: self.assertTrue(len(model_metadata) > 0) self.assertTrue(ep_devices and max_selections > 0) # Select the first and last devices (if there are more than one) selected_devices = [ep_devices[0]] if max_selections > 2 and len(ep_devices) > 1: selected_devices.append(ep_devices[-1]) # ORT CPU EP is always last return selected_devices session_options = onnxrt.SessionOptions() session_options.set_provider_selection_policy_delegate(my_delegate) model_compiler = onnxrt.ModelCompiler( session_options, input_model_path, embed_compiled_data_into_model=True, external_initializers_file_path=None, ) model_compiler.compile_to_file(output_model_path) ``` How to raise an exception from the Python EP selection function: ```python # User's custom EP selection function. custom_error_message = "MY ERROR" def my_delegate_that_fails( ep_devices: Sequence[onnxrt.OrtEpDevice], model_metadata: dict[str, str], runtime_metadata: dict[str, str], max_selections: int, ) -> Sequence[onnxrt.OrtEpDevice]: self.assertTrue(len(ep_devices) >= 1) raise ValueError(custom_error_message) sess_options = onnxrt.SessionOptions() sess_options.set_provider_selection_policy_delegate(my_delegate_that_fails) # Create session and expect ORT to raise a Fail exception that contains our message. with self.assertRaises(Fail) as context: onnxrt.InferenceSession(get_name("mul_1.onnx"), sess_options=sess_options) self.assertIn(custom_error_message, str(context.exception)) ``` ### Motivation and Context

…e APIs (microsoft#24661) ### Description Fixes documentation errors in comments within onnxruntime_c_api.h and onnxruntime__cxx_api.h. ### Motivation and Context The [Generate C/C++ API docs](https://github.com/microsoft/onnxruntime/actions/runs/14855108283/job/41706460753#logs) action is failing with error: ```shell Run mkdir -p build/doxygen /mnt/vss/_work/onnxruntime/onnxruntime/include/onnxruntime/core/session/onnxruntime_cxx_api.h:775: error: explicit link request to 'OrtKeyValuePair' could not be resolved (warning treated as error, aborting now) ```

### Description Added ScatterND operator to Native WebGPU EP. ### Motivation and Context  Required to increase coverage.

…icrosoft#24666) ### Description  Handle user selection policy delegate throwing or returning too many selections in C# code and create error message. ### Motivation and Context

… unit tests (microsoft#24667) ### Description Cleans up the usage of `ep_name` and `ep_registration_name` in the autoEP Python unit tests. ### Motivation and Context Addresses comments from a previous PR: microsoft#24634 > nit: the registration name and EP names don't need to match. could we call this 'ep_name' to avoid potentially creating an assumption that they always do?

- Use ResizeNearestNeighbor Op for Resize with interpolation_mode=Nearest and rank-4 inputs. - Add a Unit test to verify the modified translation. ### Description ResizeNearestNeighbor Op is faster for Resize with interpolation_mode=Nearest and rank-4 inputs. ### Motivation and Context This commit matches Resize Op behavior in QNN-EP with QNN Offline converter path. This fix also improves inference time.

### Description Build are not reproducible, remove information that contains local information from the build ### Motivation and Context Reproducible build is important to ensure package is reliable Signed-off-by: Andrew Davis <afd@ti.com> Signed-off-by: Clément Péron <peron.clem@gmail.com> Co-authored-by: Andrew Davis <afd@ti.com>

…4746) fixes error for https://huggingface.co/Xenova/musicgen-small on webgpu native ![4574e02e-fb31-41a9-9257-bd389373f64f](https://github.com/user-attachments/assets/d71b07db-863b-40ad-8478-08d94ac74f69)

### Description Publish Windows debug symbols to not only Azure DevOps but also msdl.microsoft.com . See https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/symbol-path for how to consume it.

### Description  Fix typos in multiple files ### Motivation and Context

### Description Expose and test GetTensorSizeInBytes in C# and Python ### Motivation and Context microsoft#24680

To include the following change: dmlc/dlpack#165

### Description See the [example repository](https://github.com/jordanozang/onnxruntime_minimal_static) for a minimal example of using the static CMake Config. - Add libraries built during the static onnxruntime build (onnxruntime_common, onnxruntime_mlas, etc.) to the onnxruntime export set. Additionally, add an onnxruntime::onnxruntime interface target that behaves much the same as that target in the shared build case. ``find_package(onnxruntime REQUIRED)`` ``target_link_libraries(example PRIVATE onnxruntime::onnxruntime)`` should now work. - Minor modifications to ensure that dependency targets like Boost::mp11 are treated as imported targets and not part of the build interface. - Static webgpu builds will currently not generate this CMake export ### Motivation and Context - Resolves Issue microsoft#21351 - Builds on Pull Request microsoft#21348

WebNN doesn't provide a dedicated op for `MatMulInteger`, this PR supports `MatMulInteger` by decomposing it into `DequantizeLinear A, B -> MatMul -> Cast (to int32)` and makes some code optimization BTW.

microsoft#24757)

…24753) ### Description  Some CPUs don't show up in SetupApi info for some reason. Create default entry if that is the case. Manually tested by disabling the lookup of GUID_DEVCLASS_PROCESSOR info. Not sure of a better way to test. ### Motivation and Context  Fix crash as other code assumes there will always be CPU device.

This PR is a follow-up of microsoft#24547 . Previously the pipelines had some issues that prevented me to modify these files. Now the issue is solved.

…osoft#24761) Adjust fix for microsoft#24746

The [RotaryEmbedding](https://onnx.ai/onnx/operators/onnx__RotaryEmbedding.html#rotaryembedding) op has been released in opset 23 and has some differences compared to the original contributed op: - The order of input indexes changed - The position_ids input is optional - If the input is 3D, the num_heads must be provided - If it is full rotation, we need to slice the gathered cosine/sine to get the shape [batch_size, sequence_length, rotary_embedding_dim / 2]

### Description  Fix typos in multiple files ### Motivation and Context  Signed-off-by: co63oc <co63oc@users.noreply.github.com>

### Description  Fix typos in bert_defs.cc and contrib_defs.cc ### Motivation and Context  Signed-off-by: co63oc <co63oc@users.noreply.github.com>

### Description Use aligned load and preloading. There is ~10% token generation speed up. ### Motivation and Context Optimize perf

- Currently QNN EP only supports MaxPool for rank 4. - This change adds support for rank 3 input by adding Reshapes before and after the Op to ensure that the MaxPool gets input rank 4. - Updated all attributes if converting rank 3 input to rank 4 by updating stride, pads, dilations and kernel size. - Added unit tests which takes input rank 3 to validate MaxPool on NPU. ### Description This change extends the support of QNN EP's MaxPool operation to handle input tensors of rank 3. To achieve this, Reshape Ops are added before and after the MaxPool Op to ensure that the input to MaxPool is always of rank 4, as required. Additionally, the attributes such as stride, pads, dilations, and kernel size are updated accordingly to accommodate the conversion of rank 3 inputs to rank 4. Unit tests have been added to validate the functionality of MaxPool on NPU with rank 3 inputs. ### Motivation and Context This change is required to enhance the flexibility and usability of QNN EP's MaxPool operation by supporting a broader range of input tensor ranks. Previously, the operation was limited to only supporting rank 4 inputs, which restricted the support in certain scenarios. By adding support for rank 3 inputs, this change solves the problem of limited compatibility and enhances the overall functionality and makes sure that the MaxPool op offloads to the NPU (QNN HTP Backend)

Remove onnxruntime-mlas section

The `symbolFolder` parameter in publish-symbolrequestprod-api.yml actually was an unused-parameter. The yaml was copied from another project and I didn't check the code. Delete tools/ci_build/github/azure-pipelines/templates/py-packaging-training-cuda-stage.yml.

…iven. (microsoft#24781) ### Description  Always write to profiling file if `profiling_file_path` is given. ### Motivation and Context  Previously, on Windows, if the ETW path is enabled, the profiling data will not be written to the file even if `profiling_file_path` is given. I thought that this behavior was confusing.

…ft#24779) ### Description  Adds MultiHeadAttention operator support to MIGraphX EP to leverage the existing MIGraphX parser and Implimentation ### Motivation and Context  Needed for Model enablement

### Description  Adds enablement for MIGraphX EP to use MIGraphX's QuickGelu parser and op ### Motivation and Context  Required for model support

### Description Memtype Memhandle is applicable only for Graph IO tensors. For other tensors we can leave it as RAW ### Motivation and Context Compose failed for some models as Memtype is set as MemHandle for static tensors.

…soft#24752) Enable MaxPool Op with "auto_pad" param set as VALID. VALID runs with all pad values set to 0. ### Description Remove the assert from QNN_EP for MaxPool Op with "auto_pad" as VALID since the Op with this config is supported on QNN backend. ### Motivation and Context QNN_EP rejects MaxPool Op with "auto_pad" as VALID with message the QNN Pool does not support this config. QNN Pool Op supports auto_pad=VALID and all the pad values are set to 0. Signed-off-by: quic-ankus <quic_ankus@quicinc.com>

jiafatom and others added 30 commits May 2, 2025 21:38

K quant (microsoft#24615)

25c43c1

### Description Integrate some neural compressor code since the ORT side in the repo is in maintenance mode. ### Motivation and Context Enable k-quant quantization.

Remove neural_compressor dependency in MatMulNBits (microsoft#24627)

6fa8ba1

### Description As titled. ### Motivation and Context Dependency no need.

Enable use_vcpkg for QNN Nuget package build and Python arm64ec build (…

8bf5362

…microsoft#24640) ### Description enable use_vcpkg for QNN Nuget package build and Python arm64ec build

Publish debug symbols for windows (microsoft#24643)

1ef7b1b

k_quant should have zero_point (microsoft#24647)

bb5d2c2

### Description As titled.

[webgpu] fix compile errors in instancenorm (microsoft#24639)

7942b0c

fix shader compile; don't know how this made it past ci

Fix source name in CUDA publishing pipeline configuration (microsoft#…

cab3c42

…24645) ### Description Python Cuda Publishing pipeline references old test pipeline

allow upload log on failure for further investigating (microsoft#24649)

1f4ca88

### Description The random failure on Web CI is hard to investigate because it's not reproducible. Add this step to upload the log to help investigate the issue.

fix header include in webgpu_context.cc (microsoft#24648)

7bec521

### Description  header file "dawn/dawn_proc.h" is only used in a non-monolithic build of dawn.

Use build id when publishing symbols (microsoft#24662)

8e7c0ac

clementperon and others added 26 commits May 13, 2025 17:51

Fix policheck error in string_normalizer.h (microsoft#24669)

d23eb9e

[WebGPU EP] add output_size==0 check in transpose kernel (microsoft#2…

8f0865d

…4746) fixes error for https://huggingface.co/Xenova/musicgen-small on webgpu native ![4574e02e-fb31-41a9-9257-bd389373f64f](https://github.com/user-attachments/assets/d71b07db-863b-40ad-8478-08d94ac74f69)

Publish symbols to MSDL (microsoft#24748)

a755ebf

### Description Publish Windows debug symbols to not only Azure DevOps but also msdl.microsoft.com . See https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/symbol-path for how to consume it.

[C#] Fix typos in multiple files (microsoft#24665)

80c390f

### Description  Fix typos in multiple files ### Motivation and Context

Expose and test GetTensorSizeInBytes in C# and Python (microsoft#24728)

da3c24b

### Description Expose and test GetTensorSizeInBytes in C# and Python ### Motivation and Context microsoft#24680

Update dlpack to a newer version (microsoft#24760)

d7e49c7

To include the following change: dmlc/dlpack#165

[WebNN] Support MatMulInteger op (microsoft#24687)

d186e28

WebNN doesn't provide a dedicated op for `MatMulInteger`, this PR supports `MatMulInteger` by decomposing it into `DequantizeLinear A, B -> MatMul -> Cast (to int32)` and makes some code optimization BTW.

Skipping newly added MHA test as this isn't supported by ROCm EP like… (

de0415e

microsoft#24757)

Update pull request pipeline triggers (microsoft#24762)

c4495a9

This PR is a follow-up of microsoft#24547 . Previously the pipelines had some issues that prevented me to modify these files. Now the issue is solved.

[WebGPU EP] move transpose output_size check to computeinternal (micr…

e877726

…osoft#24761) Adjust fix for microsoft#24746

[x86] matmul8bit memory loading perf tuning (microsoft#24732)

fd3e0e8

### Description Use aligned load and preloading. There is ~10% token generation speed up. ### Motivation and Context Optimize perf

Update CODEOWNERS (microsoft#24780)

79ff843

Remove onnxruntime-mlas section

Merge branch 'master' into sync_msft_16_5_25

e6dd15b

jatinwadhwa921 requested a review from ankitm3k May 16, 2025 06:19

ankitm3k approved these changes May 16, 2025

View reviewed changes

ankitm3k merged commit 080f66b into ovep-develop May 16, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_16_5_25 branch May 16, 2025 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backmerging with Msft commits #692

Backmerging with Msft commits #692

Uh oh!

jatinwadhwa921 commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants

Backmerging with Msft commits #692

Backmerging with Msft commits #692

Uh oh!

Conversation

jatinwadhwa921 commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants