forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Backmerging with Msft commits #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…24952) ### Description- In the InstanceNormOpBuilder, when inputs[1] (scale) is constant and running on NPU backend, transform S8 scale into U8 scale since Qnn Htp does not support SFIXED InstanceNorm scale. - Allow RunInstanceNormQDQTest to specify different data types for inputs[0] (data) and inputs[1] (scale) - Add corresponding unit test QnnHTPBackendTests.InstanceNormU8S8 ### Motivation and Context - Prevent InstanceNormOp from fallback to CPU on SFIXED scale
### Description This PR is a redo of the reverted PR microsoft#24935. - Upgrade version of Dawn to 991810463a. - Comparing to previous commit c3999d7e3, it includes the fix to GCC13, so that now Linux build passes. - Includes previous merged patches: - https://dawn-review.googlesource.com/c/dawn/+/242335 - https://dawn-review.googlesource.com/c/dawn/+/242634 - https://dawn-review.googlesource.com/c/dawn/+/243614 - Remaining patches: - `dawn_destroy_buffer_on_destructor.patch` **Allow WGPUBufferImpl class to destroy the buffer in the destructor** In native implementation, wgpuBufferRelease will trigger the buffer destroy (if refcount decreased to 0). But in emwgpu implementation, the buffer destroy won't happen. This change adds a destructor to the buffer class to destroy the buffer when the refcount is 0 for non-external buffers. - `dawn_force_enable_f16_nvidia_vulkan.patch` **Force enable f16 support for NVIDIA Vulkan** Dawn disabled f16 support for NVIDIA Vulkan by default because of crashes in f16 CTS tests (crbug.com/tint/2164). Since the crashes are limited to specific GPU models, we patched Dawn to remove the restriction. - `dawn_allow_non_uniform_for_subgroup_matrix_args.patch` (new) **Revert change "[tint] Check uniformity for subgroup matrix builtin arguments"** The following change was made in upstream to force the subgroup matrix builtin arguments to be uniform: https://dawn-review.googlesource.com/c/dawn/+/236054 (https://issues.chromium.org/issues/403611487) Since we use `subgroup_id` as the subgroup matrix builtin argument, we have to revert this change to allow the subgroup matrix builtin arguments to be non-uniform.
### Description Update Qnn default version to 2.35.0.250530
…rosoft#24949) ### Description Replace the method of updating the opset version with onnx.version_converter.convert_version in static_quantize_runner. ### Motivation and Context The current method of updating the opset version involves directly modifying the model's opset version, which does not ensure the model's validity.
### Description This PR creates and saves the JSON files needed to run the Whisper models that are exported with `--no_beam_search_op` in ONNX Runtime GenAI. ### Motivation and Context The JSON files need to be dynamically created based on the settings provided during export.
### Description Enable boolean support for expand op ### Motivation and Context Additional op data type coverage.
### Description This PR converts TensorProto graph initializers to TensorProto/OrtValue pairs. Currently, we only split the output for some optimizers to the above pairs. Eventually, we should be able to convert all initializers to OrtValues on load. Small weights will continue to be an exception as they are sometimes required by ONNX inference functions. Some graph API leaks to EPs so we are not able to remove it at present, and this constrains our ability to convert everything at once. ### Motivation and Context Lay Gound for proper layers separation. Eventually eliminate weights copies in the EPs.
### Description This PR changes the class from which the Whisper tokenizer is loaded. ### Motivation and Context By using `AutoTokenizer` instead of `WhisperTokenizer`, the `save_pretrained` method will also save the `tokenizer.json` file. This file is used in ONNX Runtime GenAI's tokenizers.
### Description Suppress some build warnings in Windows for sm=90. ### Motivation and Context I saw some build warnings during building Windows for sm=90 with cuda 12.8. Those are caused by cutlass. It is safe to suppress them.
…#25024) ### Description - Add argument "--dynamic_input_shapes" to fix dynamic input shapes in qnn.preprocess ### Motivation and Context - Because QNN-EP doesn't support dynamic shape, add this argument to support model with dynamic input dims but static input shape after qnn.preprocess.
### Description Add retry to node tests data downloading. this should help to fix the random pipeline failure because of download failures.
### Description Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a required input. ### Motivation and Context Adding this op to ONNXRuntime is necessary to allow the GenAI team to enable a continuous batching server API.
…oft#25027) This util `GetWebNNOpFirstInputName` previously accepted a WebNN op name as parameter, which will always return "input" because the key of `op_inputs_map` is ONNX op type. This PR fixes the bug by: - Changing the parameter of `GetWebNNOpFirstInputName` to ONNX op - Changing the `decomposed_op_map` to ONNX-to-ONNX op mapping, in order to improve the search complexity.
…oft#25031) Modified the script to explicitly add a build time dependency on onnxruntime.dll
…ft#25022) Use WebNN EP from C++ code needs to call the Module.webnnRegisterMLContext from EM_ASM block.
The build time for Windows CUDA job reduced from 3.5 hours to 1 hour. Also add --use_vcpkg there.
### Description <!-- Describe your changes. --> Make OrtDevice generic by removing vendor specific hardcoded memory types and replacing with generic DEFAULT/HOST_ACCESSIBLE values and an explicit vendor id to differentiate. Remove unnecessary device id from OrtMemoryInfo. The OrtDevice in OrtMemoryInfo provides this. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support plugin EPs
This disables the uniformity checks for SubgroupMatrix in the WebGPU EP.
### Description This change cherry-picks telemetry changes from win-onnxruntime to improve telemetry data collection for ONNX Runtime on Windows. ### Motivation and Context These changes are already present in win-onnxruntime, so cherry-picking these changes here for Windows use cases that rely on public ONNX Runtime.
…crosoft#25055) ### Description Revert [Improve Windows ETW callback registration and fix issues](microsoft#24877) to unblock python packaging pipeline. ### Motivation and Context Python packaging pipeline is failing due to the changes in the PR. --------- Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
…#25025) ### Description This commit refactors the `DP4AMatMulNBitsSmallMProgram` to allow both `tile_size_k_vec` and `tile_size` to be configured. This change allows more flexibility for performance tuning without altering the core shader functionality. There is no functional change in this commit. ### Motivation and Context This is a preparatory change to enable `DP4AMatMulNBitsSmallMProgram` performance optimization work in subsequent commits. --------- Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
### Description The result of Clip op is not expected when min > max for CUDA. This fixes the implementation to align with operator spec: https://onnx.ai/onnx/operators/onnx__Clip.html ### Motivation and Context ONNX backend test failure with onnx 1.18: python onnx_backend_test_series.py ====================================================================== FAIL: test_clip_min_greater_than_max_cuda (__main__.OnnxBackendNodeModelTest) ---------------------------------------------------------------------- DESIRED: array([1., 1., 1.], dtype=float32) ACTUAL: array([2., 2., 1.], dtype=float32)
The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…25056) Missing test failures from ONNX 1.18 integration. 1. The RMSNorm test models is invalid in terms of ONNX spec (e.g. Model rms_normalization_2d_axis1_expanded failed to load:Node () Op (Range) [ShapeInferenceError] Input to 'Range' op should be scalars (Tensor with only one element and shape empty)).  2. TopK with uint64 is currently not supported.
…t#24238) ### Description Allows users to configure and enable the global thread pool via Python, and have inference sessions use it instead of session-local thread pools. ### Motivation and Context Forked off of microsoft#23495 to take over implementation, see issue microsoft#23523. Our particular use case involves a single service instance serving thousands of individual models, each relatively small (e.g. small decision trees). Creating individual services for each model is too much overhead, and attempting to start several thousand thread-pools is a non-starter. We could possibly have each session be single-threaded, but we would like to be able to separate the request handler thread count from the compute thread count (e.g. 2 handler threads but 4 intra-op ones). <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: alex-halpin <alex.halpin@prizepicks.com>
Replace `ToStatus` with `ToStatusAndRelease` which also releases the `OrtStatus*`.
…t#25026) Resolves the below error for florence2 model by using WebGPU built-in sqrt(x) function instead of pow(x, y) when the exponent is 0.5. The sqrt(x) built-in is both faster and more stable than using pow(x, 0.5). ``` Non-zero status code returned while running Reshape node. Name:'/Reshape_1' Status Message: /Users/runner/work/1/s/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:47 onnxruntime::ReshapeHelper::ReshapeHelper(const TensorShape &, TensorShapeVector &, bool) input_shape_size == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{1,576,32}, requested shape:{1,23,23,32} ```
### Description Fix the CANN compilation error in microsoft#25074
misspelled in microsoft#25056 topk with uint64 is currently not supported in ORT
- Implemented ReciprocalOpBuilder to support ONNX Reciprocal Op in QNN EP. - Decomposed Reciprocal into a Div Op. - Added unit tests to run Reciprocal Op on HTP ### Description Adds support for the ONNX Reciprocal operator in QNN EP via Div decomposition. ### Motivation and Context Enables execution of models using Reciprocal on QNN backend, improving Op support.
### Description This PRs sets the foundation for the EP ABI, which allows plugin-EPs to interface with ORT using a binary stable interface. A plugin-EP can be built separately from ORT and is not tied to a specific commit of ORT. Currently, this PR adds basic APIs necessary to allow an example plugin-EP to compile and run a simple model with a single `Mul` node. - Example plugin-EP implementation: https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/onnxruntime/test/autoep/library/example_plugin_ep.cc - APIs: - Graph IR: https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/include/onnxruntime/core/session/onnxruntime_c_api.h#L5290-L5439 - Plugin EP: https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/include/onnxruntime/core/session/onnxruntime_c_api.h#L6177-L6481 - Example app code (from unit tests): https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi/onnxruntime/test/autoep/test_autoep_selection.cc#L614 ### Motivation and Context Based on microsoft#21450 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description Add early exit in ComputeModelGraphHash when EPContext nodes are present, returning "0" to indicate pre-compiled model state. Conditionally skip ComputeModelWeightHash when graph hash is "0" to avoid unnecessary computation for pre-compiled models. This optimization reduces overhead for models containing EPContext nodes, which represent execution provider pre-compiled subgraphs. ### Motivation and Context Currently, the hash generated by ComputeModelGraphHash function when the graph contains EPContext nodes does not correctly represent the graph because we do not hash the contents of the context pointed to by the EPContext node. Thus, it makes more sense to skip hashing for cases involving EPContext nodes.
### Description Missed a few documentation errors in the [previous PR](microsoft#24887). This PR fixes the C/C++ API documentation generation action: https://github.com/microsoft/onnxruntime/actions/runs/15749541590 ### Motivation and Context Fix the C/C++ API documentation generation GitHub action.
### Description Some initializers are stored as in-memory external data, WebNN EP should support these initializers. ### Motivation and Context This PR: - Added `HasExternalDataInMemory` check for external data to avoid unexpected error. - Wrapped the `UnpackInitializerData` to make it compatible with external data. Fixed microsoft#25078
### Description
This PR allows to use WebGPU EP in `onnxruntime-web` NPM package.
### Migration Plan
Currently, there are 2 different EPs implemented the WebGPU backend of
onnxruntime-web. They are JSEP and WebGPU EP. The migration plan is to
replace the JSEP with WebGPU EP and eventually remove the JSEP. The plan
contains the following stages:
- STAGE 1: enable WebGPU EP on onnxruntime-web in local build. (Done)
- **STAGE 2: enable WebGPU EP on onnxruntime-web in the public package.
(This PR)**
- STAGE 3: remove JSEP from onnxruntime-web.
### Package consumption changes
- Default import (`import 'onnxruntime-web'`) and CPU only import
(`import 'onnxruntime-web/wasm'`) keeps their previous behaviors.
- WebGPU import (`import 'onnxruntime-web/webgpu'`) will now use WebGPU
EP instead of JSEP. Previously it was the same as default import.
- WebGPU import will use a different suffix for the .mjs and .wasm name
(which was `.jsep`):
- ort-wasm-simd-threaded<b>.asyncify</b>.mjs
- ort-wasm-simd-threaded<b>.asyncify</b>.wasm
- The suffix `.asyncify` is used as `.jspi` is planned for future. They
are 2 different ways of emscripten's async implementation (sync C++
function calls async JS function)
1. Updated abseil to the latest version, which fixed all BinSkim warnings so that we no longer need to patch it. 2. However, the new abseil itself has some extra warnings, so I have to patch it. 3. Added "--compile_no_warning_as_error" to windows 32-bit build. I will fix the warnings later.
### Description This change implements **Metadata-based hash override** optimization to the model hashing logic. Added logic to check for "model_hash" in model metadata before computing hashes. If present, both model_graph_hash and model_weight_hash are set to the metadata value, bypassing computation entirely. ### Motivation and Context ONNX models generated using the Olive toolchain now have the option to include the model hash as part of the ONNX metadata. For such models, it would be beneficial to use the provided hash instead of computing it from scratch.
### Description This change introduced a 6x8 QGEMM micro kernel for WASM relaxed SIMD build. ### Motivation and Context This change optimizes the performance of QGEMM on x64 devices with AVX-VNNI. | Mlas bench/RPL laptop/node v24.1.0 | baseline | opt | diff | |------------------------------------------------------------------------|----------|---------|------| | QGEMM/UnsignedANoPackB/M:384/N:1024/K:1024/Batch:1/Threads:4/real_time | 2452212 | 1708338 | 44% | | QGEMM/UnsignedANoPackB/M:384/N:1024/K:3072/Batch:1/Threads:4/real_time | 9053789 | 6395584 | 42% | | QGEMM/UnsignedANoPackB/M:384/N:1024/K:4096/Batch:1/Threads:4/real_time | 12109727 | 8189719 | 48% | | QGEMM/UnsignedANoPackB/M:384/N:4096/K:1024/Batch:1/Threads:4/real_time | 11787607 | 7926226 | 49% |
Enhance MatMulNBits CUDA kernel testing: (1) Add a kernel testing for different cuda kernels used in MatMulNBits. (2) Refactoring the gemm profiler to use cuda allocator (2) Add verbose logging macros. (3) Adjustments to speed up compiling when sm90 is excluded from build. Example kernel test output: 
microsoft#25117) Fix: delay CUDADriverWrapper instantiation to avoid uncaught exceptions when CUDA is unavailable ### Description This PR moves the static instantiation of CUDADriverWrapper from a class-level static field to a function-local static inside CUDADriverWrapper::GetInstance(). This change ensures that the CUDA driver is only loaded when the instance is actually needed, rather than at static initialization time. It preserves the singleton behavior while deferring instantiation to runtime. ### Motivation and Context When libcuda.so.1 is not available on the system, the constructor of CUDADriverWrapper throws an exception. Previously, this exception was triggered during static initialization, leading to an uncatchable std::terminate() and process termination. By moving the instance into GetInstance() as a function-local static, the exception can now be caught by client code (e.g., in try/catch), allowing graceful fallback when CUDA is unavailable.
) * Change 90 to 90a. It is because FpA IntB Gemm uses accelerated features (like WGMMA, TMA and setmaxnreg). * Change 90 to `90a-real;90a-virtual`. It is because a recent change in [cuda_configuration.cmake](https://github.com/microsoft/onnxruntime/blob/main/cmake/external/cuda_configuration.cmake) will replace "90" to "90a-real", and we need explicitly add the virtual one `90a-virtual` to the list. Note that our pipelines are still using CUDA 12.2, so it is not able to add 100 and 120 to the list. We can add it when we upgrade to CUDA 13 in the future.
…ft#25102) ### Description - Add argument "--exclude_initializer_from_input" to remove initializer from input if model.ir_version >= 4 - Add function argument on remove_initializer_from_input - Modify the onnxruntime_python.cmake to include remove_initializer_from_input.py ### Motivation and Context - To solve the issue `Initializer <name> appears in graph inputs and will not be treated as constant value/weight.`, we require the remove_initializer_from_input to be included in `qnn.preprocess`
…icrosoft#24869) Enable NV TRT RTX EP engines to be weight stripped always when using EP Context We want to always use weight-stripped engines for EP Context to reduce disk footprint on end-user system. With this, there are two ways to load weights 1. provide weights via bytestream (recommended) 2. original `model.onnx` present in the same folder as the `model_ctx.onnx` ```cpp std::vector<char> model_bytes = ReadFileFromDisk("model.onnx"); // weight refitting using bytesteam std::unordered_map<std::string, std::string> rtx_ep_options; rtx_ep_options[onnxruntime::nv::provider_option_names::kONNXBytestream] = std::to_string(reinterpret_cast<size_t>(model_bytes.data())); rtx_ep_options[onnxruntime::nv::provider_option_names::kONNXBytestreamSize] = std::to_string(model_bytes.size()); ```
…5132) ### Description Adds the start time of WebGPU kernel profiling to the logging output. ### Motivation and Context To aid in performance analysis, this change includes the kernel profiling start time in addition to the existing execution time. This allows for a more detailed understanding of kernel performance and scheduling.
fix build error
### Description Add UDO support in QNN Example usage: ` ./onnx_test_runner -v -e qnn -j 1 -i "backend_path|./libQnnCpu.so op_packages|<op_type>:<op_package_path>:<interface_symbol_name>[:<target>],<op_type2>:<op_package_path2>:<interface_symbol_nam2e>[:<target2>]" <models> ` ### Motivation and Context Add QNN EP UDO support. For more information, ref [op packages](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/op_packages.html)
### Description <!-- Describe your changes. --> This PR makes the intermediate generated buffers static in GQA for the static kv cache so that it's possible to use the graph capture capability on llm. The changes may improve the buffer cache hit rate but also slightly increase the average gpu memory usage.
…5090) ### Description In the original script, the order of inputs in the generated ONNX model follows the order in the JSON file instead of the id. Therefore, the script will first sort the inputs by ID before wrapping the context into the new model. ### Motivation and Context If we use gen_qnn_ctx_onnx_model.py, it will wrap the context based on the order in qnn_model_net.json instead of the id in qnn_model_net.json, which may result in the onnx model having an incorrect order of inputs. --------- Co-authored-by: chuteng <chuteng@qti.qualcomm.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description This change optimizes MlasGemmQuantKernel for WASM SIMD build by introducing 4x8 micro kernel. ### Motivation and Context This change optimizes the performance of QGEMM on x64 devices using WASM SIMD build. | Mlas bench/LNL laptop/node v24.2.0 | improvement | |------------------------------------------------------------------------|-------------| | QGEMM/UnsignedANoPackB/M:384/N:1024/K:1024/Batch:1/Threads:4/real_time | 51% | | QGEMM/UnsignedANoPackB/M:384/N:1024/K:3072/Batch:1/Threads:4/real_time | 50% | | QGEMM/UnsignedANoPackB/M:384/N:1024/K:4096/Batch:1/Threads:4/real_time | 51% | | QGEMM/UnsignedANoPackB/M:384/N:4096/K:1024/Batch:1/Threads:4/real_time | 71% |
### Description This PR updates the default value for `past_present_share_buffer` in the GenAI config for Whisper. ### Motivation and Context ONNX Runtime GenAI does not currently support buffer sharing during beam search. Whisper is often used for beam search so this should be set to false by default. However, the CUDA model is an exception as the `DecoderMaskedMultiHeadAttention` kernel inside `MultiHeadAttention` requires buffer sharing and manages beam search through its `cache_indirection`.
…tvh (microsoft#25148) From Windows team's CyberEO requirement, teams are required to enable warnings, such as 4018, 4146, 4244, 4267, 4302, 4308, 4509, 4532, 4533, 4700, 4789, 4995, 4996.
### Description <!-- Describe your changes. --> Add ONNX RotaryEmbedding(23) following https://github.com/onnx/onnx/blob/main/docs/Operators.md#RotaryEmbedding. The PR uses contrib op RotaryEmbedding implementation under the hood. The main difference between this op and the contrib op is that the `position_ids` in ONNX RotaryEmbedding is optional. When it's not provided, `cos_cache` and `sin_cache` should be 3d. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix microsoft#24556 Reference microsoft#23507
### Description Enabling the VTCM backup buffering feature on QNN EP, assuming all graphs are running sequentially where the input of the next graph is the output of the current graph. Under these assumptions, rather than allocate buffers for all inputs and outputs, only a single buffer can be shared between all graphs. ### Motivation and Context This will allow larger LLM models to be run
### Description - Updates `OrtEp::Compile()` to allow a plugin EP to create and return EPContext nodes. - Updates the example EP plugin to generate an example EPContext model: <img width="747" alt="image" src="https://github.com/user-attachments/assets/e5d98a10-ec15-45aa-bfaf-887d3b6226e2" /> ### Motivation and Context Adds more of the functionality missing from the EP ABI used for plugin EPs. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
ankitm3k
approved these changes
Jun 25, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backmerging with Msft commits