forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Sync with Microsoft ONNX Runtime - 25/11/2025 #863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…to BFCArena (microsoft#26535) ### Description This change allows users to better control GPU memory in shared environments with multiple tenants or multiple inference sessions per process. Cuda based memory pool features native allocations on streams, allows trimming the memory on Shrink if enabled and releases memory back to the system based on the user specified parameters. In my limited testing latencies were comparable with running on BFCArena, although your milage and requirements may vary. CudaMemoryPoolArena is enabled via OrtArenaCfg with introducing 3 new parameters: - `use_cuda_mempool` set to 1 to enable - `cuda_mempool_release_threshold` amount of memory to keep cached - `cuda_mempool_bytes_to_keep_on_shrink` the amount of memory to keep on Shrink when being trimmed, allocated memory is not affected. ### Motivation and Context Better GPU memory control in multitenant environments. There are some new options for `onnxruntime_perf_test` introduced in this PR so they may assist clients to figure out the best settings for they case: - `--enable_cuda_mempool 209715200;1048576` with first parameter being `cuda_mempool_release_threshold`. The second `cuda_mempool_bytes_to_keep_on_shrink` can be zero if shrink is not enabled. - `--shrink_arena_between_runs gpu:0` measure perf and memory consumption with shrink. This new allocator strictly speaking does not need `Shrink()` since cuda mempool may release memory on the go according to `cuda_mempool_release_threshold`. Here is some performance numbers gathered when running HF_Bart model. If the CudaMempool release threshold is set too low, latency increases because the system ends up constantly allocating and releasing memory. But as we raise the threshold and allow more memory to stay allocated, latency improves—and we end up using only about half as much memory between runs compared to BFCArena. Running default setup with BFCArena > onnxruntime_perf_test -s -e cuda -I -S 10 -m times -r 100 "hf_Bart_torchscript.onnx" Average inference time cost total: 66.493545 ms P99 Latency: 0.0805385 s Total memory allocated: 1,409,286,144 200 MB release threshold > onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 209715200;0 -I -S 10 -m times -r 100 hf_Bart_torchscript.onnx Average inference time cost total: 77.367473 ms P99 Latency: 0.0931895 s 0.5Gb release threshold > onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 536870912;0 -I -S 10 -m times -r 100 hf_Bart_torchscript.onnx Average inference time cost total: 75.112840 ms P99 Latency: 0.0910992 s 1Gb release threshold > onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 1073741824;0 -I -S 10 -m times -r 100 hf_Bart_torchscript.onnx Average inference time cost total: 66.533892 ms P99 Latency: 0.0761336 s Enabling shrink show we’re retaining only half the memory compared to BFCArena in between inference runs. >CudaMempoolArena::Shrink: pool current_in_use: 709,603,688 reserved size after trim : 738,197,504 bytes. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Attention input handling updates: * Corrected the input indices for `past` from `input[5]` to `input[4]` in the fallback logic, ensuring the code reflects the actual input order. With this change, the Attention ops in phi-4-mm-vision.onnx can go to the gpu instead of cpu.
### Description This patch implements the `Split-K` optimization on `Conv|MatMul`. With `Split-K` we can re-arrange the computation into multiple workgroups when `K` is large to increase the parallelism on the platforms that `Split-K` is confirmed to be useful. 1. Support `Split-K` in `MakeMatMulPackedVec4Source()` to split a workgroup with large K into smaller ones. In this patch we only support `Split-K` with `batch_size == 1` and `vec4` on `Conv|MatMul`. 2. Support `Split-K` in `MatMulWriteFnSource()` (add the partial result to output with atomic built-in functions) 3. Implement `SplitKConfig` to decide whether `Split-K` should be used or not, and all the related thresholds. 4. Implement `MatMulFillBiasBeforeSplitKProgram` to initialize the output with `bias` or 0 when `Split-K` is used. ### Motivation and Context In current implementation, when `K` or `dim_inner` is large, in each invocation we always do the computation one by one in a very large loop, which may not make full use of all EUs on a GPU. With `Split-K` we can split such large amount of computation (`K`) into multiple workgroups with less computation (`kSplitK`, smaller than K), which can greatly improve the parallelism. With this patch we can get about 15% performance improvement on `efficientnet-lite-f16-demo` and 9% improvement on `mobilenetv2-12-f16-demo` on Lunar Lake and Meteor Lake.
### Description This PR makes ORT to prefer initializer allocator when calling `OpKernel::PrePack`. If an EP does not register an initializer allocator (currently only WebGPU does this), the behavior is kept unchanged. ### Motivation and Context Helps to improve the memory usage when doing prepack.
### Description
In AIX, dladdr() is not supported so blocking the call of dladdr API
under _AIX.
we don't have support of cpuinfo pkg also which generates a warning at
runtime.
This PR is to fox the issues mentioned above.
### Motivation and Context
1. Fix for below compilation error
```
/home/buildusr/jenkins/workspace/onnxruntime-openxl/onnxruntime/onnxruntime/core/platform/posix/env.cc:562:9: error: unknown type name 'Dl_info'
562 | if (Dl_info dl_info{};
```
2. Fix for below warning during test application executions.
`2025-11-06 07:23:44.176700000 [W:onnxruntime:Default, cpuid_info.cc:95
LogEarlyWarning] Unknown CPU vendor. cpuinfo_vendor value: 0`
Only list no fallback and pass tests for WebNN, as a pre-requisite for enabling WebNN CI tests in future.
…crosoft#26616) ### Description This change adds a well-known key name (`os_driver_version`) corresponding to the OS driver version associated with an EP. We will eventually flesh this out to enable retrieving it from the `OrtEpDevice` if it's been populated, but for starters we reserve the name. ### Motivation and Context We have a scenario in WebNN where the browser would like to get the driver version associated with a given EP (this is to enable policy against the driver, e.g. for maintaining a blocklist if a particular driver version has a scenario-blocking bug in it). Having a mechanism to retrieve the driver version via ORT would help with implementing this feature. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
### Description
<!-- Describe your changes. -->
Support the `local_window_size` attribute in **GroupQueryAttention**
Operator, which is designed for sliding window attention and may
influence the attention mask pattern.
For local window size not equal to -1, new attention mask pattern will
be created as follows for applying sliding window.
```
condition_1 (old attn_mask) ---> CumSum (axis=3, exclusive=true, reversed=true)
| |
| Lesser <--- local_window_size
| |
LogicalAnd <----------------- condition_2
|
new attn_mask
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…plitPackedQKVWithRotaryEmbeddingAndCopyKV (microsoft#26563) ### Description <!-- Describe your changes. --> Create a ultimated fused path called SplitPackedQKVWithRotaryEmbeddingAndCopyKV which fused SplitPackedQKVWithRotaryEmbedding and CopyKVCache. When use flash attention and static kv cache is enabled, run it. We did the following things: - Support components for existed SplitPackedQKVWithRotaryEmbedding - Fused it and copykvcache as new SplitPackedQKVWithRotaryEmbeddingAndCopyKV ### Motivation and Context On NV5080, the token generation speed improve ~4%. | generation tps | Before | After | |--------|--------|-------| | NV5080 | 135 | **141** | | Intel | 15.3 | 15.4 | | Mac | 71.2 | 71.8 |
…6623) ### Description <!-- Describe your changes. --> Since this is a MS component, I thinkg vcpkg is already updated ### Motivation and Context The older URL is now failing.
…ft#26604) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> This change extends the `ConvInteger` implementation to match the [ONNX operator spec](https://onnx.ai/onnx/operators/onnx__ConvInteger.html), which allows both `int8` and `uint8` for the input tensors: - The ONNX `ConvInteger` schema defines: - `T1`: `tensor(int8)` or `tensor(uint8)` - `T2`: `tensor(int8)` or `tensor(uint8)` - `T3`: `tensor(int32)` - Previously, only the `uint8` × `uint8` combination was supported. - This PR adds support for all 8-bit combinations: - `uint8` × `uint8` (existing behavior) - `uint8` × `int8` - `int8` × `uint8` - `int8` × `int8` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes microsoft#24183 Fixes microsoft#15888 Fixes microsoft#12558 Fixes microsoft#3130 Fixes microsoft#12362 The ONNX ConvInteger operator schema allows both int8 and uint8 element types for its inputs, but the current implementation only supports uint8 × uint8. This leads to a gap where valid ONNX models using ConvInteger with int8 tensors cannot be executed. This PR closes that gap by: Aligning the implementation with the official ConvInteger type constraints. Enabling models that use int8 (or mixed int8/uint8) for X and W to run without needing operator rewrites or additional custom kernels. Keeping existing uint8 behavior unchanged, so the change is backwards compatible for current users. ### Implementation details 1. Templated core implementation (ComputeInner) The core logic of ConvInteger::Compute is moved into a templated helper: ```text class ConvInteger : public OpKernel { public: ... private: template <typename XT, typename WT> Status ComputeInner(OpKernelContext* context) const }; ``` XT is the element type of X (uint8_t or int8_t). WT is the element type of W (uint8_t or int8_t). 2. Zero-point handling Zero points are still treated as per-tensor scalar values, with the same validation, The values are read via `DataRaw()` and stored as 8-bit scalars, preserving the previous behavior. Interpretation of these raw bytes as signed or unsigned is delegated to the GEMM implementation via explicit signedness flags (see below). 3. Im2col templated on XT The Im2col call now uses the runtime input type XT. 4. Quantized GEMM with signedness flags: ```text gemm_shape.AIsSigned = W->IsDataType<int8_t>(); gemm_shape.BIsSigned = X->IsDataType<int8_t>(); ``` AIsSigned and BIsSigned are derived from the runtime types of W and X. Data for A and B is passed as raw bytes, the GEMM implementation uses the signedness flags to interpret them correctly (In a manner similar to the implementation in `MatMulInteger`). 5. Runtime dispatch in Compute() The public Compute method becomes a thin dispatcher that selects the appropriate ComputeInner<XT, WT> instantiation based on the actual input types. In addition, a small set of unit tests is added on top of the existing ConvInteger tests to cover the new type combinations, including cases where the first input tensor contains negative values (for the int8 × int8 path). --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Description This PR adds a new API function `KernelInfo_GetConfigEntries` that allows custom operators to access all configuration entries from the `OrtKernelInfo` object during kernel construction. ## Motivation and Context Custom operators may need to access session configuration options to adjust their behavior. Previously, there was no way to retrieve all config entries from `KernelInfo`. This PR provides a convenient method to get all configuration key-value pairs that were set on the `OrtSessionOptions`. ## Changes ### API Additions - **C API**: Added `KernelInfo_GetConfigEntries` function to `OrtApi` (Version 1.24) - Takes an `OrtKernelInfo*` as input - Returns all config entries as `OrtKeyValuePairs*` - Properly documented with usage examples - **C++ API**: Added `GetConfigEntries()` method to `KernelInfoImpl` template class - Returns `KeyValuePairs` object - Follows existing C++ wrapper patterns ### Implementation - Implemented in `onnxruntime/core/session/custom_ops.cc` - Iterates through `config_options_map` from `OpKernelInfo` - Creates and populates `OrtKeyValuePairs` with all configuration entries ### Testing - Updated `shape_inference_test.cc` with test case - Verifies config entries can be retrieved in custom kernel constructor - Tests both existing and non-existing config keys ## Files Changed - `include/onnxruntime/core/session/onnxruntime_c_api.h` - API declaration - `include/onnxruntime/core/session/onnxruntime_cxx_api.h` - C++ wrapper declaration - `include/onnxruntime/core/session/onnxruntime_cxx_inline.h` - C++ wrapper implementation - `onnxruntime/core/session/custom_ops.cc` - Core implementation - `onnxruntime/core/session/onnxruntime_c_api.cc` - API registration - `onnxruntime/core/session/ort_apis.h` - API header declaration - `onnxruntime/test/framework/shape_inference_test.cc` - Test coverage ## API Version This change is part of ORT API Version 1.24. ## Breaking Changes None. This is a backward-compatible addition to the API.
### Description - ONNX models exported with older Opset version contains Gelu operator decomposed into multiple operators (Div, Erf, Add, Mul). - QNN doesn't support Erf operator but supports Gelu operator - Since QNN doesn't support Erf operator, the graphs contain Gelu pattern partition between QNN and CPU EPs and degrading the inference time. ### Motivation and Context - Identify and fuse the Gelu pattern into a QNN Gelu node improves the inference time. --------- Co-authored-by: Tirupathi Reddy T <tirupath@qti.qualcomm.com>
### Description - Updates the `ep_weight_sharing_ctx_gen` tool to support specifying a plugin EP configuration (via JSON). - Mark the `ep_weight_sharing_ctx_gen` tool as deprecated and add notification to README that recommends the use the public Python ORT APIs instead. - Note we no longer publish a binary for this tool [as of ORT 1.22.2](microsoft#24895). - Added an example Python script in the README. - Added a Python unit test that tests compiling models with weight sharing using an example plugin EP. #### Tool usage Create a JSON file that contains information about the plugin EP to load/use (e.g., `example_plugin_ep_config.json`): ```json { "ep_library_registration_name": "example_plugin_ep", "ep_library_path": "example_plugin_ep.dll", "selected_ep_name": "example_plugin_ep", "default_ep_options": { "option_key": "option_value" } } ``` Call the `ep_weight_sharing_ctx_gen` tool with the `-p` command-line option to specify the location of the above configuration file: ```console $ ep_weight_sharing_ctx_gen.exe -p example_plugin_ep_config.json model_1.onnx,model_2.onnx ``` ### Motivation and Context Close the functionality gap between traditional provider-bridge EPs and plugin EPs. This PR allows using plugin EPs with the tool that compiles models with weight sharing.
…crosoft#26575) ### Description Making cache objects of packed data thread_local rather than static. ### Motivation and Context Both LHS and RHS packing utilize a cache mechanism based on a static unordered map. There's the potential for interference between parallel inference sessions. Made both structures thread_local. Signed-off-by: Colm Donelan <colm.donelan@arm.com>
preetha-intel
approved these changes
Nov 25, 2025
preetha-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.