Skip to content

Conversation

@Jaswanth51
Copy link

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

fs-eire and others added 18 commits November 18, 2025 20:27
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…to BFCArena (microsoft#26535)

### Description
This change allows users to better control GPU memory in shared
environments with multiple tenants or multiple inference sessions per
process.

Cuda based memory pool features native allocations on streams, allows
trimming the memory on Shrink if enabled and releases memory back to the
system based on the user specified parameters.

In my limited testing latencies were comparable with running on
BFCArena, although your milage and requirements may vary.

CudaMemoryPoolArena is enabled via OrtArenaCfg with introducing 3 new
parameters:
- `use_cuda_mempool` set to 1 to enable
- `cuda_mempool_release_threshold` amount of memory to keep cached
- `cuda_mempool_bytes_to_keep_on_shrink` the amount of memory to keep on
Shrink when being trimmed, allocated memory is not affected.

### Motivation and Context
Better GPU memory control in multitenant environments.

There are some new options for `onnxruntime_perf_test` introduced in
this PR so they may assist clients to figure out the best settings for
they case:
- `--enable_cuda_mempool 209715200;1048576` with first parameter being
`cuda_mempool_release_threshold`. The second
`cuda_mempool_bytes_to_keep_on_shrink` can be zero if shrink is not
enabled.
- `--shrink_arena_between_runs gpu:0` measure perf and memory
consumption with shrink. This new allocator strictly speaking does not
need `Shrink()` since cuda mempool may release memory on the go
according to `cuda_mempool_release_threshold`.

Here is some performance numbers gathered when running HF_Bart model.

If the CudaMempool release threshold is set too low, latency increases
because the system ends up constantly allocating and releasing memory.
But as we raise the threshold and allow more memory to stay allocated,
latency improves—and we end up using only about half as much memory
between runs compared to BFCArena.

Running default setup with BFCArena
> onnxruntime_perf_test -s -e cuda -I -S 10 -m times -r 100
"hf_Bart_torchscript.onnx"
Average inference time cost total: 66.493545 ms
P99 Latency: 0.0805385 s
Total memory allocated: 1,409,286,144

200 MB release threshold
> onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 209715200;0 -I
-S 10 -m times -r 100 hf_Bart_torchscript.onnx
Average inference time cost total: 77.367473 ms
P99 Latency: 0.0931895 s

0.5Gb release threshold
> onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 536870912;0 -I
-S 10 -m times -r 100 hf_Bart_torchscript.onnx
Average inference time cost total: 75.112840 ms
P99 Latency: 0.0910992 s

1Gb release threshold
> onnxruntime_perf_test -s -e cuda --enable_cuda_mempool 1073741824;0 -I
-S 10 -m times -r 100 hf_Bart_torchscript.onnx
Average inference time cost total: 66.533892 ms
P99 Latency: 0.0761336 s

Enabling shrink show we’re retaining only half the memory compared to
BFCArena in between inference runs.

>CudaMempoolArena::Shrink: pool current_in_use: 709,603,688 reserved
size after trim : 738,197,504 bytes.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Attention input handling updates:

* Corrected the input indices for `past` from `input[5]` to `input[4]`
in the fallback logic, ensuring the code reflects the actual input
order.

With this change, the Attention ops in phi-4-mm-vision.onnx can go to
the gpu instead of cpu.
### Description
This patch implements the `Split-K` optimization on `Conv|MatMul`. With
`Split-K` we can re-arrange the computation into multiple workgroups
when `K` is large to increase the parallelism on the platforms that
`Split-K` is confirmed to be useful.

1. Support `Split-K` in `MakeMatMulPackedVec4Source()` to split a
workgroup with large K into smaller ones. In this patch we only support
`Split-K` with `batch_size == 1` and `vec4` on `Conv|MatMul`.
2. Support `Split-K` in `MatMulWriteFnSource()` (add the partial result
to output with atomic built-in functions)
3. Implement `SplitKConfig` to decide whether `Split-K` should be used
or not, and all the related thresholds.
4. Implement `MatMulFillBiasBeforeSplitKProgram` to initialize the
output with `bias` or 0 when `Split-K` is used.

### Motivation and Context
In current implementation, when `K` or `dim_inner` is large, in each
invocation we always do the computation one by one in a very large loop,
which may not make full use of all EUs on a GPU.

With `Split-K` we can split such large amount of computation (`K`) into
multiple workgroups with less computation (`kSplitK`, smaller than K),
which can greatly improve the parallelism.

With this patch we can get about 15% performance improvement on
`efficientnet-lite-f16-demo` and 9% improvement on
`mobilenetv2-12-f16-demo` on Lunar Lake and Meteor Lake.
### Description

This PR makes ORT to prefer initializer allocator when calling
`OpKernel::PrePack`.

If an EP does not register an initializer allocator (currently only
WebGPU does this), the behavior is kept unchanged.

### Motivation and Context

Helps to improve the memory usage when doing prepack.
### Description
In AIX, dladdr() is not supported so blocking the call of dladdr API
under _AIX.
we don't have support of cpuinfo pkg also which generates a warning at
runtime.

This PR is to fox the issues mentioned above.


### Motivation and Context

1. Fix for below compilation error
```
/home/buildusr/jenkins/workspace/onnxruntime-openxl/onnxruntime/onnxruntime/core/platform/posix/env.cc:562:9: error: unknown type name 'Dl_info'
  562 |     if (Dl_info dl_info{};
```

2. Fix for below warning during test application executions.

`2025-11-06 07:23:44.176700000 [W:onnxruntime:Default, cpuid_info.cc:95
LogEarlyWarning] Unknown CPU vendor. cpuinfo_vendor value: 0`
Only list no fallback and pass tests for WebNN, as a pre-requisite for
enabling WebNN CI tests in future.
…crosoft#26616)

### Description
This change adds a well-known key name (`os_driver_version`)
corresponding to the OS driver version associated with an EP. We will
eventually flesh this out to enable retrieving it from the `OrtEpDevice`
if it's been populated, but for starters we reserve the name.

### Motivation and Context
We have a scenario in WebNN where the browser would like to get the
driver version associated with a given EP (this is to enable policy
against the driver, e.g. for maintaining a blocklist if a particular
driver version has a scenario-blocking bug in it). Having a mechanism to
retrieve the driver version via ORT would help with implementing this
feature.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
### Description
<!-- Describe your changes. -->
Support the `local_window_size` attribute in **GroupQueryAttention**
Operator, which is designed for sliding window attention and may
influence the attention mask pattern.

For local window size not equal to -1, new attention mask pattern will
be created as follows for applying sliding window.
```
     condition_1 (old attn_mask) ---> CumSum (axis=3, exclusive=true, reversed=true)
          |                             |
          |                           Lesser <--- local_window_size
          |                             |
      LogicalAnd <----------------- condition_2
          |
    new attn_mask
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…plitPackedQKVWithRotaryEmbeddingAndCopyKV (microsoft#26563)

### Description
<!-- Describe your changes. -->

Create a ultimated fused path called
SplitPackedQKVWithRotaryEmbeddingAndCopyKV which fused
SplitPackedQKVWithRotaryEmbedding and CopyKVCache. When use flash
attention and static kv cache is enabled, run it.

We did the following things:
- Support components for existed SplitPackedQKVWithRotaryEmbedding
- Fused it and copykvcache as new
SplitPackedQKVWithRotaryEmbeddingAndCopyKV

### Motivation and Context

On NV5080, the token generation speed improve ~4%.
|    generation tps    | Before | After |
|--------|--------|-------|
| NV5080 | 135    | **141**   |
| Intel  | 15.3   | 15.4  |
| Mac    | 71.2   | 71.8  |
…6623)

### Description
<!-- Describe your changes. -->
Since this is a MS component, I thinkg vcpkg is already updated

### Motivation and Context
The older URL is now failing.
…ft#26604)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This change extends the `ConvInteger` implementation to match the [ONNX
operator spec](https://onnx.ai/onnx/operators/onnx__ConvInteger.html),
which allows both `int8` and `uint8` for the input tensors:

- The ONNX `ConvInteger` schema defines:
  - `T1`: `tensor(int8)` or `tensor(uint8)`
  - `T2`: `tensor(int8)` or `tensor(uint8)`
  - `T3`: `tensor(int32)`
- Previously, only the `uint8` × `uint8` combination was supported.
- This PR adds support for all 8-bit combinations:
  - `uint8` × `uint8` (existing behavior)
  - `uint8` × `int8`
  - `int8` × `uint8`
  - `int8` × `int8`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes microsoft#24183
Fixes microsoft#15888
Fixes microsoft#12558
Fixes microsoft#3130
Fixes microsoft#12362

The ONNX ConvInteger operator schema allows both int8 and uint8 element
types for its inputs, but the current implementation only supports uint8
× uint8. This leads to a gap where valid ONNX models using ConvInteger
with int8 tensors cannot be executed.
This PR closes that gap by:
Aligning the implementation with the official ConvInteger type
constraints.
Enabling models that use int8 (or mixed int8/uint8) for X and W to run
without needing operator rewrites or additional custom kernels.
Keeping existing uint8 behavior unchanged, so the change is backwards
compatible for current users.

### Implementation details

1. Templated core implementation (ComputeInner)
The core logic of ConvInteger::Compute is moved into a templated helper:
```text
class ConvInteger : public OpKernel {
 public:
       ...
 private:
  template <typename XT, typename WT>
  Status ComputeInner(OpKernelContext* context) const
};
```
XT is the element type of X (uint8_t or int8_t).
WT is the element type of W (uint8_t or int8_t).

2. Zero-point handling
Zero points are still treated as per-tensor scalar values, with the same
validation,
The values are read via `DataRaw()` and stored as 8-bit scalars,
preserving the previous behavior.
Interpretation of these raw bytes as signed or unsigned is delegated to
the GEMM implementation via explicit signedness flags (see below).

3. Im2col templated on XT
The Im2col call now uses the runtime input type XT.

4. Quantized GEMM with signedness flags:
```text
gemm_shape.AIsSigned = W->IsDataType<int8_t>();
gemm_shape.BIsSigned = X->IsDataType<int8_t>();
```
AIsSigned and BIsSigned are derived from the runtime types of W and X.
Data for A and B is passed as raw bytes, the GEMM implementation uses
the signedness flags to interpret them correctly (In a manner similar to
the implementation in `MatMulInteger`).

5. Runtime dispatch in Compute()
The public Compute method becomes a thin dispatcher that selects the
appropriate ComputeInner<XT, WT> instantiation based on the actual input
types.

In addition, a small set of unit tests is added on top of the existing
ConvInteger tests to cover the new type combinations, including cases
where the first input tensor contains negative values (for the int8 ×
int8 path).

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Description

This PR adds a new API function `KernelInfo_GetConfigEntries` that
allows custom operators to access all configuration entries from the
`OrtKernelInfo` object during kernel construction.

## Motivation and Context

Custom operators may need to access session configuration options to
adjust their behavior. Previously, there was no way to retrieve all
config entries from `KernelInfo`. This PR provides a convenient method
to get all configuration key-value pairs that were set on the
`OrtSessionOptions`.

## Changes

### API Additions

- **C API**: Added `KernelInfo_GetConfigEntries` function to `OrtApi`
(Version 1.24)
  - Takes an `OrtKernelInfo*` as input
  - Returns all config entries as `OrtKeyValuePairs*`
  - Properly documented with usage examples

- **C++ API**: Added `GetConfigEntries()` method to `KernelInfoImpl`
template class
  - Returns `KeyValuePairs` object
  - Follows existing C++ wrapper patterns

### Implementation

- Implemented in `onnxruntime/core/session/custom_ops.cc`
- Iterates through `config_options_map` from `OpKernelInfo`
- Creates and populates `OrtKeyValuePairs` with all configuration
entries

### Testing

- Updated `shape_inference_test.cc` with test case
- Verifies config entries can be retrieved in custom kernel constructor
- Tests both existing and non-existing config keys

## Files Changed

- `include/onnxruntime/core/session/onnxruntime_c_api.h` - API
declaration
- `include/onnxruntime/core/session/onnxruntime_cxx_api.h` - C++ wrapper
declaration
- `include/onnxruntime/core/session/onnxruntime_cxx_inline.h` - C++
wrapper implementation
- `onnxruntime/core/session/custom_ops.cc` - Core implementation
- `onnxruntime/core/session/onnxruntime_c_api.cc` - API registration
- `onnxruntime/core/session/ort_apis.h` - API header declaration
- `onnxruntime/test/framework/shape_inference_test.cc` - Test coverage

## API Version

This change is part of ORT API Version 1.24.

## Breaking Changes

None. This is a backward-compatible addition to the API.
### Description
- ONNX models exported with older Opset version contains Gelu operator
decomposed into multiple operators (Div, Erf, Add, Mul).
- QNN doesn't support Erf operator but supports Gelu operator
- Since QNN doesn't support Erf operator, the graphs contain Gelu
pattern partition between QNN and CPU EPs and degrading the inference
time.



### Motivation and Context
- Identify and fuse the Gelu pattern into a QNN Gelu node improves the
inference time.

---------

Co-authored-by: Tirupathi Reddy T <tirupath@qti.qualcomm.com>
### Description
- Updates the `ep_weight_sharing_ctx_gen` tool to support specifying a
plugin EP configuration (via JSON).
- Mark the `ep_weight_sharing_ctx_gen` tool as deprecated and add
notification to README that recommends the use the public Python ORT
APIs instead.
- Note we no longer publish a binary for this tool [as of ORT
1.22.2](microsoft#24895).
- Added an example Python script in the README.
- Added a Python unit test that tests compiling models with weight
sharing using an example plugin EP.

#### Tool usage
Create a JSON file that contains information about the plugin EP to
load/use (e.g., `example_plugin_ep_config.json`):
```json
{
    "ep_library_registration_name": "example_plugin_ep",
    "ep_library_path": "example_plugin_ep.dll",
    "selected_ep_name": "example_plugin_ep",
    "default_ep_options": { "option_key": "option_value" }
}
```

Call the `ep_weight_sharing_ctx_gen` tool with the `-p` command-line
option to specify the location of the above configuration file:

```console
$ ep_weight_sharing_ctx_gen.exe -p example_plugin_ep_config.json model_1.onnx,model_2.onnx
```

### Motivation and Context
Close the functionality gap between traditional provider-bridge EPs and
plugin EPs. This PR allows using plugin EPs with the tool that compiles
models with weight sharing.
…crosoft#26575)

### Description
Making cache objects of packed data thread_local rather than static.

### Motivation and Context
Both LHS and RHS packing utilize a cache mechanism based on a static
unordered map. There's the potential for interference between parallel
inference sessions. Made both structures thread_local.

Signed-off-by: Colm Donelan <colm.donelan@arm.com>
Copy link

@preetha-intel preetha-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jaswanth51 Jaswanth51 merged commit bb738ba into ovep-develop Nov 25, 2025
6 of 8 checks passed
@Jaswanth51 Jaswanth51 deleted the sync_msft_25112025 branch November 25, 2025 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.