Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

skottmckay and others added 30 commits July 24, 2025 08:10
### Description
<!-- Describe your changes. -->
Add new allocator type of OrtReadOnlyAllocator to enable providing a
separate allocator that is only used for initializers.

Update the SessionState logic to support this allocator type being
provided, and use it when doing device allocations for initializers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Performance.
### Description
This PR patches the features provided for this PR
microsoft#25476, this provides a
stable fix for the GPU plugin with upcoming OV toolkit v2025.2.1

---------

Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com>
Co-authored-by: co63oc <co63oc@users.noreply.github.com>
Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Alessio Soldano <services@soldano.it>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com>
Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: wp <webgraphics@intel.com>
Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Mark Schofield <mschofie@microsoft.com>
Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Fei Chen <feich@microsoft.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com>
Co-authored-by: Wang Ning <ning4.wang@intel.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com>
Co-authored-by: Ian Hunter <ianfhunter@gmail.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com>
Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Previously the machine pool had a User-assigned managed identity (UMI)
which was used for accessing the blob storage. Now the UMI was removed.
to improve security. Therefore we baked the data into the VM image
instead.
…tValues (microsoft#25482)

### Description
- Adds APIs to get information (file path, file offset, byte size) for
initializers with data in external files. This allows EPs to do their
own custom memory-mapping of initializer data. By default, EPs that
don't have specific requirements can still use
`ValueInfo_GetInitializerValue` to get an `OrtValue` with memory-mapped
initializer data.
- Updates `OrtGraph` to only load `OrtValue` for external initializers
on demand. This prevents having to memory map all external initializers
before the first call to `OrtEp::GetCapability`.

Follow up to microsoft#25320

New API functions:

| Function | Summary|
|-----------|--------------|
| `ValueInfo_GetExternalInitializerInfo` | Get
`OrtExternalInitializerInfo` from `OrtValueInfo` (or `NULL`). Must be
released with `ReleaseExternalInitializerInfo`|
| `ReleaseExternalInitializerInfo` | Releases the
`OrtExternalInitializerInfo` instance |
| `ExternalInitializerInfo_GetFilePath` | Returns the relative path to
the file that stores the initializer's data |
| `ExternalInitializerInfo_GetFileOffset` | Returns the byte offset
within the file where the initializer's data is stored |
| `ExternalInitializerInfo_GetByteSize` | Returns the size in bytes of
the initializer's data within the file |


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Use the license file from QNN SDK to make sure it's up to date.

---------

Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
…nt. (microsoft#25465)

### Description
<!-- Describe your changes. -->
Add arena that uses EP API so that an EP library can be self-sufficient.
Remove cross stream sharing from BFCArena. Nothing is using it and it
creates a dependency on synchronizing streams inside the arena
implementation.
Tried to simplify the Stream/Notification usage. 

Current setup adds an AllocOnStream to OrtAllocator. There's no stream
aware Free at this point as ORT does not attach the Stream to the memory
usage so can't pass it in to the Free call.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
If ORT adds BFCArena to an OrtAllocator from the EP we have OrtAllocator
-> IAllocator wrapper -> BFCArena IAllocator [-> OrtAllocator wrapper
for external usage].

The EP managing its own arena is much simpler.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#25390)

### Description
<!-- Describe your changes. -->

Adjusts concat operator to batch inputs based on
maxStorageBuffersPerShaderStage to allow unlimited number of inputs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fixes patchtst model for transformers.js
<img width="960" height="367"
alt="{31C75CD1-7A7D-48E3-A090-FB153925D165}"
src="https://github.com/user-attachments/assets/f5772709-80b7-4a05-8927-40f496be908c"
/>
### Description
Implementation Attention(23) for CPU.

The backend tests from onnx were wrong for Attention (see
onnx/onnx#7142). The onnx version needs to be
updated to make all tests pass. The implementation matches the reference
implementation after onnx was fixed.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
…t are not correctly excluded (microsoft#25502)

### Description

This change respects initializers that are external but already loaded
in memory. This is required due to an optimization that leaves it to the
backend to read a mapped memory area.

@chilo-ms can you help run the CI and merge this change ?

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
1. Implemented the required changes for the EP factory. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
These changes are required for WinML GA.
### Description
Fixes for the OrtGraphToProto utilities that EPs can copy and modify:
- When serializing `OrtGraph` to ONNX protobuf, do not set an
`onnx::TensorShapeProto` for `onnx::ValueInfo` if the shape has no
dimension entries. Otherwise, the shape incorrectly looks like a scalar.
- Add `ORT_OP_ATTR_GRAPH` to the enum values returned by the
`OpAttr_GetType` C API function. This allows the OrtGraphToProto
utilities to skip processing subgraph attributes, which can be retrieved
via a different API, but return an error on any unsupported attribute
type.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ft#25534)

### Description
1. Upgrade onnxruntime-Ubuntu2204-AMD-CPU machine pool to Ubuntu 24.04,
which can fix some vulnerability management issues.
2. Fix some packaging pipeline issues and remove some unused code blocks
from dml-vs-2022.yml
…#25484)

WebNN requires the shapes of zeroPoint and scale for a qdq op to be
same. However the ONNX allows [1] as scalar shape and some models may
use [1] as the shape for x_zero_point. We should explicitly set the
shape of scale to x_zero_point.
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
…KleidiAI (microsoft#25187)

This PR introduces the initial integration of KleidiAI-optimized
microkernels into ONNX Runtime's MLAS backend, focusing on support for:

- SGEMM
- IGEMM
- Dynamic Quantized MatMuls

Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and
MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2
availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2
paths.
Maintains fallback paths to default MLAS implementations to ensure
coverage and stability.

**Known Issues / Next Steps:**
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?

Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators:
* Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a
single thread
<img width="815" height="308" alt="image"
src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4"
/>

---------

Signed-off-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Declan Flavin <declan.flavin@arm.com>
Co-authored-by: Colm Donelan <colm.donelan@arm.com>
Co-authored-by: Damien Dooley <damdoo01@ip-10-249-28-46.eu-west-1.compute.internal>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- LPBQ encoding is Qualcomm's alternative quantization encoding format
for Block Quantization
- Add translation logic to read LPBQ pattern on MatMul weights in an QDQ
ONNX model exported by AIMET Quantizer
- Prepare the corresponding QNN Quantization param for applying
LowPowerBlockQuantization on MatMul weights
- Apply LPBQ Fusions only for NPU Backend as currently only NPU backend
supports LPBQ encoding format


### Motivation and Context
- This requires accelerate accuracy sensitive large language models like
Phi-3.5 efficiently on Qualcomm's NPU accelerator.
### Description
Corrected dtype_name for the respective float16 implementations,
previously MLFloat16 would return bf16 rather than fp16, and vice-versa.


### Motivation and Context
It looked wrong but passed the tests, I don't fully comprehend what the
test suite is doing to try and improve it. I'd be willing to implement
any pointers.
It reduces the pipeline time for about 30 minutes. The tests still take
about 1 hour, which should be reduced.
…ded but not constant. (microsoft#25544)

### Description
<!-- Describe your changes. -->

In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case
where B zero point input is provided but not constant. In this case, we
should not prepack.

Add some unit tests that test the prepacking code path.

Add check for ARM SME instructions in DynamicQuantizeMatMul before
calling `MlasDynamicQGemmBatch()` and associated functions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Follow up to microsoft#25187
### Description

### Motivation and Context

Fix the build break on Windows+Ninja
### Description

Fixes the packaging pipeline.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR uses the existed RunOption `gpu_graph_id` to control whether to
skip the graph capture. When the webgpu ep option `enableGraphCapture`
is enabled, in RunOption, gpu_graph_id = -1 means skipping graph
capture. Otherwise, go to the graph capture path for each session.run.
If gpu_graph_id is not specified in RunOption, it will respect
`enableGraphCapture `'s value to see whether to go to graph capture
path.
### Description
<!-- Describe your changes. -->
Refactor to split out classes and make things easier to find. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup
…ml (microsoft#25552)

### Description
Yesterday I updated the machine images. Now they already have python
preinstalled. We don't need to do this anymore.
Remove the steps to avoid conflicts.
Also, refactor the yaml file a little bit. Refactors templates to use
parameterized Python versions instead of matrix strategy.
Additional equation support for QNN EP on einsum op.
- **DynamicQuantizeMatMul - handle case where B zero point input is
provided but not constant. (microsoft#25544)**
- **Refactor plugin EP support (microsoft#25541)**
- **Remove the python installation steps from
win-qnn-arm64-ci-pipeline.yml (microsoft#25552)**
### Description
This change is based on microsoft#25135.

Upgrade xnnpack and several related third-party dependencies, including
pthreadpool, cpuinfo, and kleidiai. This change also updates the xnnpack
execution provider code to accommodate changes in the xnnpack api.
Average pooling qu8 is removed as the corresponding microkernel seems no
longer exist in xnnpack.
shaoboyan091 and others added 4 commits July 28, 2025 15:08
…te (microsoft#25553)

This PR fixed webgpu_fix_frame_generator by adding present mode to the
surface configuration. This new attribute is required by laste Dawn to
rendering frames.
### Description

This implements the SwiGLU activation for MoE and qMoE. The activation
is corresponding to
https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py.

Also update test_parity_moe.py to enable test for qMoE in CI pipelines.

### Motivation and Context

This is naive implementation of the activation. Since the activation
will reduce each row length to half, we cannot directly use epilogue.
Current implementations need an extra buffer to run SwiGLU kernel.

In the future, we might take a look at other alternatives that does not
need extra buffer.
### Description
Fixes documentation error in onnxruntime_c_api.h: parameter name
mismatch for `Graph_GetGraphView`



### Motivation and Context
Fix errors in the GitHub action for generating the C/C++ documentation
from public header files.
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k July 29, 2025 06:34
@ankitm3k ankitm3k merged commit 420ec3a into ovep-develop Jul 29, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the synccc_msft_29_7_25 branch July 29, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.