forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Backmerging with Msft commits #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Description <!-- Describe your changes. --> Add new allocator type of OrtReadOnlyAllocator to enable providing a separate allocator that is only used for initializers. Update the SessionState logic to support this allocator type being provided, and use it when doing device allocations for initializers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Performance.
### Description This PR patches the features provided for this PR microsoft#25476, this provides a stable fix for the GPU plugin with upcoming OV toolkit v2025.2.1 --------- Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com> Co-authored-by: co63oc <co63oc@users.noreply.github.com> Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Alessio Soldano <services@soldano.it> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com> Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: wp <webgraphics@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Fei Chen <feich@microsoft.com> Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com> Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com> Co-authored-by: Wang Ning <ning4.wang@intel.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com> Co-authored-by: Ian Hunter <ianfhunter@gmail.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com> Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com> Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Previously the machine pool had a User-assigned managed identity (UMI) which was used for accessing the blob storage. Now the UMI was removed. to improve security. Therefore we baked the data into the VM image instead.
…tValues (microsoft#25482) ### Description - Adds APIs to get information (file path, file offset, byte size) for initializers with data in external files. This allows EPs to do their own custom memory-mapping of initializer data. By default, EPs that don't have specific requirements can still use `ValueInfo_GetInitializerValue` to get an `OrtValue` with memory-mapped initializer data. - Updates `OrtGraph` to only load `OrtValue` for external initializers on demand. This prevents having to memory map all external initializers before the first call to `OrtEp::GetCapability`. Follow up to microsoft#25320 New API functions: | Function | Summary| |-----------|--------------| | `ValueInfo_GetExternalInitializerInfo` | Get `OrtExternalInitializerInfo` from `OrtValueInfo` (or `NULL`). Must be released with `ReleaseExternalInitializerInfo`| | `ReleaseExternalInitializerInfo` | Releases the `OrtExternalInitializerInfo` instance | | `ExternalInitializerInfo_GetFilePath` | Returns the relative path to the file that stores the initializer's data | | `ExternalInitializerInfo_GetFileOffset` | Returns the byte offset within the file where the initializer's data is stored | | `ExternalInitializerInfo_GetByteSize` | Returns the size in bytes of the initializer's data within the file | ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description Use the license file from QNN SDK to make sure it's up to date. --------- Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
…nt. (microsoft#25465) ### Description <!-- Describe your changes. --> Add arena that uses EP API so that an EP library can be self-sufficient. Remove cross stream sharing from BFCArena. Nothing is using it and it creates a dependency on synchronizing streams inside the arena implementation. Tried to simplify the Stream/Notification usage. Current setup adds an AllocOnStream to OrtAllocator. There's no stream aware Free at this point as ORT does not attach the Stream to the memory usage so can't pass it in to the Free call. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> If ORT adds BFCArena to an OrtAllocator from the EP we have OrtAllocator -> IAllocator wrapper -> BFCArena IAllocator [-> OrtAllocator wrapper for external usage]. The EP managing its own arena is much simpler. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#25390) ### Description <!-- Describe your changes. --> Adjusts concat operator to batch inputs based on maxStorageBuffersPerShaderStage to allow unlimited number of inputs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes patchtst model for transformers.js <img width="960" height="367" alt="{31C75CD1-7A7D-48E3-A090-FB153925D165}" src="https://github.com/user-attachments/assets/f5772709-80b7-4a05-8927-40f496be908c" />
### Description Implementation Attention(23) for CPU. The backend tests from onnx were wrong for Attention (see onnx/onnx#7142). The onnx version needs to be updated to make all tests pass. The implementation matches the reference implementation after onnx was fixed. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
…t are not correctly excluded (microsoft#25502) ### Description This change respects initializers that are external but already loaded in memory. This is required due to an optimization that leaves it to the backend to read a mapped memory area. @chilo-ms can you help run the CI and merge this change ? --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> 1. Implemented the required changes for the EP factory. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> These changes are required for WinML GA.
### Description Fixes for the OrtGraphToProto utilities that EPs can copy and modify: - When serializing `OrtGraph` to ONNX protobuf, do not set an `onnx::TensorShapeProto` for `onnx::ValueInfo` if the shape has no dimension entries. Otherwise, the shape incorrectly looks like a scalar. - Add `ORT_OP_ATTR_GRAPH` to the enum values returned by the `OpAttr_GetType` C API function. This allows the OrtGraphToProto utilities to skip processing subgraph attributes, which can be retrieved via a different API, but return an error on any unsupported attribute type. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…ft#25534) ### Description 1. Upgrade onnxruntime-Ubuntu2204-AMD-CPU machine pool to Ubuntu 24.04, which can fix some vulnerability management issues. 2. Fix some packaging pipeline issues and remove some unused code blocks from dml-vs-2022.yml
…#25484) WebNN requires the shapes of zeroPoint and scale for a qdq op to be same. However the ONNX allows [1] as scalar shape and some models may use [1] as the shape for x_zero_point. We should explicitly set the shape of scale to x_zero_point.
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
…KleidiAI (microsoft#25187) This PR introduces the initial integration of KleidiAI-optimized microkernels into ONNX Runtime's MLAS backend, focusing on support for: - SGEMM - IGEMM - Dynamic Quantized MatMuls Key changes: Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and MlasGemmPackB using KleidiAI where applicable. Applies dispatch logic based on TransA == CblasNoTrans and SME2 availability. Supports float32 and int8 GEMM workloads with conditionally invoked SME2 paths. Maintains fallback paths to default MLAS implementations to ensure coverage and stability. **Known Issues / Next Steps:** Requesting feedback specifically on the API structure: Does the new MLAS interface design align with long-term extensibility? Are the dispatch points and override boundaries well-structured? Indicative Performance figures: The kernels added are particularly effective for Conv2D operators: * Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a single thread <img width="815" height="308" alt="image" src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4" /> --------- Signed-off-by: Damien Dooley <damien.dooley@arm.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Declan Flavin <declan.flavin@arm.com> Co-authored-by: Colm Donelan <colm.donelan@arm.com> Co-authored-by: Damien Dooley <damdoo01@ip-10-249-28-46.eu-west-1.compute.internal>
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description - LPBQ encoding is Qualcomm's alternative quantization encoding format for Block Quantization - Add translation logic to read LPBQ pattern on MatMul weights in an QDQ ONNX model exported by AIMET Quantizer - Prepare the corresponding QNN Quantization param for applying LowPowerBlockQuantization on MatMul weights - Apply LPBQ Fusions only for NPU Backend as currently only NPU backend supports LPBQ encoding format ### Motivation and Context - This requires accelerate accuracy sensitive large language models like Phi-3.5 efficiently on Qualcomm's NPU accelerator.
### Description Corrected dtype_name for the respective float16 implementations, previously MLFloat16 would return bf16 rather than fp16, and vice-versa. ### Motivation and Context It looked wrong but passed the tests, I don't fully comprehend what the test suite is doing to try and improve it. I'd be willing to implement any pointers.
It reduces the pipeline time for about 30 minutes. The tests still take about 1 hour, which should be reduced.
…ded but not constant. (microsoft#25544) ### Description <!-- Describe your changes. --> In DynamicQuantizeMatMul KleidiAI-specific prepacking logic, handle case where B zero point input is provided but not constant. In this case, we should not prepack. Add some unit tests that test the prepacking code path. Add check for ARM SME instructions in DynamicQuantizeMatMul before calling `MlasDynamicQGemmBatch()` and associated functions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Follow up to microsoft#25187
### Description ### Motivation and Context Fix the build break on Windows+Ninja
### Description Fixes the packaging pipeline. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR uses the existed RunOption `gpu_graph_id` to control whether to skip the graph capture. When the webgpu ep option `enableGraphCapture` is enabled, in RunOption, gpu_graph_id = -1 means skipping graph capture. Otherwise, go to the graph capture path for each session.run. If gpu_graph_id is not specified in RunOption, it will respect `enableGraphCapture `'s value to see whether to go to graph capture path.
### Description <!-- Describe your changes. --> Refactor to split out classes and make things easier to find. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Cleanup
…ml (microsoft#25552) ### Description Yesterday I updated the machine images. Now they already have python preinstalled. We don't need to do this anymore. Remove the steps to avoid conflicts. Also, refactor the yaml file a little bit. Refactors templates to use parameterized Python versions instead of matrix strategy.
Additional equation support for QNN EP on einsum op.
- **DynamicQuantizeMatMul - handle case where B zero point input is provided but not constant. (microsoft#25544)** - **Refactor plugin EP support (microsoft#25541)** - **Remove the python installation steps from win-qnn-arm64-ci-pipeline.yml (microsoft#25552)**
### Description This change is based on microsoft#25135. Upgrade xnnpack and several related third-party dependencies, including pthreadpool, cpuinfo, and kleidiai. This change also updates the xnnpack execution provider code to accommodate changes in the xnnpack api. Average pooling qu8 is removed as the corresponding microkernel seems no longer exist in xnnpack.
…te (microsoft#25553) This PR fixed webgpu_fix_frame_generator by adding present mode to the surface configuration. This new attribute is required by laste Dawn to rendering frames.
### Description This implements the SwiGLU activation for MoE and qMoE. The activation is corresponding to https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py. Also update test_parity_moe.py to enable test for qMoE in CI pipelines. ### Motivation and Context This is naive implementation of the activation. Since the activation will reduce each row length to half, we cannot directly use epilogue. Current implementations need an extra buffer to run SwiGLU kernel. In the future, we might take a look at other alternatives that does not need extra buffer.
### Description Fixes documentation error in onnxruntime_c_api.h: parameter name mismatch for `Graph_GetGraphView` ### Motivation and Context Fix errors in the GitHub action for generating the C/C++ documentation from public header files.
ankitm3k
approved these changes
Jul 29, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backmerging with Msft commits