MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly.#27575
Conversation
|
could you review this PR @hariharans29 |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR improves POWER10 SGEMM (single-precision GEMM) performance by introducing an explicit PackA stage and updating the MMA compute kernel to consume the packed-A layout, with an optimized assembly PackA implementation on non-AIX platforms.
Changes:
- Add a new POWER10 PackA implementation (C++ fallback + optional assembly fast-path) and route SGEMM through it for the MMA kernel paths.
- Refactor
MlasSgemmMMAProcessCountto consume packed A (Pa) instead of reading A directly withlda. - Update the MLAS CMake configuration to enable ASM and build the new
.Sfile on non-AIX POWER10 builds.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp |
Switch MMA kernel to packed-A input; add C++ PackA routine, prefetching, and assembly PackA hook. |
onnxruntime/core/mlas/lib/power/SgemmKernelPackA.S |
New POWER10 assembly kernel to pack A efficiently for 4- or 8-row blocks. |
onnxruntime/core/mlas/lib/power/asmmacro.h |
New shared assembly macro header providing a function entry macro. |
cmake/onnxruntime_mlas.cmake |
Enable ASM for POWER10 and conditionally compile the new PackA assembly source (non-AIX). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Can you please address Copilot's comments ? |
Thanks @hariharans29 . I have addressed the Copilot's comments. Please review it. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Please wait for and eventually rebase to include #27618 - I think that will solve failing build. |
…sembly Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration. Performance improvements observed in prompt processing: - 14% speedup (batch size 1) - 6% speedup (batch size 4) - 4% speedup (batch size 8) Tested with granite-3.1-8b Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
1. Removed the memset — unnecessary for CountM == 8 2. Replaced CountM with explicit literals 8 and 4 in the PackAKernelPOWER10 calls — purely a readability fix, no behavioral change. 3. Update the header comment of file SgemmKernelPackA.S 4. Update the PackAKernelPOWER10 declaration.
Head branch was pushed to by a user without write access
9edea2c to
a7b3d36
Compare
Thanks. I have rebased my branch. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@hariharans29 Thanks. I’d like to understand whether backporting patches to past releases is allowed. If so, could you please clarify what kinds of changes are eligible for backporting and what the process looks like? |
Generally backporting to an existing release is not allowed. Only when we plan new patch releases on top of existing releases, we take it commits. But the bar to go for patch release very high. |
Description
Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration.
Motivation and Context
Performance improvements observed in prompt processing:
Tested with granite-3.1-8b