Skip to content

MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly.#27575

Merged
hariharans29 merged 2 commits intomicrosoft:mainfrom
BODAPATIMAHESH:main_Sgemm_PackA
Mar 12, 2026
Merged

MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly.#27575
hariharans29 merged 2 commits intomicrosoft:mainfrom
BODAPATIMAHESH:main_Sgemm_PackA

Conversation

@BODAPATIMAHESH
Copy link
Contributor

Description

Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration.

Motivation and Context

Performance improvements observed in prompt processing:

  • 14% speedup (batch size 1)
  • 6% speedup (batch size 4)
  • 4% speedup (batch size 8)

Tested with granite-3.1-8b

@BODAPATIMAHESH
Copy link
Contributor Author

BODAPATIMAHESH commented Mar 10, 2026

could you review this PR @hariharans29

@hariharans29 hariharans29 requested a review from Copilot March 10, 2026 17:14
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves POWER10 SGEMM (single-precision GEMM) performance by introducing an explicit PackA stage and updating the MMA compute kernel to consume the packed-A layout, with an optimized assembly PackA implementation on non-AIX platforms.

Changes:

  • Add a new POWER10 PackA implementation (C++ fallback + optional assembly fast-path) and route SGEMM through it for the MMA kernel paths.
  • Refactor MlasSgemmMMAProcessCount to consume packed A (Pa) instead of reading A directly with lda.
  • Update the MLAS CMake configuration to enable ASM and build the new .S file on non-AIX POWER10 builds.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp Switch MMA kernel to packed-A input; add C++ PackA routine, prefetching, and assembly PackA hook.
onnxruntime/core/mlas/lib/power/SgemmKernelPackA.S New POWER10 assembly kernel to pack A efficiently for 4- or 8-row blocks.
onnxruntime/core/mlas/lib/power/asmmacro.h New shared assembly macro header providing a function entry macro.
cmake/onnxruntime_mlas.cmake Enable ASM for POWER10 and conditionally compile the new PackA assembly source (non-AIX).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29
Copy link
Member

Can you please address Copilot's comments ?

@BODAPATIMAHESH
Copy link
Contributor Author

Can you please address Copilot's comments ?

Thanks @hariharans29 . I have addressed the Copilot's comments. Please review it.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 enabled auto-merge (squash) March 11, 2026 21:30
@hariharans29
Copy link
Member

Please wait for and eventually rebase to include #27618 - I think that will solve failing build.

…sembly

Introduce an optimized POWER10 PackA implementation leveraging
VSX builtins and assembly to pre-pack 8 rows of matrix A, packing
64 bytes per row per iteration.

Performance improvements observed in prompt processing:
- 14% speedup (batch size 1)
- 6% speedup (batch size 4)
- 4% speedup (batch size 8)

Tested with granite-3.1-8b

Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
1. Removed the memset — unnecessary for CountM == 8
2. Replaced CountM with explicit literals 8 and 4 in the PackAKernelPOWER10 calls
   — purely a readability fix, no behavioral change.
3. Update the header comment of file SgemmKernelPackA.S
4. Update the PackAKernelPOWER10 declaration.
auto-merge was automatically disabled March 12, 2026 05:40

Head branch was pushed to by a user without write access

@BODAPATIMAHESH
Copy link
Contributor Author

Please wait for and eventually rebase to include #27618 - I think that will solve failing build.

Thanks. I have rebased my branch.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 enabled auto-merge (squash) March 12, 2026 18:06
@hariharans29 hariharans29 merged commit 5274c19 into microsoft:main Mar 12, 2026
89 checks passed
@BODAPATIMAHESH
Copy link
Contributor Author

@hariharans29 Thanks. I’d like to understand whether backporting patches to past releases is allowed. If so, could you please clarify what kinds of changes are eligible for backporting and what the process looks like?

@hariharans29
Copy link
Member

@hariharans29 Thanks. I’d like to understand whether backporting patches to past releases is allowed. If so, could you please clarify what kinds of changes are eligible for backporting and what the process looks like?

Generally backporting to an existing release is not allowed. Only when we plan new patch releases on top of existing releases, we take it commits. But the bar to go for patch release very high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants