Skip to content

[MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt#27609

Merged
hariharans29 merged 4 commits intomicrosoft:mainfrom
Laan33:catlaw01/sgemm-epilogue-neon-opt
Mar 19, 2026
Merged

[MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt#27609
hariharans29 merged 4 commits intomicrosoft:mainfrom
Laan33:catlaw01/sgemm-epilogue-neon-opt

Conversation

@Laan33
Copy link
Contributor

@Laan33 Laan33 commented Mar 10, 2026

Description

This change updates the KleidiAI SGEMM post-processing path in onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp with two parts:

  • Correctness fix: in the alpha == 0 || K == 0 fast path, beta handling is now applied for every batch entry (not just batch 0), so batched SGEMM behaviour is correct.
  • NEON SGEMM epilogue optimisation: adds a vectorised alpha/beta post-processing path for contiguous outputs, with guarded fallback to scalar for non-contiguous or small cases. The 2D epilogue path also routes contiguous tiles through the contiguous 1D epilogue path to enable vectorisation.

Motivation and Context

This change addresses correctness and performance in the SGEMM post-processing stage:

  • The batched alpha == 0 || K == 0 path previously used only Data[0], which could produce incorrect results for BatchSize > 1.
  • The post-processing loop (C = alpha * (A*B) + beta * C) is a known latency contributor when memcpy fast paths are not applicable. The NEON epilogue changes are intended to reduce this cost on supported ARM platforms while preserving existing fallback behaviour.

@Laan33
Copy link
Contributor Author

Laan33 commented Mar 10, 2026

@microsoft-github-policy-service agree company="Arm"

@Laan33 Laan33 changed the title Catlaw01/sgemm epilogue neon opt [MLAS][KleidiAI] Catlaw01/sgemm epilogue neon opt Mar 10, 2026
@Laan33 Laan33 changed the title [MLAS][KleidiAI] Catlaw01/sgemm epilogue neon opt [MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt Mar 10, 2026
@hariharans29 hariharans29 requested a review from Copilot March 10, 2026 17:04
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the KleidiAI-backed SGEMM post-processing (alpha/beta “epilogue”) in MLAS to fix batched correctness in an early-exit path and to improve ARM NEON performance for contiguous outputs.

Changes:

  • Fix batched alpha == 0 || K == 0 fast path to apply beta reduction for every batch entry.
  • Add a contiguous-only NEON-vectorized alpha/beta epilogue path with scalar fallback for small/non-contiguous cases.
  • Route contiguous 2D tiles through the 1D contiguous path to reuse vectorization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Laan33 Laan33 force-pushed the catlaw01/sgemm-epilogue-neon-opt branch from becb89f to 7b594fc Compare March 11, 2026 11:48
@hariharans29
Copy link
Member

Please includ ethis - #27618 when it goes through eventually

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

Can you please fix the CI issues ?

Laan33 added 4 commits March 18, 2026 17:08
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Fix batched fast-path handling in KleidiAI SGemm by avoiding alpha checks on batch-0 only:
  - handle K==0 as a per-batch beta-only path
  - only take alpha==0 fast path when all batch entries have alpha==0

Add non-long SGemm regression coverage for BatchSize>1 with mixed alpha/beta combinations, including a batched K==0 case.
Update ApplyAlphaBeta2D comments to match current contiguous-tile control flow.

Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>
Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>
@Laan33 Laan33 force-pushed the catlaw01/sgemm-epilogue-neon-opt branch from 2b591f7 to f1d1717 Compare March 18, 2026 17:09
@Laan33
Copy link
Contributor Author

Laan33 commented Mar 18, 2026

Pushed new commits there to fix issues caused by when I merged main into the branch

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29 hariharans29 enabled auto-merge (squash) March 18, 2026 19:16
@hariharans29 hariharans29 merged commit 3bb9e95 into microsoft:main Mar 19, 2026
89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants