[MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt by Laan33 · Pull Request #27609 · microsoft/onnxruntime

Laan33 · 2026-03-10T12:14:28Z

Description

This change updates the KleidiAI SGEMM post-processing path in onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp with two parts:

Correctness fix: in the alpha == 0 || K == 0 fast path, beta handling is now applied for every batch entry (not just batch 0), so batched SGEMM behaviour is correct.
NEON SGEMM epilogue optimisation: adds a vectorised alpha/beta post-processing path for contiguous outputs, with guarded fallback to scalar for non-contiguous or small cases. The 2D epilogue path also routes contiguous tiles through the contiguous 1D epilogue path to enable vectorisation.

Motivation and Context

This change addresses correctness and performance in the SGEMM post-processing stage:

The batched alpha == 0 || K == 0 path previously used only Data[0], which could produce incorrect results for BatchSize > 1.
The post-processing loop (C = alpha * (A*B) + beta * C) is a known latency contributor when memcpy fast paths are not applicable. The NEON epilogue changes are intended to reduce this cost on supported ARM platforms while preserving existing fallback behaviour.

Laan33 · 2026-03-10T12:27:46Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2026-03-10T17:05:27Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-10T17:05:49Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

Updates the KleidiAI-backed SGEMM post-processing (alpha/beta “epilogue”) in MLAS to fix batched correctness in an early-exit path and to improve ARM NEON performance for contiguous outputs.

Changes:

Fix batched alpha == 0 || K == 0 fast path to apply beta reduction for every batch entry.
Add a contiguous-only NEON-vectorized alpha/beta epilogue path with scalar fallback for small/non-contiguous cases.
Route contiguous 2D tiles through the 1D contiguous path to reuse vectorization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

hariharans29 · 2026-03-11T22:11:34Z

Please includ ethis - #27618 when it goes through eventually

hariharans29 · 2026-03-13T17:47:12Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-13T17:47:34Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-03-16T17:49:52Z

Can you please fix the CI issues ?

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

Fix batched fast-path handling in KleidiAI SGemm by avoiding alpha checks on batch-0 only: - handle K==0 as a per-batch beta-only path - only take alpha==0 fast path when all batch entries have alpha==0 Add non-long SGemm regression coverage for BatchSize>1 with mixed alpha/beta combinations, including a batched K==0 case. Update ApplyAlphaBeta2D comments to match current contiguous-tile control flow. Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>

Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>

Laan33 · 2026-03-18T17:10:54Z

Pushed new commits there to fix issues caused by when I merged main into the branch

hariharans29 · 2026-03-18T18:17:20Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-18T18:17:41Z

Azure Pipelines successfully started running 4 pipeline(s).

Laan33 changed the title ~~Catlaw01/sgemm epilogue neon opt~~ [MLAS][KleidiAI] Catlaw01/sgemm epilogue neon opt Mar 10, 2026

Laan33 changed the title ~~[MLAS][KleidiAI] Catlaw01/sgemm epilogue neon opt~~ [MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt Mar 10, 2026

hariharans29 requested a review from Copilot March 10, 2026 17:04

Copilot started reviewing on behalf of hariharans29 March 10, 2026 17:05 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Show resolved Hide resolved

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

hariharans29 reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

Laan33 force-pushed the catlaw01/sgemm-epilogue-neon-opt branch from becb89f to 7b594fc Compare March 11, 2026 11:48

hariharans29 reviewed Mar 11, 2026

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Outdated Show resolved Hide resolved

Laan33 added 4 commits March 18, 2026 17:08

fix: Add ApplyBetaToC helper function and handle multiple batch sizes.

c2baed2

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

feat: Optimize ApplyAlphaBetaStrided for ARM NEON with vectorized path

086ef85

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

Remove conditional compilation for ARM_NEON in sgemm_kleidiai.cpp

f1d1717

Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>

Laan33 force-pushed the catlaw01/sgemm-epilogue-neon-opt branch from 2b591f7 to f1d1717 Compare March 18, 2026 17:09

hariharans29 approved these changes Mar 18, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) March 18, 2026 19:16

hariharans29 merged commit 3bb9e95 into microsoft:main Mar 19, 2026
89 checks passed

Conversation

Laan33 commented Mar 10, 2026

Description

Motivation and Context

Uh oh!

Laan33 commented Mar 10, 2026

Uh oh!

hariharans29 commented Mar 10, 2026

Uh oh!

azure-pipelines bot commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Mar 11, 2026

Uh oh!

hariharans29 commented Mar 13, 2026

Uh oh!

azure-pipelines bot commented Mar 13, 2026

Uh oh!

hariharans29 commented Mar 16, 2026

Uh oh!

Laan33 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Mar 18, 2026

Uh oh!

azure-pipelines bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Laan33 commented Mar 18, 2026 •

edited

Loading