[MLAS] Removed memcpy step by storing result in C if possible by JonathanC-ARM · Pull Request #27367 · microsoft/onnxruntime

JonathanC-ARM · 2026-02-17T17:05:29Z

Summary

This change removes the memcpy step in sgemm_kleidiai where possible by writing directly to C

Testing

Model	Baseline avg (ms)	Current avg (ms)	Δ ms	Δ %
Transformer_complex_f32.onnx	2.929885	2.701083	-0.228802	-7.81%
bert_tiny_f32.onnx	0.279675	0.273928	-0.005747	-2.05%
de_efficientnetlitev3_f32.onnx	80.038132	78.560747	-1.477385	-1.85%
deeplabv3_mobilenetv2_f32.onnx	48.565125	46.446841	-2.118284	-4.36%
imagetransformnet_f32.onnx	303.835868	302.553625	-1.282243	-0.42%
mobilenet_v1_f32.onnx	4.379468	4.163018	-0.216450	-4.94%
mobilenetv1_ssd_f32.onnx	9.245055	8.881198	-0.363857	-3.94%
openposev2_vgg19_f32.onnx	210.981128	209.199398	-1.781730	-0.84%
retinaface_f32.onnx	42.326391	38.454346	-3.872045	-9.15%
rfdn_f32.onnx	13.929565	13.679875	-0.249690	-1.79%

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

hariharans29 · 2026-02-17T20:24:04Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-02-17T20:24:24Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

This PR optimizes MLAS KleidiAI's SGEMM implementation by eliminating an unnecessary memcpy step. When alpha=1.0 and beta=0.0, the result is now written directly to the output matrix C instead of first writing to a temporary buffer and then copying.

Changes:

Simplified the direct-write condition from checking multiple constraints (ldc==TileSizeN, boundary checks, zero size checks) to only checking alpha==1.0 and beta==0.0
Leverages KleidiAI's run_matmul capability to write directly to non-contiguous output via the row stride parameter
Eliminates temporary buffer allocation and memcpy when alpha/beta scaling is not needed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Removed memcpy step by storing result in C if possible

d06244f

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

hariharans29 changed the title ~~Removed memcpy step by storing result in C if possible~~ [MLAS] Removed memcpy step by storing result in C if possible Feb 17, 2026

hariharans29 requested a review from Copilot February 17, 2026 20:25

Copilot started reviewing on behalf of hariharans29 February 17, 2026 20:25 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) February 17, 2026 20:48

hariharans29 approved these changes Feb 17, 2026

View reviewed changes

hariharans29 merged commit 36cbdb4 into microsoft:main Feb 17, 2026
94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Removed memcpy step by storing result in C if possible#27367

[MLAS] Removed memcpy step by storing result in C if possible#27367
hariharans29 merged 1 commit intomicrosoft:mainfrom
JonathanC-ARM:origin/jonclo01_sgemm_memcpy_removal

JonathanC-ARM commented Feb 17, 2026

Uh oh!

hariharans29 commented Feb 17, 2026

Uh oh!

azure-pipelines bot commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JonathanC-ARM commented Feb 17, 2026

Summary

Testing

Uh oh!

hariharans29 commented Feb 17, 2026

Uh oh!

azure-pipelines bot commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants