Skip to content

[MLAS] Removed memcpy step by storing result in C if possible#27367

Merged
hariharans29 merged 1 commit intomicrosoft:mainfrom
JonathanC-ARM:origin/jonclo01_sgemm_memcpy_removal
Feb 17, 2026
Merged

[MLAS] Removed memcpy step by storing result in C if possible#27367
hariharans29 merged 1 commit intomicrosoft:mainfrom
JonathanC-ARM:origin/jonclo01_sgemm_memcpy_removal

Conversation

@JonathanC-ARM
Copy link
Contributor

Summary

This change removes the memcpy step in sgemm_kleidiai where possible by writing directly to C

Testing

Model Baseline avg (ms) Current avg (ms) Δ ms Δ %
Transformer_complex_f32.onnx 2.929885 2.701083 -0.228802 -7.81%
bert_tiny_f32.onnx 0.279675 0.273928 -0.005747 -2.05%
de_efficientnetlitev3_f32.onnx 80.038132 78.560747 -1.477385 -1.85%
deeplabv3_mobilenetv2_f32.onnx 48.565125 46.446841 -2.118284 -4.36%
imagetransformnet_f32.onnx 303.835868 302.553625 -1.282243 -0.42%
mobilenet_v1_f32.onnx 4.379468 4.163018 -0.216450 -4.94%
mobilenetv1_ssd_f32.onnx 9.245055 8.881198 -0.363857 -3.94%
openposev2_vgg19_f32.onnx 210.981128 209.199398 -1.781730 -0.84%
retinaface_f32.onnx 42.326391 38.454346 -3.872045 -9.15%
rfdn_f32.onnx 13.929565 13.679875 -0.249690 -1.79%

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
@hariharans29 hariharans29 changed the title Removed memcpy step by storing result in C if possible [MLAS] Removed memcpy step by storing result in C if possible Feb 17, 2026
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes MLAS KleidiAI's SGEMM implementation by eliminating an unnecessary memcpy step. When alpha=1.0 and beta=0.0, the result is now written directly to the output matrix C instead of first writing to a temporary buffer and then copying.

Changes:

  • Simplified the direct-write condition from checking multiple constraints (ldc==TileSizeN, boundary checks, zero size checks) to only checking alpha==1.0 and beta==0.0
  • Leverages KleidiAI's run_matmul capability to write directly to non-contiguous output via the row stride parameter
  • Eliminates temporary buffer allocation and memcpy when alpha/beta scaling is not needed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 enabled auto-merge (squash) February 17, 2026 20:48
@hariharans29 hariharans29 merged commit 36cbdb4 into microsoft:main Feb 17, 2026
94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants