[MLAS] KleidiAI fix igemm regression by martin-klacer-arm · Pull Request #28571 · microsoft/onnxruntime

martin-klacer-arm · 2026-05-19T15:32:18Z

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer for the LHS (left-hand-side) matrix packing, which could consume excessive memory and reduce cache efficiency. This patch modifies the packing strategy to use a chunked approach: - Introduce a compile-time upper bound for temporary LHS packing buffers - Allocate a moderate-sized temporary buffer once. - Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk. Benefits: - Significantly reduces peak memory usage. - Improves cache utilization and overall computation efficiency. - Avoids potential memory allocation failures for large convolutions. Performance improvement: - Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500 Before this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 4.25327 ``` After this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 5.03257 ``` ---------------------------------------------------------------------------------------------------------------- sme_with_patch | sme_without_patch | Diff (μs) | Change % | Node Name ---------------------------------------------------------------------------------------------------------------- 11975.10 | 31235.30 | 19260.20 | 160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D 4514.80 | 7691.70 | 3176.90 | 70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D 4220.20 | 7120.70 | 2900.50 | 68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D 5429.20 | 8279.60 | 2850.40 | 52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D 4497.80 | 5478.40 | 980.60 | 21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D 3474.30 | 4351.80 | 877.50 | 25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D 3627.30 | 4504.00 | 876.70 | 24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D 5244.20 | 5961.10 | 716.90 | 13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D 3439.80 | 4050.90 | 611.10 | 17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D 9749.80 | 10195.50 | 445.70 | 4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D 3814.00 | 4209.80 | 395.80 | 10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D 2715.90 | 3034.70 | 318.80 | 11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D 4089.10 | 4367.80 | 278.70 | 6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D 2698.00 | 2959.50 | 261.50 | 9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D 3869.20 | 4102.80 | 233.60 | 6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D 2767.90 | 2966.80 | 198.90 | 7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D 9652.10 | 9816.60 | 164.50 | 1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D 2897.50 | 3054.60 | 157.10 | 5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D 4601.20 | 4748.60 | 147.40 | 3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D 3134.00 | 3246.10 | 112.10 | 3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

- Added SelectConvRoute function to mlasi_kleidiai.h to decide between GemmFallback and Igemm based on the convolution workload parameters - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp to use the new SelectConvRoute function Co-authored-by: Damien Dooley <damien.dooley@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Copilot

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h	Adds `ConvRoute` + `SelectConvRoute` heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp	Uses `SelectConvRoute` to optionally switch im2col+SGEMM slices to `MlasGemm` for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

martin-klacer-arm · 2026-05-20T09:29:13Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2026-05-20T20:53:29Z

Can you please address the Copliot comments ?

martin-klacer-arm · 2026-05-21T16:07:24Z

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

hariharans29 · 2026-05-21T18:48:49Z

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

Thank you!

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

martin-klacer-arm · 2026-05-27T09:33:46Z

Hi, I pushed a new patch addressing the Copilot comments, and left a reply with further details on one of the comments.

hariharans29 · 2026-05-27T15:59:08Z

Review — PR #28571 `[MLAS] KleidiAI fix igemm regression`

Verdict: good direction, accept after a few fixes. The two changes (chunked LHS packing + workload-based route selection) are the right shape for fixing #27633. There are some real concerns around the heuristic, the new per-tile allocation, and use of ORT_ENFORCE inside a parallel worker — but nothing structural.

Substantive concerns

1. ORT_ENFORCE inside MlasTrySimpleParallel. The inner loop in ConvolveSme now does:

ORT_ENFORCE(bytes_per_m_step != 0,
            "KleidiAI LHS packed size must be non-zero.");

inside the worker lambda. MlasTrySimpleParallel does not catch exceptions thrown from workers — in exception-disabled builds ORT_ENFORCE calls abort(), and in exception-enabled builds the throw will likely terminate the process from a worker thread. Also: bytes_per_m_step depends only on m_step, d_kh*d_kw, ci — none of which change between iterations. Compute it (and check it) once before entering the parallel region. CheckCapabilitiesSme already validates this same quantity for the Igemm route, so by the time we reach ConvolveSme it cannot be zero; the enforce here is dead weight. Hoist the call out of the lambda and either drop the check or convert to an assertion.

2. Per-tile std::make_unique<std::byte[]> allocation. Each parallel worker now does:

auto lhs = std::make_unique<std::byte[]>(lhs_buffer_bytes);

inside the MlasTrySimpleParallel lambda, so this is one heap allocation per (m_tile, n_tile). For large co/n_tile_step, this is a lot of churn that the old code avoided by allocating the full LHS once. A thread_local grow-only std::vector<std::byte> (mirroring the pad_ptr pattern just above) would eliminate it. Yes, that re-introduces the cross-call state pattern this codebase has been bitten by (see #28654), so document the invariant clearly — but the cost of per-tile allocation in a hot conv loop is not free.

3. MAX_LHS_CHUNK_BYTES = 2 MiB is a magic constant. It's commented as "cache-friendly," but there is no cache-size discovery and no per-CPU tuning — Apple M-series, Snapdragon X, AWS Graviton, and Ampere Altra all have very different L2 sizes (and SME implementations). At minimum:

Add a comment explaining why 2 MiB specifically (target L2? something else?).
Consider deriving from MlasGetPreferredBufferAlignment or a cpuinfo-derived size.
Add a runtime override (env var / session option) for tuning experiments before this is baked in.

Also: the chunk computation

size_t m_chunk = std::max<size_t>(m_step,
                                  MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);

silently truncates MAX_LHS_CHUNK_BYTES / bytes_per_m_step to zero when bytes_per_m_step > MAX_LHS_CHUNK_BYTES, then the std::max rescues it back to m_step. That works, but it means the "2 MiB cap" silently does nothing for any workload whose per-m_step packed size exceeds 2 MiB. Make this explicit with a comment, or restructure so the intent is obvious.

4. ConvIgemmMaxWork = 1'000'000 threshold. This is the heuristic boundary between Igemm and GemmFallback. There is no data in the PR description showing how this was chosen (which models, which CPUs, which kernel sizes). The linked issue #27633 specifically mentions Apple M-series and 9×9 OCR kernels; what does the curve look like at, say, output_m × effective_k × filter_count near the threshold? At minimum, please add a comment citing the measurement methodology, and ideally make this overridable.

The formula

const auto igemm_max_output_m = (ConvIgemmMaxWork / effective_k / filter_count);
if (output_m > igemm_max_output_m) return ConvRoute::GemmFallback;

is equivalent to output_m * effective_k * filter_count > 1'000'000 modulo integer-division rounding. The integer-division form silently lowers the effective threshold when effective_k * filter_count doesn't divide cleanly into 1M. Use SafeInt/mul_overflow_size_t_builtin and a single comparison:

size_t total_work;
if (mul_overflow_size_t_builtin(output_m, effective_k, &total_work) ||
    mul_overflow_size_t_builtin(total_work, filter_count, &total_work)) {
    return ConvRoute::GemmFallback;  // assume too large
}
if (total_work > ConvIgemmMaxWork) return ConvRoute::GemmFallback;

5. New cross-file coupling. convolve.cpp now #includes kleidiai/mlasi_kleidiai.h and calls ArmKleidiAI::SelectConvRoute directly:

#if defined(USE_KLEIDIAI)
#include "kleidiai/mlasi_kleidiai.h"
#endif
...
#if defined(USE_KLEIDIAI)
use_gemm_batch_override =
    ArmKleidiAI::SelectConvRoute(Parameters) == ArmKleidiAI::ConvRoute::GemmFallback;
#endif

This means the generic MLAS convolve path now has compile-time and call-site knowledge of one specific backend. The existing pattern in MLAS for this is the dispatch override / MlasPlatform selector. Wrap this behind a MLAS_CONV_ROUTE_OVERRIDE function pointer set up in platform.cpp, the same way every other backend hook works. That keeps convolve.cpp backend-agnostic and avoids more #if defined(USE_KLEIDIAI) scaffolding creeping in.

Also: SelectConvRoute is now called twice per Compute() — once in CheckCapabilitiesSme and once at the top of the convolve op (via the new flag). It does non-trivial overflow-checked arithmetic. Cache the result on MLAS_CONV_PARAMETERS (there is already a BackendKernelSelectorConfig pointer there) so it isn't recomputed.

6. MlasGemm vs MlasSgemmOperation substitution semantics. The change does:

if (use_gemm_batch_override) {
    MlasGemm(CblasNoTrans, CblasNoTrans, FilterCount, CountN, CountK, 1.0f,
             Filter + k, K, ColumnBuffer, CountN, beta,
             SegmentOutput, OutputSize, nullptr, Parameters->BackendKernelSelectorConfig);
} else {
    MlasSgemmOperation(...);
}

MlasGemm accepts a threadpool (passed as nullptr here) and may parallelize internally, while MlasSgemmOperation is a single-segment operation that the caller (MlasConvOperation) is already running inside its own segment loop. Confirm that calling MlasGemm(..., nullptr, ...) from inside the per-thread segment loop doesn't accidentally re-enter the threadpool or serialize what was previously parallelized. A quick perf check on the regression case from #27633 would also confirm the fallback actually wins where the heuristic claims it does.

7. mul_overflow_size_t_builtin forward declaration in a public-ish header. mlasi_kleidiai.h now has:

inline bool mul_overflow_size_t_builtin(size_t a, size_t b, size_t* out);

with the comment "implementation is at the bottom of the file." Forward-declaring an inline function whose definition lives in the same TU is fine, but per AGENTS.md there is already MlasMultiplyOverflowsSizeT in mlasi.h from PR #28416. Use that and delete the local helper (PR #28487 already introduces a third copy as MlasTryMultiplySizeT). Three names for the same operation is too many.

8. CheckCapabilitiesSme redundant work. The new code path:

const auto route = ArmKleidiAI::SelectConvRoute(Parameters);
if (route == ArmKleidiAI::ConvRoute::Igemm) {
    const size_t d_kh = ComputeKernelSize(...);
    const size_t d_kw = ComputeKernelSize(...);
    const size_t m_step = imatmul_conv.ukernel.get_m_step();
    const size_t bytes_per_m_step = kai_get_lhs_packed_size_...(m_step, d_kh*d_kw, ...);
    if (bytes_per_m_step == 0) { ... return false; }
    return true;
}

SelectConvRoute already computes effective_kernel_h/w. Returning the dilated kernel sizes from SelectConvRoute (or expanding to a struct return) would avoid recomputation. Minor.

Nits

mlasi_kleidiai.h: the #include <iostream> move into the KLEIDIAI_DEBUG_LOGGING || KLEIDIAI_KERNEL_LOGGING block is good. Drive-by cleanup, +1.
TryComputeConvOutputSize returns true with *result = 0 for the padded_input < kernel case and for stride == 0. The caller (SelectConvRoute) then checks output_m == 0 and returns ConvRoute::None. That works but is non-obvious — the function returns true on what is semantically failure. Consider returning false for these (or rename the function to reflect "compute output size, with zero meaning 'no valid output'").
SelectConvRoute rejects filter_count == 1 outright. Comment why (presumably because the SME imatmul has min-N requirements). Future readers will wonder.
The comment //RhsPackWeightsBiasSme() - to perform dilation increases kernel size and masks unused weights was deleted — fine, since it was just a note, but the new code still does the same dilation expansion; consider keeping the note near ComputeKernelSize calls.
MIdx/NIdx → m_idx/n_idx rename is good; please apply the same to other CamelCase locals (ATile, BTile, CTile, TileSize*) for consistency in the touched block.
ConvRoute::None value ordering: None, Igemm, GemmFallback. Consider documenting the contract — most readers will expect None to mean "this PR's logic doesn't apply, defer to caller's default" rather than "fail." That's how convolve.cpp interprets it, but the enum name doesn't make it obvious.
inline constexpr size_t ConvIgemmMaxWork = 1000000ULL; — please use 1'000'000 for readability.

Things to verify before merging

Re-run the regression model from [Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x #27633 on both Apple M-series and Snapdragon X with the chunked packing; confirm the regression is gone and there is no slowdown on the cases that previously hit the IGEMM path cleanly.
Confirm MlasGemm(..., nullptr, ...) from inside MlasConvOperation's segment loop doesn't cause re-entrant threadpool dispatch.
Run existing MLAS conv unit tests under TSAN if available — the thread_local caches plus chunked packing change cross-thread access patterns.
Confirm the bytes_per_m_step == 0 early-out in CheckCapabilitiesSme makes the ORT_ENFORCE in ConvolveSme provably unreachable, then drop the enforce.

Bottom line

Approve after:

Hoist or drop the ORT_ENFORCE from the worker lambda.
Replace per-tile make_unique with TLS scratch (or document why per-tile is acceptable).
Route SelectConvRoute through the existing dispatch-override pattern instead of #if defined(USE_KLEIDIAI) in convolve.cpp.
Document/justify the 2 MiB and 1M thresholds (or make them overridable).
Converge on one mul_overflow_size_t helper.

The chunked-packing idea and the route-selection split are both correct fixes for the reported regression. With the above tightening this should be safe to land.

martin-klacer-arm · 2026-05-28T11:30:27Z

Thank you for the thorough review! The comments are sensible and understandable. I will update the PR as soon as we have a patch to address them finished and internally reviewed.

JonathanC-ARM and others added 2 commits May 18, 2026 16:49

hariharans29 requested a review from Copilot May 19, 2026 16:07

Copilot started reviewing on behalf of hariharans29 May 19, 2026 16:08 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

Comment thread onnxruntime/core/mlas/lib/convolve.cpp

hariharans29 changed the title ~~KleidiAI fix igemm regression~~ [MLAS] KleidiAI fix igemm regression May 19, 2026

Addressed upstream Copilot comments

fcc0e95

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] KleidiAI fix igemm regression#28571

[MLAS] KleidiAI fix igemm regression#28571
martin-klacer-arm wants to merge 3 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression

martin-klacer-arm commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-klacer-arm commented May 20, 2026

Uh oh!

hariharans29 commented May 20, 2026

Uh oh!

martin-klacer-arm commented May 21, 2026

Uh oh!

hariharans29 commented May 21, 2026

Uh oh!

martin-klacer-arm commented May 27, 2026

Uh oh!

hariharans29 commented May 27, 2026

Uh oh!

martin-klacer-arm commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

martin-klacer-arm commented May 19, 2026

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-klacer-arm commented May 20, 2026

Uh oh!

hariharans29 commented May 20, 2026

Uh oh!

martin-klacer-arm commented May 21, 2026

Uh oh!

hariharans29 commented May 21, 2026

Uh oh!

martin-klacer-arm commented May 27, 2026

Uh oh!

hariharans29 commented May 27, 2026

Review — PR #28571 [MLAS] KleidiAI fix igemm regression

Substantive concerns

Nits

Things to verify before merging

Bottom line

Uh oh!

martin-klacer-arm commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Review — PR #28571 `[MLAS] KleidiAI fix igemm regression`