Skip to content

[MLAS] KleidiAI fix igemm regression#28571

Open
martin-klacer-arm wants to merge 3 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression
Open

[MLAS] KleidiAI fix igemm regression#28571
martin-klacer-arm wants to merge 3 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression

Conversation

@martin-klacer-arm
Copy link
Copy Markdown

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

  1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
  2. a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

JonathanC-ARM and others added 2 commits May 18, 2026 16:49
previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer
for the LHS (left-hand-side) matrix packing, which could consume excessive memory
and reduce cache efficiency.

This patch modifies the packing strategy to use a chunked approach:
- Introduce a compile-time upper bound for temporary LHS packing buffers
- Allocate a moderate-sized temporary buffer once.
- Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk.

Benefits:
- Significantly reduces peak memory usage.
- Improves cache utilization and overall computation efficiency.
- Avoids potential memory allocation failures for large convolutions.

Performance improvement:
- Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500

    Before this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 4.25327
    ```

    After this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 5.03257
    ```

----------------------------------------------------------------------------------------------------------------
sme_with_patch | sme_without_patch  | Diff (μs) | Change % | Node Name
----------------------------------------------------------------------------------------------------------------
  11975.10 |    31235.30 |     19260.20 |    160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D
   4514.80 |     7691.70 |      3176.90 |     70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D
   4220.20 |     7120.70 |      2900.50 |     68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D
   5429.20 |     8279.60 |      2850.40 |     52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D
   4497.80 |     5478.40 |       980.60 |     21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D
   3474.30 |     4351.80 |       877.50 |     25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D
   3627.30 |     4504.00 |       876.70 |     24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D
   5244.20 |     5961.10 |       716.90 |     13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D
   3439.80 |     4050.90 |       611.10 |     17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D
   9749.80 |    10195.50 |       445.70 |      4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D
   3814.00 |     4209.80 |       395.80 |     10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D
   2715.90 |     3034.70 |       318.80 |     11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D
   4089.10 |     4367.80 |       278.70 |      6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D
   2698.00 |     2959.50 |       261.50 |      9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D
   3869.20 |     4102.80 |       233.60 |      6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D
   2767.90 |     2966.80 |       198.90 |      7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D
   9652.10 |     9816.60 |       164.50 |      1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D
   2897.50 |     3054.60 |       157.10 |      5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D
   4601.20 |     4748.60 |       147.40 |      3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D
   3134.00 |     3246.10 |       112.10 |      3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D

Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com>
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
 - Added SelectConvRoute function to mlasi_kleidiai.h to decide between
   GemmFallback and Igemm based on the convolution workload parameters
 - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp
   to use the new SelectConvRoute function

Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

  • Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
  • Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
  • In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Adds ConvRoute + SelectConvRoute heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp Uses SelectConvRoute to optionally switch im2col+SGEMM slices to MlasGemm for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated
Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp
Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp
Comment thread onnxruntime/core/mlas/lib/convolve.cpp
@hariharans29 hariharans29 changed the title KleidiAI fix igemm regression [MLAS] KleidiAI fix igemm regression May 19, 2026
@martin-klacer-arm
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Arm"

@hariharans29
Copy link
Copy Markdown
Member

Can you please address the Copliot comments ?

@martin-klacer-arm
Copy link
Copy Markdown
Author

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

@hariharans29
Copy link
Copy Markdown
Member

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

Thank you!

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
@martin-klacer-arm
Copy link
Copy Markdown
Author

Hi, I pushed a new patch addressing the Copilot comments, and left a reply with further details on one of the comments.

@hariharans29
Copy link
Copy Markdown
Member

Review — PR #28571 [MLAS] KleidiAI fix igemm regression

Verdict: good direction, accept after a few fixes. The two changes (chunked LHS packing + workload-based route selection) are the right shape for fixing #27633. There are some real concerns around the heuristic, the new per-tile allocation, and use of ORT_ENFORCE inside a parallel worker — but nothing structural.

Substantive concerns

1. ORT_ENFORCE inside MlasTrySimpleParallel. The inner loop in ConvolveSme now does:

ORT_ENFORCE(bytes_per_m_step != 0,
            "KleidiAI LHS packed size must be non-zero.");

inside the worker lambda. MlasTrySimpleParallel does not catch exceptions thrown from workers — in exception-disabled builds ORT_ENFORCE calls abort(), and in exception-enabled builds the throw will likely terminate the process from a worker thread. Also: bytes_per_m_step depends only on m_step, d_kh*d_kw, ci — none of which change between iterations. Compute it (and check it) once before entering the parallel region. CheckCapabilitiesSme already validates this same quantity for the Igemm route, so by the time we reach ConvolveSme it cannot be zero; the enforce here is dead weight. Hoist the call out of the lambda and either drop the check or convert to an assertion.

2. Per-tile std::make_unique<std::byte[]> allocation. Each parallel worker now does:

auto lhs = std::make_unique<std::byte[]>(lhs_buffer_bytes);

inside the MlasTrySimpleParallel lambda, so this is one heap allocation per (m_tile, n_tile). For large co/n_tile_step, this is a lot of churn that the old code avoided by allocating the full LHS once. A thread_local grow-only std::vector<std::byte> (mirroring the pad_ptr pattern just above) would eliminate it. Yes, that re-introduces the cross-call state pattern this codebase has been bitten by (see #28654), so document the invariant clearly — but the cost of per-tile allocation in a hot conv loop is not free.

3. MAX_LHS_CHUNK_BYTES = 2 MiB is a magic constant. It's commented as "cache-friendly," but there is no cache-size discovery and no per-CPU tuning — Apple M-series, Snapdragon X, AWS Graviton, and Ampere Altra all have very different L2 sizes (and SME implementations). At minimum:

  • Add a comment explaining why 2 MiB specifically (target L2? something else?).
  • Consider deriving from MlasGetPreferredBufferAlignment or a cpuinfo-derived size.
  • Add a runtime override (env var / session option) for tuning experiments before this is baked in.

Also: the chunk computation

size_t m_chunk = std::max<size_t>(m_step,
                                  MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);

silently truncates MAX_LHS_CHUNK_BYTES / bytes_per_m_step to zero when bytes_per_m_step > MAX_LHS_CHUNK_BYTES, then the std::max rescues it back to m_step. That works, but it means the "2 MiB cap" silently does nothing for any workload whose per-m_step packed size exceeds 2 MiB. Make this explicit with a comment, or restructure so the intent is obvious.

4. ConvIgemmMaxWork = 1'000'000 threshold. This is the heuristic boundary between Igemm and GemmFallback. There is no data in the PR description showing how this was chosen (which models, which CPUs, which kernel sizes). The linked issue #27633 specifically mentions Apple M-series and 9×9 OCR kernels; what does the curve look like at, say, output_m × effective_k × filter_count near the threshold? At minimum, please add a comment citing the measurement methodology, and ideally make this overridable.

The formula

const auto igemm_max_output_m = (ConvIgemmMaxWork / effective_k / filter_count);
if (output_m > igemm_max_output_m) return ConvRoute::GemmFallback;

is equivalent to output_m * effective_k * filter_count > 1'000'000 modulo integer-division rounding. The integer-division form silently lowers the effective threshold when effective_k * filter_count doesn't divide cleanly into 1M. Use SafeInt/mul_overflow_size_t_builtin and a single comparison:

size_t total_work;
if (mul_overflow_size_t_builtin(output_m, effective_k, &total_work) ||
    mul_overflow_size_t_builtin(total_work, filter_count, &total_work)) {
    return ConvRoute::GemmFallback;  // assume too large
}
if (total_work > ConvIgemmMaxWork) return ConvRoute::GemmFallback;

5. New cross-file coupling. convolve.cpp now #includes kleidiai/mlasi_kleidiai.h and calls ArmKleidiAI::SelectConvRoute directly:

#if defined(USE_KLEIDIAI)
#include "kleidiai/mlasi_kleidiai.h"
#endif
...
#if defined(USE_KLEIDIAI)
use_gemm_batch_override =
    ArmKleidiAI::SelectConvRoute(Parameters) == ArmKleidiAI::ConvRoute::GemmFallback;
#endif

This means the generic MLAS convolve path now has compile-time and call-site knowledge of one specific backend. The existing pattern in MLAS for this is the dispatch override / MlasPlatform selector. Wrap this behind a MLAS_CONV_ROUTE_OVERRIDE function pointer set up in platform.cpp, the same way every other backend hook works. That keeps convolve.cpp backend-agnostic and avoids more #if defined(USE_KLEIDIAI) scaffolding creeping in.

Also: SelectConvRoute is now called twice per Compute() — once in CheckCapabilitiesSme and once at the top of the convolve op (via the new flag). It does non-trivial overflow-checked arithmetic. Cache the result on MLAS_CONV_PARAMETERS (there is already a BackendKernelSelectorConfig pointer there) so it isn't recomputed.

6. MlasGemm vs MlasSgemmOperation substitution semantics. The change does:

if (use_gemm_batch_override) {
    MlasGemm(CblasNoTrans, CblasNoTrans, FilterCount, CountN, CountK, 1.0f,
             Filter + k, K, ColumnBuffer, CountN, beta,
             SegmentOutput, OutputSize, nullptr, Parameters->BackendKernelSelectorConfig);
} else {
    MlasSgemmOperation(...);
}

MlasGemm accepts a threadpool (passed as nullptr here) and may parallelize internally, while MlasSgemmOperation is a single-segment operation that the caller (MlasConvOperation) is already running inside its own segment loop. Confirm that calling MlasGemm(..., nullptr, ...) from inside the per-thread segment loop doesn't accidentally re-enter the threadpool or serialize what was previously parallelized. A quick perf check on the regression case from #27633 would also confirm the fallback actually wins where the heuristic claims it does.

7. mul_overflow_size_t_builtin forward declaration in a public-ish header. mlasi_kleidiai.h now has:

inline bool mul_overflow_size_t_builtin(size_t a, size_t b, size_t* out);

with the comment "implementation is at the bottom of the file." Forward-declaring an inline function whose definition lives in the same TU is fine, but per AGENTS.md there is already MlasMultiplyOverflowsSizeT in mlasi.h from PR #28416. Use that and delete the local helper (PR #28487 already introduces a third copy as MlasTryMultiplySizeT). Three names for the same operation is too many.

8. CheckCapabilitiesSme redundant work. The new code path:

const auto route = ArmKleidiAI::SelectConvRoute(Parameters);
if (route == ArmKleidiAI::ConvRoute::Igemm) {
    const size_t d_kh = ComputeKernelSize(...);
    const size_t d_kw = ComputeKernelSize(...);
    const size_t m_step = imatmul_conv.ukernel.get_m_step();
    const size_t bytes_per_m_step = kai_get_lhs_packed_size_...(m_step, d_kh*d_kw, ...);
    if (bytes_per_m_step == 0) { ... return false; }
    return true;
}

SelectConvRoute already computes effective_kernel_h/w. Returning the dilated kernel sizes from SelectConvRoute (or expanding to a struct return) would avoid recomputation. Minor.

Nits

  • mlasi_kleidiai.h: the #include <iostream> move into the KLEIDIAI_DEBUG_LOGGING || KLEIDIAI_KERNEL_LOGGING block is good. Drive-by cleanup, +1.
  • TryComputeConvOutputSize returns true with *result = 0 for the padded_input < kernel case and for stride == 0. The caller (SelectConvRoute) then checks output_m == 0 and returns ConvRoute::None. That works but is non-obvious — the function returns true on what is semantically failure. Consider returning false for these (or rename the function to reflect "compute output size, with zero meaning 'no valid output'").
  • SelectConvRoute rejects filter_count == 1 outright. Comment why (presumably because the SME imatmul has min-N requirements). Future readers will wonder.
  • The comment //RhsPackWeightsBiasSme() - to perform dilation increases kernel size and masks unused weights was deleted — fine, since it was just a note, but the new code still does the same dilation expansion; consider keeping the note near ComputeKernelSize calls.
  • MIdx/NIdxm_idx/n_idx rename is good; please apply the same to other CamelCase locals (ATile, BTile, CTile, TileSize*) for consistency in the touched block.
  • ConvRoute::None value ordering: None, Igemm, GemmFallback. Consider documenting the contract — most readers will expect None to mean "this PR's logic doesn't apply, defer to caller's default" rather than "fail." That's how convolve.cpp interprets it, but the enum name doesn't make it obvious.
  • inline constexpr size_t ConvIgemmMaxWork = 1000000ULL; — please use 1'000'000 for readability.

Things to verify before merging

  1. Re-run the regression model from [Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x #27633 on both Apple M-series and Snapdragon X with the chunked packing; confirm the regression is gone and there is no slowdown on the cases that previously hit the IGEMM path cleanly.
  2. Confirm MlasGemm(..., nullptr, ...) from inside MlasConvOperation's segment loop doesn't cause re-entrant threadpool dispatch.
  3. Run existing MLAS conv unit tests under TSAN if available — the thread_local caches plus chunked packing change cross-thread access patterns.
  4. Confirm the bytes_per_m_step == 0 early-out in CheckCapabilitiesSme makes the ORT_ENFORCE in ConvolveSme provably unreachable, then drop the enforce.

Bottom line

Approve after:

  • Hoist or drop the ORT_ENFORCE from the worker lambda.
  • Replace per-tile make_unique with TLS scratch (or document why per-tile is acceptable).
  • Route SelectConvRoute through the existing dispatch-override pattern instead of #if defined(USE_KLEIDIAI) in convolve.cpp.
  • Document/justify the 2 MiB and 1M thresholds (or make them overridable).
  • Converge on one mul_overflow_size_t helper.

The chunked-packing idea and the route-selection split are both correct fixes for the reported regression. With the above tightening this should be safe to land.

@martin-klacer-arm
Copy link
Copy Markdown
Author

Thank you for the thorough review! The comments are sensible and understandable. I will update the PR as soon as we have a patch to address them finished and internally reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x

4 participants