RFC: Backend-Selected Split-K Factor via Encoding #24022

hanhanW · 2026-04-06T22:35:53Z

hanhanW
Apr 6, 2026
Collaborator

RFC: Backend-Selected Split-K Factor via Encoding

Author: hanhanW

Problem

Today the split-k factor is decided in DispatchCreation by static
heuristics in SetSplitReductionSizesPass. The pass walks
PartialReductionOpInterface ops, applies one of four target-agnostic
heuristics (outer-reduction, convolution, matmul-like, arg_compare),
and writes a static iree_linalg_ext.split_reduction = [<int>]
attribute. FormSplitReductionDispatchesPass then tiles using that
constant.

Two consequences:

Target-blind decisions. The heuristic constants
(largeOutputSize = 2048 * 4096, largeKSize = 18000,
ratioThreshold = 48, output-size buckets, ...) are baked into
DispatchCreation. Different backends want different factors for the
same op, but DispatchCreation runs once per program, before HAL
device assignment is honored by codegen-side decisions. The only
way for a backend to disagree today is to add another bucket to the
shared heuristic file.
Heuristic complexity grows on the wrong side of the boundary.
Each new op family or target needs another branch in
SetSplitReductionSizes.cpp. Backend authors who know their
hardware cannot express that knowledge without modifying a
target-agnostic pass.

The static factor is otherwise fine: codegen produces high-quality IR
because the scf.forall step is a constant. We want to keep that
quality but move the decision to the encoding resolver.

Proposal

Defer the split-k decision to the per-target encoding resolver, the
same mechanism data-tiling uses for layout selection.

Three new pieces, all in the Encoding dialect:

1. `#iree_encoding.split_k_encoding` attribute

A new attribute that implements SerializableAttr. The existing data-tiling
encoding resolvers (#iree_gpu.gpu_encoding_resolver, #iree_cpu.cpu_encoding_resolver,
...) extend their getLayout() to handle it. Carries the metadata the
resolver needs:

#iree_encoding.split_k_encoding<
    op_type = matmul,
    reduction_dims = [2],
    element_types = [f32, f32, f32],
    user_indexing_maps = [#map, #map1, #map2],
    split_k_factors = [?]>

The split_k_factors field starts as all kDynamic (verbose). After
SpecializeEncodingsPass runs, it contains static integers (serialized). Same
verbose-vs-serialized pattern as #iree_encoding.encoding.

Note: the fields are not finalized; they can be added to or removed as needed if
they are static info.

2. `iree_encoding.query_split_k_size` op

Standalone op that yields the split-k tile sizes as index values:

%splitk = iree_encoding.query_split_k_size(%M, %N, %K)
    #iree_encoding.split_k_encoding<...> -> index

One result per entry in reduction_dims (variadic).
The operands are the iteration sizes (M, N, K) — carried for the
resolver and for future runtime-computed factors.
Created in host code by FormSplitReductionDispatchesPass when
it sees the ? sentinel in iree_linalg_ext.split_reduction.

The op never appears inside dispatch bodies. Tile sizes flow into the
dispatch as workload arguments.

3. Resolution: SpecializeEncodings → EncodeHostTensors

Two existing Stream passes handle the op, mirroring how
stream.tensor.sizeof is resolved for data-tiling:

Pass	Action
`SpecializeEncodings`	Calls `getLayout()` on the encoding via the device-affinity resolver. The resolver returns a `split_k_encoding` with static `split_k_factors`. Op stays.
`EncodeHostTensors`	Reads the static factors from the serialized encoding via a new pattern, replaces the op with `arith.constant` results. Canonicalization folds downstream `ceildivui`/`tensor.empty`/`flow.dispatch` shapes.
`FoldUniformOperands`	(Existing, no changes.) Inlines the constant workload arg into dispatch bodies.

After these three passes, the dispatch body has scf.forall (...) step (%c320) — bit-identical to today's static path. Codegen needs
no changes.

Mechanism

+----------------------------------------------------------------------+
|  DispatchCreation                                                    |
|    SetSplitReductionSizes annotates the op with [? : index]          |
|    FormSplitReductionDispatches creates the placeholder + tiles      |
|                                                                      |
|    %splitk = iree_encoding.query_split_k_size(%c32, %c32, %c40960)   |
|        #split_k_encoding<..., split_k_factors=[?]> -> index          |
|    %nparts = arith.ceildivui %c40960, %splitk                        |
|    flow.dispatch.region {                                            |
|      scf.forall (%iv) = (0) to (40960) step (%splitk) { ... }        |
|    } -> tensor<32x32x?xf32>{%nparts}                                 |
+----------------------------------------------------------------------+
                                  |
                                  v
+----------------------------------------------------------------------+
|  Stream::SpecializeEncodings                                         |
|    Calls LayoutResolverAttr::getLayout() via device-affinity         |
|    Backend resolver returns encoding with static split_k_factors     |
|                                                                      |
|    %splitk = iree_encoding.query_split_k_size(%c32, %c32, %c40960)   |
|        #split_k_encoding<..., split_k_factors=[320]> -> index        |
|                                          ^^^^^                       |
|                                          serialized                  |
+----------------------------------------------------------------------+
                                  |
                                  v
+----------------------------------------------------------------------+
|  Stream::EncodeHostTensors                                           |
|    New pattern reads static factors from the serialized encoding     |
|    Replaces query op with arith.constant; canonicalization folds     |
|    ceildivui(40960, 320) -> 128                                      |
|                                                                      |
|    %c320 = arith.constant 320 : index                                |
|    stream.tensor.dispatch [%c320](...) -> tensor<32x32x128xf32>      |
+----------------------------------------------------------------------+
                                  |
                                  v
+----------------------------------------------------------------------+
|  Stream::FoldUniformOperands  (existing pass, no changes)            |
|    Inlines constant workload into dispatch body                      |
|                                                                      |
|    Inside dispatch:                                                  |
|      scf.forall (%iv) = (0) to (40960) step (%c320) { ... }          |
|                                                                      |
|  Codegen sees fully static IR — same as today's static path.         |
+----------------------------------------------------------------------+

Why these design choices

Why two passes (SpecializeEncodings + EncodeHostTensors) instead of
one? This matches the existing data-tiling resolution pipeline:
SpecializeEncodings serializes encodings (verbose → target-specific),
EncodeHostTensors lowers serialized encodings to concrete arithmetic.
Treating the new op the same way avoids inventing a third resolution
phase and reuses the existing affinity-analysis machinery.

Why scope to static-shape inputs? With static iteration sizes, the
resolved factor is always a compile-time constant; the SSA chain
folds; codegen sees fully static IR. Dynamic inputs require a more
general mechanism (see Future Work) and we want to avoid coupling
that work to the immediate goal of letting backends own the heuristic.

Observations

The intermediate tensor needs no encoding. Unlike data-tiling,
the split-k factor doesn't change data layout, only the partition
count. The intermediate tensor between the two dispatches is a
plain tensor<MxNx?xf32>{%num_partitions}. Standard
stream.tensor.sizeof works without changes.
Codegen is unchanged. The whole resolution happens at Stream
scope. By the time codegen runs, the IR looks identical to today's
static split-k path.
The existing iree_linalg_ext.split_reduction = [?] sentinel
composes with the static path. Setting [320] keeps today's
behavior; setting [?] activates the new path.
Heuristics move, not disappear. The current heuristics in
SetSplitReductionSizes.cpp get ported into the per-backend
resolver implementations (GPUEncodingExternalModels,
CPUEncodingExternalModels). DispatchCreation no longer chooses,
but the same logic still runs — at the right scope.

Future work

Dynamic-shape inputs

Generalize via a new iree_encoding.physical_dim op (verbose form, in
DispatchCreation) that converts to stream.tensor.dim (lowered form, in
Stream phase). Both %splitk (the tile-size factor) and %nparts (the
partition count) become physical_dim queries on the encoded intermediate
tensor type. A new SerializableAttr::getPhysicalDim(builder, loc, kind, operands)
interface method handles the lowering.

For split_k_encoding the physical-dim kinds are:

factor(N) — the split-k factor for the Nth entry in reduction_dims
(the tile size used as the scf.forall step).
dim(N) — the size of tensor dim N of the partial-result carrier. For
the partition dim this is ceildivui(iter_size, factor).

For a fully dynamic matmul, the intermediate tensor of partial results
is tensor<?x?x?xf32> (M × N × num_partitions). Both queries:

// Verbose: split_k_factors is unresolved.
%splitk = iree_encoding.physical_dim factor(0) sizes(%M, %N, %K)
    : tensor<?x?x?xf32, #iree_encoding.split_k_encoding<
        op_type = matmul,
        reduction_dims = [2],
        element_types = [f32, f32, f32],
        user_indexing_maps = [#map, #map1, #map2],
        split_k_factors = [?]>>
    -> index
%nparts = iree_encoding.physical_dim dim(2) sizes(%M, %N, %K)
    : tensor<?x?x?xf32, #iree_encoding.split_k_encoding<...>>
    -> index

After SpecializeEncodings, the encoding's split_k_factors carries a
list of (condition, factor) pairs — the same shape as
iree_codegen.specialization_ranges — gated on a named iteration-space
dim. The input = 2 field anchors the conditions to the K iteration
size (so lt 1024 reads as "K < 1024"):

// Serialized: backend resolver picked three factor buckets gated on K.
%splitk = iree_encoding.physical_dim factor(0) sizes(%M, %N, %K)
    : tensor<?x?x?xf32, #iree_encoding.split_k_encoding<
        op_type = matmul,
        reduction_dims = [2],
        element_types = [f32, f32, f32],
        user_indexing_maps = [#map, #map1, #map2],
        split_k_factors = [#iree_encoding.specialized_value<
            input = 2,                                  // K is iteration dim 2
            ranges = [(lt 1024, 32),
                      (lt 16384, 128),
                      (default, 320)]>]>>
    -> index
%nparts = iree_encoding.physical_dim dim(2) sizes(%M, %N, %K)
    : tensor<?x?x?xf32, #iree_encoding.split_k_encoding<...>>
    -> index

After EncodeHostTensors lowers each query via getPhysicalDim, both
produce nested arith.select chains. To preserve first-match-wins
semantics, the highest-priority condition is the outermost select;
the default branch is the innermost else. The chain is Pure and
region-free, so the standard cse pass deduplicates the factor chain
between the two lowerings:

%c1024  = arith.constant 1024 : index
%c16384 = arith.constant 16384 : index
%c32    = arith.constant 32 : index
%c128   = arith.constant 128 : index
%c320   = arith.constant 320 : index
%cmp_small = arith.cmpi slt, %K, %c1024  : index
%cmp_med   = arith.cmpi slt, %K, %c16384 : index
// First-match-wins: nest highest-priority condition outermost.
%inner     = arith.select %cmp_med,   %c128, %c320  : index
%splitk    = arith.select %cmp_small, %c32,  %inner : index
// `nparts` reuses the same factor chain (CSE'd):
%nparts    = arith.ceildivui %K, %splitk : index

The dispatch body still receives %splitk as a workload arg, but now
it is computed at host runtime instead of being a compile-time constant.
FoldUniformOperands cannot inline a non-constant workload, so codegen
sees a dynamic step in scf.forall. This is the cost of the dynamic
case; the static-input path above keeps the constant-step quality.

Retire `iree_linalg_ext.split_reduction` attribute

Once the encoding-based path is the default, the LinalgExt attribute
can be removed in favor of the encoding-driven flow.

A concrete IR walkthrough across pipeline stages, for a static
32x40960 × 40960x32 matmul, can be found at
rfc_split_k_v2_ir_walkthrough.md.

hanhanW · 2026-04-06T22:38:26Z

hanhanW
Apr 6, 2026
Collaborator Author

cc @bangtianliu @yzhang93, because @MaheshRavishankar told me that you've been working on this area. This is the next evolution for split reduction in my mind. I'll need you to help review code when I implement it. I'm happy to delegate the implementation to you if you're interested in it.

Happy to chat on VC if there are any questions.

1 reply

hanhanW Apr 6, 2026
Collaborator Author

The future work for dynamic part is just a POC. You don't need to review all the details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Backend-Selected Split-K Factor via Encoding #24022

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: Backend-Selected Split-K Factor via Encoding #24022

Uh oh!

hanhanW Apr 6, 2026 Collaborator

RFC: Backend-Selected Split-K Factor via Encoding

Problem

Proposal

1. #iree_encoding.split_k_encoding attribute

2. iree_encoding.query_split_k_size op

3. Resolution: SpecializeEncodings → EncodeHostTensors

Mechanism

Why these design choices

Observations

Future work

Dynamic-shape inputs

Retire iree_linalg_ext.split_reduction attribute

Replies: 1 comment · 1 reply

Uh oh!

hanhanW Apr 6, 2026 Collaborator Author

Uh oh!

hanhanW Apr 6, 2026 Collaborator Author

hanhanW
Apr 6, 2026
Collaborator

1. `#iree_encoding.split_k_encoding` attribute

2. `iree_encoding.query_split_k_size` op

Retire `iree_linalg_ext.split_reduction` attribute

Replies: 1 comment 1 reply

hanhanW
Apr 6, 2026
Collaborator Author

hanhanW Apr 6, 2026
Collaborator Author