Replies: 1 comment 1 reply
-
|
cc @bangtianliu @yzhang93, because @MaheshRavishankar told me that you've been working on this area. This is the next evolution for split reduction in my mind. I'll need you to help review code when I implement it. I'm happy to delegate the implementation to you if you're interested in it. Happy to chat on VC if there are any questions. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: Backend-Selected Split-K Factor via Encoding
Author: hanhanW
Problem
Today the split-k factor is decided in DispatchCreation by static
heuristics in
SetSplitReductionSizesPass. The pass walksPartialReductionOpInterfaceops, applies one of four target-agnosticheuristics (outer-reduction, convolution, matmul-like, arg_compare),
and writes a static
iree_linalg_ext.split_reduction = [<int>]attribute.
FormSplitReductionDispatchesPassthen tiles using thatconstant.
Two consequences:
Target-blind decisions. The heuristic constants
(
largeOutputSize = 2048 * 4096,largeKSize = 18000,ratioThreshold = 48, output-size buckets, ...) are baked intoDispatchCreation. Different backends want different factors for the
same op, but DispatchCreation runs once per program, before HAL
device assignment is honored by codegen-side decisions. The only
way for a backend to disagree today is to add another bucket to the
shared heuristic file.
Heuristic complexity grows on the wrong side of the boundary.
Each new op family or target needs another branch in
SetSplitReductionSizes.cpp. Backend authors who know theirhardware cannot express that knowledge without modifying a
target-agnostic pass.
The static factor is otherwise fine: codegen produces high-quality IR
because the
scf.forallstep is a constant. We want to keep thatquality but move the decision to the encoding resolver.
Proposal
Defer the split-k decision to the per-target encoding resolver, the
same mechanism data-tiling uses for layout selection.
Three new pieces, all in the Encoding dialect:
1.
#iree_encoding.split_k_encodingattributeA new attribute that implements
SerializableAttr. The existing data-tilingencoding resolvers (
#iree_gpu.gpu_encoding_resolver,#iree_cpu.cpu_encoding_resolver,...) extend their
getLayout()to handle it. Carries the metadata theresolver needs:
The
split_k_factorsfield starts as allkDynamic(verbose). AfterSpecializeEncodingsPassruns, it contains static integers (serialized). Sameverbose-vs-serialized pattern as
#iree_encoding.encoding.Note: the fields are not finalized; they can be added to or removed as needed if
they are static info.
2.
iree_encoding.query_split_k_sizeopStandalone op that yields the split-k tile sizes as
indexvalues:reduction_dims(variadic).resolver and for future runtime-computed factors.
FormSplitReductionDispatchesPasswhenit sees the
?sentinel iniree_linalg_ext.split_reduction.The op never appears inside dispatch bodies. Tile sizes flow into the
dispatch as workload arguments.
3. Resolution: SpecializeEncodings → EncodeHostTensors
Two existing Stream passes handle the op, mirroring how
stream.tensor.sizeofis resolved for data-tiling:SpecializeEncodingsgetLayout()on the encoding via the device-affinity resolver. The resolver returns asplit_k_encodingwith staticsplit_k_factors. Op stays.EncodeHostTensorsarith.constantresults. Canonicalization folds downstreamceildivui/tensor.empty/flow.dispatchshapes.FoldUniformOperandsAfter these three passes, the dispatch body has
scf.forall (...) step (%c320)— bit-identical to today's static path. Codegen needsno changes.
Mechanism
Why these design choices
Why two passes (SpecializeEncodings + EncodeHostTensors) instead of
one? This matches the existing data-tiling resolution pipeline:
SpecializeEncodings serializes encodings (verbose → target-specific),
EncodeHostTensors lowers serialized encodings to concrete arithmetic.
Treating the new op the same way avoids inventing a third resolution
phase and reuses the existing affinity-analysis machinery.
Why scope to static-shape inputs? With static iteration sizes, the
resolved factor is always a compile-time constant; the SSA chain
folds; codegen sees fully static IR. Dynamic inputs require a more
general mechanism (see Future Work) and we want to avoid coupling
that work to the immediate goal of letting backends own the heuristic.
Observations
the split-k factor doesn't change data layout, only the partition
count. The intermediate tensor between the two dispatches is a
plain
tensor<MxNx?xf32>{%num_partitions}. Standardstream.tensor.sizeofworks without changes.scope. By the time codegen runs, the IR looks identical to today's
static split-k path.
iree_linalg_ext.split_reduction = [?]sentinelcomposes with the static path. Setting
[320]keeps today'sbehavior; setting
[?]activates the new path.SetSplitReductionSizes.cppget ported into the per-backendresolver implementations (GPUEncodingExternalModels,
CPUEncodingExternalModels). DispatchCreation no longer chooses,
but the same logic still runs — at the right scope.
Future work
Dynamic-shape inputs
Generalize via a new
iree_encoding.physical_dimop (verbose form, inDispatchCreation) that converts to
stream.tensor.dim(lowered form, inStream phase). Both
%splitk(the tile-size factor) and%nparts(thepartition count) become physical_dim queries on the encoded intermediate
tensor type. A new
SerializableAttr::getPhysicalDim(builder, loc, kind, operands)interface method handles the lowering.
For
split_k_encodingthe physical-dim kinds are:factor(N)— the split-k factor for the Nth entry inreduction_dims(the tile size used as the
scf.forallstep).dim(N)— the size of tensor dim N of the partial-result carrier. Forthe partition dim this is
ceildivui(iter_size, factor).For a fully dynamic matmul, the intermediate tensor of partial results
is
tensor<?x?x?xf32>(M × N × num_partitions). Both queries:After
SpecializeEncodings, the encoding'ssplit_k_factorscarries alist of
(condition, factor)pairs — the same shape asiree_codegen.specialization_ranges— gated on a named iteration-spacedim. The
input = 2field anchors the conditions to the K iterationsize (so
lt 1024reads as "K < 1024"):After
EncodeHostTensorslowers each query viagetPhysicalDim, bothproduce nested
arith.selectchains. To preserve first-match-winssemantics, the highest-priority condition is the outermost select;
the default branch is the innermost else. The chain is
Pureandregion-free, so the standard
csepass deduplicates the factor chainbetween the two lowerings:
The dispatch body still receives
%splitkas a workload arg, but nowit is computed at host runtime instead of being a compile-time constant.
FoldUniformOperandscannot inline a non-constant workload, so codegensees a dynamic step in
scf.forall. This is the cost of the dynamiccase; the static-input path above keeps the constant-step quality.
Retire
iree_linalg_ext.split_reductionattributeOnce the encoding-based path is the default, the LinalgExt attribute
can be removed in favor of the encoding-driven flow.
A concrete IR walkthrough across pipeline stages, for a static
32x40960 × 40960x32matmul, can be found atrfc_split_k_v2_ir_walkthrough.md.
Beta Was this translation helpful? Give feedback.
All reactions