[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction by Hardcode84 · Pull Request #1188 · iree-org/wave

Hardcode84 · 2026-03-25T13:00:32Z

Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops visible to the register allocator.

What changed

Core refactor: SRD construction in handleFatRawBufferCast, emitSRDBaseAdjustment, and handleMakeBufferRsrc now uses ExtractOp to decompose source SRDs, typed SALU ops (S_MOV_B32, S_AND_B32, S_OR_B32, S_ADD_U32, S_ADDC_U32) to build individual words, and PackOp to assemble the result. Regalloc assigns contiguous registers to pack inputs, eliminating all hardcoded register index arithmetic and string-based RawOp("s_mov_b32 sN, ...") patterns.

Prologue SRD setup (in-place writes to precolored registers) uses typed ops targeting PSRegType + DCEProtectOp instead of RawOp.

Infrastructure: srdValueMap/getSRDValue/setSRDValue added to TranslationContext for SSA-based SRD lookup. emitSrdNumRecords split into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace. Dead getNextSwizzleSRDIndex/getFirstFreeSgprAfterSRDs removed.

Follow-up fixes

Moving from opaque RawOp strings to SSA-visible typed ops exposed several interactions with existing passes:

SRD lookup in stores: handleVectorStore had an inline SRD lookup checking getSRDIndex but not getSRDValue. Replaced with lookupSRD().
CSE duplicate PackOp inputs: ScopedCSE merges identical S_MOV_B32 ops, creating shared SSA values across PackOp slots. Liveness pass now inserts copies so each slot gets a distinct physical register.
SRD liveness across prologue-to-adjustment gap: emitSRDBaseAdjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing regalloc to reuse kernel-arg SRD registers. Propagated srcSrdValue through PendingSRDBaseAdjust and connected S_LOAD_DWORDX2 results to S_MOV_B64 sources.
VALU->readfirstlane hazard detection: Hazard mitigation missed conflicts when producer/consumer were different SSA values at the same physical VGPR, or when non-emitting ops sat between them. Now compares physical register indices and skips non-emitting ops.
ExtractOp-into-PackOp register conflict: ExtractOp is non-emitting (aliases source sub-register). When used as PackOp input, the post-allocation pass could assign it to a different register than the extract source. Liveness pass now inserts explicit copies for ExtractOp inputs to PackOps.

waveasm/lib/Transforms/TranslateFromLLVMDialect.cpp

waveasm/lib/Transforms/HazardMitigation.cpp

harsh-nod · 2026-03-27T15:23:17Z

waveasm/lib/Transforms/HazardMitigation.cpp

+/// These ops are lowered to register aliases or eliminated entirely,
+/// so they don't create real inter-instruction delays.
+bool isNonEmittingOp(Operation *op) {
+  return isa<ExtractOp, PackOp, PrecoloredSRegOp, PrecoloredVRegOp, ConstantOp,


What do you think about replacing this with an OpTrait / Interface?

Added a TODO for now

waveasm/lib/Transforms/HazardMitigation.cpp

harsh-nod · 2026-03-27T15:27:12Z

waveasm/lib/Transforms/HazardMitigation.cpp

+      Operation *op = ops[i];
+
+      if (isReadfirstlaneOp(op)) {
+        // Walk backwards to find the nearest emitting predecessor.


In the backward scan, how do we know that a candidate is in the same region as the current op?

multiregion candidates can actually create real hazards (e.g. last op inside the if can create hazard for the first op after the if), fixed and added some tests

waveasm/lib/Transforms/handlers/AMDGPUHandlers.cpp

waveasm/lib/Transforms/LinearScanPass.cpp

harsh-nod · 2026-03-27T15:33:47Z

waveasm/lib/Transforms/handlers/AMDGPUHandlers.cpp

            srdBase = indexAttr.getInt();
      if (srdBase >= 0)
-        emitSrdNumRecords(builder, loc, srdBase, op, ctx);
+        patchSrdWord2InPlace(builder, loc, srdBase, op, ctx);


In the suppress-swizzle path, patchSrdWord2InPlace attempts to extract a physical SGPR base index from the source SRD. If the source SRD was built via PackOp (type SRegType, defining op waveasm.pack), both the PSRegType cast and the precolored.sreg name check fail. srdBase stays -1 and the patch is silently skipped.

When hasNonMaxValidBytes && !adj && suppressWord3Swizzle is true, the caller expects word 2 (num_records) to be overwritten with tighter OOB bounds. Skipping this leaves the source SRD's num_records intact (e.g., 0xFFFFFFFF), potentially making out-of-bounds accesses appear in-bounds.

Also, this path modifies a sub-register in-place without going through PackOp, creating a split personality where some SRD words are SSA-tracked and some are imperatively modified.

Should be fixed (workarounded) for now. Going forward, we will probably need insert op too to update PackOp result inplace.

makes sense

harsh-nod · 2026-03-27T16:33:28Z

The removal of the RawOp is a big win here, thanks for fixing that. Most of the comments are minor, so should be good to land after addressing those.

Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops that are visible to the register allocator. For new SRD construction (handleFatRawBufferCast, emitSRDBaseAdjustment, handleMakeBufferRsrc): extract source SRD words via ExtractOp, build each word with S_MOV_B32/S_AND_B32/S_OR_B32/S_ADD_U32/S_ADDC_U32, and assemble via PackOp. Regalloc assigns contiguous registers to pack inputs, eliminating hardcoded register index arithmetic. For prologue SRD setup (in-place writes to precolored registers): replace RawOp with typed S_MOV_B32/S_MOV_B64 targeting PSRegType + DCEProtectOp. Infrastructure changes: - Add srdValueMap/getSRDValue/setSRDValue to TranslationContext for SSA-based SRD lookup (parallel to the existing index-based API). - Split emitSrdNumRecords into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace (writes to pinned register). - Remove dead getNextSwizzleSRDIndex and getFirstFreeSgprAfterSRDs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…actor The previous commit replaced RawOp with typed SALU ops + PackOp for SRD construction, introducing two bugs: 1. handleVectorStore had an inline SRD lookup that checked getSRDIndex but not getSRDValue. Since emitSRDBaseAdjustment now uses setSRDValue, subsequent stores fell back to the unadjusted prologue SRD. Fix by using lookupSRD(), matching the load and masked-store handlers. 2. ScopedCSE can merge identical S_MOV_B32 ops across PackOps, creating a shared SSA value. Since one value can only have one physical register, the last PackOp assignment won, leaving earlier PackOps with an uninitialized SRD word. Fix by detecting duplicate PackOp inputs in the liveness pass and inserting copies (S_MOV_B32/V_MOV_B32) to give each slot its own distinct value. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Add two regression tests for bugs fixed in e9a9fca: 1. buffer-ops-store-reuses-srd.mlir: vector.store on the same adjusted memref as a prior vector.load must reuse the PackOp-produced SRD via getSRDValue, not fall back to the unadjusted prologue SRD. 2. lit-regalloc-pack-cse-dedup.mlir: when a single SSA value is shared across multiple PackOp slots (same value across packs, or duplicated within one pack), the liveness pass must insert copies so each slot gets a distinct physical register. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…ection Two issues caused the bufops e2e test to fail after the typed-op refactor: 1. SRD base adjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing the register allocator to reuse kernel-arg SRD registers. Propagate srcSrdValue through PendingSRDBaseAdjust and connect S_LOAD_DWORDX2 results to S_MOV_B64 sources in the prologue. 2. Hazard mitigation missed VALU→v_readfirstlane conflicts when (a) the producer and consumer were different SSA values at the same physical VGPR, or (b) non-emitting ops (extract, constant) sat between them in the MLIR. Compare physical register indices instead of SSA identity, and skip non-emitting ops when searching for the preceding instruction. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

ExtractOp is non-emitting (aliases source[offset] without a copy instruction). When used as a PackOp input, the post-allocation pass assigns it to result_base + i, which can differ from the extract's physical register, silently corrupting the packed value. Insert an explicit s_mov_b32/v_mov_b32 copy for ExtractOp inputs to PackOps, matching the existing handling for duplicate inputs. Fixes test_dynamic_copy_water_waveasm[shape0] where the SRD stride word (word 2) was left with garbage after the register allocator placed the new SRD at a different base than the source. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

…comments - Add TODO for threading SCC from S_ADD_U32 into S_ADDC_U32 (needs TableGen change to add SCC input operand). - Document VRegType-vs-PVRegType handling in hasVGPRConflict: virtual registers have no physical index, so mixed pairs rely on SSA identity. - Update hazard-mitigation.mlir: first test uses pvreg (post-regalloc form), gfx1250 test documents SSA-identity fallback path. - Add TODO for replacing isNonEmittingOp with an OpTrait/Interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Collapse getVGPRDefs/getVGPRUses/hasVGPRConflict into a single function that iterates over producer results and consumer operands directly. Both sets are 1-3 elements, so a plain nested loop beats DenseSet hashing overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

The backward scan for preceding emitting instructions now skips YieldOp (non-emitting terminator), allowing it to find VALU ops inside if/loop bodies that are the actual predecessors in execution. Extract findPrecedingEmittingOp helper to deduplicate the backward scan in both the VALU->readfirstlane and Trans->VALU hazard checks. Add region boundary tests: if-only, if/else, loop, and nested if. Document known limitation: flattened-list scan is single-path and may miss hazards from the then-branch when else is last in the list (safe in practice due to implicit branch instructions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Factor out the repeated ConstantOp + S_MOV_B32 pattern into a static helper. Used in buildSrdWord2 (two sites) and word3 flags construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

If the liveness pass fails to deduplicate shared SSA values across PackOp slots, the last setPhysReg call silently wins and earlier slots get the wrong register. Assert this invariant in debug builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

When the suppress-swizzle path encountered a PackOp-built source SRD (SRegType, no physical register index), it silently skipped the num_records patch, leaving stale OOB bounds. Two fixes: 1. The suppress path now falls through to the full SRD rebuild when it cannot extract a physical register base (PackOp SRDs). 2. The full rebuild now recognizes srcMapped as a usable 4-wide SRD when all other lookups (getSRDValue, getSRDIndex, precolored) fail. Add regression test: chained fat_raw_buffer_cast where the second cast enters the suppress path with a PackOp source and tight validBytes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor

Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Hardcode84 force-pushed the waveasm-srd-contruction-refac branch from a527eda to f06e145 Compare March 26, 2026 12:45

Hardcode84 changed the title ~~WIP: Replace RawOp with typed SALU ops + PackOp in SRD construction~~ Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026

Hardcode84 changed the title ~~Replace RawOp with typed SALU ops + PackOp in SRD construction~~ [asm] Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026

Hardcode84 requested a review from harsh-nod March 26, 2026 13:22