Skip to content

[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188

Merged
Hardcode84 merged 12 commits intoiree-org:mainfrom
Hardcode84:waveasm-srd-contruction-refac
Mar 30, 2026
Merged

[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188
Hardcode84 merged 12 commits intoiree-org:mainfrom
Hardcode84:waveasm-srd-contruction-refac

Conversation

@Hardcode84
Copy link
Copy Markdown
Contributor

@Hardcode84 Hardcode84 commented Mar 25, 2026

Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops visible to the register allocator.

What changed

Core refactor: SRD construction in handleFatRawBufferCast, emitSRDBaseAdjustment, and handleMakeBufferRsrc now uses ExtractOp to decompose source SRDs, typed SALU ops (S_MOV_B32, S_AND_B32, S_OR_B32, S_ADD_U32, S_ADDC_U32) to build individual words, and PackOp to assemble the result. Regalloc assigns contiguous registers to pack inputs, eliminating all hardcoded register index arithmetic and string-based RawOp("s_mov_b32 sN, ...") patterns.

Prologue SRD setup (in-place writes to precolored registers) uses typed ops targeting PSRegType + DCEProtectOp instead of RawOp.

Infrastructure: srdValueMap/getSRDValue/setSRDValue added to TranslationContext for SSA-based SRD lookup. emitSrdNumRecords split into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace. Dead getNextSwizzleSRDIndex/getFirstFreeSgprAfterSRDs removed.

Follow-up fixes

Moving from opaque RawOp strings to SSA-visible typed ops exposed several interactions with existing passes:

  • SRD lookup in stores: handleVectorStore had an inline SRD lookup checking getSRDIndex but not getSRDValue. Replaced with lookupSRD().

  • CSE duplicate PackOp inputs: ScopedCSE merges identical S_MOV_B32 ops, creating shared SSA values across PackOp slots. Liveness pass now inserts copies so each slot gets a distinct physical register.

  • SRD liveness across prologue-to-adjustment gap: emitSRDBaseAdjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing regalloc to reuse kernel-arg SRD registers. Propagated srcSrdValue through PendingSRDBaseAdjust and connected S_LOAD_DWORDX2 results to S_MOV_B64 sources.

  • VALU->readfirstlane hazard detection: Hazard mitigation missed conflicts when producer/consumer were different SSA values at the same physical VGPR, or when non-emitting ops sat between them. Now compares physical register indices and skips non-emitting ops.

  • ExtractOp-into-PackOp register conflict: ExtractOp is non-emitting (aliases source sub-register). When used as PackOp input, the post-allocation pass could assign it to a different register than the extract source. Liveness pass now inserts explicit copies for ExtractOp inputs to PackOps.

@Hardcode84 Hardcode84 force-pushed the waveasm-srd-contruction-refac branch from a527eda to f06e145 Compare March 26, 2026 12:45
@Hardcode84 Hardcode84 changed the title WIP: Replace RawOp with typed SALU ops + PackOp in SRD construction Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026
@Hardcode84 Hardcode84 changed the title Replace RawOp with typed SALU ops + PackOp in SRD construction [asm] Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026
@Hardcode84 Hardcode84 requested a review from harsh-nod March 26, 2026 13:22
/// These ops are lowered to register aliases or eliminated entirely,
/// so they don't create real inter-instruction delays.
bool isNonEmittingOp(Operation *op) {
return isa<ExtractOp, PackOp, PrecoloredSRegOp, PrecoloredVRegOp, ConstantOp,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about replacing this with an OpTrait / Interface?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for now

Operation *op = ops[i];

if (isReadfirstlaneOp(op)) {
// Walk backwards to find the nearest emitting predecessor.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the backward scan, how do we know that a candidate is in the same region as the current op?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiregion candidates can actually create real hazards (e.g. last op inside the if can create hazard for the first op after the if), fixed and added some tests

srdBase = indexAttr.getInt();
if (srdBase >= 0)
emitSrdNumRecords(builder, loc, srdBase, op, ctx);
patchSrdWord2InPlace(builder, loc, srdBase, op, ctx);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the suppress-swizzle path, patchSrdWord2InPlace attempts to extract a physical SGPR base index from the source SRD. If the source SRD was built via PackOp (type SRegType, defining op waveasm.pack), both the PSRegType cast and the precolored.sreg name check fail. srdBase stays -1 and the patch is silently skipped.

When hasNonMaxValidBytes && !adj && suppressWord3Swizzle is true, the caller expects word 2 (num_records) to be overwritten with tighter OOB bounds. Skipping this leaves the source SRD's num_records intact (e.g., 0xFFFFFFFF), potentially making out-of-bounds accesses appear in-bounds.

Also, this path modifies a sub-register in-place without going through PackOp, creating a split personality where some SRD words are SSA-tracked and some are imperatively modified.

Copy link
Copy Markdown
Contributor Author

@Hardcode84 Hardcode84 Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed (workarounded) for now. Going forward, we will probably need insert op too to update PackOp result inplace.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@harsh-nod
Copy link
Copy Markdown
Collaborator

The removal of the RawOp is a big win here, thanks for fixing that. Most of the comments are minor, so should be good to land after addressing those.

Hardcode84 and others added 12 commits March 29, 2026 12:43
Eliminate all raw assembly string manipulation from SRD (Shader Resource
Descriptor) construction, replacing it with typed MLIR ops that are
visible to the register allocator.

For new SRD construction (handleFatRawBufferCast, emitSRDBaseAdjustment,
handleMakeBufferRsrc): extract source SRD words via ExtractOp, build
each word with S_MOV_B32/S_AND_B32/S_OR_B32/S_ADD_U32/S_ADDC_U32, and
assemble via PackOp. Regalloc assigns contiguous registers to pack
inputs, eliminating hardcoded register index arithmetic.

For prologue SRD setup (in-place writes to precolored registers): replace
RawOp with typed S_MOV_B32/S_MOV_B64 targeting PSRegType + DCEProtectOp.

Infrastructure changes:
- Add srdValueMap/getSRDValue/setSRDValue to TranslationContext for
  SSA-based SRD lookup (parallel to the existing index-based API).
- Split emitSrdNumRecords into buildSrdWord2 (returns free Value for
  PackOp) and patchSrdWord2InPlace (writes to pinned register).
- Remove dead getNextSwizzleSRDIndex and getFirstFreeSgprAfterSRDs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…actor

The previous commit replaced RawOp with typed SALU ops + PackOp for SRD
construction, introducing two bugs:

1. handleVectorStore had an inline SRD lookup that checked getSRDIndex
   but not getSRDValue. Since emitSRDBaseAdjustment now uses setSRDValue,
   subsequent stores fell back to the unadjusted prologue SRD. Fix by
   using lookupSRD(), matching the load and masked-store handlers.

2. ScopedCSE can merge identical S_MOV_B32 ops across PackOps, creating
   a shared SSA value. Since one value can only have one physical
   register, the last PackOp assignment won, leaving earlier PackOps
   with an uninitialized SRD word. Fix by detecting duplicate PackOp
   inputs in the liveness pass and inserting copies (S_MOV_B32/V_MOV_B32)
   to give each slot its own distinct value.

Made-with: Cursor
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add two regression tests for bugs fixed in e9a9fca:

1. buffer-ops-store-reuses-srd.mlir: vector.store on the same adjusted
   memref as a prior vector.load must reuse the PackOp-produced SRD via
   getSRDValue, not fall back to the unadjusted prologue SRD.

2. lit-regalloc-pack-cse-dedup.mlir: when a single SSA value is shared
   across multiple PackOp slots (same value across packs, or duplicated
   within one pack), the liveness pass must insert copies so each slot
   gets a distinct physical register.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…ection

Two issues caused the bufops e2e test to fail after the typed-op refactor:

1. SRD base adjustment used disconnected PrecoloredSRegOp instead of the
   prologue's SSA value, allowing the register allocator to reuse
   kernel-arg SRD registers. Propagate srcSrdValue through
   PendingSRDBaseAdjust and connect S_LOAD_DWORDX2 results to S_MOV_B64
   sources in the prologue.

2. Hazard mitigation missed VALU→v_readfirstlane conflicts when (a) the
   producer and consumer were different SSA values at the same physical
   VGPR, or (b) non-emitting ops (extract, constant) sat between them in
   the MLIR. Compare physical register indices instead of SSA identity,
   and skip non-emitting ops when searching for the preceding instruction.

Made-with: Cursor
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
ExtractOp is non-emitting (aliases source[offset] without a copy
instruction). When used as a PackOp input, the post-allocation pass
assigns it to result_base + i, which can differ from the extract's
physical register, silently corrupting the packed value.

Insert an explicit s_mov_b32/v_mov_b32 copy for ExtractOp inputs to
PackOps, matching the existing handling for duplicate inputs.

Fixes test_dynamic_copy_water_waveasm[shape0] where the SRD stride
word (word 2) was left with garbage after the register allocator
placed the new SRD at a different base than the source.

Made-with: Cursor
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…comments

- Add TODO for threading SCC from S_ADD_U32 into S_ADDC_U32 (needs
  TableGen change to add SCC input operand).
- Document VRegType-vs-PVRegType handling in hasVGPRConflict: virtual
  registers have no physical index, so mixed pairs rely on SSA identity.
- Update hazard-mitigation.mlir: first test uses pvreg (post-regalloc
  form), gfx1250 test documents SSA-identity fallback path.
- Add TODO for replacing isNonEmittingOp with an OpTrait/Interface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Collapse getVGPRDefs/getVGPRUses/hasVGPRConflict into a single function
that iterates over producer results and consumer operands directly.
Both sets are 1-3 elements, so a plain nested loop beats DenseSet
hashing overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The backward scan for preceding emitting instructions now skips
YieldOp (non-emitting terminator), allowing it to find VALU ops
inside if/loop bodies that are the actual predecessors in execution.

Extract findPrecedingEmittingOp helper to deduplicate the backward
scan in both the VALU->readfirstlane and Trans->VALU hazard checks.

Add region boundary tests: if-only, if/else, loop, and nested if.
Document known limitation: flattened-list scan is single-path and
may miss hazards from the then-branch when else is last in the list
(safe in practice due to implicit branch instructions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Factor out the repeated ConstantOp + S_MOV_B32 pattern into a static
helper. Used in buildSrdWord2 (two sites) and word3 flags construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
If the liveness pass fails to deduplicate shared SSA values across
PackOp slots, the last setPhysReg call silently wins and earlier
slots get the wrong register. Assert this invariant in debug builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When the suppress-swizzle path encountered a PackOp-built source SRD
(SRegType, no physical register index), it silently skipped the
num_records patch, leaving stale OOB bounds.

Two fixes:
1. The suppress path now falls through to the full SRD rebuild when
   it cannot extract a physical register base (PackOp SRDs).
2. The full rebuild now recognizes srcMapped as a usable 4-wide SRD
   when all other lookups (getSRDValue, getSRDIndex, precolored) fail.

Add regression test: chained fat_raw_buffer_cast where the second cast
enters the suppress path with a PackOp source and tight validBytes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
@Hardcode84 Hardcode84 merged commit e94fc5a into iree-org:main Mar 30, 2026
19 checks passed
@Hardcode84 Hardcode84 deleted the waveasm-srd-contruction-refac branch March 30, 2026 19:02
panditsa added a commit to panditsa/wave that referenced this pull request Mar 31, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
panditsa added a commit to panditsa/wave that referenced this pull request Apr 6, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
panditsa added a commit to panditsa/wave that referenced this pull request Apr 6, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 7, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 10, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 10, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 10, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 10, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
panditsa added a commit to panditsa/wave that referenced this pull request Apr 10, 2026
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit
helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar
operands through SALU instructions. Keep scalar kernel arguments in
SGPRs and promote uniform address arithmetic to SALU, reducing VGPR
pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect.
Rewrite SRD prologue with all-or-nothing preloading strategy and
overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure.
Add handleROCDLSchedBarrier and maskedload without mask support.

Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64,
S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream
PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for
S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites.

Made-with: Cursor
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants