[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188
[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188Hardcode84 merged 12 commits intoiree-org:mainfrom
Conversation
a527eda to
f06e145
Compare
| /// These ops are lowered to register aliases or eliminated entirely, | ||
| /// so they don't create real inter-instruction delays. | ||
| bool isNonEmittingOp(Operation *op) { | ||
| return isa<ExtractOp, PackOp, PrecoloredSRegOp, PrecoloredVRegOp, ConstantOp, |
There was a problem hiding this comment.
What do you think about replacing this with an OpTrait / Interface?
There was a problem hiding this comment.
Added a TODO for now
| Operation *op = ops[i]; | ||
|
|
||
| if (isReadfirstlaneOp(op)) { | ||
| // Walk backwards to find the nearest emitting predecessor. |
There was a problem hiding this comment.
In the backward scan, how do we know that a candidate is in the same region as the current op?
There was a problem hiding this comment.
multiregion candidates can actually create real hazards (e.g. last op inside the if can create hazard for the first op after the if), fixed and added some tests
| srdBase = indexAttr.getInt(); | ||
| if (srdBase >= 0) | ||
| emitSrdNumRecords(builder, loc, srdBase, op, ctx); | ||
| patchSrdWord2InPlace(builder, loc, srdBase, op, ctx); |
There was a problem hiding this comment.
In the suppress-swizzle path, patchSrdWord2InPlace attempts to extract a physical SGPR base index from the source SRD. If the source SRD was built via PackOp (type SRegType, defining op waveasm.pack), both the PSRegType cast and the precolored.sreg name check fail. srdBase stays -1 and the patch is silently skipped.
When hasNonMaxValidBytes && !adj && suppressWord3Swizzle is true, the caller expects word 2 (num_records) to be overwritten with tighter OOB bounds. Skipping this leaves the source SRD's num_records intact (e.g., 0xFFFFFFFF), potentially making out-of-bounds accesses appear in-bounds.
Also, this path modifies a sub-register in-place without going through PackOp, creating a split personality where some SRD words are SSA-tracked and some are imperatively modified.
There was a problem hiding this comment.
Should be fixed (workarounded) for now. Going forward, we will probably need insert op too to update PackOp result inplace.
|
The removal of the RawOp is a big win here, thanks for fixing that. Most of the comments are minor, so should be good to land after addressing those. |
Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops that are visible to the register allocator. For new SRD construction (handleFatRawBufferCast, emitSRDBaseAdjustment, handleMakeBufferRsrc): extract source SRD words via ExtractOp, build each word with S_MOV_B32/S_AND_B32/S_OR_B32/S_ADD_U32/S_ADDC_U32, and assemble via PackOp. Regalloc assigns contiguous registers to pack inputs, eliminating hardcoded register index arithmetic. For prologue SRD setup (in-place writes to precolored registers): replace RawOp with typed S_MOV_B32/S_MOV_B64 targeting PSRegType + DCEProtectOp. Infrastructure changes: - Add srdValueMap/getSRDValue/setSRDValue to TranslationContext for SSA-based SRD lookup (parallel to the existing index-based API). - Split emitSrdNumRecords into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace (writes to pinned register). - Remove dead getNextSwizzleSRDIndex and getFirstFreeSgprAfterSRDs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…actor The previous commit replaced RawOp with typed SALU ops + PackOp for SRD construction, introducing two bugs: 1. handleVectorStore had an inline SRD lookup that checked getSRDIndex but not getSRDValue. Since emitSRDBaseAdjustment now uses setSRDValue, subsequent stores fell back to the unadjusted prologue SRD. Fix by using lookupSRD(), matching the load and masked-store handlers. 2. ScopedCSE can merge identical S_MOV_B32 ops across PackOps, creating a shared SSA value. Since one value can only have one physical register, the last PackOp assignment won, leaving earlier PackOps with an uninitialized SRD word. Fix by detecting duplicate PackOp inputs in the liveness pass and inserting copies (S_MOV_B32/V_MOV_B32) to give each slot its own distinct value. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add two regression tests for bugs fixed in e9a9fca: 1. buffer-ops-store-reuses-srd.mlir: vector.store on the same adjusted memref as a prior vector.load must reuse the PackOp-produced SRD via getSRDValue, not fall back to the unadjusted prologue SRD. 2. lit-regalloc-pack-cse-dedup.mlir: when a single SSA value is shared across multiple PackOp slots (same value across packs, or duplicated within one pack), the liveness pass must insert copies so each slot gets a distinct physical register. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…ection Two issues caused the bufops e2e test to fail after the typed-op refactor: 1. SRD base adjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing the register allocator to reuse kernel-arg SRD registers. Propagate srcSrdValue through PendingSRDBaseAdjust and connect S_LOAD_DWORDX2 results to S_MOV_B64 sources in the prologue. 2. Hazard mitigation missed VALU→v_readfirstlane conflicts when (a) the producer and consumer were different SSA values at the same physical VGPR, or (b) non-emitting ops (extract, constant) sat between them in the MLIR. Compare physical register indices instead of SSA identity, and skip non-emitting ops when searching for the preceding instruction. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
ExtractOp is non-emitting (aliases source[offset] without a copy instruction). When used as a PackOp input, the post-allocation pass assigns it to result_base + i, which can differ from the extract's physical register, silently corrupting the packed value. Insert an explicit s_mov_b32/v_mov_b32 copy for ExtractOp inputs to PackOps, matching the existing handling for duplicate inputs. Fixes test_dynamic_copy_water_waveasm[shape0] where the SRD stride word (word 2) was left with garbage after the register allocator placed the new SRD at a different base than the source. Made-with: Cursor Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…comments - Add TODO for threading SCC from S_ADD_U32 into S_ADDC_U32 (needs TableGen change to add SCC input operand). - Document VRegType-vs-PVRegType handling in hasVGPRConflict: virtual registers have no physical index, so mixed pairs rely on SSA identity. - Update hazard-mitigation.mlir: first test uses pvreg (post-regalloc form), gfx1250 test documents SSA-identity fallback path. - Add TODO for replacing isNonEmittingOp with an OpTrait/Interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Collapse getVGPRDefs/getVGPRUses/hasVGPRConflict into a single function that iterates over producer results and consumer operands directly. Both sets are 1-3 elements, so a plain nested loop beats DenseSet hashing overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The backward scan for preceding emitting instructions now skips YieldOp (non-emitting terminator), allowing it to find VALU ops inside if/loop bodies that are the actual predecessors in execution. Extract findPrecedingEmittingOp helper to deduplicate the backward scan in both the VALU->readfirstlane and Trans->VALU hazard checks. Add region boundary tests: if-only, if/else, loop, and nested if. Document known limitation: flattened-list scan is single-path and may miss hazards from the then-branch when else is last in the list (safe in practice due to implicit branch instructions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Factor out the repeated ConstantOp + S_MOV_B32 pattern into a static helper. Used in buildSrdWord2 (two sites) and word3 flags construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
If the liveness pass fails to deduplicate shared SSA values across PackOp slots, the last setPhysReg call silently wins and earlier slots get the wrong register. Assert this invariant in debug builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When the suppress-swizzle path encountered a PackOp-built source SRD (SRegType, no physical register index), it silently skipped the num_records patch, leaving stale OOB bounds. Two fixes: 1. The suppress path now falls through to the full SRD rebuild when it cannot extract a physical register base (PackOp SRDs). 2. The full rebuild now recognizes srcMapped as a usable 4-wide SRD when all other lookups (getSRDValue, getSRDIndex, precolored) fail. Add regression test: chained fat_raw_buffer_cast where the second cast enters the suppress path with a PackOp source and tight validBytes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
568cbe6 to
39e3ed0
Compare
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Rewrite Arith/Affine/AMDGPU/MemRef handlers to use auto-select emit helpers (emitAdd, emitMul, emitScalarCmp, etc.) that route scalar operands through SALU instructions. Keep scalar kernel arguments in SGPRs and promote uniform address arithmetic to SALU, reducing VGPR pressure. Fuse arith.cmpi + arith.select into s_cmp + s_cselect. Rewrite SRD prologue with all-or-nothing preloading strategy and overflow SGPR slots. Modify SRDs in-place to reduce SGPR pressure. Add handleROCDLSchedBarrier and maskedload without mask support. Replace SRD prologue RawOps with typed S_LOAD_DWORDX2, S_MOV_B64, S_MOV_B32, and ExtractOp via SSA value reuse pattern. Adapt upstream PackOp-based SRD construction (PR iree-org#1188) to use explicit SCCType for S_AND_B32, S_OR_B32, and S_ADDC_U32 call sites. Made-with: Cursor Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops visible to the register allocator.
What changed
Core refactor: SRD construction in
handleFatRawBufferCast,emitSRDBaseAdjustment, andhandleMakeBufferRsrcnow uses ExtractOp to decompose source SRDs, typed SALU ops (S_MOV_B32, S_AND_B32, S_OR_B32, S_ADD_U32, S_ADDC_U32) to build individual words, and PackOp to assemble the result. Regalloc assigns contiguous registers to pack inputs, eliminating all hardcoded register index arithmetic and string-basedRawOp("s_mov_b32 sN, ...")patterns.Prologue SRD setup (in-place writes to precolored registers) uses typed ops targeting PSRegType + DCEProtectOp instead of RawOp.
Infrastructure:
srdValueMap/getSRDValue/setSRDValueadded to TranslationContext for SSA-based SRD lookup.emitSrdNumRecordssplit intobuildSrdWord2(returns free Value for PackOp) andpatchSrdWord2InPlace. DeadgetNextSwizzleSRDIndex/getFirstFreeSgprAfterSRDsremoved.Follow-up fixes
Moving from opaque RawOp strings to SSA-visible typed ops exposed several interactions with existing passes:
SRD lookup in stores:
handleVectorStorehad an inline SRD lookup checkinggetSRDIndexbut notgetSRDValue. Replaced withlookupSRD().CSE duplicate PackOp inputs: ScopedCSE merges identical S_MOV_B32 ops, creating shared SSA values across PackOp slots. Liveness pass now inserts copies so each slot gets a distinct physical register.
SRD liveness across prologue-to-adjustment gap:
emitSRDBaseAdjustmentused disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing regalloc to reuse kernel-arg SRD registers. PropagatedsrcSrdValuethroughPendingSRDBaseAdjustand connected S_LOAD_DWORDX2 results to S_MOV_B64 sources.VALU->readfirstlane hazard detection: Hazard mitigation missed conflicts when producer/consumer were different SSA values at the same physical VGPR, or when non-emitting ops sat between them. Now compares physical register indices and skips non-emitting ops.
ExtractOp-into-PackOp register conflict: ExtractOp is non-emitting (aliases source sub-register). When used as PackOp input, the post-allocation pass could assign it to a different register than the extract source. Liveness pass now inserts explicit copies for ExtractOp inputs to PackOps.