[AMDGPU] Lazily emit waitcnts on function entry #73122

jayfoad · 2023-11-22T14:02:38Z

Instead of emitting s_waitcnt 0 on entry to a function, just remember
that the counters are in an unknown state and let the pass emit the
waitcnts naturally before the first use of any register that might need
it.

Delay the s_waitcnts could have a small performance improvement, and in
some cases allows them to be combined with other waits which can have a
small code size improvement.

jayfoad · 2023-11-22T14:03:49Z

This is a WIP. Lots of tests still need to be updated and some of the diffs look dubious.

jayfoad · 2023-11-22T14:05:10Z

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

+; SI-NEXT:    s_mov_b64 s[4:5], 0
+; SI-NEXT:    s_andn2_b64 vcc, exec, s[4:5]
+; SI-NEXT:    s_waitcnt lgkmcnt(0)
+; SI-NEXT:    s_mov_b64 vcc, vcc


This is very poor code. Maybe the placement of the waitcnt interfered with some peephole optimization that was suppose to clean it up?

Right, SIInsertWaitcnts::insertWaitcntInBlock generates s_waitcnt lgkmcnt(0) and s_mov_b64 vcc, vcc to work around a bug on GFX6 where vccz could be clobbered by completion of an SMEM load. Then SIPreEmitPeephole::optimizeVccBranch fails to look through the mov.

This only affects GFX6 so I don't think it's worth tring to fix it.

jayfoad · 2023-11-22T14:06:04Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.set.inactive.chain.arg.ll

 ; GFX11-NEXT:    s_not_b32 exec_lo, exec_lo
 ; GFX11-NEXT:    v_mov_b32_e32 v0, v10
 ; GFX11-NEXT:    s_not_b32 exec_lo, exec_lo
 ; GFX11-NEXT:    global_store_b32 v[8:9], v0, off
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)


This needs investigating.

This change is good. This patch naturally fixes the same bug that #72245 fixes.

arsenm · 2023-11-28T13:59:30Z

llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll

 ; MUBUF-NEXT:    s_mov_b32 s4, s33
 ; MUBUF-NEXT:    s_mov_b32 s33, s32


This is assuming that SP/FP are never written by a memory instruction, which is probably an OK assumption but we should probably document it

I don't want to change the ABI in this patch so I have "fixed" this by checking for reserved SGPRs and marking them as possibly depending on any of the counters.

arsenm · 2023-11-28T14:00:03Z

llvm/test/CodeGen/AMDGPU/GlobalISel/clamp-fmed3-const-combine.ll

 ; GFX10-NEXT:    v_mul_f32_e64 v0, v0, 2.0 clamp
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)


We can't guarantee that v0 isn't being written by a pending load

Good point. This was a major thinko. I am now marking live-ins to the entry block as possibly depending on any of the counters.

rovka · 2023-12-01T13:03:07Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+      UB = std::max(UB, Brackets.getWaitCountMax(T));
+
+    for (auto T : inst_counter_types())
+      Brackets.setScoreUB(T, UB);


Why do we need to use the max over all inst_counter_types, rather than setting the UB for each T with its own limit (i.e. Brackets.setScoreUB(T, Brackets.getWaitCountMax(T))?

Good question. We probably don't need to. I thought that WaitcntBrackets::merge relied on different counter types having a consistent notion of upper bound, but I've looked at the code again and I guess I was wrong.

…Class This means that getRegInterval no longer depends on the MCInstrDesc, so it could be simplified further to take just a MachineOperand or just a physical register. NFCI.

Instead of emitting s_waitcnt 0 on entry to a function, just remember that the counters are in an unknown state and let the pass emit the waitcnts naturally before the first use of any register that might need it. Delaying the s_waitcnts could have a small performance improvement, and in some cases allows them to be combined with other waits which can have a small code size improvement.

jayfoad · 2023-12-05T17:04:37Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+
+    // Reserved SGPRs (e.g. stack pointer or scratch descriptor) may depend on
+    // any counter.
+    // FIXME: Why are these not live-in to the function and/or the entry BB?


Is there a good reason for this?

SP is always manually managed without memory operations, so I don't think there's any useful way it can depend on a counter. If the caller didn't have FP/BP and was using it as a general SGPR, it's maybe possible they were used with a memory operation we need to complete before initializing them in the prolog

jayfoad · 2023-12-05T17:06:10Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+
+    // Live-in registers may depend on any counter.
+    const MachineBasicBlock &EntryMBB = MF.front();
+    for (auto [Reg, Mask] : EntryMBB.liveins()) {


I tried reimplementing this part using MRI.liveins instead of EntryMBB.liveins but apparently VGPR function arguments are not included in the former. Is that expected?

MRI.liveins is way less reliable. For example, we're missing verification that everything in it is mirrored in the entry block live ins. Additionally GlobalISel doesn't even populate it, and it's apparently not an issue

rovka

Should callWaitsOnFunctionEntry still return true all the time?

rovka · 2023-12-06T11:45:50Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+    // FIXME: Why are these not live-in to the function and/or the entry BB?
+    for (unsigned I = 0, E = ST->getMaxNumSGPRs(MF); I != E; ++I) {
+      MCRegister Reg = AMDGPU::SGPR_32RegClass.getRegister(I);
+      if (MRI.getReservedRegs()[Reg])


Nit: Maybe hoist the getReservedRegs call out of the loop.

It's a trivial accessor defined in the header so I don't think there's much point.

rovka · 2023-12-06T11:57:29Z

llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll

@@ -171,15 +171,15 @@ define <2 x i16> @v_add_v2i16_neg_inline_imm_splat(<2 x i16> %a) {
 ;
 ; GFX9-LABEL: v_add_v2i16_neg_inline_imm_splat:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 0xffc0ffc0


Do we need to wait for expcnt before the v_mov? IIUC if there's an export or GDS instruction that reads the VGPRs in the caller, we're supposed to wait for EXPcnt before we write the VGPRs.

Yes I think you're right. I guess that's another fundamental problem with this patch :(

Stepping back a bit, it might make more sense to change the ABI to say that expcnt must be zero at a function call boundary. I suspect there aren't any realistic cases where it would be useful to make a function call with an un-completed export in flight. But I don't want to change the ABI in this patch.

rovka · 2023-12-06T12:01:06Z

llvm/test/CodeGen/AMDGPU/GlobalISel/assert-align.ll

 ; CHECK-NEXT:    s_mov_b32 s16, s33
 ; CHECK-NEXT:    s_mov_b32 s33, s32
+; CHECK-NEXT:    s_waitcnt expcnt(0)


Why are we waiting for expcnt here? (as opposed to after changing EXEC)

Quoting from SIInsertWaitcnts:

// Export & GDS instructions do not read the EXEC mask until after the export // is granted (which can occur well after the instruction is issued). // The shader program must flush all EXP operations on the export-count // before overwriting the EXEC mask.

Right, thanks! Found it in the SPG too, it just wasn't in the table where I was originally looking.

rovka · 2023-12-06T12:09:28Z

llvm/test/CodeGen/AMDGPU/GlobalISel/assert-align.ll

@@ -44,7 +46,6 @@ entry:
 define ptr addrspace(1) @tail_call_assert_align() {
 ; CHECK-LABEL: tail_call_assert_align:
 ; CHECK:       ; %bb.0: ; %entry
-; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)


Why do these drop completely here, but in the next test (br-constant-invalid-sgpr-copy.ll) they just move to the end of the function? Is it because this function returns something, so it's the caller's job to wait before reading the results?

This function ends with a tail call, not a return. The ABI says that counters are undefined on function entry but must be 0 on return.

Ok, I completely filtered out that it was a tail call. Makes sense now.

rovka · 2023-12-06T12:11:23Z

llvm/test/CodeGen/AMDGPU/GlobalISel/br-constant-invalid-sgpr-copy.ll

 ; WAVE64-NEXT:  .LBB0_1: ; %bb0
 ; WAVE64-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; WAVE64-NEXT:    s_mov_b32 s4, 1
 ; WAVE64-NEXT:    s_cmp_lg_u32 s4, 0
 ; WAVE64-NEXT:    s_cbranch_scc1 .LBB0_1
 ; WAVE64-NEXT:  ; %bb.2: ; %.exit5
+; WAVE64-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)


Why exactly do we need to wait here? We only wrote some SGPRs that we're not returning

As above, counters are undefined on function entry but must be 0 on return.

jayfoad · 2023-12-06T14:56:46Z

Should callWaitsOnFunctionEntry still return true all the time?

Yes. It's really saying whether we should wait before making a call (i.e. waitcnts guaranteed to be zero on entry to the callee) or after entering the callee (i.e. waitcnts undefined on entry to the callee), and this patch does not change that.

jayfoad commented Nov 22, 2023

View reviewed changes

jayfoad mentioned this pull request Nov 22, 2023

[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known to be 0 already #72830

Merged

arsenm reviewed Nov 28, 2023

View reviewed changes

rovka reviewed Dec 1, 2023

View reviewed changes

jayfoad added 3 commits December 4, 2023 13:52

[AMDGPU] Simplify WaitcntBrackets::getRegInterval with getPhysRegBase…

13449c8

…Class This means that getRegInterval no longer depends on the MCInstrDesc, so it could be simplified further to take just a MachineOperand or just a physical register. NFCI.

Add a getRegInterval overload that just takes a Register

13f634f

jayfoad commented Dec 5, 2023

View reviewed changes

rovka reviewed Dec 6, 2023

View reviewed changes

jayfoad closed this by deleting the head repository Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Lazily emit waitcnts on function entry #73122

[AMDGPU] Lazily emit waitcnts on function entry #73122

jayfoad commented Nov 22, 2023

jayfoad commented Nov 22, 2023

jayfoad Nov 22, 2023

jayfoad Nov 23, 2023

jayfoad Nov 22, 2023

jayfoad Nov 23, 2023

arsenm Nov 28, 2023

jayfoad Dec 5, 2023

arsenm Nov 28, 2023

jayfoad Dec 5, 2023

rovka Dec 1, 2023

jayfoad Dec 1, 2023

jayfoad Dec 5, 2023

arsenm Dec 6, 2023

jayfoad Dec 5, 2023

arsenm Dec 6, 2023

rovka left a comment

rovka Dec 6, 2023

jayfoad Dec 6, 2023

rovka Dec 6, 2023

jayfoad Dec 6, 2023

jayfoad Dec 6, 2023

rovka Dec 6, 2023

jayfoad Dec 6, 2023

rovka Dec 7, 2023

rovka Dec 6, 2023

jayfoad Dec 6, 2023

rovka Dec 7, 2023

rovka Dec 6, 2023

jayfoad Dec 6, 2023

jayfoad commented Dec 6, 2023

		; MUBUF-NEXT: s_mov_b32 s4, s33
		; MUBUF-NEXT: s_mov_b32 s33, s32

		; GFX10-NEXT: v_mul_f32_e64 v0, v0, 2.0 clamp
		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

[AMDGPU] Lazily emit waitcnts on function entry #73122

[AMDGPU] Lazily emit waitcnts on function entry #73122

Conversation

jayfoad commented Nov 22, 2023

jayfoad commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rovka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayfoad commented Dec 6, 2023