-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Lazily emit waitcnts on function entry #73122
Conversation
This is a WIP. Lots of tests still need to be updated and some of the diffs look dubious. |
; SI-NEXT: s_mov_b64 s[4:5], 0 | ||
; SI-NEXT: s_andn2_b64 vcc, exec, s[4:5] | ||
; SI-NEXT: s_waitcnt lgkmcnt(0) | ||
; SI-NEXT: s_mov_b64 vcc, vcc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very poor code. Maybe the placement of the waitcnt interfered with some peephole optimization that was suppose to clean it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, SIInsertWaitcnts::insertWaitcntInBlock
generates s_waitcnt lgkmcnt(0)
and s_mov_b64 vcc, vcc
to work around a bug on GFX6 where vccz could be clobbered by completion of an SMEM load. Then SIPreEmitPeephole::optimizeVccBranch
fails to look through the mov.
This only affects GFX6 so I don't think it's worth tring to fix it.
; GFX11-NEXT: s_not_b32 exec_lo, exec_lo | ||
; GFX11-NEXT: v_mov_b32_e32 v0, v10 | ||
; GFX11-NEXT: s_not_b32 exec_lo, exec_lo | ||
; GFX11-NEXT: global_store_b32 v[8:9], v0, off | ||
; GFX11-NEXT: s_nop 0 | ||
; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is good. This patch naturally fixes the same bug that #72245 fixes.
; MUBUF-NEXT: s_mov_b32 s4, s33 | ||
; MUBUF-NEXT: s_mov_b32 s33, s32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that SP/FP are never written by a memory instruction, which is probably an OK assumption but we should probably document it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to change the ABI in this patch so I have "fixed" this by checking for reserved SGPRs and marking them as possibly depending on any of the counters.
; GFX10-NEXT: v_mul_f32_e64 v0, v0, 2.0 clamp | ||
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't guarantee that v0 isn't being written by a pending load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This was a major thinko. I am now marking live-ins to the entry block as possibly depending on any of the counters.
UB = std::max(UB, Brackets.getWaitCountMax(T)); | ||
|
||
for (auto T : inst_counter_types()) | ||
Brackets.setScoreUB(T, UB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use the max over all inst_counter_types, rather than setting the UB for each T with its own limit (i.e. Brackets.setScoreUB(T, Brackets.getWaitCountMax(T))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. We probably don't need to. I thought that WaitcntBrackets::merge
relied on different counter types having a consistent notion of upper bound, but I've looked at the code again and I guess I was wrong.
…Class This means that getRegInterval no longer depends on the MCInstrDesc, so it could be simplified further to take just a MachineOperand or just a physical register. NFCI.
Instead of emitting s_waitcnt 0 on entry to a function, just remember that the counters are in an unknown state and let the pass emit the waitcnts naturally before the first use of any register that might need it. Delaying the s_waitcnts could have a small performance improvement, and in some cases allows them to be combined with other waits which can have a small code size improvement.
|
||
// Reserved SGPRs (e.g. stack pointer or scratch descriptor) may depend on | ||
// any counter. | ||
// FIXME: Why are these not live-in to the function and/or the entry BB? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SP is always manually managed without memory operations, so I don't think there's any useful way it can depend on a counter. If the caller didn't have FP/BP and was using it as a general SGPR, it's maybe possible they were used with a memory operation we need to complete before initializing them in the prolog
|
||
// Live-in registers may depend on any counter. | ||
const MachineBasicBlock &EntryMBB = MF.front(); | ||
for (auto [Reg, Mask] : EntryMBB.liveins()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried reimplementing this part using MRI.liveins
instead of EntryMBB.liveins
but apparently VGPR function arguments are not included in the former. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MRI.liveins is way less reliable. For example, we're missing verification that everything in it is mirrored in the entry block live ins. Additionally GlobalISel doesn't even populate it, and it's apparently not an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should callWaitsOnFunctionEntry
still return true all the time?
// FIXME: Why are these not live-in to the function and/or the entry BB? | ||
for (unsigned I = 0, E = ST->getMaxNumSGPRs(MF); I != E; ++I) { | ||
MCRegister Reg = AMDGPU::SGPR_32RegClass.getRegister(I); | ||
if (MRI.getReservedRegs()[Reg]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe hoist the getReservedRegs call out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a trivial accessor defined in the header so I don't think there's much point.
@@ -171,15 +171,15 @@ define <2 x i16> @v_add_v2i16_neg_inline_imm_splat(<2 x i16> %a) { | |||
; | |||
; GFX9-LABEL: v_add_v2i16_neg_inline_imm_splat: | |||
; GFX9: ; %bb.0: | |||
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) | |||
; GFX9-NEXT: v_mov_b32_e32 v1, 0xffc0ffc0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to wait for expcnt before the v_mov? IIUC if there's an export or GDS instruction that reads the VGPRs in the caller, we're supposed to wait for EXPcnt before we write the VGPRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think you're right. I guess that's another fundamental problem with this patch :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stepping back a bit, it might make more sense to change the ABI to say that expcnt must be zero at a function call boundary. I suspect there aren't any realistic cases where it would be useful to make a function call with an un-completed export in flight. But I don't want to change the ABI in this patch.
; CHECK-NEXT: s_mov_b32 s16, s33 | ||
; CHECK-NEXT: s_mov_b32 s33, s32 | ||
; CHECK-NEXT: s_waitcnt expcnt(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we waiting for expcnt here? (as opposed to after changing EXEC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting from SIInsertWaitcnts
:
// Export & GDS instructions do not read the EXEC mask until after the export
// is granted (which can occur well after the instruction is issued).
// The shader program must flush all EXP operations on the export-count
// before overwriting the EXEC mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks! Found it in the SPG too, it just wasn't in the table where I was originally looking.
@@ -44,7 +46,6 @@ entry: | |||
define ptr addrspace(1) @tail_call_assert_align() { | |||
; CHECK-LABEL: tail_call_assert_align: | |||
; CHECK: ; %bb.0: ; %entry | |||
; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these drop completely here, but in the next test (br-constant-invalid-sgpr-copy.ll
) they just move to the end of the function? Is it because this function returns something, so it's the caller's job to wait before reading the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function ends with a tail call, not a return. The ABI says that counters are undefined on function entry but must be 0 on return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I completely filtered out that it was a tail call. Makes sense now.
; WAVE64-NEXT: .LBB0_1: ; %bb0 | ||
; WAVE64-NEXT: ; =>This Inner Loop Header: Depth=1 | ||
; WAVE64-NEXT: s_mov_b32 s4, 1 | ||
; WAVE64-NEXT: s_cmp_lg_u32 s4, 0 | ||
; WAVE64-NEXT: s_cbranch_scc1 .LBB0_1 | ||
; WAVE64-NEXT: ; %bb.2: ; %.exit5 | ||
; WAVE64-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly do we need to wait here? We only wrote some SGPRs that we're not returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, counters are undefined on function entry but must be 0 on return.
Yes. It's really saying whether we should wait before making a call (i.e. waitcnts guaranteed to be zero on entry to the callee) or after entering the callee (i.e. waitcnts undefined on entry to the callee), and this patch does not change that. |
Instead of emitting s_waitcnt 0 on entry to a function, just remember
that the counters are in an unknown state and let the pass emit the
waitcnts naturally before the first use of any register that might need
it.
Delay the s_waitcnts could have a small performance improvement, and in
some cases allows them to be combined with other waits which can have a
small code size improvement.