Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Lazily emit waitcnts on function entry #73122

Closed
wants to merge 3 commits into from
Closed

[AMDGPU] Lazily emit waitcnts on function entry #73122

wants to merge 3 commits into from

Conversation

jayfoad
Copy link
Contributor

@jayfoad jayfoad commented Nov 22, 2023

Instead of emitting s_waitcnt 0 on entry to a function, just remember
that the counters are in an unknown state and let the pass emit the
waitcnts naturally before the first use of any register that might need
it.

Delay the s_waitcnts could have a small performance improvement, and in
some cases allows them to be combined with other waits which can have a
small code size improvement.

@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 22, 2023

This is a WIP. Lots of tests still need to be updated and some of the diffs look dubious.

; SI-NEXT: s_mov_b64 s[4:5], 0
; SI-NEXT: s_andn2_b64 vcc, exec, s[4:5]
; SI-NEXT: s_waitcnt lgkmcnt(0)
; SI-NEXT: s_mov_b64 vcc, vcc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very poor code. Maybe the placement of the waitcnt interfered with some peephole optimization that was suppose to clean it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, SIInsertWaitcnts::insertWaitcntInBlock generates s_waitcnt lgkmcnt(0) and s_mov_b64 vcc, vcc to work around a bug on GFX6 where vccz could be clobbered by completion of an SMEM load. Then SIPreEmitPeephole::optimizeVccBranch fails to look through the mov.

This only affects GFX6 so I don't think it's worth tring to fix it.

; GFX11-NEXT: s_not_b32 exec_lo, exec_lo
; GFX11-NEXT: v_mov_b32_e32 v0, v10
; GFX11-NEXT: s_not_b32 exec_lo, exec_lo
; GFX11-NEXT: global_store_b32 v[8:9], v0, off
; GFX11-NEXT: s_nop 0
; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs investigating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is good. This patch naturally fixes the same bug that #72245 fixes.

Comment on lines 226 to 228
; MUBUF-NEXT: s_mov_b32 s4, s33
; MUBUF-NEXT: s_mov_b32 s33, s32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that SP/FP are never written by a memory instruction, which is probably an OK assumption but we should probably document it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the ABI in this patch so I have "fixed" this by checking for reserved SGPRs and marking them as possibly depending on any of the counters.

; GFX10-NEXT: v_mul_f32_e64 v0, v0, 2.0 clamp
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't guarantee that v0 isn't being written by a pending load

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This was a major thinko. I am now marking live-ins to the entry block as possibly depending on any of the counters.

UB = std::max(UB, Brackets.getWaitCountMax(T));

for (auto T : inst_counter_types())
Brackets.setScoreUB(T, UB);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use the max over all inst_counter_types, rather than setting the UB for each T with its own limit (i.e. Brackets.setScoreUB(T, Brackets.getWaitCountMax(T))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. We probably don't need to. I thought that WaitcntBrackets::merge relied on different counter types having a consistent notion of upper bound, but I've looked at the code again and I guess I was wrong.

…Class

This means that getRegInterval no longer depends on the MCInstrDesc, so
it could be simplified further to take just a MachineOperand or just a
physical register. NFCI.
Instead of emitting s_waitcnt 0 on entry to a function, just remember
that the counters are in an unknown state and let the pass emit the
waitcnts naturally before the first use of any register that might need
it.

Delaying the s_waitcnts could have a small performance improvement, and
in some cases allows them to be combined with other waits which can have
a small code size improvement.

// Reserved SGPRs (e.g. stack pointer or scratch descriptor) may depend on
// any counter.
// FIXME: Why are these not live-in to the function and/or the entry BB?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SP is always manually managed without memory operations, so I don't think there's any useful way it can depend on a counter. If the caller didn't have FP/BP and was using it as a general SGPR, it's maybe possible they were used with a memory operation we need to complete before initializing them in the prolog


// Live-in registers may depend on any counter.
const MachineBasicBlock &EntryMBB = MF.front();
for (auto [Reg, Mask] : EntryMBB.liveins()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried reimplementing this part using MRI.liveins instead of EntryMBB.liveins but apparently VGPR function arguments are not included in the former. Is that expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MRI.liveins is way less reliable. For example, we're missing verification that everything in it is mirrored in the entry block live ins. Additionally GlobalISel doesn't even populate it, and it's apparently not an issue

Copy link
Collaborator

@rovka rovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should callWaitsOnFunctionEntry still return true all the time?

// FIXME: Why are these not live-in to the function and/or the entry BB?
for (unsigned I = 0, E = ST->getMaxNumSGPRs(MF); I != E; ++I) {
MCRegister Reg = AMDGPU::SGPR_32RegClass.getRegister(I);
if (MRI.getReservedRegs()[Reg])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe hoist the getReservedRegs call out of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trivial accessor defined in the header so I don't think there's much point.

@@ -171,15 +171,15 @@ define <2 x i16> @v_add_v2i16_neg_inline_imm_splat(<2 x i16> %a) {
;
; GFX9-LABEL: v_add_v2i16_neg_inline_imm_splat:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_mov_b32_e32 v1, 0xffc0ffc0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wait for expcnt before the v_mov? IIUC if there's an export or GDS instruction that reads the VGPRs in the caller, we're supposed to wait for EXPcnt before we write the VGPRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you're right. I guess that's another fundamental problem with this patch :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a bit, it might make more sense to change the ABI to say that expcnt must be zero at a function call boundary. I suspect there aren't any realistic cases where it would be useful to make a function call with an un-completed export in flight. But I don't want to change the ABI in this patch.

; CHECK-NEXT: s_mov_b32 s16, s33
; CHECK-NEXT: s_mov_b32 s33, s32
; CHECK-NEXT: s_waitcnt expcnt(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we waiting for expcnt here? (as opposed to after changing EXEC)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from SIInsertWaitcnts:

  // Export & GDS instructions do not read the EXEC mask until after the export
  // is granted (which can occur well after the instruction is issued).
  // The shader program must flush all EXP operations on the export-count
  // before overwriting the EXEC mask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks! Found it in the SPG too, it just wasn't in the table where I was originally looking.

@@ -44,7 +46,6 @@ entry:
define ptr addrspace(1) @tail_call_assert_align() {
; CHECK-LABEL: tail_call_assert_align:
; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these drop completely here, but in the next test (br-constant-invalid-sgpr-copy.ll) they just move to the end of the function? Is it because this function returns something, so it's the caller's job to wait before reading the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function ends with a tail call, not a return. The ABI says that counters are undefined on function entry but must be 0 on return.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I completely filtered out that it was a tail call. Makes sense now.

; WAVE64-NEXT: .LBB0_1: ; %bb0
; WAVE64-NEXT: ; =>This Inner Loop Header: Depth=1
; WAVE64-NEXT: s_mov_b32 s4, 1
; WAVE64-NEXT: s_cmp_lg_u32 s4, 0
; WAVE64-NEXT: s_cbranch_scc1 .LBB0_1
; WAVE64-NEXT: ; %bb.2: ; %.exit5
; WAVE64-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly do we need to wait here? We only wrote some SGPRs that we're not returning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, counters are undefined on function entry but must be 0 on return.

@jayfoad
Copy link
Contributor Author

jayfoad commented Dec 6, 2023

Should callWaitsOnFunctionEntry still return true all the time?

Yes. It's really saying whether we should wait before making a call (i.e. waitcnts guaranteed to be zero on entry to the callee) or after entering the callee (i.e. waitcnts undefined on entry to the callee), and this patch does not change that.

@jayfoad jayfoad closed this by deleting the head repository Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants