-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
We (Modular) picked up LLVM commit 4d42a0c3f139f41fb7409e7831f21ab9bca40a0c. This includes the changes from #166483. There is a regression in the code generation for one of the routines in our GPU print() path.
See https://gist.github.com/raiseirql/4b1815844ee452f68f7b2c4dd8625feb for a reduced repro. Specifically, see the block that is bracketed by llvm.debugtrap. This code is generated from https://github.com/modular/modular/blob/e8cce25027e913cdb54baf2aedc1789a11aa5301/mojo/stdlib/stdlib/builtin/_format_float.mojo#L174. Each lane is doing a print of a float value and this code is counting how many characters are needed. The values are expected to diverge based on the per-lane float value. The end result is the print output is corrupted.
If we change SITargetLowering::getRegClassFor to remove the code to return an AV class, then the correct code is produced.
if (TRI->isSGPRClass(RC) && isDivergent) {
// Disable the new code to fix codegen.
#if 0
if (Subtarget->hasGFX90AInsts())
return TRI->getEquivalentAVClass(RC);
#endif
return TRI->getEquivalentVGPRClass(RC);
}
The working codegen:
s_add_u32 s4, s4, 1
v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
v_lshlrev_b16_e32 v3, 8, v3
s_addc_u32 s5, s5, 0
v_lshlrev_b32_e32 v4, 16, v4
v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
s_or_b64 s[6:7], vcc, s[6:7]
v_mov_b64_e32 v[68:69], s[4:5] <<<<< this captures the loop index that diverges across threads
v_or_b32_sdwa v66, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
s_andn2_b64 exec, exec, s[6:7]
s_cbranch_execnz .LBB5_30
s_or_b64 exec, exec, s[6:7]
.LBB5_32:
s_or_b64 exec, exec, s[2:3]
s_mov_b64 s[22:23], 0
v_cmp_lt_i64_e32 vcc, 0, v[68:69]
s_mov_b64 s[0:1], -1
s_mov_b64 s[26:27], 0
s_trap 3
The broken codegen:
s_add_u32 s4, s4, 1
v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
v_lshlrev_b16_e32 v3, 8, v3
s_addc_u32 s5, s5, 0
v_lshlrev_b32_e32 v4, 16, v4
v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
s_or_b64 s[6:7], vcc, s[6:7]
v_or_b32_sdwa v68, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
s_andn2_b64 exec, exec, s[6:7]
s_cbranch_execnz .LBB5_30
; %bb.31: ; %Flow35
s_or_b64 exec, exec, s[6:7]
v_mov_b64_e32 v[70:71], s[4:5] <<<<< this move should be inside the above loop, effectively captures max(idx)
.LBB5_32: ; %Flow36
s_or_b64 exec, exec, s[2:3]
s_mov_b64 s[22:23], 0
v_cmp_lt_i64_e32 vcc, 0, v[70:71]
s_mov_b64 s[0:1], -1
s_mov_b64 s[26:27], 0
s_trap 3