-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Register spills generated by the backend can be scheduled in regions of code where exec=0, so the instructions are not executed. Kernels with these spills then crash or produce incorrect results.
Apologies for the long repro case: this is an attention kernel generated by the mojo compiler. I tried to create a simplified repro case but could not hit the condition.
In the repro, the kernel is using readfirstlane for the scalar buffer resource. The recently added amdgpu-uniform-intrinsic-combine pass correctly determines that these are uniform and removes the readfirstlane intrinsics. But due to later instruction scheduling, si-fix-sgpr-copies generates code assuming this is not uniform and there is loop generated like this (I'm opening a different issue for the readfirstlane problem):
s_cmp_lg_u32 s55, 0
s_mov_b64 exec, s[38:39]
s_cselect_b32 s55, 1, 0
s_mov_b64 s[38:39], exec
.LBB0_17: ; Parent Loop BB0_7 Depth=1
; => This Inner Loop Header: Depth=2
v_readfirstlane_b32 s4, v0
v_readfirstlane_b32 s5, v1
v_readfirstlane_b32 s6, v254
v_readfirstlane_b32 s7, v255
v_cmp_eq_u64_e32 vcc, s[4:5], v[0:1]
s_nop 0
v_cmp_eq_u64_e64 s[2:3], s[6:7], v[254:255]
s_and_b64 s[2:3], vcc, s[2:3]
s_and_saveexec_b64 s[2:3], s[2:3]
buffer_load_dwordx4 v[2:5], v26, s[4:7], s48 offen
s_xor_b64 exec, exec, s[2:3]
s_cbranch_execnz .LBB0_17
; %bb.18: ; in Loop: Header=BB0_7 Depth=1
v_accvgpr_write_b32 a97, v13
v_accvgpr_write_b32 a88, v10
s_cmp_lg_u32 s55, 0
s_mov_b64 exec, s[38:39]
s_cselect_b32 s55, 1, 0
s_mov_b64 s[38:39], execThe problem here is that the v_accvgpr_write_b32 instructions are in a section of code where the exec register is now zero, so these instructions are masked. If I edit the final assembly to move these instructions to have after the s_mov_b64 exec, ... that restores the exec mask, then the kernel runs fine. There are multiple cases of this that occur in this example. I have also observed on MI300 cases where scratch instructions are generated in this region.
The excessive register usage here is due to this being an attention op for head_size=256. The kernel itself is not touching the exec register behind the back of the compiler. The expectation would be that the above transforms would generate slow but functional code.