Skip to content

[AMDGPU] register spill instructions are generated inside control flow with exec=0 #166657

@raiseirql

Description

@raiseirql

Register spills generated by the backend can be scheduled in regions of code where exec=0, so the instructions are not executed. Kernels with these spills then crash or produce incorrect results.

Apologies for the long repro case: this is an attention kernel generated by the mojo compiler. I tried to create a simplified repro case but could not hit the condition.

In the repro, the kernel is using readfirstlane for the scalar buffer resource. The recently added amdgpu-uniform-intrinsic-combine pass correctly determines that these are uniform and removes the readfirstlane intrinsics. But due to later instruction scheduling, si-fix-sgpr-copies generates code assuming this is not uniform and there is loop generated like this (I'm opening a different issue for the readfirstlane problem):

        s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
        s_mov_b64 s[38:39], exec
.LBB0_17:                               ;   Parent Loop BB0_7 Depth=1
                                        ; =>  This Inner Loop Header: Depth=2
        v_readfirstlane_b32 s4, v0
        v_readfirstlane_b32 s5, v1
        v_readfirstlane_b32 s6, v254
        v_readfirstlane_b32 s7, v255
        v_cmp_eq_u64_e32 vcc, s[4:5], v[0:1]
        s_nop 0
        v_cmp_eq_u64_e64 s[2:3], s[6:7], v[254:255]
        s_and_b64 s[2:3], vcc, s[2:3]
        s_and_saveexec_b64 s[2:3], s[2:3]
        buffer_load_dwordx4 v[2:5], v26, s[4:7], s48 offen
        s_xor_b64 exec, exec, s[2:3]
        s_cbranch_execnz .LBB0_17
; %bb.18:                               ;   in Loop: Header=BB0_7 Depth=1
        v_accvgpr_write_b32 a97, v13
        v_accvgpr_write_b32 a88, v10
        s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
        s_mov_b64 s[38:39], exec

The problem here is that the v_accvgpr_write_b32 instructions are in a section of code where the exec register is now zero, so these instructions are masked. If I edit the final assembly to move these instructions to have after the s_mov_b64 exec, ... that restores the exec mask, then the kernel runs fine. There are multiple cases of this that occur in this example. I have also observed on MI300 cases where scratch instructions are generated in this region.

The excessive register usage here is due to this being an attention op for head_size=256. The kernel itself is not touching the exec register behind the back of the compiler. The expectation would be that the above transforms would generate slow but functional code.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions