[AMDGPU] register spill instructions are generated inside control flow with `exec=0`

Register spills generated by the backend can be scheduled in regions of code where `exec=0`, so the instructions are not executed. Kernels with these spills then crash or produce incorrect results.

Apologies for the long repro case: this is an attention kernel generated by the mojo compiler. I tried to create a simplified repro case but could not hit the condition.

In the repro, the kernel is using `readfirstlane` for the scalar buffer resource. The recently added `amdgpu-uniform-intrinsic-combine` pass correctly determines that these are uniform and removes the `readfirstlane` intrinsics. But due to later instruction scheduling, `si-fix-sgpr-copies` generates code assuming this is not uniform and there is loop generated like this (I'm opening a different issue for the `readfirstlane` problem):

```assembly
        s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
        s_mov_b64 s[38:39], exec
.LBB0_17:                               ;   Parent Loop BB0_7 Depth=1
                                        ; =>  This Inner Loop Header: Depth=2
        v_readfirstlane_b32 s4, v0
        v_readfirstlane_b32 s5, v1
        v_readfirstlane_b32 s6, v254
        v_readfirstlane_b32 s7, v255
        v_cmp_eq_u64_e32 vcc, s[4:5], v[0:1]
        s_nop 0
        v_cmp_eq_u64_e64 s[2:3], s[6:7], v[254:255]
        s_and_b64 s[2:3], vcc, s[2:3]
        s_and_saveexec_b64 s[2:3], s[2:3]
        buffer_load_dwordx4 v[2:5], v26, s[4:7], s48 offen
        s_xor_b64 exec, exec, s[2:3]
        s_cbranch_execnz .LBB0_17
; %bb.18:                               ;   in Loop: Header=BB0_7 Depth=1
        v_accvgpr_write_b32 a97, v13
        v_accvgpr_write_b32 a88, v10
        s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
        s_mov_b64 s[38:39], exec
``` 

The problem here is that the `v_accvgpr_write_b32` instructions are in a section of code where the `exec` register is now zero, so these instructions are masked. If I edit the final assembly to move these instructions to have after the `s_mov_b64 exec, ...` that restores the `exec` mask, then the kernel runs fine. There are multiple cases of this that occur in this example. I have also observed on MI300 cases where scratch instructions are generated in this region.

The excessive register usage here is due to this being an attention op for `head_size=256`. The kernel itself is not touching the `exec` register behind the back of the compiler. The expectation would be that the above transforms would generate slow but functional code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] register spill instructions are generated inside control flow with `exec=0` #166657

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AMDGPU] register spill instructions are generated inside control flow with exec=0 #166657

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[AMDGPU] register spill instructions are generated inside control flow with `exec=0` #166657