[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

kosarev · 2023-09-13T12:14:37Z

V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test.

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.

…rand legalization. V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test. This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.

llvmbot · 2023-09-13T12:15:48Z

@llvm/pr-subscribers-backend-amdgpu

Changes

V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test.

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in `a1e38e0`, partially reverted in `eaf737a`.

Full diff: https://github.com/llvm/llvm-project/pull/66202.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+6)
(modified) llvm/test/CodeGen/AMDGPU/fma.f16.ll (+95-4)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 38b5e0114903cdf..a887f7567d76c89 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3318,6 +3318,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
       if (pseudoToMCOpcode(NewOpc) == -1)
         return false;
 
+      // V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite
+      // would also require restricting their register classes. For now
+      // just bail out.
+      if (NewOpc == AMDGPU::V_FMAAK_F16_t16)
+        return false;
+
       const int64_t Imm = ImmOp->getImm();
 
       // FIXME: This would be a lot easier if we could return a new instruction
diff --git a/llvm/test/CodeGen/AMDGPU/fma.f16.ll b/llvm/test/CodeGen/AMDGPU/fma.f16.ll
index 23971f2b681cb0c..e8423ce9fbc36a2 100644
--- a/llvm/test/CodeGen/AMDGPU/fma.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fma.f16.ll
@@ -1,8 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG
-; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL
-; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG
-; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL
+; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG
+; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL
+; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG
+; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL
+; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-SDAG
+; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-GISEL
 
 declare half @llvm.fma.f16(half, half, half)
 declare half @llvm.maxnum.f16(half, half)
@@ -19,6 +21,12 @@ define half @test_fma(half %x, half %y, half %z) {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_fma_f16 v0, v0, v1, v2
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: test_fma:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_fma_f16 v0, v0, v1, v2
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %r = call half @llvm.fma.f16(half %x, half %y, half %z)
   ret half %r
 }
@@ -36,6 +44,12 @@ define half @test_fmac(half %x, half %y, half %z) {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_fmac_f16_e32 v0, v1, v2
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: test_fmac:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_fmac_f16_e32 v0, v1, v2
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %r = call half @llvm.fma.f16(half %y, half %z, half %x)
   ret half %r
 }
@@ -61,6 +75,12 @@ define half @test_fmaak(half %x, half %y, half %z) {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_fmaak_f16 v0, v0, v1, 0x4200
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: test_fmaak:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_fmaak_f16 v0, v0, v1, 0x4200
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %r = call half @llvm.fma.f16(half %x, half %y, half 0xH4200)
   ret half %r
 }
@@ -86,6 +106,12 @@ define half @test_fmamk(half %x, half %y, half %z) {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_fmamk_f16 v0, v0, 0x4200, v2
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: test_fmamk:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_fmamk_f16 v0, v0, 0x4200, v2
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
   %r = call half @llvm.fma.f16(half %x, half 0xH4200, half %z)
   ret half %r
 }
@@ -139,6 +165,33 @@ define i32 @test_D139469_f16(half %arg) {
 ; GFX10-GISEL-NEXT:    s_or_b32 s4, vcc_lo, s4
 ; GFX10-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s4
 ; GFX10-GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-SDAG-LABEL: test_D139469_f16:
+; GFX11-SDAG:       ; %bb.0: ; %bb
+; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-SDAG-NEXT:    v_mov_b32_e32 v1, 0x211e
+; GFX11-SDAG-NEXT:    v_mul_f16_e32 v2, 0x291e, v0
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-NEXT:    v_fmac_f16_e32 v1, 0x291e, v0
+; GFX11-SDAG-NEXT:    v_min_f16_e32 v0, v2, v1
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-NEXT:    v_cmp_gt_f16_e32 vcc_lo, 0, v0
+; GFX11-SDAG-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc_lo
+; GFX11-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-GISEL-LABEL: test_D139469_f16:
+; GFX11-GISEL:       ; %bb.0: ; %bb
+; GFX11-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-GISEL-NEXT:    s_movk_i32 s0, 0x291e
+; GFX11-GISEL-NEXT:    v_mul_f16_e32 v1, 0x291e, v0
+; GFX11-GISEL-NEXT:    v_fmaak_f16 v0, s0, v0, 0x211e
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e32 vcc_lo, 0, v1
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e64 s0, 0, v0
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-GISEL-NEXT:    s_or_b32 s0, vcc_lo, s0
+; GFX11-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s0
+; GFX11-GISEL-NEXT:    s_setpc_b64 s[30:31]
 bb:
   %i = fmul contract half %arg, 0xH291E
   %i1 = fcmp olt half %i, 0xH0000
@@ -213,6 +266,44 @@ define <2 x i32> @test_D139469_v2f16(<2 x half> %arg) {
 ; GFX10-GISEL-NEXT:    s_or_b32 s4, s6, s5
 ; GFX10-GISEL-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s4
 ; GFX10-GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-SDAG-LABEL: test_D139469_v2f16:
+; GFX11-SDAG:       ; %bb.0: ; %bb
+; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-SDAG-NEXT:    s_movk_i32 s0, 0x211e
+; GFX11-SDAG-NEXT:    v_pk_mul_f16 v1, 0x291e, v0 op_sel_hi:[0,1]
+; GFX11-SDAG-NEXT:    v_pk_fma_f16 v0, 0x291e, v0, s0 op_sel_hi:[0,1,0]
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-NEXT:    v_pk_min_f16 v0, v1, v0
+; GFX11-SDAG-NEXT:    v_lshrrev_b32_e32 v1, 16, v0
+; GFX11-SDAG-NEXT:    v_cmp_gt_f16_e32 vcc_lo, 0, v0
+; GFX11-SDAG-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc_lo
+; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_3)
+; GFX11-SDAG-NEXT:    v_cmp_gt_f16_e32 vcc_lo, 0, v1
+; GFX11-SDAG-NEXT:    v_cndmask_b32_e64 v1, 0, 1, vcc_lo
+; GFX11-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-GISEL-LABEL: test_D139469_v2f16:
+; GFX11-GISEL:       ; %bb.0: ; %bb
+; GFX11-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-GISEL-NEXT:    s_mov_b32 s0, 0x291e291e
+; GFX11-GISEL-NEXT:    v_pk_mul_f16 v1, v0, 0x291e op_sel_hi:[1,0]
+; GFX11-GISEL-NEXT:    v_pk_fma_f16 v0, v0, s0, 0x211e op_sel_hi:[1,1,0]
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-GISEL-NEXT:    v_lshrrev_b32_e32 v2, 16, v1
+; GFX11-GISEL-NEXT:    v_lshrrev_b32_e32 v3, 16, v0
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e32 vcc_lo, 0, v1
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e64 s0, 0, v0
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e64 s1, 0, v2
+; GFX11-GISEL-NEXT:    v_cmp_gt_f16_e64 s2, 0, v3
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-GISEL-NEXT:    s_or_b32 s0, vcc_lo, s0
+; GFX11-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s0
+; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-GISEL-NEXT:    s_or_b32 s0, s1, s2
+; GFX11-GISEL-NEXT:    v_cndmask_b32_e64 v1, 0, 1, s0
+; GFX11-GISEL-NEXT:    s_setpc_b64 s[30:31]
 bb:
   %i = fmul contract <2 x half> %arg, <half 0xH291E, half 0xH291E>
   %i1 = fcmp olt <2 x half> %i, <half 0xH0000, half 0xH0000>

jayfoad · 2023-09-13T12:22:11Z

I don't understand. Is there some mismatch between the way the t16 forms of V_FMA/FMAC_* are defined, vs the t16 form of V_FMAAK_F16 ?

What is the exact opcode that was rewritten to V_FMAAK_F16_t16 in the failing case?

arsenm · 2023-09-13T12:25:32Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -3318,6 +3318,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
      if (pseudoToMCOpcode(NewOpc) == -1)
        return false;

+      // V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite
+      // would also require restricting their register classes. For now


So just handle this?

I'm not sure it is of any practical value. For fake True16 instructions, it seems we already have enough support and generally don't want to invest more in that. For real True16 instructions that new code wouldn't be used anyway; VGPR_16_Lo128 is not even allocatable downstream.

I don't think the reasoning for real True16 is right, but the conclusion we don't want to shrink is right. VGPR_16_Lo128 (or something derived from it ) would still be used downstream.
dag InOperandList = (ins VSrcT_f16_Lo128_Deferred:$src0, VGPRSrc_16_Lo128:$src1, KImmFP16:$imm);

We don't want to shrink pre-RA because it would restrict the register allocation to only use the bottom 128 vgprs

This looks irrelevant to me. The comment aims to explain why we bail out for fake True16 and VGPR_32_Lo128, not VGPR_16_Lo128. The instruction in the condition is supposed to be changed to V_FMAAK_F16_fake16 when we have one, and then removed along with the rest of the fake16 support.

jayfoad · 2023-09-13T12:38:37Z

I don't understand. Is there some mismatch between the way the t16 forms of V_FMA/FMAC_* are defined, vs the t16 form of V_FMAAK_F16 ?

What is the exact opcode that was rewritten to V_FMAAK_F16_t16 in the failing case?

Oh I see, it's because we are folding a VOP3 instruction (V_FMAC_F16_t16_e64) to V_FMAAK_F16_t16 which is VOP2 only.

kosarev · 2023-09-13T13:15:02Z

Oh I see, it's because we are folding a VOP3 instruction (V_FMAC_F16_t16_e64) to V_FMAAK_F16_t16 which is VOP2 only.

Yes. Which I understand makes the shrinking pass the right place to rewrite to V_FMAAK_F16_t16, should we want that.

jayfoad · 2023-09-13T13:46:59Z

Oh I see, it's because we are folding a VOP3 instruction (V_FMAC_F16_t16_e64) to V_FMAAK_F16_t16 which is VOP2 only.

Yes. Which I understand makes the shrinking pass the right place to rewrite to V_FMAAK_F16_t16, should we want that.

Unfortunately it is not that simple. It would be nice if folding immediates and shrinking VOP3 to VOP2 were orthogonal and could be implemented in separate passes, but this is one of the awkward cases where there is some interaction - you can only fold if you also shrink.

In our compiler we do folding first and then shrinking. So the options are:

When folding, be prepared to shrink the instruction to enable that (which is what we do here).
When shrinking, see if it enables any extra folding.

kosarev · 2023-09-18T10:08:41Z

Ping.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

Sisyph · 2023-10-03T16:11:58Z

Is it true that this function FoldImmediate is only called pre-RA? It seems from my cursory check just now that is the case. If so, then I agree that bailing out on FMAAK_F16_t16 and FMAMK_F16_t16 is correct. If not, then I think we would want to check the assigned registers similar to what is done in SIShrinkInstructions::shouldShrinkTrue16.

jayfoad · 2023-10-03T16:19:49Z

Is it true that this function FoldImmediate is only called pre-RA?

Yes.

Sisyph

LGTM

kosarev requested review from jayfoad, arsenm, Sisyph and rampitec September 13, 2023 12:14

kosarev requested a review from a team as a code owner September 13, 2023 12:14

llvmbot added the backend:AMDGPU label Sep 13, 2023

arsenm reviewed Sep 13, 2023

View reviewed changes

kosarev requested a review from arsenm September 13, 2023 13:16

Sisyph reviewed Sep 18, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Show resolved Hide resolved

kosarev requested a review from Sisyph October 3, 2023 14:21

Sisyph approved these changes Oct 3, 2023

View reviewed changes

kosarev merged commit cf80def into llvm:main Oct 4, 2023
2 of 3 checks passed

kosarev deleted the fix_passing_fma_f16 branch October 4, 2023 11:41

kosarev mentioned this pull request Oct 5, 2023

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

Merged

stepthomas mentioned this pull request Oct 10, 2023

AMDGPU stepthomas atomic csub no rtn forms ver2 stepthomas/llvm-project#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

kosarev commented Sep 13, 2023

llvmbot commented Sep 13, 2023

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in `a1e38e0`, partially reverted in `eaf737a`.

jayfoad commented Sep 13, 2023

arsenm Sep 13, 2023

kosarev Sep 13, 2023

Sisyph Oct 3, 2023

kosarev Oct 3, 2023

jayfoad commented Sep 13, 2023

kosarev commented Sep 13, 2023

jayfoad commented Sep 13, 2023

kosarev commented Sep 18, 2023

Sisyph commented Oct 3, 2023

jayfoad commented Oct 3, 2023

Sisyph left a comment

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

Conversation

kosarev commented Sep 13, 2023

llvmbot commented Sep 13, 2023

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.

jayfoad commented Sep 13, 2023

arsenm Sep 13, 2023

Choose a reason for hiding this comment

kosarev Sep 13, 2023

Choose a reason for hiding this comment

Sisyph Oct 3, 2023

Choose a reason for hiding this comment

kosarev Oct 3, 2023

Choose a reason for hiding this comment

jayfoad commented Sep 13, 2023

kosarev commented Sep 13, 2023

jayfoad commented Sep 13, 2023

kosarev commented Sep 18, 2023

Sisyph commented Oct 3, 2023

jayfoad commented Oct 3, 2023

Sisyph left a comment

Choose a reason for hiding this comment

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in `a1e38e0`, partially reverted in `eaf737a`.