Skip to content

[AMDGPU] regression from SIFixSGPRCopies::processPHINode change #169552

@raiseirql

Description

@raiseirql

The recent change to SIFixSGPRCopies::processPHINode in #169038 addressed the problem that we reported in #168761, but this caused a regression in our matmul kernel where PHI nodes that had av_class_128 were then converted to vreg_128 with copies introduced to move back to areg_128 registers.

I put a repro case of the kernel I'm looking at here: https://gist.github.com/raiseirql/ebe539468b5a741ff561a85832205a3c

Should the change in #169038 be refined or should a later pass have cleaned up the codegen to remove the copies.

In order to get AGPRs to be used due to other recent backend changes, we have needed to apply this patch. @arsenm had been looking into how to fix this properly.

diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caf..42af5c19cae2 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -90,7 +90,11 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
     auto [MinNumAGPRAttr, MaxNumAGPRAttr] =
         AMDGPU::getIntegerPairAttribute(F, "amdgpu-agpr-alloc", {~0u, ~0u},
                                         /*OnlyFirstRequired=*/true);
-    MinNumAGPRs = MinNumAGPRAttr;
+    MinNumAGPRs = ~0u;
+    if (MFMAVGPRForm ||
+         (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() &&
+          !mayUseAGPRs(F)))
+      MinNumAGPRs = 0;
   }

   if (AMDGPU::isChainCC(CC)) {

Before si-fix-sgpr-copies in the matmul inner loop. Same basic pattern after si-fix-sgpr-copies without #169038.

bb.2 (%ir-block.634):
; predecessors: %bb.1

...
  %706:areg_128_align2 = PHI %352:av_128_align2, %bb.1
  %707:areg_128_align2 = PHI %353:av_128_align2, %bb.1
...
  %1147:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %778:av_128_align2, %770:av_128_align2, %706:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %1148:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %779:av_128_align2, %770:av_128_align2, %707:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec

After si-fix-sgpr-copies with #169038:

  %2483:vgpr_32 = COPY %791:av_32, implicit $exec
  %2485:vgpr_32 = COPY %791:av_32, implicit $exec
  %2487:vgpr_32 = COPY %791:av_32, implicit $exec
  %2489:vgpr_32 = COPY %791:av_32, implicit $exec
  %2491:vgpr_32 = COPY %791:av_32, implicit $exec
  %2493:vgpr_32 = COPY %791:av_32, implicit $exec
  %2495:vgpr_32 = COPY %791:av_32, implicit $exec
  %2497:vgpr_32 = COPY %791:av_32, implicit $exec

bb.1 (%ir-block.184):
; predecessors: %bb.0, %bb.1
  successors: %bb.2(0x04000000), %bb.1(0x7c000000); %bb.2(3.12%), %bb.1(96.88%)
...
  %328:av_32 = PHI %2483:vgpr_32, %bb.0, %2484:vgpr_32, %bb.1
  %329:av_32 = PHI %2485:vgpr_32, %bb.0, %2486:vgpr_32, %bb.1
  %330:av_32 = PHI %2487:vgpr_32, %bb.0, %2488:vgpr_32, %bb.1
  %331:av_32 = PHI %2489:vgpr_32, %bb.0, %2490:vgpr_32, %bb.1
  %332:av_32 = PHI %2491:vgpr_32, %bb.0, %2492:vgpr_32, %bb.1
  %333:av_32 = PHI %2493:vgpr_32, %bb.0, %2494:vgpr_32, %bb.1
  %334:av_32 = PHI %2495:vgpr_32, %bb.0, %2496:vgpr_32, %bb.1
  %335:av_32 = PHI %2497:vgpr_32, %bb.0, %2498:vgpr_32, %bb.1
...
  %908:areg_128_align2 = REG_SEQUENCE %332:av_32, %subreg.sub0, %333:av_32, %subreg.sub1, %334:av_32, %subreg.sub2, %335:av_32, %subreg.sub3
  %909:areg_128_align2 = REG_SEQUENCE %328:av_32, %subreg.sub0, %329:av_32, %subreg.sub1, %330:av_32, %subreg.sub2, %331:av_32, %subreg.sub3
...
  %988:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %55:av_128_align2, %71:av_128_align2, %908:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %989:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %54:av_128_align2, %71:av_128_align2, %909:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec

The end effect of all this is the matmul kernel for a test shape of 8Kx8Kx8K goes from 0.85ms to over 15ms. The inner loop is has several scratch reads and copies between vgpr/agpr:

        v_mfma_f32_16x16x32_bf16 a[216:219], v[162:165], v[54:57], a[216:219]
        s_nop 4
        v_accvgpr_read_b32 v241, a135
        v_accvgpr_read_b32 v240, a134
        v_accvgpr_read_b32 v239, a133
        v_mfma_f32_16x16x32_bf16 a[152:155], v[162:165], v[46:49], a[172:175]
        v_accvgpr_read_b32 v238, a132
        v_mfma_f32_16x16x32_bf16 a[128:131], v[162:165], v[50:53], a[128:131]
        scratch_load_dwordx4 v[166:169], off, off offset:144

cc @jayfoad

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions