[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

jrbyrnes · 2024-03-19T20:15:09Z

ST.getMaxNumVGPRs(MF) lowers to AMDGPUBaseInfo.cpp:getTotalNumVGPRs which returns 512 for gfx90a. This is subsequently limited by AMDGPUBaseInfo:getAddressableNumVGPRs(), which also returns 512 for gfx90a. The ISA states we can have a total of 512 registers, but a maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).

Therefore, in unified register file case, ST.getMaxNumVGPRs(MF) calculates the maximum number of combined VGPR + AGPR. But, it is currently used as the limit for accvgpr and as the limit for archvgpr.

This patch adds a parameter to getMaxNumVGPRs, which the caller can use to control whether to return the maximum total VGPRs or maxmimum acc/arch VGPRs. These values will only differ when we have the unified register file case (i.e. FeatureGFX90AInsts). While this doesn't really seem to have an impact based on lit tests, it does have an impact on several real kernels I'm working with.

It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.

Still working on test cases.

llvmbot · 2024-03-19T20:15:44Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

ST.getMaxNumVGPRs(MF) lowers to AMDGPUBaseInfo.cpp:getTotalNumVGPRs which returns 512 for gfx90a. This is subsequently limited by AMDGPUBaseInfo:getAddressableNumVGPRs(), which also returns 512 for gfx90a. The ISA states we can have up to 256 of each of AGPR and VGPR, for a total of 512 (gfx90a 3.6.4).

ST.getMaxNumVGPRs(MF) therefore calculates the maximum possible number of combined VGPR + AGPR.

However, ST.getMaxNumVGPRs(MF) is being used in the scheduler as the maximum number of specifically accvgpr and the maxinum number of specifically archvgpr.

By adding a parameter to getMaxNumVGPRs, the caller can control whether to return the maximum total VGPRs or maxmimum acc/arch VGPRs. These values will only differ when we have the unified register file case and getTotalNumVGPRs(STI) / WavesPerEU > getAddressableNumArchVGPRs(STI) (that is, when WavesPerEu == 1). While this doesn't really seem to have an impact based on lit tests, it does have an impact on several real kernels I'm working with.

It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.

Still working on test cases.

Full diff: https://github.com/llvm/llvm-project/pull/85860.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp (+11-7)
(modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+2-2)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+5-3)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+14-7)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp (+5-2)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir (+2)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index fa77b94fc22def..248713bca8a688 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -809,11 +809,13 @@ unsigned GCNSubtarget::getMaxNumSGPRs(const Function &F) const {
                             getReservedNumSGPRs(F));
 }
 
-unsigned GCNSubtarget::getBaseMaxNumVGPRs(
-    const Function &F, std::pair<unsigned, unsigned> WavesPerEU) const {
+unsigned
+GCNSubtarget::getBaseMaxNumVGPRs(const Function &F,
+                                 std::pair<unsigned, unsigned> WavesPerEU,
+                                 bool WholeRegisterFile) const {
   // Compute maximum number of VGPRs function can use using default/requested
   // minimum number of waves per execution unit.
-  unsigned MaxNumVGPRs = getMaxNumVGPRs(WavesPerEU.first);
+  unsigned MaxNumVGPRs = getMaxNumVGPRs(WavesPerEU.first, WholeRegisterFile);
 
   // Check if maximum number of VGPRs was explicitly requested using
   // "amdgpu-num-vgpr" attribute.
@@ -839,14 +841,16 @@ unsigned GCNSubtarget::getBaseMaxNumVGPRs(
   return MaxNumVGPRs;
 }
 
-unsigned GCNSubtarget::getMaxNumVGPRs(const Function &F) const {
-  return getBaseMaxNumVGPRs(F, getWavesPerEU(F));
+unsigned GCNSubtarget::getMaxNumVGPRs(const Function &F,
+                                      bool WholeRegisterFile) const {
+  return getBaseMaxNumVGPRs(F, getWavesPerEU(F), WholeRegisterFile);
 }
 
-unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF) const {
+unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF,
+                                      bool WholeRegisterFile) const {
   const Function &F = MF.getFunction();
   const SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
-  return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU());
+  return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU(), WholeRegisterFile);
 }
 
 void GCNSubtarget::adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index 5c394e6d6296d0..b6fd5deb880f76 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -110,7 +110,7 @@ bool GCNRegPressure::less(const MachineFunction &MF, const GCNRegPressure &O,
   if (Occ != OtherOcc)
     return Occ > OtherOcc;
 
-  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF);
+  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ true);
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
   // SGPR excess pressure conditions
@@ -124,7 +124,7 @@ bool GCNRegPressure::less(const MachineFunction &MF, const GCNRegPressure &O,
   unsigned OtherVGPRForSGPRSpills =
       (OtherExcessSGPR + (WaveSize - 1)) / WaveSize;
 
-  unsigned MaxArchVGPRs = ST.getAddressableNumArchVGPRs();
+  unsigned MaxArchVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ false);
 
   // Unified excess pressure conditions, accounting for VGPRs used for SGPR
   // spills
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 9f419a7fbf6834..7e73f5ce49e9b8 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -974,11 +974,13 @@ void GCNSchedStage::checkScheduling() {
                       << DAG.MinOccupancy << ".\n");
   }
 
-  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF);
+  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ true);
+  unsigned MaxArchVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ false);
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
-  if (PressureAfter.getVGPRNum(false) > MaxVGPRs ||
-      PressureAfter.getAGPRNum() > MaxVGPRs ||
+  if (PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > MaxVGPRs ||
+      PressureAfter.getVGPRNum(false) > MaxArchVGPRs ||
+      PressureAfter.getAGPRNum() > MaxArchVGPRs ||
       PressureAfter.getSGPRNum() > MaxSGPRs) {
     DAG.RescheduleRegions[RegionIdx] = true;
     DAG.RegionsWithHighRP[RegionIdx] = true;
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index ca51da659c3311..fd3d124e257414 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1414,26 +1414,32 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   /// \returns the maximum number of VGPRs that can be used and still achieved
   /// at least the specified number of waves \p WavesPerEU.
-  unsigned getMaxNumVGPRs(unsigned WavesPerEU) const {
-    return AMDGPU::IsaInfo::getMaxNumVGPRs(this, WavesPerEU);
+  unsigned getMaxNumVGPRs(unsigned WavesPerEU,
+                          bool WholeRegisterFile = true) const {
+    return AMDGPU::IsaInfo::getMaxNumVGPRs(this, WavesPerEU, WholeRegisterFile);
   }
 
   /// \returns max num VGPRs. This is the common utility function
   /// called by MachineFunction and Function variants of getMaxNumVGPRs.
   unsigned getBaseMaxNumVGPRs(const Function &F,
-                              std::pair<unsigned, unsigned> WavesPerEU) const;
+                              std::pair<unsigned, unsigned> WavesPerEU,
+                              bool WholeRegisterFile) const;
   /// \returns Maximum number of VGPRs that meets number of waves per execution
   /// unit requirement for function \p F, or number of VGPRs explicitly
   /// requested using "amdgpu-num-vgpr" attribute attached to function \p F.
+  /// If \p WholeRegisterFile is false and our target has a unified register
+  /// file, getMaxNumVGPRs will instead \return the maxmium number of ArchVGPRs.
   ///
   /// \returns Value that meets number of waves per execution unit requirement
   /// if explicitly requested value cannot be converted to integer, violates
   /// subtarget's specifications, or does not meet number of waves per execution
   /// unit requirement.
-  unsigned getMaxNumVGPRs(const Function &F) const;
+  unsigned getMaxNumVGPRs(const Function &F,
+                          bool WholeRegisterFile = true) const;
 
-  unsigned getMaxNumAGPRs(const Function &F) const {
-    return getMaxNumVGPRs(F);
+  unsigned getMaxNumAGPRs(const Function &F,
+                          bool WholeRegisterFile = true) const {
+    return getMaxNumVGPRs(F, WholeRegisterFile);
   }
 
   /// \returns Maximum number of VGPRs that meets number of waves per execution
@@ -1444,7 +1450,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   /// if explicitly requested value cannot be converted to integer, violates
   /// subtarget's specifications, or does not meet number of waves per execution
   /// unit requirement.
-  unsigned getMaxNumVGPRs(const MachineFunction &MF) const;
+  unsigned getMaxNumVGPRs(const MachineFunction &MF,
+                          bool WholeRegisterFile = true) const;
 
   void getPostRAMutations(
       std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations)
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
index 6d53f68ace70df..a75b080b5850af 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
@@ -1155,12 +1155,15 @@ unsigned getMinNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU) {
   return std::min(MinNumVGPRs, AddrsableNumVGPRs);
 }
 
-unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU) {
+unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU,
+                        bool WholeRegisterFile) {
   assert(WavesPerEU != 0);
 
   unsigned MaxNumVGPRs = alignDown(getTotalNumVGPRs(STI) / WavesPerEU,
                                    getVGPRAllocGranule(STI));
-  unsigned AddressableNumVGPRs = getAddressableNumVGPRs(STI);
+  unsigned AddressableNumVGPRs = WholeRegisterFile
+                                     ? getAddressableNumVGPRs(STI)
+                                     : getAddressableNumArchVGPRs(STI);
   return std::min(MaxNumVGPRs, AddressableNumVGPRs);
 }
 
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
index 29ac402d953513..f49bb26e060e37 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
@@ -308,7 +308,8 @@ unsigned getMinNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU);
 
 /// \returns Maximum number of VGPRs that meets given number of waves per
 /// execution unit requirement for given subtarget \p STI.
-unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU);
+unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU,
+                        bool WholeRegisterFile);
 
 /// \returns Number of waves reachable for a given \p NumVGPRs usage for given
 /// subtarget \p STI.
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
index 091b29c23d60e2..dc5df7f039b04c 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
@@ -4,6 +4,8 @@
 --- |
   define amdgpu_kernel void @single-wave-phase-2b(ptr addrspace(3) noalias %in0, ptr addrspace(3) noalias %in1, ptr addrspace(3) noalias %in2, ptr addrspace(3) noalias %in3, ptr addrspace(3) noalias %in4, ptr addrspace(3) noalias %in5, ptr addrspace(3) noalias %in6, ptr addrspace(3) noalias %in7, ptr addrspace(3) noalias %in8, ptr addrspace(3) noalias %in9, ptr addrspace(3) noalias %in10, ptr addrspace(3) noalias %in11, ptr addrspace(7) noalias %in12, ptr addrspace(7) noalias %in13, ptr addrspace(7) noalias %in14, ptr addrspace(7) noalias %in15, ptr addrspace(7) noalias %in16, ptr addrspace(7) noalias %in17, ptr addrspace(7) noalias %in18, ptr addrspace(7) noalias %in19, ptr addrspace(7) noalias %in20, ptr addrspace(7) noalias %in21, ptr addrspace(7) noalias %in22, ptr addrspace(7) noalias %in23, ptr addrspace(7) noalias %in24, ptr addrspace(7) noalias %in25, ptr addrspace(7) noalias %in26, ptr addrspace(7) noalias %in27, ptr addrspace(7) noalias %in28, ptr addrspace(7) noalias %in29) #0 { ret void }
 
+  attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1"  "amdgpu-flat-work-group-size"="1,256" }
+
   !0 = distinct !{!0}
   !1 = !{!1, !0}
 ...

rampitec · 2024-03-19T20:25:12Z

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs. I.e. you may have improvements in some cases you are working on, but there will be bigger regressions in some other cases.

If you'd like to do it, I would at the very least guard this accounting with a call to MFI::usesAGPRs().

jrbyrnes · 2024-03-19T22:25:47Z

Just fix a bug -- still need to address comment.

github-actions · 2024-03-19T22:28:03Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jrbyrnes · 2024-03-20T20:57:57Z

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs.

I'm afraid I don't follow. This is not doubling, but rather halving from 512 to 256. The goal is to reduce RP and avoid spilling.

This will make the scheduler more conservative in terms of RP. AVReg are tracked as VGPR, so a limit of 256 will only be accurate if 100% of AVRegs are allocated as VGPR.

I don't understand how the current implementation is supposed to work. It seems to me, comparing pressure from RegKind::VGPR32 to 512 (assuming min-waves is 1) only makes sense if we add pressure from RegKind::AGPR32 , but even then, that would assume perfect allocation of AVRegs. We don't do that, though. We compare pressure from RegKind::VGPR32 to 512, then we compare pressure from RegKind::AGPR32 to 512

rampitec · 2024-03-20T21:00:48Z

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs.

I'm afraid I don't follow. This is not doubling, but rather halving from 512 to 256. The goal is to reduce RP and avoid spilling.

This will make the scheduler more conservative in terms of RP. AVReg are tracked as VGPR, so a limit of 256 will only be accurate if 100% of AVRegs are allocated as VGPR.

I don't understand how the current implementation is supposed to work. It seems to me, comparing pressure from RegKind::VGPR32 to 512 (assuming min-waves is 1) only makes sense if we add pressure from RegKind::AGPR32 , but even then, that would assume perfect allocation of AVRegs. We don't do that, though. We compare pressure from RegKind::VGPR32 to 512, then we compare pressure from RegKind::AGPR32 to 512

I see. Then I misread the description. Halving should be fine.

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

llvm/lib/Target/AMDGPU/GCNSubtarget.h

jrbyrnes · 2024-03-21T18:30:54Z

Halving should be fine.

Thanks for taking another look.

jrbyrnes · 2024-03-21T21:05:37Z

Latest iteration just uses getTotalNumVGPRs for MaxNumVGPRs and clamps to getAddressableNumArchVGPRs based on WholeRegisterFile as suggested.

Allows for unequal allocation (with a per register type max of 256) and PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > ST.getMaxNumVGPRs(MF, /WholeRegisterFile/ true) captures the unified limit

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp

…unified register file case Change-Id: I5a4ddfbf7fcea931b47c18b0b5421cada7ea9541

Change-Id: Iadeff5844c3e7baece274b03838f0457b1523c3b

rampitec

LGTM

…unified register file case (llvm#85860) `ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs` which returns 512 for gfx90a. This is subsequently limited by `AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for gfx90a. The ISA states we can have a total of 512 registers, but a maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4). Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)` calculates the maximum number of combined VGPR + AGPR. But, it is currently used as the limit for accvgpr and as the limit for archvgpr. This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits. It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way. Change-Id: Iee00cb03a64eab4a4d0b97454e5f48c57534f565

jrbyrnes requested review from arsenm, kerbowa and rampitec March 19, 2024 20:15

llvmbot added the backend:AMDGPU label Mar 19, 2024

jrbyrnes force-pushed the ExcessVGPR branch from 1bf2b9e to db3214c Compare March 19, 2024 22:29

rampitec reviewed Mar 20, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/GCNSubtarget.h Outdated Show resolved Hide resolved

rampitec reviewed Mar 21, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

arsenm reviewed Mar 22, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp Outdated Show resolved Hide resolved

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in …

5cba833

…unified register file case Change-Id: I5a4ddfbf7fcea931b47c18b0b5421cada7ea9541

jrbyrnes force-pushed the ExcessVGPR branch from becee75 to 5cba833 Compare March 22, 2024 18:05

add comment explaining the conditions

00ab315

Change-Id: Iadeff5844c3e7baece274b03838f0457b1523c3b

jrbyrnes force-pushed the ExcessVGPR branch from dad901e to 00ab315 Compare March 22, 2024 18:13

rampitec approved these changes Mar 22, 2024

View reviewed changes

jrbyrnes merged commit b761137 into llvm:main Mar 25, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

jrbyrnes commented Mar 19, 2024 •

edited

llvmbot commented Mar 19, 2024

rampitec commented Mar 19, 2024

jrbyrnes commented Mar 19, 2024

github-actions bot commented Mar 19, 2024 •

edited

jrbyrnes commented Mar 20, 2024

rampitec commented Mar 20, 2024

jrbyrnes commented Mar 21, 2024

jrbyrnes commented Mar 21, 2024

rampitec left a comment

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

Conversation

jrbyrnes commented Mar 19, 2024 • edited

llvmbot commented Mar 19, 2024

rampitec commented Mar 19, 2024

jrbyrnes commented Mar 19, 2024

github-actions bot commented Mar 19, 2024 • edited

jrbyrnes commented Mar 20, 2024

rampitec commented Mar 20, 2024

jrbyrnes commented Mar 21, 2024

jrbyrnes commented Mar 21, 2024

rampitec left a comment

Choose a reason for hiding this comment

jrbyrnes commented Mar 19, 2024 •

edited

github-actions bot commented Mar 19, 2024 •

edited