Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case #85860

Merged
merged 2 commits into from
Mar 25, 2024

Conversation

jrbyrnes
Copy link
Contributor

@jrbyrnes jrbyrnes commented Mar 19, 2024

ST.getMaxNumVGPRs(MF) lowers to AMDGPUBaseInfo.cpp:getTotalNumVGPRs which returns 512 for gfx90a. This is subsequently limited by AMDGPUBaseInfo:getAddressableNumVGPRs(), which also returns 512 for gfx90a. The ISA states we can have a total of 512 registers, but a maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).

Therefore, in unified register file case, ST.getMaxNumVGPRs(MF) calculates the maximum number of combined VGPR + AGPR. But, it is currently used as the limit for accvgpr and as the limit for archvgpr.

This patch adds a parameter to getMaxNumVGPRs, which the caller can use to control whether to return the maximum total VGPRs or maxmimum acc/arch VGPRs. These values will only differ when we have the unified register file case (i.e. FeatureGFX90AInsts). While this doesn't really seem to have an impact based on lit tests, it does have an impact on several real kernels I'm working with.

It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.

Still working on test cases.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 19, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

ST.getMaxNumVGPRs(MF) lowers to AMDGPUBaseInfo.cpp:getTotalNumVGPRs which returns 512 for gfx90a. This is subsequently limited by AMDGPUBaseInfo:getAddressableNumVGPRs(), which also returns 512 for gfx90a. The ISA states we can have up to 256 of each of AGPR and VGPR, for a total of 512 (gfx90a 3.6.4).

ST.getMaxNumVGPRs(MF) therefore calculates the maximum possible number of combined VGPR + AGPR.

However, ST.getMaxNumVGPRs(MF) is being used in the scheduler as the maximum number of specifically accvgpr and the maxinum number of specifically archvgpr.

By adding a parameter to getMaxNumVGPRs, the caller can control whether to return the maximum total VGPRs or maxmimum acc/arch VGPRs. These values will only differ when we have the unified register file case and getTotalNumVGPRs(STI) / WavesPerEU > getAddressableNumArchVGPRs(STI) (that is, when WavesPerEu == 1). While this doesn't really seem to have an impact based on lit tests, it does have an impact on several real kernels I'm working with.

It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.

Still working on test cases.


Full diff: https://github.com/llvm/llvm-project/pull/85860.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp (+11-7)
  • (modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+5-3)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+14-7)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp (+5-2)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir (+2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index fa77b94fc22def..248713bca8a688 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -809,11 +809,13 @@ unsigned GCNSubtarget::getMaxNumSGPRs(const Function &F) const {
                             getReservedNumSGPRs(F));
 }
 
-unsigned GCNSubtarget::getBaseMaxNumVGPRs(
-    const Function &F, std::pair<unsigned, unsigned> WavesPerEU) const {
+unsigned
+GCNSubtarget::getBaseMaxNumVGPRs(const Function &F,
+                                 std::pair<unsigned, unsigned> WavesPerEU,
+                                 bool WholeRegisterFile) const {
   // Compute maximum number of VGPRs function can use using default/requested
   // minimum number of waves per execution unit.
-  unsigned MaxNumVGPRs = getMaxNumVGPRs(WavesPerEU.first);
+  unsigned MaxNumVGPRs = getMaxNumVGPRs(WavesPerEU.first, WholeRegisterFile);
 
   // Check if maximum number of VGPRs was explicitly requested using
   // "amdgpu-num-vgpr" attribute.
@@ -839,14 +841,16 @@ unsigned GCNSubtarget::getBaseMaxNumVGPRs(
   return MaxNumVGPRs;
 }
 
-unsigned GCNSubtarget::getMaxNumVGPRs(const Function &F) const {
-  return getBaseMaxNumVGPRs(F, getWavesPerEU(F));
+unsigned GCNSubtarget::getMaxNumVGPRs(const Function &F,
+                                      bool WholeRegisterFile) const {
+  return getBaseMaxNumVGPRs(F, getWavesPerEU(F), WholeRegisterFile);
 }
 
-unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF) const {
+unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF,
+                                      bool WholeRegisterFile) const {
   const Function &F = MF.getFunction();
   const SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
-  return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU());
+  return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU(), WholeRegisterFile);
 }
 
 void GCNSubtarget::adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index 5c394e6d6296d0..b6fd5deb880f76 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -110,7 +110,7 @@ bool GCNRegPressure::less(const MachineFunction &MF, const GCNRegPressure &O,
   if (Occ != OtherOcc)
     return Occ > OtherOcc;
 
-  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF);
+  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ true);
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
   // SGPR excess pressure conditions
@@ -124,7 +124,7 @@ bool GCNRegPressure::less(const MachineFunction &MF, const GCNRegPressure &O,
   unsigned OtherVGPRForSGPRSpills =
       (OtherExcessSGPR + (WaveSize - 1)) / WaveSize;
 
-  unsigned MaxArchVGPRs = ST.getAddressableNumArchVGPRs();
+  unsigned MaxArchVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ false);
 
   // Unified excess pressure conditions, accounting for VGPRs used for SGPR
   // spills
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 9f419a7fbf6834..7e73f5ce49e9b8 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -974,11 +974,13 @@ void GCNSchedStage::checkScheduling() {
                       << DAG.MinOccupancy << ".\n");
   }
 
-  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF);
+  unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ true);
+  unsigned MaxArchVGPRs = ST.getMaxNumVGPRs(MF, /*WholeRegisterFile*/ false);
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(MF);
 
-  if (PressureAfter.getVGPRNum(false) > MaxVGPRs ||
-      PressureAfter.getAGPRNum() > MaxVGPRs ||
+  if (PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > MaxVGPRs ||
+      PressureAfter.getVGPRNum(false) > MaxArchVGPRs ||
+      PressureAfter.getAGPRNum() > MaxArchVGPRs ||
       PressureAfter.getSGPRNum() > MaxSGPRs) {
     DAG.RescheduleRegions[RegionIdx] = true;
     DAG.RegionsWithHighRP[RegionIdx] = true;
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index ca51da659c3311..fd3d124e257414 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1414,26 +1414,32 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   /// \returns the maximum number of VGPRs that can be used and still achieved
   /// at least the specified number of waves \p WavesPerEU.
-  unsigned getMaxNumVGPRs(unsigned WavesPerEU) const {
-    return AMDGPU::IsaInfo::getMaxNumVGPRs(this, WavesPerEU);
+  unsigned getMaxNumVGPRs(unsigned WavesPerEU,
+                          bool WholeRegisterFile = true) const {
+    return AMDGPU::IsaInfo::getMaxNumVGPRs(this, WavesPerEU, WholeRegisterFile);
   }
 
   /// \returns max num VGPRs. This is the common utility function
   /// called by MachineFunction and Function variants of getMaxNumVGPRs.
   unsigned getBaseMaxNumVGPRs(const Function &F,
-                              std::pair<unsigned, unsigned> WavesPerEU) const;
+                              std::pair<unsigned, unsigned> WavesPerEU,
+                              bool WholeRegisterFile) const;
   /// \returns Maximum number of VGPRs that meets number of waves per execution
   /// unit requirement for function \p F, or number of VGPRs explicitly
   /// requested using "amdgpu-num-vgpr" attribute attached to function \p F.
+  /// If \p WholeRegisterFile is false and our target has a unified register
+  /// file, getMaxNumVGPRs will instead \return the maxmium number of ArchVGPRs.
   ///
   /// \returns Value that meets number of waves per execution unit requirement
   /// if explicitly requested value cannot be converted to integer, violates
   /// subtarget's specifications, or does not meet number of waves per execution
   /// unit requirement.
-  unsigned getMaxNumVGPRs(const Function &F) const;
+  unsigned getMaxNumVGPRs(const Function &F,
+                          bool WholeRegisterFile = true) const;
 
-  unsigned getMaxNumAGPRs(const Function &F) const {
-    return getMaxNumVGPRs(F);
+  unsigned getMaxNumAGPRs(const Function &F,
+                          bool WholeRegisterFile = true) const {
+    return getMaxNumVGPRs(F, WholeRegisterFile);
   }
 
   /// \returns Maximum number of VGPRs that meets number of waves per execution
@@ -1444,7 +1450,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   /// if explicitly requested value cannot be converted to integer, violates
   /// subtarget's specifications, or does not meet number of waves per execution
   /// unit requirement.
-  unsigned getMaxNumVGPRs(const MachineFunction &MF) const;
+  unsigned getMaxNumVGPRs(const MachineFunction &MF,
+                          bool WholeRegisterFile = true) const;
 
   void getPostRAMutations(
       std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations)
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
index 6d53f68ace70df..a75b080b5850af 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
@@ -1155,12 +1155,15 @@ unsigned getMinNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU) {
   return std::min(MinNumVGPRs, AddrsableNumVGPRs);
 }
 
-unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU) {
+unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU,
+                        bool WholeRegisterFile) {
   assert(WavesPerEU != 0);
 
   unsigned MaxNumVGPRs = alignDown(getTotalNumVGPRs(STI) / WavesPerEU,
                                    getVGPRAllocGranule(STI));
-  unsigned AddressableNumVGPRs = getAddressableNumVGPRs(STI);
+  unsigned AddressableNumVGPRs = WholeRegisterFile
+                                     ? getAddressableNumVGPRs(STI)
+                                     : getAddressableNumArchVGPRs(STI);
   return std::min(MaxNumVGPRs, AddressableNumVGPRs);
 }
 
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
index 29ac402d953513..f49bb26e060e37 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
@@ -308,7 +308,8 @@ unsigned getMinNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU);
 
 /// \returns Maximum number of VGPRs that meets given number of waves per
 /// execution unit requirement for given subtarget \p STI.
-unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU);
+unsigned getMaxNumVGPRs(const MCSubtargetInfo *STI, unsigned WavesPerEU,
+                        bool WholeRegisterFile);
 
 /// \returns Number of waves reachable for a given \p NumVGPRs usage for given
 /// subtarget \p STI.
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
index 091b29c23d60e2..dc5df7f039b04c 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.single.2b.mir
@@ -4,6 +4,8 @@
 --- |
   define amdgpu_kernel void @single-wave-phase-2b(ptr addrspace(3) noalias %in0, ptr addrspace(3) noalias %in1, ptr addrspace(3) noalias %in2, ptr addrspace(3) noalias %in3, ptr addrspace(3) noalias %in4, ptr addrspace(3) noalias %in5, ptr addrspace(3) noalias %in6, ptr addrspace(3) noalias %in7, ptr addrspace(3) noalias %in8, ptr addrspace(3) noalias %in9, ptr addrspace(3) noalias %in10, ptr addrspace(3) noalias %in11, ptr addrspace(7) noalias %in12, ptr addrspace(7) noalias %in13, ptr addrspace(7) noalias %in14, ptr addrspace(7) noalias %in15, ptr addrspace(7) noalias %in16, ptr addrspace(7) noalias %in17, ptr addrspace(7) noalias %in18, ptr addrspace(7) noalias %in19, ptr addrspace(7) noalias %in20, ptr addrspace(7) noalias %in21, ptr addrspace(7) noalias %in22, ptr addrspace(7) noalias %in23, ptr addrspace(7) noalias %in24, ptr addrspace(7) noalias %in25, ptr addrspace(7) noalias %in26, ptr addrspace(7) noalias %in27, ptr addrspace(7) noalias %in28, ptr addrspace(7) noalias %in29) #0 { ret void }
 
+  attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1"  "amdgpu-flat-work-group-size"="1,256" }
+
   !0 = distinct !{!0}
   !1 = !{!1, !0}
 ...

@rampitec
Copy link
Collaborator

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs. I.e. you may have improvements in some cases you are working on, but there will be bigger regressions in some other cases.

If you'd like to do it, I would at the very least guard this accounting with a call to MFI::usesAGPRs().

@jrbyrnes
Copy link
Contributor Author

Just fix a bug -- still need to address comment.

Copy link

github-actions bot commented Mar 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jrbyrnes
Copy link
Contributor Author

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs.

I'm afraid I don't follow. This is not doubling, but rather halving from 512 to 256. The goal is to reduce RP and avoid spilling.

This will make the scheduler more conservative in terms of RP. AVReg are tracked as VGPR, so a limit of 256 will only be accurate if 100% of AVRegs are allocated as VGPR.

I don't understand how the current implementation is supposed to work. It seems to me, comparing pressure from RegKind::VGPR32 to 512 (assuming min-waves is 1) only makes sense if we add pressure from RegKind::AGPR32 , but even then, that would assume perfect allocation of AVRegs. We don't do that, though. We compare pressure from RegKind::VGPR32 to 512, then we compare pressure from RegKind::AGPR32 to 512

@rampitec
Copy link
Collaborator

This was done conservatively on purpose. That is true we have total of 512 VGPRs, but only half of these a general purpose. Doubling it will result in a higher register pressure on some programs and spilling because most of the instructions may not use AGPRs.

I'm afraid I don't follow. This is not doubling, but rather halving from 512 to 256. The goal is to reduce RP and avoid spilling.

This will make the scheduler more conservative in terms of RP. AVReg are tracked as VGPR, so a limit of 256 will only be accurate if 100% of AVRegs are allocated as VGPR.

I don't understand how the current implementation is supposed to work. It seems to me, comparing pressure from RegKind::VGPR32 to 512 (assuming min-waves is 1) only makes sense if we add pressure from RegKind::AGPR32 , but even then, that would assume perfect allocation of AVRegs. We don't do that, though. We compare pressure from RegKind::VGPR32 to 512, then we compare pressure from RegKind::AGPR32 to 512

I see. Then I misread the description. Halving should be fine.

@jrbyrnes
Copy link
Contributor Author

Halving should be fine.

Thanks for taking another look.

@jrbyrnes
Copy link
Contributor Author

Latest iteration just uses getTotalNumVGPRs for MaxNumVGPRs and clamps to getAddressableNumArchVGPRs based on WholeRegisterFile as suggested.

Allows for unequal allocation (with a per register type max of 256) and PressureAfter.getVGPRNum(ST.hasGFX90AInsts()) > ST.getMaxNumVGPRs(MF, /WholeRegisterFile/ true) captures the unified limit

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp Outdated Show resolved Hide resolved
…unified register file case

Change-Id: I5a4ddfbf7fcea931b47c18b0b5421cada7ea9541
Change-Id: Iadeff5844c3e7baece274b03838f0457b1523c3b
Copy link
Collaborator

@rampitec rampitec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jrbyrnes jrbyrnes merged commit b761137 into llvm:main Mar 25, 2024
4 checks passed
jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request Apr 26, 2024
…unified register file case (llvm#85860)

`ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs`
which returns 512 for gfx90a. This is subsequently limited by
`AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for
gfx90a. The ISA states we can have a total of 512 registers, but a
maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).

Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)`
calculates the maximum number of combined VGPR + AGPR. But, it is
currently used as the limit for accvgpr and as the limit for archvgpr.

This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits.

It is not unreasonable to think other clients of getTotalNumVGPRs are
using it in the wrong way.

Change-Id: Iee00cb03a64eab4a4d0b97454e5f48c57534f565
jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request May 7, 2024
…unified register file case (llvm#85860)

`ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs`
which returns 512 for gfx90a. This is subsequently limited by
`AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for
gfx90a. The ISA states we can have a total of 512 registers, but a
maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).

Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)`
calculates the maximum number of combined VGPR + AGPR. But, it is
currently used as the limit for accvgpr and as the limit for archvgpr.

This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits.

It is not unreasonable to think other clients of getTotalNumVGPRs are
using it in the wrong way.

Change-Id: Iee00cb03a64eab4a4d0b97454e5f48c57534f565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants