[AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers #88797

jrbyrnes · 2024-04-15T21:06:17Z

Having walked through several interesting lit changes, I think it's a good time to put up a WIP PR for any high level comments.

There are still a few things left to do, namely:

Verify more lit changes when enabled
Add tests
Compatibility with schedulers in GCNIterativeScheduler.cpp

Nonetheless, this patch adds the GCNIterativeRPTrackers. We can optionally use these instead of generic RPTrackers during scheduling. The benefits include: 1. better subreg liveness modeling, 2. actually tracking %av registers (counted as VGPRs), and 3. having accurate global live-through RP.

I've introduced as separate classes since there is other outstanding work on RPTrackers. The plan is to merge the two projects once at steady state.

Change-Id: I433c6d19d79e7bf8ee1fa1d99ca948d5e1411ff8

Change-Id: I3e184df1ca349433db6abbeb9d28eed2fea5640b

Change-Id: I194d9121d725d55cb5fba609267b86031dc3a179

Change-Id: I1f2d93f75881a87434033e7bb387b3342309118b

Change-Id: Ia353f09d201e2b5a472b73930e3dde1fc51da363

Change-Id: I15025afa46ae092aa67f42984ee52154c5f4dba4

llvmbot · 2024-04-15T21:06:51Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

Having walked through several interesting lit changes, I think it's a good time to put up a WIP PR for any high level comments.

There are still a few things left to do, namely:

Verify more lit changes when enabled
Add tests
Compatibility with schedulers in GCNIterativeScheduler.cpp

Nonetheless, this patch adds the GCNIterativeRPTrackers. We can optionally use these instead of generic RPTrackers during scheduling. The benefits include: 1. better subreg liveness modeling, 2. actually tracking %av registers (counted as VGPRs), and 3. having accurate global live-through RP.

I've introduced as separate classes since there is other outstanding work on RPTrackers. The plan is to merge the two projects once at steady state.

Patch is 21.84 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/88797.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+163)
(modified) llvm/lib/Target/AMDGPU/GCNRegPressure.h (+37-38)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+127-34)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+33-2)

diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index 5c394e6d6296d0..06be02d1dc9cb6 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -343,6 +343,48 @@ void GCNRPTracker::reset(const MachineInstr &MI,
   MaxPressure = CurPressure = getRegPressure(*MRI, LiveRegs);
 }
 
+DenseMap<int, GCNRPTracker::LiveRegSet>
+llvm::getLiveRegMap(DenseMap<MachineInstr *, int> &R, bool After,
+                    LiveIntervals &LIS) {
+  std::vector<SlotIndex> Indexes;
+  // Indexes.reserve(R.size());
+  auto &SII = *LIS.getSlotIndexes();
+  for (std::pair<MachineInstr *, int> &Entry : R) {
+    auto SI = SII.getInstructionIndex(*Entry.first);
+    Indexes.push_back(After ? SI.getDeadSlot() : SI.getBaseIndex());
+  }
+  llvm::sort(Indexes);
+
+  auto &MRI = (*R.begin()).first->getParent()->getParent()->getRegInfo();
+  DenseMap<int, GCNRPTracker::LiveRegSet> LiveRegMap;
+  SmallVector<SlotIndex, 32> LiveIdxs, SRLiveIdxs;
+  for (unsigned I = 0, E = MRI.getNumVirtRegs(); I != E; ++I) {
+    auto Reg = Register::index2VirtReg(I);
+    if (!LIS.hasInterval(Reg))
+      continue;
+    auto &LI = LIS.getInterval(Reg);
+    LiveIdxs.clear();
+    if (!LI.findIndexesLiveAt(Indexes, std::back_inserter(LiveIdxs)))
+      continue;
+    if (!LI.hasSubRanges()) {
+      for (auto SI : LiveIdxs) {
+        auto Idx = R[SII.getInstructionFromIndex(SI)];
+        LiveRegMap[Idx][Reg] = MRI.getMaxLaneMaskForVReg(Reg);
+      }
+    } else
+      for (const auto &S : LI.subranges()) {
+        // constrain search for subranges by indexes live at main range
+        SRLiveIdxs.clear();
+        S.findIndexesLiveAt(LiveIdxs, std::back_inserter(SRLiveIdxs));
+        for (auto SI : SRLiveIdxs) {
+          auto Idx = R[SII.getInstructionFromIndex(SI)];
+          LiveRegMap[Idx][Reg] |= S.LaneMask;
+        }
+      }
+  }
+  return LiveRegMap;
+}
+
 ////////////////////////////////////////////////////////////////////////////////
 // GCNUpwardRPTracker
 
@@ -570,6 +612,127 @@ bool GCNUpwardRPTracker::isValid() const {
   return true;
 }
 
+////////////////////////////////////////////////////////////////////////////////
+// GCNIterativeRPTrackers
+
+void GCNIterativeRPTracker::reset(const MachineRegisterInfo *MRI_,
+                                  const LiveRegSet *LiveRegsCopy) {
+
+  MRI = MRI_;
+  if (LiveRegsCopy && &LiveRegs != LiveRegsCopy)
+    LiveRegs = *LiveRegsCopy;
+  if (!LiveRegsCopy)
+    LiveRegs.clear();
+  MaxPressure = CurPressure = getRegPressure(*MRI, LiveRegs);
+}
+
+// Mostly copy+paste from GCNUpwardRPTracker::recede
+void GCNIterativeUpwardRPTracker::recede(const MachineInstr &MI,
+                                         LiveIntervals *LIS) {
+  assert(MRI && "call reset first");
+
+  if (MI.isDebugInstr())
+    return;
+
+  SmallVector<RegisterMaskPair, 8> RegUses;
+  collectVirtualRegUses(RegUses, MI, *LIS, *MRI);
+
+  // calc pressure at the MI (defs + uses)
+  auto AtMIPressure = CurPressure;
+  for (const auto &U : RegUses) {
+    auto LiveMask = LiveRegs[U.RegUnit];
+    AtMIPressure.inc(U.RegUnit, LiveMask, LiveMask | U.LaneMask, *MRI);
+  }
+  // update max pressure
+  MaxPressure = max(AtMIPressure, MaxPressure);
+
+  for (const auto &MO : MI.all_defs()) {
+    if (!MO.getReg().isVirtual() || MO.isDead())
+      continue;
+
+    auto Reg = MO.getReg();
+    auto I = LiveRegs.find(Reg);
+    if (I == LiveRegs.end())
+      continue;
+    auto &LiveMask = I->second;
+    auto PrevMask = LiveMask;
+    LiveMask &= ~getDefRegMask(MO, *MRI);
+    CurPressure.inc(Reg, PrevMask, LiveMask, *MRI);
+    if (LiveMask.none())
+      LiveRegs.erase(I);
+  }
+  for (const auto &U : RegUses) {
+    auto &LiveMask = LiveRegs[U.RegUnit];
+    auto PrevMask = LiveMask;
+    LiveMask |= U.LaneMask;
+    CurPressure.inc(U.RegUnit, PrevMask, LiveMask, *MRI);
+  }
+  assert(CurPressure == getRegPressure(*MRI, LiveRegs));
+}
+
+// Mostly copy+paste from GCNDownwardRPTracker::(advanceBeforeNext +
+// advanceToNext)
+void GCNIterativeDownwardRPTracker::advance(const MachineInstr &MI,
+                                            LiveIntervals *LIS) {
+  assert(MRI && "call reset first");
+  // Add new registers or mask bits.
+  for (const auto &MO : MI.all_defs()) {
+    Register Reg = MO.getReg();
+    if (!Reg.isVirtual())
+      continue;
+    if (MO.isDead())
+      continue;
+    auto &LiveMask = LiveRegs[Reg];
+    auto PrevMask = LiveMask;
+    LiveMask |= getDefRegMask(MO, *MRI);
+    CurPressure.inc(Reg, PrevMask, LiveMask, *MRI);
+  }
+
+  SlotIndex SI = LIS->getInstructionIndex(MI).getBoundaryIndex();
+  assert(SI.isValid());
+
+  // Remove dead registers or mask bits.
+  SmallSet<Register, 8> SeenRegs;
+  for (auto &MO : MI.operands()) {
+    if (!MO.isReg() || !MO.getReg().isVirtual())
+      continue;
+    if (!MO.isUse())
+      continue;
+    if (!MO.readsReg())
+      continue;
+    if (!SeenRegs.insert(MO.getReg()).second)
+      continue;
+
+    const LiveInterval &LI = LIS->getInterval(MO.getReg());
+    if (LI.hasSubRanges()) {
+      auto It = LiveRegs.end();
+      for (const auto &S : LI.subranges()) {
+        if (S.expiredAt(SI)) {
+          if (It == LiveRegs.end()) {
+            It = LiveRegs.find(MO.getReg());
+            if (It == LiveRegs.end())
+              llvm_unreachable("register isn't live");
+          }
+          auto PrevMask = It->second;
+          It->second &= ~S.LaneMask;
+          CurPressure.inc(MO.getReg(), PrevMask, It->second, *MRI);
+        }
+      }
+      if (It != LiveRegs.end() && It->second.none()) {
+        LiveRegs.erase(It);
+      }
+    } else if (LI.expiredAt(SI)) {
+      auto It = LiveRegs.find(MO.getReg());
+      if (It == LiveRegs.end())
+        llvm_unreachable("register isn't live");
+      CurPressure.inc(MO.getReg(), It->second, LaneBitmask::getNone(), *MRI);
+      LiveRegs.erase(It);
+    }
+  }
+
+  MaxPressure = max(MaxPressure, CurPressure);
+}
+
 Printable llvm::print(const GCNRPTracker::LiveRegSet &LiveRegs,
                       const MachineRegisterInfo &MRI) {
   return Printable([&LiveRegs, &MRI](raw_ostream &OS) {
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.h b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
index 752f53752fa68b..3991a51ff09a3b 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.h
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
@@ -259,6 +259,41 @@ class GCNDownwardRPTracker : public GCNRPTracker {
                const LiveRegSet *LiveRegsCopy = nullptr);
 };
 
+class GCNIterativeRPTracker {
+public:
+  using LiveRegSet = DenseMap<unsigned, LaneBitmask>;
+
+protected:
+  LiveRegSet LiveRegs;
+  GCNRegPressure CurPressure, MaxPressure;
+
+  mutable const MachineRegisterInfo *MRI = nullptr;
+
+  GCNIterativeRPTracker() {};
+
+public:
+  void reset(const MachineRegisterInfo *MRI_, const LiveRegSet *LiveRegsCopy);
+
+  GCNRegPressure getPressure() const { return CurPressure; }
+  GCNRegPressure getMaxPressure() const { return MaxPressure; }
+};
+
+class GCNIterativeUpwardRPTracker : public GCNIterativeRPTracker {
+public:
+  GCNIterativeUpwardRPTracker() {};
+
+  // Move to the state just before the MI.
+  void recede(const MachineInstr &MI, LiveIntervals *TheLIS);
+};
+
+class GCNIterativeDownwardRPTracker : public GCNIterativeRPTracker {
+public:
+  GCNIterativeDownwardRPTracker() {};
+
+  // Move to the state just after the MI.
+  void advance(const MachineInstr &MI, LiveIntervals *TheLIS);
+};
+
 LaneBitmask getLiveLaneMask(unsigned Reg,
                             SlotIndex SI,
                             const LiveIntervals &LIS,
@@ -275,44 +310,8 @@ GCNRPTracker::LiveRegSet getLiveRegs(SlotIndex SI, const LiveIntervals &LIS,
 /// After - upon entry or exit of every instruction
 /// Note: there is no entry in the map for instructions with empty live reg set
 /// Complexity = O(NumVirtRegs * averageLiveRangeSegmentsPerReg * lg(R))
-template <typename Range>
-DenseMap<MachineInstr*, GCNRPTracker::LiveRegSet>
-getLiveRegMap(Range &&R, bool After, LiveIntervals &LIS) {
-  std::vector<SlotIndex> Indexes;
-  Indexes.reserve(std::distance(R.begin(), R.end()));
-  auto &SII = *LIS.getSlotIndexes();
-  for (MachineInstr *I : R) {
-    auto SI = SII.getInstructionIndex(*I);
-    Indexes.push_back(After ? SI.getDeadSlot() : SI.getBaseIndex());
-  }
-  llvm::sort(Indexes);
-
-  auto &MRI = (*R.begin())->getParent()->getParent()->getRegInfo();
-  DenseMap<MachineInstr *, GCNRPTracker::LiveRegSet> LiveRegMap;
-  SmallVector<SlotIndex, 32> LiveIdxs, SRLiveIdxs;
-  for (unsigned I = 0, E = MRI.getNumVirtRegs(); I != E; ++I) {
-    auto Reg = Register::index2VirtReg(I);
-    if (!LIS.hasInterval(Reg))
-      continue;
-    auto &LI = LIS.getInterval(Reg);
-    LiveIdxs.clear();
-    if (!LI.findIndexesLiveAt(Indexes, std::back_inserter(LiveIdxs)))
-      continue;
-    if (!LI.hasSubRanges()) {
-      for (auto SI : LiveIdxs)
-        LiveRegMap[SII.getInstructionFromIndex(SI)][Reg] =
-          MRI.getMaxLaneMaskForVReg(Reg);
-    } else
-      for (const auto &S : LI.subranges()) {
-        // constrain search for subranges by indexes live at main range
-        SRLiveIdxs.clear();
-        S.findIndexesLiveAt(LiveIdxs, std::back_inserter(SRLiveIdxs));
-        for (auto SI : SRLiveIdxs)
-          LiveRegMap[SII.getInstructionFromIndex(SI)][Reg] |= S.LaneMask;
-      }
-  }
-  return LiveRegMap;
-}
+DenseMap<int, GCNRPTracker::LiveRegSet>
+getLiveRegMap(DenseMap<MachineInstr *, int> &R, bool After, LiveIntervals &LIS);
 
 inline GCNRPTracker::LiveRegSet getLiveRegsAfter(const MachineInstr &MI,
                                                  const LiveIntervals &LIS) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 94d93390d0916f..4019dcebd8d5c5 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -58,6 +58,11 @@ static cl::opt<bool>
                         "Wave Limited (amdgpu-limit-wave-threshold)."),
                cl::init(false));
 
+static cl::opt<bool> GCNTrackers(
+    "amdgpu-use-gcn-iterative-trackers", cl::Hidden,
+    cl::desc("Use the GCN specific iterative RPTrackers during scheduling"),
+    cl::init(false));
+
 const unsigned ScheduleMetrics::ScaleFactor = 100;
 
 GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
@@ -128,23 +133,46 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
   if (!DAG->isTrackingPressure())
     return;
 
-  // getDownwardPressure() and getUpwardPressure() make temporary changes to
-  // the tracker, so we need to pass those function a non-const copy.
-  RegPressureTracker &TempTracker = const_cast<RegPressureTracker&>(RPTracker);
+  unsigned NewSGPRPressure, NewVGPRPressure;
+  if (!GCNTrackers) {
+    // getDownwardPressure() and getUpwardPressure() make temporary changes to
+    // the tracker, so we need to pass those function a non-const copy.
+    RegPressureTracker &TempTracker =
+        const_cast<RegPressureTracker &>(RPTracker);
+
+    Pressure.clear();
+    MaxPressure.clear();
+
+    if (AtTop)
+      TempTracker.getDownwardPressure(SU->getInstr(), Pressure, MaxPressure);
+    else {
+      // FIXME: I think for bottom up scheduling, the register pressure is
+      // cached and can be retrieved by DAG->getPressureDif(SU).
+      TempTracker.getUpwardPressure(SU->getInstr(), Pressure, MaxPressure);
+    }
+    NewSGPRPressure = Pressure[AMDGPU::RegisterPressureSets::SReg_32];
+    NewVGPRPressure = Pressure[AMDGPU::RegisterPressureSets::VGPR_32];
+  }
 
-  Pressure.clear();
-  MaxPressure.clear();
+  if (GCNTrackers) {
+    if (AtTop) {
+      GCNIterativeDownwardRPTracker TempTopTracker(TheTracker);
+      auto MI = SU->getInstr();
+      TempTopTracker.advance(*MI, DAG->getLIS());
 
-  if (AtTop)
-    TempTracker.getDownwardPressure(SU->getInstr(), Pressure, MaxPressure);
-  else {
-    // FIXME: I think for bottom up scheduling, the register pressure is cached
-    // and can be retrieved by DAG->getPressureDif(SU).
-    TempTracker.getUpwardPressure(SU->getInstr(), Pressure, MaxPressure);
-  }
+      NewSGPRPressure = TempTopTracker.getPressure().getSGPRNum();
+      NewVGPRPressure = TempTopTracker.getPressure().getVGPRNum(false);
+    }
+
+    else {
+      GCNIterativeUpwardRPTracker TempBotTracker(TheUpwardTracker);
+      auto MI = SU->getInstr();
+      TempBotTracker.recede(*MI, DAG->getLIS());
 
-  unsigned NewSGPRPressure = Pressure[AMDGPU::RegisterPressureSets::SReg_32];
-  unsigned NewVGPRPressure = Pressure[AMDGPU::RegisterPressureSets::VGPR_32];
+      NewSGPRPressure = TempBotTracker.getPressure().getSGPRNum();
+      NewVGPRPressure = TempBotTracker.getPressure().getVGPRNum(false);
+    }
+  }
 
   // If two instructions increase the pressure of different register sets
   // by the same amount, the generic scheduler will prefer to schedule the
@@ -213,12 +241,20 @@ void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
   unsigned SGPRPressure = 0;
   unsigned VGPRPressure = 0;
   if (DAG->isTrackingPressure()) {
-    SGPRPressure = Pressure[AMDGPU::RegisterPressureSets::SReg_32];
-    VGPRPressure = Pressure[AMDGPU::RegisterPressureSets::VGPR_32];
+    SGPRPressure =
+        GCNTrackers
+            ? (Zone.isTop() ? TheTracker.getPressure().getSGPRNum()
+                            : TheUpwardTracker.getPressure().getSGPRNum())
+            : Pressure[AMDGPU::RegisterPressureSets::SReg_32];
+    VGPRPressure =
+        GCNTrackers
+            ? (Zone.isTop() ? TheTracker.getPressure().getVGPRNum(false)
+                            : TheUpwardTracker.getPressure().getVGPRNum(false))
+            : Pressure[AMDGPU::RegisterPressureSets::VGPR_32];
   }
+
   ReadyQueue &Q = Zone.Available;
   for (SUnit *SU : Q) {
-
     SchedCandidate TryCand(ZonePolicy);
     initCandidate(TryCand, SU, Zone.isTop(), RPTracker, SRI,
                   SGPRPressure, VGPRPressure);
@@ -312,6 +348,16 @@ SUnit *GCNSchedStrategy::pickNodeBidirectional(bool &IsTopNode) {
   return Cand.SU;
 }
 
+void GCNSchedStrategy::schedNode(SUnit *SU, bool IsTopNode) {
+  if (GCNTrackers) {
+    MachineInstr *MI = SU->getInstr();
+    IsTopNode ? TheTracker.advance(*MI, DAG->getLIS())
+              : TheUpwardTracker.recede(*MI, DAG->getLIS());
+  }
+
+  return GenericScheduler::schedNode(SU, IsTopNode);
+}
+
 // This function is mostly cut and pasted from
 // GenericScheduler::pickNode()
 SUnit *GCNSchedStrategy::pickNode(bool &IsTopNode) {
@@ -565,7 +611,7 @@ void GCNScheduleDAGMILive::computeBlockPressure(unsigned RegionIdx,
     MBBLiveIns.erase(LiveInIt);
   } else {
     I = Rgn.first;
-    auto LRS = BBLiveInMap.lookup(NonDbgMI);
+    auto LRS = BBLiveInMap.lookup(CurRegion);
 #ifdef EXPENSIVE_CHECKS
     assert(isEqual(getLiveRegsBefore(*NonDbgMI, *LIS), LRS));
 #endif
@@ -599,23 +645,42 @@ void GCNScheduleDAGMILive::computeBlockPressure(unsigned RegionIdx,
   }
 }
 
-DenseMap<MachineInstr *, GCNRPTracker::LiveRegSet>
+DenseMap<int, GCNRPTracker::LiveRegSet>
 GCNScheduleDAGMILive::getBBLiveInMap() const {
   assert(!Regions.empty());
-  std::vector<MachineInstr *> BBStarters;
-  BBStarters.reserve(Regions.size());
-  auto I = Regions.rbegin(), E = Regions.rend();
-  auto *BB = I->first->getParent();
-  do {
-    auto *MI = &*skipDebugInstructionsForward(I->first, I->second);
-    BBStarters.push_back(MI);
-    do {
-      ++I;
-    } while (I != E && I->first->getParent() == BB);
-  } while (I != E);
+  DenseMap<MachineInstr *, int> BBStarters;
+  for (int I = Regions.size() - 1; I >= 0; I--) {
+    auto Rgn = Regions[I];
+    auto *MI = &*skipDebugInstructionsForward(Rgn.first, Rgn.second);
+    BBStarters.insert({MI, I});
+  }
   return getLiveRegMap(BBStarters, false /*After*/, *LIS);
 }
 
+DenseMap<int, GCNRPTracker::LiveRegSet>
+GCNScheduleDAGMILive::getBBLiveOutMap() const {
+  assert(!Regions.empty());
+  DenseMap<MachineInstr *, int> BBEnders;
+  for (int I = Regions.size() - 1; I >= 0; I--) {
+    auto Rgn = Regions[I];
+    auto TheBB = Rgn.first->getParent();
+    if (Rgn.second != TheBB->end() && !Rgn.second->isDebugInstr()) {
+      BBEnders.insert({&*Rgn.second, I});
+      continue;
+    }
+    if (Rgn.second == TheBB->end()) {
+      auto *MI = &*prev_nodbg(Rgn.second, Rgn.first);
+      BBEnders.insert({&*MI, I});
+      continue;
+    }
+
+    auto *MI = &*skipDebugInstructionsBackward(Rgn.second, Rgn.first);
+    BBEnders.insert({MI, I});
+  }
+
+  return getLiveRegMap(BBEnders, true /*After*/, *LIS);
+}
+
 void GCNScheduleDAGMILive::finalizeSchedule() {
   // Start actual scheduling here. This function is called by the base
   // MachineScheduler after all regions have been recorded by
@@ -639,9 +704,14 @@ void GCNScheduleDAGMILive::finalizeSchedule() {
 void GCNScheduleDAGMILive::runSchedStages() {
   LLVM_DEBUG(dbgs() << "All regions recorded, starting actual scheduling.\n");
 
-  if (!Regions.empty())
+  if (!Regions.empty()) {
     BBLiveInMap = getBBLiveInMap();
 
+    if (GCNTrackers) {
+      BBLiveOutMap = getBBLiveOutMap();
+    }
+  }
+
   GCNSchedStrategy &S = static_cast<GCNSchedStrategy &>(*SchedImpl);
   while (S.advanceStage()) {
     auto Stage = createSchedStage(S.getCurrentStage());
@@ -658,6 +728,27 @@ void GCNScheduleDAGMILive::runSchedStages() {
         continue;
       }
 
+      if (GCNTrackers) {
+        GCNIterativeDownwardRPTracker *TheTracker = S.getTracker();
+        GCNIterativeUpwardRPTracker *TheUpwardTracker = S.getUpwardTracker();
+        auto LiveInEntry = BBLiveInMap.find(Stage->getRegionIdx());
+        GCNRPTracker::LiveRegSet *LiveIns =
+            LiveInEntry != BBLiveInMap.end() ? &LiveInEntry->second : nullptr;
+        auto LiveOutEntry = BBLiveOutMap.find(Stage->getRegionIdx());
+        GCNRPTracker::LiveRegSet *LiveOuts = LiveOutEntry != BBLiveOutMap.end()
+                                                 ? &LiveOutEntry->second
+                                                 : nullptr;
+        TheTracker->reset(
+            &Regions[Stage->getRegionIdx()].first->getMF()->getRegInfo(),
+            LiveIns);
+        TheUpwardTracker->reset(
+            &Regions[Stage->getRegionIdx()].first->getMF()->getRegInfo(),
+            LiveOuts);
+
+        S.setTracker(*TheTracker);
+        S.setUpwardTracker(*TheUpwardTracker);
+      }
+
       ScheduleDAGMILive::schedule();
       Stage->finalizeGCNRegion();
     }
@@ -1479,9 +1570,6 @@ bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
     MachineInstr *MI = Entry.first;
     MachineInstr *OldMI = Entry.second;
 
-    // Remove OldMI from BBLiveInMap since we are sinking it from its MBB.
-    DAG.BBLiveInMap.erase(OldMI);
-
     // Remove OldMI and update LIS
     Register Reg = MI->getOperand(0).getReg();
     LIS->RemoveMachineInstrFromMaps(*OldMI);
@@ -1493,9 +1581,14 @@ bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
   // Update live-ins, register pressure, and regions caches.
   for (auto Idx : ImpactedRegions) {
     DAG.LiveIns[Idx] = NewLiveIns[Idx];
+    DAG.BBLiveInMap[Idx] = NewLiveIns[Idx];
     DAG.Pressure[Idx] = NewPressure[Idx];
     DAG.MBBLiveIns.erase(DAG.Regions[Idx].first->getParent());
   }
+
+  if (GCNTrackers)
+    DAG.BBLiveOutMap = DAG.getBBLiveOutMap();
+
   DAG.Regions = NewRegions;
   DAG.RescheduleRegions = NewRescheduleRegions;
 
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 2084aae4128ff3..e463bd76c1124e 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -70,6 +70,12 @@ class GCNSchedStrategy : public GenericScheduler {
   // Pointer to the current SchedStageID.
   SmallVectorImpl<GCNSchedStageID>::iterator CurrentStage = nullptr;
 
+  // GCN RP Tracker for top-down scheduling
+  GCNIterativeDownwardRPTracker TheTracker;
+
+  // GCN RP Trakcer for botttom-up scheduling
+  GCNIterativeUpwardRPTracker TheUpwardTracker;
+
 public:
   // schedule() have seen register pressure over the critical limits and had to
   // track register pressure for actual scheduling heuristics.
@@ -102,6 +108,8 @@ class GCNSchedStrategy : public GenericScheduler {
 
   SUnit *pickNode(bool &IsTopNode) override;
 
+  void schedNode(SUnit *SU, bool IsTopNode) override;
+
   void initialize(ScheduleDAGMI *DAG) override...
[truncated]

github-actions · 2024-04-15T21:09:18Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 470aefb240dee7d791875284b9917bf641ca971a dd877ee1048acb7f3a3fb5ea27494e9b58c23f57 -- llvm/lib/Target/AMDGPU/GCNIterativeScheduler.cpp llvm/lib/Target/AMDGPU/GCNRegPressure.cpp llvm/lib/Target/AMDGPU/GCNRegPressure.h llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.h b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
index 69a04e1818..b45936c63d 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.h
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.h
@@ -311,7 +311,8 @@ GCNRPTracker::LiveRegSet getLiveRegs(SlotIndex SI, const LiveIntervals &LIS,
 /// Note: there is no entry in the map for instructions with empty live reg set
 /// Complexity = O(NumVirtRegs * averageLiveRangeSegmentsPerReg * lg(R))
 DenseMap<unsigned, GCNRPTracker::LiveRegSet>
-getLiveRegMap(DenseMap<MachineInstr *, unsigned> &R, bool After, LiveIntervals &LIS);
+getLiveRegMap(DenseMap<MachineInstr *, unsigned> &R, bool After,
+              LiveIntervals &LIS);
 
 inline GCNRPTracker::LiveRegSet getLiveRegsAfter(const MachineInstr &MI,
                                                  const LiveIntervals &LIS) {

vpykhtin · 2024-04-18T08:49:52Z

Hi @jrbyrnes, would be nice if you add a few words explaining overall approach. I'm curious why these new trackers are iterative, is it just a name to distinguish them from current?

arsenm · 2024-04-18T09:24:49Z

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp

@@ -343,6 +343,48 @@ void GCNRPTracker::reset(const MachineInstr &MI,
  MaxPressure = CurPressure = getRegPressure(*MRI, LiveRegs);
 }

+DenseMap<int, GCNRPTracker::LiveRegSet>


I'm not sure what the key is here. I'm also concerned by building maps for liveness info, that's already encoded in the LiveIntervals. Most of this code is also replicating logic from the generic tracker, and all of this is really hard to get correct.

I'm not sure what to do with this comment.

I'm not sure what the key is here.

Currently, these maps are keyed on the initial first instructions of the regions. However, if we are going to requery, we need a different key as these first instructions are likely to change. I've used instead RegionIdx -- which tracks the index in the Regions array. This is similarly used to index all these https://github.com/llvm/llvm-project/blob/a5044e6d505deb79f1b00bb39d11096d29b9c910/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp#L623C3-L623C10 I guess we could restructure as vector.

I'm concerned by building maps for liveness info, that's already encoded in the LiveIntervals.

While this patch does introduce a LiveOutMap, it doesn't introduce the notion of using a Map to cache live-ins -- this already exists in the code. In this patch, these maps optimize compile time by avoiding recalculation of the live regs.

Most of this code is also replicating logic from the generic tracker, and all of this is really hard to get correct.

This is mostly copy-paste from existing code in GCNRegPressure, but, nonetheless, we're running a CQE cycle to root out any correctness / performance issues. Per discussion, I think it's better to slightly duplicate logic for time being, and potentially rework the generic tracker in a long term project.

I'm not sure what the key is here. I'm also concerned by building maps for liveness info, that's already encoded in the LiveIntervals.

I've tried to address this in #93089

I've left getLiveRegMap unmodified and used the last MI per region as the key to the liveness info (as is similarly done for BBLiveInMap currently). The result is that we do not introduce any new way of building maps for liveness info. The maps are not strictly necessary, and we could recompute them each time we encounter the same region during scheduling, but they do eliminate redundant calculations.

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp

Change-Id: I5522bc12692d9c0d406f0d6e3fe002ff6ff40494

jrbyrnes · 2024-04-18T18:33:06Z

Would be nice if you add a few words explaining overall approach. I'm curious why these new trackers are iterative, is it just a name to distinguish them from current?

I'll explain the approach by defining the issues.

GCNSchedStrategy uses GCNDownwardRPTracker in a couple different places to calculate the pressure of a region. At these points, the schedule is in a steady state, so the tracker can scan through the region in order. Thus, it is okay to internalize the iterators (LastTrackedMI, NextMI, MBBEnd), which the tracker maintains through calls to advance() / advanceBeforeNext() / advanceToNext(). Obviously during scheduling the order is subject to change, so these internal iterators don't make sense -- at least not in the way they are currently maintained. Moreover, with these, there is no good way to query the impact a candidate SU will have on RP (initCandidate) since calls to advance* will update the iterators.

On the other hand, GCNIterativeDownwardRPTracker does not assume an in-order scan of a region, so it does not suffer from the above problems. I named it iterative because we are incrementally constructing the order, but I'm not very confident about the naming. Note that the GCNUpwardRPTracker has a more friendly interface for this use-case, but I also implemented GCNIterativeUpwardRPTracker since scheduling is bi-directional, and having a shared base class is the cleaner approach.

As noted in the code, the actual RP calculation implemention is largely copy & paste from the existing trackers. Note: when calculating RP, we call GCNRegPressure::getRegKind which treats %AV registers as VGPR. During scheduling, we use the trackers to update the state of RP when we schedNode, and use this state to measure the impact of candidate SU in initCandidate. When first scheduling a region, we initialize the tackers with the LiveIn (Downward) / LiveOut (Upward) registers as calculated by the existing getLiveRegMap. This calculation is global live-thru aware (as opposed to the generic tracker which isn't

llvm-project/llvm/lib/CodeGen/RegisterPressure.cpp

Line 355 in 047b2b2

/// The register tracker is unaware of global liveness so ignores normal

).

The generic trackers still maintain RP through scheduling, but the meaningful usecase of this is when picking between candidate SUs, and that usecase has been replaced with the RP from our trackers (initCandidate).

Change-Id: I1ce9d57c2971f0b498144f353de0e112a9b60a78

vpykhtin · 2024-04-19T16:07:10Z

Ok, thanks. Just on a historical note upward and downward trackers were introduced for different purposes. Upward tracker was designed with two ideas in mind:

To be able tracking unimplemented schedule, that is the order if instruction is maintained externally, for example in array.
The LIS isn't updated for the unimplemented schedule.

In this sense it is iterative because it was supposed to use on iterations over different schedules.

Currently I have the need to be able to collect live registers at the point of maximum pressure over an instruction. It seems that it's not the same set that would be collected iterating over slot index R because an instruction define and use registers at R slot index and we would consider 'dest' and 'source' registers as live at COPY instruction, the mistake that upward tracker had previously.

For that reason I would add some interface to be able to collect live registers at different stages of instruction execution.

Change-Id: Ibde513bc45e08966b471b7189b6dc8953bd2d198

jrbyrnes · 2024-05-07T22:02:27Z

Ping for any other high level comments while I transition this to ready-for-review?

vpykhtin · 2024-05-14T10:18:34Z

Sorry but this is very hard to review. For example, you've moved getLiveRegMap to the new location and modified it at the same time. Also, I start to think it wasn't a good idea to duplicate tracker's code because I cannot see the diffs.

If I understood correctly, you can reuse gcn upward tracker in it's current state? You only need to introduce iterative downward tracker?

jrbyrnes · 2024-05-21T18:41:43Z

I plan to split up the PR and refactor a bit to make it easier on the reviewer.

jrbyrnes · 2024-05-22T19:01:10Z

Closing in favor of the stack of PRs:

Create BBLiveOutMap using existing getLiveRegMap: #93089
Extend Current Trackers for External iterator: #93088
Scheduler changes necessary to use new trackers: #93090

jrbyrnes added 6 commits April 12, 2024 16:10

[AMDGPU] NFCI: Use RegionIdx as key of BBLiveInMap

afb45a5

Change-Id: I433c6d19d79e7bf8ee1fa1d99ca948d5e1411ff8

[AMDGPU] Add GCNIterativeRPTrackers

92a0531

Change-Id: I3e184df1ca349433db6abbeb9d28eed2fea5640b

[AMDGPU] Introduce getLiveOutMap

325b39e

Change-Id: I194d9121d725d55cb5fba609267b86031dc3a179

[AMDGPU] Optionally use new trackers

be2ba38

Change-Id: I1f2d93f75881a87434033e7bb387b3342309118b

Add note about recede / advance

45ddcd8

Change-Id: Ia353f09d201e2b5a472b73930e3dde1fc51da363

Guard recomputation of liveOuts in SinkRemat stage

4dcec51

Change-Id: I15025afa46ae092aa67f42984ee52154c5f4dba4

jrbyrnes requested review from arsenm, kerbowa, vpykhtin, rampitec and vangthao95 April 15, 2024 21:06

jrbyrnes added the backend:AMDGPU label Apr 15, 2024

arsenm reviewed Apr 18, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp Outdated Show resolved Hide resolved

Compatability w/ legacy maxOccupancy scheduler + review comment

1c6ff62

Change-Id: I5522bc12692d9c0d406f0d6e3fe002ff6ff40494

formatting

b0a75e4

Change-Id: I1ce9d57c2971f0b498144f353de0e112a9b60a78

Switch to unsigned + fix bug

dd877ee

Change-Id: Ibde513bc45e08966b471b7189b6dc8953bd2d198

jrbyrnes changed the title ~~[AMDGPU] Add and optionally use GCNIterativeRPTrackers~~ [AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers May 7, 2024

This was referenced May 22, 2024

[AMDGPU] NFC: Provide RPTracker interface for external iterators #93088

Open

[AMDGPU] NFC: Add BBLiveOutMap & LiveOut Cache #93089

Open

[AMDGPU] Optionally Use GCNRPTrackers during scheduling #93090

Draft

jrbyrnes closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers #88797

[AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers #88797

jrbyrnes commented Apr 15, 2024

llvmbot commented Apr 15, 2024

github-actions bot commented Apr 15, 2024 •

edited

vpykhtin commented Apr 18, 2024

arsenm Apr 18, 2024

jrbyrnes May 7, 2024

jrbyrnes May 22, 2024

jrbyrnes commented Apr 18, 2024 •

edited

vpykhtin commented Apr 19, 2024

jrbyrnes commented May 7, 2024

vpykhtin commented May 14, 2024

jrbyrnes commented May 21, 2024

jrbyrnes commented May 22, 2024

[AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers #88797

[AMDGPU] RFC: Add and optionally use GCNIterativeRPTrackers #88797

Conversation

jrbyrnes commented Apr 15, 2024

llvmbot commented Apr 15, 2024

github-actions bot commented Apr 15, 2024 • edited

vpykhtin commented Apr 18, 2024

arsenm Apr 18, 2024

Choose a reason for hiding this comment

jrbyrnes May 7, 2024

Choose a reason for hiding this comment

jrbyrnes May 22, 2024

Choose a reason for hiding this comment

jrbyrnes commented Apr 18, 2024 • edited

vpykhtin commented Apr 19, 2024

jrbyrnes commented May 7, 2024

vpykhtin commented May 14, 2024

jrbyrnes commented May 21, 2024

jrbyrnes commented May 22, 2024

github-actions bot commented Apr 15, 2024 •

edited

jrbyrnes commented Apr 18, 2024 •

edited