[AMDGPU][MISCHED] GCNBalancedSchedStrategy. #66634

alex-t · 2023-09-18T11:47:35Z

The change implements the scheduling strategy which aims to find a reasonable trade-off between the ILP and occupancy.
For that purpose, it computes the heuristic metric to decide if the current schedule is worth to be kept.
This is an attempt to use the same idea as in the https://reviews.llvm.org/D139710 to replace the shouldRevertScheduling function.
Unlike the https://reviews.llvm.org/D139710 the heuristic is applied to all scheduling stages.

llvmbot · 2023-09-18T11:48:42Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Changes

The change implements the scheduling strategy which aims to find a reasonable trade-off between the ILP and occupancy.
For that purpose, it computes the heuristic metric to decide if the current schedule is worth to be kept.
This is an attempt to use the same idea as in the https://reviews.llvm.org/D139710 to replace the shouldRevertScheduling function.
Unlike the https://reviews.llvm.org/D139710 the heuristic is applied to all scheduling stages.

Patch is 521.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66634.diff

23 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+23)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+74-168)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+99-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-fma-mul.ll (+52-52)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+461-461)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+444-444)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/usubsat.ll (+22-22)
(modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/debug-value-scheduler.mir (-2)
(modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll (+114-117)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+445-450)
(modified) llvm/test/CodeGen/AMDGPU/function-returns.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+148-150)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+62-63)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+594-594)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+433-431)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+233-227)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+89-91)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+256-247)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+295-292)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (-48)
(modified) llvm/test/CodeGen/AMDGPU/scc-clobbered-sgpr-to-vmem-spill.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/sched-assert-dead-def-subreg-use-other-subreg.mir (+2-2)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 481fbaf1543a4ea..04493c62d2e7a3d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -343,6 +343,12 @@ static cl::opt<bool> EnableRewritePartialRegUses(
     cl::desc("Enable rewrite partial reg uses pass"), cl::init(false),
     cl::Hidden);
 
+static cl::opt<bool> EnableBalancedSchedStrategy(
+    "amdgpu-enable-balanced-scheduling-strategy",
+    cl::desc(
+        "Enable scheduling strategy to tradeoff between ILP and occupancy."),
+    cl::Hidden, cl::init(true));
+
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   // Register the target
   RegisterTargetMachine<R600TargetMachine> X(getTheR600Target());
@@ -448,6 +454,20 @@ createGCNMaxILPMachineScheduler(MachineSchedContext *C) {
   return DAG;
 }
 
+static ScheduleDAGInstrs *
+createGCNBalancedMachineScheduler(MachineSchedContext *C) {
+  const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();
+  ScheduleDAGMILive *DAG =
+    new GCNScheduleDAGMILive(C, std::make_unique<GCNBalancedSchedStrategy>(C));
+  DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
+  if (ST.shouldClusterStores())
+    DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
+  DAG->addMutation(createIGroupLPDAGMutation());
+  DAG->addMutation(createAMDGPUMacroFusionDAGMutation());
+  DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+  return DAG;
+}
+
 static ScheduleDAGInstrs *
 createIterativeGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {
   const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();
@@ -1126,6 +1146,9 @@ ScheduleDAGInstrs *GCNPassConfig::createMachineScheduler(
   if (EnableMaxIlpSchedStrategy)
     return createGCNMaxILPMachineScheduler(C);
 
+  if (EnableBalancedSchedStrategy)
+    return createGCNBalancedMachineScheduler(C);
+
   return createGCNMaxOccupancyMachineScheduler(C);
 }
 
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 994cfea1fd7db67..9d8504af38c8c74 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -52,8 +52,6 @@ static cl::opt<bool>
                         "Wave Limited (amdgpu-limit-wave-threshold)."),
                cl::init(false));
 
-const unsigned ScheduleMetrics::ScaleFactor = 100;
-
 GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
     : GenericScheduler(C), TargetOccupancy(0), MF(nullptr),
       HasHighPressure(false) {}
@@ -703,7 +701,7 @@ bool UnclusteredHighRPStage::initGCNSchedStage() {
   if (!GCNSchedStage::initGCNSchedStage())
     return false;
 
-  if (DAG.RegionsWithHighRP.none() && DAG.RegionsWithExcessRP.none())
+  if (DAG.RegionsWithExcessRP.none())
     return false;
 
   SavedMutations.swap(DAG.Mutations);
@@ -839,6 +837,7 @@ bool GCNSchedStage::initGCNRegion() {
 
   S.HasHighPressure = false;
   S.KnownExcessRP = isRegionWithExcessRP();
+  S.clearMetric();
 
   if (DAG.RegionsWithIGLPInstrs[RegionIdx] &&
       StageID != GCNSchedStageID::UnclusteredHighRPReschedule) {
@@ -970,9 +969,7 @@ void GCNSchedStage::checkScheduling() {
     DAG.RegionsWithExcessRP[RegionIdx] = true;
   }
 
-  // Revert if this region's schedule would cause a drop in occupancy or
-  // spilling.
-  if (shouldRevertScheduling(WavesAfter)) {
+  if (shouldRevertScheduling(WavesAfter, WavesBefore)) {
     revertScheduling();
   } else {
     DAG.Pressure[RegionIdx] = PressureAfter;
@@ -981,193 +978,52 @@ void GCNSchedStage::checkScheduling() {
   }
 }
 
-unsigned
-GCNSchedStage::computeSUnitReadyCycle(const SUnit &SU, unsigned CurrCycle,
-                                      DenseMap<unsigned, unsigned> &ReadyCycles,
-                                      const TargetSchedModel &SM) {
-  unsigned ReadyCycle = CurrCycle;
-  for (auto &D : SU.Preds) {
-    if (D.isAssignedRegDep()) {
-      MachineInstr *DefMI = D.getSUnit()->getInstr();
-      unsigned Latency = SM.computeInstrLatency(DefMI);
-      unsigned DefReady = ReadyCycles[DAG.getSUnit(DefMI)->NodeNum];
-      ReadyCycle = std::max(ReadyCycle, DefReady + Latency);
-    }
-  }
-  ReadyCycles[SU.NodeNum] = ReadyCycle;
-  return ReadyCycle;
-}
-
-#ifndef NDEBUG
-struct EarlierIssuingCycle {
-  bool operator()(std::pair<MachineInstr *, unsigned> A,
-                  std::pair<MachineInstr *, unsigned> B) const {
-    return A.second < B.second;
-  }
-};
-
-static void printScheduleModel(std::set<std::pair<MachineInstr *, unsigned>,
-                                        EarlierIssuingCycle> &ReadyCycles) {
-  if (ReadyCycles.empty())
-    return;
-  unsigned BBNum = ReadyCycles.begin()->first->getParent()->getNumber();
-  dbgs() << "\n################## Schedule time ReadyCycles for MBB : " << BBNum
-         << " ##################\n# Cycle #\t\t\tInstruction          "
-            "             "
-            "                            \n";
-  unsigned IPrev = 1;
-  for (auto &I : ReadyCycles) {
-    if (I.second > IPrev + 1)
-      dbgs() << "****************************** BUBBLE OF " << I.second - IPrev
-             << " CYCLES DETECTED ******************************\n\n";
-    dbgs() << "[ " << I.second << " ]  :  " << *I.first << "\n";
-    IPrev = I.second;
-  }
-}
-#endif
-
-ScheduleMetrics
-GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
-#ifndef NDEBUG
-  std::set<std::pair<MachineInstr *, unsigned>, EarlierIssuingCycle>
-      ReadyCyclesSorted;
-#endif
-  const TargetSchedModel &SM = ST.getInstrInfo()->getSchedModel();
-  unsigned SumBubbles = 0;
-  DenseMap<unsigned, unsigned> ReadyCycles;
-  unsigned CurrCycle = 0;
-  for (auto &SU : InputSchedule) {
-    unsigned ReadyCycle =
-        computeSUnitReadyCycle(SU, CurrCycle, ReadyCycles, SM);
-    SumBubbles += ReadyCycle - CurrCycle;
-#ifndef NDEBUG
-    ReadyCyclesSorted.insert(std::make_pair(SU.getInstr(), ReadyCycle));
-#endif
-    CurrCycle = ++ReadyCycle;
-  }
-#ifndef NDEBUG
-  LLVM_DEBUG(
-      printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
-             << (SumBubbles
-                     ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
-                     : 1)
-             << "\n\n");
-#endif
-
-  return ScheduleMetrics(CurrCycle, SumBubbles);
-}
-
-ScheduleMetrics
-GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
-#ifndef NDEBUG
-  std::set<std::pair<MachineInstr *, unsigned>, EarlierIssuingCycle>
-      ReadyCyclesSorted;
-#endif
-  const TargetSchedModel &SM = ST.getInstrInfo()->getSchedModel();
-  unsigned SumBubbles = 0;
-  DenseMap<unsigned, unsigned> ReadyCycles;
-  unsigned CurrCycle = 0;
-  for (auto &MI : DAG) {
-    SUnit *SU = DAG.getSUnit(&MI);
-    if (!SU)
-      continue;
-    unsigned ReadyCycle =
-        computeSUnitReadyCycle(*SU, CurrCycle, ReadyCycles, SM);
-    SumBubbles += ReadyCycle - CurrCycle;
-#ifndef NDEBUG
-    ReadyCyclesSorted.insert(std::make_pair(SU->getInstr(), ReadyCycle));
-#endif
-    CurrCycle = ++ReadyCycle;
-  }
-#ifndef NDEBUG
-  LLVM_DEBUG(
-      printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
-             << (SumBubbles
-                     ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
-                     : 1)
-             << "\n\n");
-#endif
-
-  return ScheduleMetrics(CurrCycle, SumBubbles);
-}
-
-bool GCNSchedStage::shouldRevertScheduling(unsigned WavesAfter) {
+bool GCNSchedStage::shouldRevertScheduling(unsigned WavesAfter,
+                                           unsigned WavesBefore) {
   if (WavesAfter < DAG.MinOccupancy)
     return true;
 
   return false;
 }
 
-bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
+bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter,
+                                                     unsigned WavesBefore) {
   if (PressureAfter == PressureBefore)
     return false;
 
-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
-    return true;
-
   if (mayCauseSpilling(WavesAfter))
     return true;
 
-  return false;
+  return S.computeScheduleMetric(RegionIdx, WavesAfter, WavesBefore);
 }
 
-bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
-  // If RP is not reduced in the unclustred reschedule stage, revert to the
-  // old schedule.
-  if ((WavesAfter <= PressureBefore.getOccupancy(ST) &&
-       mayCauseSpilling(WavesAfter)) ||
-      GCNSchedStage::shouldRevertScheduling(WavesAfter)) {
-    LLVM_DEBUG(dbgs() << "Unclustered reschedule did not help.\n");
-    return true;
-  }
-
-  // Do not attempt to relax schedule even more if we are already spilling.
-  if (isRegionWithExcessRP())
-    return false;
-
-  LLVM_DEBUG(
-      dbgs()
-      << "\n\t      *** In shouldRevertScheduling ***\n"
-      << "      *********** BEFORE UnclusteredHighRPStage ***********\n");
-  ScheduleMetrics MBefore =
-      getScheduleMetrics(DAG.SUnits);
-  LLVM_DEBUG(
-      dbgs()
-      << "\n      *********** AFTER UnclusteredHighRPStage ***********\n");
-  ScheduleMetrics MAfter = getScheduleMetrics(DAG);
-  unsigned OldMetric = MBefore.getMetric();
-  unsigned NewMetric = MAfter.getMetric();
-  unsigned WavesBefore =
-      std::min(S.getTargetOccupancy(), PressureBefore.getOccupancy(ST));
-  unsigned Profit =
-      ((WavesAfter * ScheduleMetrics::ScaleFactor) / WavesBefore *
-       ((OldMetric + ScheduleMetricBias) * ScheduleMetrics::ScaleFactor) /
-       NewMetric) /
-      ScheduleMetrics::ScaleFactor;
-  LLVM_DEBUG(dbgs() << "\tMetric before " << MBefore << "\tMetric after "
-                    << MAfter << "Profit: " << Profit << "\n");
-  return Profit < ScheduleMetrics::ScaleFactor;
+bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter,
+                                                    unsigned WavesBefore) {
+  // Revert if may cause spilling. Otherwise rely on the metric computed by the
+  // strategy class. Exception: does not make sense to revert the unclustered
+  // schedule if we are still in excess RP state as it will not become better.
+  return GCNSchedStage::shouldRevertScheduling(WavesAfter, WavesBefore) ||
+         (S.computeScheduleMetric(RegionIdx, WavesAfter, WavesBefore) &&
+          !isRegionWithExcessRP());
 }
 
-bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {
+bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter,
+                                                  unsigned WavesBefore) {
   if (PressureAfter == PressureBefore)
     return false;
 
-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
+  if (GCNSchedStage::shouldRevertScheduling(WavesAfter, WavesBefore))
     return true;
 
   if (mayCauseSpilling(WavesAfter))
     return true;
 
-  return false;
+  return S.computeScheduleMetric(RegionIdx, WavesAfter, WavesBefore);
 }
 
-bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) {
-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
+bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter,
+                                             unsigned WavesBefore) {
+  if (GCNSchedStage::shouldRevertScheduling(WavesAfter, WavesBefore))
     return true;
 
   if (mayCauseSpilling(WavesAfter))
@@ -1176,7 +1032,8 @@ bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) {
   return false;
 }
 
-bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
+bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter,
+                                                     unsigned WavesBefore) {
   if (mayCauseSpilling(WavesAfter))
     return true;
 
@@ -1570,3 +1427,52 @@ void GCNPostScheduleDAGMILive::finalizeSchedule() {
 
   ScheduleDAGMI::finalizeSchedule();
 }
+
+unsigned GCNBalancedSchedStrategy::computeSUnitReadyCycle(const SUnit &SU) {
+  unsigned ReadyCycle = CurrCycle;
+  for (auto &D : SU.Preds) {
+    if (D.isAssignedRegDep()) {
+      MachineInstr *DefMI = D.getSUnit()->getInstr();
+      unsigned Latency = SM->computeInstrLatency(DefMI);
+      unsigned DefReady = ReadyCycles[DAG->getSUnit(DefMI)->NodeNum];
+      ReadyCycle = std::max(ReadyCycle, DefReady + Latency);
+    }
+  }
+  ReadyCycles[SU.NodeNum] = ReadyCycle;
+  return ReadyCycle;
+}
+
+bool llvm::GCNBalancedSchedStrategy::computeScheduleMetric(
+    unsigned RegionIdx, unsigned WavesAfter, unsigned WavesBefore) {
+  bool Result = false;
+  unsigned PrevMetric = 0;
+  if (Metrics.count(RegionIdx)) {
+    PrevMetric = Metrics[RegionIdx].back();
+  }
+  for (auto &MI : *DAG) {
+    SUnit *SU = DAG->getSUnit(&MI);
+    if (!SU)
+      continue;
+    unsigned ReadyCycle = computeSUnitReadyCycle(*SU);
+    StallTotal += ReadyCycle - CurrCycle;
+#ifndef NDEBUG
+    PrintableSchedule.insert(std::make_pair(SU->getInstr(), ReadyCycle));
+#endif
+    CurrCycle = ++ReadyCycle;
+  }
+  unsigned Metric = StallTotal ? StallTotal * ScaleFactor / CurrCycle : 1;
+#ifndef NDEBUG
+  LLVM_DEBUG(printSchedule());
+#endif
+  if (PrevMetric) {
+    unsigned Profit =
+        ((WavesAfter * ScaleFactor) / WavesBefore *
+         ((PrevMetric + ScheduleMetricBias) * ScaleFactor) / Metric) /
+        ScaleFactor;
+    Result = Profit < ScaleFactor;
+  }
+  if (!Result)
+    Metrics[RegionIdx].push_back(Metric);
+  clearMetric();
+  return Result;
+}
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 7862ec1e894b62e..84dbf0ce3e7913c 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -116,6 +116,12 @@ class GCNSchedStrategy : public GenericScheduler {
   bool hasNextStage() const;
 
   GCNSchedStageID getNextStage() const;
+
+  virtual bool computeScheduleMetric(unsigned RegionIdx, unsigned WavesAfter,
+                                    unsigned WavesBefore) {
+    return false;
+  }
+  virtual void clearMetric(){};
 };
 
 /// The goal of this scheduling strategy is to maximize kernel occupancy (i.e.
@@ -136,33 +142,6 @@ class GCNMaxILPSchedStrategy final : public GCNSchedStrategy {
   GCNMaxILPSchedStrategy(const MachineSchedContext *C);
 };
 
-class ScheduleMetrics {
-  unsigned ScheduleLength;
-  unsigned BubbleCycles;
-
-public:
-  ScheduleMetrics() {}
-  ScheduleMetrics(unsigned L, unsigned BC)
-      : ScheduleLength(L), BubbleCycles(BC) {}
-  unsigned getLength() const { return ScheduleLength; }
-  unsigned getBubbles() const { return BubbleCycles; }
-  unsigned getMetric() const {
-    unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength;
-    // Metric is zero if the amount of bubbles is less than 1% which is too
-    // small. So, return 1.
-    return Metric ? Metric : 1;
-  }
-  static const unsigned ScaleFactor;
-};
-
-inline raw_ostream &operator<<(raw_ostream &OS, const ScheduleMetrics &Sm) {
-  dbgs() << "\n Schedule Metric (scaled by "
-         << ScheduleMetrics::ScaleFactor
-         << " ) is: " << Sm.getMetric() << " [ " << Sm.getBubbles() << "/"
-         << Sm.getLength() << " ]\n";
-  return OS;
-}
-
 class GCNScheduleDAGMILive final : public ScheduleDAGMILive {
   friend class GCNSchedStage;
   friend class OccInitialScheduleStage;
@@ -296,15 +275,9 @@ class GCNSchedStage {
   // Check result of scheduling.
   void checkScheduling();
 
-  // computes the given schedule virtual execution time in clocks
-  ScheduleMetrics getScheduleMetrics(const std::vector<SUnit> &InputSchedule);
-  ScheduleMetrics getScheduleMetrics(const GCNScheduleDAGMILive &DAG);
-  unsigned computeSUnitReadyCycle(const SUnit &SU, unsigned CurrCycle,
-                                  DenseMap<unsigned, unsigned> &ReadyCycles,
-                                  const TargetSchedModel &SM);
-
   // Returns true if scheduling should be reverted.
-  virtual bool shouldRevertScheduling(unsigned WavesAfter);
+  virtual bool shouldRevertScheduling(unsigned WavesAfter,
+                                      unsigned WavesBefore);
 
   // Returns true if current region has known excess pressure.
   bool isRegionWithExcessRP() const {
@@ -324,7 +297,8 @@ class GCNSchedStage {
 
 class OccInitialScheduleStage : public GCNSchedStage {
 public:
-  bool shouldRevertScheduling(unsigned WavesAfter) override;
+  bool shouldRevertScheduling(unsigned WavesAfter,
+                              unsigned WavesBefore) override;
 
   OccInitialScheduleStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
       : GCNSchedStage(StageID, DAG) {}
@@ -342,7 +316,8 @@ class UnclusteredHighRPStage : public GCNSchedStage {
 
   bool initGCNRegion() override;
 
-  bool shouldRevertScheduling(unsigned WavesAfter) override;
+  bool shouldRevertScheduling(unsigned WavesAfter,
+                              unsigned WavesBefore) override;
 
   UnclusteredHighRPStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
       : GCNSchedStage(StageID, DAG) {}
@@ -357,7 +332,8 @@ class ClusteredLowOccStage : public GCNSchedStage {
 
   bool initGCNRegion() override;
 
-  bool shouldRevertScheduling(unsigned WavesAfter) override;
+  bool shouldRevertScheduling(unsigned WavesAfter,
+                              unsigned WavesBefore) override;
 
   ClusteredLowOccStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
       : GCNSchedStage(StageID, DAG) {}
@@ -393,7 +369,8 @@ class PreRARematStage : public GCNSchedStage {
 
   bool initGCNRegion() override;
 
-  bool shouldRevertScheduling(unsigned WavesAfter) override;
+  bool shouldRevertScheduling(unsigned WavesAfter,
+                              unsigned WavesBefore = 0) override;
 
   PreRARematStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
       : GCNSchedStage(StageID, DAG) {}
@@ -401,7 +378,8 @@ class PreRARematStage : public GCNSchedStage {
 
 class ILPInitialScheduleStage : public GCNSchedStage {
 public:
-  bool shouldRevertScheduling(unsigned WavesAfter) override;
+  bool shouldRevertScheduling(unsigned WavesAfter,
+                              unsigned WavesBefore) override;
 
   ILPInitialScheduleStage(GCNSchedStageID StageID, GCNScheduleDAGMILive &DAG)
       : GCNSchedStage(StageID, DAG) {}
@@ -423,6 +401,87 @@ class GCNPostScheduleDAGMILive final : public ScheduleDAGMI {
                            bool RemoveKillFlags);
 };
 
+#ifndef NDEBUG
+struct EarlierIssuingCycle {
+  bool operator()(std::pair<MachineInstr *, unsigned> A,
+                  std::pair<MachineInstr *, unsigned> B) const {
+    return A.second < B.second;
+  }
+};
+#endif
+
+/// The goal of this scheduling strategy is to find a reasonable tradeof between
+/// the kernel occupancy (i.e. maximum number of waves per simd). and ILP (i.e.
+/// minimize the amount of stall cycles by means of the better latency
+/// covering).
+class GCNBalancedSchedStrategy final : public GCNSchedStrategy {
+
+  const unsigned ScaleFactor = 100;
+  unsigned StallTotal = 0;
+  unsigned CurrCycle = 0;
+  DenseMap<unsigned, unsigned> ReadyCycles;
+  DenseMap<unsigned, SmallVector<unsigned, 4>> Metrics;
+  const TargetSchedModel *SM;
+  unsigned computeSUnitReadyCycle(const SUnit &SU);
+
+  void clearMetric() override {
+    StallTotal = 0;
+    CurrCycle = 0;
+    ReadyCycles.clear();
+#ifndef NDEBUG
+    PrintableSchedule.clear();
+#endif
+  }
+
+#ifndef NDEBUG
+  std::set<std::pair<MachineInstr *, unsigned>, EarlierIssuingCycle> PrintableSchedule;
+
+  void printSchedule() {
+    if (PrintableSchedule.empty())
+      return;
+
+    unsigned BBNum = PrintableSchedule.begin()->first->getParent()->getNumber();
+    dbgs() << "\n################## Schedule time ReadyCycles for MBB : "
+           << BBNum
+           << " ##################\n# Cycle #\t\t\tInstruction          "
+              "             "
+              "                            \n";
+    unsigned IPrev = 1;
+    for (auto &I : PrintableSchedule) {
+      if (I.second > IPrev + 1)
+        dbgs() << "****************************** BUBBLE OF "
+               << I.second - IPrev
+               << " CYCLES DETECTED ******************************\n\n";
+      dbgs() << "[ " << I.second << " ]  :  " << *I.first << "\n";
+      IPrev = I.second;
+    }
+    dbgs() << "\n\t"
+             << "Metric: "
+             << (StallTotal
+                     ? (StallTotal * ScaleFactor) / CurrCycle
+ ...
[truncated]

jrbyrnes

I have two remaining points to discuss.

We bypass the UnclusteredHighRPStage (in its initGCNRegion) for regions with the following condition:

  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;

My question: for the balanced scheduler, should we actually run this phase for RegionsWithMinOcc && !RegionsWithExcessRP?

Should we run ClusteredLowOccStage for any region so long as we have observed an occupancy drop (at least for the experimental version)?

jrbyrnes · 2023-09-18T16:27:15Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

  if (WavesAfter < DAG.MinOccupancy)
    return true;

  return false;
 }

-bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
+bool OccInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter,
+                                                     unsigned WavesBefore) {
  if (PressureAfter == PressureBefore)


For the balanced scheduler, should we really early exit on RP?

jrbyrnes · 2023-09-18T16:27:19Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

  if (PressureAfter == PressureBefore)
    return false;

-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))


We need to keep this in order to preserve MaxOccupancy scheduler's behavior, but it should not be the priority for the balanced scheduler. Maybe the balanced scheduler should just have a different initial stage entirely.

We need to keep this in order to preserve MaxOccupancy scheduler's behavior, but it should not be the priority for the balanced scheduler. Maybe the balanced scheduler should just have a different initial stage entirely.

If we decide to exactly preserve the MaxOccupancy scheduler's behavior, we'd better write a completely new code for most things. We could try to have a separate set of stages for the balanced scheduler but it still has to communicate with interfaces that are focused on the occupancy. So, I agree that the approach to minimize the existing code changes could be not viable.

jrbyrnes · 2023-09-18T16:27:51Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  // Revert if may cause spilling. Otherwise rely on the metric computed by the
+  // strategy class. Exception: does not make sense to revert the unclustered
+  // schedule if we are still in excess RP state as it will not become better.
+  return GCNSchedStage::shouldRevertScheduling(WavesAfter, WavesBefore) ||


Does this preserve the behavior of MaxOccupancy scheduler?

alex-t · 2023-09-19T17:07:12Z

I have two remaining points to discuss.

We bypass the UnclusteredHighRPStage (in its initGCNRegion) for regions with the following condition:
  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;
My question: for the balanced scheduler, should we actually run this phase for RegionsWithMinOcc && !RegionsWithExcessRP?

The idea to run this stage for the ExcessRP regions only is correct. This is (I hope temporarily) left aside to avoid vast changes throughout the scheduler. The reason is that the spill-controlling mechanisms heavily rely on the MinOCC set and as soon as we touch this we immediately have to rewrite a lot of things.

Should we run ClusteredLowOccStage for any region so long as we have observed an occupancy drop (at least for the experimental version)?

The fact we observed an occupancy drop does not make our schedule invalid, right? So, maybe the chance to improve ILP at the cost of worsening the occupancy yet more is not as bad? We need to have more real traces collected from the real program runs to have statistics.

alex-t requested review from kerbowa and jrbyrnes September 18, 2023 11:47

llvmbot added backend:AMDGPU llvm:globalisel labels Sep 18, 2023

jrbyrnes reviewed Sep 18, 2023

View reviewed changes

alex-t closed this Oct 2, 2023

alex-t force-pushed the master branch from b2ea385 to cad4c10 Compare October 2, 2023 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU][MISCHED] GCNBalancedSchedStrategy. #66634

[AMDGPU][MISCHED] GCNBalancedSchedStrategy. #66634

alex-t commented Sep 18, 2023

llvmbot commented Sep 18, 2023 •

edited

jrbyrnes left a comment

jrbyrnes Sep 18, 2023

jrbyrnes Sep 18, 2023

alex-t Sep 19, 2023

jrbyrnes Sep 18, 2023

alex-t commented Sep 19, 2023

[AMDGPU][MISCHED] GCNBalancedSchedStrategy. #66634

[AMDGPU][MISCHED] GCNBalancedSchedStrategy. #66634

Conversation

alex-t commented Sep 18, 2023

llvmbot commented Sep 18, 2023 • edited

jrbyrnes left a comment

Choose a reason for hiding this comment

jrbyrnes Sep 18, 2023

Choose a reason for hiding this comment

jrbyrnes Sep 18, 2023

Choose a reason for hiding this comment

alex-t Sep 19, 2023

Choose a reason for hiding this comment

jrbyrnes Sep 18, 2023

Choose a reason for hiding this comment

alex-t commented Sep 19, 2023

llvmbot commented Sep 18, 2023 •

edited