[AMDGPU] Break Large PHIs: Take whole PHI chains into account

Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector. We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations. The solution I found is to consider the whole "PHI chain" when looking at PHI. That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole. The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value. In simple terms, two-thirds (rounded up) of the PHIs should be breakable. This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue. Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156414
llvm · Aug 3, 2023 · 62ea799 · 62ea799
1 parent 46642cc
commit 62ea799
Show file tree

Hide file tree

Showing 2 changed files with 568 additions and 52 deletions.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -52,17 +52,17 @@ static cl::opt<bool> Widen16BitOps(
   cl::init(true));
 
 static cl::opt<bool>
-    ScalarizeLargePHIs("amdgpu-codegenprepare-break-large-phis",
-                       cl::desc("Break large PHI nodes for DAGISel"),
-                       cl::ReallyHidden, cl::init(true));
+    BreakLargePHIs("amdgpu-codegenprepare-break-large-phis",
+                   cl::desc("Break large PHI nodes for DAGISel"),
+                   cl::ReallyHidden, cl::init(true));
 
 static cl::opt<bool>
-    ForceScalarizeLargePHIs("amdgpu-codegenprepare-force-break-large-phis",
-                            cl::desc("For testing purposes, always break large "
-                                     "PHIs even if it isn't profitable."),
-                            cl::ReallyHidden, cl::init(false));
+    ForceBreakLargePHIs("amdgpu-codegenprepare-force-break-large-phis",
+                        cl::desc("For testing purposes, always break large "
+                                 "PHIs even if it isn't profitable."),
+                        cl::ReallyHidden, cl::init(false));
 
-static cl::opt<unsigned> ScalarizeLargePHIsThreshold(
+static cl::opt<unsigned> BreakLargePHIsThreshold(
     "amdgpu-codegenprepare-break-large-phis-threshold",
     cl::desc("Minimum type size in bits for breaking large PHI nodes"),
     cl::ReallyHidden, cl::init(32));
@@ -1777,47 +1777,79 @@ static bool isInterestingPHIIncomingValue(const Value *V) {
   return false;
 }
 
+static void collectPHINodes(const PHINode &I,
+                            SmallPtrSet<const PHINode *, 8> &SeenPHIs) {
+  const auto [It, Inserted] = SeenPHIs.insert(&I);
+  if (!Inserted)
+    return;
+
+  for (const Value *Inc : I.incoming_values()) {
+    if (const auto *PhiInc = dyn_cast<PHINode>(Inc))
+      collectPHINodes(*PhiInc, SeenPHIs);
+  }
+
+  for (const User *U : I.users()) {
+    if (const auto *PhiU = dyn_cast<PHINode>(U))
+      collectPHINodes(*PhiU, SeenPHIs);
+  }
+}
+
 bool AMDGPUCodeGenPrepareImpl::canBreakPHINode(const PHINode &I) {
-  // Check in the cache, or add an entry for this node.
-  //
-  // We init with false because we consider all PHI nodes unbreakable until we
-  // reach a conclusion. Doing the opposite - assuming they're break-able until
-  // proven otherwise - can be harmful in some pathological cases so we're
-  // conservative for now.
-  const auto [It, DidInsert] = BreakPhiNodesCache.insert({&I, false});
-  if (!DidInsert)
+  // Check in the cache first.
+  if (const auto It = BreakPhiNodesCache.find(&I);
+      It != BreakPhiNodesCache.end())
     return It->second;
 
-  // This function may recurse, so to guard against infinite looping, this PHI
-  // is conservatively considered unbreakable until we reach a conclusion.
+  // We consider PHI nodes as part of "chains", so given a PHI node I, we
+  // recursively consider all its users and incoming values that are also PHI
+  // nodes. We then make a decision about all of those PHIs at once. Either they
+  // all get broken up, or none of them do. That way, we avoid cases where a
+  // single PHI is/is not broken and we end up reforming/exploding a vector
+  // multiple times, or even worse, doing it in a loop.
+  SmallPtrSet<const PHINode *, 8> WorkList;
+  collectPHINodes(I, WorkList);
+
+#ifndef NDEBUG
+  // Check that none of the PHI nodes in the worklist are in the map. If some of
+  // them are, it means we're not good enough at collecting related PHIs.
+  for (const PHINode *WLP : WorkList) {
+    assert(BreakPhiNodesCache.count(WLP) == 0);
+  }
+#endif
 
-  // Don't break PHIs that have no interesting incoming values. That is, where
-  // there is no clear opportunity to fold the "extractelement" instructions we
-  // would add.
+  // To consider a PHI profitable to break, we need to see some interesting
+  // incoming values. At least 2/3rd (rounded up) of all PHIs in the worklist
+  // must have one to consider all PHIs breakable.
   //
-  // Note: IC does not run after this pass, so we're only interested in the
-  // foldings that the DAG combiner can do.
-  if (none_of(I.incoming_values(),
-              [&](Value *V) { return isInterestingPHIIncomingValue(V); }))
-    return false;
-
-  // Now, check users for unbreakable PHI nodes. If we have an unbreakable PHI
-  // node as user, we don't want to break this PHI either because it's unlikely
-  // to be beneficial. We would just explode the vector and reassemble it
-  // directly, wasting instructions.
+  // This threshold has been determined through performance testing.
+  //
+  // Note that the computation below is equivalent to
+  //
+  //    (unsigned)ceil((K / 3.0) * 2)
   //
-  // In the case where multiple users are PHI nodes, we want at least half of
-  // them to be breakable.
-  int Score = 0;
-  for (const Value *U : I.users()) {
-    if (const auto *PU = dyn_cast<PHINode>(U))
-      Score += canBreakPHINode(*PU) ? 1 : -1;
+  // It's simply written this way to avoid mixing integral/FP arithmetic.
+  const auto Threshold = (alignTo(WorkList.size() * 2, 3) / 3);
+  unsigned NumBreakablePHIs = 0;
+  bool CanBreak = false;
+  for (const PHINode *Cur : WorkList) {
+    // Don't break PHIs that have no interesting incoming values. That is, where
+    // there is no clear opportunity to fold the "extractelement" instructions
+    // we would add.
+    //
+    // Note: IC does not run after this pass, so we're only interested in the
+    // foldings that the DAG combiner can do.
+    if (any_of(Cur->incoming_values(), isInterestingPHIIncomingValue)) {
+      if (++NumBreakablePHIs >= Threshold) {
+        CanBreak = true;
+        break;
+      }
+    }
   }
 
-  if (Score < 0)
-    return false;
+  for (const PHINode *Cur : WorkList)
+    BreakPhiNodesCache[Cur] = CanBreak;
 
-  return BreakPhiNodesCache[&I] = true;
+  return CanBreak;
 }
 
 /// Helper class for "break large PHIs" (visitPHINode).
@@ -1898,14 +1930,15 @@ bool AMDGPUCodeGenPrepareImpl::visitPHINode(PHINode &I) {
   // operations with most elements being "undef". This inhibits a lot of
   // optimization opportunities and can result in unreasonably high register
   // pressure and the inevitable stack spilling.
-  if (!ScalarizeLargePHIs || getCGPassBuilderOption().EnableGlobalISelOption)
+  if (!BreakLargePHIs || getCGPassBuilderOption().EnableGlobalISelOption)
     return false;
 
   FixedVectorType *FVT = dyn_cast<FixedVectorType>(I.getType());
-  if (!FVT || DL->getTypeSizeInBits(FVT) <= ScalarizeLargePHIsThreshold)
+  if (!FVT || FVT->getNumElements() == 1 ||
+      DL->getTypeSizeInBits(FVT) <= BreakLargePHIsThreshold)
     return false;
 
-  if (!ForceScalarizeLargePHIs && !canBreakPHINode(I))
+  if (!ForceBreakLargePHIs && !canBreakPHINode(I))
     return false;
 
   std::vector<VectorSlice> Slices;
@@ -1930,8 +1963,7 @@ bool AMDGPUCodeGenPrepareImpl::visitPHINode(PHINode &I) {
       Slices.emplace_back(EltTy, Idx, 1);
   }
 
-  if (Slices.size() == 1)
-    return false;
+  assert(Slices.size() > 1);
 
   // Create one PHI per vector piece. The "VectorSlice" class takes care of
   // creating the necessary instruction to extract the relevant slices of each