Skip to content

Commit

Permalink
[AMDGPU] Break Large PHIs: Take whole PHI chains into account
Browse files Browse the repository at this point in the history
Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector.
We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.

The solution I found is to consider the whole "PHI chain" when looking at PHI.
That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.

The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value.
In simple terms, two-thirds (rounded up) of the PHIs should be breakable.

This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.

Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156414
  • Loading branch information
Pierre-vh committed Aug 3, 2023
1 parent 46642cc commit 62ea799
Show file tree
Hide file tree
Showing 2 changed files with 568 additions and 52 deletions.
122 changes: 77 additions & 45 deletions llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,17 +52,17 @@ static cl::opt<bool> Widen16BitOps(
cl::init(true));

static cl::opt<bool>
ScalarizeLargePHIs("amdgpu-codegenprepare-break-large-phis",
cl::desc("Break large PHI nodes for DAGISel"),
cl::ReallyHidden, cl::init(true));
BreakLargePHIs("amdgpu-codegenprepare-break-large-phis",
cl::desc("Break large PHI nodes for DAGISel"),
cl::ReallyHidden, cl::init(true));

static cl::opt<bool>
ForceScalarizeLargePHIs("amdgpu-codegenprepare-force-break-large-phis",
cl::desc("For testing purposes, always break large "
"PHIs even if it isn't profitable."),
cl::ReallyHidden, cl::init(false));
ForceBreakLargePHIs("amdgpu-codegenprepare-force-break-large-phis",
cl::desc("For testing purposes, always break large "
"PHIs even if it isn't profitable."),
cl::ReallyHidden, cl::init(false));

static cl::opt<unsigned> ScalarizeLargePHIsThreshold(
static cl::opt<unsigned> BreakLargePHIsThreshold(
"amdgpu-codegenprepare-break-large-phis-threshold",
cl::desc("Minimum type size in bits for breaking large PHI nodes"),
cl::ReallyHidden, cl::init(32));
Expand Down Expand Up @@ -1777,47 +1777,79 @@ static bool isInterestingPHIIncomingValue(const Value *V) {
return false;
}

static void collectPHINodes(const PHINode &I,
SmallPtrSet<const PHINode *, 8> &SeenPHIs) {
const auto [It, Inserted] = SeenPHIs.insert(&I);
if (!Inserted)
return;

for (const Value *Inc : I.incoming_values()) {
if (const auto *PhiInc = dyn_cast<PHINode>(Inc))
collectPHINodes(*PhiInc, SeenPHIs);
}

for (const User *U : I.users()) {
if (const auto *PhiU = dyn_cast<PHINode>(U))
collectPHINodes(*PhiU, SeenPHIs);
}
}

bool AMDGPUCodeGenPrepareImpl::canBreakPHINode(const PHINode &I) {
// Check in the cache, or add an entry for this node.
//
// We init with false because we consider all PHI nodes unbreakable until we
// reach a conclusion. Doing the opposite - assuming they're break-able until
// proven otherwise - can be harmful in some pathological cases so we're
// conservative for now.
const auto [It, DidInsert] = BreakPhiNodesCache.insert({&I, false});
if (!DidInsert)
// Check in the cache first.
if (const auto It = BreakPhiNodesCache.find(&I);
It != BreakPhiNodesCache.end())
return It->second;

// This function may recurse, so to guard against infinite looping, this PHI
// is conservatively considered unbreakable until we reach a conclusion.
// We consider PHI nodes as part of "chains", so given a PHI node I, we
// recursively consider all its users and incoming values that are also PHI
// nodes. We then make a decision about all of those PHIs at once. Either they
// all get broken up, or none of them do. That way, we avoid cases where a
// single PHI is/is not broken and we end up reforming/exploding a vector
// multiple times, or even worse, doing it in a loop.
SmallPtrSet<const PHINode *, 8> WorkList;
collectPHINodes(I, WorkList);

#ifndef NDEBUG
// Check that none of the PHI nodes in the worklist are in the map. If some of
// them are, it means we're not good enough at collecting related PHIs.
for (const PHINode *WLP : WorkList) {
assert(BreakPhiNodesCache.count(WLP) == 0);
}
#endif

// Don't break PHIs that have no interesting incoming values. That is, where
// there is no clear opportunity to fold the "extractelement" instructions we
// would add.
// To consider a PHI profitable to break, we need to see some interesting
// incoming values. At least 2/3rd (rounded up) of all PHIs in the worklist
// must have one to consider all PHIs breakable.
//
// Note: IC does not run after this pass, so we're only interested in the
// foldings that the DAG combiner can do.
if (none_of(I.incoming_values(),
[&](Value *V) { return isInterestingPHIIncomingValue(V); }))
return false;

// Now, check users for unbreakable PHI nodes. If we have an unbreakable PHI
// node as user, we don't want to break this PHI either because it's unlikely
// to be beneficial. We would just explode the vector and reassemble it
// directly, wasting instructions.
// This threshold has been determined through performance testing.
//
// Note that the computation below is equivalent to
//
// (unsigned)ceil((K / 3.0) * 2)
//
// In the case where multiple users are PHI nodes, we want at least half of
// them to be breakable.
int Score = 0;
for (const Value *U : I.users()) {
if (const auto *PU = dyn_cast<PHINode>(U))
Score += canBreakPHINode(*PU) ? 1 : -1;
// It's simply written this way to avoid mixing integral/FP arithmetic.
const auto Threshold = (alignTo(WorkList.size() * 2, 3) / 3);
unsigned NumBreakablePHIs = 0;
bool CanBreak = false;
for (const PHINode *Cur : WorkList) {
// Don't break PHIs that have no interesting incoming values. That is, where
// there is no clear opportunity to fold the "extractelement" instructions
// we would add.
//
// Note: IC does not run after this pass, so we're only interested in the
// foldings that the DAG combiner can do.
if (any_of(Cur->incoming_values(), isInterestingPHIIncomingValue)) {
if (++NumBreakablePHIs >= Threshold) {
CanBreak = true;
break;
}
}
}

if (Score < 0)
return false;
for (const PHINode *Cur : WorkList)
BreakPhiNodesCache[Cur] = CanBreak;

return BreakPhiNodesCache[&I] = true;
return CanBreak;
}

/// Helper class for "break large PHIs" (visitPHINode).
Expand Down Expand Up @@ -1898,14 +1930,15 @@ bool AMDGPUCodeGenPrepareImpl::visitPHINode(PHINode &I) {
// operations with most elements being "undef". This inhibits a lot of
// optimization opportunities and can result in unreasonably high register
// pressure and the inevitable stack spilling.
if (!ScalarizeLargePHIs || getCGPassBuilderOption().EnableGlobalISelOption)
if (!BreakLargePHIs || getCGPassBuilderOption().EnableGlobalISelOption)
return false;

FixedVectorType *FVT = dyn_cast<FixedVectorType>(I.getType());
if (!FVT || DL->getTypeSizeInBits(FVT) <= ScalarizeLargePHIsThreshold)
if (!FVT || FVT->getNumElements() == 1 ||
DL->getTypeSizeInBits(FVT) <= BreakLargePHIsThreshold)
return false;

if (!ForceScalarizeLargePHIs && !canBreakPHINode(I))
if (!ForceBreakLargePHIs && !canBreakPHINode(I))
return false;

std::vector<VectorSlice> Slices;
Expand All @@ -1930,8 +1963,7 @@ bool AMDGPUCodeGenPrepareImpl::visitPHINode(PHINode &I) {
Slices.emplace_back(EltTy, Idx, 1);
}

if (Slices.size() == 1)
return false;
assert(Slices.size() > 1);

// Create one PHI per vector piece. The "VectorSlice" class takes care of
// creating the necessary instruction to extract the relevant slices of each
Expand Down
Loading

0 comments on commit 62ea799

Please sign in to comment.