Skip to content

Commit

Permalink
[PowerPC] Add option to disable perfect shuffle
Browse files Browse the repository at this point in the history
Perfect shuffle was introduced into PowerPC backend years ago, and only
available in big-endian subtargets. This optimization has good effects
in simple cases, but brings serious negative impact in large programs
with many shuffle instructions sharing the same mask.

Here introduces a temporary backend hidden option to control it until we
implemented better way to fix the gap in vectorshuffle decomposition.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D120072
  • Loading branch information
ecnelises committed Feb 20, 2022
1 parent 8ef3e89 commit 43d48ed
Showing 1 changed file with 55 additions and 47 deletions.
102 changes: 55 additions & 47 deletions llvm/lib/Target/PowerPC/PPCISelLowering.cpp
Expand Up @@ -126,6 +126,11 @@ static cl::opt<bool> EnableQuadwordAtomics(
cl::desc("enable quadword lock-free atomic operations"), cl::init(false),
cl::Hidden);

static cl::opt<bool>
DisablePerfectShuffle("ppc-disable-perfect-shuffle",
cl::desc("disable vector permute decomposition"),
cl::init(false), cl::Hidden);

STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumSiblingCalls, "Number of sibling calls");
STATISTIC(ShufflesHandledWithVPERM, "Number of shuffles lowered to a VPERM");
Expand Down Expand Up @@ -10071,56 +10076,59 @@ SDValue PPCTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
// perfect shuffle table to emit an optimal matching sequence.
ArrayRef<int> PermMask = SVOp->getMask();

unsigned PFIndexes[4];
bool isFourElementShuffle = true;
for (unsigned i = 0; i != 4 && isFourElementShuffle; ++i) { // Element number
unsigned EltNo = 8; // Start out undef.
for (unsigned j = 0; j != 4; ++j) { // Intra-element byte.
if (PermMask[i*4+j] < 0)
continue; // Undef, ignore it.

unsigned ByteSource = PermMask[i*4+j];
if ((ByteSource & 3) != j) {
isFourElementShuffle = false;
break;
}
if (!DisablePerfectShuffle && !isLittleEndian) {
unsigned PFIndexes[4];
bool isFourElementShuffle = true;
for (unsigned i = 0; i != 4 && isFourElementShuffle;
++i) { // Element number
unsigned EltNo = 8; // Start out undef.
for (unsigned j = 0; j != 4; ++j) { // Intra-element byte.
if (PermMask[i * 4 + j] < 0)
continue; // Undef, ignore it.

unsigned ByteSource = PermMask[i * 4 + j];
if ((ByteSource & 3) != j) {
isFourElementShuffle = false;
break;
}

if (EltNo == 8) {
EltNo = ByteSource/4;
} else if (EltNo != ByteSource/4) {
isFourElementShuffle = false;
break;
if (EltNo == 8) {
EltNo = ByteSource / 4;
} else if (EltNo != ByteSource / 4) {
isFourElementShuffle = false;
break;
}
}
PFIndexes[i] = EltNo;
}

// If this shuffle can be expressed as a shuffle of 4-byte elements, use the
// perfect shuffle vector to determine if it is cost effective to do this as
// discrete instructions, or whether we should use a vperm.
// For now, we skip this for little endian until such time as we have a
// little-endian perfect shuffle table.
if (isFourElementShuffle) {
// Compute the index in the perfect shuffle table.
unsigned PFTableIndex = PFIndexes[0] * 9 * 9 * 9 + PFIndexes[1] * 9 * 9 +
PFIndexes[2] * 9 + PFIndexes[3];

unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
unsigned Cost = (PFEntry >> 30);

// Determining when to avoid vperm is tricky. Many things affect the cost
// of vperm, particularly how many times the perm mask needs to be
// computed. For example, if the perm mask can be hoisted out of a loop or
// is already used (perhaps because there are multiple permutes with the
// same shuffle mask?) the vperm has a cost of 1. OTOH, hoisting the
// permute mask out of the loop requires an extra register.
//
// As a compromise, we only emit discrete instructions if the shuffle can
// be generated in 3 or fewer operations. When we have loop information
// available, if this block is within a loop, we should avoid using vperm
// for 3-operation perms and use a constant pool load instead.
if (Cost < 3)
return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
}
PFIndexes[i] = EltNo;
}

// If this shuffle can be expressed as a shuffle of 4-byte elements, use the
// perfect shuffle vector to determine if it is cost effective to do this as
// discrete instructions, or whether we should use a vperm.
// For now, we skip this for little endian until such time as we have a
// little-endian perfect shuffle table.
if (isFourElementShuffle && !isLittleEndian) {
// Compute the index in the perfect shuffle table.
unsigned PFTableIndex =
PFIndexes[0]*9*9*9+PFIndexes[1]*9*9+PFIndexes[2]*9+PFIndexes[3];

unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
unsigned Cost = (PFEntry >> 30);

// Determining when to avoid vperm is tricky. Many things affect the cost
// of vperm, particularly how many times the perm mask needs to be computed.
// For example, if the perm mask can be hoisted out of a loop or is already
// used (perhaps because there are multiple permutes with the same shuffle
// mask?) the vperm has a cost of 1. OTOH, hoisting the permute mask out of
// the loop requires an extra register.
//
// As a compromise, we only emit discrete instructions if the shuffle can be
// generated in 3 or fewer operations. When we have loop information
// available, if this block is within a loop, we should avoid using vperm
// for 3-operation perms and use a constant pool load instead.
if (Cost < 3)
return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
}

// Lower this to a VPERM(V1, V2, V3) expression, where V3 is a constant
Expand Down

0 comments on commit 43d48ed

Please sign in to comment.