New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sve] clang failed to tail folding optimization compare to gcc #63616
Comments
Minimal C code: #define ARRAY_ALIGNMENT 64
#define LEN_1D 8000
// array definitions
__attribute__((aligned(ARRAY_ALIGNMENT))) float a[LEN_1D];
float s311(struct args_t * func_args)
{
float sum = 0.;
for (int i = 0; i < LEN_1D; i++) {
sum += a[i];
}
return sum;
} common options: |
@llvm/issue-subscribers-backend-aarch64 |
This comment was marked as off-topic.
This comment was marked as off-topic.
I don't think this is a bug - it's expected behaviour. I believe clang is doing the right thing here because we can prove the loop predicate p0 is always all-true. I think clang is choosing the more optimal form of the loop by avoiding using the while instruction (instead using the cheaper add instruction) to maintain a predicate. |
I think part of the problem is that we don't clean up the code after vectorization. The vectorizer knows that vscale is a power of 2, so doesn't need to fold the tail. The rest of the pass pipeline doesn't know that though, so doesn't know that the checks for the scalar remainder will always be false. |
@davemgreen Are you sure the vscale is a power of two ? Isn't it possible to have it simply an integer ? Here an except from "Introduction to SVE"
Does LLVM constraint vscale to be a power of two ? or is it Linux ? |
See the summary of https://reviews.llvm.org/D141486 |
Thanks for the pointer. So actually there are two problems for the issue:
Intermediate IR after
|
BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() { | |
// The trip counts should be cached by now. | |
Value *Count = getTripCount(); | |
Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader); | |
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator(); | |
// Add a check in the middle block to see if we have completed | |
// all of the iterations in the first vector loop. Three cases: | |
// 1) If we require a scalar epilogue, there is no conditional branch as | |
// we unconditionally branch to the scalar preheader. Do nothing. | |
// 2) If (N - N%VF) == N, then we *don't* need to run the remainder. | |
// Thus if tail is to be folded, we know we don't need to run the | |
// remainder and we can use the previous value for the condition (true). | |
// 3) Otherwise, construct a runtime check. | |
if (!Cost->requiresScalarEpilogue(VF.isVector()) && | |
!Cost->foldTailByMasking()) { | |
Instruction *CmpN = CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, | |
Count, VectorTripCount, "cmp.n", | |
LoopMiddleBlock->getTerminator()); | |
// Here we use the same DebugLoc as the scalar loop latch terminator instead | |
// of the corresponding compare because they may have ended up with | |
// different line numbers and we want to avoid awkward line stepping while | |
// debugging. Eg. if the compare has got a line number inside the loop. | |
CmpN->setDebugLoc(ScalarLatchTerm->getDebugLoc()); | |
cast<BranchInst>(LoopMiddleBlock->getTerminator())->setCondition(CmpN); | |
} | |
#ifdef EXPENSIVE_CHECKS | |
assert(DT->verify(DominatorTree::VerificationLevel::Fast)); | |
#endif | |
return LoopVectorPreHeader; | |
} |
We can modify this part or we can add information to Count
and VectorTripCount
to allow analyses to detect the divisibility.
I suspect the following code does this kind of divisibility analysis (but for something else):
llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Lines 5180 to 5211 in b89b3cd
// Avoid tail folding if the trip count is known to be a multiple of any VF | |
// we choose. | |
std::optional<unsigned> MaxPowerOf2RuntimeVF = | |
MaxFactors.FixedVF.getFixedValue(); | |
if (MaxFactors.ScalableVF) { | |
std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI); | |
if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo()) { | |
MaxPowerOf2RuntimeVF = std::max<unsigned>( | |
*MaxPowerOf2RuntimeVF, | |
*MaxVScale * MaxFactors.ScalableVF.getKnownMinValue()); | |
} else | |
MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now. | |
} | |
if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) { | |
assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) && | |
"MaxFixedVF must be a power of 2"); | |
unsigned MaxVFtimesIC = | |
UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF; | |
ScalarEvolution *SE = PSE.getSE(); | |
const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount(); | |
const SCEV *ExitCount = SE->getAddExpr( | |
BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType())); | |
const SCEV *Rem = SE->getURemExpr( | |
SE->applyLoopGuards(ExitCount, TheLoop), | |
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC)); | |
if (Rem->isZero()) { | |
// Accept MaxFixedVF if we do not have a tail. | |
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n"); | |
return MaxFactors; | |
} | |
} |
Tried the above improvement with https://reviews.llvm.org/D154314 |
david-arm rightfully wants to be sure this removal of the loop remainder is not something that occurs elsewhere, especially at the |
New candidate's MR: https://reviews.llvm.org/D154953 |
…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs: https://alive2.llvm.org/ce/z/bT62Wa Fix #63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953
…s true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs: https://alive2.llvm.org/ce/z/FkTMoy Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953
…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs: https://alive2.llvm.org/ce/z/bT62Wa Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953
…s true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs: https://alive2.llvm.org/ce/z/FkTMoy Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953
…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs: https://alive2.llvm.org/ce/z/bT62Wa Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953
The text was updated successfully, but these errors were encountered: