Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sve] clang failed to tail folding optimization compare to gcc #63616

Closed
vfdff opened this issue Jun 30, 2023 · 11 comments
Closed

[sve] clang failed to tail folding optimization compare to gcc #63616

vfdff opened this issue Jun 30, 2023 · 11 comments

Comments

@vfdff
Copy link
Contributor

vfdff commented Jun 30, 2023

real_t s311(struct args_t * func_args)
{
    real_t sum = (real_t)0.;
    for (int i = 0; i < LEN_1D; i++) {
        sum += a[i];
    }

    return sum;
}
  • gcc: use whilelo to fold the tail loop
.L2:
        ld1w    z31.s, p7/z, [x2, x0, lsl 2]
        add     x0, x0, x3
        fadda   s0, p7, s0, z31.s
        whilelo p7.s, w0, w1
        b.any   .L2
  • clang: normal branch for the kernel loop body .LBB0_1
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        ld1w    { z1.s }, p0/z, [x12, x10, lsl #2]
        add     x10, x10, x9
        cmp     x13, x10
        fadda   s0, p0, s0, z1.s
        b.ne    .LBB0_1
@v01dXYZ
Copy link

v01dXYZ commented Jun 30, 2023

Minimal C code:

#define ARRAY_ALIGNMENT 64
#define LEN_1D 8000

// array definitions
__attribute__((aligned(ARRAY_ALIGNMENT))) float a[LEN_1D];

float s311(struct args_t * func_args)
{
    float sum = 0.;
    for (int i = 0; i < LEN_1D; i++) {
        sum += a[i];
    }

    return sum;
}

common options: -march=armv8-a+sve -O3
clang/llvm options: -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

@EugeneZelenko EugeneZelenko added backend:AArch64 SVE ARM Scalable Vector Extensions and removed new issue labels Jun 30, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Jun 30, 2023

@llvm/issue-subscribers-backend-aarch64

@v01dXYZ

This comment was marked as off-topic.

@david-arm
Copy link
Contributor

I don't think this is a bug - it's expected behaviour. I believe clang is doing the right thing here because we can prove the loop predicate p0 is always all-true. I think clang is choosing the more optimal form of the loop by avoiding using the while instruction (instead using the cheaper add instruction) to maintain a predicate.

@davemgreen
Copy link
Collaborator

davemgreen commented Jul 3, 2023

I think part of the problem is that we don't clean up the code after vectorization. The vectorizer knows that vscale is a power of 2, so doesn't need to fold the tail. The rest of the pass pipeline doesn't know that though, so doesn't know that the checks for the scalar remainder will always be false.

@v01dXYZ
Copy link

v01dXYZ commented Jul 3, 2023

@davemgreen Are you sure the vscale is a power of two ? Isn't it possible to have it simply an integer ?

Here an except from "Introduction to SVE"

* An implementation must allow the vector length to be constrained to any power of two.
* An implementation allows the vector length to be constrained to multiples of 128 that are not a power of two.

Does LLVM constraint vscale to be a power of two ? or is it Linux ?

@davemgreen
Copy link
Collaborator

See the summary of https://reviews.llvm.org/D141486

@v01dXYZ
Copy link

v01dXYZ commented Jul 4, 2023

Thanks for the pointer.

So actually there are two problems for the issue:

  1. add + b.ne is used instead of whilelo: it's not a problem as the instruction selector recognises the predicate is always ALL as said david-arm and selects lower cost instructions.
  2. dead branch: there are not tail so it should be cleaned

Intermediate IR after loop-vectorize

define dso_local float @s_8000(ptr nocapture noundef readnone %0) local_unnamed_addr #0 {
vector.ph:
  %1 = tail call i64 @llvm.vscale.i64()
  %2 = shl nuw nsw i64 %1, 2
  %n.mod.vf = urem i64 8000, %2
  %n.vec = sub nuw nsw i64 8000, %n.mod.vf
  %3 = tail call i64 @llvm.vscale.i64()
  %4 = shl nuw nsw i64 %3, 2
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
  %5 = getelementptr inbounds [8000 x float], ptr @a, i64 0, i64 %index
  %wide.load = load <vscale x 4 x float>, ptr %5, align 16
  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
  %index.next = add nuw i64 %index, %4
  %7 = icmp eq i64 %index.next, %n.vec
  br i1 %7, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %n.mod.vf, 0
  br i1 %cmp.n, label %.loopexit, label %scalar.ph

.loopexit:                                        ; preds = %scalar.ph, %middle.block
  %.lcssa = phi float [ %6, %middle.block ], [ %10, %scalar.ph ]
  ret float %.lcssa

scalar.ph:                                        ; preds = %middle.block, %scalar.ph
  %indvars.iv = phi i64 [ %indvars.iv.next, %scalar.ph ], [ %n.vec, %middle.block ]
  %.045 = phi float [ %10, %scalar.ph ], [ %6, %middle.block ]
  %8 = getelementptr inbounds [8000 x float], ptr @a, i64 0, i64 %indvars.iv
  %9 = load float, ptr %8, align 4
  %10 = fadd float %.045, %9
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 8000
  br i1 %exitcond.not, label %.loopexit, label %scalar.ph, !llvm.loop !3
}

%2 is in the form power(2, n+2) with n an integer in [[0, 4]] which is equal to %1 and thus always divides 64. %n.mod.vf is always 0. The branching to the remainder loop will never occur.

Do you think it is an easy task to add this information to the loop vectoriser cleaner and remove the dead branch ?

Code Analysis

This part is responsible of introducing the conditional branching.

BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
// The trip counts should be cached by now.
Value *Count = getTripCount();
Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
// Add a check in the middle block to see if we have completed
// all of the iterations in the first vector loop. Three cases:
// 1) If we require a scalar epilogue, there is no conditional branch as
// we unconditionally branch to the scalar preheader. Do nothing.
// 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
// Thus if tail is to be folded, we know we don't need to run the
// remainder and we can use the previous value for the condition (true).
// 3) Otherwise, construct a runtime check.
if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
!Cost->foldTailByMasking()) {
Instruction *CmpN = CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ,
Count, VectorTripCount, "cmp.n",
LoopMiddleBlock->getTerminator());
// Here we use the same DebugLoc as the scalar loop latch terminator instead
// of the corresponding compare because they may have ended up with
// different line numbers and we want to avoid awkward line stepping while
// debugging. Eg. if the compare has got a line number inside the loop.
CmpN->setDebugLoc(ScalarLatchTerm->getDebugLoc());
cast<BranchInst>(LoopMiddleBlock->getTerminator())->setCondition(CmpN);
}
#ifdef EXPENSIVE_CHECKS
assert(DT->verify(DominatorTree::VerificationLevel::Fast));
#endif
return LoopVectorPreHeader;
}

We can modify this part or we can add information to Count and VectorTripCount to allow analyses to detect the divisibility.

I suspect the following code does this kind of divisibility analysis (but for something else):

// Avoid tail folding if the trip count is known to be a multiple of any VF
// we choose.
std::optional<unsigned> MaxPowerOf2RuntimeVF =
MaxFactors.FixedVF.getFixedValue();
if (MaxFactors.ScalableVF) {
std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI);
if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo()) {
MaxPowerOf2RuntimeVF = std::max<unsigned>(
*MaxPowerOf2RuntimeVF,
*MaxVScale * MaxFactors.ScalableVF.getKnownMinValue());
} else
MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now.
}
if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) &&
"MaxFixedVF must be a power of 2");
unsigned MaxVFtimesIC =
UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF;
ScalarEvolution *SE = PSE.getSE();
const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
const SCEV *ExitCount = SE->getAddExpr(
BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
const SCEV *Rem = SE->getURemExpr(
SE->applyLoopGuards(ExitCount, TheLoop),
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
if (Rem->isZero()) {
// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxFactors;
}
}

@vfdff
Copy link
Contributor Author

vfdff commented Jul 6, 2023

Tried the above improvement with https://reviews.llvm.org/D154314

@v01dXYZ
Copy link

v01dXYZ commented Jul 8, 2023

david-arm rightfully wants to be sure this removal of the loop remainder is not something that occurs elsewhere, especially at the InstCombine. But when looking at the usage of ScalarEvolution with InstCombine, it seems it's not used at all, so I don't think it is likely a divisibility analysis is done by InstCombine unless it reimplements a partial scalar evolution analysis. Thus, the change proposed could manage cases when the array size is assured to be a multiple of 64, especially when not constant, which could not be possible at all by InstCombine. But I think it could occur in simplify-cfg (after verification, I don't think this is the case too).

@vfdff
Copy link
Contributor Author

vfdff commented Jul 26, 2023

New candidate's MR: https://reviews.llvm.org/D154953

@vfdff vfdff closed this as completed in 3e386b2 Aug 1, 2023
@EugeneZelenko EugeneZelenko added llvm:optimizations and removed backend:AArch64 SVE ARM Scalable Vector Extensions labels Aug 1, 2023
vfdff added a commit that referenced this issue Aug 1, 2023
…s always true

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/bT62Wa

Fix #63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
doru1004 pushed a commit to doru1004/llvm-project that referenced this issue Aug 3, 2023
…s true

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/FkTMoy

Fix llvm#63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
doru1004 pushed a commit to doru1004/llvm-project that referenced this issue Aug 3, 2023
…s always true

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/bT62Wa

Fix llvm#63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
razmser pushed a commit to razmser/llvm-project that referenced this issue Sep 8, 2023
…s true

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/FkTMoy

Fix llvm#63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
razmser pushed a commit to razmser/llvm-project that referenced this issue Sep 8, 2023
…s always true

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350

proofs: https://alive2.llvm.org/ce/z/bT62Wa

Fix llvm#63616.

Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants