[sve] clang failed to tail folding optimization compare to gcc #63616

vfdff · 2023-06-30T09:08:10Z

test: https://gcc.godbolt.org/z/4Ks98cP1W

real_t s311(struct args_t * func_args)
{
    real_t sum = (real_t)0.;
    for (int i = 0; i < LEN_1D; i++) {
        sum += a[i];
    }

    return sum;
}

gcc: use whilelo to fold the tail loop

.L2:
        ld1w    z31.s, p7/z, [x2, x0, lsl 2]
        add     x0, x0, x3
        fadda   s0, p7, s0, z31.s
        whilelo p7.s, w0, w1
        b.any   .L2

clang: normal branch for the kernel loop body .LBB0_1

.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        ld1w    { z1.s }, p0/z, [x12, x10, lsl #2]
        add     x10, x10, x9
        cmp     x13, x10
        fadda   s0, p0, s0, z1.s
        b.ne    .LBB0_1

The text was updated successfully, but these errors were encountered:

v01dXYZ · 2023-06-30T09:20:56Z

Minimal C code:

#define ARRAY_ALIGNMENT 64
#define LEN_1D 8000

// array definitions
__attribute__((aligned(ARRAY_ALIGNMENT))) float a[LEN_1D];

float s311(struct args_t * func_args)
{
    float sum = 0.;
    for (int i = 0; i < LEN_1D; i++) {
        sum += a[i];
    }

    return sum;
}

common options: -march=armv8-a+sve -O3
clang/llvm options: -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

llvmbot · 2023-06-30T14:08:18Z

@llvm/issue-subscribers-backend-aarch64

david-arm · 2023-07-03T08:36:38Z

I don't think this is a bug - it's expected behaviour. I believe clang is doing the right thing here because we can prove the loop predicate p0 is always all-true. I think clang is choosing the more optimal form of the loop by avoiding using the while instruction (instead using the cheaper add instruction) to maintain a predicate.

davemgreen · 2023-07-03T09:00:18Z

I think part of the problem is that we don't clean up the code after vectorization. The vectorizer knows that vscale is a power of 2, so doesn't need to fold the tail. The rest of the pass pipeline doesn't know that though, so doesn't know that the checks for the scalar remainder will always be false.

v01dXYZ · 2023-07-03T09:04:25Z

@davemgreen Are you sure the vscale is a power of two ? Isn't it possible to have it simply an integer ?

Here an except from "Introduction to SVE"

* An implementation must allow the vector length to be constrained to any power of two.
* An implementation allows the vector length to be constrained to multiples of 128 that are not a power of two.

Does LLVM constraint vscale to be a power of two ? or is it Linux ?

davemgreen · 2023-07-03T12:58:03Z

See the summary of https://reviews.llvm.org/D141486

v01dXYZ · 2023-07-04T06:24:59Z

Thanks for the pointer.

So actually there are two problems for the issue:

add + b.ne is used instead of whilelo: it's not a problem as the instruction selector recognises the predicate is always ALL as said david-arm and selects lower cost instructions.
dead branch: there are not tail so it should be cleaned

Intermediate IR after `loop-vectorize`

define dso_local float @s_8000(ptr nocapture noundef readnone %0) local_unnamed_addr #0 {
vector.ph:
  %1 = tail call i64 @llvm.vscale.i64()
  %2 = shl nuw nsw i64 %1, 2
  %n.mod.vf = urem i64 8000, %2
  %n.vec = sub nuw nsw i64 8000, %n.mod.vf
  %3 = tail call i64 @llvm.vscale.i64()
  %4 = shl nuw nsw i64 %3, 2
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
  %5 = getelementptr inbounds [8000 x float], ptr @a, i64 0, i64 %index
  %wide.load = load <vscale x 4 x float>, ptr %5, align 16
  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
  %index.next = add nuw i64 %index, %4
  %7 = icmp eq i64 %index.next, %n.vec
  br i1 %7, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %n.mod.vf, 0
  br i1 %cmp.n, label %.loopexit, label %scalar.ph

.loopexit:                                        ; preds = %scalar.ph, %middle.block
  %.lcssa = phi float [ %6, %middle.block ], [ %10, %scalar.ph ]
  ret float %.lcssa

scalar.ph:                                        ; preds = %middle.block, %scalar.ph
  %indvars.iv = phi i64 [ %indvars.iv.next, %scalar.ph ], [ %n.vec, %middle.block ]
  %.045 = phi float [ %10, %scalar.ph ], [ %6, %middle.block ]
  %8 = getelementptr inbounds [8000 x float], ptr @a, i64 0, i64 %indvars.iv
  %9 = load float, ptr %8, align 4
  %10 = fadd float %.045, %9
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 8000
  br i1 %exitcond.not, label %.loopexit, label %scalar.ph, !llvm.loop !3
}

%2 is in the form power(2, n+2) with n an integer in [[0, 4]] which is equal to %1 and thus always divides 64. %n.mod.vf is always 0. The branching to the remainder loop will never occur.

Do you think it is an easy task to add this information to the loop vectoriser cleaner and remove the dead branch ?

Code Analysis

This part is responsible of introducing the conditional branching.

llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines 3230 to 3264 in 6c72cee

    
           BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() { 
        
             // The trip counts should be cached by now. 
        
             Value *Count = getTripCount(); 
        
             Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader); 
        
             auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator(); 
        
             // Add a check in the middle block to see if we have completed 
        
             // all of the iterations in the first vector loop.  Three cases: 
        
             // 1) If we require a scalar epilogue, there is no conditional branch as 
        
             //    we unconditionally branch to the scalar preheader.  Do nothing. 
        
             // 2) If (N - N%VF) == N, then we *don't* need to run the remainder. 
        
             //    Thus if tail is to be folded, we know we don't need to run the 
        
             //    remainder and we can use the previous value for the condition (true). 
        
             // 3) Otherwise, construct a runtime check. 
        
             if (!Cost->requiresScalarEpilogue(VF.isVector()) && 
        
                 !Cost->foldTailByMasking()) { 
        
               Instruction *CmpN = CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, 
        
                                                   Count, VectorTripCount, "cmp.n", 
        
                                                   LoopMiddleBlock->getTerminator()); 
        
               // Here we use the same DebugLoc as the scalar loop latch terminator instead 
        
               // of the corresponding compare because they may have ended up with 
        
               // different line numbers and we want to avoid awkward line stepping while 
        
               // debugging. Eg. if the compare has got a line number inside the loop. 
        
               CmpN->setDebugLoc(ScalarLatchTerm->getDebugLoc()); 
        
               cast<BranchInst>(LoopMiddleBlock->getTerminator())->setCondition(CmpN); 
        
             } 
        
           #ifdef EXPENSIVE_CHECKS 
        
             assert(DT->verify(DominatorTree::VerificationLevel::Fast)); 
        
           #endif 
        
             return LoopVectorPreHeader; 
        
           }

We can modify this part or we can add information to Count and VectorTripCount to allow analyses to detect the divisibility.

I suspect the following code does this kind of divisibility analysis (but for something else):

llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines 5180 to 5211 in b89b3cd

    
           // Avoid tail folding if the trip count is known to be a multiple of any VF 
        
           // we choose. 
        
           std::optional<unsigned> MaxPowerOf2RuntimeVF = 
        
               MaxFactors.FixedVF.getFixedValue(); 
        
           if (MaxFactors.ScalableVF) { 
        
             std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI); 
        
             if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo()) { 
        
               MaxPowerOf2RuntimeVF = std::max<unsigned>( 
        
                   *MaxPowerOf2RuntimeVF, 
        
                   *MaxVScale * MaxFactors.ScalableVF.getKnownMinValue()); 
        
             } else 
        
               MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now. 
        
           } 
        
           if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) { 
        
             assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) && 
        
                    "MaxFixedVF must be a power of 2"); 
        
             unsigned MaxVFtimesIC = 
        
                 UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF; 
        
             ScalarEvolution *SE = PSE.getSE(); 
        
             const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount(); 
        
             const SCEV *ExitCount = SE->getAddExpr( 
        
                 BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType())); 
        
             const SCEV *Rem = SE->getURemExpr( 
        
                 SE->applyLoopGuards(ExitCount, TheLoop), 
        
                 SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC)); 
        
             if (Rem->isZero()) { 
        
               // Accept MaxFixedVF if we do not have a tail. 
        
               LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n"); 
        
               return MaxFactors; 
        
             } 
        
           }

vfdff · 2023-07-06T04:55:34Z

Tried the above improvement with https://reviews.llvm.org/D154314

v01dXYZ · 2023-07-08T03:58:27Z

david-arm rightfully wants to be sure this removal of the loop remainder is not something that occurs elsewhere, especially at the InstCombine. But when looking at the usage of ScalarEvolution with InstCombine, it seems it's not used at all, so I don't think it is likely a divisibility analysis is done by InstCombine unless it reimplements a partial scalar evolution analysis. Thus, the change proposed could manage cases when the array size is assured to be a multiple of 64, especially when not constant, which could not be possible at all by InstCombine. But I think it could occur in simplify-cfg (after verification, I don't think this is the case too).

vfdff · 2023-07-26T06:28:30Z

New candidate's MR: https://reviews.llvm.org/D154953

…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs： https://alive2.llvm.org/ce/z/bT62Wa Fix #63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953

…s true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs： https://alive2.llvm.org/ce/z/FkTMoy Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953

…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs： https://alive2.llvm.org/ce/z/bT62Wa Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953

…s true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs： https://alive2.llvm.org/ce/z/FkTMoy Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953

…s always true We check the loop trip count is known a power of 2 to determine whether the tail loop can be eliminated in D146199. However, the remainder loop of mask scalable loop can also be removed If we know the mask is always going to be true for every vector iteration. Depend on the assume of power-of-two vscale on D155350 proofs： https://alive2.llvm.org/ce/z/bT62Wa Fix llvm#63616. Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm Differential Revision: https://reviews.llvm.org/D154953

github-actions bot added the new issue label Jun 30, 2023

EugeneZelenko added backend:AArch64 SVE ARM Scalable Vector Extensions and removed new issue labels Jun 30, 2023

This comment was marked as off-topic.

Sign in to view

vfdff closed this as completed in 3e386b2 Aug 1, 2023

EugeneZelenko added llvm:optimizations and removed backend:AArch64 SVE ARM Scalable Vector Extensions labels Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sve] clang failed to tail folding optimization compare to gcc #63616

[sve] clang failed to tail folding optimization compare to gcc #63616

vfdff commented Jun 30, 2023 •

edited

v01dXYZ commented Jun 30, 2023 •

edited

llvmbot commented Jun 30, 2023

This comment was marked as off-topic.

david-arm commented Jul 3, 2023

davemgreen commented Jul 3, 2023 •

edited

v01dXYZ commented Jul 3, 2023 •

edited

davemgreen commented Jul 3, 2023

v01dXYZ commented Jul 4, 2023 •

edited

vfdff commented Jul 6, 2023

v01dXYZ commented Jul 8, 2023 •

edited

vfdff commented Jul 26, 2023

[sve] clang failed to tail folding optimization compare to gcc #63616

[sve] clang failed to tail folding optimization compare to gcc #63616

Comments

vfdff commented Jun 30, 2023 • edited

v01dXYZ commented Jun 30, 2023 • edited

llvmbot commented Jun 30, 2023

This comment was marked as off-topic.

david-arm commented Jul 3, 2023

davemgreen commented Jul 3, 2023 • edited

v01dXYZ commented Jul 3, 2023 • edited

davemgreen commented Jul 3, 2023

v01dXYZ commented Jul 4, 2023 • edited

Intermediate IR after loop-vectorize

Code Analysis

vfdff commented Jul 6, 2023

v01dXYZ commented Jul 8, 2023 • edited

vfdff commented Jul 26, 2023

vfdff commented Jun 30, 2023 •

edited

v01dXYZ commented Jun 30, 2023 •

edited

davemgreen commented Jul 3, 2023 •

edited

v01dXYZ commented Jul 3, 2023 •

edited

v01dXYZ commented Jul 4, 2023 •

edited

Intermediate IR after `loop-vectorize`

v01dXYZ commented Jul 8, 2023 •

edited