loop vectorizer unable to reason with trip count bounds #40306

llvmbot · 2019-03-04T22:18:31Z


Bugzilla Link	40961
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@efriedma-quic,@fhahn,@hfinkel,@hidekisaito,@RKSimon

Extended Description

I see bugs 38280, 37423, but this one is simpler still. Consider the following:

void topup(int a[], unsigned long i)
{
    for (; i < 16; i++) {
        a[i] = 1;
    }
}

The result, compiled with -O -march=haswell

loop is vectorized despite the small absolute limit on trip count
vectorized part has loop step of 256 (!) and is actually unreachable

The text was updated successfully, but these errors were encountered:

fhahn · 2019-03-04T22:39:42Z

What's happening depends on the target: for x86_64-apple-macos we generate a check if I < 16 and a call to memset_pattern16, for x86_64-unknown-linux we generate a big vectorized version as reported.

I guess the problem is the unknown start value of the induction variable.

hidekisaito · 2019-03-04T23:57:42Z

This is coming from LoopUtil's getLoopEstimatedTripCount returning "None".

So, some part of the compiler (such as apple-macos calling memset_pattern16) knows trip count of up to 16 is special and other parts (like LV) does not.

I think the notion of estimated trip count information needs to be extended beyond single value.

What can be described by Intel's loopcount pragma may be a good starting point for the discussion: It can represent min/max/ave as well as individual known constant values.

Here, we have 16 or less (no wrap case) ---- or possibly big number (wrap case). I could see why we'd want to distinguish this from i=0;i<N;i++ where N can be an arbitrary big number.

llvmbot · 2019-03-05T00:29:45Z

With simple loops, but where the exact trip count is unknown, it should still be possible to perform an analysis on the initial loop condition: value range of either side of the (a < b) relation can restrict the full range of the type, yielding upper, and lower, limits to the trip count.

I don't know how easy it is to get at...

Upper limit may be useful as a heuristic: loop step of approx. sqrt(limit) to minimize the overall branch count.

hidekisaito · 2019-03-05T01:13:09Z

What I'm saying is that this is most likely the issue that can affect other loop optimizations as well. So, any solution we come up with should not be just vectorizer only. We better have some discussions with many loop optimization stakeholders ---- in llvm-dev. Would you like me to start a thread?

efriedma-quic · 2019-03-05T03:09:19Z

I don't think there's any fundamnental issue to bring up on llvmdev. The vectorizer tries to handle this sort of construct by calling ScalarEvolution::getSmallConstantMaxTripCount(); it just isn't succeeding here for some reason.

hidekisaito · 2019-03-05T19:03:50Z

Sorry, you are right. I was thinking "i != 16" case.

llvmbot · 2019-09-02T06:06:23Z

I am trying to find out why ScalarEvolution is not able to give correct back edge taken count for an expression.

So in this case flow reaches to howFarToZero() and in that function, I have following expressions as SCEV

Start = (15 + (-1 * %i) (which is set to Distance SCEV)
Step = 1

now, first of all, should I expect Start as ConstantSCEV (15) instead of the whole expression
the problem here is getUnsignedRangeMax(Distance) returns very large number because of -1 in the SCEV.

How we can make this work? Here we can clearly say that after 15 steps this expression will be 0
and thus we have a value for backedge taken count.

llvmbot · 2019-09-03T14:03:58Z

experimental fix
Please look at the proposed fix and if this seems in the correct direction I can send it for review.

efriedma-quic · 2019-09-03T20:30:03Z

That patch is taking the wrong approach.

The IR should look something like this:

define void @topup(i32* nocapture %a, i64 %i) {
entry:
%cmp3 = icmp ult i64 %i, 16
br i1 %cmp3, label %for.body, label %for.end

for.body:
%i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
store i32 1, i32* %arrayidx, align 4, !tbaa !2
%inc = add nuw nsw i64 %i.addr.04, 1
%exitcond = icmp eq i64 %inc, 16
br i1 %exitcond, label %for.end, label %for.body

for.end:
ret void
}

If you look at the IR, you have a loop which is roughly of the form "for (; i != 16; ++i)". In general, for loops of that form, the backedge-taken count is unrestricted: unsigned arithmetic wraps. (For example, if i is -10, the loop runs 26 times.)

There are two factors you could use to prove the max backedge-taken count. One, the induction variable is marked "nuw". Two, the loop is guarded by an "icmp ult" which restricts the initial value of the induction variable. No patch can work correctly without using one of those bits of information.

fhahn · 2019-09-03T22:33:27Z

I experimented with folding the information from the preheader into the expression to compute the max backedge taken count today. I’ll try to share a patch tomorrow.

llvmbot · 2019-09-04T11:39:36Z

That patch is taking the wrong approach.

The IR should look something like this:

define void @topup(i32* nocapture %a, i64 %i) {
entry:
%cmp3 = icmp ult i64 %i, 16
br i1 %cmp3, label %for.body, label %for.end

for.body:
%i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
store i32 1, i32* %arrayidx, align 4, !tbaa !2
%inc = add nuw nsw i64 %i.addr.04, 1
%exitcond = icmp eq i64 %inc, 16
br i1 %exitcond, label %for.end, label %for.body

for.end:
ret void
}

If you look at the IR, you have a loop which is roughly of the form "for (;
i != 16; ++i)". In general, for loops of that form, the backedge-taken
count is unrestricted: unsigned arithmetic wraps. (For example, if i is
-10, the loop runs 26 times.)

There are two factors you could use to prove the max backedge-taken count.
One, the induction variable is marked "nuw". Two, the loop is guarded by an
"icmp ult" which restricts the initial value of the induction variable. No
patch can work correctly without using one of those bits of information.

Here i is declared as unsigned so max back-edge taken count should be 16 because for any value greater than 16 loops should not get executed and there will not be unsigned wrapping of i. So I think without "nuw" based on the loop guard we should be able to figure out max back edge taken count.

fhahn · 2019-09-04T15:19:38Z

I think we could achieve the desired result by folding information from the loop guards into the expression we use to compute the max backedge taken count. I've put up a initial patch, which just deals with getting a more precise max count https://reviews.llvm.org/D67178

efriedma-quic · 2019-09-04T19:18:12Z

So I think without "nuw" based on the loop guard we should be able to figure < > out max back edge taken count.

I thought that's what I said; you need either the nuw or the loop guard (but not both).

fhahn · 2020-10-08T12:22:58Z

Just a quick update. The loop vectorizer now only uses a vectorization factor of 8 for the example, but the unroller unrolls with vector body 8 times later on

https://godbolt.org/z/MdWWbd

fhahn · 2021-11-27T03:32:52Z

mentioned in issue llvm/llvm-bugzilla-archive#47247

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

fhahn mentioned this issue Aug 16, 2022

Excessive loop unrolling #42332

Open

fhahn closed this as completed in 68469a8 Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loop vectorizer unable to reason with trip count bounds #40306

loop vectorizer unable to reason with trip count bounds #40306

llvmbot commented Mar 4, 2019

fhahn commented Mar 4, 2019

hidekisaito commented Mar 4, 2019

llvmbot commented Mar 5, 2019

hidekisaito commented Mar 5, 2019

efriedma-quic commented Mar 5, 2019

hidekisaito commented Mar 5, 2019

llvmbot commented Sep 2, 2019

llvmbot commented Sep 3, 2019

efriedma-quic commented Sep 3, 2019

fhahn commented Sep 3, 2019

llvmbot commented Sep 4, 2019

fhahn commented Sep 4, 2019

efriedma-quic commented Sep 4, 2019

fhahn commented Oct 8, 2020

fhahn commented Nov 27, 2021

loop vectorizer unable to reason with trip count bounds #40306

loop vectorizer unable to reason with trip count bounds #40306

Comments

llvmbot commented Mar 4, 2019

Extended Description

fhahn commented Mar 4, 2019

hidekisaito commented Mar 4, 2019

llvmbot commented Mar 5, 2019

hidekisaito commented Mar 5, 2019

efriedma-quic commented Mar 5, 2019

hidekisaito commented Mar 5, 2019

llvmbot commented Sep 2, 2019

llvmbot commented Sep 3, 2019

efriedma-quic commented Sep 3, 2019

fhahn commented Sep 3, 2019

llvmbot commented Sep 4, 2019

fhahn commented Sep 4, 2019

efriedma-quic commented Sep 4, 2019

fhahn commented Oct 8, 2020

fhahn commented Nov 27, 2021