Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop vectorizer unable to reason with trip count bounds #40306

Closed
llvmbot opened this issue Mar 4, 2019 · 15 comments
Closed

loop vectorizer unable to reason with trip count bounds #40306

llvmbot opened this issue Mar 4, 2019 · 15 comments
Labels
bugzilla Issues migrated from bugzilla loopoptim

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 4, 2019

Bugzilla Link 40961
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @efriedma-quic,@fhahn,@hfinkel,@hidekisaito,@RKSimon

Extended Description

I see bugs 38280, 37423, but this one is simpler still. Consider the following:

void topup(int a[], unsigned long i)
{
    for (; i < 16; i++) {
        a[i] = 1;
    }
}

The result, compiled with -O -march=haswell

  • loop is vectorized despite the small absolute limit on trip count
  • vectorized part has loop step of 256 (!) and is actually unreachable
@fhahn
Copy link
Contributor

fhahn commented Mar 4, 2019

What's happening depends on the target: for x86_64-apple-macos we generate a check if I < 16 and a call to memset_pattern16, for x86_64-unknown-linux we generate a big vectorized version as reported.

I guess the problem is the unknown start value of the induction variable.

@hidekisaito
Copy link
Contributor

This is coming from LoopUtil's getLoopEstimatedTripCount returning "None".

So, some part of the compiler (such as apple-macos calling memset_pattern16) knows trip count of up to 16 is special and other parts (like LV) does not.

I think the notion of estimated trip count information needs to be extended beyond single value.

What can be described by Intel's loopcount pragma may be a good starting point for the discussion: It can represent min/max/ave as well as individual known constant values.

Here, we have 16 or less (no wrap case) ---- or possibly big number (wrap case). I could see why we'd want to distinguish this from i=0;i<N;i++ where N can be an arbitrary big number.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Mar 5, 2019

With simple loops, but where the exact trip count is unknown, it should still be possible to perform an analysis on the initial loop condition: value range of either side of the (a < b) relation can restrict the full range of the type, yielding upper, and lower, limits to the trip count.

I don't know how easy it is to get at...

Upper limit may be useful as a heuristic: loop step of approx. sqrt(limit) to minimize the overall branch count.

@hidekisaito
Copy link
Contributor

What I'm saying is that this is most likely the issue that can affect other loop optimizations as well. So, any solution we come up with should not be just vectorizer only. We better have some discussions with many loop optimization stakeholders ---- in llvm-dev. Would you like me to start a thread?

@efriedma-quic
Copy link
Collaborator

I don't think there's any fundamnental issue to bring up on llvmdev. The vectorizer tries to handle this sort of construct by calling ScalarEvolution::getSmallConstantMaxTripCount(); it just isn't succeeding here for some reason.

@hidekisaito
Copy link
Contributor

Sorry, you are right. I was thinking "i != 16" case.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 2, 2019

I am trying to find out why ScalarEvolution is not able to give correct back edge taken count for an expression.

So in this case flow reaches to howFarToZero() and in that function, I have following expressions as SCEV

Start = (15 + (-1 * %i) (which is set to Distance SCEV)
Step = 1

now, first of all, should I expect Start as ConstantSCEV (15) instead of the whole expression
the problem here is getUnsignedRangeMax(Distance) returns very large number because of -1 in the SCEV.

How we can make this work? Here we can clearly say that after 15 steps this expression will be 0
and thus we have a value for backedge taken count.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 3, 2019

experimental fix
Please look at the proposed fix and if this seems in the correct direction I can send it for review.

@efriedma-quic
Copy link
Collaborator

That patch is taking the wrong approach.

The IR should look something like this:

define void @​topup(i32* nocapture %a, i64 %i) {
entry:
%cmp3 = icmp ult i64 %i, 16
br i1 %cmp3, label %for.body, label %for.end

for.body:
%i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
store i32 1, i32* %arrayidx, align 4, !tbaa !​2
%inc = add nuw nsw i64 %i.addr.04, 1
%exitcond = icmp eq i64 %inc, 16
br i1 %exitcond, label %for.end, label %for.body

for.end:
ret void
}

If you look at the IR, you have a loop which is roughly of the form "for (; i != 16; ++i)". In general, for loops of that form, the backedge-taken count is unrestricted: unsigned arithmetic wraps. (For example, if i is -10, the loop runs 26 times.)

There are two factors you could use to prove the max backedge-taken count. One, the induction variable is marked "nuw". Two, the loop is guarded by an "icmp ult" which restricts the initial value of the induction variable. No patch can work correctly without using one of those bits of information.

@fhahn
Copy link
Contributor

fhahn commented Sep 3, 2019

I experimented with folding the information from the preheader into the expression to compute the max backedge taken count today. I’ll try to share a patch tomorrow.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 4, 2019

That patch is taking the wrong approach.

The IR should look something like this:

define void @​topup(i32* nocapture %a, i64 %i) {
entry:
%cmp3 = icmp ult i64 %i, 16
br i1 %cmp3, label %for.body, label %for.end

for.body:
%i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
store i32 1, i32* %arrayidx, align 4, !tbaa !​2
%inc = add nuw nsw i64 %i.addr.04, 1
%exitcond = icmp eq i64 %inc, 16
br i1 %exitcond, label %for.end, label %for.body

for.end:
ret void
}

If you look at the IR, you have a loop which is roughly of the form "for (;
i != 16; ++i)". In general, for loops of that form, the backedge-taken
count is unrestricted: unsigned arithmetic wraps. (For example, if i is
-10, the loop runs 26 times.)

There are two factors you could use to prove the max backedge-taken count.
One, the induction variable is marked "nuw". Two, the loop is guarded by an
"icmp ult" which restricts the initial value of the induction variable. No
patch can work correctly without using one of those bits of information.

Here i is declared as unsigned so max back-edge taken count should be 16 because for any value greater than 16 loops should not get executed and there will not be unsigned wrapping of i. So I think without "nuw" based on the loop guard we should be able to figure out max back edge taken count.

@fhahn
Copy link
Contributor

fhahn commented Sep 4, 2019

I think we could achieve the desired result by folding information from the loop guards into the expression we use to compute the max backedge taken count. I've put up a initial patch, which just deals with getting a more precise max count https://reviews.llvm.org/D67178

@efriedma-quic
Copy link
Collaborator

So I think without "nuw" based on the loop guard we should be able to figure < > out max back edge taken count.

I thought that's what I said; you need either the nuw or the loop guard (but not both).

@fhahn
Copy link
Contributor

fhahn commented Oct 8, 2020

Just a quick update. The loop vectorizer now only uses a vectorization factor of 8 for the example, but the unroller unrolls with vector body 8 times later on

https://godbolt.org/z/MdWWbd

@fhahn
Copy link
Contributor

fhahn commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#47247

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@fhahn fhahn closed this as completed in 68469a8 Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla loopoptim
Projects
None yet
Development

No branches or pull requests

4 participants