-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCEV][LV] Add Stride equal to one Predicate to enable strided access versioning #77287
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis Author: None (ShivaChen) ChangesThere is a case in TSVC didn't be vectorized due to the BECount is unknown.
By assuming the stride as one and generating the runtime checking to guard the vectorized loop, it seems the case can be vectorized. Full diff: https://github.com/llvm/llvm-project/pull/77287.diff 2 Files Affected:
diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp
index 623814c038a78f..3c712ead953186 100644
--- a/llvm/lib/Analysis/ScalarEvolution.cpp
+++ b/llvm/lib/Analysis/ScalarEvolution.cpp
@@ -12778,10 +12778,23 @@ ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS,
// The positive stride case is the same as isKnownPositive(Stride) returning
// true (original behavior of the function).
//
- if (PredicatedIV || !NoWrap || !loopIsFiniteByAssumption(L) ||
+ if (PredicatedIV || !loopIsFiniteByAssumption(L) ||
!loopHasNoAbnormalExits(L))
return getCouldNotCompute();
+ // Adding Stride equal to one Predicate when there is no wrap flags.
+ // It might enable strided access versioning in LAA and calculate BECount
+ // with Stride = 1.
+ if (!NoWrap) {
+ if (AllowPredicates) {
+ const auto *One =
+ static_cast<const SCEVConstant *>(getOne(Stride->getType()));
+ Predicates.insert(getEqualPredicate(Stride, One));
+ Stride = One;
+ } else
+ return getCouldNotCompute();
+ }
+
if (!isKnownNonZero(Stride)) {
// If we have a step of zero, and RHS isn't invariant in L, we don't know
// if it might eventually be greater than start and if so, on which
diff --git a/llvm/test/Transforms/LoopVectorize/version-mem-access.ll b/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
index 7bf4fbd89b0eea..f1283365ef52a4 100644
--- a/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
+++ b/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
@@ -92,3 +92,55 @@ for.end.loopexit:
for.end:
ret void
}
+
+; We can vectorize the loop by using stride = 1 to calculate iteration count
+; and generate the runtime check to guard the vectorized loop.
+
+; CHECK-LABEL: s172
+; CHECK-DAG: icmp ne i32 %xb, 1
+; CHECK: vector.body
+
+@b = global [32000 x float] zeroinitializer, align 64
+@a = global [32000 x float] zeroinitializer, align 64
+
+; for (int i = xa - 1; i < 32000; i += xb)
+; a[i] += b[i];
+;
+define float @s172(i32 signext %xa, i32 signext %xb) mustprogress {
+entry:
+ %cmp214 = icmp slt i32 %xa, 32001
+ br i1 %cmp214, label %for.body.us.preheader, label %for.cond.cleanup
+
+for.body.us.preheader: ; preds = %entry
+ %sub = add i32 %xa, -1
+ %0 = sext i32 %sub to i64
+ %1 = sext i32 %xb to i64
+ br label %for.body.us
+
+for.body.us: ; preds = %for.body.us.preheader, %for.cond1.for.cond.cleanup3_crit_edge.us
+ %nl.016.us = phi i32 [ %inc.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.body.us.preheader ]
+ br label %for.body4.us
+
+for.body4.us: ; preds = %for.body.us, %for.body4.us
+ %indvars.iv = phi i64 [ %0, %for.body.us ], [ %indvars.iv.next, %for.body4.us ]
+ %arrayidx.us = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+ %2 = load float, ptr %arrayidx.us, align 4
+ %arrayidx6.us = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+ %3 = load float, ptr %arrayidx6.us, align 4
+ %add.us = fadd fast float %3, %2
+ store float %add.us, ptr %arrayidx6.us, align 4
+ %indvars.iv.next = add i64 %indvars.iv, %1
+ %cmp2.us = icmp slt i64 %indvars.iv.next, 32000
+ br i1 %cmp2.us, label %for.body4.us, label %for.cond1.for.cond.cleanup3_crit_edge.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
+ %inc.us = add nuw nsw i32 %nl.016.us, 1
+ %exitcond.not = icmp eq i32 %inc.us, 100000
+ br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body.us
+
+for.cond.cleanup.loopexit: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us
+ br label %for.cond.cleanup
+
+for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
+ ret float undef
+}
|
Thanks for the patch! |
|
||
; CHECK-LABEL: s172 | ||
; CHECK-DAG: icmp ne i32 %xb, 1 | ||
; CHECK: vector.body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the checks are a bit minimal, personally I would like to see a bit more context, but I see there's precedent in this file for just checking for stride 1 compare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sjoerdmeijer,
I added more check lines to bring more context.
Thanks for the review. :-)
Looks like a good patch to me, but I will let @fhahn sign off on it. |
… versioning This commit enable the vectorization for the case from llvm#71517. float s172(int xa, int xb) { for (int i = xa - 1; i < 32000; i += xb) a[i] += b[i]; } By assuming the stride as one and generating the runtime checking to guard the vectorized loop, it seems the case can be vectorized.
2fa239c
to
8917ec0
Compare
I have updated the description in the commit and the PR. Thanks for the suggestion! |
Thanks for the patch! This is an interesting issue. One thing I am not sure yet if this has potential to clash with the other stride versioning logic in LAA. For that particular case at hand, I think we may not need to version, as it looks like the wrap flags get dropped during |
This is slightly off-topic for this patch, but GCC has a loop-versioning pass running before vectorisation to deal with these sort of cases, whereas in our case we have some logic sprinkled around to deal with strided accesses. My curiousity and question @fhahn, is if you would see value in a separate loop-versioning pass like GCC? |
This commit enables the vectorization for the case from #71517.
The loop can't be vectorized due to the BECount is unknown.
By assuming the stride as one and generating the runtime checking to guard the vectorized loop, it seems the case can be vectorized.