Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCEV][LV] Add Stride equal to one Predicate to enable strided access versioning #77287

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShivaChen
Copy link
Collaborator

@ShivaChen ShivaChen commented Jan 8, 2024

This commit enables the vectorization for the case from #71517.
The loop can't be vectorized due to the BECount is unknown.

float  s172(int xa, int xb)  {
  for (int i = xa - 1; i < 32000; i += xb)
     a[i] += b[i];
}

By assuming the stride as one and generating the runtime checking to guard the vectorized loop, it seems the case can be vectorized.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 8, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: None (ShivaChen)

Changes

There is a case in TSVC didn't be vectorized due to the BECount is unknown.

float  s172(int xa, int xb)  {
  for (int i = xa - 1; i &lt; 32000; i += xb)
     a[i] += b[i];
}

By assuming the stride as one and generating the runtime checking to guard the vectorized loop, it seems the case can be vectorized.


Full diff: https://github.com/llvm/llvm-project/pull/77287.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/ScalarEvolution.cpp (+14-1)
  • (modified) llvm/test/Transforms/LoopVectorize/version-mem-access.ll (+52)
diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp
index 623814c038a78f..3c712ead953186 100644
--- a/llvm/lib/Analysis/ScalarEvolution.cpp
+++ b/llvm/lib/Analysis/ScalarEvolution.cpp
@@ -12778,10 +12778,23 @@ ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS,
     // The positive stride case is the same as isKnownPositive(Stride) returning
     // true (original behavior of the function).
     //
-    if (PredicatedIV || !NoWrap || !loopIsFiniteByAssumption(L) ||
+    if (PredicatedIV || !loopIsFiniteByAssumption(L) ||
         !loopHasNoAbnormalExits(L))
       return getCouldNotCompute();
 
+    // Adding Stride equal to one Predicate when there is no wrap flags.
+    // It might enable strided access versioning in LAA and calculate BECount
+    // with Stride = 1.
+    if (!NoWrap) {
+      if (AllowPredicates) {
+        const auto *One =
+            static_cast<const SCEVConstant *>(getOne(Stride->getType()));
+        Predicates.insert(getEqualPredicate(Stride, One));
+        Stride = One;
+      } else
+        return getCouldNotCompute();
+    }
+
     if (!isKnownNonZero(Stride)) {
       // If we have a step of zero, and RHS isn't invariant in L, we don't know
       // if it might eventually be greater than start and if so, on which
diff --git a/llvm/test/Transforms/LoopVectorize/version-mem-access.ll b/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
index 7bf4fbd89b0eea..f1283365ef52a4 100644
--- a/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
+++ b/llvm/test/Transforms/LoopVectorize/version-mem-access.ll
@@ -92,3 +92,55 @@ for.end.loopexit:
 for.end:
   ret void
 }
+
+; We can vectorize the loop by using stride = 1 to calculate iteration count
+; and generate the runtime check to guard the vectorized loop.
+
+; CHECK-LABEL: s172
+; CHECK-DAG: icmp ne i32 %xb, 1
+; CHECK: vector.body
+
+@b = global [32000 x float] zeroinitializer, align 64
+@a = global [32000 x float] zeroinitializer, align 64
+
+; for (int i = xa - 1; i < 32000; i += xb)
+;   a[i] += b[i];
+;
+define float @s172(i32 signext %xa, i32 signext %xb) mustprogress {
+entry:
+  %cmp214 = icmp slt i32 %xa, 32001
+  br i1 %cmp214, label %for.body.us.preheader, label %for.cond.cleanup
+
+for.body.us.preheader:                            ; preds = %entry
+  %sub = add i32 %xa, -1
+  %0 = sext i32 %sub to i64
+  %1 = sext i32 %xb to i64
+  br label %for.body.us
+
+for.body.us:                                      ; preds = %for.body.us.preheader, %for.cond1.for.cond.cleanup3_crit_edge.us
+  %nl.016.us = phi i32 [ %inc.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.body.us.preheader ]
+  br label %for.body4.us
+
+for.body4.us:                                     ; preds = %for.body.us, %for.body4.us
+  %indvars.iv = phi i64 [ %0, %for.body.us ], [ %indvars.iv.next, %for.body4.us ]
+  %arrayidx.us = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+  %2 = load float, ptr %arrayidx.us, align 4
+  %arrayidx6.us = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+  %3 = load float, ptr %arrayidx6.us, align 4
+  %add.us = fadd fast float %3, %2
+  store float %add.us, ptr %arrayidx6.us, align 4
+  %indvars.iv.next = add i64 %indvars.iv, %1
+  %cmp2.us = icmp slt i64 %indvars.iv.next, 32000
+  br i1 %cmp2.us, label %for.body4.us, label %for.cond1.for.cond.cleanup3_crit_edge.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:         ; preds = %for.body4.us
+  %inc.us = add nuw nsw i32 %nl.016.us, 1
+  %exitcond.not = icmp eq i32 %inc.us, 100000
+  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body.us
+
+for.cond.cleanup.loopexit:                        ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us
+  br label %for.cond.cleanup
+
+for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
+  ret float undef
+}

@nikic nikic requested a review from fhahn January 8, 2024 09:10
@sjoerdmeijer
Copy link
Collaborator

Thanks for the patch!
It's worth mentioning (in the description) that this fixes #71517


; CHECK-LABEL: s172
; CHECK-DAG: icmp ne i32 %xb, 1
; CHECK: vector.body
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the checks are a bit minimal, personally I would like to see a bit more context, but I see there's precedent in this file for just checking for stride 1 compare.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sjoerdmeijer,

I added more check lines to bring more context.
Thanks for the review. :-)

@sjoerdmeijer
Copy link
Collaborator

Looks like a good patch to me, but I will let @fhahn sign off on it.

… versioning

This commit enable the vectorization for the case from
llvm#71517.

  float  s172(int xa, int xb)  {
    for (int i = xa - 1; i < 32000; i += xb)
           a[i] += b[i];
  }

By assuming the stride as one and generating the runtime checking to guard
the vectorized loop, it seems the case can be vectorized.
@ShivaChen
Copy link
Collaborator Author

Thanks for the patch! It's worth mentioning (in the description) that this fixes #71517

I have updated the description in the commit and the PR. Thanks for the suggestion!

@fhahn
Copy link
Contributor

fhahn commented Jan 16, 2024

Thanks for the patch! This is an interesting issue. One thing I am not sure yet if this has potential to clash with the other stride versioning logic in LAA.

For that particular case at hand, I think we may not need to version, as it looks like the wrap flags get dropped during IndVars before LV, but we may be able to retain the flags. This is something I am currently looking into.

@sjoerdmeijer
Copy link
Collaborator

Thanks for the patch! This is an interesting issue. One thing I am not sure yet if this has potential to clash with the other stride versioning logic in LAA.

For that particular case at hand, I think we may not need to version, as it looks like the wrap flags get dropped during IndVars before LV, but we may be able to retain the flags. This is something I am currently looking into.

This is slightly off-topic for this patch, but GCC has a loop-versioning pass running before vectorisation to deal with these sort of cases, whereas in our case we have some logic sprinkled around to deal with strided accesses. My curiousity and question @fhahn, is if you would see value in a separate loop-versioning pass like GCC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants