Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopInterchange] Fix depends() check parameters #77719

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ShivaChen
Copy link
Collaborator

@ShivaChen ShivaChen commented Jan 11, 2024

This commit enables loop-interchange with -enable-loopinterchange for the case in #71519.
With the loop-interchange, the case can be vectorized.

for (int nl = 0; nl < 10000000/256; nl++)
  for (int i = 0; i < 256; ++i)
    for (int j = 1; j < 256; j++)
      aa[j][i] = aa[j - 1][i] + bb[j][i];

The commit address the issues that:
The reversed parameter order of depends() caused the distance between aa[j][i] and aa[j - 1][i] be -1 instead of 1.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 11, 2024

@llvm/pr-subscribers-llvm-transforms

Author: None (ShivaChen)

Changes

This commit enables loop-interchange with -enable-loopinterchange for the case in #71519.
With the loop-interchange, the case can be vectorized.

for (int nl = 0; nl &lt; 10000000/256; nl++)
  for (int i = 0; i &lt; 256; ++i)
    for (int j = 1; j &lt; 256; j++)
      aa[j][i] = aa[j - 1][i] + bb[j][i];

The commit address the issues that:

  1. populateDependencyMatrix determine aa[j][i] has output dependency with itself.
  2. The reversed parameter order of depends() caused the distance between aa[j][i] and aa[j - 1][i] be -1 instead of 1.

Full diff: https://github.com/llvm/llvm-project/pull/77719.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+3-3)
  • (added) llvm/test/Transforms/LoopInterchange/interchange-s231.ll (+59)
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index 277f530ee25fc1..d4b78f4f04599c 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -113,10 +113,10 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
   ValueVector::iterator I, IE, J, JE;
 
   for (I = MemInstr.begin(), IE = MemInstr.end(); I != IE; ++I) {
-    for (J = I, JE = MemInstr.end(); J != JE; ++J) {
+    for (J = I + 1, JE = MemInstr.end(); J != JE; ++J) {
       std::vector<char> Dep;
-      Instruction *Src = cast<Instruction>(*I);
-      Instruction *Dst = cast<Instruction>(*J);
+      Instruction *Src = cast<Instruction>(*J);
+      Instruction *Dst = cast<Instruction>(*I);
       // Ignore Input dependencies.
       if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
         continue;
diff --git a/llvm/test/Transforms/LoopInterchange/interchange-s231.ll b/llvm/test/Transforms/LoopInterchange/interchange-s231.ll
new file mode 100644
index 00000000000000..5b51d2169fe510
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/interchange-s231.ll
@@ -0,0 +1,59 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 -verify-dom-info -verify-loop-info \
+; RUN:     -S -debug 2>&1 | FileCheck %s
+
+@aa = global [256 x [256 x float]] zeroinitializer, align 64
+@bb = global [256 x [256 x float]] zeroinitializer, align 64
+
+;;  for (int nl = 0; nl < 10000000/256; nl++)
+;;    for (int i = 0; i < 256; ++i)
+;;      for (int j = 1; j < 256; j++)
+;;        aa[j][i] = aa[j - 1][i] + bb[j][i];
+
+; CHECK: Found flow dependency between Src and Dst
+; CHECK:  Src:  store float %add, ptr %arrayidx18, align 4
+; CHECK:  Dst:  %1 = load float, ptr %arrayidx10, align 4
+; CHECK: Processing InnerLoopId = 2 and OuterLoopId = 1
+; CHECK: Loops interchanged.
+
+define float @s231() {
+entry:
+  br label %for.cond1.preheader
+
+; Loop:
+for.cond1.preheader:                              ; preds = %entry, %for.cond.cleanup3
+  %nl.036 = phi i32 [ 0, %entry ], [ %inc23, %for.cond.cleanup3 ]
+  br label %for.cond5.preheader
+
+for.cond.cleanup3:                                ; preds = %for.cond.cleanup7
+  %inc23 = add nuw nsw i32 %nl.036, 1
+  %exitcond41 = icmp ne i32 %inc23, 39062
+  br i1 %exitcond41, label %for.cond1.preheader, label %for.cond.cleanup
+
+for.cond.cleanup7:                                ; preds = %for.body8
+  %indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
+  %exitcond40 = icmp ne i64 %indvars.iv.next39, 256
+  br i1 %exitcond40, label %for.cond5.preheader, label %for.cond.cleanup3
+
+for.body8:                                        ; preds = %for.cond5.preheader, %for.body8
+  %indvars.iv = phi i64 [ 1, %for.cond5.preheader ], [ %indvars.iv.next, %for.body8 ]
+  %0 = add nsw i64 %indvars.iv, -1
+  %arrayidx10 = getelementptr inbounds [256 x [256 x float]], ptr @aa, i64 0, i64 %0, i64 %indvars.iv38
+  %1 = load float, ptr %arrayidx10, align 4
+  %arrayidx14 = getelementptr inbounds [256 x [256 x float]], ptr @bb, i64 0, i64 %indvars.iv, i64 %indvars.iv38
+  %2 = load float, ptr %arrayidx14, align 4
+  %add = fadd fast float %2, %1
+  %arrayidx18 = getelementptr inbounds [256 x [256 x float]], ptr @aa, i64 0, i64 %indvars.iv, i64 %indvars.iv38
+  store float %add, ptr %arrayidx18, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond = icmp ne i64 %indvars.iv.next, 256
+  br i1 %exitcond, label %for.body8, label %for.cond.cleanup7
+
+for.cond5.preheader:                              ; preds = %for.cond1.preheader, %for.cond.cleanup7
+  %indvars.iv38 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next39, %for.cond.cleanup7 ]
+  br label %for.body8
+
+; Exit blocks
+for.cond.cleanup:                                 ; preds = %for.cond.cleanup3
+  ret float undef
+}

@Meinersbur
Copy link
Member

  1. populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.

  2. The order of I/J does not matter because of D->normalize(SE).

@@ -113,10 +113,10 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
ValueVector::iterator I, IE, J, JE;

for (I = MemInstr.begin(), IE = MemInstr.end(); I != IE; ++I) {
for (J = I, JE = MemInstr.end(); J != JE; ++J) {
for (J = I + 1, JE = MemInstr.end(); J != JE; ++J) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/Tacx7M7M7

There is just a single MemInst (a StoreInst) in the loop body, hence I == J is the only check that would prevent the loop being interchanged.

@ShivaChen
Copy link
Collaborator Author

  1. populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.
  2. The order of I/J does not matter because of D->normalize(SE).

Hi Meinersbur,

Thanks for providing the example in https://godbolt.org/z/Tacx7M7M7, it make it easier to follow the reason of store self checking. :-)

  1. It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.
  2. According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.
  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}.
The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1.
The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1.
Should LoopInterchange follow the parameters as LoopCacheAnalysis?

@ShivaChen
Copy link
Collaborator Author

  1. populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.
  2. The order of I/J does not matter because of D->normalize(SE).

Hi Meinersbur,

Thanks for providing the example in https://godbolt.org/z/Tacx7M7M7, it make it easier to follow the reason of store self checking. :-)

  1. It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.
  2. According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.
  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction.
And depending on program order could have oppose case get the wrong direction.

for (int j = 1; j < M; j++) 
  for (int i = 1; i < N; i++) {
    aa[1][j-1] += bb[i][j];
    cc[i][j] = aa[1][j];
  }

What about make call site as depends(store, load)?
So we can catch the distance of cross iteration flow dependency.

depends(aa[j][i], aa[j-1][i]) return distance 1 for j in s231() and depends(aa[1][j-1], aa[1][j]) return distance -1 for j in the above case.

@Meinersbur
Copy link
Member

Meinersbur commented Jan 22, 2024

  1. It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.

The property that you are looking for is that after a product order-preserving reordering (as loop interchange is), the last overwrite remains the same. Any previous write don't matter since they are overwritten. Be sure that you comment why something is correct in the general case. This patch does not give me confidence that the problem is well understood.

Traditionally, dependence analysis does not assume that store will completely overwrite a previous value, but some bits remain. This is what my example is showing. Removing stores that will be overwritten anyway is what a 'dead store elimination' optimization would do. You could do this by adding a "is completely overwritten" accessor or similar to a FullDependency testing for all required conditions. #77885 is insufficient as I will explain there.

  1. According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.
  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction. And depending on program order could have oppose case get the wrong direction.

for (int j = 1; j < M; j++) 
  for (int i = 1; i < N; i++) {
    aa[1][j-1] += bb[i][j];
    cc[i][j] = aa[1][j];
  }

What about make call site as depends(store, load)? So we can catch the distance of cross iteration flow dependency.

depends(aa[j][i], aa[j-1][i]) return distance 1 for j in s231() and depends(aa[1][j-1], aa[1][j]) return distance -1 for j in the above case.

LoopCacheAnalysis does not care about the order of access, but whether the values are close enough to fit into the cache or cache line. E.g. if it is in the same cache line, then the second access will already have been established in the L1 cache, and therefore can be accessed with lower latency. Which of the accesses is first/second does not matter, accessing one of them will establish the other in the cache line1.

Footnotes

  1. LoopCacheAnalysis does not consider cache line boundaries and data alignment, but if two neighboring addresses do not fall into the same cache line, the processor may have prefetched the neighboring cache line. Hence, it should be a good enough heuristic.

@ShivaChen
Copy link
Collaborator Author

ShivaChen commented Jan 25, 2024

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction.

LoopCacheAnalysis does not care about the order of access, but whether the values are close enough to fit into the cache or cache line. E.g. if it is in the same cache line, then the second access will already have been established in the L1 cache, and therefore can be accessed with lower latency. Which of the accesses is first/second does not matter, accessing one of them will establish the other in the cache line1.

Footnotes

  1. LoopCacheAnalysis does not consider cache line boundaries and data alignment, but if two neighboring addresses do not fall into the same cache line, the processor may have prefetched the neighboring cache line. Hence, it should be a good enough heuristic.

Agree, LoopCacheAnalysis only need to consider the aboslute value of |distance|. So might not a good analogue of LoopInterchange which the direction matter.
And LoopInterchange should able to catch the correct direction if the normalize occur as expected.

@ShivaChen
Copy link
Collaborator Author

Loop Access Analysis would use distance's direction for validation. LAA is using SinkSCEV - SrcSCEV to calculate distance. Dependence Analysis used by loop interchange is using SrcSCEV - DstSCEV(SinkSECV). So reverse the call site Dst and Src can sync with LAA. Another observation is that LAA using LoopBlocksRPO to traverse the blocks with program order. GCC use get_loop_body_in_dom_order to traverse blocks with program order. GCC's build_classic_dist_vector will have leftmost distance always positive by reverse the direction and record DDR_REVERSED_P to determine whether the vector has been reversed. In gimple-loop-interchange.cc, gcc will consider DDR_REVERSED_P, that is it will use original distance signedness to determine the validation.
An alternative method would rely on normalize to reverse the distance: #78951.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants