[LoopInterchange] Fix depends() check parameters #77719

ShivaChen · 2024-01-11T03:19:24Z

This commit enables loop-interchange with -enable-loopinterchange for the case in #71519.
With the loop-interchange, the case can be vectorized.

for (int nl = 0; nl < 10000000/256; nl++)
  for (int i = 0; i < 256; ++i)
    for (int j = 1; j < 256; j++)
      aa[j][i] = aa[j - 1][i] + bb[j][i];

The commit address the issues that:
The reversed parameter order of depends() caused the distance between aa[j][i] and aa[j - 1][i] be -1 instead of 1.

llvmbot · 2024-01-11T03:19:53Z

@llvm/pr-subscribers-llvm-transforms

Author: None (ShivaChen)

Changes

This commit enables loop-interchange with -enable-loopinterchange for the case in #71519.
With the loop-interchange, the case can be vectorized.

for (int nl = 0; nl &lt; 10000000/256; nl++)
  for (int i = 0; i &lt; 256; ++i)
    for (int j = 1; j &lt; 256; j++)
      aa[j][i] = aa[j - 1][i] + bb[j][i];

The commit address the issues that:

populateDependencyMatrix determine aa[j][i] has output dependency with itself.
The reversed parameter order of depends() caused the distance between aa[j][i] and aa[j - 1][i] be -1 instead of 1.

Full diff: https://github.com/llvm/llvm-project/pull/77719.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+3-3)
(added) llvm/test/Transforms/LoopInterchange/interchange-s231.ll (+59)

diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index 277f530ee25fc1..d4b78f4f04599c 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -113,10 +113,10 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
   ValueVector::iterator I, IE, J, JE;
 
   for (I = MemInstr.begin(), IE = MemInstr.end(); I != IE; ++I) {
-    for (J = I, JE = MemInstr.end(); J != JE; ++J) {
+    for (J = I + 1, JE = MemInstr.end(); J != JE; ++J) {
       std::vector<char> Dep;
-      Instruction *Src = cast<Instruction>(*I);
-      Instruction *Dst = cast<Instruction>(*J);
+      Instruction *Src = cast<Instruction>(*J);
+      Instruction *Dst = cast<Instruction>(*I);
       // Ignore Input dependencies.
       if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
         continue;
diff --git a/llvm/test/Transforms/LoopInterchange/interchange-s231.ll b/llvm/test/Transforms/LoopInterchange/interchange-s231.ll
new file mode 100644
index 00000000000000..5b51d2169fe510
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/interchange-s231.ll
@@ -0,0 +1,59 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -cache-line-size=64 -verify-dom-info -verify-loop-info \
+; RUN:     -S -debug 2>&1 | FileCheck %s
+
+@aa = global [256 x [256 x float]] zeroinitializer, align 64
+@bb = global [256 x [256 x float]] zeroinitializer, align 64
+
+;;  for (int nl = 0; nl < 10000000/256; nl++)
+;;    for (int i = 0; i < 256; ++i)
+;;      for (int j = 1; j < 256; j++)
+;;        aa[j][i] = aa[j - 1][i] + bb[j][i];
+
+; CHECK: Found flow dependency between Src and Dst
+; CHECK:  Src:  store float %add, ptr %arrayidx18, align 4
+; CHECK:  Dst:  %1 = load float, ptr %arrayidx10, align 4
+; CHECK: Processing InnerLoopId = 2 and OuterLoopId = 1
+; CHECK: Loops interchanged.
+
+define float @s231() {
+entry:
+  br label %for.cond1.preheader
+
+; Loop:
+for.cond1.preheader:                              ; preds = %entry, %for.cond.cleanup3
+  %nl.036 = phi i32 [ 0, %entry ], [ %inc23, %for.cond.cleanup3 ]
+  br label %for.cond5.preheader
+
+for.cond.cleanup3:                                ; preds = %for.cond.cleanup7
+  %inc23 = add nuw nsw i32 %nl.036, 1
+  %exitcond41 = icmp ne i32 %inc23, 39062
+  br i1 %exitcond41, label %for.cond1.preheader, label %for.cond.cleanup
+
+for.cond.cleanup7:                                ; preds = %for.body8
+  %indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
+  %exitcond40 = icmp ne i64 %indvars.iv.next39, 256
+  br i1 %exitcond40, label %for.cond5.preheader, label %for.cond.cleanup3
+
+for.body8:                                        ; preds = %for.cond5.preheader, %for.body8
+  %indvars.iv = phi i64 [ 1, %for.cond5.preheader ], [ %indvars.iv.next, %for.body8 ]
+  %0 = add nsw i64 %indvars.iv, -1
+  %arrayidx10 = getelementptr inbounds [256 x [256 x float]], ptr @aa, i64 0, i64 %0, i64 %indvars.iv38
+  %1 = load float, ptr %arrayidx10, align 4
+  %arrayidx14 = getelementptr inbounds [256 x [256 x float]], ptr @bb, i64 0, i64 %indvars.iv, i64 %indvars.iv38
+  %2 = load float, ptr %arrayidx14, align 4
+  %add = fadd fast float %2, %1
+  %arrayidx18 = getelementptr inbounds [256 x [256 x float]], ptr @aa, i64 0, i64 %indvars.iv, i64 %indvars.iv38
+  store float %add, ptr %arrayidx18, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond = icmp ne i64 %indvars.iv.next, 256
+  br i1 %exitcond, label %for.body8, label %for.cond.cleanup7
+
+for.cond5.preheader:                              ; preds = %for.cond1.preheader, %for.cond.cleanup7
+  %indvars.iv38 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next39, %for.cond.cleanup7 ]
+  br label %for.body8
+
+; Exit blocks
+for.cond.cleanup:                                 ; preds = %for.cond.cleanup3
+  ret float undef
+}

Meinersbur · 2024-01-11T16:46:26Z

populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.
The order of I/J does not matter because of D->normalize(SE).

Meinersbur · 2024-01-11T20:26:45Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

@@ -113,10 +113,10 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
  ValueVector::iterator I, IE, J, JE;

  for (I = MemInstr.begin(), IE = MemInstr.end(); I != IE; ++I) {
-    for (J = I, JE = MemInstr.end(); J != JE; ++J) {
+    for (J = I + 1, JE = MemInstr.end(); J != JE; ++J) {


https://godbolt.org/z/Tacx7M7M7

There is just a single MemInst (a StoreInst) in the loop body, hence I == J is the only check that would prevent the loop being interchanged.

ShivaChen · 2024-01-12T10:46:58Z

populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.

The order of I/J does not matter because of D->normalize(SE).

Hi Meinersbur,

Thanks for providing the example in https://godbolt.org/z/Tacx7M7M7, it make it easier to follow the reason of store self checking. :-)

It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.
According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.

  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}.
The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1.
The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1.
Should LoopInterchange follow the parameters as LoopCacheAnalysis?

ShivaChen · 2024-01-16T06:29:52Z

populateDependencyMatrix MUST check for self-dependencies, i.e. StoreInsts that overwrite memory of previous iterations. These must be kept in-order so the last overwrite is the value in memory when the loop nest finishes.

The order of I/J does not matter because of D->normalize(SE).

Hi Meinersbur,

Thanks for providing the example in https://godbolt.org/z/Tacx7M7M7, it make it easier to follow the reason of store self checking. :-)

It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.

According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.
  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];
Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction.
And depending on program order could have oppose case get the wrong direction.

for (int j = 1; j < M; j++) 
  for (int i = 1; i < N; i++) {
    aa[1][j-1] += bb[i][j];
    cc[i][j] = aa[1][j];
  }

What about make call site as depends(store, load)?
So we can catch the distance of cross iteration flow dependency.

depends(aa[j][i], aa[j-1][i]) return distance 1 for j in s231() and depends(aa[1][j-1], aa[1][j]) return distance -1 for j in the above case.

Meinersbur · 2024-01-22T22:35:08Z

It seems the interchange in the case is invalid due to the store type is wider than the array element type. Could we detect the type size to prevent the interchange instead of output dependency to itself? The loop in s231() didn't has cross multiple iteration store and bail out by the store self checking. There is a [LoopInterchange] Forbid interchange if load store type wider than element type #77885 PR to address the issue.

The property that you are looking for is that after a product order-preserving reordering (as loop interchange is), the last overwrite remains the same. Any previous write don't matter since they are overwritten. Be sure that you comment why something is correct in the general case. This patch does not give me confidence that the problem is well understood.

Traditionally, dependence analysis does not assume that store will completely overwrite a previous value, but some bits remain. This is what my example is showing. Removing stores that will be overwritten anyway is what a 'dead store elimination' optimization would do. You could do this by adding a "is completely overwritten" accessor or similar to a FullDependency testing for all required conditions. #77885 is insufficient as I will explain there.

According to isDirectionNegative(), it will normalize if find ">". However, s231 has 3 level loop. There is no dependency in Level 1 loop and Direction is initialized as "*" which prevent the normalize.
  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j-1][i] + bb[j][i];
Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction. And depending on program order could have oppose case get the wrong direction.
for (int j = 1; j < M; j++) 
  for (int i = 1; i < N; i++) {
    aa[1][j-1] += bb[i][j];
    cc[i][j] = aa[1][j];
  }
What about make call site as depends(store, load)? So we can catch the distance of cross iteration flow dependency.

depends(aa[j][i], aa[j-1][i]) return distance 1 for j in s231() and depends(aa[1][j-1], aa[1][j]) return distance -1 for j in the above case.

LoopCacheAnalysis does not care about the order of access, but whether the values are close enough to fit into the cache or cache line. E.g. if it is in the same cache line, then the second access will already have been established in the L1 cache, and therefore can be accessed with lower latency. Which of the accesses is first/second does not matter, accessing one of them will establish the other in the cache line¹.

LoopCacheAnalysis does not consider cache line boundaries and data alignment, but if two neighboring addresses do not fall into the same cache line, the processor may have prefetched the neighboring cache line. Hence, it should be a good enough heuristic. ↩

ShivaChen · 2024-01-25T07:25:58Z

Index SCEV of aa[j] is {1,+,1}, Index SCEV of aa[j-1] is {0,+,1}. The depends(aa[j][i], aa[j-1][i]) call site in LoopCacheAnalysis return the correct distance 1. The call site in LoopInterchange has depends(aa[j-1][i], aa[j][i]) and return distance as -1. Should LoopInterchange follow the parameters as LoopCacheAnalysis?

It seems LoopCacheAnalysis only consider |distance| to fit in cache line instead of direction.

LoopCacheAnalysis does not care about the order of access, but whether the values are close enough to fit into the cache or cache line. E.g. if it is in the same cache line, then the second access will already have been established in the L1 cache, and therefore can be accessed with lower latency. Which of the accesses is first/second does not matter, accessing one of them will establish the other in the cache line1.

Footnotes

LoopCacheAnalysis does not consider cache line boundaries and data alignment, but if two neighboring addresses do not fall into the same cache line, the processor may have prefetched the neighboring cache line. Hence, it should be a good enough heuristic. ↩

Agree, LoopCacheAnalysis only need to consider the aboslute value of |distance|. So might not a good analogue of LoopInterchange which the direction matter.
And LoopInterchange should able to catch the correct direction if the normalize occur as expected.

ShivaChen · 2024-02-06T07:16:11Z

Loop Access Analysis would use distance's direction for validation. LAA is using SinkSCEV - SrcSCEV to calculate distance. Dependence Analysis used by loop interchange is using SrcSCEV - DstSCEV(SinkSECV). So reverse the call site Dst and Src can sync with LAA. Another observation is that LAA using LoopBlocksRPO to traverse the blocks with program order. GCC use get_loop_body_in_dom_order to traverse blocks with program order. GCC's build_classic_dist_vector will have leftmost distance always positive by reverse the direction and record DDR_REVERSED_P to determine whether the vector has been reversed. In gimple-loop-interchange.cc, gcc will consider DDR_REVERSED_P, that is it will use original distance signedness to determine the validation.
An alternative method would rely on normalize to reverse the distance: #78951.

ShivaChen requested review from Meinersbur, bmahjour and sjoerdmeijer January 11, 2024 03:19

llvmbot added the llvm:transforms label Jan 11, 2024

Meinersbur requested changes Jan 11, 2024

View reviewed changes

ShivaChen force-pushed the loop-interchange-s231 branch from 8de55ee to a2f8269 Compare January 16, 2024 06:27

ShivaChen mentioned this pull request Jan 22, 2024

[LoopInterchange] Update the direction of undistributed loop to EQ #78951

Open

ShivaChen added 2 commits February 6, 2024 02:55

Add interchange-s231.ll

12cab0e

[LoopInterchange] Swap Src and Dst to get distance without normalize

dac9012

ShivaChen force-pushed the loop-interchange-s231 branch from a2f8269 to dac9012 Compare February 6, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopInterchange] Fix depends() check parameters #77719

[LoopInterchange] Fix depends() check parameters #77719

ShivaChen commented Jan 11, 2024 •

edited

llvmbot commented Jan 11, 2024

Meinersbur commented Jan 11, 2024

Meinersbur Jan 11, 2024

ShivaChen commented Jan 12, 2024

ShivaChen commented Jan 16, 2024

Meinersbur commented Jan 22, 2024 •

edited

ShivaChen commented Jan 25, 2024 •

edited

Footnotes

ShivaChen commented Feb 6, 2024

[LoopInterchange] Fix depends() check parameters #77719

Are you sure you want to change the base?

[LoopInterchange] Fix depends() check parameters #77719

Conversation

ShivaChen commented Jan 11, 2024 • edited

llvmbot commented Jan 11, 2024

Meinersbur commented Jan 11, 2024

Meinersbur Jan 11, 2024

Choose a reason for hiding this comment

ShivaChen commented Jan 12, 2024

ShivaChen commented Jan 16, 2024

Meinersbur commented Jan 22, 2024 • edited

Footnotes

ShivaChen commented Jan 25, 2024 • edited

Footnotes

ShivaChen commented Feb 6, 2024

ShivaChen commented Jan 11, 2024 •

edited

Meinersbur commented Jan 22, 2024 •

edited

ShivaChen commented Jan 25, 2024 •

edited