[LoopInterchange] Consider forward/backward dependency in vectorize heuristic #133672

kasuga-fj · 2025-03-31T01:56:22Z

The vectorization heuristic of LoopInterchange attempts to move a vectorizable loop to the innermost position. Before this patch, a loop was deemed vectorizable if there are no loop-carried dependencies induced by the loop.
This patch extends the vectorization heuristic by introducing the concept of forward and backward dependencies, inspired by LoopAccessAnalysis. Specifically, an additional element is appended to each direction vector to indicate whether it represents a forward dependency (<) or not (*). Among these, only the forward dependencies (i.e., those whose last element is <) affect the vectorization heuristic. Accordingly, the check is conservative, and dependencies are considered forward only when this can be proven. Currently, we only support perfectly nested loops whose body consists of a single basic block. For other cases, dependencies are pessimistically treated as non-forward.

llvmbot · 2025-03-31T01:56:52Z

@llvm/pr-subscribers-llvm-transforms

Author: Ryotaro Kasuga (kasuga-fj)

Changes

The vectorization profitability has a process to check whether a given loop can be vectorized or not. Since the process is conservative, a loop that can be vectorized may be deemed not to be possible. This can trigger unnecessary exchanges.
This patch improves the profitability decision by mitigating such misjudgments. Before this patch, we considered a loop to be vectorizable only when there are no loop carried dependencies with the IV of the loop. However, a loop carried dependency doesn't prevent vectorization if the distance is positive. This patch makes the vectorization check more accurate by allowing a loop with the positive dependency. Note that it is difficult to make a complete decision whether a loop can be vectorized or not. To achieve this, we must check the vector width and the distance of dependency.

Full diff: https://github.com/llvm/llvm-project/pull/133672.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+103-25)
(modified) llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll (+3-5)

diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index b6b0b7d7a947a..0c3a9cbfeed5f 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -17,8 +17,8 @@
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/ADT/StringMap.h"
 #include "llvm/ADT/StringRef.h"
-#include "llvm/ADT/StringSet.h"
 #include "llvm/Analysis/DependenceAnalysis.h"
 #include "llvm/Analysis/LoopCacheAnalysis.h"
 #include "llvm/Analysis/LoopInfo.h"
@@ -80,6 +80,21 @@ enum class RuleTy {
   ForVectorization,
 };
 
+/// Store the information about if corresponding direction vector was negated
+/// by normalization or not. This is necessary to restore the original one from
+/// a row of a dependency matrix because we only manage normalized direction
+/// vectors. Also, duplicate vectors are eliminated, so there may be both
+/// original and negated vectors for a single entry (a row of dependency
+/// matrix). E.g., if there are two direction vectors `[< =]` and `[> =]`, the
+/// later one will be converted to the same as former one by normalization, so
+/// only `[< =]` would be retained in the final result.
+struct NegatedStatus {
+  bool Original = false;
+  bool Negated = false;
+
+  bool isNonNegativeDir(char Dir) const;
+};
+
 } // end anonymous namespace
 
 // Minimum loop depth supported.
@@ -126,9 +141,10 @@ static void printDepMatrix(CharMatrix &DepMatrix) {
 }
 #endif
 
-static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
-                                     Loop *L, DependenceInfo *DI,
-                                     ScalarEvolution *SE,
+static bool populateDependencyMatrix(CharMatrix &DepMatrix,
+                                     std::vector<NegatedStatus> &NegStatusVec,
+                                     unsigned Level, Loop *L,
+                                     DependenceInfo *DI, ScalarEvolution *SE,
                                      OptimizationRemarkEmitter *ORE) {
   using ValueVector = SmallVector<Value *, 16>;
 
@@ -167,7 +183,9 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
     return false;
   }
   ValueVector::iterator I, IE, J, JE;
-  StringSet<> Seen;
+
+  // Manage all found direction vectors. and map it to the index of DepMatrix.
+  StringMap<unsigned> Seen;
 
   for (I = MemInstr.begin(), IE = MemInstr.end(); I != IE; ++I) {
     for (J = I, JE = MemInstr.end(); J != JE; ++J) {
@@ -182,7 +200,8 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
         assert(D->isOrdered() && "Expected an output, flow or anti dep.");
         // If the direction vector is negative, normalize it to
         // make it non-negative.
-        if (D->normalize(SE))
+        bool Normalized = D->normalize(SE);
+        if (Normalized)
           LLVM_DEBUG(dbgs() << "Negative dependence vector normalized.\n");
         LLVM_DEBUG(StringRef DepType =
                        D->isFlow() ? "flow" : D->isAnti() ? "anti" : "output";
@@ -214,8 +233,17 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
         }
 
         // Make sure we only add unique entries to the dependency matrix.
-        if (Seen.insert(StringRef(Dep.data(), Dep.size())).second)
+        unsigned Index = DepMatrix.size();
+        auto [Ite, Inserted] =
+            Seen.try_emplace(StringRef(Dep.data(), Dep.size()), Index);
+        if (Inserted) {
           DepMatrix.push_back(Dep);
+          NegStatusVec.push_back(NegatedStatus{});
+        } else
+          Index = Ite->second;
+
+        NegatedStatus &Status = NegStatusVec[Index];
+        (Normalized ? Status.Negated : Status.Original) = true;
       }
     }
   }
@@ -400,6 +428,7 @@ class LoopInterchangeProfitability {
   bool isProfitable(const Loop *InnerLoop, const Loop *OuterLoop,
                     unsigned InnerLoopId, unsigned OuterLoopId,
                     CharMatrix &DepMatrix,
+                    const std::vector<NegatedStatus> &NegStatusVec,
                     const DenseMap<const Loop *, unsigned> &CostMap,
                     std::unique_ptr<CacheCost> &CC);
 
@@ -409,9 +438,10 @@ class LoopInterchangeProfitability {
       const DenseMap<const Loop *, unsigned> &CostMap,
       std::unique_ptr<CacheCost> &CC);
   std::optional<bool> isProfitablePerInstrOrderCost();
-  std::optional<bool> isProfitableForVectorization(unsigned InnerLoopId,
-                                                   unsigned OuterLoopId,
-                                                   CharMatrix &DepMatrix);
+  std::optional<bool>
+  isProfitableForVectorization(unsigned InnerLoopId, unsigned OuterLoopId,
+                               CharMatrix &DepMatrix,
+                               const std::vector<NegatedStatus> &NegStatusVec);
   Loop *OuterLoop;
   Loop *InnerLoop;
 
@@ -503,8 +533,9 @@ struct LoopInterchange {
                       << "\n");
 
     CharMatrix DependencyMatrix;
+    std::vector<NegatedStatus> NegStatusVec;
     Loop *OuterMostLoop = *(LoopList.begin());
-    if (!populateDependencyMatrix(DependencyMatrix, LoopNestDepth,
+    if (!populateDependencyMatrix(DependencyMatrix, NegStatusVec, LoopNestDepth,
                                   OuterMostLoop, DI, SE, ORE)) {
       LLVM_DEBUG(dbgs() << "Populating dependency matrix failed\n");
       return false;
@@ -543,8 +574,8 @@ struct LoopInterchange {
     for (unsigned j = SelecLoopId; j > 0; j--) {
       bool ChangedPerIter = false;
       for (unsigned i = SelecLoopId; i > SelecLoopId - j; i--) {
-        bool Interchanged =
-            processLoop(LoopList, i, i - 1, DependencyMatrix, CostMap);
+        bool Interchanged = processLoop(LoopList, i, i - 1, DependencyMatrix,
+                                        NegStatusVec, CostMap);
         ChangedPerIter |= Interchanged;
         Changed |= Interchanged;
       }
@@ -559,6 +590,8 @@ struct LoopInterchange {
   bool processLoop(SmallVectorImpl<Loop *> &LoopList, unsigned InnerLoopId,
                    unsigned OuterLoopId,
                    std::vector<std::vector<char>> &DependencyMatrix,
+
+                   const std::vector<NegatedStatus> &NegStatusVec,
                    const DenseMap<const Loop *, unsigned> &CostMap) {
     Loop *OuterLoop = LoopList[OuterLoopId];
     Loop *InnerLoop = LoopList[InnerLoopId];
@@ -572,7 +605,7 @@ struct LoopInterchange {
     LLVM_DEBUG(dbgs() << "Loops are legal to interchange\n");
     LoopInterchangeProfitability LIP(OuterLoop, InnerLoop, SE, ORE);
     if (!LIP.isProfitable(InnerLoop, OuterLoop, InnerLoopId, OuterLoopId,
-                          DependencyMatrix, CostMap, CC)) {
+                          DependencyMatrix, NegStatusVec, CostMap, CC)) {
       LLVM_DEBUG(dbgs() << "Interchanging loops not profitable.\n");
       return false;
     }
@@ -1197,27 +1230,71 @@ LoopInterchangeProfitability::isProfitablePerInstrOrderCost() {
   return std::nullopt;
 }
 
+static char flipDirection(char Dir) {
+  switch (Dir) {
+  case '<':
+    return '>';
+  case '>':
+    return '<';
+  case '=':
+  case 'I':
+  case '*':
+    return Dir;
+  default:
+    llvm_unreachable("Unknown direction");
+  }
+}
+
+/// Ensure that there are no negative direction dependencies corresponding to \p
+/// Dir.
+bool NegatedStatus::isNonNegativeDir(char Dir) const {
+  assert((Original || Negated) && "Cannot restore the original direction");
+
+  // If both flag is true, it means that there is both as-is and negated
+  // direction. In this case only `=` or `I` don't have negative direction
+  // dependency.
+  if (Original && Negated)
+    return Dir == '=' || Dir == 'I';
+
+  char Restored = Negated ? flipDirection(Dir) : Dir;
+  return Restored == '=' || Restored == 'I' || Restored == '<';
+}
+
 /// Return true if we can vectorize the loop specified by \p LoopId.
-static bool canVectorize(const CharMatrix &DepMatrix, unsigned LoopId) {
+static bool canVectorize(const CharMatrix &DepMatrix,
+                         const std::vector<NegatedStatus> &NegStatusVec,
+                         unsigned LoopId) {
+  // The loop can be vectorized if there are no negative dependencies. Consider
+  // the dependency of `j` in the following example.
+  //
+  //   Positive: ... = A[i][j]       Negative: ... = A[i][j-1]
+  //             A[i][j-1] = ...               A[i][j] = ...
+  //
+  // In the right case, vectorizing the loop can change the loaded value from
+  // `A[i][j-1]`. At the moment we don't take into account the distance of the
+  // dependency and vector width.
+  // TODO: Considering the dependency distance and the vector width can give a
+  // more accurate result. For example, the following loop can be vectorized if
+  // the vector width is less than or equal to 4 x sizeof(A[0][0]).
   for (unsigned I = 0; I != DepMatrix.size(); I++) {
     char Dir = DepMatrix[I][LoopId];
-    if (Dir != 'I' && Dir != '=')
+    if (!NegStatusVec[I].isNonNegativeDir(Dir))
       return false;
   }
   return true;
 }
 
 std::optional<bool> LoopInterchangeProfitability::isProfitableForVectorization(
-    unsigned InnerLoopId, unsigned OuterLoopId, CharMatrix &DepMatrix) {
-  // If the outer loop is not loop independent it is not profitable to move
-  // this to inner position, since doing so would not enable inner loop
-  // parallelism.
-  if (!canVectorize(DepMatrix, OuterLoopId))
+    unsigned InnerLoopId, unsigned OuterLoopId, CharMatrix &DepMatrix,
+    const std::vector<NegatedStatus> &NegStatusVec) {
+  // If the outer loop cannot be vectorized, it is not profitable to move this
+  // to inner position.
+  if (!canVectorize(DepMatrix, NegStatusVec, OuterLoopId))
     return false;
 
-  // If inner loop has dependence and outer loop is loop independent then it is
+  // If inner loop cannot be vectorized and outer loop can be then it is
   // profitable to interchange to enable inner loop parallelism.
-  if (!canVectorize(DepMatrix, InnerLoopId))
+  if (!canVectorize(DepMatrix, NegStatusVec, InnerLoopId))
     return true;
 
   // TODO: Estimate the cost of vectorized loop body when both the outer and the
@@ -1228,6 +1305,7 @@ std::optional<bool> LoopInterchangeProfitability::isProfitableForVectorization(
 bool LoopInterchangeProfitability::isProfitable(
     const Loop *InnerLoop, const Loop *OuterLoop, unsigned InnerLoopId,
     unsigned OuterLoopId, CharMatrix &DepMatrix,
+    const std::vector<NegatedStatus> &NegStatusVec,
     const DenseMap<const Loop *, unsigned> &CostMap,
     std::unique_ptr<CacheCost> &CC) {
   // isProfitable() is structured to avoid endless loop interchange. If the
@@ -1249,8 +1327,8 @@ bool LoopInterchangeProfitability::isProfitable(
       shouldInterchange = isProfitablePerInstrOrderCost();
       break;
     case RuleTy::ForVectorization:
-      shouldInterchange =
-          isProfitableForVectorization(InnerLoopId, OuterLoopId, DepMatrix);
+      shouldInterchange = isProfitableForVectorization(InnerLoopId, OuterLoopId,
+                                                       DepMatrix, NegStatusVec);
       break;
     }
 
diff --git a/llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll b/llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll
index b82dd5141a6b2..b83b6b37a6eda 100644
--- a/llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll
+++ b/llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll
@@ -64,15 +64,13 @@ exit:
 ;   for (int j = 1; j < 256; j++)
 ;     A[i][j-1] = A[i][j] + B[i][j];
 ;
-; FIXME: These loops are exchanged at this time due to the problem of
-; profitablity heuristic for vectorization.
 
-; CHECK:      --- !Passed
+; CHECK:      --- !Missed
 ; CHECK-NEXT: Pass:            loop-interchange
-; CHECK-NEXT: Name:            Interchanged
+; CHECK-NEXT: Name:            InterchangeNotProfitable
 ; CHECK-NEXT: Function:        interchange_unnecesasry_for_vectorization
 ; CHECK-NEXT: Args:
-; CHECK-NEXT:   - String:          Loop interchanged with enclosing loop.
+; CHECK-NEXT:   - String:          Insufficient information to calculate the cost of loop for interchange.
 define void @interchange_unnecesasry_for_vectorization() {
 entry:
   br label %for.i.header

kasuga-fj · 2025-03-31T08:59:43Z

Depends on #133665 and #133667

sjoerdmeijer · 2025-04-02T12:47:12Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

  ForVectorization,
 };

+/// Store the information about if corresponding direction vector was negated


Before I keep reading the rest of this patch, just wanted to share this first question that I had. I was initially a bit confused about this, and was wondering why we need 2 booleans and 4 states if a direction vector's negated status can only be true or false. But I now guess that the complication here is the unique entries in the dependency matrix, is that right? If that is the case, then I am wondering if it isn't easier to keep all the entries and don't make them unique? Making them unique was a little optimisation that I added recently because I thought that would help, but if this is now complicating things and we need to do all sorts of gymnastics we might as well keep all entries.

But I now guess that the complication here is the unique entries in the dependency matrix, is that right?

Yes. (But holding two boolean values is a bit redundant. What is actually needed are three states. If both of them are false, it is an illegal state.)

I am wondering if it isn't easier to keep all the entries and don't make them unique?

I think it would be simpler. Also, there is no need to stop making entries unique altogether. If duplicate direction vectors are allowed, I think the simplest implementation would be to keep pairs of a direction vector and a boolean value indicating whether the corresponding vector is negated. However, I'm not sure how effective it is to make direction vectors unique. In the worst case, holding pairs of a vector and a boolean value instead of a single vector doubles the number of entries. Is this allowed?

I think duplicated direction vectors are always allowed. They don't add new or different information, so it shouldn't effect the interpretation of the dependence analysis in any way. The only thing that it affects is processing the same information again and again, so the only benefit of making them unique is to avoid that. But if keeping all entries makes the logic easier, there is a good reason to not make them unique. I think adding all the state here complicates things, and if a simple map of original to negated helps, you've certainly got my vote to simplify this.

I think duplicated direction vectors are always allowed. They don't add new or different information, so it shouldn't effect the interpretation of the dependence analysis in any way. The only thing that it affects is processing the same information again and again, so the only benefit of making them unique is to avoid that.

I agree with this, and am only concerned about the compile time degradation. So I will try to compare the difference of the number of entries with or without making them unique. Thank you for your opinion!

I did a quick check using llvm-test-suite, and found using (direction vector, boolean value) pairs as unique keys makes little difference in the number of entries (although in some cases the number of entries increased enormously if I stop making them unique). So changed to keep the negated and non-negated vectors separate.

The vectorization profitability has a process to check whether a given loop can be vectorized or not. Since the process is conservative, a loop that can be vectorized may be deemed not to be possible. This can trigger unnecessary exchanges. This patch improves the profitability decision by mitigating such misjudgments. Before this patch, we considered a loop to be vectorizable only when there are no loop carried dependencies with the IV of the loop. However, a loop carried dependency doesn't prevent vectorization if the distance is positive. This patch makes the vectorization check more accurate by allowing a loop with the positive dependency. Note that it is difficult to make a complete decision whether a loop can be vectorized or not. To achieve this, we must check the vector width and the distance of dependency.

Meinersbur

The PR is described as allowing some loop-carried dependencies, but the only test case is fixing a false-positive which is rejected because "nsufficient information to calculate the cost of loop for interchange". Can you add a positive test with such a positive deendence distance what can now be detected?

sjoerdmeijer · 2025-04-07T19:15:03Z

( I am away from keyboard for 1.5 weeks, can pick this up later, but good thing Michael is looking now too)

kasuga-fj · 2025-04-08T15:05:15Z

Can you add a positive test with such a positive deendence distance what can now be detected?

That makes sense, added it.

kasuga-fj · 2025-05-29T12:55:56Z

gentle ping

Meinersbur

Whether a loop with loop-carried positive dependence distance is vectorizable is unfortunately more complicated.

First, dependencies are either always positive or the dependency is carried by a surrounding loop (legality check).
Then, the dependence distance has to be either larger than the vector size (which we cannot determine) OR it has to be "lexically forward" (which is difficult to define). Looks like you patch tries to determine "lexically forward" from whether it has been normalized. This derives from the order in which the instructions are iterated over. This is OK within a BB, but BBs themselves can be ordered arbitrarily and does not indicate actual execution order. I think the test should be more conservative in the case of control flow in the loop body. Also, please add comments about the intend to match LLA's forward dependency.

Meinersbur · 2025-06-04T11:49:58Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

+  //   Positive: ... = A[i][j]       Negative: ... = A[i][j-1]
+  //             A[i][j-1] = ...               A[i][j] = ...


[serious] Both of these have positive dependence distance, jsut the first is a WAR (anti-)dependence, the second is a RAW (flow)-dependence.

The i-loop variable is irrelevant here since always the same, so the dependence distance can only be non-negative.

I think I'm confusing the terms "positive/negative" and "forward/backward". IIUIC, I meant to say here that a forward RAW dependence doesn't prevent vectorization.

positive/negative refers to the difference of (normalized) loop induction values (i, j)

forward/backward refers to the execution order of statements within the loop body. A forward dependence is just the normal execution flow, e.g.

for (int i = 1; i < n; ++i) { A[i-1] = 42; // A[i] would still be a read-after-write forward dependency use(A[i]); }

A backward dependence is when the source of a dependences is a statement that is located after the destination in the loop body, necessarily from a previous iteration:

for (int i = 1; i < n; ++i) { use(A[i]); A[i-1] = 42; // A[i] would make this a write-after-read forward dependency }

In the polyhedral model one just assigns numbers to the sequence of statements in the loop which allows doing calculations over statement order as if it was another loop:

for (int i = 0; i < n; ++i) { for (int k = 0; k < 2; ++i) { switch (k) { case 0: use(A[i]); break; case 1: A[i-1] = 42; break; } } }

In this view, a forward dependency is when the innermost distance vector element (i.e. k) is positive, and a backward dependency is when the innermost dependence vector element is negative. I find this view helpful.

As mentioned, the execution order of statements is ambigious if the body is not straight-line code. I had lenghty discussions on the OpenMP language committee about it. For instance, assume an if/else construct:

for (int i = 0; i < n; ++i) { if (i%2==0) use(A[i]); else A[i-1] = 42; }

Is this a forward or backward dependency? It is kind-of ill-defind because within the same body iteration, only one of the statements is ever executed, so there is no order between them. This becomes clearer if you consider that the following has the very same semantics:

for (int i = 0; i < n; ++i) { if (i%2!=0) A[i-1] = 42; else use(A[i]); }

So it does not matter? Well it does if you vectorize using predicated instructions. You can vectorize the latter, with simd width of 2 (and more if you allow store-to-load forwarding), but you cannot vectorize the former, at least not by just replacing every statement with its vector equivalent. If you want to keep it simple, only consider straight-line code within a single BB.

The entire forward/backward nomenclature also breaks down if you allow non-perfectly nested loops.

I also dislike calling those "lexically" forward/backward. I can use e.g. gotos to change the lexical order of statements:

for (int i = 1; i < n; ++i) { goto T; S: A[i] = 42; goto NEXT; T: use(A[i]); goto S; NEXT: }

Thanks a lot for the really thorough explanation! It took me some time to fully understand it, but I believe I got what you meant.

In the polyhedral model one just assigns numbers to the sequence of statements in the loop which allows doing calculations over statement order as if it was another loop:

for (int i = 0; i < n; ++i) { for (int k = 0; k < 2; ++i) { switch (k) { case 0: use(A[i]); break; case 1: A[i-1] = 42; break; } } }

In this view, a forward dependency is when the innermost distance vector element (i.e. k) is positive, and a backward dependency is when the innermost dependence vector element is negative. I find this view helpful.

Yes, that made it much clearer. Thanks! And thanks to this, I finally see why you recommended to represent the information about whether a dependency is forward or not by the last element of the direction vector.

The entire forward/backward nomenclature also breaks down if you allow non-perfectly nested loops.

I also dislike calling those "lexically" forward/backward. I can use e.g. gotos to change the lexical order of statements:

for (int i = 1; i < n; ++i) { goto T; S: A[i] = 42; goto NEXT; T: use(A[i]); goto S; NEXT: }

(This is probably off-topic, but seeing this made me realize that I was almost confusing the "lexicographical order" in the context of direction vectors with the "lexical forward/backward" in LAA.)

(This is probably off-topic, but seeing this made me realize that I was almost confusing the "lexicographical order" in the context of direction vectors with the "lexical forward/backward" in LAA.)

They are related:

lexicographic: Order within a dictionary

lexical (in opposition to semantical): Order within a text (not well established, confusingly also used as synonym of lexicographic, but "lexicographically forward/backward" makes no sense, as if we would sort by variable names).

Maybe I'm biased, but when I here "dictionary order", I kind of imagine the order on a Cartesian product...

Meinersbur · 2025-06-04T12:44:02Z

Proposal: Instead of doublicating all dependencies, use the flag to mean "all dependencies of this vector are forward dependencies". It is reset whenever a dependency does not guarantee that the Dst is executed after Src (in the same loop iteration). This includes if Dst and Src are in different basic blocks (unless you can prove dominance).

You could go as for as to encode it as < and * in another trailing element of the dependence vector.

kasuga-fj · 2025-06-04T13:49:39Z

Oops! You're right, I only considered the execution order within a BB and didn't take the order of BBs into account. Thanks for the detailed explanation, it is very helpful.

Proposal: Instead of doublicating all dependencies, use the flag to mean "all dependencies of this vector are forward dependencies". It is reset whenever a dependency does not guarantee that the Dst is executed after Src (in the same loop iteration). This includes if Dst and Src are in different basic blocks (unless you can prove dominance).

You could go as for as to encode it as < and * in another trailing element of the dependence vector.

Thank you, this proposal makes sense to me. For now, I'll take a closer look at LAA and reconsider my approach.

…fitable-vectorization

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

kasuga-fj · 2025-06-06T11:31:40Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

+    // If both Dir and DepType are '<', it means that the all dependencies are
+    // lexically forward. Such dependencies don't prevent vectorization.
+    if (Dir == '<' && DepType == '<')
+      continue;


A similar fact holds when Dir is > and all dependencies are lexically backward? (even if this is true, I don't intend to address it in this PR).

no; that's actually impossible. D->normalize(SE) would have reversed it.

I intend to reverse the last element at the same time. Even so, is it still impossible?

When Dir is >, it is reversed by ->normalize() independently of the last element that DependenceAnalysis does not even know about.

I was considering a case where the original dependence vector is something like [> <] (which will be normalized to [< >]). In this case, representing a backward dependency like [< > >] instead of [< > *] looked reasonable to me in some situations, but I couldn't come up with any particularly useful examples...

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

Meinersbur · 2025-07-14T13:23:59Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

+    // If both Dir and DepType are '<', it means that the all dependencies are
+    // lexically forward. Such dependencies don't prevent vectorization.
+    if (Dir == '<' && DepType == '<')
+      continue;


no; that's actually impossible. D->normalize(SE) would have reversed it.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

llvm/test/Transforms/LoopInterchange/profitability-vectorization-heuristic.ll

Meinersbur · 2025-07-14T14:46:41Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

+        bool Normalized = D->normalize(SE);
+        if (Normalized) {
          LLVM_DEBUG(dbgs() << "Negative dependence vector normalized.\n");
+          IsForward = false;


There are some deduction steps needed for this, please add some explandation into a comment:

If Src and Dst are not in the same BB, line 207 will consider it "not forward"

If Src and Dst are in the same BB, DI->depends will be called with Src being the first, Dst the second in the BB due to how we iterate over the instructions, i.e. assuming a forward dependency
2a. dependence vector is positve: assumption was true
2b. dependence vector is negative: If it actually is not a forward dependency, the dependence vector will be negative, and D->normalize reverse the dependency (to make the dependence vector positive). That is, it becomes a backward dependence and D->normalize returns true.
2c. If the dependency vector is 0, i.e. the dependency is not loop carried, D->normalize will not reverse the dependency. Because we called DI->depends with execution order of Src/Dst, we have a forward dependency

If Src==Dst (e.g a single StoreInst depending on itself from a previous iterations, a WAW-dependency), the concept of forward/backward dependency is ill-defined. I think we should optimistically assume a forward dependency
3a. dependence vector is positve: DI->depends(Src, Dst) probably can only return a positive dependence vector(?) that does not need to be normalized
3b. dependence vector is negative: probably cannot happend as discussed (add assertion?)
3c. dependence vector is zero: By atomicity of an instruction, cannot happen

Okay, I'll add comments.

3. If Src==Dst (e.g two StoreInst of a WAW-dependency), the concept of forward/backward dependency is ill-defined. I think we should optimistically assume a forward dependency
3a. dependence vector is positve: DI->depends(Src, Dst) probably can only return a positive dependence vector(?) that does not need to be normalized
3b. dependence vector is negative: probably cannot happend as discussed (add assertion?)
3c. dependence vector is zero: By atomicity of an instruction, cannot happen

I believe 3b cannot happen.

This is a bit of a tangent, but seeing this reminded me of something. Recently, I’ve been thinking that maybe [* >] should actually be normalized to [* <] (if doing so, I think 3b can happen). If you don’t mind, I'd like to hear what you think about it.

Assuming > here means the dependence-vector part of it (since your current encoding puts * for backward dependencies):

An analysis returning [* >] is unlikey, but could be possible because pessimizing [< >] to [* >] should be conservatively correct. It is not reversed though, because FullDependence::isDirectionNegative stops at *-like dependencies¹.

Footnotes

exactly for that reason ↩

Assuming > here means the dependence-vector part of it (since you current encoding puts * for backward dependencies):

Correct, I was talking about the original dependence vector returned by DI->depends.

An analysis returning [* >] is unlikey, but could be possible because pessimizing [< >] to [* >] should be conservatively correct. It is not reversed though, because FullDependence::isDirectionNegative stops at *-like dependencies.

What I meant is that it might be more convenient for LoopInterchange if [* >] is normalized to [* <]. In other words, it may be useful if FullDependence::isDirectionNegative doesn’t stop at *-like dependencies.

I don’t think it’s very rare to have a dependence vector with * as its head element. For example, consider a case where the outermost loop has scalar dependencies (I don't know if ), like in the following example (I found such cases while investigating TSVC):

for (int n_times = 0; n_times < NTIMES; ++n_times) for (int i = 0; i < N; ++i) for (int j = 1; j < M; ++j) aa[j][i] = aa[j - 1][i] + 1; // This statement itself doesn't depend on `n_times`

The direction vector in the above example is [* = >]. Interchanging the i-loop and j-loop is legal (I believe), but it is currently rejected because [= >] is lexicographically negative. Alternatively, if the outermost one is not counted as a loop, the direction vector would be normalized to [= <] and the interchange would be legal.

So, I'm thinking that it may be better if direction vectors like [* = >] were normalized to [* = <]. This could probably be done by changing FullDependence::isDirectionNegative so that it doesn't stop at *. And what I'd like to ask is: Are there any concerns that come to mind? One thing that comes to my mind is that it can change the type of the dependence (flow/anti/output), but it is not very important for LoopInterchange, which is currently the only client of Dependence::normalize.

* may just mean "not analyzable", and as mentioned, it might effectively be <, but DA was not able to detect it as such. Reversing the dependence vector would be wrong.

* might also mean "sometimes <, sometimes >" depending on control flow. In that case there is no single correct normlization of the dependence vector, both (as-is and reversed) would time-negative in some cases.

for (int i = 0; i < 100; ++i) { A[99 - i] = ..; use(A[i]); // flow dependency with i >= 50, anti-dependency with i < 50 }

Since because of this one cannot assume that the dependency vector is positive even after normalization, it could be considered a heuristic, and it might be reasonable to assume that the * is a non-detected = direction due to symmetry of < and >. Could we do that in a different patch? it feels risky.

A better modeling might actually be that a FullDependence represents two dependencies: From Src to Dst and the anti-dependency from Dst to Src. One of each may actually be empty, because with [= <] there is no dependency from Dst to Src, and with [= >] there is not dependency from Src to Dst. But with [*], neither directions would bne ruled out and one has to pessimistically assume both.

Thanks for your opinion! This is something I've been thinking about lately, and I don't intend to include this change in the current patch. As the discussion touched on a similar topic, I took the opportunity to raise a related question, just in case you happen to know any background or historical context behind it. Since there appear to be multiple approaches and potentially some edge cases, I'll give it some more thought. Your input was really helpful, thanks again!

* might also mean "sometimes <, sometimes >" depending on control flow. In that case there is no single correct normlization of the dependence vector

Considering this, it makes me wonder if the existence of the function normalize might be a bit misleading...

Considering this, it makes me wonder if the existence of the function normalize might be a bit misleading...

Definitely. When introduced I thought that callers should be able to handle the direction as-is since the caller has chosen Src and Dst. normalize retroactively swaps the arguments. But it also makes some sense since you do not want to call DA::depends again with Src/Dst swapped, paying the computational cost again.

But it also makes some sense since you do not want to call DA::depends again with Src/Dst swapped, paying the computational cost again.

If I don't miss anything, simply copying the object looks to resolve that issue. Since there's a unique_ptr member in FullDependence, I don't think we can copy this as-is, but it probably doesn't need to be a unique_ptr.

Coming back to the original topic, added comments, and moved the process toward the bottom of the while-loop.

3b. dependence vector is negative: probably cannot happend as discussed (add assertion?)

I considered adding the assertion earlier, but realized it should be done right after calling normalize, so I didn’t add it at that point.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

…fitable-vectorization

Meinersbur

Some nitpicks after which I will LGTM

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

Meinersbur · 2025-07-24T12:24:53Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

    return false;

-  // If inner loop has dependence and outer loop is loop independent then it is
+  // If inner loop cannot be vectorized and outer loop can be then it is


Suggested change

// If inner loop cannot be vectorized and outer loop can be then it is

// If the inner loop cannot be vectorized but the outer loop can be then it is

[grammar]

Fixed.

By the way, this would be nitpicky as well, but do you think the original comment is accurate? What I'm trying to say is that even if canVectorize were a perfectly accurate function (neither false-positive nor false-negative), I'm starting to think that interchanging the loops here is not necessarily profitable for enabling inner loop parallelism. For example, in the following code:

for (int i = 1; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) { // Assume f and g don't have side effects use(A[i][j][f(k)]); A[i + 1][j][g(k)] = ...; }

For the k-loop, canVectorize would return false if f and g are sufficiently complex. However, in principle, parallelizing the k-loop still seems legal in the original one. Therefore, a more accurate comment might be something like "... can be profitable to interchange the loops to enable inner loop parallelism"? (Apparently, I wrote the original comment too, so either past me or present me is wrong...)

What is "sufficiently complex"? If DA returns "confused" then canVectorize has to return false. If it returns [< = *] the dependency is carried by the outermost loop, it does not matter what the inner loop does.
I actually don't know/undestand why canVectorize does not look at the parent loop dependencies. Possible because what the outer loops are changes with interchange. At least the loops that are surrounded by both, outer+inner could be considered.

The case you mention is interesting because it is a counterexample to the assumption that if canVectorize is pessimistic (never says a loop can be vectorized even though LoopVectorize will not for some reason), it will not cause loop exchanges that would not happen if it was not pessimistic. Anyway, in this case the j-loop looks more likely to be vectorized profitable because f(k)/g(k) indices would require more complex memory accesses. LoopVectorize can better handle i as a "strided access pattern".

I think the comment itself is correct: If the outer one could be vectorized (if moved to the inner position) but the current inner one cannot, swap the outer one to the vectorizable position. For "vectorizable" it just assumes the definition of canVectorize. Generally, even a loop is vectorizable in terms of dependencies, LoopVectorize may still consider it unprofitable to vectorize because of the instructions it contains, or the code may actually run slower after vectorization, so "profitable" was never in absolute term and hopefully understood as such by the reader. "can" does not add new information here unless we would mention such concrete situations.

What is "sufficiently complex"? If DA returns "confused" then canVectorize has to return false. If it returns [< = *] the dependency is carried by the outermost loop, it does not matter what the inner loop does.

I tried to say the latter one. Just as you mentioned, I was assuming a case where DA returns [< = *].

I hadn't really been conscious of it, but as you pointed out, this is a case where pessimistic heuristics lead to an interchange that wouldn't have happened if they hadn't been pessimistic (and in this specific case, moving the j-loop would be profitable for vectorization because the memory access pattern is simpler) I personally think that the interchange should not happen in this case, since we currently don't take the vectorization cost into account. Checking dependencies of the surrounding loops seems basically like a good idea, but I'm not confident whether that might lead to other unintended transformations. Using the same cost model as LoopVectorize seems like an ideal solution, but it feels challenging.

For "vectorizable" it just assumes the definition of canVectorize.

As for the comment here, this explanation made the most sense to me. Thanks for clarifying!

I personally think that the interchange should not happen in this case, since we currently don't take the vectorization cost into account.

I agree, but there are limits on what we can do. At the end it is just a heuristric.

Checking dependencies of the surrounding loops seems basically like a good idea, but I'm not confident whether that might lead to other unintended transformations. Using the same cost model as LoopVectorize seems like an ideal solution, but it feels challenging.

This is a common problem that also LoopDistribute has: It is intended to enable vectorization on one more more distributed loops, but does not know whether they actually are vectorized. In other words, it has no cost model. Becausei if it does not do anything unless explicitly told to do so.

Using the profitability heuristic from LoopVectorize itself, even it it was easy, might also not what we want: Its computational cost is immense (building an entire new IR representation called VPlan) that we would not do speculatively on all loops without actually vectorizing.

Unless you have an universal cost model that takes everything into account and predicts the execution time, each pass needs its own heuristic for what it is optimizing for. E.g. the vectorizer optmizes cycles but does not consider cache effects.

When you put it that way, it hardly seems feasible (well, if it were feasible, it would probably have been done already).

No typo; the patch tries to teach DependenceAnalysis to determine dependencies after loop fusion has taken place without applying loop fusion. Now also do that for interchange, distribution, vectorization, ....

After reading this comment, I noticed that the patch introduces additional analysis for loop fusion even though the client doesn't require it. I initially expected an argument to be added (such as depends(Src, Dst, /*ForFusion=*/true)), but that doesn't seem to be the case. Tough, controlling the analysis behavior via flags could complicate caching and reusing results across different passes.

By the way, I've recently been reading DependenceAnalysis.cpp, and noticed that; it's already quite complex and potentially buggy. I'm fairly certain it should be refactored before adding any new features.

UnrollAndJam is disabled by default. Its heuristic also does not take vectorization into account, but tires to maximize L1i cache usage.

Optimal outcome would be if the vectorizer supported outer-loop vectorization.

I don't know much about the details of the UnrollAndJam pass, but it appears to work (unintentionally?) as if outer-loop vectorization is applied in some cases, especially when combined with the SLPVectorizer (of course, I needed to specify the unroll count explicitly by pragma). So, I just thought that it might make more sense to enhance UnrollAndJam instead of interchange, for cases where outer-loop is vectorizable but inner-loop is not. And, as you said, it would be the best solution to support outer-loop vectorization in the vectorizer.

After reading this comment, I noticed that the patch introduces additional analysis for loop fusion even though the client doesn't require it. I initially expected an argument to be added (such as depends(Src, Dst, /*ForFusion=*/true)), but that doesn't seem to be the case. Tough, controlling the analysis behavior via flags could complicate caching and reusing results across different passes.

Whether it is for fusion is not yet decided when calling depends, but FullDependence stores the analysis for both.

By the way, I've recently been reading DependenceAnalysis.cpp, and noticed that; it's already quite complex and potentially buggy. I'm fairly certain it should be refactored before adding any new features.

The principle is straightforward; when processing one of the two fused loops, process them as the same. Since an expression can only be in one of the loops, no ambiguity arises. Only when processing the relationship between two statements you need to decide whether you want to treat them as the same or sequential loops.

I am not sure refactoring helps. Big part of why it is difficult to understand is the math. The Pair also makes it look complex, but it is just matching the access subscript dimensions after delinearization. But I also am also not very happy about adding special cases to an already complex analysis. If you do loop fusion, you may want to support more cases than loops that have excactly the same trip count.

Whether it is for fusion is not yet decided when calling depends, but FullDependence stores the analysis for both.

IIUC, FullDependence objects are not cached anyware. DependenceInfo is nearly stateless. Furthermore, DependenceInfo::depends returns a unique_ptr, hence we cannot cache the result as it is. That is, I think we know whether the caller is fusion or not when calling DependenceInfo::depends.

I am not sure refactoring helps. Big part of why it is difficult to understand is the math. The Pair also makes it look complex, but it is just matching the access subscript dimensions after delinearization. But I also am also not very happy about adding special cases to an already complex analysis. If you do loop fusion, you may want to support more cases than loops that have excactly the same trip count.

I agree that we can't do much about the mathematical complexity, but I believe the code could be made simpler. It looks to me like there's a fair amount of code duplication, especially when the same processes are executed for SrcXXX and DstXXX (e.g., here). I'm not sure whether this duplication makes the code harder to understand, but I do think it hurts maintainability. I don't believe "Don't Repeat Yourself" is always the right principle, but in this case, I think there are parts of the logic where it does apply.

However, I think the most significant problem is that we don't take wrapping into account. The approach in #116632 seems incorrect to me. We probably need to be more pessimistic with respect to wrapping. I think it makes sense to insert checks for wrap flags where necessary, which would complicate the code. I'm not sure if #146383 applies in that case, but generally speaking, adding a new feature could increase the number of factors we need to consider.

In fact, there's a case where DependenceAnalysis misses a dependency, probably due to ignoring wraps, as shown below (godbolt: https://godbolt.org/z/hsxWve8s6).

; for (i = 0; i < 4; i++) ; a[i & 1][i & 1] = 0; define void @f(ptr %a) { entry: br label %loop loop: %i = phi i64 [ 0, %entry ], [ %i.next, %loop ] %and = and i64 %i, 1 %idx = getelementptr [4 x [4 x i8]], ptr %a, i64 0, i64 %and, i64 %and store i8 0, ptr %idx %i.next = add i64 %i, 1 %exitcond.not = icmp slt i64 %i.next, 8 br i1 %exitcond.not, label %loop, label %exit exit: ret void }

Printing analysis 'Dependence Analysis' for function 'f': Src: store i8 0, ptr %idx, align 1 --> Dst: store i8 0, ptr %idx, align 1 da analyze - none!

Can you create an issue # for that case? (Or I can do so) It doesn't look nsw/nuw related though, the subscipts are well within i64 range.

I remember having had issues with #116632 but apparently I have been convinced otherwise.

Would be looking forward to cleanup PRs on DA.

Ah, sorry, the issue already exists: #148435 (comment)

I think this is a kind of wrapping problem. IIRC, the %and is represented as {false,+,true}<%loop>, which would wrap. But DA casts it to i64 and ultimately overlooks the wrapping.

(While I'm at it, I'll share the other issues I found: #149977, #149501, #149991).

Would be looking forward to cleanup PRs on DA.

👍

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

Meinersbur

LGTM

Meinersbur

LGTM

kasuga-fj · 2025-07-25T13:37:03Z

Thanks for your reviews!

llvm-ci · 2025-07-25T13:47:03Z

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime-2 running on rocm-worker-hw-02 while building llvm at step 6 "test-openmp".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/10/builds/10184

Here is the relevant piece of the build log for the reference

Step 6 (test-openmp) failure: test (failure)
******************** TEST 'libarcher :: races/task-two.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 13
/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/clang -fopenmp  -gdwarf-4 -O1 -fsanitize=thread  -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src   /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c -o /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/deflake.bash /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp 2>&1 | tee /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp.log | /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/FileCheck /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c
# executed command: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/clang -fopenmp -gdwarf-4 -O1 -fsanitize=thread -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c -o /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp -latomic
# note: command had no output on stdout or stderr
# executed command: env TSAN_OPTIONS=ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1 /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/deflake.bash /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp
# note: command had no output on stdout or stderr
# executed command: tee /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp.log
# note: command had no output on stdout or stderr
# executed command: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/FileCheck /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c
# note: command had no output on stdout or stderr
# RUN: at line 14
/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/clang -fopenmp  -gdwarf-4 -O1 -fsanitize=thread  -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src   /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c -o /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp -latomic && env ARCHER_OPTIONS="ignore_serial=1 report_data_leak=1" env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/deflake.bash /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp 2>&1 | tee /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp.log | /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/FileCheck /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c
# executed command: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/clang -fopenmp -gdwarf-4 -O1 -fsanitize=thread -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests -I /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/runtime/src /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c -o /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp -latomic
# note: command had no output on stdout or stderr
# executed command: env 'ARCHER_OPTIONS=ignore_serial=1 report_data_leak=1' env TSAN_OPTIONS=ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1 /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/deflake.bash /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp
# note: command had no output on stdout or stderr
# executed command: tee /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/runtimes/runtimes-bins/openmp/tools/archer/tests/races/Output/task-two.c.tmp.log
# note: command had no output on stdout or stderr
# executed command: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/FileCheck /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c
# .---command stderr------------
# | /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c:44:11: error: CHECK: expected string not found in input
# | // CHECK: ThreadSanitizer: reported {{[0-9]+}} warnings
# |           ^
# | <stdin>:27:5: note: scanning from here
# | DONE
# |     ^
# | <stdin>:28:1: note: possible intended match here
# | ThreadSanitizer: thread T4 finished with ignores enabled, created at:
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             .
# |             .
# |             .
# |            22:  #0 pthread_create /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1090:3 (task-two.c.tmp+0xa34ba) 
# |            23:  #1 __kmp_create_worker z_Linux_util.cpp (libomp.so+0xcb262) 
# |            24:  
# |            25: SUMMARY: ThreadSanitizer: data race /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.src/openmp/tools/archer/tests/races/task-two.c:30:10 in .omp_outlined. 
# |            26: ================== 
...

llvm-ci · 2025-07-26T06:31:40Z

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-expensive-checks-debian running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/23268

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: TableGen/RuntimeLibcallEmitter.td' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llvm-tblgen -gen-runtime-libcalls -I /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/../../include /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td # RUN: at line 1
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llvm-tblgen -gen-runtime-libcalls -I /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/../../include /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td
/b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td:98:16: error: CHECK-NEXT: expected string not found in input
// CHECK-NEXT: sqrtl_f80 = 7, // sqrtl
               ^
<stdin>:32:23: note: scanning from here
 calloc = 6, // calloc
                      ^
<stdin>:34:2: note: possible intended match here
 sqrtl_f80 = 8, // sqrtl
 ^

Input file: <stdin>
Check file: /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/TableGen/RuntimeLibcallEmitter.td

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           .
           .
           .
          27:  ___memcpy = 1, // ___memcpy 
          28:  ___memset = 2, // ___memset 
          29:  __ashlsi3 = 3, // __ashlsi3 
          30:  __lshrdi3 = 4, // __lshrdi3 
          31:  bzero = 5, // bzero 
          32:  calloc = 6, // calloc 
next:98'0                           X error: no match found
          33:  sqrtl_f128 = 7, // sqrtl 
next:98'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~
          34:  sqrtl_f80 = 8, // sqrtl 
next:98'0     ~~~~~~~~~~~~~~~~~~~~~~~~~
next:98'1      ?                        possible intended match
          35:  NumLibcallImpls = 9 
next:98'0     ~~~~~~~~~~~~~~~~~~~~~
          36: }; 
next:98'0     ~~~
          37: } // End namespace RTLIB 
next:98'0     ~~~~~~~~~~~~~~~~~~~~~~~~~
          38: } // End namespace llvm 
next:98'0     ~~~~~~~~~~~~~~~~~~~~~~~~
          39: #endif 
next:98'0     ~~~~~~~
...

…euristic (llvm#133672) The vectorization heuristic of LoopInterchange attempts to move a vectorizable loop to the innermost position. Before this patch, a loop was deemed vectorizable if there are no loop-carried dependencies induced by the loop. This patch extends the vectorization heuristic by introducing the concept of forward and backward dependencies, inspired by LoopAccessAnalysis. Specifically, an additional element is appended to each direction vector to indicate whether it represents a forward dependency (`<`) or not (`*`). Among these, only the forward dependencies (i.e., those whose last element is `<`) affect the vectorization heuristic. Accordingly, the check is conservative, and dependencies are considered forward only when this can be proven. Currently, we only support perfectly nested loops whose body consists of a single basic block. For other cases, dependencies are pessimistically treated as non-forward.

llvmbot added the llvm:transforms label Mar 31, 2025

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-fix-profitable-vectorization branch from 2db59e8 to ae5d9cf Compare April 2, 2025 07:12

kasuga-fj mentioned this pull request Apr 2, 2025

[LoopInterchange] Add tests for the vectorization profitability (NFC) #133665

Merged

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-improve-profitable-vectorization branch from cdec72a to 72b48ba Compare April 2, 2025 07:13

kasuga-fj requested review from Meinersbur, madhur13490 and sjoerdmeijer April 2, 2025 07:16

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-fix-profitable-vectorization branch from ae5d9cf to bd84ddc Compare April 2, 2025 07:29

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-improve-profitable-vectorization branch from 72b48ba to 692e4de Compare April 2, 2025 07:29

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-fix-profitable-vectorization branch from bd84ddc to b1f0744 Compare April 2, 2025 12:13

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-improve-profitable-vectorization branch from 692e4de to 1a1c1f6 Compare April 2, 2025 12:14

sjoerdmeijer reviewed Apr 2, 2025

View reviewed changes

Base automatically changed from users/kasuga-fj/loop-interchange-fix-profitable-vectorization to main April 3, 2025 07:21

kasuga-fj force-pushed the users/kasuga-fj/loop-interchange-improve-profitable-vectorization branch from 1a1c1f6 to b1c4248 Compare April 3, 2025 08:00

Handle negated and non negated direction vectors separately.

8f4f814

Meinersbur reviewed Apr 7, 2025

View reviewed changes

Add test that has positive dependencies

cad4db9

Meinersbur reviewed Jun 4, 2025

View reviewed changes

kasuga-fj added 2 commits June 6, 2025 20:20

Merge branch 'main' into users/kasuga-fj/loop-interchange-improve-pro…

42a19fb

…fitable-vectorization

Add "lexically forward" flag for vectorization profitability check

6a0a868

kasuga-fj commented Jun 6, 2025

View reviewed changes

Meinersbur reviewed Jul 14, 2025

View reviewed changes

kasuga-fj added 4 commits July 24, 2025 09:56

Merge branch 'main' into users/kasuga-fj/loop-interchange-improve-pro…

4f5a8c0

…fitable-vectorization

Fix comments

ced443b

Modify forward dependency check

211be9e

Revise tests

c21efda

kasuga-fj changed the title ~~[LoopInterchange] Improve profitability check for vectorization~~ [LoopInterchange] Consider forward/backward dependency in vectorize heuristic Jul 24, 2025

Meinersbur reviewed Jul 24, 2025

View reviewed changes

kasuga-fj added 2 commits July 24, 2025 13:30

Address review comments

e324e62

Revert unnecessary change

e8ef29a

Meinersbur approved these changes Jul 25, 2025

View reviewed changes

kasuga-fj merged commit b75530f into main Jul 25, 2025
9 checks passed

kasuga-fj deleted the users/kasuga-fj/loop-interchange-improve-profitable-vectorization branch July 25, 2025 13:37

		// Positive: ... = A[i][j] Negative: ... = A[i][j-1]
		// A[i][j-1] = ... A[i][j] = ...

	// If inner loop cannot be vectorized and outer loop can be then it is
	// If the inner loop cannot be vectorized but the outer loop can be then it is

[LoopInterchange] Consider forward/backward dependency in vectorize heuristic #133672

[LoopInterchange] Consider forward/backward dependency in vectorize heuristic #133672

Uh oh!

Conversation

kasuga-fj commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 31, 2025

Uh oh!

kasuga-fj commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Meinersbur left a comment

Choose a reason for hiding this comment

Uh oh!

sjoerdmeijer commented Apr 7, 2025

Uh oh!

kasuga-fj commented Apr 8, 2025

Uh oh!

kasuga-fj commented May 29, 2025

Uh oh!

Meinersbur left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Meinersbur Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Meinersbur Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kasuga-fj Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Meinersbur commented Jun 4, 2025

Uh oh!

kasuga-fj commented Jun 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Meinersbur Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Meinersbur Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

kasuga-fj commented Mar 31, 2025 •

edited

Loading

kasuga-fj commented Mar 31, 2025 •

edited

Loading

Meinersbur Jul 14, 2025 •

edited

Loading

Meinersbur Jul 21, 2025 •

edited

Loading

kasuga-fj Jul 22, 2025 •

edited

Loading

Meinersbur Jul 14, 2025 •

edited

Loading

Meinersbur Jul 21, 2025 •

edited

Loading

Meinersbur Jul 23, 2025 •

edited

Loading

kasuga-fj Jul 30, 2025 •

edited

Loading