[LoopUnroll] Simplify reduction operations after a loop unroll #84805

rj-jesus · 2024-03-11T18:09:30Z

Try to simplify reductions (e.g. chains of floating-point adds) into independent operations after unrolling a loop. The idea is to break simple reduction loops such as:

 loop:
   PN = PHI [SUM2, loop], ...
   X = ...
   SUM1 = ADD (X, PN)
   Y = ...
   SUM2 = ADD (Y, SUM1)
   br loop

into independent sums of the form:

loop:
  PN1 = PHI [SUM1, loop], ...
  PN2 = PHI [SUM2, loop], ...
  X = ...
  SUM1 = ADD (X, PN1)
  Y = ...
  SUM2 = ADD (Y, PN2)
  <Reductions>
  br loop

where <Reductions> are new instructions used to compute the final values of the reduction from the partial sums we introduced, in this case:

<Reductions> =
  PN.red = ADD (PN1, PN2)
  SUM1.red = ADD (SUM1, PN2)
  SUM2.red = ADD (SUM1, SUM2)

This is a very common pattern in unrolled loops that compute dot products (for example) and can help with vectorisation.

This fixes a performance issue with the Polybench 3MM kernel on Grace (see the attached test case), leading to a speedup of about 90%.

Try to simplify reductions (e.g. chains of floating-point adds) into independent operations after unrolling a loop. This is a very common pattern in unrolled loops that compute dot products (for example) and can help with vectorisation.

llvmbot · 2024-03-11T18:09:58Z

@llvm/pr-subscribers-backend-systemz

@llvm/pr-subscribers-llvm-transforms

Author: Ricardo Jesus (rj-jesus)

Changes

Try to simplify reductions (e.g. chains of floating-point adds) into independent operations after unrolling a loop. The idea is to break simple reduction loops such as:

 loop:
   PN = PHI [SUM2, loop], ...
   X = ...
   SUM1 = ADD (X, PN)
   Y = ...
   SUM2 = ADD (Y, SUM1)
   br loop

into independent sums of the form:

loop:
  PN1 = PHI [SUM1, loop], ...
  PN2 = PHI [SUM2, loop], ...
  X = ...
  SUM1 = ADD (X, PN1)
  Y = ...
  SUM2 = ADD (Y, PN2)
  &lt;Reductions&gt;
  br loop

where <Reductions> are new instructions used to compute the final values of the reduction from the partial sums we introduced, in this case:

&lt;Reductions&gt; =
  PN.red = ADD (PN1, PN2)
  SUM1.red = ADD (SUM1, PN2)
  SUM2.red = ADD (SUM1, SUM2)

This is a very common pattern in unrolled loops that compute dot products (for example) and can help with vectorisation.

Patch is 147.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/84805.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnroll.cpp (+267)
(added) llvm/test/CodeGen/AArch64/polybench-3mm.ll (+55)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll (+20)
(modified) llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll (+12-8)
(modified) llvm/test/Transforms/LoopUnroll/X86/znver3.ll (+409-141)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop5.ll (+13-7)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-remainder.ll (+13-7)
(added) llvm/test/Transforms/LoopUnroll/simplify-reductions.ll (+301)
(modified) llvm/test/Transforms/PhaseOrdering/SystemZ/sub-xor.ll (+108-55)

diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index 6f0d000815726e..b14d05d642e275 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -84,6 +84,11 @@ STATISTIC(NumUnrolled, "Number of loops unrolled (completely or otherwise)");
 STATISTIC(NumUnrolledNotLatch, "Number of loops unrolled without a conditional "
                                "latch (completely or otherwise)");
 
+static cl::opt<bool>
+UnrollSimplifyReductions("unroll-simplify-reductions", cl::init(true),
+                         cl::Hidden, cl::desc("Try to simplify reductions "
+                                              "after unrolling a loop."));
+
 static cl::opt<bool>
 UnrollRuntimeEpilog("unroll-runtime-epilog", cl::init(false), cl::Hidden,
                     cl::desc("Allow runtime unrolled loops to be unrolled "
@@ -209,6 +214,258 @@ static bool isEpilogProfitable(Loop *L) {
   return false;
 }
 
+/// This function tries to break apart simple reduction loops like the one
+/// below:
+///
+/// loop:
+///   PN = PHI [SUM2, loop], ...
+///   X = ...
+///   SUM1 = ADD (X, PN)
+///   Y = ...
+///   SUM2 = ADD (Y, SUM1)
+///   br loop
+///
+/// into independent sums of the form:
+///
+/// loop:
+///   PN1 = PHI [SUM1, loop], ...
+///   PN2 = PHI [SUM2, loop], ...
+///   X = ...
+///   SUM1 = ADD (X, PN1)
+///   Y = ...
+///   SUM2 = ADD (Y, PN2)
+///   <Reductions>
+///   br loop
+///
+/// where <Reductions> are new instructions inserted to compute the final
+/// values of the reduction from the partial sums we introduced, in this case:
+///
+/// <Reductions> =
+///   PN.red = ADD (PN1, PN2)
+///   SUM1.red = ADD (SUM1, PN2)
+///   SUM2.red = ADD (SUM1, SUM2)
+///
+/// In practice in most cases only one or two of the reduced values are
+/// required outside the loop so most of the reduction instructions do not
+/// need to be added into the loop. Moreover, these instructions can be sunk
+/// from the loop which happens in later passes.
+///
+/// This is a very common pattern in unrolled loops that compute dot products
+/// (for example) and breaking apart the reduction chains can help greatly with
+/// vectorisation.
+static bool trySimplifyReductions(Instruction &I) {
+  // Check if I is a PHINode (potentially the start of a reduction chain).
+  // Note: For simplicity we only consider loops that consists of a single
+  // basic block that branches to itself.
+  BasicBlock *BB = I.getParent();
+  PHINode *PN = dyn_cast<PHINode>(&I);
+  if (!PN || PN->getBasicBlockIndex(BB) == -1)
+    return false;
+
+  // Attempt to construct a list of instructions that are chained together
+  // (i.e. that perform a reduction).
+  SmallVector<BinaryOperator *, 16> Ops;
+  for (Instruction *Cur = PN, *Next = nullptr; /* true */; Cur = Next,
+                                                           Next = nullptr) {
+    // Try to find the next element in the reduction chain.
+    for (auto *U : Cur->users()) {
+      auto *Candidate = dyn_cast<Instruction>(U);
+      if (Candidate && Candidate->getParent() == BB) {
+        // If we've already found a candidate element for the chain and we find
+        // *another* candidate we bail out as this means the intermediate
+        // values of the reduction are needed within the loop, and so there is
+        // no point in breaking the reduction apart.
+        if (Next)
+          return false;
+        Next = Candidate;
+      }
+    }
+    // If we've reached the start, i.e. the next element in the chain would be
+    // the PN we started with, we are done.
+    if (Next == PN)
+      break;
+    // Else, check if we found a candidate at all and if so if it is a binary
+    // operator.
+    if (!Next || !isa<BinaryOperator>(Next))
+      return false;
+    // If everything checks out, add the new element to the chain.
+    Ops.push_back(cast<BinaryOperator>(Next));
+  }
+
+  // Ensure the reduction comprises at least two instructions, otherwise this
+  // is a trivial reduction of a single element that doesn't need to be
+  // simplified.
+  if (Ops.size() < 2)
+    return false;
+
+  LLVM_DEBUG(
+  dbgs() << "Found candidate reduction: " << I << "\n";
+  for (auto const *Op : Ops)
+    dbgs() << "                         | " << *Op << "\n";
+  );
+
+  // Ensure all instructions perform the same operation and that the operation
+  // is associative and commutative so that we can break the chain apart and
+  // reassociate the Ops.
+  Instruction::BinaryOps const Opcode = Ops[0]->getOpcode();
+  for (auto const *Op : Ops)
+    if (Op->getOpcode() != Opcode || !Op->isAssociative() ||
+        !Op->isCommutative())
+      return false;
+
+  // Define the neutral element of the reduction or bail out if we don't have
+  // one defined.
+  // TODO: This could be generalised to other operations (e.g. MUL's).
+  Value *NeutralElem = nullptr;
+  switch (Opcode) {
+  case Instruction::BinaryOps::Add:
+  case Instruction::BinaryOps::Or:
+  case Instruction::BinaryOps::Xor:
+  case Instruction::BinaryOps::FAdd:
+    NeutralElem = Constant::getNullValue(PN->getType());
+    break;
+  case Instruction::BinaryOps::And:
+    NeutralElem = Constant::getAllOnesValue(PN->getType());
+    break;
+  case Instruction::BinaryOps::Mul:
+  case Instruction::BinaryOps::FMul:
+  default:
+    return false;
+  }
+  assert(NeutralElem && "Neutral element of reduction undefined.");
+
+  // --------------------------------------------------------------------- //
+  // At this point Ops is a list of chained binary operations performing a //
+  // reduction that we know we can break apart.                            //
+  // --------------------------------------------------------------------- //
+
+  // For shorthand, let N be the length of the chain.
+  unsigned const N = Ops.size();
+  LLVM_DEBUG(dbgs() << "Simplifying reduction of length " << N << ".\n");
+
+  // Create new phi nodes for all but the first element in the chain.
+  SmallVector<PHINode *, 16> Phis{PN};
+  for (unsigned i = 1; i < N; i++) {
+    PHINode *NewPN = PHINode::Create(PN->getType(), PN->getNumIncomingValues(),
+                                     PN->getName());
+    // Copy incoming blocks from the first/original PN to the new Phi and set
+    // their incoming values to the neutral element of the reduction.
+    for (auto *IncomingBB : PN->blocks())
+      NewPN->addIncoming(NeutralElem, IncomingBB);
+    NewPN->insertAfter(Phis.back());
+    Phis.push_back(NewPN);
+  }
+
+  // Set the chained operands of the Ops to the Phis and the incoming values of
+  // the Phis (for this BB) to the Ops.
+  for (unsigned i = 0; i < N; i++) {
+    PHINode *Phi = Phis[i];
+    Instruction *Op = Ops[i];
+
+    // Find the index of the operand of Op to replace. The first Op reads its
+    // value from the first Phi node. The other Ops read their value from the
+    // previous Op.
+    Value *OperandToReplace = i == 0 ? cast<Value>(PN) : Ops[i-1];
+    unsigned OperandIdx = Op->getOperand(0) == OperandToReplace ? 0 : 1;
+    assert(Op->getOperand(OperandIdx) == OperandToReplace &&
+           "Operand mismatch. Perhaps a malformed chain?");
+
+    // Set the operand of Op to Phi and the incoming value of Phi for BB to Op.
+    Op->setOperand(OperandIdx, Phi);
+    Phi->setIncomingValueForBlock(BB, Op);
+  }
+
+  // Replace old uses of PN and Ops outside this BB with the updated totals.
+  // The "old" total corresponding to PN now corresponds to the sum of all
+  // Phis. Similarly, the old totals in Ops correspond to the sum of the
+  // partial results in the new Ops up to the index of the Op we want to
+  // compute, plus the sum of the Phis from that index onwards.
+  //
+  // More rigorously, the totals can be computed as follows.
+  // 1. Let k be an index in the list of length N+1 below of the variables we
+  //    want to compute the new totals for:
+  //      { PN, Ops[0], Ops[1], ... }
+  // 2. Let Sum(k) denote the new total to compute for the k-th variable in the
+  //    list above. Then,
+  //      Sum(0) = Sum(PN) = \sum_{0 <= i < N} Phis[i],
+  //      Sum(1) = Sum(Ops[0]) = \sum_{0 <= i < 1} Ops[i] +
+  //                             \sum_{1 <= i < N} Phis[i],
+  //      ...
+  //      Sum(N) = Sum(Ops[N-1]) = \sum_{0 <= i < N} Ops[i].
+  // 3. More generally,
+  //      Sum(k) = Sum(PN) if k == 0 else Sum(Ops[k-1])
+  //             = \sum_{0 <= i < k} Ops[i] +
+  //               \sum_{k <= i < N} Phis[i],
+  //      for 0 <= k <= N.
+  // 4. Finally, if we name the sums in Ops and Phis separately, i.e.
+  //      SOps(k) = \sum_{0 <= i < k} Ops[i],
+  //      SPhis(k) = \sum_{k <= i < N} Phis[i],
+  //    then
+  //      Sum(k) = SOps(k) + SPhis(k), 0 <= k <= N.
+  // .
+
+  // Helper function to create a new binary op.
+  // Note: We copy the flags from Ops[0]. Could this be too permissive?
+  auto CreateBinOp = [&](Value *V1, Value *V2) {
+    auto Name = PN->getName()+".red";
+    return BinaryOperator::CreateWithCopiedFlags(Opcode, V1, V2, Ops[0],
+                                                 Name, &BB->back());
+  };
+
+  // Compute the partial sums of the Ops:
+  //   SOps[k] = \sum_{0 <= i < k} Ops[i], 0 <= k <= N.
+  // For 1 <= k <= N we have:
+  //   SOps[k] = Ops[k-1] + \sum_{0 <= i < k-1} Ops[i]
+  //           = Ops[k-1] + SOps[k-1],
+  // so if we compute SOps in order (i.e. from 0 to N) we can reuse partial
+  // results.
+  SmallVector<Value *, 16> SOps(N+1);
+  SOps[0] = nullptr;  // alternatively we could use NeutralElem
+  SOps[1] = Ops.front();
+  for (unsigned k = 2; k <= N; k++)
+    SOps[k] = CreateBinOp(SOps[k-1], Ops[k-1]);
+
+  // Compute the partial sums of the Phis:
+  //   SPhis[k] = \sum_{k <= i < N} Phis[i], 0 <= k <= N.
+  // Similarly, for 0 <= k <= N-1 we have:
+  //   SPhis[k] = Phis[k] + \sum_{k+1 <= i < N} Phis[i]
+  //            = Phis[k] + SPhis[k+1],
+  // so if we compute SPhis in reverse (i.e. from N down to 0) we can reuse the
+  // partial sums computed thus far.
+  SmallVector<Value *, 16> SPhis(N+1);
+  SPhis[N] = nullptr;  // alternatively we could use NeutralElem
+  SPhis[N-1] = Phis.back();
+  for (signed k = N-2; k >= 0; k--)
+    SPhis[k] = CreateBinOp(SPhis[k+1], Phis[k]);
+
+  // Finally, compute the total sums for PN and Ops from:
+  //   Sums[k] = SOps[k] + SPhis[k], 0 <= k <= N.
+  // These sums might be dead so we had them to a weak tracking vector for
+  // cleanup after.
+  SmallVector<WeakTrackingVH, 16> Sums(N+1);
+  for (unsigned k = 0; k <= N; k++) {
+    // Pick the Op we want to compute the new total for.
+    Value *Op = k == 0 ? cast<Value>(PN) : Ops[k-1];
+
+    Value *SOp = SOps[k], *SPhi = SPhis[k];
+    if (SOp && SPhi)
+      Sums[k] = CreateBinOp(SOp, SPhi);
+    else if (SOp)
+      Sums[k] = SOp;
+    else
+      Sums[k] = SPhi;
+
+    // Replace uses of the old total with the new total.
+    Op->replaceUsesOutsideBlock(Sums[k], BB);
+  }
+
+  // Drop dead totals. In case the totals *are* used they could and should be
+  // sunk, but this happens in later passes so we don't bother doing it here.
+  RecursivelyDeleteTriviallyDeadInstructionsPermissive(Sums);
+
+  return true;
+}
+
 /// Perform some cleanup and simplifications on loops after unrolling. It is
 /// useful to simplify the IV's in the new loop, as well as do a quick
 /// simplify/dce pass of the instructions.
@@ -272,6 +529,16 @@ void llvm::simplifyLoopAfterUnroll(Loop *L, bool SimplifyIVs, LoopInfo *LI,
     // have a phi which (potentially indirectly) uses instructions later in
     // the block we're iterating through.
     RecursivelyDeleteTriviallyDeadInstructions(DeadInsts);
+    // Try to simplify reductions (e.g. chains of floating-point adds) into
+    // independent operations (see more at trySimplifyReductions). This is a
+    // very common pattern in unrolled loops that compute dot products (for
+    // example).
+    //
+    // We do this outside the loop over the instructions above to let
+    // instsimplify kick in before trying to apply this transform.
+    if (UnrollSimplifyReductions)
+      for (PHINode &PN : BB->phis())
+        trySimplifyReductions(PN);
   }
 }
 
diff --git a/llvm/test/CodeGen/AArch64/polybench-3mm.ll b/llvm/test/CodeGen/AArch64/polybench-3mm.ll
new file mode 100644
index 00000000000000..309fa0e9a305f7
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/polybench-3mm.ll
@@ -0,0 +1,55 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: opt -passes=loop-unroll,instcombine -unroll-count=2 %s | llc --mattr=,+neon | FileCheck %s
+
+target triple = "aarch64"
+
+; This is a reduced example adapted from the Polybench 3MM kernel.
+; We are doing something similar to:
+;   double dot = 0.0;
+;   for (long k = 0; k < 1000; k++)
+;     dot += A[k] * B[k*nb];
+;   return dot;
+
+define double @test(ptr %A, ptr %B, i64 %nb) {
+; CHECK-LABEL: test:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    movi d0, #0000000000000000
+; CHECK-NEXT:    movi d1, #0000000000000000
+; CHECK-NEXT:    lsl x8, x2, #4
+; CHECK-NEXT:    mov x9, xzr
+; CHECK-NEXT:  .LBB0_1: // %loop
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    add x10, x0, x9, lsl #3
+; CHECK-NEXT:    ldr d2, [x1]
+; CHECK-NEXT:    ldr d5, [x1, x2, lsl #3]
+; CHECK-NEXT:    add x9, x9, #2
+; CHECK-NEXT:    add x1, x1, x8
+; CHECK-NEXT:    ldp d3, d4, [x10]
+; CHECK-NEXT:    cmp x9, #1000
+; CHECK-NEXT:    fmadd d0, d2, d3, d0
+; CHECK-NEXT:    fmadd d1, d5, d4, d1
+; CHECK-NEXT:    b.ne .LBB0_1
+; CHECK-NEXT:  // %bb.2: // %exit
+; CHECK-NEXT:    fadd d0, d0, d1
+; CHECK-NEXT:    ret
+entry:
+  br label %loop
+
+loop:
+  %k = phi i64 [ %k.next, %loop ], [ 0, %entry ]
+  %dot = phi double [ %dot.next, %loop ], [ 0.000000e+00, %entry ]
+  %A.gep = getelementptr inbounds double, ptr %A, i64 %k
+  %A.val = load double, ptr %A.gep, align 8
+  %B.idx = mul nsw i64 %k, %nb
+  %B.gep = getelementptr inbounds double, ptr %B, i64 %B.idx
+  %B.val = load double, ptr %B.gep, align 8
+  %fmul = fmul fast double %B.val, %A.val
+  %dot.next = fadd fast double %fmul, %dot
+  %k.next = add nuw nsw i64 %k, 1
+  %cmp = icmp eq i64 %k.next, 1000
+  br i1 %cmp, label %exit, label %loop
+
+exit:
+  %dot.next.lcssa = phi double [ %dot.next, %loop ]
+  ret double %dot.next.lcssa
+}
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll b/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
index 045b1c72321a97..874a4aa800c411 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
@@ -73,6 +73,13 @@ exit:
 ; NOHWPF-LABEL: loop2:
 ; NOHWPF-NEXT: phi
 ; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
 ; NOHWPF-NEXT: getelementptr
 ; NOHWPF-NEXT: load
 ; NOHWPF-NEXT: add
@@ -106,6 +113,13 @@ exit:
 ; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: icmp
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: br
 ; NOHWPF-NEXT-LABEL: exit2:
 ;
@@ -113,6 +127,9 @@ exit:
 ; CHECK-LABEL: loop2:
 ; CHECK-NEXT: phi
 ; CHECK-NEXT: phi
+; CHECK-NEXT: phi
+; CHECK-NEXT: phi
+; CHECK-NEXT: phi
 ; CHECK-NEXT: getelementptr
 ; CHECK-NEXT: load
 ; CHECK-NEXT: add
@@ -130,6 +147,9 @@ exit:
 ; CHECK-NEXT: add
 ; CHECK-NEXT: add
 ; CHECK-NEXT: icmp
+; CHECK-NEXT: add
+; CHECK-NEXT: add
+; CHECK-NEXT: add
 ; CHECK-NEXT: br
 ; CHECK-NEXT-LABEL: exit2:
 
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll b/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
index 216bf489bc66ec..59208a6da76f62 100644
--- a/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
+++ b/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
@@ -195,14 +195,15 @@ define i32 @test_i32_select_optsize(ptr %a, ptr %b, ptr %c) #0 {
 ; CHECK-V8-NEXT:    br label [[LOOP:%.*]]
 ; CHECK-V8:       loop:
 ; CHECK-V8-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[COUNT_1:%.*]], [[LOOP]] ]
-; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC1:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    [[ADDR_A:%.*]] = getelementptr i32, ptr [[A:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[ADDR_B:%.*]] = getelementptr i32, ptr [[B:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[DATA_A:%.*]] = load i32, ptr [[ADDR_A]], align 4
 ; CHECK-V8-NEXT:    [[DATA_B:%.*]] = load i32, ptr [[ADDR_B]], align 4
 ; CHECK-V8-NEXT:    [[UGT:%.*]] = icmp ugt i32 [[DATA_A]], [[DATA_B]]
 ; CHECK-V8-NEXT:    [[UMAX:%.*]] = select i1 [[UGT]], i32 [[DATA_A]], i32 [[DATA_B]]
-; CHECK-V8-NEXT:    [[ACC_NEXT:%.*]] = add i32 [[UMAX]], [[ACC]]
+; CHECK-V8-NEXT:    [[ACC_NEXT]] = add i32 [[UMAX]], [[ACC]]
 ; CHECK-V8-NEXT:    [[ADDR_C:%.*]] = getelementptr i32, ptr [[C:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX]], ptr [[ADDR_C]], align 4
 ; CHECK-V8-NEXT:    [[COUNT:%.*]] = add nuw nsw i32 [[IV]], 1
@@ -212,14 +213,15 @@ define i32 @test_i32_select_optsize(ptr %a, ptr %b, ptr %c) #0 {
 ; CHECK-V8-NEXT:    [[DATA_B_1:%.*]] = load i32, ptr [[ADDR_B_1]], align 4
 ; CHECK-V8-NEXT:    [[UGT_1:%.*]] = icmp ugt i32 [[DATA_A_1]], [[DATA_B_1]]
 ; CHECK-V8-NEXT:    [[UMAX_1:%.*]] = select i1 [[UGT_1]], i32 [[DATA_A_1]], i32 [[DATA_B_1]]
-; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC_NEXT]]
+; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC1]]
 ; CHECK-V8-NEXT:    [[ADDR_C_1:%.*]] = getelementptr i32, ptr [[C]], i32 [[COUNT]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX_1]], ptr [[ADDR_C_1]], align 4
 ; CHECK-V8-NEXT:    [[COUNT_1]] = add nuw nsw i32 [[IV]], 2
 ; CHECK-V8-NEXT:    [[END_1:%.*]] = icmp ne i32 [[COUNT_1]], 100
+; CHECK-V8-NEXT:    [[ACC_RED:%.*]] = add i32 [[ACC_NEXT]], [[ACC_NEXT_1]]
 ; CHECK-V8-NEXT:    br i1 [[END_1]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK-V8:       exit:
-; CHECK-V8-NEXT:    [[ACC_NEXT_LCSSA:%.*]] = phi i32 [ [[ACC_NEXT_1]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC_NEXT_LCSSA:%.*]] = phi i32 [ [[ACC_RED]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    ret i32 [[ACC_NEXT_LCSSA]]
 ;
 entry:
@@ -251,14 +253,15 @@ define i32 @test_i32_select_minsize(ptr %a, ptr %b, ptr %c) #1 {
 ; CHECK-V8-NEXT:    br label [[LOOP:%.*]]
 ; CHECK-V8:       loop:
 ; CHECK-V8-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[COUNT_1:%.*]], [[LOOP]] ]
-; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC1:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    [[ADDR_A:%.*]] = getelementptr i32, ptr [[A:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[ADDR_B:%.*]] = getelementptr i32, ptr [[B:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[DATA_A:%.*]] = load i32, ptr [[ADDR_A]], align 4
 ; CHECK-V8-NEXT:    [[DATA_B:%.*]] = load i32, ptr [[ADDR_B]], align 4
 ; CHECK-V8-NEXT:    [[UGT:%.*]] = icmp ugt i32 [[DATA_A]], [[DATA_B]]
 ; CHECK-V8-NEXT:    [[UMAX:%.*]] = select i1 [[UGT]], i32 [[DATA_A]], i32 [[DATA_B]]
-; CHECK-V8-NEXT:    [[ACC_NEXT:%.*]] = add i32 [[UMAX]], [[ACC]]
+; CHECK-V8-NEXT:    [[ACC_NEXT]] = add i32 [[UMAX]], [[ACC]]
 ; CHECK-V8-NEXT:    [[ADDR_C:%.*]] = getelementptr i32, ptr [[C:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX]], ptr [[ADDR_C]], align 4
 ; CHECK-V8-NEXT:    [[COUNT:%.*]] = add nuw nsw i32 [[IV]], 1
@@ -268,14 +271,15 @@ define i32 @test_i32_select_minsize(ptr %a, ptr %b, ptr %c) #1 {
 ; CHECK-V8-NEXT:    [[DATA_B_1:%.*]] = load i32, ptr [[ADDR_B_1]], align 4
 ; CHECK-V8-NEXT:    [[UGT_1:%.*]] = icmp ugt i32 [[DATA_A_1]], [[DATA_B_1]]
 ; CHECK-V8-NEXT:    [[UMAX_1:%.*]] = select i1 [[UGT_1]], i32 [[DATA_A_1]], i32 [[DATA_B_1]]
-; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC_NEXT]]
+; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC1]]
 ; CHECK-V8-NEXT:    [[ADDR_C_1:%.*]] = getelementptr i32, ptr [[C]], i32 [[COUNT]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX_1]], ptr [[ADDR_C_1]], align 4
 ; CHECK-V8-NEXT:    [[COUNT_1]] = add nuw nsw i32 [[IV]], 2
 ; CHECK-V8-NEXT:...
[truncated]

llvmbot · 2024-03-11T18:09:59Z

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

Try to simplify reductions (e.g. chains of floating-point adds) into independent operations after unrolling a loop. The idea is to break simple reduction loops such as:

 loop:
   PN = PHI [SUM2, loop], ...
   X = ...
   SUM1 = ADD (X, PN)
   Y = ...
   SUM2 = ADD (Y, SUM1)
   br loop

into independent sums of the form:

loop:
  PN1 = PHI [SUM1, loop], ...
  PN2 = PHI [SUM2, loop], ...
  X = ...
  SUM1 = ADD (X, PN1)
  Y = ...
  SUM2 = ADD (Y, PN2)
  &lt;Reductions&gt;
  br loop

where <Reductions> are new instructions used to compute the final values of the reduction from the partial sums we introduced, in this case:

&lt;Reductions&gt; =
  PN.red = ADD (PN1, PN2)
  SUM1.red = ADD (SUM1, PN2)
  SUM2.red = ADD (SUM1, SUM2)

This is a very common pattern in unrolled loops that compute dot products (for example) and can help with vectorisation.

Patch is 147.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/84805.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnroll.cpp (+267)
(added) llvm/test/CodeGen/AArch64/polybench-3mm.ll (+55)
(modified) llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll (+20)
(modified) llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll (+12-8)
(modified) llvm/test/Transforms/LoopUnroll/X86/znver3.ll (+409-141)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop5.ll (+13-7)
(modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-remainder.ll (+13-7)
(added) llvm/test/Transforms/LoopUnroll/simplify-reductions.ll (+301)
(modified) llvm/test/Transforms/PhaseOrdering/SystemZ/sub-xor.ll (+108-55)

diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index 6f0d000815726e..b14d05d642e275 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -84,6 +84,11 @@ STATISTIC(NumUnrolled, "Number of loops unrolled (completely or otherwise)");
 STATISTIC(NumUnrolledNotLatch, "Number of loops unrolled without a conditional "
                                "latch (completely or otherwise)");
 
+static cl::opt<bool>
+UnrollSimplifyReductions("unroll-simplify-reductions", cl::init(true),
+                         cl::Hidden, cl::desc("Try to simplify reductions "
+                                              "after unrolling a loop."));
+
 static cl::opt<bool>
 UnrollRuntimeEpilog("unroll-runtime-epilog", cl::init(false), cl::Hidden,
                     cl::desc("Allow runtime unrolled loops to be unrolled "
@@ -209,6 +214,258 @@ static bool isEpilogProfitable(Loop *L) {
   return false;
 }
 
+/// This function tries to break apart simple reduction loops like the one
+/// below:
+///
+/// loop:
+///   PN = PHI [SUM2, loop], ...
+///   X = ...
+///   SUM1 = ADD (X, PN)
+///   Y = ...
+///   SUM2 = ADD (Y, SUM1)
+///   br loop
+///
+/// into independent sums of the form:
+///
+/// loop:
+///   PN1 = PHI [SUM1, loop], ...
+///   PN2 = PHI [SUM2, loop], ...
+///   X = ...
+///   SUM1 = ADD (X, PN1)
+///   Y = ...
+///   SUM2 = ADD (Y, PN2)
+///   <Reductions>
+///   br loop
+///
+/// where <Reductions> are new instructions inserted to compute the final
+/// values of the reduction from the partial sums we introduced, in this case:
+///
+/// <Reductions> =
+///   PN.red = ADD (PN1, PN2)
+///   SUM1.red = ADD (SUM1, PN2)
+///   SUM2.red = ADD (SUM1, SUM2)
+///
+/// In practice in most cases only one or two of the reduced values are
+/// required outside the loop so most of the reduction instructions do not
+/// need to be added into the loop. Moreover, these instructions can be sunk
+/// from the loop which happens in later passes.
+///
+/// This is a very common pattern in unrolled loops that compute dot products
+/// (for example) and breaking apart the reduction chains can help greatly with
+/// vectorisation.
+static bool trySimplifyReductions(Instruction &I) {
+  // Check if I is a PHINode (potentially the start of a reduction chain).
+  // Note: For simplicity we only consider loops that consists of a single
+  // basic block that branches to itself.
+  BasicBlock *BB = I.getParent();
+  PHINode *PN = dyn_cast<PHINode>(&I);
+  if (!PN || PN->getBasicBlockIndex(BB) == -1)
+    return false;
+
+  // Attempt to construct a list of instructions that are chained together
+  // (i.e. that perform a reduction).
+  SmallVector<BinaryOperator *, 16> Ops;
+  for (Instruction *Cur = PN, *Next = nullptr; /* true */; Cur = Next,
+                                                           Next = nullptr) {
+    // Try to find the next element in the reduction chain.
+    for (auto *U : Cur->users()) {
+      auto *Candidate = dyn_cast<Instruction>(U);
+      if (Candidate && Candidate->getParent() == BB) {
+        // If we've already found a candidate element for the chain and we find
+        // *another* candidate we bail out as this means the intermediate
+        // values of the reduction are needed within the loop, and so there is
+        // no point in breaking the reduction apart.
+        if (Next)
+          return false;
+        Next = Candidate;
+      }
+    }
+    // If we've reached the start, i.e. the next element in the chain would be
+    // the PN we started with, we are done.
+    if (Next == PN)
+      break;
+    // Else, check if we found a candidate at all and if so if it is a binary
+    // operator.
+    if (!Next || !isa<BinaryOperator>(Next))
+      return false;
+    // If everything checks out, add the new element to the chain.
+    Ops.push_back(cast<BinaryOperator>(Next));
+  }
+
+  // Ensure the reduction comprises at least two instructions, otherwise this
+  // is a trivial reduction of a single element that doesn't need to be
+  // simplified.
+  if (Ops.size() < 2)
+    return false;
+
+  LLVM_DEBUG(
+  dbgs() << "Found candidate reduction: " << I << "\n";
+  for (auto const *Op : Ops)
+    dbgs() << "                         | " << *Op << "\n";
+  );
+
+  // Ensure all instructions perform the same operation and that the operation
+  // is associative and commutative so that we can break the chain apart and
+  // reassociate the Ops.
+  Instruction::BinaryOps const Opcode = Ops[0]->getOpcode();
+  for (auto const *Op : Ops)
+    if (Op->getOpcode() != Opcode || !Op->isAssociative() ||
+        !Op->isCommutative())
+      return false;
+
+  // Define the neutral element of the reduction or bail out if we don't have
+  // one defined.
+  // TODO: This could be generalised to other operations (e.g. MUL's).
+  Value *NeutralElem = nullptr;
+  switch (Opcode) {
+  case Instruction::BinaryOps::Add:
+  case Instruction::BinaryOps::Or:
+  case Instruction::BinaryOps::Xor:
+  case Instruction::BinaryOps::FAdd:
+    NeutralElem = Constant::getNullValue(PN->getType());
+    break;
+  case Instruction::BinaryOps::And:
+    NeutralElem = Constant::getAllOnesValue(PN->getType());
+    break;
+  case Instruction::BinaryOps::Mul:
+  case Instruction::BinaryOps::FMul:
+  default:
+    return false;
+  }
+  assert(NeutralElem && "Neutral element of reduction undefined.");
+
+  // --------------------------------------------------------------------- //
+  // At this point Ops is a list of chained binary operations performing a //
+  // reduction that we know we can break apart.                            //
+  // --------------------------------------------------------------------- //
+
+  // For shorthand, let N be the length of the chain.
+  unsigned const N = Ops.size();
+  LLVM_DEBUG(dbgs() << "Simplifying reduction of length " << N << ".\n");
+
+  // Create new phi nodes for all but the first element in the chain.
+  SmallVector<PHINode *, 16> Phis{PN};
+  for (unsigned i = 1; i < N; i++) {
+    PHINode *NewPN = PHINode::Create(PN->getType(), PN->getNumIncomingValues(),
+                                     PN->getName());
+    // Copy incoming blocks from the first/original PN to the new Phi and set
+    // their incoming values to the neutral element of the reduction.
+    for (auto *IncomingBB : PN->blocks())
+      NewPN->addIncoming(NeutralElem, IncomingBB);
+    NewPN->insertAfter(Phis.back());
+    Phis.push_back(NewPN);
+  }
+
+  // Set the chained operands of the Ops to the Phis and the incoming values of
+  // the Phis (for this BB) to the Ops.
+  for (unsigned i = 0; i < N; i++) {
+    PHINode *Phi = Phis[i];
+    Instruction *Op = Ops[i];
+
+    // Find the index of the operand of Op to replace. The first Op reads its
+    // value from the first Phi node. The other Ops read their value from the
+    // previous Op.
+    Value *OperandToReplace = i == 0 ? cast<Value>(PN) : Ops[i-1];
+    unsigned OperandIdx = Op->getOperand(0) == OperandToReplace ? 0 : 1;
+    assert(Op->getOperand(OperandIdx) == OperandToReplace &&
+           "Operand mismatch. Perhaps a malformed chain?");
+
+    // Set the operand of Op to Phi and the incoming value of Phi for BB to Op.
+    Op->setOperand(OperandIdx, Phi);
+    Phi->setIncomingValueForBlock(BB, Op);
+  }
+
+  // Replace old uses of PN and Ops outside this BB with the updated totals.
+  // The "old" total corresponding to PN now corresponds to the sum of all
+  // Phis. Similarly, the old totals in Ops correspond to the sum of the
+  // partial results in the new Ops up to the index of the Op we want to
+  // compute, plus the sum of the Phis from that index onwards.
+  //
+  // More rigorously, the totals can be computed as follows.
+  // 1. Let k be an index in the list of length N+1 below of the variables we
+  //    want to compute the new totals for:
+  //      { PN, Ops[0], Ops[1], ... }
+  // 2. Let Sum(k) denote the new total to compute for the k-th variable in the
+  //    list above. Then,
+  //      Sum(0) = Sum(PN) = \sum_{0 <= i < N} Phis[i],
+  //      Sum(1) = Sum(Ops[0]) = \sum_{0 <= i < 1} Ops[i] +
+  //                             \sum_{1 <= i < N} Phis[i],
+  //      ...
+  //      Sum(N) = Sum(Ops[N-1]) = \sum_{0 <= i < N} Ops[i].
+  // 3. More generally,
+  //      Sum(k) = Sum(PN) if k == 0 else Sum(Ops[k-1])
+  //             = \sum_{0 <= i < k} Ops[i] +
+  //               \sum_{k <= i < N} Phis[i],
+  //      for 0 <= k <= N.
+  // 4. Finally, if we name the sums in Ops and Phis separately, i.e.
+  //      SOps(k) = \sum_{0 <= i < k} Ops[i],
+  //      SPhis(k) = \sum_{k <= i < N} Phis[i],
+  //    then
+  //      Sum(k) = SOps(k) + SPhis(k), 0 <= k <= N.
+  // .
+
+  // Helper function to create a new binary op.
+  // Note: We copy the flags from Ops[0]. Could this be too permissive?
+  auto CreateBinOp = [&](Value *V1, Value *V2) {
+    auto Name = PN->getName()+".red";
+    return BinaryOperator::CreateWithCopiedFlags(Opcode, V1, V2, Ops[0],
+                                                 Name, &BB->back());
+  };
+
+  // Compute the partial sums of the Ops:
+  //   SOps[k] = \sum_{0 <= i < k} Ops[i], 0 <= k <= N.
+  // For 1 <= k <= N we have:
+  //   SOps[k] = Ops[k-1] + \sum_{0 <= i < k-1} Ops[i]
+  //           = Ops[k-1] + SOps[k-1],
+  // so if we compute SOps in order (i.e. from 0 to N) we can reuse partial
+  // results.
+  SmallVector<Value *, 16> SOps(N+1);
+  SOps[0] = nullptr;  // alternatively we could use NeutralElem
+  SOps[1] = Ops.front();
+  for (unsigned k = 2; k <= N; k++)
+    SOps[k] = CreateBinOp(SOps[k-1], Ops[k-1]);
+
+  // Compute the partial sums of the Phis:
+  //   SPhis[k] = \sum_{k <= i < N} Phis[i], 0 <= k <= N.
+  // Similarly, for 0 <= k <= N-1 we have:
+  //   SPhis[k] = Phis[k] + \sum_{k+1 <= i < N} Phis[i]
+  //            = Phis[k] + SPhis[k+1],
+  // so if we compute SPhis in reverse (i.e. from N down to 0) we can reuse the
+  // partial sums computed thus far.
+  SmallVector<Value *, 16> SPhis(N+1);
+  SPhis[N] = nullptr;  // alternatively we could use NeutralElem
+  SPhis[N-1] = Phis.back();
+  for (signed k = N-2; k >= 0; k--)
+    SPhis[k] = CreateBinOp(SPhis[k+1], Phis[k]);
+
+  // Finally, compute the total sums for PN and Ops from:
+  //   Sums[k] = SOps[k] + SPhis[k], 0 <= k <= N.
+  // These sums might be dead so we had them to a weak tracking vector for
+  // cleanup after.
+  SmallVector<WeakTrackingVH, 16> Sums(N+1);
+  for (unsigned k = 0; k <= N; k++) {
+    // Pick the Op we want to compute the new total for.
+    Value *Op = k == 0 ? cast<Value>(PN) : Ops[k-1];
+
+    Value *SOp = SOps[k], *SPhi = SPhis[k];
+    if (SOp && SPhi)
+      Sums[k] = CreateBinOp(SOp, SPhi);
+    else if (SOp)
+      Sums[k] = SOp;
+    else
+      Sums[k] = SPhi;
+
+    // Replace uses of the old total with the new total.
+    Op->replaceUsesOutsideBlock(Sums[k], BB);
+  }
+
+  // Drop dead totals. In case the totals *are* used they could and should be
+  // sunk, but this happens in later passes so we don't bother doing it here.
+  RecursivelyDeleteTriviallyDeadInstructionsPermissive(Sums);
+
+  return true;
+}
+
 /// Perform some cleanup and simplifications on loops after unrolling. It is
 /// useful to simplify the IV's in the new loop, as well as do a quick
 /// simplify/dce pass of the instructions.
@@ -272,6 +529,16 @@ void llvm::simplifyLoopAfterUnroll(Loop *L, bool SimplifyIVs, LoopInfo *LI,
     // have a phi which (potentially indirectly) uses instructions later in
     // the block we're iterating through.
     RecursivelyDeleteTriviallyDeadInstructions(DeadInsts);
+    // Try to simplify reductions (e.g. chains of floating-point adds) into
+    // independent operations (see more at trySimplifyReductions). This is a
+    // very common pattern in unrolled loops that compute dot products (for
+    // example).
+    //
+    // We do this outside the loop over the instructions above to let
+    // instsimplify kick in before trying to apply this transform.
+    if (UnrollSimplifyReductions)
+      for (PHINode &PN : BB->phis())
+        trySimplifyReductions(PN);
   }
 }
 
diff --git a/llvm/test/CodeGen/AArch64/polybench-3mm.ll b/llvm/test/CodeGen/AArch64/polybench-3mm.ll
new file mode 100644
index 00000000000000..309fa0e9a305f7
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/polybench-3mm.ll
@@ -0,0 +1,55 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: opt -passes=loop-unroll,instcombine -unroll-count=2 %s | llc --mattr=,+neon | FileCheck %s
+
+target triple = "aarch64"
+
+; This is a reduced example adapted from the Polybench 3MM kernel.
+; We are doing something similar to:
+;   double dot = 0.0;
+;   for (long k = 0; k < 1000; k++)
+;     dot += A[k] * B[k*nb];
+;   return dot;
+
+define double @test(ptr %A, ptr %B, i64 %nb) {
+; CHECK-LABEL: test:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    movi d0, #0000000000000000
+; CHECK-NEXT:    movi d1, #0000000000000000
+; CHECK-NEXT:    lsl x8, x2, #4
+; CHECK-NEXT:    mov x9, xzr
+; CHECK-NEXT:  .LBB0_1: // %loop
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    add x10, x0, x9, lsl #3
+; CHECK-NEXT:    ldr d2, [x1]
+; CHECK-NEXT:    ldr d5, [x1, x2, lsl #3]
+; CHECK-NEXT:    add x9, x9, #2
+; CHECK-NEXT:    add x1, x1, x8
+; CHECK-NEXT:    ldp d3, d4, [x10]
+; CHECK-NEXT:    cmp x9, #1000
+; CHECK-NEXT:    fmadd d0, d2, d3, d0
+; CHECK-NEXT:    fmadd d1, d5, d4, d1
+; CHECK-NEXT:    b.ne .LBB0_1
+; CHECK-NEXT:  // %bb.2: // %exit
+; CHECK-NEXT:    fadd d0, d0, d1
+; CHECK-NEXT:    ret
+entry:
+  br label %loop
+
+loop:
+  %k = phi i64 [ %k.next, %loop ], [ 0, %entry ]
+  %dot = phi double [ %dot.next, %loop ], [ 0.000000e+00, %entry ]
+  %A.gep = getelementptr inbounds double, ptr %A, i64 %k
+  %A.val = load double, ptr %A.gep, align 8
+  %B.idx = mul nsw i64 %k, %nb
+  %B.gep = getelementptr inbounds double, ptr %B, i64 %B.idx
+  %B.val = load double, ptr %B.gep, align 8
+  %fmul = fmul fast double %B.val, %A.val
+  %dot.next = fadd fast double %fmul, %dot
+  %k.next = add nuw nsw i64 %k, 1
+  %cmp = icmp eq i64 %k.next, 1000
+  br i1 %cmp, label %exit, label %loop
+
+exit:
+  %dot.next.lcssa = phi double [ %dot.next, %loop ]
+  ret double %dot.next.lcssa
+}
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll b/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
index 045b1c72321a97..874a4aa800c411 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/falkor-prefetch.ll
@@ -73,6 +73,13 @@ exit:
 ; NOHWPF-LABEL: loop2:
 ; NOHWPF-NEXT: phi
 ; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
+; NOHWPF-NEXT: phi
 ; NOHWPF-NEXT: getelementptr
 ; NOHWPF-NEXT: load
 ; NOHWPF-NEXT: add
@@ -106,6 +113,13 @@ exit:
 ; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: icmp
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
+; NOHWPF-NEXT: add
 ; NOHWPF-NEXT: br
 ; NOHWPF-NEXT-LABEL: exit2:
 ;
@@ -113,6 +127,9 @@ exit:
 ; CHECK-LABEL: loop2:
 ; CHECK-NEXT: phi
 ; CHECK-NEXT: phi
+; CHECK-NEXT: phi
+; CHECK-NEXT: phi
+; CHECK-NEXT: phi
 ; CHECK-NEXT: getelementptr
 ; CHECK-NEXT: load
 ; CHECK-NEXT: add
@@ -130,6 +147,9 @@ exit:
 ; CHECK-NEXT: add
 ; CHECK-NEXT: add
 ; CHECK-NEXT: icmp
+; CHECK-NEXT: add
+; CHECK-NEXT: add
+; CHECK-NEXT: add
 ; CHECK-NEXT: br
 ; CHECK-NEXT-LABEL: exit2:
 
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll b/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
index 216bf489bc66ec..59208a6da76f62 100644
--- a/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
+++ b/llvm/test/Transforms/LoopUnroll/ARM/instr-size-costs.ll
@@ -195,14 +195,15 @@ define i32 @test_i32_select_optsize(ptr %a, ptr %b, ptr %c) #0 {
 ; CHECK-V8-NEXT:    br label [[LOOP:%.*]]
 ; CHECK-V8:       loop:
 ; CHECK-V8-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[COUNT_1:%.*]], [[LOOP]] ]
-; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC1:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    [[ADDR_A:%.*]] = getelementptr i32, ptr [[A:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[ADDR_B:%.*]] = getelementptr i32, ptr [[B:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[DATA_A:%.*]] = load i32, ptr [[ADDR_A]], align 4
 ; CHECK-V8-NEXT:    [[DATA_B:%.*]] = load i32, ptr [[ADDR_B]], align 4
 ; CHECK-V8-NEXT:    [[UGT:%.*]] = icmp ugt i32 [[DATA_A]], [[DATA_B]]
 ; CHECK-V8-NEXT:    [[UMAX:%.*]] = select i1 [[UGT]], i32 [[DATA_A]], i32 [[DATA_B]]
-; CHECK-V8-NEXT:    [[ACC_NEXT:%.*]] = add i32 [[UMAX]], [[ACC]]
+; CHECK-V8-NEXT:    [[ACC_NEXT]] = add i32 [[UMAX]], [[ACC]]
 ; CHECK-V8-NEXT:    [[ADDR_C:%.*]] = getelementptr i32, ptr [[C:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX]], ptr [[ADDR_C]], align 4
 ; CHECK-V8-NEXT:    [[COUNT:%.*]] = add nuw nsw i32 [[IV]], 1
@@ -212,14 +213,15 @@ define i32 @test_i32_select_optsize(ptr %a, ptr %b, ptr %c) #0 {
 ; CHECK-V8-NEXT:    [[DATA_B_1:%.*]] = load i32, ptr [[ADDR_B_1]], align 4
 ; CHECK-V8-NEXT:    [[UGT_1:%.*]] = icmp ugt i32 [[DATA_A_1]], [[DATA_B_1]]
 ; CHECK-V8-NEXT:    [[UMAX_1:%.*]] = select i1 [[UGT_1]], i32 [[DATA_A_1]], i32 [[DATA_B_1]]
-; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC_NEXT]]
+; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC1]]
 ; CHECK-V8-NEXT:    [[ADDR_C_1:%.*]] = getelementptr i32, ptr [[C]], i32 [[COUNT]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX_1]], ptr [[ADDR_C_1]], align 4
 ; CHECK-V8-NEXT:    [[COUNT_1]] = add nuw nsw i32 [[IV]], 2
 ; CHECK-V8-NEXT:    [[END_1:%.*]] = icmp ne i32 [[COUNT_1]], 100
+; CHECK-V8-NEXT:    [[ACC_RED:%.*]] = add i32 [[ACC_NEXT]], [[ACC_NEXT_1]]
 ; CHECK-V8-NEXT:    br i1 [[END_1]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK-V8:       exit:
-; CHECK-V8-NEXT:    [[ACC_NEXT_LCSSA:%.*]] = phi i32 [ [[ACC_NEXT_1]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC_NEXT_LCSSA:%.*]] = phi i32 [ [[ACC_RED]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    ret i32 [[ACC_NEXT_LCSSA]]
 ;
 entry:
@@ -251,14 +253,15 @@ define i32 @test_i32_select_minsize(ptr %a, ptr %b, ptr %c) #1 {
 ; CHECK-V8-NEXT:    br label [[LOOP:%.*]]
 ; CHECK-V8:       loop:
 ; CHECK-V8-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[COUNT_1:%.*]], [[LOOP]] ]
-; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT:%.*]], [[LOOP]] ]
+; CHECK-V8-NEXT:    [[ACC1:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[ACC_NEXT_1:%.*]], [[LOOP]] ]
 ; CHECK-V8-NEXT:    [[ADDR_A:%.*]] = getelementptr i32, ptr [[A:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[ADDR_B:%.*]] = getelementptr i32, ptr [[B:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    [[DATA_A:%.*]] = load i32, ptr [[ADDR_A]], align 4
 ; CHECK-V8-NEXT:    [[DATA_B:%.*]] = load i32, ptr [[ADDR_B]], align 4
 ; CHECK-V8-NEXT:    [[UGT:%.*]] = icmp ugt i32 [[DATA_A]], [[DATA_B]]
 ; CHECK-V8-NEXT:    [[UMAX:%.*]] = select i1 [[UGT]], i32 [[DATA_A]], i32 [[DATA_B]]
-; CHECK-V8-NEXT:    [[ACC_NEXT:%.*]] = add i32 [[UMAX]], [[ACC]]
+; CHECK-V8-NEXT:    [[ACC_NEXT]] = add i32 [[UMAX]], [[ACC]]
 ; CHECK-V8-NEXT:    [[ADDR_C:%.*]] = getelementptr i32, ptr [[C:%.*]], i32 [[IV]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX]], ptr [[ADDR_C]], align 4
 ; CHECK-V8-NEXT:    [[COUNT:%.*]] = add nuw nsw i32 [[IV]], 1
@@ -268,14 +271,15 @@ define i32 @test_i32_select_minsize(ptr %a, ptr %b, ptr %c) #1 {
 ; CHECK-V8-NEXT:    [[DATA_B_1:%.*]] = load i32, ptr [[ADDR_B_1]], align 4
 ; CHECK-V8-NEXT:    [[UGT_1:%.*]] = icmp ugt i32 [[DATA_A_1]], [[DATA_B_1]]
 ; CHECK-V8-NEXT:    [[UMAX_1:%.*]] = select i1 [[UGT_1]], i32 [[DATA_A_1]], i32 [[DATA_B_1]]
-; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC_NEXT]]
+; CHECK-V8-NEXT:    [[ACC_NEXT_1]] = add i32 [[UMAX_1]], [[ACC1]]
 ; CHECK-V8-NEXT:    [[ADDR_C_1:%.*]] = getelementptr i32, ptr [[C]], i32 [[COUNT]]
 ; CHECK-V8-NEXT:    store i32 [[UMAX_1]], ptr [[ADDR_C_1]], align 4
 ; CHECK-V8-NEXT:    [[COUNT_1]] = add nuw nsw i32 [[IV]], 2
 ; CHECK-V8-NEXT:...
[truncated]

github-actions · 2024-03-11T18:12:50Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 034cc2f5d0abcf7a465665246f16a1b75fbde93a b259694086a5c79991ee1a0bcf6b6562e49c4fbf -- llvm/lib/Transforms/Utils/LoopUnroll.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index e30547d8ee..002e4d0c44 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -110,10 +110,10 @@ UnrollVerifyLoopInfo("unroll-verify-loopinfo", cl::Hidden,
                     );
 
 static cl::opt<bool>
-UnrollSimplifyReductions("unroll-simplify-reductions", cl::init(true),
-                         cl::Hidden, cl::desc("Try to simplify reductions "
-                                              "after unrolling a loop."));
-
+    UnrollSimplifyReductions("unroll-simplify-reductions", cl::init(true),
+                             cl::Hidden,
+                             cl::desc("Try to simplify reductions "
+                                      "after unrolling a loop."));
 
 /// Check if unrolling created a situation where we need to insert phi nodes to
 /// preserve LCSSA form.
@@ -362,7 +362,7 @@ static bool trySimplifyReductions(Instruction &I) {
     // Find the index of the operand of Op to replace. The first Op reads its
     // value from the first Phi node. The other Ops read their value from the
     // previous Op.
-    Value *OperandToReplace = i == 0 ? cast<Value>(PN) : Ops[i-1];
+    Value *OperandToReplace = i == 0 ? cast<Value>(PN) : Ops[i - 1];
     unsigned OperandIdx = Op->getOperand(0) == OperandToReplace ? 0 : 1;
     assert(Op->getOperand(OperandIdx) == OperandToReplace &&
            "Operand mismatch. Perhaps a malformed chain?");
@@ -416,11 +416,11 @@ static bool trySimplifyReductions(Instruction &I) {
   //           = Ops[k-1] + SOps[k-1],
   // so if we compute SOps in order (i.e. from 0 to N) we can reuse partial
   // results.
-  SmallVector<Value *, 16> SOps(N+1);
+  SmallVector<Value *, 16> SOps(N + 1);
   SOps[0] = nullptr; // alternatively we could use NeutralElem
   SOps[1] = Ops.front();
   for (unsigned k = 2; k <= N; k++)
-    SOps[k] = CreateBinOp(SOps[k-1], Ops[k-1]);
+    SOps[k] = CreateBinOp(SOps[k - 1], Ops[k - 1]);
 
   // Compute the partial sums of the Phis:
   //   SPhis[k] = \sum_{k <= i < N} Phis[i], 0 <= k <= N.
@@ -429,20 +429,20 @@ static bool trySimplifyReductions(Instruction &I) {
   //            = Phis[k] + SPhis[k+1],
   // so if we compute SPhis in reverse (i.e. from N down to 0) we can reuse the
   // partial sums computed thus far.
-  SmallVector<Value *, 16> SPhis(N+1);
+  SmallVector<Value *, 16> SPhis(N + 1);
   SPhis[N] = nullptr; // alternatively we could use NeutralElem
-  SPhis[N-1] = Phis.back();
-  for (signed k = N-2; k >= 0; k--)
-    SPhis[k] = CreateBinOp(SPhis[k+1], Phis[k]);
+  SPhis[N - 1] = Phis.back();
+  for (signed k = N - 2; k >= 0; k--)
+    SPhis[k] = CreateBinOp(SPhis[k + 1], Phis[k]);
 
   // Finally, compute the total sums for PN and Ops from:
   //   Sums[k] = SOps[k] + SPhis[k], 0 <= k <= N.
   // These sums might be dead so we had them to a weak tracking vector for
   // cleanup after.
-  SmallVector<WeakTrackingVH, 16> Sums(N+1);
+  SmallVector<WeakTrackingVH, 16> Sums(N + 1);
   for (unsigned k = 0; k <= N; k++) {
     // Pick the Op we want to compute the new total for.
-    Value *Op = k == 0 ? cast<Value>(PN) : Ops[k-1];
+    Value *Op = k == 0 ? cast<Value>(PN) : Ops[k - 1];
 
     Value *SOp = SOps[k], *SPhi = SPhis[k];
     if (SOp && SPhi)

rj-jesus · 2024-03-12T10:38:28Z

The failed clangd test seems to be unrelated to this transform.

Clangd Unit Tests :: ./ClangdTests/DiagnosticTest/RespectsDiagnosticConfig

rj-jesus added 2 commits March 11, 2024 17:52

[LoopUnroll][NFC] Add tests for reduction deinterleaving

f786d84

llvmbot added backend:AArch64 backend:SystemZ llvm:transforms labels Mar 11, 2024

rj-jesus requested a review from preames March 11, 2024 18:23

[LoopUnroll] Fix some formatting issues

b259694

rj-jesus requested review from fhahn, mtrofin, aeubanks and davemgreen March 12, 2024 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Simplify reduction operations after a loop unroll #84805

[LoopUnroll] Simplify reduction operations after a loop unroll #84805

rj-jesus commented Mar 11, 2024 •

edited

llvmbot commented Mar 11, 2024 •

edited

llvmbot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024 •

edited

rj-jesus commented Mar 12, 2024

[LoopUnroll] Simplify reduction operations after a loop unroll #84805

Are you sure you want to change the base?

[LoopUnroll] Simplify reduction operations after a loop unroll #84805

Conversation

rj-jesus commented Mar 11, 2024 • edited

llvmbot commented Mar 11, 2024 • edited

llvmbot commented Mar 11, 2024

github-actions bot commented Mar 11, 2024 • edited

rj-jesus commented Mar 12, 2024

rj-jesus commented Mar 11, 2024 •

edited

llvmbot commented Mar 11, 2024 •

edited

github-actions bot commented Mar 11, 2024 •

edited