[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

lukel97 · 2024-01-16T09:36:56Z

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting:

loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)

to

loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

Note

The tests were added in a separate commit in this PR, in case they needed reviewed too. The codegen diff should be viewable between the two commits.

llvmbot · 2024-01-16T09:37:27Z

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting:

loop:
%phi = phi &lt;float&gt; [ ..., %entry ], [ %acc, %loop]
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %phi, &lt;vscale x 2 x float&gt; %vec)

to

loop:
%phi = phi &lt;vscale x 2 x float&gt; [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement &lt;vscale x 2 x float&gt; %phi, i64 0
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %x, &lt;vscale x 2 x float&gt; %vec)
%acc.vec = insertelement &lt;vscale x 2 x float&gt; poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

> [!NOTE]
> The tests were added in a separate commit in this PR, in case they needed reviewed too. The codegen diff should be viewable between the two commits.

Full diff: https://github.com/llvm/llvm-project/pull/78244.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp (+56-1)
(added) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll (+133)
(added) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll (+109)

diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index f9d8401bab7b39..9f7486fc7d68e1 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -18,9 +18,10 @@
 #include "llvm/ADT/Statistic.h"
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/InstVisitor.h"
-#include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/IR/Intrinsics.h"
 #include "llvm/Pass.h"
 
 using namespace llvm;
@@ -51,6 +52,7 @@ class RISCVCodeGenPrepare : public FunctionPass,
 
   bool visitInstruction(Instruction &I) { return false; }
   bool visitAnd(BinaryOperator &BO);
+  bool visitIntrinsicInst(IntrinsicInst &I);
 };
 
 } // end anonymous namespace
@@ -103,6 +105,59 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
   return true;
 }
 
+// LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector
+// reduction instructions write the result in the first element of a vector
+// register. So when a reduction in a loop uses a scalar phi, we end up with
+// unnecessary scalar moves:
+//
+// loop:
+// vfmv.s.f v10, fa0
+// vfredosum.vs v8, v8, v10
+// vfmv.f.s fa0, v8
+//
+// This mainly affects ordered fadd reductions, since other types of reduction
+// typically use element-wise vectorisation in the loop body. This tries to
+// vectorize any scalar phis that feed into a fadd reduction:
+//
+// loop:
+// %phi = phi <float> [ ..., %entry ], [ %acc, %loop]
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+//
+// ->
+//
+// loop:
+// %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
+// %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
+// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+//
+// Which eliminates the scalar -> vector -> scalar crossing during instruction
+// selection.
+bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
+  if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd)
+    return false;
+
+  auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
+  if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I))
+    return false;
+
+  Type *VecTy = I.getOperand(1)->getType();
+  IRBuilder<> Builder(PHI);
+  auto *VecPHI = Builder.CreatePHI(VecTy, PHI->getNumIncomingValues());
+
+  for (auto *BB : PHI->blocks()) {
+    Builder.SetInsertPoint(BB->getTerminator());
+    Value *InsertElt = Builder.CreateInsertElement(
+        VecTy, PHI->getIncomingValueForBlock(BB), (uint64_t)0);
+    VecPHI->addIncoming(InsertElt, BB);
+  }
+
+  Builder.SetInsertPoint(&I);
+  I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0));
+
+  return true;
+}
+
 bool RISCVCodeGenPrepare::runOnFunction(Function &F) {
   if (skipFunction(F))
     return false;
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
new file mode 100644
index 00000000000000..6f8aa6e5a441d8
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
@@ -0,0 +1,133 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: reduce_fadd:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    blez a1, .LBB0_3
+; CHECK-NEXT:  # %bb.1: # %for.body.preheader
+; CHECK-NEXT:    addi sp, sp, -48
+; CHECK-NEXT:    .cfi_def_cfa_offset 48
+; CHECK-NEXT:    sd ra, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s0, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s1, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s2, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s3, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_offset ra, -8
+; CHECK-NEXT:    .cfi_offset s0, -16
+; CHECK-NEXT:    .cfi_offset s1, -24
+; CHECK-NEXT:    .cfi_offset s2, -32
+; CHECK-NEXT:    .cfi_offset s3, -40
+; CHECK-NEXT:    csrr s1, vlenb
+; CHECK-NEXT:    srli s0, s1, 1
+; CHECK-NEXT:    bgeu a1, s0, .LBB0_4
+; CHECK-NEXT:  # %bb.2:
+; CHECK-NEXT:    li a2, 0
+; CHECK-NEXT:    fmv.w.x fa0, zero
+; CHECK-NEXT:    j .LBB0_7
+; CHECK-NEXT:  .LBB0_3:
+; CHECK-NEXT:    fmv.w.x fa0, zero
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB0_4: # %vector.ph
+; CHECK-NEXT:    srli a2, s1, 3
+; CHECK-NEXT:    lui a3, 524288
+; CHECK-NEXT:    addiw a3, a3, -4
+; CHECK-NEXT:    mv s2, a0
+; CHECK-NEXT:    mv a0, a2
+; CHECK-NEXT:    mv s3, a1
+; CHECK-NEXT:    mv a1, a3
+; CHECK-NEXT:    call __muldi3
+; CHECK-NEXT:    mv a1, s3
+; CHECK-NEXT:    mv a2, a0
+; CHECK-NEXT:    mv a0, s2
+; CHECK-NEXT:    and a2, a2, s3
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    slli a3, s1, 1
+; CHECK-NEXT:    mv a4, s2
+; CHECK-NEXT:    mv a5, a2
+; CHECK-NEXT:  .LBB0_5: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vl2re32.v v10, (a4)
+; CHECK-NEXT:    vsetvli a6, zero, e32, m2, ta, ma
+; CHECK-NEXT:    vfredosum.vs v8, v10, v8
+; CHECK-NEXT:    sub a5, a5, s0
+; CHECK-NEXT:    add a4, a4, a3
+; CHECK-NEXT:    bnez a5, .LBB0_5
+; CHECK-NEXT:  # %bb.6: # %middle.block
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    beq a2, a1, .LBB0_9
+; CHECK-NEXT:  .LBB0_7: # %for.body.preheader7
+; CHECK-NEXT:    slli a2, a2, 2
+; CHECK-NEXT:    add a2, a0, a2
+; CHECK-NEXT:    slli a1, a1, 2
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:  .LBB0_8: # %for.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    flw fa5, 0(a2)
+; CHECK-NEXT:    addi a2, a2, 4
+; CHECK-NEXT:    fadd.s fa0, fa0, fa5
+; CHECK-NEXT:    bne a2, a0, .LBB0_8
+; CHECK-NEXT:  .LBB0_9:
+; CHECK-NEXT:    ld ra, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s0, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s1, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s2, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s3, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    addi sp, sp, 48
+; CHECK-NEXT:    ret
+entry:
+  %cmp4 = icmp sgt i32 %N, 0
+  br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %wide.trip.count = zext nneg i32 %N to i64
+  %0 = tail call i64 @llvm.vscale.i64()
+  %1 = shl nuw nsw i64 %0, 2
+  %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+  br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph:                                        ; preds = %for.body.preheader
+  %2 = tail call i64 @llvm.vscale.i64()
+  %.neg = mul nuw nsw i64 %2, 2147483644
+  %n.vec = and i64 %.neg, %wide.trip.count
+  %3 = tail call i64 @llvm.vscale.i64()
+  %4 = shl nuw nsw i64 %3, 2
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+  %5 = getelementptr inbounds float, ptr %f, i64 %index
+  %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+  %index.next = add nuw i64 %index, %4
+  %7 = icmp eq i64 %index.next, %n.vec
+  br i1 %7, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7:                              ; preds = %for.body.preheader, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+  %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block, %entry
+  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+  ret float %sum.0.lcssa
+
+for.body:                                         ; preds = %for.body.preheader7, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+  %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+  %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+  %8 = load float, ptr %arrayidx, align 4
+  %add = fadd float %sum.05, %8
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
new file mode 100644
index 00000000000000..bf4aa35ae8eb25
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
@@ -0,0 +1,109 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt %s -S -riscv-codegenprepare -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: define float @reduce_fadd(
+; CHECK-SAME: ptr nocapture noundef readonly [[F:%.*]], i32 noundef signext [[N:%.*]]) #[[ATTR2:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP4:%.*]] = icmp sgt i32 [[N]], 0
+; CHECK-NEXT:    br i1 [[CMP4]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       for.body.preheader:
+; CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ugt i64 [[TMP1]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY_PREHEADER7:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nuw nsw i64 [[TMP2]], 2147483644
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[DOTNEG]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    [[TMP3:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = phi <vscale x 4 x float> [ insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i64 0), [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP6]], align 4
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <vscale x 4 x float> [[TMP5]], i64 0
+; CHECK-NEXT:    [[TMP8]] = tail call float @llvm.vector.reduce.fadd.nxv4f32(float [[TMP7]], <vscale x 4 x float> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP10]] = insertelement <vscale x 4 x float> poison, float [[TMP8]], i64 0
+; CHECK-NEXT:    br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_PREHEADER7]]
+; CHECK:       for.body.preheader7:
+; CHECK-NEXT:    [[INDVARS_IV_PH:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[SUM_05_PH:%.*]] = phi float [ 0.000000e+00, [[FOR_BODY_PREHEADER]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi float [ 0.000000e+00, [[ENTRY:%.*]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    ret float [[SUM_0_LCSSA]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[INDVARS_IV_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT:    [[SUM_05:%.*]] = phi float [ [[ADD]], [[FOR_BODY]] ], [ [[SUM_05_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP11:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ADD]] = fadd float [[SUM_05]], [[TMP11]]
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
+;
+entry:
+  %cmp4 = icmp sgt i32 %N, 0
+  br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %wide.trip.count = zext nneg i32 %N to i64
+  %0 = tail call i64 @llvm.vscale.i64()
+  %1 = shl nuw nsw i64 %0, 2
+  %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+  br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph:                                        ; preds = %for.body.preheader
+  %2 = tail call i64 @llvm.vscale.i64()
+  %.neg = mul nuw nsw i64 %2, 2147483644
+  %n.vec = and i64 %.neg, %wide.trip.count
+  %3 = tail call i64 @llvm.vscale.i64()
+  %4 = shl nuw nsw i64 %3, 2
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+  %5 = getelementptr inbounds float, ptr %f, i64 %index
+  %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+  %index.next = add nuw i64 %index, %4
+  %7 = icmp eq i64 %index.next, %n.vec
+  br i1 %7, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7:                              ; preds = %for.body.preheader, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+  %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block, %entry
+  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+  ret float %sum.0.lcssa
+
+for.body:                                         ; preds = %for.body.preheader7, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+  %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+  %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+  %8 = load float, ptr %arrayidx, align 4
+  %add = fadd float %sum.05, %8
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}

github-actions · 2024-01-16T09:39:19Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff a02a0e806fab01f4cf4307443cdaed76a2488752 85b741017544c63b3da591929d08070fb42f5fe6 -- llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index 6434532afd..278f2b2654 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -121,15 +121,17 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
 //
 // loop:
 // %phi = phi <float> [ ..., %entry ], [ %acc, %loop ]
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x
+// float> %vec)
 //
 // ->
 //
 // loop:
 // %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop ]
 // %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
-// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x
+// float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float
+// %acc.next, i64 0
 //
 // Which eliminates the scalar -> vector -> scalar crossing during instruction
 // selection.

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

lukel97 · 2024-01-16T09:44:17Z

llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll

-; CHECK-NEXT:    vfredosum.vs v8, v8, v10
-; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    vl2re32.v v10, (a4)
+; CHECK-NEXT:    vsetvli a6, zero, e32, m2, ta, ma


This looks like a gap in RISCVInsertVSETVLI. Previously we were doing PRE, but now we have a vector instruction vmv.s.x in the predecessor it no longer kicks in.

This happens with and without 286a366

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

dtcxzyw · 2024-01-16T12:56:18Z

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

+    return false;
+
+  auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
+  if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I))


Should we handle cases where Op0 is not an indvar?
Example: https://godbolt.org/z/aTYcocfee

Sure, we could try to generalise it to all reduction intrinsics then. We would need to double check that wrapping every reduction in extract_element/insert_element doesn't negatively affect codegen in other places though.

Actually ignore the above, I think llvm.vector.reduce.fadd is the only intrinsic applicable here: We expand llvm.vector.reduce.fmul and llvm.vector.reduce.fmax/fmin don't have a no-fast-math accumulator argument

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

…omment

topperc · 2024-01-16T17:20:09Z

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

@@ -157,6 +155,9 @@ bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
  Builder.SetInsertPoint(&I);
  I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0));

+  if (PHI->hasNUses(0))


use_empty()?

topperc · 2024-01-16T17:28:52Z

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

+
+  for (auto *BB : PHI->blocks()) {
+    Builder.SetInsertPoint(BB->getTerminator());
+    Value *InsertElt = Builder.CreateInsertElement(


Doesn't look like we verified that we're in a loop. If the phi is just the merge of control flow from an if/else or something does this unnecessarily hoist an insertelement into the if/else?

The initial revision of this PR checked that the reduction was one of the incoming values in phi: 05488f6

but I relaxed it to address the comment in #78244 (comment)

I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?

Please revert to the loop specific version. Please also restrict this - in the initial patch - to the case where the only use of the phi is the intrinsic. Let's start with something vary narrowly focused on the motivating case and generalize in separate patches if desired.

I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?

Yeah, we can generalize it in the future if we find it exists in some real-world applications.

…ly user of phi.

preames

LGTM w/minor comment addressed.

preames · 2024-01-17T15:36:49Z

llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll

+; CHECK-NEXT:    addi sp, sp, 48
+; CHECK-NEXT:    ret
+entry:
+  %cmp4 = icmp sgt i32 %N, 0


For both of these tests, you don't need the whole loop structure emitted by loop vectorizer. You just need the main vector loop.

preames · 2024-01-17T15:38:45Z

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp

+
+  auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
+  if (!PHI || !PHI->hasOneUse() ||
+      !llvm::is_contained(PHI->incoming_values(), &I))


Just for the record... This isn't checking that the phi is in a loop. It's checking it is in a cycle, which is not quite the same thing. In particular, we allow both non-loop cycles, and loops in unreachable code which aren't "loops". Not a problem here, just pointing it out for future reference.

…#78244) LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, which has a scalar accumulator operand. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) ``` to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

Add tests

a14e251

lukel97 requested review from preames, dtcxzyw, topperc, yetingk and wangpc-pp January 16, 2024 09:36

llvmbot added the backend:RISC-V label Jan 16, 2024

lukel97 force-pushed the codegenprepare-freduce-phi branch from c19b05b to 910dc9f Compare January 16, 2024 09:43

lukel97 commented Jan 16, 2024

View reviewed changes

dtcxzyw requested changes Jan 16, 2024

View reviewed changes

Remove restriction on indvar phis, erase PHI if unused, fix mangled c…

05488f6

…omment

topperc reviewed Jan 16, 2024

View reviewed changes

Restrict again to just loops, restrict to cases where intrinsic is on…

f63eac7

…ly user of phi.

preames approved these changes Jan 17, 2024

View reviewed changes

preames reviewed Jan 17, 2024

View reviewed changes

Strip down tests

85b7410

lukel97 merged commit 15b0fab into llvm:main Jan 18, 2024
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

lukel97 commented Jan 16, 2024

llvmbot commented Jan 16, 2024

github-actions bot commented Jan 16, 2024 •

edited

lukel97 Jan 16, 2024 •

edited

dtcxzyw Jan 16, 2024

lukel97 Jan 16, 2024

lukel97 Jan 16, 2024

topperc Jan 16, 2024

topperc Jan 16, 2024 •

edited

lukel97 Jan 16, 2024

preames Jan 16, 2024

dtcxzyw Jan 17, 2024

preames left a comment

preames Jan 17, 2024

preames Jan 17, 2024

[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

Conversation

lukel97 commented Jan 16, 2024

llvmbot commented Jan 16, 2024

github-actions bot commented Jan 16, 2024 • edited

lukel97 Jan 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topperc Jan 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

preames left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 16, 2024 •

edited

lukel97 Jan 16, 2024 •

edited

topperc Jan 16, 2024 •

edited