Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244

Merged
merged 5 commits into from
Jan 18, 2024

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Jan 16, 2024

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting:

loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)

to

loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

Note

The tests were added in a separate commit in this PR, in case they needed reviewed too. The codegen diff should be viewable between the two commits.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 16, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting:

loop:
%phi = phi &lt;float&gt; [ ..., %entry ], [ %acc, %loop]
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %phi, &lt;vscale x 2 x float&gt; %vec)

to

loop:
%phi = phi &lt;vscale x 2 x float&gt; [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement &lt;vscale x 2 x float&gt; %phi, i64 0
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %x, &lt;vscale x 2 x float&gt; %vec)
%acc.vec = insertelement &lt;vscale x 2 x float&gt; poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction selection.

> [!NOTE]
> The tests were added in a separate commit in this PR, in case they needed reviewed too. The codegen diff should be viewable between the two commits.


Full diff: https://github.com/llvm/llvm-project/pull/78244.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp (+56-1)
  • (added) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll (+133)
  • (added) llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll (+109)
diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index f9d8401bab7b39..9f7486fc7d68e1 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -18,9 +18,10 @@
 #include "llvm/ADT/Statistic.h"
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/InstVisitor.h"
-#include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/IR/Intrinsics.h"
 #include "llvm/Pass.h"
 
 using namespace llvm;
@@ -51,6 +52,7 @@ class RISCVCodeGenPrepare : public FunctionPass,
 
   bool visitInstruction(Instruction &I) { return false; }
   bool visitAnd(BinaryOperator &BO);
+  bool visitIntrinsicInst(IntrinsicInst &I);
 };
 
 } // end anonymous namespace
@@ -103,6 +105,59 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
   return true;
 }
 
+// LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector
+// reduction instructions write the result in the first element of a vector
+// register. So when a reduction in a loop uses a scalar phi, we end up with
+// unnecessary scalar moves:
+//
+// loop:
+// vfmv.s.f v10, fa0
+// vfredosum.vs v8, v8, v10
+// vfmv.f.s fa0, v8
+//
+// This mainly affects ordered fadd reductions, since other types of reduction
+// typically use element-wise vectorisation in the loop body. This tries to
+// vectorize any scalar phis that feed into a fadd reduction:
+//
+// loop:
+// %phi = phi <float> [ ..., %entry ], [ %acc, %loop]
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+//
+// ->
+//
+// loop:
+// %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
+// %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
+// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+//
+// Which eliminates the scalar -> vector -> scalar crossing during instruction
+// selection.
+bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
+  if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd)
+    return false;
+
+  auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
+  if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I))
+    return false;
+
+  Type *VecTy = I.getOperand(1)->getType();
+  IRBuilder<> Builder(PHI);
+  auto *VecPHI = Builder.CreatePHI(VecTy, PHI->getNumIncomingValues());
+
+  for (auto *BB : PHI->blocks()) {
+    Builder.SetInsertPoint(BB->getTerminator());
+    Value *InsertElt = Builder.CreateInsertElement(
+        VecTy, PHI->getIncomingValueForBlock(BB), (uint64_t)0);
+    VecPHI->addIncoming(InsertElt, BB);
+  }
+
+  Builder.SetInsertPoint(&I);
+  I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0));
+
+  return true;
+}
+
 bool RISCVCodeGenPrepare::runOnFunction(Function &F) {
   if (skipFunction(F))
     return false;
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
new file mode 100644
index 00000000000000..6f8aa6e5a441d8
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
@@ -0,0 +1,133 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: reduce_fadd:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    blez a1, .LBB0_3
+; CHECK-NEXT:  # %bb.1: # %for.body.preheader
+; CHECK-NEXT:    addi sp, sp, -48
+; CHECK-NEXT:    .cfi_def_cfa_offset 48
+; CHECK-NEXT:    sd ra, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s0, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s1, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s2, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s3, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_offset ra, -8
+; CHECK-NEXT:    .cfi_offset s0, -16
+; CHECK-NEXT:    .cfi_offset s1, -24
+; CHECK-NEXT:    .cfi_offset s2, -32
+; CHECK-NEXT:    .cfi_offset s3, -40
+; CHECK-NEXT:    csrr s1, vlenb
+; CHECK-NEXT:    srli s0, s1, 1
+; CHECK-NEXT:    bgeu a1, s0, .LBB0_4
+; CHECK-NEXT:  # %bb.2:
+; CHECK-NEXT:    li a2, 0
+; CHECK-NEXT:    fmv.w.x fa0, zero
+; CHECK-NEXT:    j .LBB0_7
+; CHECK-NEXT:  .LBB0_3:
+; CHECK-NEXT:    fmv.w.x fa0, zero
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB0_4: # %vector.ph
+; CHECK-NEXT:    srli a2, s1, 3
+; CHECK-NEXT:    lui a3, 524288
+; CHECK-NEXT:    addiw a3, a3, -4
+; CHECK-NEXT:    mv s2, a0
+; CHECK-NEXT:    mv a0, a2
+; CHECK-NEXT:    mv s3, a1
+; CHECK-NEXT:    mv a1, a3
+; CHECK-NEXT:    call __muldi3
+; CHECK-NEXT:    mv a1, s3
+; CHECK-NEXT:    mv a2, a0
+; CHECK-NEXT:    mv a0, s2
+; CHECK-NEXT:    and a2, a2, s3
+; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vmv.s.x v8, zero
+; CHECK-NEXT:    slli a3, s1, 1
+; CHECK-NEXT:    mv a4, s2
+; CHECK-NEXT:    mv a5, a2
+; CHECK-NEXT:  .LBB0_5: # %vector.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    vl2re32.v v10, (a4)
+; CHECK-NEXT:    vsetvli a6, zero, e32, m2, ta, ma
+; CHECK-NEXT:    vfredosum.vs v8, v10, v8
+; CHECK-NEXT:    sub a5, a5, s0
+; CHECK-NEXT:    add a4, a4, a3
+; CHECK-NEXT:    bnez a5, .LBB0_5
+; CHECK-NEXT:  # %bb.6: # %middle.block
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    beq a2, a1, .LBB0_9
+; CHECK-NEXT:  .LBB0_7: # %for.body.preheader7
+; CHECK-NEXT:    slli a2, a2, 2
+; CHECK-NEXT:    add a2, a0, a2
+; CHECK-NEXT:    slli a1, a1, 2
+; CHECK-NEXT:    add a0, a0, a1
+; CHECK-NEXT:  .LBB0_8: # %for.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    flw fa5, 0(a2)
+; CHECK-NEXT:    addi a2, a2, 4
+; CHECK-NEXT:    fadd.s fa0, fa0, fa5
+; CHECK-NEXT:    bne a2, a0, .LBB0_8
+; CHECK-NEXT:  .LBB0_9:
+; CHECK-NEXT:    ld ra, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s0, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s1, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s2, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s3, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    addi sp, sp, 48
+; CHECK-NEXT:    ret
+entry:
+  %cmp4 = icmp sgt i32 %N, 0
+  br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %wide.trip.count = zext nneg i32 %N to i64
+  %0 = tail call i64 @llvm.vscale.i64()
+  %1 = shl nuw nsw i64 %0, 2
+  %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+  br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph:                                        ; preds = %for.body.preheader
+  %2 = tail call i64 @llvm.vscale.i64()
+  %.neg = mul nuw nsw i64 %2, 2147483644
+  %n.vec = and i64 %.neg, %wide.trip.count
+  %3 = tail call i64 @llvm.vscale.i64()
+  %4 = shl nuw nsw i64 %3, 2
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+  %5 = getelementptr inbounds float, ptr %f, i64 %index
+  %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+  %index.next = add nuw i64 %index, %4
+  %7 = icmp eq i64 %index.next, %n.vec
+  br i1 %7, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7:                              ; preds = %for.body.preheader, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+  %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block, %entry
+  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+  ret float %sum.0.lcssa
+
+for.body:                                         ; preds = %for.body.preheader7, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+  %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+  %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+  %8 = load float, ptr %arrayidx, align 4
+  %add = fadd float %sum.05, %8
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
new file mode 100644
index 00000000000000..bf4aa35ae8eb25
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
@@ -0,0 +1,109 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt %s -S -riscv-codegenprepare -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: define float @reduce_fadd(
+; CHECK-SAME: ptr nocapture noundef readonly [[F:%.*]], i32 noundef signext [[N:%.*]]) #[[ATTR2:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP4:%.*]] = icmp sgt i32 [[N]], 0
+; CHECK-NEXT:    br i1 [[CMP4]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       for.body.preheader:
+; CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ugt i64 [[TMP1]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY_PREHEADER7:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nuw nsw i64 [[TMP2]], 2147483644
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[DOTNEG]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    [[TMP3:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = phi <vscale x 4 x float> [ insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i64 0), [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP6]], align 4
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <vscale x 4 x float> [[TMP5]], i64 0
+; CHECK-NEXT:    [[TMP8]] = tail call float @llvm.vector.reduce.fadd.nxv4f32(float [[TMP7]], <vscale x 4 x float> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP10]] = insertelement <vscale x 4 x float> poison, float [[TMP8]], i64 0
+; CHECK-NEXT:    br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_PREHEADER7]]
+; CHECK:       for.body.preheader7:
+; CHECK-NEXT:    [[INDVARS_IV_PH:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[SUM_05_PH:%.*]] = phi float [ 0.000000e+00, [[FOR_BODY_PREHEADER]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi float [ 0.000000e+00, [[ENTRY:%.*]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    ret float [[SUM_0_LCSSA]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[INDVARS_IV_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT:    [[SUM_05:%.*]] = phi float [ [[ADD]], [[FOR_BODY]] ], [ [[SUM_05_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP11:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ADD]] = fadd float [[SUM_05]], [[TMP11]]
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
+;
+entry:
+  %cmp4 = icmp sgt i32 %N, 0
+  br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:                               ; preds = %entry
+  %wide.trip.count = zext nneg i32 %N to i64
+  %0 = tail call i64 @llvm.vscale.i64()
+  %1 = shl nuw nsw i64 %0, 2
+  %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+  br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph:                                        ; preds = %for.body.preheader
+  %2 = tail call i64 @llvm.vscale.i64()
+  %.neg = mul nuw nsw i64 %2, 2147483644
+  %n.vec = and i64 %.neg, %wide.trip.count
+  %3 = tail call i64 @llvm.vscale.i64()
+  %4 = shl nuw nsw i64 %3, 2
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+  %5 = getelementptr inbounds float, ptr %f, i64 %index
+  %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+  %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+  %index.next = add nuw i64 %index, %4
+  %7 = icmp eq i64 %index.next, %n.vec
+  br i1 %7, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+  br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7:                              ; preds = %for.body.preheader, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+  %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %middle.block, %entry
+  %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+  ret float %sum.0.lcssa
+
+for.body:                                         ; preds = %for.body.preheader7, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+  %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+  %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+  %8 = load float, ptr %arrayidx, align 4
+  %add = fadd float %sum.05, %8
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}

Copy link

github-actions bot commented Jan 16, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff a02a0e806fab01f4cf4307443cdaed76a2488752 85b741017544c63b3da591929d08070fb42f5fe6 -- llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index 6434532afd..278f2b2654 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -121,15 +121,17 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
 //
 // loop:
 // %phi = phi <float> [ ..., %entry ], [ %acc, %loop ]
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x
+// float> %vec)
 //
 // ->
 //
 // loop:
 // %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop ]
 // %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
-// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x
+// float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float
+// %acc.next, i64 0
 //
 // Which eliminates the scalar -> vector -> scalar crossing during instruction
 // selection.

LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector
reduction instructions write the result in the first element of a vector
register. So when a reduction in a loop uses a scalar phi, we end up with
unnecessary scalar moves:

loop:
vfmv.s.f v10, fa0
vfredosum.vs v8, v8, v10
vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, since other types of reduction
typically use element-wise vectorisation in the loop body. This tries to
vectorize any scalar phis that feed into a fadd reduction in
RISCVCodeGenPrepare, converting:

loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)

to

loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during instruction
selection.
; CHECK-NEXT: vfredosum.vs v8, v8, v10
; CHECK-NEXT: vfmv.f.s fa0, v8
; CHECK-NEXT: vl2re32.v v10, (a4)
; CHECK-NEXT: vsetvli a6, zero, e32, m2, ta, ma
Copy link
Contributor Author

@lukel97 lukel97 Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a gap in RISCVInsertVSETVLI. Previously we were doing PRE, but now we have a vector instruction vmv.s.x in the predecessor it no longer kicks in.

This happens with and without 286a366

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp Outdated Show resolved Hide resolved
return false;

auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle cases where Op0 is not an indvar?
Example: https://godbolt.org/z/aTYcocfee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could try to generalise it to all reduction intrinsics then. We would need to double check that wrapping every reduction in extract_element/insert_element doesn't negatively affect codegen in other places though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually ignore the above, I think llvm.vector.reduce.fadd is the only intrinsic applicable here: We expand llvm.vector.reduce.fmul and llvm.vector.reduce.fmax/fmin don't have a no-fast-math accumulator argument

llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp Show resolved Hide resolved
@@ -157,6 +155,9 @@ bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
Builder.SetInsertPoint(&I);
I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0));

if (PHI->hasNUses(0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_empty()?


for (auto *BB : PHI->blocks()) {
Builder.SetInsertPoint(BB->getTerminator());
Value *InsertElt = Builder.CreateInsertElement(
Copy link
Collaborator

@topperc topperc Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like we verified that we're in a loop. If the phi is just the merge of control flow from an if/else or something does this unnecessarily hoist an insertelement into the if/else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial revision of this PR checked that the reduction was one of the incoming values in phi: 05488f6

but I relaxed it to address the comment in #78244 (comment)

I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert to the loop specific version. Please also restrict this - in the initial patch - to the case where the only use of the phi is the intrinsic. Let's start with something vary narrowly focused on the motivating case and generalize in separate patches if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?

Yeah, we can generalize it in the future if we find it exists in some real-world applications.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/minor comment addressed.

; CHECK-NEXT: addi sp, sp, 48
; CHECK-NEXT: ret
entry:
%cmp4 = icmp sgt i32 %N, 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of these tests, you don't need the whole loop structure emitted by loop vectorizer. You just need the main vector loop.


auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
if (!PHI || !PHI->hasOneUse() ||
!llvm::is_contained(PHI->incoming_values(), &I))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record... This isn't checking that the phi is in a loop. It's checking it is in a cycle, which is not quite the same thing. In particular, we allow both non-loop cycles, and loops in unreachable code which aren't "loops". Not a problem here, just pointing it out for future reference.

@lukel97 lukel97 merged commit 15b0fab into llvm:main Jan 18, 2024
2 of 4 checks passed
ampandey-1995 pushed a commit to ampandey-1995/llvm-project that referenced this pull request Jan 19, 2024
…#78244)

LLVM vector reduction intrinsics return a scalar result, but on RISC-V
vector reduction instructions write the result in the first element of a
vector register. So when a reduction in a loop uses a scalar phi, we end
up with unnecessary scalar moves:

loop:
    vfmv.s.f v10, fa0
    vfredosum.vs v8, v8, v10
    vfmv.f.s fa0, v8

This mainly affects ordered fadd reductions, which has a scalar accumulator
operand.
This tries to vectorize any scalar phis that feed into a fadd reduction
in RISCVCodeGenPrepare, converting:

loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
```

to

loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0

Which eliminates the scalar -> vector -> scalar crossing during
instruction selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants