-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Vectorize phi for loop carried @llvm.vector.reduce.fadd #78244
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesLLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop:
vfmv.s.f v10, fa0
vfredosum.vs v8, v8, v10
vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) to loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @<!-- -->llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection. > [!NOTE] Full diff: https://github.com/llvm/llvm-project/pull/78244.diff 3 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index f9d8401bab7b39..9f7486fc7d68e1 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -18,9 +18,10 @@
#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InstVisitor.h"
-#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"
+#include "llvm/IR/Intrinsics.h"
#include "llvm/Pass.h"
using namespace llvm;
@@ -51,6 +52,7 @@ class RISCVCodeGenPrepare : public FunctionPass,
bool visitInstruction(Instruction &I) { return false; }
bool visitAnd(BinaryOperator &BO);
+ bool visitIntrinsicInst(IntrinsicInst &I);
};
} // end anonymous namespace
@@ -103,6 +105,59 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
return true;
}
+// LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector
+// reduction instructions write the result in the first element of a vector
+// register. So when a reduction in a loop uses a scalar phi, we end up with
+// unnecessary scalar moves:
+//
+// loop:
+// vfmv.s.f v10, fa0
+// vfredosum.vs v8, v8, v10
+// vfmv.f.s fa0, v8
+//
+// This mainly affects ordered fadd reductions, since other types of reduction
+// typically use element-wise vectorisation in the loop body. This tries to
+// vectorize any scalar phis that feed into a fadd reduction:
+//
+// loop:
+// %phi = phi <float> [ ..., %entry ], [ %acc, %loop]
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+//
+// ->
+//
+// loop:
+// %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
+// %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
+// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+//
+// Which eliminates the scalar -> vector -> scalar crossing during instruction
+// selection.
+bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) {
+ if (I.getIntrinsicID() != Intrinsic::vector_reduce_fadd)
+ return false;
+
+ auto *PHI = dyn_cast<PHINode>(I.getOperand(0));
+ if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I))
+ return false;
+
+ Type *VecTy = I.getOperand(1)->getType();
+ IRBuilder<> Builder(PHI);
+ auto *VecPHI = Builder.CreatePHI(VecTy, PHI->getNumIncomingValues());
+
+ for (auto *BB : PHI->blocks()) {
+ Builder.SetInsertPoint(BB->getTerminator());
+ Value *InsertElt = Builder.CreateInsertElement(
+ VecTy, PHI->getIncomingValueForBlock(BB), (uint64_t)0);
+ VecPHI->addIncoming(InsertElt, BB);
+ }
+
+ Builder.SetInsertPoint(&I);
+ I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0));
+
+ return true;
+}
+
bool RISCVCodeGenPrepare::runOnFunction(Function &F) {
if (skipFunction(F))
return false;
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
new file mode 100644
index 00000000000000..6f8aa6e5a441d8
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare-asm.ll
@@ -0,0 +1,133 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: reduce_fadd:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: blez a1, .LBB0_3
+; CHECK-NEXT: # %bb.1: # %for.body.preheader
+; CHECK-NEXT: addi sp, sp, -48
+; CHECK-NEXT: .cfi_def_cfa_offset 48
+; CHECK-NEXT: sd ra, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s0, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s1, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s2, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT: .cfi_offset ra, -8
+; CHECK-NEXT: .cfi_offset s0, -16
+; CHECK-NEXT: .cfi_offset s1, -24
+; CHECK-NEXT: .cfi_offset s2, -32
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: csrr s1, vlenb
+; CHECK-NEXT: srli s0, s1, 1
+; CHECK-NEXT: bgeu a1, s0, .LBB0_4
+; CHECK-NEXT: # %bb.2:
+; CHECK-NEXT: li a2, 0
+; CHECK-NEXT: fmv.w.x fa0, zero
+; CHECK-NEXT: j .LBB0_7
+; CHECK-NEXT: .LBB0_3:
+; CHECK-NEXT: fmv.w.x fa0, zero
+; CHECK-NEXT: ret
+; CHECK-NEXT: .LBB0_4: # %vector.ph
+; CHECK-NEXT: srli a2, s1, 3
+; CHECK-NEXT: lui a3, 524288
+; CHECK-NEXT: addiw a3, a3, -4
+; CHECK-NEXT: mv s2, a0
+; CHECK-NEXT: mv a0, a2
+; CHECK-NEXT: mv s3, a1
+; CHECK-NEXT: mv a1, a3
+; CHECK-NEXT: call __muldi3
+; CHECK-NEXT: mv a1, s3
+; CHECK-NEXT: mv a2, a0
+; CHECK-NEXT: mv a0, s2
+; CHECK-NEXT: and a2, a2, s3
+; CHECK-NEXT: vsetvli a3, zero, e32, m1, ta, ma
+; CHECK-NEXT: vmv.s.x v8, zero
+; CHECK-NEXT: slli a3, s1, 1
+; CHECK-NEXT: mv a4, s2
+; CHECK-NEXT: mv a5, a2
+; CHECK-NEXT: .LBB0_5: # %vector.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: vl2re32.v v10, (a4)
+; CHECK-NEXT: vsetvli a6, zero, e32, m2, ta, ma
+; CHECK-NEXT: vfredosum.vs v8, v10, v8
+; CHECK-NEXT: sub a5, a5, s0
+; CHECK-NEXT: add a4, a4, a3
+; CHECK-NEXT: bnez a5, .LBB0_5
+; CHECK-NEXT: # %bb.6: # %middle.block
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: beq a2, a1, .LBB0_9
+; CHECK-NEXT: .LBB0_7: # %for.body.preheader7
+; CHECK-NEXT: slli a2, a2, 2
+; CHECK-NEXT: add a2, a0, a2
+; CHECK-NEXT: slli a1, a1, 2
+; CHECK-NEXT: add a0, a0, a1
+; CHECK-NEXT: .LBB0_8: # %for.body
+; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: flw fa5, 0(a2)
+; CHECK-NEXT: addi a2, a2, 4
+; CHECK-NEXT: fadd.s fa0, fa0, fa5
+; CHECK-NEXT: bne a2, a0, .LBB0_8
+; CHECK-NEXT: .LBB0_9:
+; CHECK-NEXT: ld ra, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s0, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s1, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s2, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT: addi sp, sp, 48
+; CHECK-NEXT: ret
+entry:
+ %cmp4 = icmp sgt i32 %N, 0
+ br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader: ; preds = %entry
+ %wide.trip.count = zext nneg i32 %N to i64
+ %0 = tail call i64 @llvm.vscale.i64()
+ %1 = shl nuw nsw i64 %0, 2
+ %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+ br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph: ; preds = %for.body.preheader
+ %2 = tail call i64 @llvm.vscale.i64()
+ %.neg = mul nuw nsw i64 %2, 2147483644
+ %n.vec = and i64 %.neg, %wide.trip.count
+ %3 = tail call i64 @llvm.vscale.i64()
+ %4 = shl nuw nsw i64 %3, 2
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %vector.ph
+ %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+ %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+ %5 = getelementptr inbounds float, ptr %f, i64 %index
+ %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+ %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+ %index.next = add nuw i64 %index, %4
+ %7 = icmp eq i64 %index.next, %n.vec
+ br i1 %7, label %middle.block, label %vector.body
+
+middle.block: ; preds = %vector.body
+ %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+ br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7: ; preds = %for.body.preheader, %middle.block
+ %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+ %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+ br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %middle.block, %entry
+ %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+ ret float %sum.0.lcssa
+
+for.body: ; preds = %for.body.preheader7, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+ %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+ %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+ %8 = load float, ptr %arrayidx, align 4
+ %add = fadd float %sum.05, %8
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
new file mode 100644
index 00000000000000..bf4aa35ae8eb25
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/riscv-codegenprepare.ll
@@ -0,0 +1,109 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt %s -S -riscv-codegenprepare -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare i64 @llvm.vscale.i64()
+declare float @llvm.vector.reduce.fadd.nxv4f32(float, <vscale x 4 x float>)
+
+define float @reduce_fadd(ptr nocapture noundef readonly %f, i32 noundef signext %N) {
+; CHECK-LABEL: define float @reduce_fadd(
+; CHECK-SAME: ptr nocapture noundef readonly [[F:%.*]], i32 noundef signext [[N:%.*]]) #[[ATTR2:[0-9]+]] {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[N]], 0
+; CHECK-NEXT: br i1 [[CMP4]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK: for.body.preheader:
+; CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+; CHECK-NEXT: [[TMP0:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ugt i64 [[TMP1]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY_PREHEADER7:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP2:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[DOTNEG:%.*]] = mul nuw nsw i64 [[TMP2]], 2147483644
+; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[DOTNEG]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT: [[TMP3:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP5:%.*]] = phi <vscale x 4 x float> [ insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i64 0), [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP6]], align 4
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <vscale x 4 x float> [[TMP5]], i64 0
+; CHECK-NEXT: [[TMP8]] = tail call float @llvm.vector.reduce.fadd.nxv4f32(float [[TMP7]], <vscale x 4 x float> [[WIDE_LOAD]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: [[TMP10]] = insertelement <vscale x 4 x float> poison, float [[TMP8]], i64 0
+; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]]
+; CHECK: middle.block:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_PREHEADER7]]
+; CHECK: for.body.preheader7:
+; CHECK-NEXT: [[INDVARS_IV_PH:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[SUM_05_PH:%.*]] = phi float [ 0.000000e+00, [[FOR_BODY_PREHEADER]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.cond.cleanup:
+; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi float [ 0.000000e+00, [[ENTRY:%.*]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: ret float [[SUM_0_LCSSA]]
+; CHECK: for.body:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[INDVARS_IV_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT: [[SUM_05:%.*]] = phi float [ [[ADD]], [[FOR_BODY]] ], [ [[SUM_05_PH]], [[FOR_BODY_PREHEADER7]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[F]], i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP11:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[ADD]] = fadd float [[SUM_05]], [[TMP11]]
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
+;
+entry:
+ %cmp4 = icmp sgt i32 %N, 0
+ br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader: ; preds = %entry
+ %wide.trip.count = zext nneg i32 %N to i64
+ %0 = tail call i64 @llvm.vscale.i64()
+ %1 = shl nuw nsw i64 %0, 2
+ %min.iters.check = icmp ugt i64 %1, %wide.trip.count
+ br i1 %min.iters.check, label %for.body.preheader7, label %vector.ph
+
+vector.ph: ; preds = %for.body.preheader
+ %2 = tail call i64 @llvm.vscale.i64()
+ %.neg = mul nuw nsw i64 %2, 2147483644
+ %n.vec = and i64 %.neg, %wide.trip.count
+ %3 = tail call i64 @llvm.vscale.i64()
+ %4 = shl nuw nsw i64 %3, 2
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %vector.ph
+ %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+ %vec.phi = phi float [ 0.000000e+00, %vector.ph ], [ %6, %vector.body ]
+ %5 = getelementptr inbounds float, ptr %f, i64 %index
+ %wide.load = load <vscale x 4 x float>, ptr %5, align 4
+ %6 = tail call float @llvm.vector.reduce.fadd.nxv4f32(float %vec.phi, <vscale x 4 x float> %wide.load)
+ %index.next = add nuw i64 %index, %4
+ %7 = icmp eq i64 %index.next, %n.vec
+ br i1 %7, label %middle.block, label %vector.body
+
+middle.block: ; preds = %vector.body
+ %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+ br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader7
+
+for.body.preheader7: ; preds = %for.body.preheader, %middle.block
+ %indvars.iv.ph = phi i64 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
+ %sum.05.ph = phi float [ 0.000000e+00, %for.body.preheader ], [ %6, %middle.block ]
+ br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %middle.block, %entry
+ %sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %6, %middle.block ], [ %add, %for.body ]
+ ret float %sum.0.lcssa
+
+for.body: ; preds = %for.body.preheader7, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %indvars.iv.ph, %for.body.preheader7 ]
+ %sum.05 = phi float [ %add, %for.body ], [ %sum.05.ph, %for.body.preheader7 ]
+ %arrayidx = getelementptr inbounds float, ptr %f, i64 %indvars.iv
+ %8 = load float, ptr %arrayidx, align 4
+ %add = fadd float %sum.05, %8
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
|
You can test this locally with the following command:git-clang-format --diff a02a0e806fab01f4cf4307443cdaed76a2488752 85b741017544c63b3da591929d08070fb42f5fe6 -- llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
index 6434532afd..278f2b2654 100644
--- a/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCodeGenPrepare.cpp
@@ -121,15 +121,17 @@ bool RISCVCodeGenPrepare::visitAnd(BinaryOperator &BO) {
//
// loop:
// %phi = phi <float> [ ..., %entry ], [ %acc, %loop ]
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x
+// float> %vec)
//
// ->
//
// loop:
// %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop ]
// %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
-// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
-// %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
+// %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x
+// float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float
+// %acc.next, i64 0
//
// Which eliminates the scalar -> vector -> scalar crossing during instruction
// selection.
|
LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.
c19b05b
to
910dc9f
Compare
; CHECK-NEXT: vfredosum.vs v8, v8, v10 | ||
; CHECK-NEXT: vfmv.f.s fa0, v8 | ||
; CHECK-NEXT: vl2re32.v v10, (a4) | ||
; CHECK-NEXT: vsetvli a6, zero, e32, m2, ta, ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a gap in RISCVInsertVSETVLI. Previously we were doing PRE, but now we have a vector instruction vmv.s.x
in the predecessor it no longer kicks in.
This happens with and without 286a366
return false; | ||
|
||
auto *PHI = dyn_cast<PHINode>(I.getOperand(0)); | ||
if (!PHI || !llvm::is_contained(PHI->incoming_values(), &I)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle cases where Op0 is not an indvar?
Example: https://godbolt.org/z/aTYcocfee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we could try to generalise it to all reduction intrinsics then. We would need to double check that wrapping every reduction in extract_element/insert_element doesn't negatively affect codegen in other places though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually ignore the above, I think llvm.vector.reduce.fadd is the only intrinsic applicable here: We expand llvm.vector.reduce.fmul and llvm.vector.reduce.fmax/fmin don't have a no-fast-math accumulator argument
@@ -157,6 +155,9 @@ bool RISCVCodeGenPrepare::visitIntrinsicInst(IntrinsicInst &I) { | |||
Builder.SetInsertPoint(&I); | |||
I.setOperand(0, Builder.CreateExtractElement(VecPHI, (uint64_t)0)); | |||
|
|||
if (PHI->hasNUses(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_empty()?
|
||
for (auto *BB : PHI->blocks()) { | ||
Builder.SetInsertPoint(BB->getTerminator()); | ||
Value *InsertElt = Builder.CreateInsertElement( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like we verified that we're in a loop. If the phi is just the merge of control flow from an if/else or something does this unnecessarily hoist an insertelement into the if/else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial revision of this PR checked that the reduction was one of the incoming values in phi: 05488f6
but I relaxed it to address the comment in #78244 (comment)
I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert to the loop specific version. Please also restrict this - in the initial patch - to the case where the only use of the phi is the intrinsic. Let's start with something vary narrowly focused on the motivating case and generalize in separate patches if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't included any tests for non-loop control flow though. @dtcxzyw should we maybe start with this patch restricted to loops and then relax it in a follow up patch?
Yeah, we can generalize it in the future if we find it exists in some real-world applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/minor comment addressed.
; CHECK-NEXT: addi sp, sp, 48 | ||
; CHECK-NEXT: ret | ||
entry: | ||
%cmp4 = icmp sgt i32 %N, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of these tests, you don't need the whole loop structure emitted by loop vectorizer. You just need the main vector loop.
|
||
auto *PHI = dyn_cast<PHINode>(I.getOperand(0)); | ||
if (!PHI || !PHI->hasOneUse() || | ||
!llvm::is_contained(PHI->incoming_values(), &I)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record... This isn't checking that the phi is in a loop. It's checking it is in a cycle, which is not quite the same thing. In particular, we allow both non-loop cycles, and loops in unreachable code which aren't "loops". Not a problem here, just pointing it out for future reference.
…#78244) LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves: loop: vfmv.s.f v10, fa0 vfredosum.vs v8, v8, v10 vfmv.f.s fa0, v8 This mainly affects ordered fadd reductions, which has a scalar accumulator operand. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting: loop: %phi = phi <float> [ ..., %entry ], [ %acc, %loop] %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec) ``` to loop: %phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop] %phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0 %acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec) %acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0 Which eliminates the scalar -> vector -> scalar crossing during instruction selection.
LLVM vector reduction intrinsics return a scalar result, but on RISC-V vector reduction instructions write the result in the first element of a vector register. So when a reduction in a loop uses a scalar phi, we end up with unnecessary scalar moves:
This mainly affects ordered fadd reductions, since other types of reduction typically use element-wise vectorisation in the loop body. This tries to vectorize any scalar phis that feed into a fadd reduction in RISCVCodeGenPrepare, converting:
to
Which eliminates the scalar -> vector -> scalar crossing during instruction selection.
Note
The tests were added in a separate commit in this PR, in case they needed reviewed too. The codegen diff should be viewable between the two commits.