[InstCombine/RISCV] Constant-fold bitcast(vmv.v.x)#182630
[InstCombine/RISCV] Constant-fold bitcast(vmv.v.x)#182630
Conversation
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-risc-v Author: Ramkumar Ramachandra (artagnon) ChangesThe motivating example is: https://godbolt.org/z/vnb3ETsbc There is an issue with extra vsetvli instructions due to RISCVInsertVSETVLI asking for instructions that demand VL, and optimizing based on that. Due to the non-unit VL in vmv.v.x, we wastefully insert VL number of vsetvli instructions when all operands are constant. To avoid this, constant-fold vmv.v.x with a bitcast to handle the type conversion. llc run showing vsetvli eliminated: https://godbolt.org/z/KEPTxTPcb Full diff: https://github.com/llvm/llvm-project/pull/182630.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 6476a38b8a545..f7b2ac2ca96c8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -48,6 +48,7 @@
#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsARM.h"
#include "llvm/IR/IntrinsicsHexagon.h"
+#include "llvm/IR/IntrinsicsRISCV.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Metadata.h"
#include "llvm/IR/PatternMatch.h"
@@ -4262,6 +4263,46 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
*II, Builder.CreateZExtOrTrunc(II->getArgOperand(0), II->getType()));
return nullptr;
}
+ case Intrinsic::riscv_vmv_v_x: {
+ // If all operands are constant, constant-fold with bitcast. The rationale
+ // for this is to optimize the number of inserted vsetivli instructions, by
+ // RISCVInsertVSETVLI.
+ const APInt *Scalar, *VL;
+ if (!match(II, m_Intrinsic<Intrinsic::riscv_vmv_v_x>(
+ m_Poison(), m_APInt(Scalar), m_APInt(VL))) ||
+ VL->isOne() || Scalar->getBitWidth() > VL->getBitWidth())
+ return nullptr;
+ auto *VecTy = cast<VectorType>(II->getType());
+ bool IsScalable = VecTy->isScalableTy();
+ ElementCount EC = VecTy->getElementCount();
+ ElementCount ScaleFactor =
+ ElementCount::get(VL->getZExtValue(), IsScalable);
+ auto *EltTy = cast<IntegerType>(VecTy->getScalarType());
+ auto *NewEltTy = IntegerType::get(
+ CI.getContext(), EltTy->getScalarSizeInBits() * VL->getZExtValue());
+ if (!EC.hasKnownScalarFactor(ScaleFactor) ||
+ NewEltTy->getBitWidth() > VL->getBitWidth())
+ return nullptr;
+ ElementCount NewEC =
+ ElementCount::get(EC.getKnownScalarFactor(ScaleFactor), IsScalable);
+ Type *RetTy = VectorType::get(NewEltTy, NewEC);
+ assert(VecTy->canLosslesslyBitCastTo(RetTy) &&
+ "Lossless bitcast between types expected");
+ APInt ScalarExt = Scalar->abs().zext(NewEltTy->getBitWidth());
+ APInt NewScalar(ScalarExt.getBitWidth(), 0);
+ for (unsigned Idx : seq(VL->getZExtValue()))
+ NewScalar |= ScalarExt << Scalar->getBitWidth() * Idx;
+ if (Scalar->isSignBitSet())
+ NewScalar.setSignBit();
+ return replaceInstUsesWith(
+ *II,
+ Builder.CreateBitCast(
+ Builder.CreateIntrinsic(
+ RetTy, Intrinsic::riscv_vmv_v_x,
+ {PoisonValue::get(RetTy), ConstantInt::get(NewEltTy, NewScalar),
+ ConstantInt::get(II->getOperand(2)->getType(), 1)}),
+ VecTy));
+ }
default: {
// Handle target specific intrinsics
std::optional<Instruction *> V = targetInstCombineIntrinsic(*II);
diff --git a/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll b/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll
new file mode 100644
index 0000000000000..cc8afb5f4b89e
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll
@@ -0,0 +1,103 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -p instcombine -mtriple=riscv32 -mattr=+v -S %s | FileCheck %s
+; RUN: opt -p instcombine -mtriple=riscv64 -mattr=+v -S %s | FileCheck %s
+
+define <8 x i8> @fixed() {
+; CHECK-LABEL: define <8 x i8> @fixed(
+; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT: [[A:%.*]] = bitcast <2 x i32> [[TMP1]] to <8 x i8>
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 4)
+ ret <8 x i8> %a
+}
+
+define <vscale x 8 x i8> @scalable() {
+; CHECK-LABEL: define <vscale x 8 x i8> @scalable(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 2 x i32> @llvm.riscv.vmv.v.x.nxv2i32.i64(<vscale x 2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT: [[A:%.*]] = bitcast <vscale x 2 x i32> [[TMP1]] to <vscale x 8 x i8>
+; CHECK-NEXT: ret <vscale x 8 x i8> [[A]]
+;
+ %a = call <vscale x 8 x i8> @llvm.riscv.vmv.v.x.nxv8i8(<vscale x 8 x i8> poison, i8 85, i64 4)
+ ret <vscale x 8 x i8> %a
+}
+
+define <8 x i8> @small_scalar() {
+; CHECK-LABEL: define <8 x i8> @small_scalar(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT: [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 50529027, i64 1)
+; CHECK-NEXT: [[A:%.*]] = bitcast <2 x i32> [[TMP1]] to <8 x i8>
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 3, i64 4)
+ ret <8 x i8> %a
+}
+
+define <64 x i1> @users_with_bitcast() {
+; CHECK-LABEL: define <64 x i1> @users_with_bitcast(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT: [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT: [[TMP2:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 -698984874, i64 1)
+; CHECK-NEXT: [[RET1:%.*]] = xor <2 x i32> [[TMP1]], [[TMP2]]
+; CHECK-NEXT: [[RET:%.*]] = bitcast <2 x i32> [[RET1]] to <64 x i1>
+; CHECK-NEXT: ret <64 x i1> [[RET]]
+;
+ %vmv.1 = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 4)
+ %cast.1 = bitcast <8 x i8> %vmv.1 to <64 x i1>
+ %vmv.2 = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 -86, i64 4)
+ %cast.2 = bitcast <8 x i8> %vmv.2 to <64 x i1>
+ %ret = xor <64 x i1> %cast.1, %cast.2
+ ret <64 x i1> %ret
+}
+
+define <8 x i8> @passthru_non_poison(<8 x i8> %x) {
+; CHECK-LABEL: define <8 x i8> @passthru_non_poison(
+; CHECK-SAME: <8 x i8> [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> [[X]], i8 85, i64 4)
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> %x, i8 85, i64 4)
+ ret <8 x i8> %a
+}
+
+define <8 x i8> @scalar_non_constant(i8 %scalar) {
+; CHECK-LABEL: define <8 x i8> @scalar_non_constant(
+; CHECK-SAME: i8 [[SCALAR:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 [[SCALAR]], i64 4)
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 %scalar, i64 4)
+ ret <8 x i8> %a
+}
+
+define <8 x i8> @vl_non_constant(i64 %vl) {
+; CHECK-LABEL: define <8 x i8> @vl_non_constant(
+; CHECK-SAME: i64 [[VL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 85, i64 [[VL]])
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 %vl)
+ ret <8 x i8> %a
+}
+
+define <1 x i128> @scalar_operand_too_large() {
+; CHECK-LABEL: define <1 x i128> @scalar_operand_too_large(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT: [[A:%.*]] = call <1 x i128> @llvm.riscv.vmv.v.x.v1i128.i64(<1 x i128> poison, i128 85, i64 4)
+; CHECK-NEXT: ret <1 x i128> [[A]]
+;
+ %a = call <1 x i128> @llvm.riscv.vmv.v.x.v8i8(<1 x i128> poison, i128 85, i64 4)
+ ret <1 x i128> %a
+}
+
+define <8 x i8> @vl_too_large() {
+; CHECK-LABEL: define <8 x i8> @vl_too_large(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT: [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 85, i64 128)
+; CHECK-NEXT: ret <8 x i8> [[A]]
+;
+ %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 128)
+ ret <8 x i8> %a
+}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
d0e4495 to
4a544a7
Compare
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
|
How can we know that this will reduce vsetvlis unless we know the result is bitcasted? Inserting a bitcast rather than removing a bitcast is guaranteed to create vsetvlis in other cases. |
Right, my bad. Fixed now. |
I don't understand this example. It's creating a 32 bit mask with every other bit set, but then passing it to an instruction with a VL of 1. So only the first mask bit matters. |
It was reduced to just show the optimization -- maybe the less reduced version (https://godbolt.org/z/qWP6vYMK8) helps? |
So the non-reduced case produces 32 times as many mask bits as are really needed? SEW=32 vmv has the same VL as the sstrided store. Optimizing this without potentially breaking other things requires knowing more than just the bitcast. |
The motivating example is: https://godbolt.org/z/vnb3ETsbc There is an issue with extra vsetvli instructions due to RISCVInsertVSETVLI asking for instructions that demand VL, and optimizing based on that. Due to the non-unit VL in vmv.v.x, we wastefully insert VL number of vsetvli instructions when all operands are constant. To avoid this, constant-fold vmv.v.x with a bitcast to handle the type conversion. llc run showing vsetvli eliminated: https://godbolt.org/z/KEPTxTPcb
b1bba34 to
3d35ea8
Compare
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
Oops, I think I might have given you the gcc link instead of the clang one: https://godbolt.org/z/77xnq9han. Yeah, it's quite hard to optimize, but I think folding bitcast(vmv.v.x) will help a bit -- I checked the IR. I think for the vmnot, I have to look into the RISCVVLOptimizer. Not sure what else is possible without too much effort? |
Is this code already in Eigen? |
Yeah, it's reduced from Eigen. |
Can you provide a link to the source? |
Ah, sorry -- it's from a downstream hand-optimization that hasn't been upstreamed yet. The downstream repo is https://github.com/ChipKerchner/Eigen |
|
I believe the code is producing a mask with a 32 times as many bits as are needed. It would be best to use do Using this would produce a mask with only 8 times as many bits as needed. It uses the same VL as the strided store, uses an i8 constant thats easier to materialize, and reduces the amount of work done on hardware that has DLEN < VLEN. Can the source be changed to this? |
If its "hand-optimization", why does the compiler need to fix it? |
Yes, there's no problem changing the source: the example was artificially crafted for finding a missed optimization, and hence this patch: does this bitcast(vmv.v.x) const-fold never really appear in real-world code? |
I'm sure there are cases where there are bitcasts. But I don't know that there is a generic algorithm for figuring out what scaling factor to use. Your original motivating example has a bitcast to an i1 type which isn't a valid SEW. Other examples cases might cast to a valid SEW. Some of those cases probably want to scale to that SEW. I see you've updated the code to use result type of the bitcast now, but you have no tests for i1 vectors. Does it still work on your original motivating case? |
I've stripped the motivating case now -- like I said, it was just some toy code to experiment with finding compiler optimizations, and we found one successfully. I think the patch stands on its own merit for removing redundant masks in some cases -- let's not worry about the toy experiment. Is there any pending work for the patch itself? |
Constant-fold bitcast(vmv.v.x) to avoid creating redundant masks.
llc run showing vsetvli eliminated: https://godbolt.org/z/d1Gx3KqaT