Skip to content

[InstCombine/RISCV] Constant-fold bitcast(vmv.v.x)#182630

Open
artagnon wants to merge 9 commits intollvm:mainfrom
artagnon:ic-rv-vmv-constfold
Open

[InstCombine/RISCV] Constant-fold bitcast(vmv.v.x)#182630
artagnon wants to merge 9 commits intollvm:mainfrom
artagnon:ic-rv-vmv-constfold

Conversation

@artagnon
Copy link
Contributor

@artagnon artagnon commented Feb 21, 2026

Constant-fold bitcast(vmv.v.x) to avoid creating redundant masks.

llc run showing vsetvli eliminated: https://godbolt.org/z/d1Gx3KqaT

@artagnon artagnon requested a review from nikic as a code owner February 21, 2026 00:25
@llvmbot llvmbot added backend:RISC-V llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Feb 21, 2026
@llvmbot
Copy link
Member

llvmbot commented Feb 21, 2026

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Ramkumar Ramachandra (artagnon)

Changes

The motivating example is: https://godbolt.org/z/vnb3ETsbc

There is an issue with extra vsetvli instructions due to RISCVInsertVSETVLI asking for instructions that demand VL, and optimizing based on that. Due to the non-unit VL in vmv.v.x, we wastefully insert VL number of vsetvli instructions when all operands are constant. To avoid this, constant-fold vmv.v.x with a bitcast to handle the type conversion.

llc run showing vsetvli eliminated: https://godbolt.org/z/KEPTxTPcb


Full diff: https://github.com/llvm/llvm-project/pull/182630.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (+41)
  • (added) llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll (+103)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 6476a38b8a545..f7b2ac2ca96c8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -48,6 +48,7 @@
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsARM.h"
 #include "llvm/IR/IntrinsicsHexagon.h"
+#include "llvm/IR/IntrinsicsRISCV.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Metadata.h"
 #include "llvm/IR/PatternMatch.h"
@@ -4262,6 +4263,46 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
           *II, Builder.CreateZExtOrTrunc(II->getArgOperand(0), II->getType()));
     return nullptr;
   }
+  case Intrinsic::riscv_vmv_v_x: {
+    // If all operands are constant, constant-fold with bitcast. The rationale
+    // for this is to optimize the number of inserted vsetivli instructions, by
+    // RISCVInsertVSETVLI.
+    const APInt *Scalar, *VL;
+    if (!match(II, m_Intrinsic<Intrinsic::riscv_vmv_v_x>(
+                       m_Poison(), m_APInt(Scalar), m_APInt(VL))) ||
+        VL->isOne() || Scalar->getBitWidth() > VL->getBitWidth())
+      return nullptr;
+    auto *VecTy = cast<VectorType>(II->getType());
+    bool IsScalable = VecTy->isScalableTy();
+    ElementCount EC = VecTy->getElementCount();
+    ElementCount ScaleFactor =
+        ElementCount::get(VL->getZExtValue(), IsScalable);
+    auto *EltTy = cast<IntegerType>(VecTy->getScalarType());
+    auto *NewEltTy = IntegerType::get(
+        CI.getContext(), EltTy->getScalarSizeInBits() * VL->getZExtValue());
+    if (!EC.hasKnownScalarFactor(ScaleFactor) ||
+        NewEltTy->getBitWidth() > VL->getBitWidth())
+      return nullptr;
+    ElementCount NewEC =
+        ElementCount::get(EC.getKnownScalarFactor(ScaleFactor), IsScalable);
+    Type *RetTy = VectorType::get(NewEltTy, NewEC);
+    assert(VecTy->canLosslesslyBitCastTo(RetTy) &&
+           "Lossless bitcast between types expected");
+    APInt ScalarExt = Scalar->abs().zext(NewEltTy->getBitWidth());
+    APInt NewScalar(ScalarExt.getBitWidth(), 0);
+    for (unsigned Idx : seq(VL->getZExtValue()))
+      NewScalar |= ScalarExt << Scalar->getBitWidth() * Idx;
+    if (Scalar->isSignBitSet())
+      NewScalar.setSignBit();
+    return replaceInstUsesWith(
+        *II,
+        Builder.CreateBitCast(
+            Builder.CreateIntrinsic(
+                RetTy, Intrinsic::riscv_vmv_v_x,
+                {PoisonValue::get(RetTy), ConstantInt::get(NewEltTy, NewScalar),
+                 ConstantInt::get(II->getOperand(2)->getType(), 1)}),
+            VecTy));
+  }
   default: {
     // Handle target specific intrinsics
     std::optional<Instruction *> V = targetInstCombineIntrinsic(*II);
diff --git a/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll b/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll
new file mode 100644
index 0000000000000..cc8afb5f4b89e
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/RISCV/riscv-vmv-v-x.ll
@@ -0,0 +1,103 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -p instcombine -mtriple=riscv32 -mattr=+v -S %s | FileCheck %s
+; RUN: opt -p instcombine -mtriple=riscv64 -mattr=+v -S %s | FileCheck %s
+
+define <8 x i8> @fixed() {
+; CHECK-LABEL: define <8 x i8> @fixed(
+; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT:    [[A:%.*]] = bitcast <2 x i32> [[TMP1]] to <8 x i8>
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 4)
+  ret <8 x i8> %a
+}
+
+define <vscale x 8 x i8> @scalable() {
+; CHECK-LABEL: define <vscale x 8 x i8> @scalable(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call <vscale x 2 x i32> @llvm.riscv.vmv.v.x.nxv2i32.i64(<vscale x 2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT:    [[A:%.*]] = bitcast <vscale x 2 x i32> [[TMP1]] to <vscale x 8 x i8>
+; CHECK-NEXT:    ret <vscale x 8 x i8> [[A]]
+;
+  %a = call <vscale x 8 x i8> @llvm.riscv.vmv.v.x.nxv8i8(<vscale x 8 x i8> poison, i8 85, i64 4)
+  ret <vscale x 8 x i8> %a
+}
+
+define <8 x i8> @small_scalar() {
+; CHECK-LABEL: define <8 x i8> @small_scalar(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 50529027, i64 1)
+; CHECK-NEXT:    [[A:%.*]] = bitcast <2 x i32> [[TMP1]] to <8 x i8>
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 3, i64 4)
+  ret <8 x i8> %a
+}
+
+define <64 x i1> @users_with_bitcast() {
+; CHECK-LABEL: define <64 x i1> @users_with_bitcast(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 1431655765, i64 1)
+; CHECK-NEXT:    [[TMP2:%.*]] = call <2 x i32> @llvm.riscv.vmv.v.x.v2i32.i64(<2 x i32> poison, i32 -698984874, i64 1)
+; CHECK-NEXT:    [[RET1:%.*]] = xor <2 x i32> [[TMP1]], [[TMP2]]
+; CHECK-NEXT:    [[RET:%.*]] = bitcast <2 x i32> [[RET1]] to <64 x i1>
+; CHECK-NEXT:    ret <64 x i1> [[RET]]
+;
+  %vmv.1 = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 4)
+  %cast.1 = bitcast <8 x i8> %vmv.1 to <64 x i1>
+  %vmv.2 = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 -86, i64 4)
+  %cast.2 = bitcast <8 x i8> %vmv.2 to <64 x i1>
+  %ret = xor <64 x i1> %cast.1, %cast.2
+  ret <64 x i1> %ret
+}
+
+define <8 x i8> @passthru_non_poison(<8 x i8> %x) {
+; CHECK-LABEL: define <8 x i8> @passthru_non_poison(
+; CHECK-SAME: <8 x i8> [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> [[X]], i8 85, i64 4)
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> %x, i8 85, i64 4)
+  ret <8 x i8> %a
+}
+
+define <8 x i8> @scalar_non_constant(i8 %scalar) {
+; CHECK-LABEL: define <8 x i8> @scalar_non_constant(
+; CHECK-SAME: i8 [[SCALAR:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 [[SCALAR]], i64 4)
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 %scalar, i64 4)
+  ret <8 x i8> %a
+}
+
+define <8 x i8> @vl_non_constant(i64 %vl) {
+; CHECK-LABEL: define <8 x i8> @vl_non_constant(
+; CHECK-SAME: i64 [[VL:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 85, i64 [[VL]])
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 %vl)
+  ret <8 x i8> %a
+}
+
+define <1 x i128> @scalar_operand_too_large() {
+; CHECK-LABEL: define <1 x i128> @scalar_operand_too_large(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:    [[A:%.*]] = call <1 x i128> @llvm.riscv.vmv.v.x.v1i128.i64(<1 x i128> poison, i128 85, i64 4)
+; CHECK-NEXT:    ret <1 x i128> [[A]]
+;
+  %a = call <1 x i128> @llvm.riscv.vmv.v.x.v8i8(<1 x i128> poison, i128 85, i64 4)
+  ret <1 x i128> %a
+}
+
+define <8 x i8> @vl_too_large() {
+; CHECK-LABEL: define <8 x i8> @vl_too_large(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK-NEXT:    [[A:%.*]] = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8.i64(<8 x i8> poison, i8 85, i64 128)
+; CHECK-NEXT:    ret <8 x i8> [[A]]
+;
+  %a = call <8 x i8> @llvm.riscv.vmv.v.x.v8i8(<8 x i8> poison, i8 85, i64 128)
+  ret <8 x i8> %a
+}

@github-actions
Copy link

github-actions bot commented Feb 21, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@artagnon artagnon force-pushed the ic-rv-vmv-constfold branch from d0e4495 to 4a544a7 Compare February 21, 2026 00:48
@github-actions
Copy link

github-actions bot commented Feb 21, 2026

🐧 Linux x64 Test Results

  • 191100 tests passed
  • 4964 tests skipped

✅ The build succeeded and all tests passed.

@topperc
Copy link
Collaborator

topperc commented Feb 21, 2026

How can we know that this will reduce vsetvlis unless we know the result is bitcasted? Inserting a bitcast rather than removing a bitcast is guaranteed to create vsetvlis in other cases.

@artagnon artagnon changed the title [InstCombine] Constant-fold riscv.vmv.v.x [InstCombine/RISCV] Constant-fold vmv.v.x Feb 21, 2026
@artagnon
Copy link
Contributor Author

How can we know that this will reduce vsetvlis unless we know the result is bitcasted? Inserting a bitcast rather than removing a bitcast is guaranteed to create vsetvlis in other cases.

Right, my bad. Fixed now.

@topperc
Copy link
Collaborator

topperc commented Feb 24, 2026

The motivating example is: https://godbolt.org/z/vnb3ETsbc

I don't understand this example. It's creating a 32 bit mask with every other bit set, but then passing it to an instruction with a VL of 1. So only the first mask bit matters.

@artagnon
Copy link
Contributor Author

The motivating example is: https://godbolt.org/z/vnb3ETsbc

I don't understand this example. It's creating a 32 bit mask with every other bit set, but then passing it to an instruction with a VL of 1. So only the first mask bit matters.

It was reduced to just show the optimization -- maybe the less reduced version (https://godbolt.org/z/qWP6vYMK8) helps?

@topperc
Copy link
Collaborator

topperc commented Feb 24, 2026

The motivating example is: https://godbolt.org/z/vnb3ETsbc

I don't understand this example. It's creating a 32 bit mask with every other bit set, but then passing it to an instruction with a VL of 1. So only the first mask bit matters.

It was reduced to just show the optimization -- maybe the less reduced version (https://godbolt.org/z/qWP6vYMK8) helps?

So the non-reduced case produces 32 times as many mask bits as are really needed? SEW=32 vmv has the same VL as the sstrided store. Optimizing this without potentially breaking other things requires knowing more than just the bitcast.

The motivating example is: https://godbolt.org/z/vnb3ETsbc

There is an issue with extra vsetvli instructions due to
RISCVInsertVSETVLI asking for instructions that demand VL, and
optimizing based on that. Due to the non-unit VL in vmv.v.x, we
wastefully insert VL number of vsetvli instructions when all operands
are constant. To avoid this, constant-fold vmv.v.x with a bitcast to
handle the type conversion.

llc run showing vsetvli eliminated: https://godbolt.org/z/KEPTxTPcb
@artagnon artagnon changed the title [InstCombine/RISCV] Constant-fold vmv.v.x [InstCombine/RISCV] Constant-fold bitcast(vmv.v.x) Feb 25, 2026
@artagnon artagnon force-pushed the ic-rv-vmv-constfold branch from b1bba34 to 3d35ea8 Compare February 25, 2026 10:33
@github-actions
Copy link

github-actions bot commented Feb 25, 2026

🪟 Windows x64 Test Results

  • 131368 tests passed
  • 2950 tests skipped

✅ The build succeeded and all tests passed.

@artagnon
Copy link
Contributor Author

So the non-reduced case produces 32 times as many mask bits as are really needed? SEW=32 vmv has the same VL as the sstrided store. Optimizing this without potentially breaking other things requires knowing more than just the bitcast.

Oops, I think I might have given you the gcc link instead of the clang one: https://godbolt.org/z/77xnq9han. Yeah, it's quite hard to optimize, but I think folding bitcast(vmv.v.x) will help a bit -- I checked the IR. I think for the vmnot, I have to look into the RISCVVLOptimizer. Not sure what else is possible without too much effort?

@topperc
Copy link
Collaborator

topperc commented Feb 25, 2026

So the non-reduced case produces 32 times as many mask bits as are really needed? SEW=32 vmv has the same VL as the sstrided store. Optimizing this without potentially breaking other things requires knowing more than just the bitcast.

Oops, I think I might have given you the gcc link instead of the clang one: https://godbolt.org/z/77xnq9han. Yeah, it's quite hard to optimize, but I think folding bitcast(vmv.v.x) will help a bit -- I checked the IR. I think for the vmnot, I have to look into the RISCVVLOptimizer. Not sure what else is possible without too much effort?

Is this code already in Eigen?

@artagnon
Copy link
Contributor Author

Is this code already in Eigen?

Yeah, it's reduced from Eigen.

@topperc
Copy link
Collaborator

topperc commented Feb 25, 2026

Is this code already in Eigen?

Yeah, it's reduced from Eigen.

Can you provide a link to the source?

@artagnon
Copy link
Contributor Author

Is this code already in Eigen?

Yeah, it's reduced from Eigen.

Can you provide a link to the source?

Ah, sorry -- it's from a downstream hand-optimization that hasn't been upstreamed yet. The downstream repo is https://github.com/ChipKerchner/Eigen

@topperc
Copy link
Collaborator

topperc commented Feb 25, 2026

I believe the code is producing a mask with a 32 times as many bits as are needed. It would be best to use do

Using this would produce a mask with only 8 times as many bits as needed. It uses the same VL as the strided store, uses an i8 constant thats easier to materialize, and reduces the amount of work done on hardware that has DLEN < VLEN. Can the source be changed to this?

const PacketMask32 mask = __riscv_vreinterpret_v_i8m1_b32(__riscv_vlmul_ext_v_i8mf4_i8m1(__riscv_vmv_v_x_i8mf4(static_cast<char>(0x55),
      unpacket_traits<Packet1Xi>::size))));

@topperc
Copy link
Collaborator

topperc commented Feb 25, 2026

Is this code already in Eigen?

Yeah, it's reduced from Eigen.

Can you provide a link to the source?

Ah, sorry -- it's from a downstream hand-optimization that hasn't been upstreamed yet. The downstream repo is https://github.com/ChipKerchner/Eigen

If its "hand-optimization", why does the compiler need to fix it?

@artagnon
Copy link
Contributor Author

I believe the code is producing a mask with a 32 times as many bits as are needed. It would be best to use do

Yes, there's no problem changing the source: the example was artificially crafted for finding a missed optimization, and hence this patch: does this bitcast(vmv.v.x) const-fold never really appear in real-world code?

@topperc
Copy link
Collaborator

topperc commented Feb 25, 2026

I believe the code is producing a mask with a 32 times as many bits as are needed. It would be best to use do

Yes, there's no problem changing the source: the example was artificially crafted for finding a missed optimization, and hence this patch: does this bitcast(vmv.v.x) const-fold never really appear in real-world code?

I'm sure there are cases where there are bitcasts. But I don't know that there is a generic algorithm for figuring out what scaling factor to use. Your original motivating example has a bitcast to an i1 type which isn't a valid SEW. Other examples cases might cast to a valid SEW. Some of those cases probably want to scale to that SEW.

I see you've updated the code to use result type of the bitcast now, but you have no tests for i1 vectors. Does it still work on your original motivating case?

@artagnon
Copy link
Contributor Author

I see you've updated the code to use result type of the bitcast now, but you have no tests for i1 vectors. Does it still work on your motivating case?

I've stripped the motivating case now -- like I said, it was just some toy code to experiment with finding compiler optimizations, and we found one successfully. I think the patch stands on its own merit for removing redundant masks in some cases -- let's not worry about the toy experiment. Is there any pending work for the patch itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:RISC-V llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants