Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Combine build_vector of extract_vector_elts to vector_shuffle #80883

Closed

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Feb 6, 2024

When a shufflevector instruction has a different result type than its input
vectors, SelectionDAGBuilder may create a build_vector of extract_vector_elts
instead of a vector_shuffle (since vector_shuffle requires the operand and
result types are the same).

When this happens, DAGCombiner::reduceBuildVecToShuffle normally folds it back
into a vector_shuffle and we don't notice it. However, it can fail if
TLI.isExtractSubvectorCheap returns false, resulting in an surprising sequence
of vslidedowns.

For example, in the test case added in bc569f6,
isExtractSubvectorCheap returns false because the index is > 31 and doesn't fit
into the uimm5 of vslidedown.vi.

The generic combine is quite geared towards x86 which has actual shuffle
instructions and expects vectors to be quite small, thus it doesn't always deem
the combine to be profitable. But on RISC-V a vector_shuffle (that is at worst
a vrgather) is almost always going to be cheaper than a build_vector (which
will most likely be a sequence of vslidedowns).

This patch adds a separate combine for RISC-V that folds such build_vectors
into shuffle_vectors without checking if extract_subvector is cheap or not, so
we're no longer implicitly relying on the generic DAG combine, which isn't
always guaranteed to kick in.

Another way to look at this is that this implements the transform as a
canonicalization, rather than an optimisation.

When a shufflevector instruction has a different result type than its input
vectors, SelectionDAGBuilder may create a build_vector of extract_vector_elts
instead of a vector_shuffle (since vector_shuffle requires the operand and
result types are the same).

When this happens, DAGCombiner::reduceBuildVecToShuffle normally folds it back
into a vector_shuffle and we don't notice it. However, it can fail if
TLI.isExtractSubvectorCheap returns false, resulting in an surprising sequence
of vslidedowns.

For example, in the test case added in bc569f6,
isExtractSubvectorCheap returns false because the index is > 31 and doesn't fit
into the uimm5 of vslidedown.vi.

The generic combine is quite geared towards x86 which has actual shuffle
instructions and expects vectors to be quite small, thus it doesn't always deem
the combine to be profitable. But on RISC-V a vector_shuffle (that is at worst
a vrgather) is almost always going to be cheaper than a build_vector (which
will most likely be a sequence of vslidedowns).

This patch adds a separate combine for RISC-V that folds such build_vectors
into shuffle_vectors without checking if extract_subvector is cheap or not, so
we're no longer implicitly relying on the generic DAG combine, which isn't
always guaranteed to kick in.

Another way to look at this is that this implements the transform as a
canonicalization, rather than an optimisation.
@llvmbot
Copy link
Collaborator

llvmbot commented Feb 6, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

When a shufflevector instruction has a different result type than its input
vectors, SelectionDAGBuilder may create a build_vector of extract_vector_elts
instead of a vector_shuffle (since vector_shuffle requires the operand and
result types are the same).

When this happens, DAGCombiner::reduceBuildVecToShuffle normally folds it back
into a vector_shuffle and we don't notice it. However, it can fail if
TLI.isExtractSubvectorCheap returns false, resulting in an surprising sequence
of vslidedowns.

For example, in the test case added in bc569f6,
isExtractSubvectorCheap returns false because the index is > 31 and doesn't fit
into the uimm5 of vslidedown.vi.

The generic combine is quite geared towards x86 which has actual shuffle
instructions and expects vectors to be quite small, thus it doesn't always deem
the combine to be profitable. But on RISC-V a vector_shuffle (that is at worst
a vrgather) is almost always going to be cheaper than a build_vector (which
will most likely be a sequence of vslidedowns).

This patch adds a separate combine for RISC-V that folds such build_vectors
into shuffle_vectors without checking if extract_subvector is cheap or not, so
we're no longer implicitly relying on the generic DAG combine, which isn't
always guaranteed to kick in.

Another way to look at this is that this implements the transform as a
canonicalization, rather than an optimisation.


Patch is 52.13 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/80883.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+71-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll (+9-91)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+339-559)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 27037f4d5c5c8..e6e14bb8592c0 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -14933,11 +14933,68 @@ static SDValue performSELECTCombine(SDNode *N, SelectionDAG &DAG,
   return tryFoldSelectIntoOp(N, DAG, FalseVal, TrueVal, /*Swapped*/true);
 }
 
-/// If we have a build_vector where each lane is binop X, C, where C
-/// is a constant (but not necessarily the same constant on all lanes),
-/// form binop (build_vector x1, x2, ...), (build_vector c1, c2, c3, ..).
-/// We assume that materializing a constant build vector will be no more
-/// expensive that performing O(n) binops.
+/// Canonicalize
+///
+/// (build_vector (extract_vector_elt v, i0),
+///               (extract_vector_elt v, i1),
+///               (extract_vector_elt v, i2),
+///               (extract_vector_elt v, i3))
+///
+/// to
+///
+/// (vector_shuffle<i0, i1, i2, i3> v, undef)
+///
+/// shufflevectors may be lowered to the build_vector pattern above if the
+/// vector types don't match, so try and recover the shuffle to avoid
+/// scalarization.
+///
+/// Note that this combine exists in a similar form in DAGCombiner.cpp, but it
+/// depends on extract_subvector being cheap. On RISC-V we always want to use
+/// the vector_shuffle form, so we perform it unconditionally here.
+static SDValue combineBuildVectorToShuffle(SDNode *N, SelectionDAG &DAG,
+                                           const RISCVTargetLowering &TLI) {
+  EVT VT = N->getValueType(0);
+  SDLoc DL(N);
+  if (!TLI.isTypeLegal(VT))
+    return SDValue();
+
+  SDValue ExtractVec;
+  SmallVector<int> Mask(N->getNumOperands(), -1);
+  for (auto [i, Op] : enumerate(N->op_values())) {
+    if (Op.isUndef())
+      continue;
+    if (Op.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
+      return SDValue();
+    if (!isa<ConstantSDNode>(Op.getOperand(1)))
+      return SDValue();
+    if (!ExtractVec) {
+      ExtractVec = Op.getOperand(0);
+      Mask[i] = Op.getConstantOperandVal(1);
+      continue;
+    }
+    if (Op.getOperand(0) != ExtractVec)
+      return SDValue();
+    Mask[i] = Op.getConstantOperandVal(1);
+  }
+
+  EVT ExtractVT = ExtractVec.getValueType();
+  if (ExtractVT.isScalableVector())
+    return SDValue();
+
+  // extract_vector_elt can extend the type, skip this case.
+  if (ExtractVT.getVectorElementType() != VT.getVectorElementType())
+    return SDValue();
+
+  if (ExtractVT.getVectorNumElements() < VT.getVectorNumElements())
+    return SDValue();
+
+  Mask.resize(ExtractVT.getVectorNumElements(), -1);
+  SDValue Shuffle = DAG.getVectorShuffle(ExtractVT, DL, ExtractVec,
+                                         DAG.getUNDEF(ExtractVT), Mask);
+  return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Shuffle,
+                     DAG.getVectorIdxConstant(0, DL));
+}
+
 static SDValue performBUILD_VECTORCombine(SDNode *N, SelectionDAG &DAG,
                                           const RISCVSubtarget &Subtarget,
                                           const RISCVTargetLowering &TLI) {
@@ -14949,6 +15006,15 @@ static SDValue performBUILD_VECTORCombine(SDNode *N, SelectionDAG &DAG,
   if (VT.getVectorNumElements() == 1)
     return SDValue();
 
+  if (SDValue V = combineBuildVectorToShuffle(N, DAG, TLI))
+    return V;
+
+  /// If we have a build_vector where each lane is binop X, C, where C
+  /// is a constant (but not necessarily the same constant on all lanes),
+  /// form binop (build_vector x1, x2, ...), (build_vector c1, c2, c3, ..).
+  /// We assume that materializing a constant build vector will be no more
+  /// expensive that performing O(n) binops.
+
   const unsigned Opcode = N->op_begin()->getNode()->getOpcode();
   if (!TLI.isBinOp(Opcode))
     return SDValue();
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
index acad71bb59590..a4c99c5a7a822 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
@@ -722,97 +722,15 @@ define <8 x i32> @shuffle_v8i32_2(<8 x i32> %x, <8 x i32> %y) {
 
 ; FIXME: This could be expressed as a vrgather.vv
 define <8 x i8> @shuffle_v64i8_v8i8(<64 x i8> %wide.vec) {
-; RV32-LABEL: shuffle_v64i8_v8i8:
-; RV32:       # %bb.0:
-; RV32-NEXT:    addi sp, sp, -128
-; RV32-NEXT:    .cfi_def_cfa_offset 128
-; RV32-NEXT:    sw ra, 124(sp) # 4-byte Folded Spill
-; RV32-NEXT:    sw s0, 120(sp) # 4-byte Folded Spill
-; RV32-NEXT:    .cfi_offset ra, -4
-; RV32-NEXT:    .cfi_offset s0, -8
-; RV32-NEXT:    addi s0, sp, 128
-; RV32-NEXT:    .cfi_def_cfa s0, 0
-; RV32-NEXT:    andi sp, sp, -64
-; RV32-NEXT:    li a0, 64
-; RV32-NEXT:    mv a1, sp
-; RV32-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
-; RV32-NEXT:    vse8.v v8, (a1)
-; RV32-NEXT:    vsetivli zero, 1, e8, m1, ta, ma
-; RV32-NEXT:    vslidedown.vi v10, v8, 8
-; RV32-NEXT:    vmv.x.s a0, v10
-; RV32-NEXT:    vmv.x.s a1, v8
-; RV32-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV32-NEXT:    vmv.v.x v10, a1
-; RV32-NEXT:    vslide1down.vx v10, v10, a0
-; RV32-NEXT:    vsetivli zero, 1, e8, m2, ta, ma
-; RV32-NEXT:    vslidedown.vi v12, v8, 16
-; RV32-NEXT:    vmv.x.s a0, v12
-; RV32-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV32-NEXT:    vslide1down.vx v10, v10, a0
-; RV32-NEXT:    vsetivli zero, 1, e8, m2, ta, ma
-; RV32-NEXT:    vslidedown.vi v8, v8, 24
-; RV32-NEXT:    vmv.x.s a0, v8
-; RV32-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV32-NEXT:    vslide1down.vx v8, v10, a0
-; RV32-NEXT:    lbu a0, 32(sp)
-; RV32-NEXT:    lbu a1, 40(sp)
-; RV32-NEXT:    lbu a2, 48(sp)
-; RV32-NEXT:    lbu a3, 56(sp)
-; RV32-NEXT:    vslide1down.vx v8, v8, a0
-; RV32-NEXT:    vslide1down.vx v8, v8, a1
-; RV32-NEXT:    vslide1down.vx v8, v8, a2
-; RV32-NEXT:    vslide1down.vx v8, v8, a3
-; RV32-NEXT:    addi sp, s0, -128
-; RV32-NEXT:    lw ra, 124(sp) # 4-byte Folded Reload
-; RV32-NEXT:    lw s0, 120(sp) # 4-byte Folded Reload
-; RV32-NEXT:    addi sp, sp, 128
-; RV32-NEXT:    ret
-;
-; RV64-LABEL: shuffle_v64i8_v8i8:
-; RV64:       # %bb.0:
-; RV64-NEXT:    addi sp, sp, -128
-; RV64-NEXT:    .cfi_def_cfa_offset 128
-; RV64-NEXT:    sd ra, 120(sp) # 8-byte Folded Spill
-; RV64-NEXT:    sd s0, 112(sp) # 8-byte Folded Spill
-; RV64-NEXT:    .cfi_offset ra, -8
-; RV64-NEXT:    .cfi_offset s0, -16
-; RV64-NEXT:    addi s0, sp, 128
-; RV64-NEXT:    .cfi_def_cfa s0, 0
-; RV64-NEXT:    andi sp, sp, -64
-; RV64-NEXT:    li a0, 64
-; RV64-NEXT:    mv a1, sp
-; RV64-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
-; RV64-NEXT:    vse8.v v8, (a1)
-; RV64-NEXT:    vsetivli zero, 1, e8, m1, ta, ma
-; RV64-NEXT:    vslidedown.vi v10, v8, 8
-; RV64-NEXT:    vmv.x.s a0, v10
-; RV64-NEXT:    vmv.x.s a1, v8
-; RV64-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV64-NEXT:    vmv.v.x v10, a1
-; RV64-NEXT:    vslide1down.vx v10, v10, a0
-; RV64-NEXT:    vsetivli zero, 1, e8, m2, ta, ma
-; RV64-NEXT:    vslidedown.vi v12, v8, 16
-; RV64-NEXT:    vmv.x.s a0, v12
-; RV64-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV64-NEXT:    vslide1down.vx v10, v10, a0
-; RV64-NEXT:    vsetivli zero, 1, e8, m2, ta, ma
-; RV64-NEXT:    vslidedown.vi v8, v8, 24
-; RV64-NEXT:    vmv.x.s a0, v8
-; RV64-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV64-NEXT:    vslide1down.vx v8, v10, a0
-; RV64-NEXT:    lbu a0, 32(sp)
-; RV64-NEXT:    lbu a1, 40(sp)
-; RV64-NEXT:    lbu a2, 48(sp)
-; RV64-NEXT:    lbu a3, 56(sp)
-; RV64-NEXT:    vslide1down.vx v8, v8, a0
-; RV64-NEXT:    vslide1down.vx v8, v8, a1
-; RV64-NEXT:    vslide1down.vx v8, v8, a2
-; RV64-NEXT:    vslide1down.vx v8, v8, a3
-; RV64-NEXT:    addi sp, s0, -128
-; RV64-NEXT:    ld ra, 120(sp) # 8-byte Folded Reload
-; RV64-NEXT:    ld s0, 112(sp) # 8-byte Folded Reload
-; RV64-NEXT:    addi sp, sp, 128
-; RV64-NEXT:    ret
+; CHECK-LABEL: shuffle_v64i8_v8i8:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    li a0, 64
+; CHECK-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
+; CHECK-NEXT:    vid.v v12
+; CHECK-NEXT:    vsll.vi v16, v12, 3
+; CHECK-NEXT:    vrgather.vv v12, v8, v16
+; CHECK-NEXT:    vmv1r.v v8, v12
+; CHECK-NEXT:    ret
   %s = shufflevector <64 x i8> %wide.vec, <64 x i8> poison, <8 x i32> <i32 0, i32 8, i32 16, i32 24, i32 32, i32 40, i32 48, i32 56>
   ret <8 x i8> %s
 }
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
index e27ff0a573d5f..d9cfc778de30a 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
@@ -163,417 +163,266 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    mul a2, a2, a3
 ; RV32-NEXT:    sub sp, sp, a2
 ; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x38, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 56 * vlenb
-; RV32-NEXT:    addi a3, a1, 256
-; RV32-NEXT:    li a2, 32
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; RV32-NEXT:    vle32.v v16, (a3)
-; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    slli a3, a3, 5
-; RV32-NEXT:    add a3, sp, a3
-; RV32-NEXT:    addi a3, a3, 16
-; RV32-NEXT:    vs8r.v v16, (a3) # Unknown-size Folded Spill
-; RV32-NEXT:    addi a3, a1, 128
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vslideup.vi v8, v16, 4
-; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 4
-; RV32-NEXT:    add a4, sp, a4
-; RV32-NEXT:    addi a4, a4, 16
-; RV32-NEXT:    vs4r.v v8, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    lui a4, 12
-; RV32-NEXT:    vmv.s.x v1, a4
-; RV32-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
-; RV32-NEXT:    vslidedown.vi v16, v16, 16
-; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    li a5, 24
-; RV32-NEXT:    mul a4, a4, a5
-; RV32-NEXT:    add a4, sp, a4
-; RV32-NEXT:    addi a4, a4, 16
-; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
-; RV32-NEXT:    vmv1r.v v0, v1
-; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 2
-; RV32-NEXT:    add a4, sp, a4
-; RV32-NEXT:    addi a4, a4, 16
-; RV32-NEXT:    vs1r.v v1, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    vslideup.vi v8, v16, 10, v0.t
-; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    li a5, 20
-; RV32-NEXT:    mul a4, a4, a5
-; RV32-NEXT:    add a4, sp, a4
-; RV32-NEXT:    addi a4, a4, 16
-; RV32-NEXT:    vs4r.v v8, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    lui a4, %hi(.LCPI6_0)
-; RV32-NEXT:    addi a4, a4, %lo(.LCPI6_0)
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, mu
-; RV32-NEXT:    vle16.v v8, (a4)
-; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 3
-; RV32-NEXT:    add a4, sp, a4
-; RV32-NEXT:    addi a4, a4, 16
-; RV32-NEXT:    vs4r.v v8, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    lui a4, %hi(.LCPI6_1)
-; RV32-NEXT:    addi a4, a4, %lo(.LCPI6_1)
-; RV32-NEXT:    lui a5, 1
-; RV32-NEXT:    vle16.v v8, (a4)
-; RV32-NEXT:    addi a4, sp, 16
-; RV32-NEXT:    vs4r.v v8, (a4) # Unknown-size Folded Spill
-; RV32-NEXT:    vle32.v v16, (a1)
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 40
-; RV32-NEXT:    mul a1, a1, a4
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vle32.v v24, (a3)
-; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    addi a2, a1, 256
+; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; RV32-NEXT:    vle64.v v8, (a2)
+; RV32-NEXT:    csrr a2, vlenb
+; RV32-NEXT:    li a3, 40
+; RV32-NEXT:    mul a2, a2, a3
+; RV32-NEXT:    add a2, sp, a2
+; RV32-NEXT:    addi a2, a2, 16
+; RV32-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; RV32-NEXT:    addi a2, a1, 128
+; RV32-NEXT:    vle64.v v24, (a2)
+; RV32-NEXT:    csrr a2, vlenb
 ; RV32-NEXT:    li a3, 48
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    addi a1, a5, -64
-; RV32-NEXT:    vmv.s.x v0, a1
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 12
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs1r.v v0, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v4, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v16, v4
-; RV32-NEXT:    addi a1, sp, 16
-; RV32-NEXT:    vl4r.v v16, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v24, v16, v0.t
-; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 20
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vmv.v.v v12, v8
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 20
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
+; RV32-NEXT:    mul a2, a2, a3
+; RV32-NEXT:    add a2, sp, a2
+; RV32-NEXT:    addi a2, a2, 16
+; RV32-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; RV32-NEXT:    vle64.v v8, (a1)
+; RV32-NEXT:    vid.v v16
+; RV32-NEXT:    li a1, 6
+; RV32-NEXT:    vmul.vx v2, v16, a1
+; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v2
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    slli a1, a1, 5
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vslideup.vi v8, v16, 2
+; RV32-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    li a1, 56
+; RV32-NEXT:    vmv.s.x v1, a1
+; RV32-NEXT:    vadd.vi v10, v2, -16
+; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; RV32-NEXT:    vmv1r.v v0, v1
+; RV32-NEXT:    vrgatherei16.vv v16, v24, v10, v0.t
+; RV32-NEXT:    lui a1, 160
+; RV32-NEXT:    addi a1, a1, 4
+; RV32-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT:    li a2, 192
+; RV32-NEXT:    vmv.s.x v4, a2
+; RV32-NEXT:    vmv.v.x v10, a1
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, mu
+; RV32-NEXT:    vmv1r.v v0, v4
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 24
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vslideup.vi v8, v16, 8, v0.t
-; RV32-NEXT:    vmv.v.v v20, v8
-; RV32-NEXT:    lui a1, %hi(.LCPI6_2)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI6_2)
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, mu
-; RV32-NEXT:    lui a3, %hi(.LCPI6_3)
-; RV32-NEXT:    addi a3, a3, %lo(.LCPI6_3)
-; RV32-NEXT:    vle16.v v24, (a1)
-; RV32-NEXT:    vle16.v v8, (a3)
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 40
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v0, v24
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 12
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl1r.v v0, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 48
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    li a2, 40
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v16, v24, v10, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v4, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v24, v4, v0.t
-; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v20, v8
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 12
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    slli a1, a1, 4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v20, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    lui a1, %hi(.LCPI6_4)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI6_4)
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
-; RV32-NEXT:    vle16.v v8, (a1)
+; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    vadd.vi v6, v2, 1
+; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    slli a1, a1, 5
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v12, v24, v8
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 2
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl1r.v v1, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v8, v16, v6
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    vadd.vi v6, v2, -15
+; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; RV32-NEXT:    vmv1r.v v0, v1
-; RV32-NEXT:    vslideup.vi v12, v16, 6, v0.t
-; RV32-NEXT:    vmv.v.v v4, v12
-; RV32-NEXT:    lui a1, %hi(.LCPI6_5)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI6_5)
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, mu
-; RV32-NEXT:    lui a3, %hi(.LCPI6_6)
-; RV32-NEXT:    addi a3, a3, %lo(.LCPI6_6)
-; RV32-NEXT:    vle16.v v24, (a1)
-; RV32-NEXT:    vle16.v v8, (a3)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    li a1, 960
-; RV32-NEXT:    vmv.s.x v2, a1
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 40
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    li a2, 48
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v16, v24
-; RV32-NEXT:    vmv1r.v v0, v2
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 48
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v8, v16, v6, v0.t
+; RV32-NEXT:    lui a1, 176
+; RV32-NEXT:    addi a1, a1, 5
+; RV32-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT:    vmv.v.x v6, a1
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, mu
+; RV32-NEXT:    vmv1r.v v0, v4
+; RV32-NEXT:    vrgatherei16.vv v8, v24, v6, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v16, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v8, v24, v16, v0.t
-; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v4, v8
+; RV32-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; RV32-NEXT:    vadd.vi v8, v2, 2
+; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    slli a1, a1, 5
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v4, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    lui a1, %hi(.LCPI6_7)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI6_7)
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
-; RV32-NEXT:    vle16.v v8, (a1)
+; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v24, v16, v8
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli...
[truncated]

@preames
Copy link
Collaborator

preames commented Feb 6, 2024

Did you explore the impact of adjusting SDAG construction to pad the mask with undefs in this case? That would seem like a more reasonable canonical form that then explode and build. We could maybe have it behind a hook to preserve existing target behavior if needed.

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 7, 2024

Did you explore the impact of adjusting SDAG construction to pad the mask with undefs in this case? That would seem like a more reasonable canonical form that then explode and build. We could maybe have it behind a hook to preserve existing target behavior if needed.

I checked and the blast radius is quite large on other targets as we might expect. But even on RISC-V this breaks another lowering pattern for deinterleave shuffles, where we are relying on the build_vector -> vector_shuffle transform to emit shuffles of the form:

vector_shuffle (extract_subvector v, 0), (extract_subvector v, 4), <0, 2, 4, 6>

The same shuffle if we build it directly in SelectionDAGBuilder is

extract_subvector (vector_shuffle v, undef, <0, 2, 4, 6, u, u, u, u>), 0

We could try and add another combine for this or teach our deinterleave shuffle lowering to recognize this specific case, but that seems like more code.

}

EVT ExtractVT = ExtractVec.getValueType();
if (ExtractVT.isScalableVector())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably should check that ExtractVT is legal here.

@preames
Copy link
Collaborator

preames commented Feb 7, 2024

Luke, it sounds like you went further than I'd intended. I was really only suggesting replacing the explode/build fallback with a packed shuffle fallback. However, let's focus on the current patch.

For an m1 vector, replacing a set of extracts and inserts with a shuffle is going to always be a win. It might be that VL=1 might be worthwhile as the extract and insert idiom, but I don't think that case is worth worrying about too much.

However if the source vector is > m1, then the vrgather is (in general) quadratic. Given the cost of the extract and insert will be either O(NumElems) for exact vlen or O(NumElems * LMUL) for default case, we have the possibility of introducing a regression here.

Please limit the initial version of this transform to m1.

@preames
Copy link
Collaborator

preames commented Feb 7, 2024

Alternative fix: #81032

I prefer the alternate.

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 8, 2024

I was really only suggesting replacing the explode/build fallback with a packed shuffle fallback.

To clarify, that's the only change I made.

However if the source vector is > m1, then the vrgather is (in general) quadratic. Given the cost of the extract and insert will be either O(NumElems) for exact vlen or O(NumElems * LMUL) for default case, we have the possibility of introducing a regression here.

Indeed, I was hoping our shuffle lowering would have some sort of LMUL splitting strategy for this but it looks like we might only currently do it for the exact VLEN case.

As a side note, I noticed that we're extracting a v8i8 vector but we still set VL=64.

; CHECK-NEXT:    li a0, 64
; CHECK-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
; CHECK-NEXT:    vid.v v12
; CHECK-NEXT:    vsll.vi v16, v12, 3
; CHECK-NEXT:    vrgather.vv v12, v8, v16
; CHECK-NEXT:    vmv1r.v v8, v12
; CHECK-NEXT:    ret

I think we can do this with VL=8 because as per the spec "The source vector can be read at any index < VLMAX regardless of vl."

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 8, 2024

Superseded by #81032

@lukel97 lukel97 closed this Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants