[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

preames · 2023-09-13T19:42:20Z

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst).

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst). I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

llvmbot · 2023-09-13T19:43:27Z

@llvm/pr-subscribers-backend-risc-v

Changes

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst).

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

Patch is 26.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66279.diff

3 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+61)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+271-21)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll (+198)

<pre>
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index a470ceae90ce591..e2ef1c2079fb7d1 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -13510,6 +13510,40 @@ static bool legalizeScatterGatherIndexType(SDLoc DL, SDValue &Index,
return true;
}

+/// Match the index vector of a scatter or gather node as the shuffle mask
+/// which performs the rearrangement if possible. Will only match if
+/// all lanes are touched, and thus replacing the scatter or gather with
+/// a unit strided access and shuffle is legal.
+static bool matchIndexAsShuffle(EVT VT, SDValue Index, SDValue Mask,

                           SmallVector&amp;lt;int&amp;gt; &amp;amp;ShuffleMask) {

if (!ISD::isConstantSplatVectorAllOnes(Mask.getNode()))
return false;
if (!ISD::isBuildVectorOfConstantSDNodes(Index.getNode()))
return false;
const unsigned ElementSize = VT.getScalarStoreSize();
const unsigned NumElems = VT.getVectorNumElements();
// Create the shuffle mask and check all bits active
assert(ShuffleMask.empty());
BitVector ActiveLanes(NumElems);
for (const auto Idx : enumerate(Index->ops())) {
// TODO: We've found an active bit of UB, and could be
// more aggressive here if desired.
if (Index->getOperand(Idx.index())->isUndef())
```
 return false;
```
uint64_t C = Index->getConstantOperandVal(Idx.index());
if (C % ElementSize != 0)
```
 return false;
```
C = C / ElementSize;
if (C >= NumElems)
```
 return false;
```
ShuffleMask.push_back(C);
ActiveLanes.set(C);
}
return ActiveLanes.all();
+}

SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;
@@ -13857,6 +13891,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
}
case ISD::MGATHER: {
const auto *MGN = dyn_cast<MaskedGatherSDNode>(N);

const EVT VT = N->getValueType(0);
SDValue Index = MGN->getIndex();
SDValue ScaleOp = MGN->getScale();
ISD::MemIndexType IndexType = MGN->getIndexType();
@@ -13870,6 +13905,19 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
{MGN->getChain(), MGN->getPassThru(), MGN->getMask(),
MGN->getBasePtr(), Index, ScaleOp},
MGN->getMemOperand(), IndexType, MGN->getExtensionType());
SmallVector<int> ShuffleMask;
if (MGN->getExtensionType() == ISD::NON_EXTLOAD &&

   matchIndexAsShuffle(VT, Index, MGN-&amp;gt;getMask(), ShuffleMask)) {

 SDValue Load = DAG.getMaskedLoad(VT, DL, MGN-&amp;gt;getChain(),

                                  MGN-&amp;gt;getBasePtr(), DAG.getUNDEF(XLenVT),

                                  MGN-&amp;gt;getMask(), DAG.getUNDEF(VT),

                                  MGN-&amp;gt;getMemoryVT(), MGN-&amp;gt;getMemOperand(),

                                  ISD::UNINDEXED, ISD::NON_EXTLOAD);

```
 SDValue Shuffle =
```

   DAG.getVectorShuffle(VT, DL, Load, DAG.getUNDEF(VT), ShuffleMask);

 return DAG.getMergeValues({Shuffle, Load.getValue(1)}, DL);

}
break;
}
case ISD::MSCATTER:{
@@ -13887,6 +13935,19 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
{MSN->getChain(), MSN->getValue(), MSN->getMask(), MSN->getBasePtr(),
Index, ScaleOp},
MSN->getMemOperand(), IndexType, MSN->isTruncatingStore());
EVT VT = MSN->getValue()->getValueType(0);
SmallVector<int> ShuffleMask;
if (!MSN->isTruncatingStore() &&

   matchIndexAsShuffle(VT, Index, MSN-&amp;gt;getMask(), ShuffleMask)) {

 SDValue Shuffle = DAG.getVectorShuffle(VT, DL, MSN-&amp;gt;getValue(),

                                        DAG.getUNDEF(VT), ShuffleMask);

 return DAG.getMaskedStore(MSN-&amp;gt;getChain(), DL, Shuffle, MSN-&amp;gt;getBasePtr(),

                           DAG.getUNDEF(XLenVT), MSN-&amp;gt;getMask(),

                           MSN-&amp;gt;getMemoryVT(), MSN-&amp;gt;getMemOperand(),

                           ISD::UNINDEXED, false);

}
break;
}
case ISD::VP_GATHER: {
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
index f3af177ac0ff27e..438b49826dfe295 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
@@ -13016,11 +13016,8 @@ define <4 x i32> @mgather_unit_stride_load(ptr %base) {
;
; RV64V-LABEL: mgather_unit_stride_load:
; RV64V: # %bb.0:
-; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
-; RV64V-NEXT: vid.v v8
-; RV64V-NEXT: vsll.vi v10, v8, 2
-; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
-; RV64V-NEXT: vluxei64.v v8, (a0), v10
+; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64V-NEXT: vle32.v v8, (a0)
; RV64V-NEXT: ret
;
; RV64ZVE32F-LABEL: mgather_unit_stride_load:
@@ -13154,18 +13151,13 @@ define <4 x i32> @mgather_unit_stride_load_narrow_idx(ptr %base) {
; RV32-LABEL: mgather_unit_stride_load_narrow_idx:
; RV32: # %bb.0:
; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
-; RV32-NEXT: vid.v v8
-; RV32-NEXT: vsll.vi v8, v8, 2
-; RV32-NEXT: vluxei32.v v8, (a0), v8
+; RV32-NEXT: vle32.v v8, (a0)
; RV32-NEXT: ret
;
; RV64V-LABEL: mgather_unit_stride_load_narrow_idx:
; RV64V: # %bb.0:
-; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
-; RV64V-NEXT: vid.v v8
-; RV64V-NEXT: vsll.vi v10, v8, 2
-; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
-; RV64V-NEXT: vluxei64.v v8, (a0), v10
+; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64V-NEXT: vle32.v v8, (a0)
; RV64V-NEXT: ret
;
; RV64ZVE32F-LABEL: mgather_unit_stride_load_narrow_idx:
@@ -13225,18 +13217,13 @@ define <4 x i32> @mgather_unit_stride_load_wide_idx(ptr %base) {
; RV32-LABEL: mgather_unit_stride_load_wide_idx:
; RV32: # %bb.0:
; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
-; RV32-NEXT: vid.v v8
-; RV32-NEXT: vsll.vi v8, v8, 2
-; RV32-NEXT: vluxei32.v v8, (a0), v8
+; RV32-NEXT: vle32.v v8, (a0)
; RV32-NEXT: ret
;
; RV64V-LABEL: mgather_unit_stride_load_wide_idx:
; RV64V: # %bb.0:
-; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
-; RV64V-NEXT: vid.v v8
-; RV64V-NEXT: vsll.vi v10, v8, 2
-; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
-; RV64V-NEXT: vluxei64.v v8, (a0), v10
+; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; RV64V-NEXT: vle32.v v8, (a0)
; RV64V-NEXT: ret
;
; RV64ZVE32F-LABEL: mgather_unit_stride_load_wide_idx:
@@ -13601,3 +13588,266 @@ define <8 x i16> @mgather_gather_2xSEW(ptr %base) {
ret <8 x i16> %v
}

+define <8 x i16> @mgather_shuffle_reverse(ptr %base) {
+; RV32-LABEL: mgather_shuffle_reverse:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: vid.v v8
+; RV32-NEXT: vrsub.vi v10, v8, 7
+; RV32-NEXT: vrgather.vv v8, v9, v10
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_reverse:
+; RV64V: # %bb.0:
+; RV64V-NEXT: addi a0, a0, 14
+; RV64V-NEXT: li a1, -2
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vlse16.v v8, (a0), a1
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_reverse:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: addi a0, a0, 14
+; RV64ZVE32F-NEXT: li a1, -2
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vlse16.v v8, (a0), a1
+; RV64ZVE32F-NEXT: ret

%head = insertelement <8 x i1> poison, i1 true, i16 0
%allones = shufflevector <8 x i1> %head, <8 x i1> poison, <8 x i32> zeroinitializer
%ptrs = getelementptr inbounds i16, ptr %base, <8 x i64> <i64 7, i64 6, i64 5, i64 4, i64 3, i64 2, i64 1, i64 0>
%v = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %allones, <8 x i16> poison)
ret <8 x i16> %v
+}

+define <8 x i16> @mgather_shuffle_rotate(ptr %base) {
+; RV32-LABEL: mgather_shuffle_rotate:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: vslidedown.vi v8, v9, 4
+; RV32-NEXT: vslideup.vi v8, v9, 4
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_rotate:
+; RV64V: # %bb.0:
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vle16.v v9, (a0)
+; RV64V-NEXT: vslidedown.vi v8, v9, 4
+; RV64V-NEXT: vslideup.vi v8, v9, 4
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_rotate:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; RV64ZVE32F-NEXT: vmset.m v8
+; RV64ZVE32F-NEXT: vmv.x.s a1, v8
+; RV64ZVE32F-NEXT: # implicit-def: $v8
+; RV64ZVE32F-NEXT: beqz zero, .LBB110_9
+; RV64ZVE32F-NEXT: # %bb.1: # %else
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_10
+; RV64ZVE32F-NEXT: .LBB110_2: # %else2
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_11
+; RV64ZVE32F-NEXT: .LBB110_3: # %else5
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_12
+; RV64ZVE32F-NEXT: .LBB110_4: # %else8
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_13
+; RV64ZVE32F-NEXT: .LBB110_5: # %else11
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_14
+; RV64ZVE32F-NEXT: .LBB110_6: # %else14
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_15
+; RV64ZVE32F-NEXT: .LBB110_7: # %else17
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: bnez a1, .LBB110_16
+; RV64ZVE32F-NEXT: .LBB110_8: # %else20
+; RV64ZVE32F-NEXT: ret
+; RV64ZVE32F-NEXT: .LBB110_9: # %cond.load
+; RV64ZVE32F-NEXT: addi a2, a0, 8
+; RV64ZVE32F-NEXT: vlse16.v v8, (a2), zero
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_2
+; RV64ZVE32F-NEXT: .LBB110_10: # %cond.load1
+; RV64ZVE32F-NEXT: addi a2, a0, 10
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vsetivli zero, 2, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 1
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_3
+; RV64ZVE32F-NEXT: .LBB110_11: # %cond.load4
+; RV64ZVE32F-NEXT: addi a2, a0, 12
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 3, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 2
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_4
+; RV64ZVE32F-NEXT: .LBB110_12: # %cond.load7
+; RV64ZVE32F-NEXT: addi a2, a0, 14
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 4, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 3
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_5
+; RV64ZVE32F-NEXT: .LBB110_13: # %cond.load10
+; RV64ZVE32F-NEXT: lh a2, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 5, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 4
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_6
+; RV64ZVE32F-NEXT: .LBB110_14: # %cond.load13
+; RV64ZVE32F-NEXT: addi a2, a0, 2
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 6, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 5
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_7
+; RV64ZVE32F-NEXT: .LBB110_15: # %cond.load16
+; RV64ZVE32F-NEXT: addi a2, a0, 4
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 7, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 6
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: beqz a1, .LBB110_8
+; RV64ZVE32F-NEXT: .LBB110_16: # %cond.load19
+; RV64ZVE32F-NEXT: addi a0, a0, 6
+; RV64ZVE32F-NEXT: lh a0, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a0
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 7
+; RV64ZVE32F-NEXT: ret

%head = insertelement <8 x i1> poison, i1 true, i16 0
%allones = shufflevector <8 x i1> %head, <8 x i1> poison, <8 x i32> zeroinitializer
%ptrs = getelementptr inbounds i16, ptr %base, <8 x i64> <i64 4, i64 5, i64 6, i64 7, i64 0, i64 1, i64 2, i64 3>
%v = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %allones, <8 x i16> poison)
ret <8 x i16> %v
+}

+define <8 x i16> @mgather_shuffle_vrgather(ptr %base) {
+; RV32-LABEL: mgather_shuffle_vrgather:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: lui a0, %hi(.LCPI111_0)
+; RV32-NEXT: addi a0, a0, %lo(.LCPI111_0)
+; RV32-NEXT: vle16.v v10, (a0)
+; RV32-NEXT: vrgather.vv v8, v9, v10
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_vrgather:
+; RV64V: # %bb.0:
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vle16.v v9, (a0)
+; RV64V-NEXT: lui a0, %hi(.LCPI111_0)
+; RV64V-NEXT: addi a0, a0, %lo(.LCPI111_0)
+; RV64V-NEXT: vle16.v v10, (a0)
+; RV64V-NEXT: vrgather.vv v8, v9, v10
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_vrgather:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; RV64ZVE32F-NEXT: vmset.m v8
+; RV64ZVE32F-NEXT: vmv.x.s a1, v8
+; RV64ZVE32F-NEXT: # implicit-def: $v8
+; RV64ZVE32F-NEXT: beqz zero, .LBB111_9
+; RV64ZVE32F-NEXT: # %bb.1: # %else
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_10
+; RV64ZVE32F-NEXT: .LBB111_2: # %else2
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_11
+; RV64ZVE32F-NEXT: .LBB111_3: # %else5
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_12
+; RV64ZVE32F-NEXT: .LBB111_4: # %else8
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_13
+; RV64ZVE32F-NEXT: .LBB111_5: # %else11
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_14
+; RV64ZVE32F-NEXT: .LBB111_6: # %else14
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_15
+; RV64ZVE32F-NEXT: .LBB111_7: # %else17
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: bnez a1, .LBB111_16
+; RV64ZVE32F-NEXT: .LBB111_8: # %else20
+; RV64ZVE32F-NEXT: ret
+; RV64ZVE32F-NEXT: .LBB111_9: # %cond.load
+; RV64ZVE32F-NEXT: vlse16.v v8, (a0), zero
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_2
+; RV64ZVE32F-NEXT: .LBB111_10: # %cond.load1
+; RV64ZVE32F-NEXT: addi a2, a0, 4
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vsetivli zero, 2, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 1
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_3
+; RV64ZVE32F-NEXT: .LBB111_11: # %cond.load4
+; RV64ZVE32F-NEXT: addi a2, a0, 6
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 3, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 2
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_4
+; RV64ZVE32F-NEXT: .LBB111_12: # %cond.load7
+; RV64ZVE32F-NEXT: addi a2, a0, 2
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 4, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 3
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_5
+; RV64ZVE32F-NEXT: .LBB111_13: # %cond.load10
+; RV64ZVE32F-NEXT: addi a2, a0, 8
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 5, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 4
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_6
+; RV64ZVE32F-NEXT: .LBB111_14: # %cond.load13
+; RV64ZVE32F-NEXT: addi a2, a0, 10
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 6, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 5
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_7
+; RV64ZVE32F-NEXT: .LBB111_15: # %cond.load16
+; RV64ZVE32F-NEXT: addi a2, a0, 12
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 7, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 6
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: beqz a1, .LBB111_8
+; RV64ZVE32F-NEXT: .LBB111_16: # %cond.load19
+; RV64ZVE32F-NEXT: addi a0, a0, 14
+; RV64ZVE32F-NEXT: lh a0, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a0
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 7
+; RV64ZVE32F-NEXT: ret

%head = insertelement <8 x i1> poison, i1 true, i16 0
%allones = shufflevector <8 x i1> %head, <8 x i1> poison, <8 x i32> zeroinitializer
%ptrs = getelementptr inbounds i16, ptr %base, <8 x i64> <i64 0, i64 2, i64 3, i64 1, i64 4, i64 5, i64 6, i64 7>
%v = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %allones, <8 x i16> poison)
ret <8 x i16> %v
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
index 4c7b6db0d41c522..86ae2bb729ba9fa 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
@@ -11292,3 +11292,201 @@ define void @mscatter_baseidx_v32i8(<32 x i8> %val, ptr %base, <32 x i8> %idxs,
call void @llvm.masked.scatter.v32i8.v32p0(<32 x i8> %val, <32 x ptr> %ptrs, i32 1, <32 x i1> %m)
ret void
}

+define void @mscatter_unit_stride(<8 x i16> %val, ptr %base) {
+; RV32-LABEL: mscatter_unit_stride:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vse16.v v8, (a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: mscatter_unit_stride:
+; RV64: # %bb.0:
+; RV64-NEXT: li a1, 2
+; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64-NEXT: vsse16.v v8, (a0), a1
+; RV64-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mscatter_unit_stride:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: li a1, 2
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vsse16.v v8, (a0), a1
+; RV64ZVE32F-NEXT: ret

%head = insertelement <8 x i1> poison, i1 true, i16 0
%allones = shufflevector <8 x i1> %head, <8 x i1> poison, <8 x i32> zeroinitializer
%ptrs = getelementptr inbounds i16, ptr %base, <8 x i64> <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
call void @llvm.masked.scatter.v8i16.v8p0(<8 x i16> %val, <8 x ptr> %ptrs, i32 2, <8 x i1> %allones)
ret void
+}

+define void @mscatter_unit_stride_with_offset(<8 x i16> %val, ptr %base) {
+; RV32-LABEL: mscatter_unit_stride_with_offset:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT: vid.v v10
+; RV32-NEXT: vadd.vv v10, v10, v10
+; RV32-NEXT: vadd.vi v10, v10, 10
+; RV32-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV32-NEXT: vsoxei32.v v8, (a0)...

topperc · 2023-09-15T20:10:47Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

+  // Create the shuffle mask and check all bits active
+  assert(ShuffleMask.empty());
+  BitVector ActiveLanes(NumElems);
+  for (const auto Idx : enumerate(Index->ops())) {


Why are we using the ops iterator but then discarding the value() part? Can we just use for (unsigned i = 0; i < Index->getNumOperands(); ++i)?

I pushed a change to fix this.

topperc

LGTM

topperc

LGTM

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst). I did not bother to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

preames · 2023-09-18T15:46:31Z

Pushed as ff2622b.

…66279) If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst). I did not bother to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

preames requested review from asb, lukel97 and topperc September 13, 2023 19:42

preames requested a review from a team as a code owner September 13, 2023 19:42

llvmbot added the backend:RISC-V label Sep 13, 2023

topperc reviewed Sep 15, 2023

View reviewed changes

Address review comment

eb27c9e

topperc approved these changes Sep 15, 2023

View reviewed changes

preames closed this Sep 18, 2023

preames deleted the pr-riscv-gather-via-shuffle branch September 18, 2023 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

preames commented Sep 13, 2023

llvmbot commented Sep 13, 2023

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

topperc Sep 15, 2023

preames Sep 15, 2023

topperc left a comment

topperc left a comment

preames commented Sep 18, 2023

[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

Conversation

preames commented Sep 13, 2023

llvmbot commented Sep 13, 2023

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

topperc Sep 15, 2023

Choose a reason for hiding this comment

preames Sep 15, 2023

Choose a reason for hiding this comment

topperc left a comment

Choose a reason for hiding this comment

topperc left a comment

Choose a reason for hiding this comment

preames commented Sep 18, 2023