Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Optimize gather/scatter to unit-stride memop + shuffle #66279

Closed
wants to merge 2 commits into from

Conversation

preames
Copy link
Collaborator

@preames preames commented Sep 13, 2023

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst).

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation.  For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst).

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX.  Given that, they should have been transformed to the non-vp variants anyways.  I haven't checked to see if they actually are.
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 13, 2023

@llvm/pr-subscribers-backend-risc-v

Changes If we have a gather or a scatter whose index describes a permutation of the lanes, we can lower this as a shuffle + a unit strided memory operation. For RISCV, this replaces a indexed load/store with a unit strided memory operation and a vrgather (at worst).

I did not both to implement the vp.scatter and vp.gather variants of these transforms because they'd only be legal when EVL was VLMAX. Given that, they should have been transformed to the non-vp variants anyways. I haven't checked to see if they actually are.

Patch is 26.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66279.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+61)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+271-21)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll (+198)

<pre>
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index a470ceae90ce591..e2ef1c2079fb7d1 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -13510,6 +13510,40 @@ static bool legalizeScatterGatherIndexType(SDLoc DL, SDValue &amp;Index,
return true;
}

+/// Match the index vector of a scatter or gather node as the shuffle mask
+/// which performs the rearrangement if possible. Will only match if
+/// all lanes are touched, and thus replacing the scatter or gather with
+/// a unit strided access and shuffle is legal.
+static bool matchIndexAsShuffle(EVT VT, SDValue Index, SDValue Mask,

  •                            SmallVector&amp;lt;int&amp;gt; &amp;amp;ShuffleMask) {
    
  • if (!ISD::isConstantSplatVectorAllOnes(Mask.getNode()))
  • return false;
  • if (!ISD::isBuildVectorOfConstantSDNodes(Index.getNode()))
  • return false;
  • const unsigned ElementSize = VT.getScalarStoreSize();
  • const unsigned NumElems = VT.getVectorNumElements();
  • // Create the shuffle mask and check all bits active
  • assert(ShuffleMask.empty());
  • BitVector ActiveLanes(NumElems);
  • for (const auto Idx : enumerate(Index-&gt;ops())) {
  • // TODO: We&#x27;ve found an active bit of UB, and could be
  • // more aggressive here if desired.
  • if (Index-&gt;getOperand(Idx.index())-&gt;isUndef())
  •  return false;
    
  • uint64_t C = Index-&gt;getConstantOperandVal(Idx.index());
  • if (C % ElementSize != 0)
  •  return false;
    
  • C = C / ElementSize;
  • if (C &gt;= NumElems)
  •  return false;
    
  • ShuffleMask.push_back(C);
  • ActiveLanes.set(C);
  • }
  • return ActiveLanes.all();
    +}

SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &amp;DCI) const {
SelectionDAG &amp;DAG = DCI.DAG;
@@ -13857,6 +13891,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
}
case ISD::MGATHER: {
const auto *MGN = dyn_cast&lt;MaskedGatherSDNode&gt;(N);

  • const EVT VT = N-&gt;getValueType(0);
    SDValue Index = MGN-&gt;getIndex();
    SDValue ScaleOp = MGN-&gt;getScale();
    ISD::MemIndexType IndexType = MGN-&gt;getIndexType();
    @@ -13870,6 +13905,19 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
    {MGN-&gt;getChain(), MGN-&gt;getPassThru(), MGN-&gt;getMask(),
    MGN-&gt;getBasePtr(), Index, ScaleOp},
    MGN-&gt;getMemOperand(), IndexType, MGN-&gt;getExtensionType());
  • SmallVector&lt;int&gt; ShuffleMask;
  • if (MGN-&gt;getExtensionType() == ISD::NON_EXTLOAD &amp;&amp;
  •    matchIndexAsShuffle(VT, Index, MGN-&amp;gt;getMask(), ShuffleMask)) {
    
  •  SDValue Load = DAG.getMaskedLoad(VT, DL, MGN-&amp;gt;getChain(),
    
  •                                   MGN-&amp;gt;getBasePtr(), DAG.getUNDEF(XLenVT),
    
  •                                   MGN-&amp;gt;getMask(), DAG.getUNDEF(VT),
    
  •                                   MGN-&amp;gt;getMemoryVT(), MGN-&amp;gt;getMemOperand(),
    
  •                                   ISD::UNINDEXED, ISD::NON_EXTLOAD);
    
  •  SDValue Shuffle =
    
  •    DAG.getVectorShuffle(VT, DL, Load, DAG.getUNDEF(VT), ShuffleMask);
    
  •  return DAG.getMergeValues({Shuffle, Load.getValue(1)}, DL);
    
  • }
    break;
    }
    case ISD::MSCATTER:{
    @@ -13887,6 +13935,19 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
    {MSN-&gt;getChain(), MSN-&gt;getValue(), MSN-&gt;getMask(), MSN-&gt;getBasePtr(),
    Index, ScaleOp},
    MSN-&gt;getMemOperand(), IndexType, MSN-&gt;isTruncatingStore());
  • EVT VT = MSN-&gt;getValue()-&gt;getValueType(0);
  • SmallVector&lt;int&gt; ShuffleMask;
  • if (!MSN-&gt;isTruncatingStore() &amp;&amp;
  •    matchIndexAsShuffle(VT, Index, MSN-&amp;gt;getMask(), ShuffleMask)) {
    
  •  SDValue Shuffle = DAG.getVectorShuffle(VT, DL, MSN-&amp;gt;getValue(),
    
  •                                         DAG.getUNDEF(VT), ShuffleMask);
    
  •  return DAG.getMaskedStore(MSN-&amp;gt;getChain(), DL, Shuffle, MSN-&amp;gt;getBasePtr(),
    
  •                            DAG.getUNDEF(XLenVT), MSN-&amp;gt;getMask(),
    
  •                            MSN-&amp;gt;getMemoryVT(), MSN-&amp;gt;getMemOperand(),
    
  •                            ISD::UNINDEXED, false);
    
  • }
  • break;
    }
    case ISD::VP_GATHER: {
    diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
    index f3af177ac0ff27e..438b49826dfe295 100644
    --- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
    +++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll
    @@ -13016,11 +13016,8 @@ define &lt;4 x i32&gt; @mgather_unit_stride_load(ptr %base) {
    ;
    ; RV64V-LABEL: mgather_unit_stride_load:
    ; RV64V: # %bb.0:
    -; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
    -; RV64V-NEXT: vid.v v8
    -; RV64V-NEXT: vsll.vi v10, v8, 2
    -; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
    -; RV64V-NEXT: vluxei64.v v8, (a0), v10
    +; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
    +; RV64V-NEXT: vle32.v v8, (a0)
    ; RV64V-NEXT: ret
    ;
    ; RV64ZVE32F-LABEL: mgather_unit_stride_load:
    @@ -13154,18 +13151,13 @@ define &lt;4 x i32&gt; @mgather_unit_stride_load_narrow_idx(ptr %base) {
    ; RV32-LABEL: mgather_unit_stride_load_narrow_idx:
    ; RV32: # %bb.0:
    ; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
    -; RV32-NEXT: vid.v v8
    -; RV32-NEXT: vsll.vi v8, v8, 2
    -; RV32-NEXT: vluxei32.v v8, (a0), v8
    +; RV32-NEXT: vle32.v v8, (a0)
    ; RV32-NEXT: ret
    ;
    ; RV64V-LABEL: mgather_unit_stride_load_narrow_idx:
    ; RV64V: # %bb.0:
    -; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
    -; RV64V-NEXT: vid.v v8
    -; RV64V-NEXT: vsll.vi v10, v8, 2
    -; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
    -; RV64V-NEXT: vluxei64.v v8, (a0), v10
    +; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
    +; RV64V-NEXT: vle32.v v8, (a0)
    ; RV64V-NEXT: ret
    ;
    ; RV64ZVE32F-LABEL: mgather_unit_stride_load_narrow_idx:
    @@ -13225,18 +13217,13 @@ define &lt;4 x i32&gt; @mgather_unit_stride_load_wide_idx(ptr %base) {
    ; RV32-LABEL: mgather_unit_stride_load_wide_idx:
    ; RV32: # %bb.0:
    ; RV32-NEXT: vsetivli zero, 4, e32, m1, ta, ma
    -; RV32-NEXT: vid.v v8
    -; RV32-NEXT: vsll.vi v8, v8, 2
    -; RV32-NEXT: vluxei32.v v8, (a0), v8
    +; RV32-NEXT: vle32.v v8, (a0)
    ; RV32-NEXT: ret
    ;
    ; RV64V-LABEL: mgather_unit_stride_load_wide_idx:
    ; RV64V: # %bb.0:
    -; RV64V-NEXT: vsetivli zero, 4, e64, m2, ta, ma
    -; RV64V-NEXT: vid.v v8
    -; RV64V-NEXT: vsll.vi v10, v8, 2
    -; RV64V-NEXT: vsetvli zero, zero, e32, m1, ta, ma
    -; RV64V-NEXT: vluxei64.v v8, (a0), v10
    +; RV64V-NEXT: vsetivli zero, 4, e32, m1, ta, ma
    +; RV64V-NEXT: vle32.v v8, (a0)
    ; RV64V-NEXT: ret
    ;
    ; RV64ZVE32F-LABEL: mgather_unit_stride_load_wide_idx:
    @@ -13601,3 +13588,266 @@ define &lt;8 x i16&gt; @mgather_gather_2xSEW(ptr %base) {
    ret &lt;8 x i16&gt; %v
    }

+define &lt;8 x i16&gt; @mgather_shuffle_reverse(ptr %base) {
+; RV32-LABEL: mgather_shuffle_reverse:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: vid.v v8
+; RV32-NEXT: vrsub.vi v10, v8, 7
+; RV32-NEXT: vrgather.vv v8, v9, v10
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_reverse:
+; RV64V: # %bb.0:
+; RV64V-NEXT: addi a0, a0, 14
+; RV64V-NEXT: li a1, -2
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vlse16.v v8, (a0), a1
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_reverse:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: addi a0, a0, 14
+; RV64ZVE32F-NEXT: li a1, -2
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vlse16.v v8, (a0), a1
+; RV64ZVE32F-NEXT: ret

  • %head = insertelement &lt;8 x i1&gt; poison, i1 true, i16 0
  • %allones = shufflevector &lt;8 x i1&gt; %head, &lt;8 x i1&gt; poison, &lt;8 x i32&gt; zeroinitializer
  • %ptrs = getelementptr inbounds i16, ptr %base, &lt;8 x i64&gt; &lt;i64 7, i64 6, i64 5, i64 4, i64 3, i64 2, i64 1, i64 0&gt;
  • %v = call &lt;8 x i16&gt; @llvm.masked.gather.v8i16.v8p0(&lt;8 x ptr&gt; %ptrs, i32 4, &lt;8 x i1&gt; %allones, &lt;8 x i16&gt; poison)
  • ret &lt;8 x i16&gt; %v
    +}

+define &lt;8 x i16&gt; @mgather_shuffle_rotate(ptr %base) {
+; RV32-LABEL: mgather_shuffle_rotate:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: vslidedown.vi v8, v9, 4
+; RV32-NEXT: vslideup.vi v8, v9, 4
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_rotate:
+; RV64V: # %bb.0:
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vle16.v v9, (a0)
+; RV64V-NEXT: vslidedown.vi v8, v9, 4
+; RV64V-NEXT: vslideup.vi v8, v9, 4
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_rotate:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; RV64ZVE32F-NEXT: vmset.m v8
+; RV64ZVE32F-NEXT: vmv.x.s a1, v8
+; RV64ZVE32F-NEXT: # implicit-def: $v8
+; RV64ZVE32F-NEXT: beqz zero, .LBB110_9
+; RV64ZVE32F-NEXT: # %bb.1: # %else
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_10
+; RV64ZVE32F-NEXT: .LBB110_2: # %else2
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_11
+; RV64ZVE32F-NEXT: .LBB110_3: # %else5
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_12
+; RV64ZVE32F-NEXT: .LBB110_4: # %else8
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_13
+; RV64ZVE32F-NEXT: .LBB110_5: # %else11
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_14
+; RV64ZVE32F-NEXT: .LBB110_6: # %else14
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: bnez a2, .LBB110_15
+; RV64ZVE32F-NEXT: .LBB110_7: # %else17
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: bnez a1, .LBB110_16
+; RV64ZVE32F-NEXT: .LBB110_8: # %else20
+; RV64ZVE32F-NEXT: ret
+; RV64ZVE32F-NEXT: .LBB110_9: # %cond.load
+; RV64ZVE32F-NEXT: addi a2, a0, 8
+; RV64ZVE32F-NEXT: vlse16.v v8, (a2), zero
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_2
+; RV64ZVE32F-NEXT: .LBB110_10: # %cond.load1
+; RV64ZVE32F-NEXT: addi a2, a0, 10
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vsetivli zero, 2, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 1
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_3
+; RV64ZVE32F-NEXT: .LBB110_11: # %cond.load4
+; RV64ZVE32F-NEXT: addi a2, a0, 12
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 3, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 2
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_4
+; RV64ZVE32F-NEXT: .LBB110_12: # %cond.load7
+; RV64ZVE32F-NEXT: addi a2, a0, 14
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 4, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 3
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_5
+; RV64ZVE32F-NEXT: .LBB110_13: # %cond.load10
+; RV64ZVE32F-NEXT: lh a2, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 5, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 4
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_6
+; RV64ZVE32F-NEXT: .LBB110_14: # %cond.load13
+; RV64ZVE32F-NEXT: addi a2, a0, 2
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 6, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 5
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: beqz a2, .LBB110_7
+; RV64ZVE32F-NEXT: .LBB110_15: # %cond.load16
+; RV64ZVE32F-NEXT: addi a2, a0, 4
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 7, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 6
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: beqz a1, .LBB110_8
+; RV64ZVE32F-NEXT: .LBB110_16: # %cond.load19
+; RV64ZVE32F-NEXT: addi a0, a0, 6
+; RV64ZVE32F-NEXT: lh a0, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a0
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 7
+; RV64ZVE32F-NEXT: ret

  • %head = insertelement &lt;8 x i1&gt; poison, i1 true, i16 0
  • %allones = shufflevector &lt;8 x i1&gt; %head, &lt;8 x i1&gt; poison, &lt;8 x i32&gt; zeroinitializer
  • %ptrs = getelementptr inbounds i16, ptr %base, &lt;8 x i64&gt; &lt;i64 4, i64 5, i64 6, i64 7, i64 0, i64 1, i64 2, i64 3&gt;
  • %v = call &lt;8 x i16&gt; @llvm.masked.gather.v8i16.v8p0(&lt;8 x ptr&gt; %ptrs, i32 4, &lt;8 x i1&gt; %allones, &lt;8 x i16&gt; poison)
  • ret &lt;8 x i16&gt; %v
    +}

+define &lt;8 x i16&gt; @mgather_shuffle_vrgather(ptr %base) {
+; RV32-LABEL: mgather_shuffle_vrgather:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vle16.v v9, (a0)
+; RV32-NEXT: lui a0, %hi(.LCPI111_0)
+; RV32-NEXT: addi a0, a0, %lo(.LCPI111_0)
+; RV32-NEXT: vle16.v v10, (a0)
+; RV32-NEXT: vrgather.vv v8, v9, v10
+; RV32-NEXT: ret
+;
+; RV64V-LABEL: mgather_shuffle_vrgather:
+; RV64V: # %bb.0:
+; RV64V-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64V-NEXT: vle16.v v9, (a0)
+; RV64V-NEXT: lui a0, %hi(.LCPI111_0)
+; RV64V-NEXT: addi a0, a0, %lo(.LCPI111_0)
+; RV64V-NEXT: vle16.v v10, (a0)
+; RV64V-NEXT: vrgather.vv v8, v9, v10
+; RV64V-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mgather_shuffle_vrgather:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; RV64ZVE32F-NEXT: vmset.m v8
+; RV64ZVE32F-NEXT: vmv.x.s a1, v8
+; RV64ZVE32F-NEXT: # implicit-def: $v8
+; RV64ZVE32F-NEXT: beqz zero, .LBB111_9
+; RV64ZVE32F-NEXT: # %bb.1: # %else
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_10
+; RV64ZVE32F-NEXT: .LBB111_2: # %else2
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_11
+; RV64ZVE32F-NEXT: .LBB111_3: # %else5
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_12
+; RV64ZVE32F-NEXT: .LBB111_4: # %else8
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_13
+; RV64ZVE32F-NEXT: .LBB111_5: # %else11
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_14
+; RV64ZVE32F-NEXT: .LBB111_6: # %else14
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: bnez a2, .LBB111_15
+; RV64ZVE32F-NEXT: .LBB111_7: # %else17
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: bnez a1, .LBB111_16
+; RV64ZVE32F-NEXT: .LBB111_8: # %else20
+; RV64ZVE32F-NEXT: ret
+; RV64ZVE32F-NEXT: .LBB111_9: # %cond.load
+; RV64ZVE32F-NEXT: vlse16.v v8, (a0), zero
+; RV64ZVE32F-NEXT: andi a2, a1, 2
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_2
+; RV64ZVE32F-NEXT: .LBB111_10: # %cond.load1
+; RV64ZVE32F-NEXT: addi a2, a0, 4
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vsetivli zero, 2, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 1
+; RV64ZVE32F-NEXT: andi a2, a1, 4
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_3
+; RV64ZVE32F-NEXT: .LBB111_11: # %cond.load4
+; RV64ZVE32F-NEXT: addi a2, a0, 6
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 3, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 2
+; RV64ZVE32F-NEXT: andi a2, a1, 8
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_4
+; RV64ZVE32F-NEXT: .LBB111_12: # %cond.load7
+; RV64ZVE32F-NEXT: addi a2, a0, 2
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 4, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 3
+; RV64ZVE32F-NEXT: andi a2, a1, 16
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_5
+; RV64ZVE32F-NEXT: .LBB111_13: # %cond.load10
+; RV64ZVE32F-NEXT: addi a2, a0, 8
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 5, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 4
+; RV64ZVE32F-NEXT: andi a2, a1, 32
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_6
+; RV64ZVE32F-NEXT: .LBB111_14: # %cond.load13
+; RV64ZVE32F-NEXT: addi a2, a0, 10
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 6, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 5
+; RV64ZVE32F-NEXT: andi a2, a1, 64
+; RV64ZVE32F-NEXT: beqz a2, .LBB111_7
+; RV64ZVE32F-NEXT: .LBB111_15: # %cond.load16
+; RV64ZVE32F-NEXT: addi a2, a0, 12
+; RV64ZVE32F-NEXT: lh a2, 0(a2)
+; RV64ZVE32F-NEXT: vsetivli zero, 7, e16, m1, tu, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a2
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 6
+; RV64ZVE32F-NEXT: andi a1, a1, -128
+; RV64ZVE32F-NEXT: beqz a1, .LBB111_8
+; RV64ZVE32F-NEXT: .LBB111_16: # %cond.load19
+; RV64ZVE32F-NEXT: addi a0, a0, 14
+; RV64ZVE32F-NEXT: lh a0, 0(a0)
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vmv.s.x v9, a0
+; RV64ZVE32F-NEXT: vslideup.vi v8, v9, 7
+; RV64ZVE32F-NEXT: ret

  • %head = insertelement &lt;8 x i1&gt; poison, i1 true, i16 0
  • %allones = shufflevector &lt;8 x i1&gt; %head, &lt;8 x i1&gt; poison, &lt;8 x i32&gt; zeroinitializer
  • %ptrs = getelementptr inbounds i16, ptr %base, &lt;8 x i64&gt; &lt;i64 0, i64 2, i64 3, i64 1, i64 4, i64 5, i64 6, i64 7&gt;
  • %v = call &lt;8 x i16&gt; @llvm.masked.gather.v8i16.v8p0(&lt;8 x ptr&gt; %ptrs, i32 4, &lt;8 x i1&gt; %allones, &lt;8 x i16&gt; poison)
  • ret &lt;8 x i16&gt; %v
    +}
    diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
    index 4c7b6db0d41c522..86ae2bb729ba9fa 100644
    --- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
    +++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
    @@ -11292,3 +11292,201 @@ define void @mscatter_baseidx_v32i8(&lt;32 x i8&gt; %val, ptr %base, &lt;32 x i8&gt; %idxs,
    call void @llvm.masked.scatter.v32i8.v32p0(&lt;32 x i8&gt; %val, &lt;32 x ptr&gt; %ptrs, i32 1, &lt;32 x i1&gt; %m)
    ret void
    }

+define void @mscatter_unit_stride(&lt;8 x i16&gt; %val, ptr %base) {
+; RV32-LABEL: mscatter_unit_stride:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT: vse16.v v8, (a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: mscatter_unit_stride:
+; RV64: # %bb.0:
+; RV64-NEXT: li a1, 2
+; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64-NEXT: vsse16.v v8, (a0), a1
+; RV64-NEXT: ret
+;
+; RV64ZVE32F-LABEL: mscatter_unit_stride:
+; RV64ZVE32F: # %bb.0:
+; RV64ZVE32F-NEXT: li a1, 2
+; RV64ZVE32F-NEXT: vsetivli zero, 8, e16, m1, ta, ma
+; RV64ZVE32F-NEXT: vsse16.v v8, (a0), a1
+; RV64ZVE32F-NEXT: ret

  • %head = insertelement &lt;8 x i1&gt; poison, i1 true, i16 0
  • %allones = shufflevector &lt;8 x i1&gt; %head, &lt;8 x i1&gt; poison, &lt;8 x i32&gt; zeroinitializer
  • %ptrs = getelementptr inbounds i16, ptr %base, &lt;8 x i64&gt; &lt;i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7&gt;
  • call void @llvm.masked.scatter.v8i16.v8p0(&lt;8 x i16&gt; %val, &lt;8 x ptr&gt; %ptrs, i32 2, &lt;8 x i1&gt; %allones)
  • ret void
    +}

+define void @mscatter_unit_stride_with_offset(&lt;8 x i16&gt; %val, ptr %base) {
+; RV32-LABEL: mscatter_unit_stride_with_offset:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 8, e32, m2, ta, ma
+; RV32-NEXT: vid.v v10
+; RV32-NEXT: vadd.vv v10, v10, v10
+; RV32-NEXT: vadd.vi v10, v10, 10
+; RV32-NEXT: vsetvli zero, zero, e16, m1, ta, ma
+; RV32-NEXT: vsoxei32.v v8, (a0)...

// Create the shuffle mask and check all bits active
assert(ShuffleMask.empty());
BitVector ActiveLanes(NumElems);
for (const auto Idx : enumerate(Index->ops())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the ops iterator but then discarding the value() part? Can we just use for (unsigned i = 0; i < Index->getNumOperands(); ++i)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change to fix this.

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

preames added a commit that referenced this pull request Sep 15, 2023
If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation.  For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).

I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX.  Given that, they
should have been transformed to the non-vp variants anyways.  I haven't checked
to see if they actually are.
@preames
Copy link
Collaborator Author

preames commented Sep 18, 2023

Pushed as ff2622b.

@preames preames closed this Sep 18, 2023
@preames preames deleted the pr-riscv-gather-via-shuffle branch September 18, 2023 15:46
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023
…66279)

If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation.  For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).

I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX.  Given that, they
should have been transformed to the non-vp variants anyways.  I haven't checked
to see if they actually are.
zahiraam pushed a commit to tahonermann/llvm-project that referenced this pull request Oct 24, 2023
…66279)

If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation.  For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).

I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX.  Given that, they
should have been transformed to the non-vp variants anyways.  I haven't checked
to see if they actually are.
zahiraam pushed a commit to tahonermann/llvm-project that referenced this pull request Oct 24, 2023
…66279)

If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation.  For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).

I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX.  Given that, they
should have been transformed to the non-vp variants anyways.  I haven't checked
to see if they actually are.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants