Skip to content

Commit

Permalink
[RISCV] Lower fixed vectors extract_vector_elt through stack at high …
Browse files Browse the repository at this point in the history
…LMUL

This is the extract side of D159332. The goal is to avoid non-linear costing on patterns where an entire vector is split back into scalars. This is an idiomatic pattern for SLP.

Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique extracts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if we're beyond LMUL2.

There's a subtly here. For this to work, we're *relying* on an optimization in LegalizeDAG which tries to reuse the stack slot from a previous extract. In practice, this appear to trigger for patterns within a block, but if we ended up with an explode idiom split across multiple blocks, we'd still be in quadratic territory. I don't think that variant is fixable within SDAG.

It's tempting to think we can do better than going through the stack, but well, I haven't found it yet if it exists. Here's the results for sifive-s280 on all the variants I wrote (all 16 x i64 with V):

output/sifive-x280/linear_decomp_with_slidedown.mca:Total Cycles:      20703
output/sifive-x280/linear_decomp_with_vrgather.mca:Total Cycles:      23903
output/sifive-x280/naive_linear_with_slidedown.mca:Total Cycles:      21604
output/sifive-x280/naive_linear_with_vrgather.mca:Total Cycles:      22804
output/sifive-x280/recursive_decomp_with_slidedown.mca:Total Cycles:      15204
output/sifive-x280/recursive_decomp_with_vrgather.mca:Total Cycles:      18404
output/sifive-x280/stack_by_vreg.mca:Total Cycles:      12104
output/sifive-x280/stack_element_by_element.mca:Total Cycles:      4304

I am deliberately excluding scalable vectors. It functionally works, but frankly, the code quality for an idiomatic explode loop is so terrible either way that it felt better to leave that for future work.

Differential Revision: https://reviews.llvm.org/D159375
  • Loading branch information
preames committed Sep 11, 2023
1 parent 070c257 commit 299d710
Show file tree
Hide file tree
Showing 4 changed files with 472 additions and 280 deletions.
16 changes: 16 additions & 0 deletions llvm/lib/Target/RISCV/RISCVISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7588,6 +7588,22 @@ SDValue RISCVTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
}
}

// If after narrowing, the required slide is still greater than LMUL2,
// fallback to generic expansion and go through the stack. This is done
// for a subtle reason: extracting *all* elements out of a vector is
// widely expected to be linear in vector size, but because vslidedown
// is linear in LMUL, performing N extracts using vslidedown becomes
// O(n^2) / (VLEN/ETYPE) work. On the surface, going through the stack
// seems to have the same problem (the store is linear in LMUL), but the
// generic expansion *memoizes* the store, and thus for many extracts of
// the same vector we end up with one store and a bunch of loads.
// TODO: We don't have the same code for insert_vector_elt because we
// have BUILD_VECTOR and handle the degenerate case there. Should we
// consider adding an inverse BUILD_VECTOR node?
MVT LMUL2VT = getLMUL1VT(ContainerVT).getDoubleNumVectorElementsVT();
if (ContainerVT.bitsGT(LMUL2VT) && VecVT.isFixedLengthVector())
return SDValue();

// If the index is 0, the vector is already in the right position.
if (!isNullConstant(Idx)) {
// Use a VL of 1 to avoid processing more elements than we need.
Expand Down
204 changes: 175 additions & 29 deletions llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll
Original file line number Diff line number Diff line change
Expand Up @@ -244,32 +244,89 @@ define i64 @extractelt_v3i64(ptr %x) nounwind {

; A LMUL8 type
define i32 @extractelt_v32i32(ptr %x) nounwind {
; CHECK-LABEL: extractelt_v32i32:
; CHECK: # %bb.0:
; CHECK-NEXT: li a1, 32
; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 31
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
; RV32-LABEL: extractelt_v32i32:
; RV32: # %bb.0:
; RV32-NEXT: addi sp, sp, -256
; RV32-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
; RV32-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
; RV32-NEXT: addi s0, sp, 256
; RV32-NEXT: andi sp, sp, -128
; RV32-NEXT: li a1, 32
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV32-NEXT: vle32.v v8, (a0)
; RV32-NEXT: mv a0, sp
; RV32-NEXT: vse32.v v8, (a0)
; RV32-NEXT: lw a0, 124(sp)
; RV32-NEXT: addi sp, s0, -256
; RV32-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
; RV32-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 256
; RV32-NEXT: ret
;
; RV64-LABEL: extractelt_v32i32:
; RV64: # %bb.0:
; RV64-NEXT: addi sp, sp, -256
; RV64-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
; RV64-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
; RV64-NEXT: addi s0, sp, 256
; RV64-NEXT: andi sp, sp, -128
; RV64-NEXT: li a1, 32
; RV64-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV64-NEXT: vle32.v v8, (a0)
; RV64-NEXT: mv a0, sp
; RV64-NEXT: vse32.v v8, (a0)
; RV64-NEXT: lw a0, 124(sp)
; RV64-NEXT: addi sp, s0, -256
; RV64-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
; RV64-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
; RV64-NEXT: addi sp, sp, 256
; RV64-NEXT: ret
%a = load <32 x i32>, ptr %x
%b = extractelement <32 x i32> %a, i32 31
ret i32 %b
}

; Exercise type legalization for type beyond LMUL8
define i32 @extractelt_v64i32(ptr %x) nounwind {
; CHECK-LABEL: extractelt_v64i32:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, 128
; CHECK-NEXT: li a1, 32
; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 31
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
; RV32-LABEL: extractelt_v64i32:
; RV32: # %bb.0:
; RV32-NEXT: addi sp, sp, -256
; RV32-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
; RV32-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
; RV32-NEXT: addi s0, sp, 256
; RV32-NEXT: andi sp, sp, -128
; RV32-NEXT: addi a0, a0, 128
; RV32-NEXT: li a1, 32
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV32-NEXT: vle32.v v8, (a0)
; RV32-NEXT: mv a0, sp
; RV32-NEXT: vse32.v v8, (a0)
; RV32-NEXT: lw a0, 124(sp)
; RV32-NEXT: addi sp, s0, -256
; RV32-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
; RV32-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 256
; RV32-NEXT: ret
;
; RV64-LABEL: extractelt_v64i32:
; RV64: # %bb.0:
; RV64-NEXT: addi sp, sp, -256
; RV64-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
; RV64-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
; RV64-NEXT: addi s0, sp, 256
; RV64-NEXT: andi sp, sp, -128
; RV64-NEXT: addi a0, a0, 128
; RV64-NEXT: li a1, 32
; RV64-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV64-NEXT: vle32.v v8, (a0)
; RV64-NEXT: mv a0, sp
; RV64-NEXT: vse32.v v8, (a0)
; RV64-NEXT: lw a0, 124(sp)
; RV64-NEXT: addi sp, s0, -256
; RV64-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
; RV64-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
; RV64-NEXT: addi sp, sp, 256
; RV64-NEXT: ret
%a = load <64 x i32>, ptr %x
%b = extractelement <64 x i32> %a, i32 63
ret i32 %b
Expand Down Expand Up @@ -548,16 +605,105 @@ define i64 @extractelt_v3i64_idx(ptr %x, i32 zeroext %idx) nounwind {
}

define i32 @extractelt_v32i32_idx(ptr %x, i32 zeroext %idx) nounwind {
; CHECK-LABEL: extractelt_v32i32_idx:
; CHECK: # %bb.0:
; CHECK-NEXT: li a2, 32
; CHECK-NEXT: vsetvli zero, a2, e32, m8, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
; CHECK-NEXT: vslidedown.vx v8, v8, a1
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
; RV32NOM-LABEL: extractelt_v32i32_idx:
; RV32NOM: # %bb.0:
; RV32NOM-NEXT: addi sp, sp, -256
; RV32NOM-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
; RV32NOM-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
; RV32NOM-NEXT: sw s2, 244(sp) # 4-byte Folded Spill
; RV32NOM-NEXT: addi s0, sp, 256
; RV32NOM-NEXT: andi sp, sp, -128
; RV32NOM-NEXT: mv s2, a0
; RV32NOM-NEXT: andi a0, a1, 31
; RV32NOM-NEXT: li a1, 4
; RV32NOM-NEXT: call __mulsi3@plt
; RV32NOM-NEXT: li a1, 32
; RV32NOM-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV32NOM-NEXT: vle32.v v8, (s2)
; RV32NOM-NEXT: mv a1, sp
; RV32NOM-NEXT: add a0, a1, a0
; RV32NOM-NEXT: vadd.vv v8, v8, v8
; RV32NOM-NEXT: vse32.v v8, (a1)
; RV32NOM-NEXT: lw a0, 0(a0)
; RV32NOM-NEXT: addi sp, s0, -256
; RV32NOM-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
; RV32NOM-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
; RV32NOM-NEXT: lw s2, 244(sp) # 4-byte Folded Reload
; RV32NOM-NEXT: addi sp, sp, 256
; RV32NOM-NEXT: ret
;
; RV32M-LABEL: extractelt_v32i32_idx:
; RV32M: # %bb.0:
; RV32M-NEXT: addi sp, sp, -256
; RV32M-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
; RV32M-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
; RV32M-NEXT: addi s0, sp, 256
; RV32M-NEXT: andi sp, sp, -128
; RV32M-NEXT: andi a1, a1, 31
; RV32M-NEXT: li a2, 32
; RV32M-NEXT: vsetvli zero, a2, e32, m8, ta, ma
; RV32M-NEXT: vle32.v v8, (a0)
; RV32M-NEXT: slli a1, a1, 2
; RV32M-NEXT: mv a0, sp
; RV32M-NEXT: or a1, a0, a1
; RV32M-NEXT: vadd.vv v8, v8, v8
; RV32M-NEXT: vse32.v v8, (a0)
; RV32M-NEXT: lw a0, 0(a1)
; RV32M-NEXT: addi sp, s0, -256
; RV32M-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
; RV32M-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
; RV32M-NEXT: addi sp, sp, 256
; RV32M-NEXT: ret
;
; RV64NOM-LABEL: extractelt_v32i32_idx:
; RV64NOM: # %bb.0:
; RV64NOM-NEXT: addi sp, sp, -256
; RV64NOM-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
; RV64NOM-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
; RV64NOM-NEXT: sd s2, 232(sp) # 8-byte Folded Spill
; RV64NOM-NEXT: addi s0, sp, 256
; RV64NOM-NEXT: andi sp, sp, -128
; RV64NOM-NEXT: mv s2, a0
; RV64NOM-NEXT: andi a0, a1, 31
; RV64NOM-NEXT: li a1, 4
; RV64NOM-NEXT: call __muldi3@plt
; RV64NOM-NEXT: li a1, 32
; RV64NOM-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV64NOM-NEXT: vle32.v v8, (s2)
; RV64NOM-NEXT: mv a1, sp
; RV64NOM-NEXT: add a0, a1, a0
; RV64NOM-NEXT: vadd.vv v8, v8, v8
; RV64NOM-NEXT: vse32.v v8, (a1)
; RV64NOM-NEXT: lw a0, 0(a0)
; RV64NOM-NEXT: addi sp, s0, -256
; RV64NOM-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
; RV64NOM-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
; RV64NOM-NEXT: ld s2, 232(sp) # 8-byte Folded Reload
; RV64NOM-NEXT: addi sp, sp, 256
; RV64NOM-NEXT: ret
;
; RV64M-LABEL: extractelt_v32i32_idx:
; RV64M: # %bb.0:
; RV64M-NEXT: addi sp, sp, -256
; RV64M-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
; RV64M-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
; RV64M-NEXT: addi s0, sp, 256
; RV64M-NEXT: andi sp, sp, -128
; RV64M-NEXT: andi a1, a1, 31
; RV64M-NEXT: li a2, 32
; RV64M-NEXT: vsetvli zero, a2, e32, m8, ta, ma
; RV64M-NEXT: vle32.v v8, (a0)
; RV64M-NEXT: slli a1, a1, 2
; RV64M-NEXT: mv a0, sp
; RV64M-NEXT: or a1, a0, a1
; RV64M-NEXT: vadd.vv v8, v8, v8
; RV64M-NEXT: vse32.v v8, (a0)
; RV64M-NEXT: lw a0, 0(a1)
; RV64M-NEXT: addi sp, s0, -256
; RV64M-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
; RV64M-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
; RV64M-NEXT: addi sp, sp, 256
; RV64M-NEXT: ret
%a = load <32 x i32>, ptr %x
%b = add <32 x i32> %a, %a
%c = extractelement <32 x i32> %b, i32 %idx
Expand Down
Loading

0 comments on commit 299d710

Please sign in to comment.