-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[InterleavedAccess] Construct interleaved access store with shuffles #167737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: Ramkrishnan (ram-NK) ChangesRe-land reverted PR #164000 Added reported failure cases and fixed the issues. Patch is 41.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/167737.diff 8 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index cec7d09f494d6..87809a27fe913 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3232,6 +3232,9 @@ class LLVM_ABI TargetLoweringBase {
/// Default to be the minimum interleave factor: 2.
virtual unsigned getMaxSupportedInterleaveFactor() const { return 2; }
+ /// Return true if the target has interleave with shuffles.
+ virtual bool hasInterleaveWithGatherScatter() const { return false; }
+
/// Lower an interleaved load to target specific intrinsics. Return
/// true on success.
///
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 5c27a20869f81..bfb875a01f29f 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -239,7 +239,8 @@ static bool isDeInterleaveMask(ArrayRef<int> Mask, unsigned &Factor,
/// I.e. <0, LaneLen, ... , LaneLen*(Factor - 1), 1, LaneLen + 1, ...>
/// E.g. For a Factor of 2 (LaneLen=4): <0, 4, 1, 5, 2, 6, 3, 7>
static bool isReInterleaveMask(ShuffleVectorInst *SVI, unsigned &Factor,
- unsigned MaxFactor) {
+ unsigned MaxFactor,
+ bool InterleaveWithShuffles) {
unsigned NumElts = SVI->getShuffleMask().size();
if (NumElts < 4)
return false;
@@ -250,6 +251,13 @@ static bool isReInterleaveMask(ShuffleVectorInst *SVI, unsigned &Factor,
return true;
}
+ if (InterleaveWithShuffles) {
+ for (unsigned i = 1; MaxFactor * i <= 16; i *= 2) {
+ Factor = i * MaxFactor;
+ if (SVI->isInterleave(Factor))
+ return true;
+ }
+ }
return false;
}
@@ -528,7 +536,8 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
cast<FixedVectorType>(SVI->getType())->getNumElements();
// Check if the shufflevector is RE-interleave shuffle.
unsigned Factor;
- if (!isReInterleaveMask(SVI, Factor, MaxFactor))
+ if (!isReInterleaveMask(SVI, Factor, MaxFactor,
+ TLI->hasInterleaveWithGatherScatter()))
return false;
assert(NumStoredElements % Factor == 0 &&
"number of stored element should be a multiple of Factor");
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index eaa10ef031989..996d3a22fd5f1 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -97,6 +97,7 @@
#include <cctype>
#include <cstdint>
#include <cstdlib>
+#include <deque>
#include <iterator>
#include <limits>
#include <optional>
@@ -18010,12 +18011,14 @@ bool AArch64TargetLowering::lowerInterleavedStore(Instruction *Store,
ShuffleVectorInst *SVI,
unsigned Factor,
const APInt &GapMask) const {
-
- assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
- "Invalid interleave factor");
+ assert(Factor >= 2 && "Invalid interleave factor");
auto *SI = dyn_cast<StoreInst>(Store);
if (!SI)
return false;
+
+ if (Factor > getMaxSupportedInterleaveFactor())
+ return lowerInterleavedStoreWithShuffle(SI, SVI, Factor);
+
assert(!LaneMask && GapMask.popcount() == Factor &&
"Unexpected mask on store");
@@ -18161,6 +18164,146 @@ bool AArch64TargetLowering::lowerInterleavedStore(Instruction *Store,
return true;
}
+/// If the interleaved vector elements are greater than supported MaxFactor,
+/// interleaving the data with additional shuffles can be used to
+/// achieve the same.
+///
+/// Consider the following data with 8 interleaves which are shuffled to store
+/// stN instructions. Data needs to be stored in this order:
+/// [v0, v1, v2, v3, v4, v5, v6, v7]
+///
+/// v0 v4 v2 v6 v1 v5 v3 v7
+/// | | | | | | | |
+/// \ / \ / \ / \ /
+/// [zip v0,v4] [zip v2,v6] [zip v1,v5] [zip v3,v7] ==> stN = 4
+/// | | | |
+/// \ / \ /
+/// \ / \ /
+/// \ / \ /
+/// [zip [v0,v2,v4,v6]] [zip [v1,v3,v5,v7]] ==> stN = 2
+///
+/// For stN = 4, upper half of interleaved data V0, V1, V2, V3 is stored
+/// with one st4 instruction. Lower half, i.e, V4, V5, V6, V7 is stored with
+/// another st4.
+///
+/// For stN = 2, upper half of interleaved data V0, V1 is stored
+/// with one st2 instruction. Second set V2, V3 is stored with another st2.
+/// Total of 4 st2's are required here.
+bool AArch64TargetLowering::lowerInterleavedStoreWithShuffle(
+ StoreInst *SI, ShuffleVectorInst *SVI, unsigned Factor) const {
+ unsigned MaxSupportedFactor = getMaxSupportedInterleaveFactor();
+
+ auto *VecTy = cast<FixedVectorType>(SVI->getType());
+ assert(VecTy->getNumElements() % Factor == 0 && "Invalid interleaved store");
+
+ unsigned LaneLen = VecTy->getNumElements() / Factor;
+ Type *EltTy = VecTy->getElementType();
+ auto *SubVecTy = FixedVectorType::get(EltTy, Factor);
+
+ const DataLayout &DL = SI->getModule()->getDataLayout();
+ bool UseScalable;
+
+ // Skip if we do not have NEON and skip illegal vector types. We can
+ // "legalize" wide vector types into multiple interleaved accesses as long as
+ // the vector types are divisible by 128.
+ if (!Subtarget->hasNEON() ||
+ !isLegalInterleavedAccessType(SubVecTy, DL, UseScalable))
+ return false;
+
+ if (UseScalable)
+ return false;
+
+ std::deque<Value *> Shuffles;
+ Shuffles.push_back(SVI);
+ unsigned ConcatLevel = Factor;
+ // Getting all the interleaved operands.
+ while (ConcatLevel > 1) {
+ unsigned InterleavedOperands = Shuffles.size();
+ for (unsigned Ops = 0; Ops < InterleavedOperands; Ops++) {
+ auto *V = Shuffles.front();
+ Shuffles.pop_front();
+ if (isa<ConstantAggregateZero, UndefValue>(V)) {
+ VectorType *Ty = cast<VectorType>(V->getType());
+ auto *HalfTy = VectorType::getHalfElementsVectorType(Ty);
+ Value *SplitValue = nullptr;
+ if (isa<ConstantAggregateZero>(V))
+ SplitValue = ConstantAggregateZero::get(HalfTy);
+ else if (isa<PoisonValue>(V))
+ SplitValue = PoisonValue::get(HalfTy);
+ else if (isa<UndefValue>(V))
+ SplitValue = UndefValue::get(HalfTy);
+ Shuffles.push_back(SplitValue);
+ Shuffles.push_back(SplitValue);
+ continue;
+ }
+
+ ShuffleVectorInst *SFL = dyn_cast<ShuffleVectorInst>(V);
+ if (!SFL)
+ return false;
+ if (SVI != SFL && !SFL->isConcat())
+ return false;
+
+ Value *Op0 = SFL->getOperand(0);
+ Value *Op1 = SFL->getOperand(1);
+
+ Shuffles.push_back(dyn_cast<Value>(Op0));
+ Shuffles.push_back(dyn_cast<Value>(Op1));
+ }
+ ConcatLevel >>= 1;
+ }
+
+ IRBuilder<> Builder(SI);
+ auto Mask = createInterleaveMask(LaneLen, 2);
+ SmallVector<int, 16> UpperHalfMask(LaneLen), LowerHalfMask(LaneLen);
+ for (unsigned Idx = 0; Idx < LaneLen; Idx++) {
+ LowerHalfMask[Idx] = Mask[Idx];
+ UpperHalfMask[Idx] = Mask[Idx + LaneLen];
+ }
+
+ unsigned InterleaveFactor = Factor >> 1;
+ while (InterleaveFactor >= MaxSupportedFactor) {
+ std::deque<Value *> ShufflesIntermediate;
+ ShufflesIntermediate.resize(Factor);
+ for (unsigned Idx = 0; Idx < Factor; Idx += (InterleaveFactor * 2)) {
+ for (unsigned GroupIdx = 0; GroupIdx < InterleaveFactor; GroupIdx++) {
+ auto *Shuffle = Builder.CreateShuffleVector(
+ Shuffles[Idx + GroupIdx],
+ Shuffles[Idx + GroupIdx + InterleaveFactor], LowerHalfMask);
+ ShufflesIntermediate[Idx + GroupIdx] = Shuffle;
+ Shuffle = Builder.CreateShuffleVector(
+ Shuffles[Idx + GroupIdx],
+ Shuffles[Idx + GroupIdx + InterleaveFactor], UpperHalfMask);
+ ShufflesIntermediate[Idx + GroupIdx + InterleaveFactor] = Shuffle;
+ }
+ }
+ Shuffles = ShufflesIntermediate;
+ InterleaveFactor >>= 1;
+ }
+
+ Type *PtrTy = SI->getPointerOperandType();
+ auto *STVTy = FixedVectorType::get(SubVecTy->getElementType(), LaneLen);
+
+ Value *BaseAddr = SI->getPointerOperand();
+ Function *StNFunc = getStructuredStoreFunction(
+ SI->getModule(), MaxSupportedFactor, UseScalable, STVTy, PtrTy);
+ for (unsigned N = 0; N < (Factor / MaxSupportedFactor); N++) {
+ SmallVector<Value *, 5> Ops;
+ for (unsigned OpIdx = 0; OpIdx < MaxSupportedFactor; OpIdx++)
+ Ops.push_back(Shuffles[N * MaxSupportedFactor + OpIdx]);
+
+ if (N > 0) {
+ // We will compute the pointer operand of each store from the original
+ // base address using GEPs. Cast the base address to a pointer to the
+ // scalar element type.
+ BaseAddr = Builder.CreateConstGEP1_32(
+ SubVecTy->getElementType(), BaseAddr, LaneLen * MaxSupportedFactor);
+ }
+ Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));
+ Builder.CreateCall(StNFunc, Ops);
+ }
+ return true;
+}
+
bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
Instruction *Load, Value *Mask, IntrinsicInst *DI) const {
const unsigned Factor = getDeinterleaveIntrinsicFactor(DI->getIntrinsicID());
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 70bfae717fb76..2793022d9e240 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -229,6 +229,8 @@ class AArch64TargetLowering : public TargetLowering {
bool hasPairedLoad(EVT LoadedType, Align &RequiredAlignment) const override;
+ bool hasInterleaveWithGatherScatter() const override { return true; }
+
unsigned getMaxSupportedInterleaveFactor() const override { return 4; }
bool lowerInterleavedLoad(Instruction *Load, Value *Mask,
@@ -239,6 +241,9 @@ class AArch64TargetLowering : public TargetLowering {
ShuffleVectorInst *SVI, unsigned Factor,
const APInt &GapMask) const override;
+ bool lowerInterleavedStoreWithShuffle(StoreInst *SI, ShuffleVectorInst *SVI,
+ unsigned Factor) const;
+
bool lowerDeinterleaveIntrinsicToLoad(Instruction *Load, Value *Mask,
IntrinsicInst *DI) const override;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 197aae6e03cb1..5d2f33ab8d62e 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4922,11 +4922,35 @@ InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
if (!VecTy->isScalableTy() && (UseMaskForCond || UseMaskForGaps))
return InstructionCost::getInvalid();
- if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+ unsigned NumLoadStores = 1;
+ InstructionCost ShuffleCost = 0;
+ bool isInterleaveWithShuffle = false;
+ unsigned MaxSupportedFactor = TLI->getMaxSupportedInterleaveFactor();
+
+ auto *SubVecTy =
+ VectorType::get(VecVTy->getElementType(),
+ VecVTy->getElementCount().divideCoefficientBy(Factor));
+
+ if (TLI->hasInterleaveWithGatherScatter() && Opcode == Instruction::Store &&
+ (0 == Factor % MaxSupportedFactor) && Factor > MaxSupportedFactor) {
+ isInterleaveWithShuffle = true;
+ SmallVector<int, 16> Mask;
+ // preparing interleave Mask.
+ for (unsigned i = 0; i < VecVTy->getElementCount().getKnownMinValue() / 2;
+ i++) {
+ for (unsigned j = 0; j < 2; j++)
+ Mask.push_back(j * Factor + i);
+ }
+
+ NumLoadStores = Factor / MaxSupportedFactor;
+ ShuffleCost =
+ (Factor * getShuffleCost(TargetTransformInfo::SK_Splice, VecVTy, VecVTy,
+ Mask, CostKind, 0, SubVecTy));
+ }
+
+ if (!UseMaskForGaps &&
+ (Factor <= MaxSupportedFactor || isInterleaveWithShuffle)) {
unsigned MinElts = VecVTy->getElementCount().getKnownMinValue();
- auto *SubVecTy =
- VectorType::get(VecVTy->getElementType(),
- VecVTy->getElementCount().divideCoefficientBy(Factor));
// ldN/stN only support legal vector types of size 64 or 128 in bits.
// Accesses having vector types that are a multiple of 128 bits can be
@@ -4934,7 +4958,10 @@ InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
bool UseScalable;
if (MinElts % Factor == 0 &&
TLI->isLegalInterleavedAccessType(SubVecTy, DL, UseScalable))
- return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL, UseScalable);
+ return (Factor *
+ TLI->getNumInterleavedAccesses(SubVecTy, DL, UseScalable) *
+ NumLoadStores) +
+ ShuffleCost;
}
return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
diff --git a/llvm/test/CodeGen/AArch64/vldn_shuffle.ll b/llvm/test/CodeGen/AArch64/vldn_shuffle.ll
index 3685e9cf85bd6..7af1c8e9365ed 100644
--- a/llvm/test/CodeGen/AArch64/vldn_shuffle.ll
+++ b/llvm/test/CodeGen/AArch64/vldn_shuffle.ll
@@ -730,6 +730,244 @@ entry:
ret void
}
+define void @store_factor8(ptr %ptr, <4 x i32> %a0, <4 x i32> %a1, <4 x i32> %a2, <4 x i32> %a3,
+ <4 x i32> %a4, <4 x i32> %a5, <4 x i32> %a6, <4 x i32> %a7) {
+; CHECK-LABEL: store_factor8:
+; CHECK: .Lfunc_begin17:
+; CHECK-NEXT: .cfi_startproc
+; CHECK-NEXT: // %bb.0:
+; CHECK: zip1 [[V1:.*s]], [[I1:.*s]], [[I5:.*s]]
+; CHECK-NEXT: zip2 [[V5:.*s]], [[I1]], [[I5]]
+; CHECK-NEXT: zip1 [[V2:.*s]], [[I2:.*s]], [[I6:.*s]]
+; CHECK-NEXT: zip2 [[V6:.*s]], [[I2]], [[I6]]
+; CHECK-NEXT: zip1 [[V3:.*s]], [[I3:.*s]], [[I7:.*s]]
+; CHECK-NEXT: zip2 [[V7:.*s]], [[I3]], [[I7]]
+; CHECK-NEXT: zip1 [[V4:.*s]], [[I4:.*s]], [[I8:.*s]]
+; CHECK-NEXT: zip2 [[V8:.*s]], [[I4]], [[I8]]
+; CHECK-NEXT: st4 { [[V1]], [[V2]], [[V3]], [[V4]] }, [x0], #64
+; CHECK-NEXT: st4 { [[V5]], [[V6]], [[V7]], [[V8]] }, [x0]
+; CHECK-NEXT: ret
+
+ %v0 = shufflevector <4 x i32> %a0, <4 x i32> %a1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v1 = shufflevector <4 x i32> %a2, <4 x i32> %a3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v2 = shufflevector <4 x i32> %a4, <4 x i32> %a5, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v3 = shufflevector <4 x i32> %a6, <4 x i32> %a7, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+
+ %s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+ %s1 = shufflevector <8 x i32> %v2, <8 x i32> %v3, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+
+ %interleaved.vec = shufflevector <16 x i32> %s0, <16 x i32> %s1, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>
+ store <32 x i32> %interleaved.vec, ptr %ptr, align 4
+ ret void
+}
+
+define void @store_factor16(ptr %ptr, <4 x i32> %a0, <4 x i32> %a1, <4 x i32> %a2, <4 x i32> %a3,
+ <4 x i32> %a4, <4 x i32> %a5, <4 x i32> %a6, <4 x i32> %a7,
+ <4 x i32> %a8, <4 x i32> %a9, <4 x i32> %a10, <4 x i32> %a11,
+ <4 x i32> %a12, <4 x i32> %a13, <4 x i32> %a14, <4 x i32> %a15) {
+; CHECK-LABEL: store_factor16:
+; CHECK: .Lfunc_begin18:
+; CHECK-NEXT: .cfi_startproc
+; CHECK-NEXT: // %bb.0:
+; CHECK: zip1 [[V05:.*s]], [[I05:.*s]], [[I13:.*s]]
+; CHECK-NEXT: zip1 [[V01:.*s]], [[I01:.*s]], [[I09:.*s]]
+; CHECK-NEXT: zip1 [[V02:.*s]], [[I02:.*s]], [[I10:.*s]]
+; CHECK-NEXT: zip1 [[V06:.*s]], [[I06:.*s]], [[I14:.*s]]
+; CHECK-NEXT: zip1 [[V07:.*s]], [[I07:.*s]], [[I15:.*s]]
+; CHECK-NEXT: zip2 [[V09:.*s]], [[I01]], [[I09]]
+; CHECK-NEXT: zip2 [[V13:.*s]], [[I05]], [[I13]]
+; CHECK-NEXT: zip1 [[V03:.*s]], [[I03:.*s]], [[I11:.*s]]
+; CHECK-NEXT: zip1 [[V04:.*s]], [[I04:.*s]], [[I12:.*s]]
+; CHECK-NEXT: zip1 [[V08:.*s]], [[I08:.*s]], [[I16:.*s]]
+; CHECK-NEXT: zip2 [[V10:.*s]], [[I02]], [[I10]]
+; CHECK-NEXT: zip2 [[V14:.*s]], [[I06]], [[I14]]
+; CHECK-NEXT: zip2 [[V11:.*s]], [[I03]], [[I11]]
+; CHECK-NEXT: zip1 [[V17:.*s]], [[V01]], [[V05]]
+; CHECK-NEXT: zip2 [[V15:.*s]], [[I07]], [[I15]]
+; CHECK-NEXT: zip2 [[V21:.*s]], [[V01]], [[V05]]
+; CHECK-NEXT: zip1 [[V18:.*s]], [[V02]], [[V06]]
+; CHECK-NEXT: zip2 [[V12:.*s]], [[I04]], [[I12]]
+; CHECK-NEXT: zip2 [[V16:.*s]], [[I08]], [[I16]]
+; CHECK-NEXT: zip1 [[V19:.*s]], [[V03]], [[V07]]
+; CHECK-NEXT: zip2 [[V22:.*s]], [[V02]], [[V06]]
+; CHECK-NEXT: zip1 [[V25:.*s]], [[V09]], [[V13]]
+; CHECK-NEXT: zip1 [[V20:.*s]], [[V04]], [[V08]]
+; CHECK-NEXT: zip2 [[V23:.*s]], [[V03]], [[V07]]
+; CHECK-NEXT: zip1 [[V26:.*s]], [[V10]], [[V14]]
+; CHECK-NEXT: zip2 [[V29:.*s]], [[V09]], [[V13]]
+; CHECK-NEXT: zip2 [[V24:.*s]], [[V04]], [[V08]]
+; CHECK-NEXT: zip1 [[V27:.*s]], [[V11]], [[V15]]
+; CHECK-NEXT: zip2 [[V30:.*s]], [[V10]], [[V14]]
+; CHECK-NEXT: zip1 [[V28:.*s]], [[V12]], [[V16]]
+; CHECK-NEXT: zip2 [[V31:.*s]], [[V11]], [[V15]]
+; CHECK-NEXT: zip2 [[V32:.*s]], [[V12]], [[V16]]
+; CHECK-NEXT: st4 { [[V17]], [[V18]], [[V19]], [[V20]] }, [x8], #64
+; CHECK-NEXT: ldp d9, d8, [sp, #48] // 16-byte Folded Reload
+; CHECK-NEXT: ldp d11, d10, [sp, #32] // 16-byte Folded Reload
+; CHECK-NEXT: st4 { [[V21]], [[V22]], [[V23]], [[V24]] }, [x8]
+; CHECK-NEXT: add x8, x0, #128
+; CHECK-NEXT: ldp d13, d12, [sp, #16] // 16-byte Folded Reload
+; CHECK-NEXT: st4 { [[V25]], [[V26]], [[V27]], [[V28]] }, [x8]
+; CHECK-NEXT: add x8, x0, #192
+; CHECK-NEXT: st4 { [[V29]], [[V30]], [[V31]], [[V32]] }, [x8]
+; CHECK-NEXT: ldp d15, d14, [sp], #64 // 16-byte Folded Reload
+; CHECK-NEXT: ret
+
+ %v0 = shufflevector <4 x i32> %a0, <4 x i32> %a1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v1 = shufflevector <4 x i32> %a2, <4 x i32> %a3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v2 = shufflevector <4 x i32> %a4, <4 x i32> %a5, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v3 = shufflevector <4 x i32> %a6, <4 x i32> %a7, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v4 = shufflevector <4 x i32> %a8, <4 x i32> %a9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v5 = shufflevector <4 x i32> %a10, <4 x i32> %a11, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v6 = shufflevector <4 x i32> %a12, <4 x i32> %a13, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+ %v7 = shufflevector <4 x i32> %a14, <4 x i32> %a15, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+
+ %s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+ %s1 = shufflevector <8 x i32> %v2, <8 x i32> %v3, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+ %s2 = shufflevector <8 x i32> %v4, <8 x i32> %v5, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+ %s3 = shufflevector <8 x i32> %v6, <8 x i32> %v7, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+
+ %d0 = shufflevector <16 x i32> %s0, <16 x i32> %s1, <32 x i32> <i32 0, i32 1, i32 2, i32 3,...
[truncated]
|
|
✅ With the latest revision this PR passed the undef deprecator. |
0efaebf to
fdcb54e
Compare
Rajveer100
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few test cases that caused the assertion failure including the original reproducer?
- [AArch64]: Interleaved access store can handle more elements than
target supported maximum interleaved factor with shuffles.
fdcb54e to
102b917
Compare
🐧 Linux x64 Test Results
|
4fa2bbd to
5122611
Compare
- [AArch64]: Interleaved access store can handle more elements than
target supported maximum interleaved factor with shuffles.
5122611 to
d4b120b
Compare
@Rajveer100, More failure test cases added. @fhahn , One of the loop in @scalar_store_cost_after_discarding_interleave_group() in llvm/test/Transforms/LoopVectorize/AArch64/replicating-load-store-costs.ll is vectorize now. The corresponding IR is added @store_factor8_1() in llvm/test/CodeGen/AArch64/vldn_shuffle.ll for verification. Better assembly is generated with this patch. |
Re-land reverted PR #164000
Added reported failure cases and fixed the issues.