Skip to content

Commit

Permalink
[CostModel][X86] getScalarizationOverhead - improve extraction costs …
Browse files Browse the repository at this point in the history
…for > 128-bit vectors

We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs.

This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once.

Differential Revision: https://reviews.llvm.org/D125527
  • Loading branch information
RKSimon committed May 24, 2022
1 parent 1586e1d commit 6c80267
Show file tree
Hide file tree
Showing 120 changed files with 1,836 additions and 1,794 deletions.
54 changes: 48 additions & 6 deletions llvm/lib/Target/X86/X86TargetTransformInfo.cpp
Expand Up @@ -3779,15 +3779,19 @@ InstructionCost X86TTIImpl::getScalarizationOverhead(VectorType *Ty,
const APInt &DemandedElts,
bool Insert,
bool Extract) {
assert(DemandedElts.getBitWidth() ==
cast<FixedVectorType>(Ty)->getNumElements() &&
"Vector size mismatch");

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
MVT MScalarTy = LT.second.getScalarType();
unsigned SizeInBits = LT.second.getSizeInBits();

InstructionCost Cost = 0;

// For insertions, a ISD::BUILD_VECTOR style vector initialization can be much
// cheaper than an accumulation of ISD::INSERT_VECTOR_ELT.
if (Insert) {
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
MVT MScalarTy = LT.second.getScalarType();
unsigned SizeInBits = LT.second.getSizeInBits();

if ((MScalarTy == MVT::i16 && ST->hasSSE2()) ||
(MScalarTy.isInteger() && ST->hasSSE41()) ||
(MScalarTy == MVT::f32 && ST->hasSSE41())) {
Expand Down Expand Up @@ -3865,8 +3869,46 @@ InstructionCost X86TTIImpl::getScalarizationOverhead(VectorType *Ty,
return MOVMSKCost;
}

// TODO: Use default extraction for now, but we should investigate extending
// this to handle repeated subvector extraction.
if (LT.second.isVector()) {
int CostValue = *LT.first.getValue();
assert(CostValue >= 0 && "Negative cost!");

unsigned NumElts = LT.second.getVectorNumElements() * CostValue;
assert(NumElts >= DemandedElts.getBitWidth() &&
"Vector has been legalized to smaller element count");

// If we're extracting elements from a 128-bit subvector lane, we only need
// to extract each lane once, not for every element.
if (SizeInBits > 128) {
assert((SizeInBits % 128) == 0 && "Illegal vector");
unsigned NumLegal128Lanes = SizeInBits / 128;
unsigned Num128Lanes = NumLegal128Lanes * CostValue;
APInt WidenedDemandedElts = DemandedElts.zext(NumElts);
unsigned Scale = NumElts / Num128Lanes;

// Add cost for each demanded 128-bit subvector extraction.
// Luckily this is a lot easier than for insertion.
APInt DemandedUpper128Lanes =
APIntOps::ScaleBitMask(WidenedDemandedElts, Num128Lanes);
auto *Ty128 = FixedVectorType::get(Ty->getElementType(), Scale);
for (unsigned I = 0; I != Num128Lanes; ++I)
if (DemandedUpper128Lanes[I])
Cost += getShuffleCost(TTI::SK_ExtractSubvector, Ty, None,
I * Scale, Ty128);

// Add all the demanded element extractions together, but adjust the
// index to use the equivalent of the bottom 128 bit lane.
for (unsigned I = 0; I != NumElts; ++I)
if (WidenedDemandedElts[I]) {
unsigned Idx = I % Scale;
Cost += getVectorInstrCost(Instruction::ExtractElement, Ty, Idx);
}

return Cost;
}
}

// Fallback to default extraction.
Cost += BaseT::getScalarizationOverhead(Ty, DemandedElts, false, Extract);
}

Expand Down
16 changes: 8 additions & 8 deletions llvm/test/Analysis/CostModel/X86/arith-fp.ll
Expand Up @@ -663,23 +663,23 @@ define i32 @frem(i32 %arg) {
; AVX-LABEL: 'frem'
; AVX-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %V4F32 = frem <4 x float> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %V8F32 = frem <8 x float> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 68 for instruction: %V16F32 = frem <16 x float> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F32 = frem <8 x float> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 62 for instruction: %V16F32 = frem <16 x float> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F64 = frem <2 x double> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F64 = frem <4 x double> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %V8F64 = frem <8 x double> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %V4F64 = frem <4 x double> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V8F64 = frem <8 x double> undef, undef
; AVX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
; AVX512-LABEL: 'frem'
; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %V4F32 = frem <4 x float> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %V8F32 = frem <8 x float> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V16F32 = frem <16 x float> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F32 = frem <8 x float> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 63 for instruction: %V16F32 = frem <16 x float> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F64 = frem <2 x double> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F64 = frem <4 x double> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 33 for instruction: %V8F64 = frem <8 x double> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %V4F64 = frem <4 x double> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %V8F64 = frem <8 x double> undef, undef
; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
; SLM-LABEL: 'frem'
Expand Down

0 comments on commit 6c80267

Please sign in to comment.