[AArch64][TTI] Improve `LegalVF` when computing gather-loads cost #69617

antoniofrighetto · 2023-10-19T17:14:52Z

After determining the cost of loads that could not be coalesced into VectorizedLoads, computing the cost of a gather-vectorized load is carried out. Take into account the case where the type of a group of loads, whose type is a vector of size dependent upon VF, may be legalized into a scalar value.

Current number of loads is held in LT.first.

Fixes: #68953.

llvmbot · 2023-10-19T17:15:54Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Author: Antonio Frighetto (antoniofrighetto)

Changes

After determining the cost of loads that could not be coalesced into VectorizedLoads, computing the cost of a gather-vectorized load is carried out. Take into account the case where the type of a group of loads, whose type is a vector of size dependent upon VF, may be legalized into a scalar value.

Current number of loads is held in LT.first.

Fixes: #68953.

Full diff: https://github.com/llvm/llvm-project/pull/69617.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+8-1)
(added) llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll (+37)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index d8a0e68d7123759..40ed3e6d327d882 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -2989,7 +2989,14 @@ InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
       ElementCount::getScalable(1))
     return InstructionCost::getInvalid();
 
-  ElementCount LegalVF = LT.second.getVectorElementCount();
+  ElementCount LegalVF;
+  if (LT.second.isVector()) {
+    LegalVF = LT.second.getVectorElementCount();
+  } else {
+    // If the legalized type is a simple type, treat it as a 1-element vector.
+    LegalVF = ElementCount::getFixed(1);
+  }
+
   InstructionCost MemOpCost =
       getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind,
                       {TTI::OK_AnyValue, TTI::OP_None}, I);
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll
new file mode 100644
index 000000000000000..84de45bde06a62f
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll
@@ -0,0 +1,37 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux-gnu -mcpu=neoverse-512tvb -pass-remarks-output=%t.yaml < %s | FileCheck %s
+; RUN: cat %t.yaml | FileCheck -check-prefix=REMARK %s
+
+; REMARK-LABEL: Function:        gather_load_fp128
+; REMARK:       Args:
+; REMARK-NEXT:    - String:          'List vectorization was possible but not beneficial with cost '
+; REMARK-NEXT:    - Cost:            '0'
+; REMARK-NEXT:    - String:          ' >= '
+; REMARK-NEXT:    - Treshold:        '0'
+
+define void @gather_load_fp128(ptr %arg) #0 {
+; CHECK-LABEL: @gather_load_fp128(
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[ARG:%.*]], i64 16
+; CHECK-NEXT:    [[LOAD0:%.*]] = load fp128, ptr [[ARG]], align 1
+; CHECK-NEXT:    [[LOAD1:%.*]] = load fp128, ptr [[GEP]], align 1
+; CHECK-NEXT:    [[LOAD2:%.*]] = load fp128, ptr null, align 1
+; CHECK-NEXT:    [[LOAD3:%.*]] = load fp128, ptr null, align 1
+; CHECK-NEXT:    [[FCMP0:%.*]] = fcmp oeq fp128 [[LOAD0]], 0xL00000000000000000000000000000000
+; CHECK-NEXT:    [[FCMP1:%.*]] = fcmp oeq fp128 [[LOAD1]], 0xL00000000000000000000000000000000
+; CHECK-NEXT:    [[FCMP2:%.*]] = fcmp oeq fp128 [[LOAD2]], 0xL00000000000000000000000000000000
+; CHECK-NEXT:    [[FCMP3:%.*]] = fcmp oeq fp128 [[LOAD3]], 0xL00000000000000000000000000000000
+; CHECK-NEXT:    ret void
+;
+  %gep = getelementptr i8, ptr %arg, i64 16
+  %load0 = load fp128, ptr %arg, align 1
+  %load1 = load fp128, ptr %gep, align 1
+  %load2 = load fp128, ptr null, align 1
+  %load3 = load fp128, ptr null, align 1
+  %fcmp0 = fcmp oeq fp128 %load0, 0xL0
+  %fcmp1 = fcmp oeq fp128 %load1, 0xL0
+  %fcmp2 = fcmp oeq fp128 %load2, 0xL0
+  %fcmp3 = fcmp oeq fp128 %load3, 0xL0
+  ret void
+}
+
+attributes #0 = { vscale_range(2,2) }

alexey-bataev · 2023-10-19T17:20:31Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+  ElementCount LegalVF;
+  if (LT.second.isVector()) {
+    LegalVF = LT.second.getVectorElementCount();
+  } else {
+    // If the legalized type is a simple type, treat it as a 1-element vector.
+    LegalVF = ElementCount::getFixed(1);
+  }
+


Shall isLegalMaskedGather() just return false for such types instead?

May that be the case though? Currently, few lines above, if useNeonVector is available, the cost is computed via BaseT::getGatherScatterOpCost before type-legalization.

If the vector type gets scalarized (like in your check), isLegalMaskedGather() shall return false for such types, even though useNeonVector() returns true. SLP shall not assume emitting masked gather here at all, the scalarization increases the cost significantly, we should end up with the simple gather here

davemgreen · 2023-10-19T17:56:14Z

llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll

@@ -0,0 +1,37 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux-gnu -mcpu=neoverse-512tvb -pass-remarks-output=%t.yaml < %s | FileCheck %s
+; RUN: cat %t.yaml | FileCheck -check-prefix=REMARK %s


Is testing the remarks necessary, along with the output?

davemgreen · 2023-10-19T17:56:37Z

llvm/test/Transforms/SLPVectorizer/AArch64/gather-load-fp128.ll

+; REMARK-NEXT:    - String:          ' >= '
+; REMARK-NEXT:    - Treshold:        '0'
+
+define void @gather_load_fp128(ptr %arg) #0 {


Can you add a variant with i128 too?

davemgreen · 2023-10-19T18:02:49Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -2989,7 +2989,14 @@ InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
      ElementCount::getScalable(1))


I think it might be worth treating !LT.second.isVector() the same as 1 x scalable vectors. Either that or something should be checking that the element type isElementTypeLegalForScalableVector.

If we bail out when isLegalMaskedGather returns false, which indeed would happen with such types, – as per @alexey-bataev suggestion above – I don't think we would ever be reaching this point.

It didn't look like that function was called when I tried it, and it is best if these functions work correctly in isolation. It's possible to call it throught the cost model: https://godbolt.org/z/z4635vodq.

I think that because the gather will not become a scalable SVE gather in this case, the line above with useNeonVector should be returning true so we use the BaseT::getGatherScatterOpCost for the cost.

antoniofrighetto · 2023-10-27T14:27:15Z

Addressed reviews, thanks!

davemgreen · 2023-10-27T15:03:20Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -2996,6 +2996,9 @@ static unsigned getSVEGatherScatterOverhead(unsigned Opcode) {
 InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {
+  if (!isLegalMaskedGatherScatter(DataTy))


This doesn't sound right - to return an invalid cost. If the gather/scatter can be scalarized then it should have some cost, even if it's quite high. There are some scalable vectors that can't be scalarized, but in general for fixed-width it should be OK.

Is the isElementTypeLegalForScalableVector check below enough to fix the issues you were seeing? We should probably have some fixed-length cost checks too.

If not, I think doing this would make sense, so that it returned the base cost:

if (useNeonVector(DataTy) || !isLegalMaskedGatherScatter(DataTy)) return BaseT::getGatherScatterOpCost(...)

Thought we wanted to favour an invalid cost in those cases, but makes sense. I confirm isElementTypeLegalForScalableVector is fixing the issues previously seen.

davemgreen

Thanks, this LGTM. Thanks for the fix!

After determining the cost of loads that could not be coalesced into `VectorizedLoads` in SLP, computing the cost of a gather-vectorized load is carried out. Favour a potentially high valid cost when the type of a group of loads, whose type is a vector of size dependent upon `VF`, may be legalized into a scalar value. Fixes: llvm#68953.

antoniofrighetto requested review from fhahn, rengolin, alexey-bataev, davemgreen, CarolineConcatto and david-arm October 19, 2023 17:14

llvmbot added backend:AArch64 llvm:transforms labels Oct 19, 2023

alexey-bataev reviewed Oct 19, 2023

View reviewed changes

davemgreen reviewed Oct 19, 2023

View reviewed changes

antoniofrighetto force-pushed the feature/slp_aarch64_tti_improve_vf branch from 731ed9d to 7b5fa1d Compare October 27, 2023 14:26

llvmbot added the llvm:analysis label Oct 27, 2023

antoniofrighetto requested review from davemgreen and alexey-bataev October 27, 2023 14:27

davemgreen reviewed Oct 27, 2023

View reviewed changes

antoniofrighetto force-pushed the feature/slp_aarch64_tti_improve_vf branch from 7b5fa1d to c828e0d Compare October 27, 2023 15:39

antoniofrighetto requested a review from davemgreen October 27, 2023 15:41

davemgreen approved these changes Oct 27, 2023

View reviewed changes

antoniofrighetto force-pushed the feature/slp_aarch64_tti_improve_vf branch from c828e0d to 138e6c1 Compare October 27, 2023 18:25

antoniofrighetto merged commit 138e6c1 into llvm:main Oct 27, 2023
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][TTI] Improve `LegalVF` when computing gather-loads cost #69617

[AArch64][TTI] Improve `LegalVF` when computing gather-loads cost #69617

antoniofrighetto commented Oct 19, 2023

llvmbot commented Oct 19, 2023 •

edited

alexey-bataev Oct 19, 2023

antoniofrighetto Oct 19, 2023

alexey-bataev Oct 19, 2023 •

edited

davemgreen Oct 19, 2023

davemgreen Oct 19, 2023

davemgreen Oct 19, 2023

antoniofrighetto Oct 19, 2023

davemgreen Oct 20, 2023

antoniofrighetto commented Oct 27, 2023

davemgreen Oct 27, 2023

antoniofrighetto Oct 27, 2023

davemgreen left a comment

		@@ -2989,7 +2989,14 @@ InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
		ElementCount::getScalable(1))

[AArch64][TTI] Improve LegalVF when computing gather-loads cost #69617

[AArch64][TTI] Improve LegalVF when computing gather-loads cost #69617

Conversation

antoniofrighetto commented Oct 19, 2023

llvmbot commented Oct 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexey-bataev Oct 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoniofrighetto commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davemgreen left a comment

Choose a reason for hiding this comment

[AArch64][TTI] Improve `LegalVF` when computing gather-loads cost #69617

[AArch64][TTI] Improve `LegalVF` when computing gather-loads cost #69617

llvmbot commented Oct 19, 2023 •

edited

alexey-bataev Oct 19, 2023 •

edited