Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PowerPC] adjust cost for vector insert/extract with non const index #79092

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

chenzheng1030
Copy link
Collaborator

P9 has vxform Vector Extract Element Instructions like vextuwrx and P10 has vxform Vector Insert Element instructions like vinsd. Update the instruction cost reflecting these instructions.

Fixes #50249

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 23, 2024

@llvm/pr-subscribers-llvm-analysis

Author: Chen Zheng (chenzheng1030)

Changes

P9 has vxform Vector Extract Element Instructions like vextuwrx and P10 has vxform Vector Insert Element instructions like vinsd. Update the instruction cost reflecting these instructions.

Fixes #50249


Full diff: https://github.com/llvm/llvm-project/pull/79092.diff

2 Files Affected:

  • (modified) llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp (+33-23)
  • (modified) llvm/test/Analysis/CostModel/PowerPC/insert_extract.ll (+18-18)
diff --git a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
index 062b53e24a0d79..4768d0a1ba693e 100644
--- a/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
+++ b/llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
@@ -697,34 +697,44 @@ InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
 
     return Cost;
 
-  } else if (Val->getScalarType()->isIntegerTy() && Index != -1U) {
+  } else if (Val->getScalarType()->isIntegerTy()) {
     unsigned EltSize = Val->getScalarSizeInBits();
     // Computing on 1 bit values requires extra mask or compare operations.
     unsigned MaskCost = VecMaskCost && EltSize == 1 ? 1 : 0;
     if (ST->hasP9Altivec()) {
-      if (ISD == ISD::INSERT_VECTOR_ELT)
-        // A move-to VSR and a permute/insert.  Assume vector operation cost
-        // for both (cost will be 2x on P9).
-        return 2 * CostFactor;
-
-      // It's an extract.  Maybe we can do a cheap move-from VSR.
-      unsigned EltSize = Val->getScalarSizeInBits();
-      if (EltSize == 64) {
-        unsigned MfvsrdIndex = ST->isLittleEndian() ? 1 : 0;
-        if (Index == MfvsrdIndex)
-          return 1;
-      } else if (EltSize == 32) {
-        unsigned MfvsrwzIndex = ST->isLittleEndian() ? 2 : 1;
-        if (Index == MfvsrwzIndex)
-          return 1;
-      }
-
-      // We need a vector extract (or mfvsrld).  Assume vector operation cost.
-      // The cost of the load constant for a vector extract is disregarded
-      // (invariant, easily schedulable).
-      return CostFactor + MaskCost;
+      // P10 has vxform insert which can handle non const index. The MaskCost is
+      // for masking the index.
+      // P9 has insert for const index. A move-to VSR and a permute/insert.
+      // Assume vector operation cost for both (cost will be 2x on P9).
+      if (ISD == ISD::INSERT_VECTOR_ELT) {
+        if (ST->isISA3_1())
+          return CostFactor + MaskCost;
+        else if (Index != -1U)
+          return 2 * CostFactor;
+      } else if (ISD == ISD::EXTRACT_VECTOR_ELT) {
+        // It's an extract.  Maybe we can do a cheap move-from VSR.
+        unsigned EltSize = Val->getScalarSizeInBits();
+        if (EltSize == 64) {
+          // FIXME: no need to worry about endian, P9 has both mfvsrd/mfvsrld.
+          unsigned MfvsrdIndex = ST->isLittleEndian() ? 1 : 0;
+          if (Index == MfvsrdIndex)
+            return 1;
+        } else if (EltSize == 32) {
+          unsigned MfvsrwzIndex = ST->isLittleEndian() ? 2 : 1;
+          if (Index == MfvsrwzIndex)
+            return 1;
+
+          // For other indexs like non const, P9 has vxform extract. The
+          // MaskCost is for masking the index.
+          return CostFactor + MaskCost;
+        }
 
-    } else if (ST->hasDirectMove()) {
+        // We need a vector extract (or mfvsrld).  Assume vector operation cost.
+        // The cost of the load constant for a vector extract is disregarded
+        // (invariant, easily schedulable).
+        return CostFactor + MaskCost;
+      }
+    } else if (ST->hasDirectMove() && Index != -1U) {
       // Assume permute has standard cost.
       // Assume move-to/move-from VSR have 2x standard cost.
       if (ISD == ISD::INSERT_VECTOR_ELT)
diff --git a/llvm/test/Analysis/CostModel/PowerPC/insert_extract.ll b/llvm/test/Analysis/CostModel/PowerPC/insert_extract.ll
index 607d15790b5f31..512c83f0182377 100644
--- a/llvm/test/Analysis/CostModel/PowerPC/insert_extract.ll
+++ b/llvm/test/Analysis/CostModel/PowerPC/insert_extract.ll
@@ -27,7 +27,7 @@ define i32 @insert(i32 %arg) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
 ; CHECK-P10-LABEL: 'insert'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = insertelement <4 x i32> undef, i32 %arg, i32 0
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = insertelement <4 x i32> undef, i32 %arg, i32 0
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
   %x = insertelement <4 x i32> undef, i32 %arg, i32 0
@@ -109,7 +109,7 @@ define void @test4xi32(<4 x i32> %v1, i32 %x1) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; CHECK-P10-LABEL: 'test4xi32'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v2 = insertelement <4 x i32> %v1, i32 %x1, i32 2
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2 = insertelement <4 x i32> %v1, i32 %x1, i32 2
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %v2 = insertelement <4 x i32> %v1, i32 %x1, i32 2
@@ -239,7 +239,7 @@ define <2 x i64> @insert_i64_x(<2 x i64> %dest, i64 %arg, i32 %idx) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %x
 ;
 ; CHECK-P10-LABEL: 'insert_i64_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %x = insertelement <2 x i64> %dest, i64 %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = insertelement <2 x i64> %dest, i64 %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %x
 ;
   %x = insertelement <2 x i64> %dest, i64 %arg, i32 %idx
@@ -264,7 +264,7 @@ define <4 x i32> @insert_i32_x(<4 x i32> %dest, i32 %arg, i32 %idx) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %x
 ;
 ; CHECK-P10-LABEL: 'insert_i32_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %x = insertelement <4 x i32> %dest, i32 %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = insertelement <4 x i32> %dest, i32 %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %x
 ;
   %x = insertelement <4 x i32> %dest, i32 %arg, i32 %idx
@@ -289,7 +289,7 @@ define  <8 x i16> @insert_i16_x(<8 x i16> %dest, i16 %arg, i32 %idx) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i16> %x
 ;
 ; CHECK-P10-LABEL: 'insert_i16_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %x = insertelement <8 x i16> %dest, i16 %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = insertelement <8 x i16> %dest, i16 %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i16> %x
 ;
   %x = insertelement <8 x i16> %dest, i16 %arg, i32 %idx
@@ -314,7 +314,7 @@ define  <16 x i8> @insert_i8_x(<16 x i8> %dest, i8 %arg, i32 %idx) {
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i8> %x
 ;
 ; CHECK-P10-LABEL: 'insert_i8_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %x = insertelement <16 x i8> %dest, i8 %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = insertelement <16 x i8> %dest, i8 %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i8> %x
 ;
   %x = insertelement <16 x i8> %dest, i8 %arg, i32 %idx
@@ -331,15 +331,15 @@ define i64 @extract_i64_x(<2 x i64> %arg, i32 %idx) {
 ; CHECK-P8LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %x
 ;
 ; CHECK-P9BE-LABEL: 'extract_i64_x'
-; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
+; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
 ; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %x
 ;
 ; CHECK-P9LE-LABEL: 'extract_i64_x'
-; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
+; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %x
 ;
 ; CHECK-P10-LABEL: 'extract_i64_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = extractelement <2 x i64> %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %x
 ;
   %x = extractelement <2 x i64> %arg, i32 %idx
@@ -356,15 +356,15 @@ define i32 @extract_i32_x(<4 x i32> %arg, i32 %idx) {
 ; CHECK-P8LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %x
 ;
 ; CHECK-P9BE-LABEL: 'extract_i32_x'
-; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
+; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
 ; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %x
 ;
 ; CHECK-P9LE-LABEL: 'extract_i32_x'
-; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
+; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %x
 ;
 ; CHECK-P10-LABEL: 'extract_i32_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = extractelement <4 x i32> %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %x
 ;
   %x = extractelement <4 x i32> %arg, i32 %idx
@@ -381,15 +381,15 @@ define i16 @extract_i16_x(<8 x i16> %arg, i32 %idx) {
 ; CHECK-P8LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i16 %x
 ;
 ; CHECK-P9BE-LABEL: 'extract_i16_x'
-; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
+; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
 ; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i16 %x
 ;
 ; CHECK-P9LE-LABEL: 'extract_i16_x'
-; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
+; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i16 %x
 ;
 ; CHECK-P10-LABEL: 'extract_i16_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = extractelement <8 x i16> %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i16 %x
 ;
   %x = extractelement <8 x i16> %arg, i32 %idx
@@ -406,15 +406,15 @@ define i8 @extract_i8_x(<16 x i8> %arg, i32 %idx) {
 ; CHECK-P8LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i8 %x
 ;
 ; CHECK-P9BE-LABEL: 'extract_i8_x'
-; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
+; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
 ; CHECK-P9BE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i8 %x
 ;
 ; CHECK-P9LE-LABEL: 'extract_i8_x'
-; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
+; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
 ; CHECK-P9LE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i8 %x
 ;
 ; CHECK-P10-LABEL: 'extract_i8_x'
-; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
+; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x = extractelement <16 x i8> %arg, i32 %idx
 ; CHECK-P10-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i8 %x
 ;
   %x = extractelement <16 x i8> %arg, i32 %idx

// Assume vector operation cost for both (cost will be 2x on P9).
if (ISD == ISD::INSERT_VECTOR_ELT) {
if (ST->isISA3_1())
return CostFactor + MaskCost;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaskCost is dependent on element type - if the issue is the index it will be applied inconsistently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review @RolandF77 .

For the "legal" types for insertelment, like i8/i16/i32/i64, on P10, see https://godbolt.org/z/jcor7KEWr , all of them can be represented with a vector instruction + some arithmetic instructions.

Are there other types I should test?

The issue #50249 is not for P10, it is for P9 instead.

Before the change, on P9, extractelement with non-const index will have the cost calculated at line 750. That cost is based on loading the required element from the memory where the bigger data is stored. This would inaccurately increase the cost for such extractelement.

Copy link
Collaborator

@RolandF77 RolandF77 Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea of the change makes sense. The issue is specifically the use of MaskCost to account for indexing overhead. MaskCost is 0 unless the element type to the vector is i1, which does not look expected here. It was added to deal with the overhead of sub-byte values extracted from i1 vectors. I think you should probably use a different computation which is 0 unless the index is -1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Sorry, I misunderstood your previous comment. Patch updated.

// The cost of the load constant for a vector extract is disregarded
// (invariant, easily schedulable).
return CostFactor + MaskCost;
// P10 has vxform insert which can handle non const index. The MaskCost is
Copy link
Collaborator

@RolandF77 RolandF77 Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MaskCost => MaskCostForIdx

Copy link
Collaborator

@RolandF77 RolandF77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenzheng1030 chenzheng1030 merged commit 80f3bb4 into llvm:main Feb 19, 2024
3 of 4 checks passed
@chenzheng1030 chenzheng1030 deleted the issue50249 branch February 19, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

slow code for absolute value of int8 x 16 vector on POWER9 at -O3
3 participants