Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Disable the vpdpwssd -> vpmaddwd + vpaddd combiner pattern on AMD Zen4 #84347

Closed
wants to merge 1 commit into from

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Mar 7, 2024

This tweaks the existing vpdpwssd -> vpmaddwd + vpaddd machine combiner pattern to not kick in on AMD Zen4, fixing Issue #84182.

This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is potentially (depending on scheduling info) a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase added in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency.

This PR also fixes what seems like a bug(?) in 8f7f9d8 : in

if (Subtarget.hasBWI())
Patterns.push_back(MachineCombinerPattern::DPWSSD);
return true;

the return true; was not part of the if branch, so the function would return there even if the if was not taken, preventing the TargetInstrInfo::getMachineCombinerPatterns from being reached. This seems unintentional and seems to be a departure from all the other getMachineCombinerPatterns functions that I can find, where either the target-specific case or the fallback TargetInstrInfo::getMachineCombinerPatterns is taken.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 7, 2024

@llvm/pr-subscribers-backend-x86

Author: Benoit Jacob (bjacob)

Changes

This tweaks the existing vpdpwssd -> vpmaddwd + vpaddd machine combiner pattern to not kick in on AMD Zen4, fixing Issue #84182.

This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase add in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency.

This PR also fixes what seems like a bug(?) in 8f7f9d8 : in

if (Subtarget.hasBWI())
Patterns.push_back(MachineCombinerPattern::DPWSSD);
return true;

the return true; was not part of the if branch, so the function would return there even if the if was not taken, preventing the TargetInstrInfo::getMachineCombinerPatterns from being reached. This seems unintentional and seems to be a departure from all the other getMachineCombinerPatterns functions that I can find, where either the target-specific case or the fallback TargetInstrInfo::getMachineCombinerPatterns is taken.


Full diff: https://github.com/llvm/llvm-project/pull/84347.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86InstrInfo.cpp (+13-6)
  • (added) llvm/test/CodeGen/X86/znver4-vpdpwssd.ll (+15)
diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 3f0557e651f89b..1834d72a62cf63 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -10563,17 +10563,20 @@ void X86InstrInfo::buildClearRegister(Register Reg, MachineBasicBlock &MBB,
 bool X86InstrInfo::getMachineCombinerPatterns(
     MachineInstr &Root, SmallVectorImpl<MachineCombinerPattern> &Patterns,
     bool DoRegPressureReduce) const {
+  bool EnableVPDPWSSTPatterns = !Subtarget.getCPU().starts_with("znver");
   unsigned Opc = Root.getOpcode();
   switch (Opc) {
   default:
-    return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
-                                                       DoRegPressureReduce);
+    break;
   case X86::VPDPWSSDrr:
   case X86::VPDPWSSDrm:
   case X86::VPDPWSSDYrr:
   case X86::VPDPWSSDYrm: {
-    Patterns.push_back(MachineCombinerPattern::DPWSSD);
-    return true;
+    if (EnableVPDPWSSTPatterns) {
+      Patterns.push_back(MachineCombinerPattern::DPWSSD);
+      return true;
+    }
+    break;
   }
   case X86::VPDPWSSDZ128r:
   case X86::VPDPWSSDZ128m:
@@ -10581,11 +10584,15 @@ bool X86InstrInfo::getMachineCombinerPatterns(
   case X86::VPDPWSSDZ256m:
   case X86::VPDPWSSDZr:
   case X86::VPDPWSSDZm: {
-    if (Subtarget.hasBWI())
+    if (EnableVPDPWSSTPatterns && Subtarget.hasBWI()) {
       Patterns.push_back(MachineCombinerPattern::DPWSSD);
-    return true;
+      return true;
+    }
+    break;
   }
   }
+  return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
+                                                     DoRegPressureReduce);
 }
 
 static void
diff --git a/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
new file mode 100644
index 00000000000000..2958c73835e433
--- /dev/null
+++ b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
@@ -0,0 +1,15 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=znver4 | FileCheck %s
+
+define <16 x i32> @vpdpwssd_test(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2) {
+; CHECK-LABEL: vpdpwssd_test:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vpdpwssd %zmm2, %zmm1, %zmm0
+; CHECK-NEXT:    retq
+  %4 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2)
+  ret <16 x i32> %4
+}
+
+declare <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32>, <16 x i32>, <16 x i32>) #1
+
+attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping we can address this for the general case in #84182 - but in the meantime, we shouldn't be using CPU name matching - its better to create a "TuningFastVNNI" flag in X86.td and add it to the ZN4Tuning flags and use that to control the MC pattern.

But hopefully this patch won't be necessary..........

@bjacob
Copy link
Contributor Author

bjacob commented Mar 7, 2024

I see. I'm curious what "address this for the general case in #84182" would look like. Certainly if this actually comes down to a znver4 scheduling model inaccuracy, it would be great to simply fix that, but at the moment I can't find an inaccuracy there that would be relevant to this particular issue.

I've just connected with @ganeshgit on this so I might just be able to bow out of this and let the experts weigh in :-D Closing this PR for now.

@bjacob bjacob closed this Mar 7, 2024
@RKSimon RKSimon reopened this Mar 10, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Mar 10, 2024

@bjacob Reopening this - please can you replace the "EnableVPDPWSSTPatterns" with a Tuning flag as suggested on #84182

@ganeshgit
Copy link
Contributor

@bjacob Reopening this - please can you replace the "EnableVPDPWSSTPatterns" with a Tuning flag as suggested on #84182
It's on me. I will submit a patch.

@RKSimon RKSimon closed this Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants