[X86] Disable the `vpdpwssd -> vpmaddwd + vpaddd` combiner pattern on AMD Zen4 #84347

bjacob · 2024-03-07T17:29:50Z

This tweaks the existing vpdpwssd -> vpmaddwd + vpaddd machine combiner pattern to not kick in on AMD Zen4, fixing Issue #84182.

This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is potentially (depending on scheduling info) a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase added in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency.

This PR also fixes what seems like a bug(?) in 8f7f9d8 : in

llvm-project/llvm/lib/Target/X86/X86InstrInfo.cpp

Lines 10584 to 10586 in 3714f93

    
           if (Subtarget.hasBWI()) 
        
             Patterns.push_back(MachineCombinerPattern::DPWSSD); 
        
           return true;

the return true; was not part of the if branch, so the function would return there even if the if was not taken, preventing the TargetInstrInfo::getMachineCombinerPatterns from being reached. This seems unintentional and seems to be a departure from all the other getMachineCombinerPatterns functions that I can find, where either the target-specific case or the fallback TargetInstrInfo::getMachineCombinerPatterns is taken.

llvmbot · 2024-03-07T17:30:11Z

@llvm/pr-subscribers-backend-x86

Author: Benoit Jacob (bjacob)

Changes

This tweaks the existing vpdpwssd -> vpmaddwd + vpaddd machine combiner pattern to not kick in on AMD Zen4, fixing Issue #84182.

This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase add in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency.

This PR also fixes what seems like a bug(?) in 8f7f9d8 : in

llvm-project/llvm/lib/Target/X86/X86InstrInfo.cpp

Lines 10584 to 10586 in 3714f93

    
           if (Subtarget.hasBWI()) 
        
             Patterns.push_back(MachineCombinerPattern::DPWSSD); 
        
           return true;

the return true; was not part of the if branch, so the function would return there even if the if was not taken, preventing the TargetInstrInfo::getMachineCombinerPatterns from being reached. This seems unintentional and seems to be a departure from all the other getMachineCombinerPatterns functions that I can find, where either the target-specific case or the fallback TargetInstrInfo::getMachineCombinerPatterns is taken.

Full diff: https://github.com/llvm/llvm-project/pull/84347.diff

2 Files Affected:

(modified) llvm/lib/Target/X86/X86InstrInfo.cpp (+13-6)
(added) llvm/test/CodeGen/X86/znver4-vpdpwssd.ll (+15)

diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 3f0557e651f89b..1834d72a62cf63 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -10563,17 +10563,20 @@ void X86InstrInfo::buildClearRegister(Register Reg, MachineBasicBlock &MBB,
 bool X86InstrInfo::getMachineCombinerPatterns(
     MachineInstr &Root, SmallVectorImpl<MachineCombinerPattern> &Patterns,
     bool DoRegPressureReduce) const {
+  bool EnableVPDPWSSTPatterns = !Subtarget.getCPU().starts_with("znver");
   unsigned Opc = Root.getOpcode();
   switch (Opc) {
   default:
-    return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
-                                                       DoRegPressureReduce);
+    break;
   case X86::VPDPWSSDrr:
   case X86::VPDPWSSDrm:
   case X86::VPDPWSSDYrr:
   case X86::VPDPWSSDYrm: {
-    Patterns.push_back(MachineCombinerPattern::DPWSSD);
-    return true;
+    if (EnableVPDPWSSTPatterns) {
+      Patterns.push_back(MachineCombinerPattern::DPWSSD);
+      return true;
+    }
+    break;
   }
   case X86::VPDPWSSDZ128r:
   case X86::VPDPWSSDZ128m:
@@ -10581,11 +10584,15 @@ bool X86InstrInfo::getMachineCombinerPatterns(
   case X86::VPDPWSSDZ256m:
   case X86::VPDPWSSDZr:
   case X86::VPDPWSSDZm: {
-    if (Subtarget.hasBWI())
+    if (EnableVPDPWSSTPatterns && Subtarget.hasBWI()) {
       Patterns.push_back(MachineCombinerPattern::DPWSSD);
-    return true;
+      return true;
+    }
+    break;
   }
   }
+  return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
+                                                     DoRegPressureReduce);
 }
 
 static void
diff --git a/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
new file mode 100644
index 00000000000000..2958c73835e433
--- /dev/null
+++ b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
@@ -0,0 +1,15 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=znver4 | FileCheck %s
+
+define <16 x i32> @vpdpwssd_test(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2) {
+; CHECK-LABEL: vpdpwssd_test:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vpdpwssd %zmm2, %zmm1, %zmm0
+; CHECK-NEXT:    retq
+  %4 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2)
+  ret <16 x i32> %4
+}
+
+declare <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32>, <16 x i32>, <16 x i32>) #1
+
+attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }

RKSimon

I'm hoping we can address this for the general case in #84182 - but in the meantime, we shouldn't be using CPU name matching - its better to create a "TuningFastVNNI" flag in X86.td and add it to the ZN4Tuning flags and use that to control the MC pattern.

But hopefully this patch won't be necessary..........

bjacob · 2024-03-07T18:19:25Z

I see. I'm curious what "address this for the general case in #84182" would look like. Certainly if this actually comes down to a znver4 scheduling model inaccuracy, it would be great to simply fix that, but at the moment I can't find an inaccuracy there that would be relevant to this particular issue.

I've just connected with @ganeshgit on this so I might just be able to bow out of this and let the experts weigh in :-D Closing this PR for now.

RKSimon · 2024-03-10T22:36:21Z

@bjacob Reopening this - please can you replace the "EnableVPDPWSSTPatterns" with a Tuning flag as suggested on #84182

ganeshgit · 2024-03-11T03:15:18Z

@bjacob Reopening this - please can you replace the "EnableVPDPWSSTPatterns" with a Tuning flag as suggested on #84182
It's on me. I will submit a patch.

vpdpwssd

7121747

bjacob added backend:X86 performance labels Mar 7, 2024

bjacob requested a review from RKSimon March 7, 2024 17:29

bjacob mentioned this pull request Mar 7, 2024

vpdpwssd instruction not generated despite giving better performance than vpmaddwd+vpaddd expansion #84182

Closed

bjacob marked this pull request as ready for review March 7, 2024 17:38

RKSimon reviewed Mar 7, 2024

View reviewed changes

bjacob closed this Mar 7, 2024

RKSimon reopened this Mar 10, 2024

RKSimon closed this Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Disable the `vpdpwssd -> vpmaddwd + vpaddd` combiner pattern on AMD Zen4 #84347

[X86] Disable the `vpdpwssd -> vpmaddwd + vpaddd` combiner pattern on AMD Zen4 #84347

bjacob commented Mar 7, 2024 •

edited

llvmbot commented Mar 7, 2024

RKSimon left a comment

bjacob commented Mar 7, 2024

RKSimon commented Mar 10, 2024

ganeshgit commented Mar 11, 2024

	if (Subtarget.hasBWI())
	Patterns.push_back(MachineCombinerPattern::DPWSSD);
	return true;

[X86] Disable the vpdpwssd -> vpmaddwd + vpaddd combiner pattern on AMD Zen4 #84347

[X86] Disable the vpdpwssd -> vpmaddwd + vpaddd combiner pattern on AMD Zen4 #84347

Conversation

bjacob commented Mar 7, 2024 • edited

llvmbot commented Mar 7, 2024

RKSimon left a comment

Choose a reason for hiding this comment

bjacob commented Mar 7, 2024

RKSimon commented Mar 10, 2024

ganeshgit commented Mar 11, 2024

[X86] Disable the `vpdpwssd -> vpmaddwd + vpaddd` combiner pattern on AMD Zen4 #84347

[X86] Disable the `vpdpwssd -> vpmaddwd + vpaddd` combiner pattern on AMD Zen4 #84347

bjacob commented Mar 7, 2024 •

edited