-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Disable the vpdpwssd -> vpmaddwd + vpaddd
combiner pattern on AMD Zen4
#84347
Conversation
@llvm/pr-subscribers-backend-x86 Author: Benoit Jacob (bjacob) ChangesThis tweaks the existing This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase add in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency. This PR also fixes what seems like a bug(?) in 8f7f9d8 : in llvm-project/llvm/lib/Target/X86/X86InstrInfo.cpp Lines 10584 to 10586 in 3714f93
the return true; was not part of the if branch, so the function would return there even if the if was not taken, preventing the TargetInstrInfo::getMachineCombinerPatterns from being reached. This seems unintentional and seems to be a departure from all the other getMachineCombinerPatterns functions that I can find, where either the target-specific case or the fallback TargetInstrInfo::getMachineCombinerPatterns is taken.
Full diff: https://github.com/llvm/llvm-project/pull/84347.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 3f0557e651f89b..1834d72a62cf63 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -10563,17 +10563,20 @@ void X86InstrInfo::buildClearRegister(Register Reg, MachineBasicBlock &MBB,
bool X86InstrInfo::getMachineCombinerPatterns(
MachineInstr &Root, SmallVectorImpl<MachineCombinerPattern> &Patterns,
bool DoRegPressureReduce) const {
+ bool EnableVPDPWSSTPatterns = !Subtarget.getCPU().starts_with("znver");
unsigned Opc = Root.getOpcode();
switch (Opc) {
default:
- return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
- DoRegPressureReduce);
+ break;
case X86::VPDPWSSDrr:
case X86::VPDPWSSDrm:
case X86::VPDPWSSDYrr:
case X86::VPDPWSSDYrm: {
- Patterns.push_back(MachineCombinerPattern::DPWSSD);
- return true;
+ if (EnableVPDPWSSTPatterns) {
+ Patterns.push_back(MachineCombinerPattern::DPWSSD);
+ return true;
+ }
+ break;
}
case X86::VPDPWSSDZ128r:
case X86::VPDPWSSDZ128m:
@@ -10581,11 +10584,15 @@ bool X86InstrInfo::getMachineCombinerPatterns(
case X86::VPDPWSSDZ256m:
case X86::VPDPWSSDZr:
case X86::VPDPWSSDZm: {
- if (Subtarget.hasBWI())
+ if (EnableVPDPWSSTPatterns && Subtarget.hasBWI()) {
Patterns.push_back(MachineCombinerPattern::DPWSSD);
- return true;
+ return true;
+ }
+ break;
}
}
+ return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
+ DoRegPressureReduce);
}
static void
diff --git a/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
new file mode 100644
index 00000000000000..2958c73835e433
--- /dev/null
+++ b/llvm/test/CodeGen/X86/znver4-vpdpwssd.ll
@@ -0,0 +1,15 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=znver4 | FileCheck %s
+
+define <16 x i32> @vpdpwssd_test(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2) {
+; CHECK-LABEL: vpdpwssd_test:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vpdpwssd %zmm2, %zmm1, %zmm0
+; CHECK-NEXT: retq
+ %4 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %0, <16 x i32> %1, <16 x i32> %2)
+ ret <16 x i32> %4
+}
+
+declare <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32>, <16 x i32>, <16 x i32>) #1
+
+attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping we can address this for the general case in #84182 - but in the meantime, we shouldn't be using CPU name matching - its better to create a "TuningFastVNNI" flag in X86.td and add it to the ZN4Tuning flags and use that to control the MC pattern.
But hopefully this patch won't be necessary..........
I see. I'm curious what "address this for the general case in #84182" would look like. Certainly if this actually comes down to a znver4 scheduling model inaccuracy, it would be great to simply fix that, but at the moment I can't find an inaccuracy there that would be relevant to this particular issue. I've just connected with @ganeshgit on this so I might just be able to bow out of this and let the experts weigh in :-D Closing this PR for now. |
This tweaks the existing
vpdpwssd -> vpmaddwd + vpaddd
machine combiner pattern to not kick in on AMD Zen4, fixing Issue #84182.This pattern was introduced in 8f7f9d8 , and I have expressed in #84182 (comment) some generic concerns about it, which are not even AMD-specific: my actual first choice would be to simply revert that commit. This PR only exists on the assumption that we don't want to do that, that we want to keep the current behavior outside of the AMD Zen4 target where Issue #84182 shows it to be a clear regression. To be clear though, what I am saying in #84182 (comment) is that it also is potentially (depending on scheduling info) a regression on non-AMD CPUs, on different kinds of test cases, one being the simple one-intrinsic testcase added in this PR (having only one intrinsic it clearly falls out of the scope of the rationale for 8f7f9d8, so the fact that that commit did affect it seems unintentional) and another being real world optimized SIMD code that, unlike the code motivating this in the commit message, would use enough registers to not be gated on instruction latency.
This PR also fixes what seems like a bug(?) in 8f7f9d8 : in
llvm-project/llvm/lib/Target/X86/X86InstrInfo.cpp
Lines 10584 to 10586 in 3714f93
the
return true;
was not part of theif
branch, so the function would return there even if theif
was not taken, preventing theTargetInstrInfo::getMachineCombinerPatterns
from being reached. This seems unintentional and seems to be a departure from all the othergetMachineCombinerPatterns
functions that I can find, where either the target-specific case or the fallbackTargetInstrInfo::getMachineCombinerPatterns
is taken.