Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches #81689

Closed
wants to merge 1 commit into from

Conversation

goldsteinn
Copy link
Contributor

@goldsteinn goldsteinn commented Feb 13, 2024

Changes in Recommit:
1) Fix non-determanism by using SmallMapVector instead of
SmallPtrSet.
2) Fix bug in dependency pruning where we discounted the actual
and/or combining the two conditions. This lead to over pruning.

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.

@llvmbot llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Feb 13, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Feb 13, 2024

@llvm/pr-subscribers-llvm-selectiondag

Author: None (goldsteinn)

Changes

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.


Patch is 104.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81689.diff

27 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+5-2)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+159)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+5)
  • (modified) llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll (+6-5)
  • (modified) llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll (+56-53)
  • (modified) llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll (+5-8)
  • (modified) llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/avx-cmp.ll (+10-17)
  • (modified) llvm/test/CodeGen/X86/cmp.ll (+30-22)
  • (modified) llvm/test/CodeGen/X86/dagcombine-and-setcc.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+154-152)
  • (modified) llvm/test/CodeGen/X86/inline-spiller-impdef-on-implicit-def-regression.ll (+31-39)
  • (modified) llvm/test/CodeGen/X86/lsr-addrecloops.ll (+94-4)
  • (modified) llvm/test/CodeGen/X86/movmsk-cmp.ll (+42-55)
  • (modified) llvm/test/CodeGen/X86/or-branch.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+72-70)
  • (modified) llvm/test/CodeGen/X86/pr33747.ll (+9-8)
  • (modified) llvm/test/CodeGen/X86/pr37025.ll (+22-16)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+58-73)
  • (modified) llvm/test/CodeGen/X86/setcc-logic.ll (+30-46)
  • (modified) llvm/test/CodeGen/X86/swifterror.ll (-10)
  • (modified) llvm/test/CodeGen/X86/tail-dup-merge-loop-headers.ll (+50-69)
  • (modified) llvm/test/CodeGen/X86/tail-opts.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/tailcall-extract.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+18-16)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrap-unwind.ll (+28-24)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 612433b54f6e44..85288a3ae519ee 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -596,6 +596,13 @@ class TargetLoweringBase {
   /// avoided.
   bool isJumpExpensive() const { return JumpIsExpensive; }
 
+  virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                          const BranchInst &,
+                                          Instruction::BinaryOps, const Value *,
+                                          const Value *) const {
+    return false;
+  }
+
   /// Return true if selects are only cheaper than branches if the branch is
   /// unlikely to be predicted right.
   bool isPredictableSelectExpensive() const {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 28664b2ed9052d..12a6ec5f7dfb3b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -2646,8 +2646,11 @@ void SelectionDAGBuilder::visitBr(const BranchInst &I) {
     else if (match(BOp, m_LogicalOr(m_Value(BOp0), m_Value(BOp1))))
       Opcode = Instruction::Or;
 
-    if (Opcode && !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
-                    match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value())))) {
+    if (Opcode &&
+        !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
+          match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value()))) &&
+        !DAG.getTargetLoweringInfo().keepJumpConditionsTogether(
+            FuncInfo, I, Opcode, BOp0, BOp1)) {
       FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB, Opcode,
                            getEdgeProbability(BrMBB, Succ0MBB),
                            getEdgeProbability(BrMBB, Succ1MBB),
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 18f9871b2bd0c3..8ff48630cfc923 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -27,9 +27,12 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/BlockFrequencyInfo.h"
+#include "llvm/Analysis/BranchProbabilityInfo.h"
 #include "llvm/Analysis/ObjCARCUtil.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/IntrinsicLowering.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
@@ -55,6 +58,7 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCSymbol.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -77,6 +81,24 @@ static cl::opt<int> ExperimentalPrefInnermostLoopAlignment(
         "alignment set by x86-experimental-pref-loop-alignment."),
     cl::Hidden);
 
+static cl::opt<int> BrMergingBaseCostThresh(
+    "x86-cond-base", cl::init(1),
+    cl::desc(
+        "Base."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingLikelyBias(
+    "x86-cond-likely-bias", cl::init(0),
+    cl::desc(
+        "Likely."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingUnlikelyBias(
+    "x86-cond-unlikely-bias", cl::init(1),
+    cl::desc(
+        "Unlikely."),
+    cl::Hidden);
+
 static cl::opt<bool> MulConstantOptimization(
     "mul-constant-optimization", cl::init(true),
     cl::desc("Replace 'mul x, Const' with more effective instructions like "
@@ -3339,6 +3361,143 @@ unsigned X86TargetLowering::preferedOpcodeForCmpEqPiecesOfOperand(
   return ISD::SRL;
 }
 
+// Collect dependings on V recursively. This is used for the cost analysis in
+// `keepJumpConditionsTogether`.
+static bool
+collectDeps(SmallPtrSet<const Instruction *, 8> *Deps, const Value *V,
+            SmallPtrSet<const Instruction *, 8> *Necessary = nullptr,
+            unsigned Depth = 0) {
+  // Return false if we have an incomplete count.
+  if (Depth >= 6)
+    return false;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (I == nullptr)
+    return true;
+
+  if (Necessary != nullptr) {
+    // This instruction is necessary for the other side of the condition so
+    // don't count it.
+    if (Necessary->contains(I))
+      return true;
+  }
+
+  // Already added this dep.
+  if (!Deps->insert(I).second)
+    return true;
+
+  for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)
+    if (!collectDeps(Deps, I->getOperand(OpIdx), Necessary, Depth + 1))
+      return false;
+  return true;
+}
+
+bool X86TargetLowering::keepJumpConditionsTogether(
+    const FunctionLoweringInfo &FuncInfo, const BranchInst &I,
+    Instruction::BinaryOps Opc, const Value *Lhs, const Value *Rhs) const {
+  using namespace llvm::PatternMatch;
+  if (I.getNumSuccessors() != 2)
+    return false;
+
+  // Baseline cost. This is properly arbitrary.
+  InstructionCost CostThresh = BrMergingBaseCostThresh.getValue();
+  if (BrMergingBaseCostThresh.getValue() < 0)
+    return false;
+
+  // a == b && c == d can be efficiently combined.
+  ICmpInst::Predicate Pred;
+  if (Opc == Instruction::And &&
+      match(Lhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ &&
+      match(Rhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ)
+    CostThresh += 1;
+
+  BranchProbabilityInfo *BPI = nullptr;
+  if (BrMergingLikelyBias.getValue() || BrMergingUnlikelyBias.getValue())
+    BPI = FuncInfo.BPI;
+  if (BPI != nullptr) {
+    BasicBlock *IfFalse = I.getSuccessor(0);
+    BasicBlock *IfTrue = I.getSuccessor(1);
+
+    std::optional<bool> Likely;
+    if (BPI->isEdgeHot(I.getParent(), IfTrue))
+      Likely = true;
+    else if (BPI->isEdgeHot(I.getParent(), IfFalse))
+      Likely = false;
+
+    if (Likely) {
+      if (Opc == (*Likely ? Instruction::And : Instruction::Or))
+        // Its likely we will have to compute both lhs and rhs of condition
+        CostThresh += BrMergingLikelyBias.getValue();
+      else {
+        // Its likely we will get an early out.
+        CostThresh -= BrMergingUnlikelyBias.getValue();
+        if (BrMergingUnlikelyBias.getValue() < 0) {
+          return false;
+        }
+      }
+    }
+  }
+
+  if (CostThresh <= 0)
+    return false;
+
+  // Collect "all" instructions that lhs condition is dependent on.
+  SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;
+  collectDeps(&LhsDeps, Lhs);
+  // Collect "all" instructions that rhs condition is dependent on AND are
+  // dependencies of lhs. This gives us an estimate on which instructions we
+  // stand to save by splitting the condition.
+  if (!collectDeps(&RhsDeps, Rhs, &LhsDeps))
+    return false;
+  // Add the compare instruction itself unless its a dependency on the LHS.
+  if (const auto *RhsI = dyn_cast<Instruction>(Rhs))
+    if (!LhsDeps.contains(RhsI))
+      RhsDeps.insert(RhsI);
+  const auto &TTI = getTargetMachine().getTargetTransformInfo(*I.getFunction());
+
+  InstructionCost CostOfIncluding = 0;
+  // See if this instruction will need to computed independently of whether RHS
+  // is.
+  auto ShouldCountInsn = [&RhsDeps](const Instruction *Ins) {
+    for (const auto *U : Ins->users()) {
+      // If user is independent of RHS calculation we don't need to count it.
+      if (auto *UIns = dyn_cast<Instruction>(U))
+        if (!RhsDeps.contains(UIns))
+          return false;
+    }
+    return true;
+  };
+
+  // Prune instructions from RHS Deps that are dependencies of unrelated
+  // instructions.
+  const unsigned MaxPruneIters = 8;
+  // Stop after a certain point. No incorrectness from including too many
+  // instructions.
+  for (unsigned PruneIters = 0; PruneIters < MaxPruneIters; ++PruneIters) {
+    const Instruction *ToDrop = nullptr;
+    for (const auto *Ins : RhsDeps) {
+      if (!ShouldCountInsn(Ins)) {
+        ToDrop = Ins;
+        break;
+      }
+    }
+    if (ToDrop == nullptr)
+      break;
+    RhsDeps.erase(ToDrop);
+  }
+
+  for (const auto *Ins : RhsDeps) {
+    CostOfIncluding +=
+        TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);
+
+    if (CostOfIncluding > CostThresh)
+      return false;
+  }
+  return true;
+}
+
 bool X86TargetLowering::preferScalarizeSplat(SDNode *N) const {
   return N->getOpcode() != ISD::FP_EXTEND;
 }
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index f93c54781846bf..f536c8a6766b68 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1150,6 +1150,11 @@ namespace llvm {
 
     bool preferScalarizeSplat(SDNode *N) const override;
 
+    bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                    const BranchInst &, Instruction::BinaryOps,
+                                    const Value *,
+                                    const Value *) const override;
+
     bool shouldFoldConstantShiftPairToMask(const SDNode *N,
                                            CombineLevel Level) const override;
 
diff --git a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
index 0044d1c3568377..e6f28c2057f775 100644
--- a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
+++ b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
@@ -18,15 +18,16 @@ define i1 @loadAndRLEsource_no_exit_2E_1_label_2E_0(i32 %tmp.21.reload, i32 %tmp
 ; CHECK-NEXT:    movl _block, %esi
 ; CHECK-NEXT:    movb %al, 1(%esi,%edx)
 ; CHECK-NEXT:    cmpl %ecx, _last
-; CHECK-NEXT:    jge LBB0_3
-; CHECK-NEXT:  ## %bb.1: ## %label.0
+; CHECK-NEXT:    setl %cl
 ; CHECK-NEXT:    cmpl $257, %eax ## imm = 0x101
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  ## %bb.2: ## %label.0.no_exit.1_crit_edge.exitStub
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %al, %cl
+; CHECK-NEXT:    je LBB0_2
+; CHECK-NEXT:  ## %bb.1: ## %label.0.no_exit.1_crit_edge.exitStub
 ; CHECK-NEXT:    movb $1, %al
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
-; CHECK-NEXT:  LBB0_3: ## %codeRepl5.exitStub
+; CHECK-NEXT:  LBB0_2: ## %codeRepl5.exitStub
 ; CHECK-NEXT:    xorl %eax, %eax
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
diff --git a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
index 7bdc4e19a1cf66..28b4541c1bfc7f 100644
--- a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
+++ b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
@@ -44,7 +44,7 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    callq __ubyte_convert_to_ctype
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_4
+; CHECK-NEXT:    js LBB0_6
 ; CHECK-NEXT:  ## %bb.1: ## %cond_next.i
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    movq %rbx, %rdi
@@ -53,81 +53,84 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    sarl $31, %ecx
 ; CHECK-NEXT:    andl %eax, %ecx
 ; CHECK-NEXT:    cmpl $-2, %ecx
-; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:    je LBB0_10
 ; CHECK-NEXT:  ## %bb.2: ## %cond_next.i
 ; CHECK-NEXT:    cmpl $-1, %ecx
-; CHECK-NEXT:    jne LBB0_6
-; CHECK-NEXT:  LBB0_3: ## %bb4
+; CHECK-NEXT:    jne LBB0_3
+; CHECK-NEXT:  LBB0_8: ## %bb4
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 16(%rax), %rax
-; CHECK-NEXT:    jmp LBB0_10
-; CHECK-NEXT:  LBB0_4: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    jmp LBB0_9
+; CHECK-NEXT:  LBB0_6: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-2, %eax
-; CHECK-NEXT:    je LBB0_8
-; CHECK-NEXT:  ## %bb.5: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    je LBB0_10
+; CHECK-NEXT:  ## %bb.7: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-1, %eax
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  LBB0_6: ## %bb35
+; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:  LBB0_3: ## %bb35
 ; CHECK-NEXT:    movq _PyUFunc_API@GOTPCREL(%rip), %r14
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *216(%rax)
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_11
-; CHECK-NEXT:  ## %bb.7: ## %cond_false.i
+; CHECK-NEXT:    je LBB0_4
+; CHECK-NEXT:  ## %bb.12: ## %cond_false.i
+; CHECK-NEXT:    setne %dil
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
 ; CHECK-NEXT:    movzbl %sil, %ecx
 ; CHECK-NEXT:    movl %ecx, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movl %eax, %r15d
 ; CHECK-NEXT:    testb %cl, %cl
-; CHECK-NEXT:    jne LBB0_12
-; CHECK-NEXT:    jmp LBB0_14
-; CHECK-NEXT:  LBB0_8: ## %bb17
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %dil, %al
+; CHECK-NEXT:    jne LBB0_5
+; CHECK-NEXT:  LBB0_13: ## %cond_true.i200
+; CHECK-NEXT:    testb %dl, %dl
+; CHECK-NEXT:    jne LBB0_15
+; CHECK-NEXT:  ## %bb.14: ## %cond_true14.i
+; CHECK-NEXT:    movl $4, %edi
+; CHECK-NEXT:    callq _feraiseexcept
+; CHECK-NEXT:  LBB0_15: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:    xorl %ebx, %ebx
+; CHECK-NEXT:    jmp LBB0_16
+; CHECK-NEXT:  LBB0_10: ## %bb17
 ; CHECK-NEXT:    callq _PyErr_Occurred
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.9: ## %cond_next
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.11: ## %cond_next
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 80(%rax), %rax
-; CHECK-NEXT:  LBB0_10: ## %bb4
+; CHECK-NEXT:  LBB0_9: ## %bb4
 ; CHECK-NEXT:    movq 96(%rax), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    movq %rbx, %rsi
 ; CHECK-NEXT:    callq *40(%rax)
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_11: ## %cond_true.i
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_4: ## %cond_true.i
 ; CHECK-NEXT:    movl $4, %edi
 ; CHECK-NEXT:    callq _feraiseexcept
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
-; CHECK-NEXT:    xorl %r15d, %r15d
 ; CHECK-NEXT:    testb %sil, %sil
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  LBB0_12: ## %cond_false.i
+; CHECK-NEXT:    sete %al
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  ## %bb.13: ## %cond_next17.i
+; CHECK-NEXT:    sete %cl
+; CHECK-NEXT:    xorl %r15d, %r15d
+; CHECK-NEXT:    orb %al, %cl
+; CHECK-NEXT:    jne LBB0_13
+; CHECK-NEXT:  LBB0_5: ## %cond_next17.i
 ; CHECK-NEXT:    movzbl %sil, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movzbl %ah, %ebx
-; CHECK-NEXT:    jmp LBB0_18
-; CHECK-NEXT:  LBB0_14: ## %cond_true.i200
-; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    jne LBB0_17
-; CHECK-NEXT:  ## %bb.16: ## %cond_true14.i
-; CHECK-NEXT:    movl $4, %edi
-; CHECK-NEXT:    callq _feraiseexcept
-; CHECK-NEXT:  LBB0_17: ## %ubyte_ctype_remainder.exit
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:  LBB0_18: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:  LBB0_16: ## %ubyte_ctype_remainder.exit
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *224(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je LBB0_21
-; CHECK-NEXT:  ## %bb.19: ## %cond_true61
+; CHECK-NEXT:    je LBB0_19
+; CHECK-NEXT:  ## %bb.17: ## %cond_true61
 ; CHECK-NEXT:    movl %eax, %ebp
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq _.str5@GOTPCREL(%rip), %rdi
@@ -136,8 +139,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rcx
 ; CHECK-NEXT:    callq *200(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_27
-; CHECK-NEXT:  ## %bb.20: ## %cond_next73
+; CHECK-NEXT:    js LBB0_23
+; CHECK-NEXT:  ## %bb.18: ## %cond_next73
 ; CHECK-NEXT:    movl $1, {{[0-9]+}}(%rsp)
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %rsi
@@ -146,13 +149,13 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    movl %ebp, %edx
 ; CHECK-NEXT:    callq *232(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  LBB0_21: ## %cond_next89
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  LBB0_19: ## %cond_next89
 ; CHECK-NEXT:    movl $2, %edi
 ; CHECK-NEXT:    callq _PyTuple_New
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_27
-; CHECK-NEXT:  ## %bb.22: ## %cond_next97
+; CHECK-NEXT:    je LBB0_23
+; CHECK-NEXT:  ## %bb.20: ## %cond_next97
 ; CHECK-NEXT:    movq %rax, %r14
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %r12
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -160,8 +163,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.23: ## %cond_next135
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.25: ## %cond_next135
 ; CHECK-NEXT:    movb %r15b, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 24(%r14)
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -169,22 +172,22 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.24: ## %cond_next182
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.26: ## %cond_next182
 ; CHECK-NEXT:    movb %bl, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 32(%r14)
 ; CHECK-NEXT:    movq %r14, %rax
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_25: ## %cond_true113
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_21: ## %cond_true113
 ; CHECK-NEXT:    decq (%r14)
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.26: ## %cond_true126
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.22: ## %cond_true126
 ; CHECK-NEXT:    movq 8(%r14), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    callq *48(%rax)
-; CHECK-NEXT:  LBB0_27: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_23: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:  LBB0_28: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_24: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    addq $32, %rsp
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    popq %r12
diff --git a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
index 4482c5aec8e816..d9d4424267d733 100644
--- a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
+++ b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
@@ -16,15 +16,12 @@ define void @_ada_c34007g() {
 ; CHECK-NEXT:    andl $-8, %esp
 ; CHECK-NEXT:    subl $8, %esp
 ; CHECK-NEXT:    movl (%esp), %eax
+; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK-NEXT:    orl %eax, %ecx
+; CHECK-NEXT:    sete %cl
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je .LBB0_3
-; CHECK-NEXT:  # %bb.1: # %entry
-; CHECK-NEXT:    orl {{[0-9]+}}(%esp), %eax
-; CHECK-NEXT:    jne .LBB0_3
-; CHECK-NEXT:  # %bb.2: # %entry
-; CHECK-NEXT:    movb $1, %al
-; CHECK-NEXT:    testb %al, %al
-; CHECK-NEXT:  .LBB0_3: # %bb5507
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %cl, %al
 ; CHECK-NEXT:    movl %ebp, %esp
 ; CHECK-NEXT:    popl %ebp
 ; CHECK-NEXT:    .cfi_def_cfa %esp, 4
diff --git a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
index dd60e641df2543..9147692db15645 100644
--- a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
+++ b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
@@ -1,5 +1,6 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
 ; REQUIRES: asserts
-; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 16
+; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 9
 ; PR1909
 
 @.str = internal constant [48 x i8] c"transformed bounds: (%.2f, %.2f), (%.2f, %.2f)\0A\00"		; <ptr> [#uses=1]
@@ -217,4 +218,4 @@ bb456:		; preds = %bb448, %bb425, %bb417, %bb395, %bb385, %bb371
 	ret void
 }
 
-declare i32 @printf(ptr, ...) nounwind 
+declare i32 @printf(ptr, ...) nounwind
diff --git a/llvm/test/CodeGen/X86/avx-cmp.ll b/llvm/test/CodeGen/X86/avx-cmp.ll
index 502bbf3f5d118b..4ab9c545ed90da 100644
--- a/llvm/test/CodeGen/X86/avx-cmp.ll
+++ b/llvm/test/CodeGen/X86/avx-cmp.ll
@@ -26,40 +26,33 @@ declare void @scale() nounwind
 define v...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 13, 2024

@llvm/pr-subscribers-backend-x86

Author: None (goldsteinn)

Changes

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&amp;to=978278eabc0bafe2f390ca8fcdad24154f954020&amp;stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&amp;to=978278eabc0bafe2f390ca8fcdad24154f954020&amp;stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&amp;to=978278eabc0bafe2f390ca8fcdad24154f954020&amp;stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.


Patch is 104.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81689.diff

27 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+5-2)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+159)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+5)
  • (modified) llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll (+6-5)
  • (modified) llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll (+56-53)
  • (modified) llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll (+5-8)
  • (modified) llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll (+3-2)
  • (modified) llvm/test/CodeGen/X86/avx-cmp.ll (+10-17)
  • (modified) llvm/test/CodeGen/X86/cmp.ll (+30-22)
  • (modified) llvm/test/CodeGen/X86/dagcombine-and-setcc.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+154-152)
  • (modified) llvm/test/CodeGen/X86/inline-spiller-impdef-on-implicit-def-regression.ll (+31-39)
  • (modified) llvm/test/CodeGen/X86/lsr-addrecloops.ll (+94-4)
  • (modified) llvm/test/CodeGen/X86/movmsk-cmp.ll (+42-55)
  • (modified) llvm/test/CodeGen/X86/or-branch.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+72-70)
  • (modified) llvm/test/CodeGen/X86/pr33747.ll (+9-8)
  • (modified) llvm/test/CodeGen/X86/pr37025.ll (+22-16)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+58-73)
  • (modified) llvm/test/CodeGen/X86/setcc-logic.ll (+30-46)
  • (modified) llvm/test/CodeGen/X86/swifterror.ll (-10)
  • (modified) llvm/test/CodeGen/X86/tail-dup-merge-loop-headers.ll (+50-69)
  • (modified) llvm/test/CodeGen/X86/tail-opts.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/tailcall-extract.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+18-16)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrap-unwind.ll (+28-24)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 612433b54f6e44..85288a3ae519ee 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -596,6 +596,13 @@ class TargetLoweringBase {
   /// avoided.
   bool isJumpExpensive() const { return JumpIsExpensive; }
 
+  virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                          const BranchInst &,
+                                          Instruction::BinaryOps, const Value *,
+                                          const Value *) const {
+    return false;
+  }
+
   /// Return true if selects are only cheaper than branches if the branch is
   /// unlikely to be predicted right.
   bool isPredictableSelectExpensive() const {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 28664b2ed9052d..12a6ec5f7dfb3b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -2646,8 +2646,11 @@ void SelectionDAGBuilder::visitBr(const BranchInst &I) {
     else if (match(BOp, m_LogicalOr(m_Value(BOp0), m_Value(BOp1))))
       Opcode = Instruction::Or;
 
-    if (Opcode && !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
-                    match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value())))) {
+    if (Opcode &&
+        !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
+          match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value()))) &&
+        !DAG.getTargetLoweringInfo().keepJumpConditionsTogether(
+            FuncInfo, I, Opcode, BOp0, BOp1)) {
       FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB, Opcode,
                            getEdgeProbability(BrMBB, Succ0MBB),
                            getEdgeProbability(BrMBB, Succ1MBB),
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 18f9871b2bd0c3..8ff48630cfc923 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -27,9 +27,12 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/BlockFrequencyInfo.h"
+#include "llvm/Analysis/BranchProbabilityInfo.h"
 #include "llvm/Analysis/ObjCARCUtil.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/IntrinsicLowering.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
@@ -55,6 +58,7 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCSymbol.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -77,6 +81,24 @@ static cl::opt<int> ExperimentalPrefInnermostLoopAlignment(
         "alignment set by x86-experimental-pref-loop-alignment."),
     cl::Hidden);
 
+static cl::opt<int> BrMergingBaseCostThresh(
+    "x86-cond-base", cl::init(1),
+    cl::desc(
+        "Base."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingLikelyBias(
+    "x86-cond-likely-bias", cl::init(0),
+    cl::desc(
+        "Likely."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingUnlikelyBias(
+    "x86-cond-unlikely-bias", cl::init(1),
+    cl::desc(
+        "Unlikely."),
+    cl::Hidden);
+
 static cl::opt<bool> MulConstantOptimization(
     "mul-constant-optimization", cl::init(true),
     cl::desc("Replace 'mul x, Const' with more effective instructions like "
@@ -3339,6 +3361,143 @@ unsigned X86TargetLowering::preferedOpcodeForCmpEqPiecesOfOperand(
   return ISD::SRL;
 }
 
+// Collect dependings on V recursively. This is used for the cost analysis in
+// `keepJumpConditionsTogether`.
+static bool
+collectDeps(SmallPtrSet<const Instruction *, 8> *Deps, const Value *V,
+            SmallPtrSet<const Instruction *, 8> *Necessary = nullptr,
+            unsigned Depth = 0) {
+  // Return false if we have an incomplete count.
+  if (Depth >= 6)
+    return false;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (I == nullptr)
+    return true;
+
+  if (Necessary != nullptr) {
+    // This instruction is necessary for the other side of the condition so
+    // don't count it.
+    if (Necessary->contains(I))
+      return true;
+  }
+
+  // Already added this dep.
+  if (!Deps->insert(I).second)
+    return true;
+
+  for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)
+    if (!collectDeps(Deps, I->getOperand(OpIdx), Necessary, Depth + 1))
+      return false;
+  return true;
+}
+
+bool X86TargetLowering::keepJumpConditionsTogether(
+    const FunctionLoweringInfo &FuncInfo, const BranchInst &I,
+    Instruction::BinaryOps Opc, const Value *Lhs, const Value *Rhs) const {
+  using namespace llvm::PatternMatch;
+  if (I.getNumSuccessors() != 2)
+    return false;
+
+  // Baseline cost. This is properly arbitrary.
+  InstructionCost CostThresh = BrMergingBaseCostThresh.getValue();
+  if (BrMergingBaseCostThresh.getValue() < 0)
+    return false;
+
+  // a == b && c == d can be efficiently combined.
+  ICmpInst::Predicate Pred;
+  if (Opc == Instruction::And &&
+      match(Lhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ &&
+      match(Rhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ)
+    CostThresh += 1;
+
+  BranchProbabilityInfo *BPI = nullptr;
+  if (BrMergingLikelyBias.getValue() || BrMergingUnlikelyBias.getValue())
+    BPI = FuncInfo.BPI;
+  if (BPI != nullptr) {
+    BasicBlock *IfFalse = I.getSuccessor(0);
+    BasicBlock *IfTrue = I.getSuccessor(1);
+
+    std::optional<bool> Likely;
+    if (BPI->isEdgeHot(I.getParent(), IfTrue))
+      Likely = true;
+    else if (BPI->isEdgeHot(I.getParent(), IfFalse))
+      Likely = false;
+
+    if (Likely) {
+      if (Opc == (*Likely ? Instruction::And : Instruction::Or))
+        // Its likely we will have to compute both lhs and rhs of condition
+        CostThresh += BrMergingLikelyBias.getValue();
+      else {
+        // Its likely we will get an early out.
+        CostThresh -= BrMergingUnlikelyBias.getValue();
+        if (BrMergingUnlikelyBias.getValue() < 0) {
+          return false;
+        }
+      }
+    }
+  }
+
+  if (CostThresh <= 0)
+    return false;
+
+  // Collect "all" instructions that lhs condition is dependent on.
+  SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;
+  collectDeps(&LhsDeps, Lhs);
+  // Collect "all" instructions that rhs condition is dependent on AND are
+  // dependencies of lhs. This gives us an estimate on which instructions we
+  // stand to save by splitting the condition.
+  if (!collectDeps(&RhsDeps, Rhs, &LhsDeps))
+    return false;
+  // Add the compare instruction itself unless its a dependency on the LHS.
+  if (const auto *RhsI = dyn_cast<Instruction>(Rhs))
+    if (!LhsDeps.contains(RhsI))
+      RhsDeps.insert(RhsI);
+  const auto &TTI = getTargetMachine().getTargetTransformInfo(*I.getFunction());
+
+  InstructionCost CostOfIncluding = 0;
+  // See if this instruction will need to computed independently of whether RHS
+  // is.
+  auto ShouldCountInsn = [&RhsDeps](const Instruction *Ins) {
+    for (const auto *U : Ins->users()) {
+      // If user is independent of RHS calculation we don't need to count it.
+      if (auto *UIns = dyn_cast<Instruction>(U))
+        if (!RhsDeps.contains(UIns))
+          return false;
+    }
+    return true;
+  };
+
+  // Prune instructions from RHS Deps that are dependencies of unrelated
+  // instructions.
+  const unsigned MaxPruneIters = 8;
+  // Stop after a certain point. No incorrectness from including too many
+  // instructions.
+  for (unsigned PruneIters = 0; PruneIters < MaxPruneIters; ++PruneIters) {
+    const Instruction *ToDrop = nullptr;
+    for (const auto *Ins : RhsDeps) {
+      if (!ShouldCountInsn(Ins)) {
+        ToDrop = Ins;
+        break;
+      }
+    }
+    if (ToDrop == nullptr)
+      break;
+    RhsDeps.erase(ToDrop);
+  }
+
+  for (const auto *Ins : RhsDeps) {
+    CostOfIncluding +=
+        TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);
+
+    if (CostOfIncluding > CostThresh)
+      return false;
+  }
+  return true;
+}
+
 bool X86TargetLowering::preferScalarizeSplat(SDNode *N) const {
   return N->getOpcode() != ISD::FP_EXTEND;
 }
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index f93c54781846bf..f536c8a6766b68 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1150,6 +1150,11 @@ namespace llvm {
 
     bool preferScalarizeSplat(SDNode *N) const override;
 
+    bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                    const BranchInst &, Instruction::BinaryOps,
+                                    const Value *,
+                                    const Value *) const override;
+
     bool shouldFoldConstantShiftPairToMask(const SDNode *N,
                                            CombineLevel Level) const override;
 
diff --git a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
index 0044d1c3568377..e6f28c2057f775 100644
--- a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
+++ b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
@@ -18,15 +18,16 @@ define i1 @loadAndRLEsource_no_exit_2E_1_label_2E_0(i32 %tmp.21.reload, i32 %tmp
 ; CHECK-NEXT:    movl _block, %esi
 ; CHECK-NEXT:    movb %al, 1(%esi,%edx)
 ; CHECK-NEXT:    cmpl %ecx, _last
-; CHECK-NEXT:    jge LBB0_3
-; CHECK-NEXT:  ## %bb.1: ## %label.0
+; CHECK-NEXT:    setl %cl
 ; CHECK-NEXT:    cmpl $257, %eax ## imm = 0x101
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  ## %bb.2: ## %label.0.no_exit.1_crit_edge.exitStub
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %al, %cl
+; CHECK-NEXT:    je LBB0_2
+; CHECK-NEXT:  ## %bb.1: ## %label.0.no_exit.1_crit_edge.exitStub
 ; CHECK-NEXT:    movb $1, %al
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
-; CHECK-NEXT:  LBB0_3: ## %codeRepl5.exitStub
+; CHECK-NEXT:  LBB0_2: ## %codeRepl5.exitStub
 ; CHECK-NEXT:    xorl %eax, %eax
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
diff --git a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
index 7bdc4e19a1cf66..28b4541c1bfc7f 100644
--- a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
+++ b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
@@ -44,7 +44,7 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    callq __ubyte_convert_to_ctype
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_4
+; CHECK-NEXT:    js LBB0_6
 ; CHECK-NEXT:  ## %bb.1: ## %cond_next.i
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    movq %rbx, %rdi
@@ -53,81 +53,84 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    sarl $31, %ecx
 ; CHECK-NEXT:    andl %eax, %ecx
 ; CHECK-NEXT:    cmpl $-2, %ecx
-; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:    je LBB0_10
 ; CHECK-NEXT:  ## %bb.2: ## %cond_next.i
 ; CHECK-NEXT:    cmpl $-1, %ecx
-; CHECK-NEXT:    jne LBB0_6
-; CHECK-NEXT:  LBB0_3: ## %bb4
+; CHECK-NEXT:    jne LBB0_3
+; CHECK-NEXT:  LBB0_8: ## %bb4
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 16(%rax), %rax
-; CHECK-NEXT:    jmp LBB0_10
-; CHECK-NEXT:  LBB0_4: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    jmp LBB0_9
+; CHECK-NEXT:  LBB0_6: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-2, %eax
-; CHECK-NEXT:    je LBB0_8
-; CHECK-NEXT:  ## %bb.5: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    je LBB0_10
+; CHECK-NEXT:  ## %bb.7: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-1, %eax
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  LBB0_6: ## %bb35
+; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:  LBB0_3: ## %bb35
 ; CHECK-NEXT:    movq _PyUFunc_API@GOTPCREL(%rip), %r14
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *216(%rax)
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_11
-; CHECK-NEXT:  ## %bb.7: ## %cond_false.i
+; CHECK-NEXT:    je LBB0_4
+; CHECK-NEXT:  ## %bb.12: ## %cond_false.i
+; CHECK-NEXT:    setne %dil
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
 ; CHECK-NEXT:    movzbl %sil, %ecx
 ; CHECK-NEXT:    movl %ecx, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movl %eax, %r15d
 ; CHECK-NEXT:    testb %cl, %cl
-; CHECK-NEXT:    jne LBB0_12
-; CHECK-NEXT:    jmp LBB0_14
-; CHECK-NEXT:  LBB0_8: ## %bb17
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %dil, %al
+; CHECK-NEXT:    jne LBB0_5
+; CHECK-NEXT:  LBB0_13: ## %cond_true.i200
+; CHECK-NEXT:    testb %dl, %dl
+; CHECK-NEXT:    jne LBB0_15
+; CHECK-NEXT:  ## %bb.14: ## %cond_true14.i
+; CHECK-NEXT:    movl $4, %edi
+; CHECK-NEXT:    callq _feraiseexcept
+; CHECK-NEXT:  LBB0_15: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:    xorl %ebx, %ebx
+; CHECK-NEXT:    jmp LBB0_16
+; CHECK-NEXT:  LBB0_10: ## %bb17
 ; CHECK-NEXT:    callq _PyErr_Occurred
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.9: ## %cond_next
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.11: ## %cond_next
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 80(%rax), %rax
-; CHECK-NEXT:  LBB0_10: ## %bb4
+; CHECK-NEXT:  LBB0_9: ## %bb4
 ; CHECK-NEXT:    movq 96(%rax), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    movq %rbx, %rsi
 ; CHECK-NEXT:    callq *40(%rax)
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_11: ## %cond_true.i
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_4: ## %cond_true.i
 ; CHECK-NEXT:    movl $4, %edi
 ; CHECK-NEXT:    callq _feraiseexcept
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
-; CHECK-NEXT:    xorl %r15d, %r15d
 ; CHECK-NEXT:    testb %sil, %sil
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  LBB0_12: ## %cond_false.i
+; CHECK-NEXT:    sete %al
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  ## %bb.13: ## %cond_next17.i
+; CHECK-NEXT:    sete %cl
+; CHECK-NEXT:    xorl %r15d, %r15d
+; CHECK-NEXT:    orb %al, %cl
+; CHECK-NEXT:    jne LBB0_13
+; CHECK-NEXT:  LBB0_5: ## %cond_next17.i
 ; CHECK-NEXT:    movzbl %sil, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movzbl %ah, %ebx
-; CHECK-NEXT:    jmp LBB0_18
-; CHECK-NEXT:  LBB0_14: ## %cond_true.i200
-; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    jne LBB0_17
-; CHECK-NEXT:  ## %bb.16: ## %cond_true14.i
-; CHECK-NEXT:    movl $4, %edi
-; CHECK-NEXT:    callq _feraiseexcept
-; CHECK-NEXT:  LBB0_17: ## %ubyte_ctype_remainder.exit
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:  LBB0_18: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:  LBB0_16: ## %ubyte_ctype_remainder.exit
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *224(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je LBB0_21
-; CHECK-NEXT:  ## %bb.19: ## %cond_true61
+; CHECK-NEXT:    je LBB0_19
+; CHECK-NEXT:  ## %bb.17: ## %cond_true61
 ; CHECK-NEXT:    movl %eax, %ebp
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq _.str5@GOTPCREL(%rip), %rdi
@@ -136,8 +139,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rcx
 ; CHECK-NEXT:    callq *200(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_27
-; CHECK-NEXT:  ## %bb.20: ## %cond_next73
+; CHECK-NEXT:    js LBB0_23
+; CHECK-NEXT:  ## %bb.18: ## %cond_next73
 ; CHECK-NEXT:    movl $1, {{[0-9]+}}(%rsp)
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %rsi
@@ -146,13 +149,13 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    movl %ebp, %edx
 ; CHECK-NEXT:    callq *232(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  LBB0_21: ## %cond_next89
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  LBB0_19: ## %cond_next89
 ; CHECK-NEXT:    movl $2, %edi
 ; CHECK-NEXT:    callq _PyTuple_New
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_27
-; CHECK-NEXT:  ## %bb.22: ## %cond_next97
+; CHECK-NEXT:    je LBB0_23
+; CHECK-NEXT:  ## %bb.20: ## %cond_next97
 ; CHECK-NEXT:    movq %rax, %r14
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %r12
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -160,8 +163,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.23: ## %cond_next135
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.25: ## %cond_next135
 ; CHECK-NEXT:    movb %r15b, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 24(%r14)
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -169,22 +172,22 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.24: ## %cond_next182
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.26: ## %cond_next182
 ; CHECK-NEXT:    movb %bl, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 32(%r14)
 ; CHECK-NEXT:    movq %r14, %rax
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_25: ## %cond_true113
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_21: ## %cond_true113
 ; CHECK-NEXT:    decq (%r14)
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.26: ## %cond_true126
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.22: ## %cond_true126
 ; CHECK-NEXT:    movq 8(%r14), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    callq *48(%rax)
-; CHECK-NEXT:  LBB0_27: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_23: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:  LBB0_28: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_24: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    addq $32, %rsp
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    popq %r12
diff --git a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
index 4482c5aec8e816..d9d4424267d733 100644
--- a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
+++ b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
@@ -16,15 +16,12 @@ define void @_ada_c34007g() {
 ; CHECK-NEXT:    andl $-8, %esp
 ; CHECK-NEXT:    subl $8, %esp
 ; CHECK-NEXT:    movl (%esp), %eax
+; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK-NEXT:    orl %eax, %ecx
+; CHECK-NEXT:    sete %cl
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je .LBB0_3
-; CHECK-NEXT:  # %bb.1: # %entry
-; CHECK-NEXT:    orl {{[0-9]+}}(%esp), %eax
-; CHECK-NEXT:    jne .LBB0_3
-; CHECK-NEXT:  # %bb.2: # %entry
-; CHECK-NEXT:    movb $1, %al
-; CHECK-NEXT:    testb %al, %al
-; CHECK-NEXT:  .LBB0_3: # %bb5507
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %cl, %al
 ; CHECK-NEXT:    movl %ebp, %esp
 ; CHECK-NEXT:    popl %ebp
 ; CHECK-NEXT:    .cfi_def_cfa %esp, 4
diff --git a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
index dd60e641df2543..9147692db15645 100644
--- a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
+++ b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
@@ -1,5 +1,6 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
 ; REQUIRES: asserts
-; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 16
+; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 9
 ; PR1909
 
 @.str = internal constant [48 x i8] c"transformed bounds: (%.2f, %.2f), (%.2f, %.2f)\0A\00"		; <ptr> [#uses=1]
@@ -217,4 +218,4 @@ bb456:		; preds = %bb448, %bb425, %bb417, %bb395, %bb385, %bb371
 	ret void
 }
 
-declare i32 @printf(ptr, ...) nounwind 
+declare i32 @printf(ptr, ...) nounwind
diff --git a/llvm/test/CodeGen/X86/avx-cmp.ll b/llvm/test/CodeGen/X86/avx-cmp.ll
index 502bbf3f5d118b..4ab9c545ed90da 100644
--- a/llvm/test/CodeGen/X86/avx-cmp.ll
+++ b/llvm/test/CodeGen/X86/avx-cmp.ll
@@ -26,40 +26,33 @@ declare void @scale() nounwind
 define v...
[truncated]

Copy link

github-actions bot commented Feb 13, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@goldsteinn
Copy link
Contributor Author

NB: This is nearly the same as: https://reviews.llvm.org/D157581

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

@@ -596,6 +596,13 @@ class TargetLoweringBase {
/// avoided.
bool isJumpExpensive() const { return JumpIsExpensive; }

virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doxygen description

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved
llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved
else {
// Its likely we will get an early out.
CostThresh -= BrMergingUnlikelyBias.getValue();
if (BrMergingUnlikelyBias.getValue() < 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempt this earlier?

@goldsteinn
Copy link
Contributor Author

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

Fair enough. Ill put the code somewhere generic and make it a call in.

@goldsteinn
Copy link
Contributor Author

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

Fair enough. Ill put the code somewhere generic and make it a call in.

Have moved code to SelectionDAG w/ targets providing the tuning params. I kept the special case for (a == b && c == d) in x86 as not sure if thats really general.

@goldsteinn
Copy link
Contributor Author

ping

struct CondMergingParams {
int BaseCost;
int LikelyBias;
int UnlikelyBias;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments

int UnlikelyBias;
};
// Return params for deciding if we should keep two branch conditions merged
// or split them into two seperate branches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add paramater descriptions to the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo "separate"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what the 3 function parameters do

@@ -2432,6 +2434,141 @@ SelectionDAGBuilder::EmitBranchForMergedCondition(const Value *Cond,
SL->SwitchCases.push_back(CB);
}

// Collect dependings on V recursively. This is used for the cost analysis in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependings -> dependencies?

if (!Deps->insert(I).second)
return true;

for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) for (unsigned OpIdx = 0, E = I->getNumOperands(); OpIdx < E; ++OpIdx)


// Prune instructions from RHS Deps that are dependencies of unrelated
// instructions.
const unsigned MaxPruneIters = 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, changed to SelectionDAG::MaxRecursionDepth and added a comment indicating its relative arbitraryness.


for (const auto *Ins : RhsDeps) {
CostOfIncluding +=
TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why TCK_Latency ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought it we are computing a dependency chain so latency was the right measurement. We could be a bit smart and take ilp into account, but that seems overkill when the BaseCost bound is generally relatively low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment to the affect.

@@ -55,6 +58,7 @@
#include "llvm/MC/MCContext.h"
#include "llvm/MC/MCExpr.h"
#include "llvm/MC/MCSymbol.h"
#include "llvm/Support/BranchProbability.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need all these extra includes?

@@ -1,3 +1,4 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerate tests as a pre-commit so we see the actual diff

@@ -1,5 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't use update_llc_test_checks ?

@goldsteinn
Copy link
Contributor Author

ping

@jayfoad
Copy link
Contributor

jayfoad commented Feb 26, 2024

[X86] Don't always seperate conditions in (br (and/or cond0, cond1)) into seperate branches

Typo "separate" (twice).

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly there, my remaining concern is how this is supposed to interact with middle-end passes like SimplfyCFG which are handling something similar?

@@ -596,6 +596,39 @@ class TargetLoweringBase {
/// avoided.
bool isJumpExpensive() const { return JumpIsExpensive; }

// Costs paramateres used by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paramateres -> parameters

int UnlikelyBias;
};
// Return params for deciding if we should keep two branch conditions merged
// or split them into two seperate branches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what the 3 function parameters do

@goldsteinn goldsteinn changed the title [X86] Don't always seperate conditions in (br (and/or cond0, cond1)) into seperate branches [X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches Feb 26, 2024
@goldsteinn
Copy link
Contributor Author

Mostly there, my remaining concern is how this is supposed to interact with middle-end passes like SimplfyCFG which are handling something similar?

I would think this just works better to honor the middle-end's decisions.
I.e if SimplifyCFG breaks the two conditions, we aren't recombining here. If SimplifyCFG keeps them together, this patch adds backend knowledge that will either confirm that decision, or reject the decision (current default behavior).

@goldsteinn
Copy link
Contributor Author

ping

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cheers (with a couple of comments that we should followup - happy if you raise issues or want to look at them in the future).

Any other comments?

; CHECK-NEXT: .LBB0_2:
; CHECK-NEXT: setne %cl
; CHECK-NEXT: andb %al, %cl
; CHECK-NEXT: cmpb $1, %cl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the and+cmp hasn't folded to test? Not related to this patch but maybe worth a followup investigation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be because of the atomic.

SelectionDAG has 21 nodes:
  t0: ch,glue = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t8: i8,ch = llvm.x86.atomic.sub.cc<(volatile load store (s64) on %ir.0, align 64)> t0, TargetConstant:i64<12197>, t2, Constant:i64<1>, TargetConstant:i32<4>
            t4: i64,ch = CopyFromReg t0, Register:i64 %1
          t12: i1 = setcc t4, Constant:i64<0>, setne:ch
          t9: i1 = truncate t8
        t14: i1 = select t12, t9, Constant:i1<0>
      t16: i1 = xor t14, Constant:i1<-1>
    t18: ch = brcond t8:1, t16, BasicBlock:ch< 0x55ddb51cd310>
  t20: ch = br t18, BasicBlock:ch< 0x55ddb51cd210>

We plumb llvm.x86.atomic.sub.cc as i8 instead of i1 and don't fold past the truncate.
Is there any reason llvm.x86.atomic.sub.cc needs to be i8?

; CHECK-NEXT: movl $4, %eax
; CHECK-NEXT: retq
; CHECK-NEXT: .LBB14_2: # %return
; CHECK-NEXT: movl $192, %eax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity this didn't become a cmov :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly none of the new branches became cmov. I bet when we look for cmov this is still in (and cond0, cond1) and we only cmov on setcc.
Ill look into this.

@goldsteinn
Copy link
Contributor Author

LGTM cheers (with a couple of comments that we should followup - happy if you raise issues or want to look at them in the future).

Any other comments?

Thank you for the review, ill look into the lack of cmov/test. Have an idea for how to fix them.

@goldsteinn
Copy link
Contributor Author

@nikic
NOTE: going to push this, it will increase instruction count on stage2 compile time benchmarks (saves branch misses).

@goldsteinn goldsteinn closed this in ae76dfb Mar 1, 2024
@KanRobert
Copy link
Contributor

This will also likely help for APX targets with the new CCMP and
CTEST instructions.

In fact, this causes some downstream CCMP tests to fail b/c branch is already eliminated. @goldsteinn Is there a flag to undo the change in test?

@goldsteinn
Copy link
Contributor Author

This will also likely help for APX targets with the new CCMP and
CTEST instructions.

In fact, this causes some downstream CCMP tests to fail b/c branch is already eliminated. @goldsteinn Is there a flag to undo the change in test?

static cl::opt<int> BrMergingBaseCostThresh(
    "x86-br-merging-base-cost", cl::init(1),
    cl::desc(
        "Sets the cost threshold for when multiple conditionals will be merged "
        "into one branch versus be split in multiple branches. Merging "
        "conditionals saves branches at the cost of additional instructions. "
        "This value sets the instruction cost limit, below which conditionals "
        "will be merged, and above which conditionals will be split."),
    cl::Hidden);

if you set that to -1 it will disable the branch merging.

Copy link
Contributor

@chapuni chapuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces nondeterminism.

return false;

// Collect "all" instructions that lhs condition is dependent on.
SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmallPtrSet should not be iterable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, will have fix up shortly.
I think its only the first iteration, the second iteration will go through all elements so order shouldnt matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code simplifies when using a small vector for iteration as well :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #83687

for (const auto *U : Ins->users()) {
// If user is independent of RHS calculation we don't need to count it.
if (auto *UIns = dyn_cast<Instruction>(U))
if (!RhsDeps.contains(UIns))
Copy link
Contributor Author

@goldsteinn goldsteinn Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized there is a bug here, we should be checking that the use also doesn't equal I.getCondition(), otherwise we can end up overpruning. Have a fix, just testing if it implies any changes too the tuning. Will post tomorrow.
Edit: not a correctness bug.

chapuni added a commit that referenced this pull request Mar 3, 2024
… cond1))` into separate branches"

This has been buggy for a while.

Reverts #81689
This reverts commit ae76dfb.
@chapuni
Copy link
Contributor

chapuni commented Mar 3, 2024

@goldsteinn Excuse me, I have reverted this. Could you work on here please?

@chapuni chapuni reopened this Mar 3, 2024
@goldsteinn
Copy link
Contributor Author

@goldsteinn Excuse me, I have reverted this. Could you work on here please?

Ah, sure :)

…` into separate branches

Changes in Recommit:
    1) Fix non-determanism by using `SmallMapVector` instead of
       `SmallPtrSet`.
    2) Fix bug in dependency pruning where we discounted the actual
       `and/or` combining the two conditions. This lead to over pruning.

It makes sense to split if the cost of computing `cond1` is high
(proportionally to how likely `cond0` is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing `cond1` is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3:   0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement  (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new `CCMP` and
`CTEST` instructions.
@nikic
Copy link
Contributor

nikic commented Mar 3, 2024

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

FWIW, this is most likely noise. At least I didn't see any stable improvement to cycles after this change landed.

@goldsteinn
Copy link
Contributor Author

goldsteinn commented Mar 3, 2024

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

FWIW, this is most likely noise. At least I didn't see any stable improvement to cycles after this change landed.

Ah, yeah looked through, the didn't realize cycle variance was so high. Ill run SPEC locally and update message.

Across llvm benchmark suite + SPEC gets a wooping 2.2% improvement (geometric mean of N=3 runs) on SKX.
SPEC only is less impressive at .2% improvement.

Copy link
Contributor

@chapuni chapuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This works for me.

for (const auto &InsPair : RhsDeps) {
if (!ShouldCountInsn(InsPair.first)) {
ToDrop = InsPair.first;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: Does it work or improve if we would erase all found Instructions, not only first found one?
(Then, it should work w/o MapVector)

for (InsPair : RhsDeps)
  // find mark insns to be erased

for (to be erased)
  RhsDeps.erase(I);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could do it in bulk, but still need to repeat after each change (or max prune iters is reached). Think since we have a maximum number of pruning iterations, we still potentially could have issues if iteration order is not fixed.

@goldsteinn goldsteinn closed this in a4951ec Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants