[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches #81689

goldsteinn · 2024-02-13T23:56:16Z

Changes in Recommit:
1) Fix non-determanism by using SmallMapVector instead of
SmallPtrSet.
2) Fix bug in dependency pruning where we discounted the actual
and/or combining the two conditions. This lead to over pruning.

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.

llvmbot · 2024-02-13T23:56:44Z

@llvm/pr-subscribers-llvm-selectiondag

Author: None (goldsteinn)

Changes

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.

Patch is 104.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81689.diff

27 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+5-2)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+159)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+5)
(modified) llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll (+6-5)
(modified) llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll (+56-53)
(modified) llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll (+5-8)
(modified) llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll (+3-2)
(modified) llvm/test/CodeGen/X86/avx-cmp.ll (+10-17)
(modified) llvm/test/CodeGen/X86/cmp.ll (+30-22)
(modified) llvm/test/CodeGen/X86/dagcombine-and-setcc.ll (+2-1)
(modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+154-152)
(modified) llvm/test/CodeGen/X86/inline-spiller-impdef-on-implicit-def-regression.ll (+31-39)
(modified) llvm/test/CodeGen/X86/lsr-addrecloops.ll (+94-4)
(modified) llvm/test/CodeGen/X86/movmsk-cmp.ll (+42-55)
(modified) llvm/test/CodeGen/X86/or-branch.ll (+5-4)
(modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+72-70)
(modified) llvm/test/CodeGen/X86/pr33747.ll (+9-8)
(modified) llvm/test/CodeGen/X86/pr37025.ll (+22-16)
(modified) llvm/test/CodeGen/X86/pr38795.ll (+58-73)
(modified) llvm/test/CodeGen/X86/setcc-logic.ll (+30-46)
(modified) llvm/test/CodeGen/X86/swifterror.ll (-10)
(modified) llvm/test/CodeGen/X86/tail-dup-merge-loop-headers.ll (+50-69)
(modified) llvm/test/CodeGen/X86/tail-opts.ll (+2-3)
(modified) llvm/test/CodeGen/X86/tailcall-extract.ll (+4-4)
(modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+18-16)
(modified) llvm/test/CodeGen/X86/x86-shrink-wrap-unwind.ll (+28-24)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 612433b54f6e44..85288a3ae519ee 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -596,6 +596,13 @@ class TargetLoweringBase {
   /// avoided.
   bool isJumpExpensive() const { return JumpIsExpensive; }
 
+  virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                          const BranchInst &,
+                                          Instruction::BinaryOps, const Value *,
+                                          const Value *) const {
+    return false;
+  }
+
   /// Return true if selects are only cheaper than branches if the branch is
   /// unlikely to be predicted right.
   bool isPredictableSelectExpensive() const {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 28664b2ed9052d..12a6ec5f7dfb3b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -2646,8 +2646,11 @@ void SelectionDAGBuilder::visitBr(const BranchInst &I) {
     else if (match(BOp, m_LogicalOr(m_Value(BOp0), m_Value(BOp1))))
       Opcode = Instruction::Or;
 
-    if (Opcode && !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
-                    match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value())))) {
+    if (Opcode &&
+        !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
+          match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value()))) &&
+        !DAG.getTargetLoweringInfo().keepJumpConditionsTogether(
+            FuncInfo, I, Opcode, BOp0, BOp1)) {
       FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB, Opcode,
                            getEdgeProbability(BrMBB, Succ0MBB),
                            getEdgeProbability(BrMBB, Succ1MBB),
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 18f9871b2bd0c3..8ff48630cfc923 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -27,9 +27,12 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/BlockFrequencyInfo.h"
+#include "llvm/Analysis/BranchProbabilityInfo.h"
 #include "llvm/Analysis/ObjCARCUtil.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/IntrinsicLowering.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
@@ -55,6 +58,7 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCSymbol.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -77,6 +81,24 @@ static cl::opt<int> ExperimentalPrefInnermostLoopAlignment(
         "alignment set by x86-experimental-pref-loop-alignment."),
     cl::Hidden);
 
+static cl::opt<int> BrMergingBaseCostThresh(
+    "x86-cond-base", cl::init(1),
+    cl::desc(
+        "Base."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingLikelyBias(
+    "x86-cond-likely-bias", cl::init(0),
+    cl::desc(
+        "Likely."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingUnlikelyBias(
+    "x86-cond-unlikely-bias", cl::init(1),
+    cl::desc(
+        "Unlikely."),
+    cl::Hidden);
+
 static cl::opt<bool> MulConstantOptimization(
     "mul-constant-optimization", cl::init(true),
     cl::desc("Replace 'mul x, Const' with more effective instructions like "
@@ -3339,6 +3361,143 @@ unsigned X86TargetLowering::preferedOpcodeForCmpEqPiecesOfOperand(
   return ISD::SRL;
 }
 
+// Collect dependings on V recursively. This is used for the cost analysis in
+// `keepJumpConditionsTogether`.
+static bool
+collectDeps(SmallPtrSet<const Instruction *, 8> *Deps, const Value *V,
+            SmallPtrSet<const Instruction *, 8> *Necessary = nullptr,
+            unsigned Depth = 0) {
+  // Return false if we have an incomplete count.
+  if (Depth >= 6)
+    return false;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (I == nullptr)
+    return true;
+
+  if (Necessary != nullptr) {
+    // This instruction is necessary for the other side of the condition so
+    // don't count it.
+    if (Necessary->contains(I))
+      return true;
+  }
+
+  // Already added this dep.
+  if (!Deps->insert(I).second)
+    return true;
+
+  for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)
+    if (!collectDeps(Deps, I->getOperand(OpIdx), Necessary, Depth + 1))
+      return false;
+  return true;
+}
+
+bool X86TargetLowering::keepJumpConditionsTogether(
+    const FunctionLoweringInfo &FuncInfo, const BranchInst &I,
+    Instruction::BinaryOps Opc, const Value *Lhs, const Value *Rhs) const {
+  using namespace llvm::PatternMatch;
+  if (I.getNumSuccessors() != 2)
+    return false;
+
+  // Baseline cost. This is properly arbitrary.
+  InstructionCost CostThresh = BrMergingBaseCostThresh.getValue();
+  if (BrMergingBaseCostThresh.getValue() < 0)
+    return false;
+
+  // a == b && c == d can be efficiently combined.
+  ICmpInst::Predicate Pred;
+  if (Opc == Instruction::And &&
+      match(Lhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ &&
+      match(Rhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ)
+    CostThresh += 1;
+
+  BranchProbabilityInfo *BPI = nullptr;
+  if (BrMergingLikelyBias.getValue() || BrMergingUnlikelyBias.getValue())
+    BPI = FuncInfo.BPI;
+  if (BPI != nullptr) {
+    BasicBlock *IfFalse = I.getSuccessor(0);
+    BasicBlock *IfTrue = I.getSuccessor(1);
+
+    std::optional<bool> Likely;
+    if (BPI->isEdgeHot(I.getParent(), IfTrue))
+      Likely = true;
+    else if (BPI->isEdgeHot(I.getParent(), IfFalse))
+      Likely = false;
+
+    if (Likely) {
+      if (Opc == (*Likely ? Instruction::And : Instruction::Or))
+        // Its likely we will have to compute both lhs and rhs of condition
+        CostThresh += BrMergingLikelyBias.getValue();
+      else {
+        // Its likely we will get an early out.
+        CostThresh -= BrMergingUnlikelyBias.getValue();
+        if (BrMergingUnlikelyBias.getValue() < 0) {
+          return false;
+        }
+      }
+    }
+  }
+
+  if (CostThresh <= 0)
+    return false;
+
+  // Collect "all" instructions that lhs condition is dependent on.
+  SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;
+  collectDeps(&LhsDeps, Lhs);
+  // Collect "all" instructions that rhs condition is dependent on AND are
+  // dependencies of lhs. This gives us an estimate on which instructions we
+  // stand to save by splitting the condition.
+  if (!collectDeps(&RhsDeps, Rhs, &LhsDeps))
+    return false;
+  // Add the compare instruction itself unless its a dependency on the LHS.
+  if (const auto *RhsI = dyn_cast<Instruction>(Rhs))
+    if (!LhsDeps.contains(RhsI))
+      RhsDeps.insert(RhsI);
+  const auto &TTI = getTargetMachine().getTargetTransformInfo(*I.getFunction());
+
+  InstructionCost CostOfIncluding = 0;
+  // See if this instruction will need to computed independently of whether RHS
+  // is.
+  auto ShouldCountInsn = [&RhsDeps](const Instruction *Ins) {
+    for (const auto *U : Ins->users()) {
+      // If user is independent of RHS calculation we don't need to count it.
+      if (auto *UIns = dyn_cast<Instruction>(U))
+        if (!RhsDeps.contains(UIns))
+          return false;
+    }
+    return true;
+  };
+
+  // Prune instructions from RHS Deps that are dependencies of unrelated
+  // instructions.
+  const unsigned MaxPruneIters = 8;
+  // Stop after a certain point. No incorrectness from including too many
+  // instructions.
+  for (unsigned PruneIters = 0; PruneIters < MaxPruneIters; ++PruneIters) {
+    const Instruction *ToDrop = nullptr;
+    for (const auto *Ins : RhsDeps) {
+      if (!ShouldCountInsn(Ins)) {
+        ToDrop = Ins;
+        break;
+      }
+    }
+    if (ToDrop == nullptr)
+      break;
+    RhsDeps.erase(ToDrop);
+  }
+
+  for (const auto *Ins : RhsDeps) {
+    CostOfIncluding +=
+        TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);
+
+    if (CostOfIncluding > CostThresh)
+      return false;
+  }
+  return true;
+}
+
 bool X86TargetLowering::preferScalarizeSplat(SDNode *N) const {
   return N->getOpcode() != ISD::FP_EXTEND;
 }
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index f93c54781846bf..f536c8a6766b68 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1150,6 +1150,11 @@ namespace llvm {
 
     bool preferScalarizeSplat(SDNode *N) const override;
 
+    bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                    const BranchInst &, Instruction::BinaryOps,
+                                    const Value *,
+                                    const Value *) const override;
+
     bool shouldFoldConstantShiftPairToMask(const SDNode *N,
                                            CombineLevel Level) const override;
 
diff --git a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
index 0044d1c3568377..e6f28c2057f775 100644
--- a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
+++ b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
@@ -18,15 +18,16 @@ define i1 @loadAndRLEsource_no_exit_2E_1_label_2E_0(i32 %tmp.21.reload, i32 %tmp
 ; CHECK-NEXT:    movl _block, %esi
 ; CHECK-NEXT:    movb %al, 1(%esi,%edx)
 ; CHECK-NEXT:    cmpl %ecx, _last
-; CHECK-NEXT:    jge LBB0_3
-; CHECK-NEXT:  ## %bb.1: ## %label.0
+; CHECK-NEXT:    setl %cl
 ; CHECK-NEXT:    cmpl $257, %eax ## imm = 0x101
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  ## %bb.2: ## %label.0.no_exit.1_crit_edge.exitStub
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %al, %cl
+; CHECK-NEXT:    je LBB0_2
+; CHECK-NEXT:  ## %bb.1: ## %label.0.no_exit.1_crit_edge.exitStub
 ; CHECK-NEXT:    movb $1, %al
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
-; CHECK-NEXT:  LBB0_3: ## %codeRepl5.exitStub
+; CHECK-NEXT:  LBB0_2: ## %codeRepl5.exitStub
 ; CHECK-NEXT:    xorl %eax, %eax
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
diff --git a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
index 7bdc4e19a1cf66..28b4541c1bfc7f 100644
--- a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
+++ b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
@@ -44,7 +44,7 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    callq __ubyte_convert_to_ctype
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_4
+; CHECK-NEXT:    js LBB0_6
 ; CHECK-NEXT:  ## %bb.1: ## %cond_next.i
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    movq %rbx, %rdi
@@ -53,81 +53,84 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    sarl $31, %ecx
 ; CHECK-NEXT:    andl %eax, %ecx
 ; CHECK-NEXT:    cmpl $-2, %ecx
-; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:    je LBB0_10
 ; CHECK-NEXT:  ## %bb.2: ## %cond_next.i
 ; CHECK-NEXT:    cmpl $-1, %ecx
-; CHECK-NEXT:    jne LBB0_6
-; CHECK-NEXT:  LBB0_3: ## %bb4
+; CHECK-NEXT:    jne LBB0_3
+; CHECK-NEXT:  LBB0_8: ## %bb4
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 16(%rax), %rax
-; CHECK-NEXT:    jmp LBB0_10
-; CHECK-NEXT:  LBB0_4: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    jmp LBB0_9
+; CHECK-NEXT:  LBB0_6: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-2, %eax
-; CHECK-NEXT:    je LBB0_8
-; CHECK-NEXT:  ## %bb.5: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    je LBB0_10
+; CHECK-NEXT:  ## %bb.7: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-1, %eax
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  LBB0_6: ## %bb35
+; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:  LBB0_3: ## %bb35
 ; CHECK-NEXT:    movq _PyUFunc_API@GOTPCREL(%rip), %r14
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *216(%rax)
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_11
-; CHECK-NEXT:  ## %bb.7: ## %cond_false.i
+; CHECK-NEXT:    je LBB0_4
+; CHECK-NEXT:  ## %bb.12: ## %cond_false.i
+; CHECK-NEXT:    setne %dil
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
 ; CHECK-NEXT:    movzbl %sil, %ecx
 ; CHECK-NEXT:    movl %ecx, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movl %eax, %r15d
 ; CHECK-NEXT:    testb %cl, %cl
-; CHECK-NEXT:    jne LBB0_12
-; CHECK-NEXT:    jmp LBB0_14
-; CHECK-NEXT:  LBB0_8: ## %bb17
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %dil, %al
+; CHECK-NEXT:    jne LBB0_5
+; CHECK-NEXT:  LBB0_13: ## %cond_true.i200
+; CHECK-NEXT:    testb %dl, %dl
+; CHECK-NEXT:    jne LBB0_15
+; CHECK-NEXT:  ## %bb.14: ## %cond_true14.i
+; CHECK-NEXT:    movl $4, %edi
+; CHECK-NEXT:    callq _feraiseexcept
+; CHECK-NEXT:  LBB0_15: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:    xorl %ebx, %ebx
+; CHECK-NEXT:    jmp LBB0_16
+; CHECK-NEXT:  LBB0_10: ## %bb17
 ; CHECK-NEXT:    callq _PyErr_Occurred
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.9: ## %cond_next
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.11: ## %cond_next
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 80(%rax), %rax
-; CHECK-NEXT:  LBB0_10: ## %bb4
+; CHECK-NEXT:  LBB0_9: ## %bb4
 ; CHECK-NEXT:    movq 96(%rax), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    movq %rbx, %rsi
 ; CHECK-NEXT:    callq *40(%rax)
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_11: ## %cond_true.i
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_4: ## %cond_true.i
 ; CHECK-NEXT:    movl $4, %edi
 ; CHECK-NEXT:    callq _feraiseexcept
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
-; CHECK-NEXT:    xorl %r15d, %r15d
 ; CHECK-NEXT:    testb %sil, %sil
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  LBB0_12: ## %cond_false.i
+; CHECK-NEXT:    sete %al
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  ## %bb.13: ## %cond_next17.i
+; CHECK-NEXT:    sete %cl
+; CHECK-NEXT:    xorl %r15d, %r15d
+; CHECK-NEXT:    orb %al, %cl
+; CHECK-NEXT:    jne LBB0_13
+; CHECK-NEXT:  LBB0_5: ## %cond_next17.i
 ; CHECK-NEXT:    movzbl %sil, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movzbl %ah, %ebx
-; CHECK-NEXT:    jmp LBB0_18
-; CHECK-NEXT:  LBB0_14: ## %cond_true.i200
-; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    jne LBB0_17
-; CHECK-NEXT:  ## %bb.16: ## %cond_true14.i
-; CHECK-NEXT:    movl $4, %edi
-; CHECK-NEXT:    callq _feraiseexcept
-; CHECK-NEXT:  LBB0_17: ## %ubyte_ctype_remainder.exit
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:  LBB0_18: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:  LBB0_16: ## %ubyte_ctype_remainder.exit
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *224(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je LBB0_21
-; CHECK-NEXT:  ## %bb.19: ## %cond_true61
+; CHECK-NEXT:    je LBB0_19
+; CHECK-NEXT:  ## %bb.17: ## %cond_true61
 ; CHECK-NEXT:    movl %eax, %ebp
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq _.str5@GOTPCREL(%rip), %rdi
@@ -136,8 +139,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rcx
 ; CHECK-NEXT:    callq *200(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_27
-; CHECK-NEXT:  ## %bb.20: ## %cond_next73
+; CHECK-NEXT:    js LBB0_23
+; CHECK-NEXT:  ## %bb.18: ## %cond_next73
 ; CHECK-NEXT:    movl $1, {{[0-9]+}}(%rsp)
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %rsi
@@ -146,13 +149,13 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    movl %ebp, %edx
 ; CHECK-NEXT:    callq *232(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  LBB0_21: ## %cond_next89
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  LBB0_19: ## %cond_next89
 ; CHECK-NEXT:    movl $2, %edi
 ; CHECK-NEXT:    callq _PyTuple_New
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_27
-; CHECK-NEXT:  ## %bb.22: ## %cond_next97
+; CHECK-NEXT:    je LBB0_23
+; CHECK-NEXT:  ## %bb.20: ## %cond_next97
 ; CHECK-NEXT:    movq %rax, %r14
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %r12
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -160,8 +163,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.23: ## %cond_next135
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.25: ## %cond_next135
 ; CHECK-NEXT:    movb %r15b, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 24(%r14)
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -169,22 +172,22 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.24: ## %cond_next182
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.26: ## %cond_next182
 ; CHECK-NEXT:    movb %bl, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 32(%r14)
 ; CHECK-NEXT:    movq %r14, %rax
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_25: ## %cond_true113
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_21: ## %cond_true113
 ; CHECK-NEXT:    decq (%r14)
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.26: ## %cond_true126
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.22: ## %cond_true126
 ; CHECK-NEXT:    movq 8(%r14), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    callq *48(%rax)
-; CHECK-NEXT:  LBB0_27: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_23: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:  LBB0_28: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_24: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    addq $32, %rsp
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    popq %r12
diff --git a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
index 4482c5aec8e816..d9d4424267d733 100644
--- a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
+++ b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
@@ -16,15 +16,12 @@ define void @_ada_c34007g() {
 ; CHECK-NEXT:    andl $-8, %esp
 ; CHECK-NEXT:    subl $8, %esp
 ; CHECK-NEXT:    movl (%esp), %eax
+; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK-NEXT:    orl %eax, %ecx
+; CHECK-NEXT:    sete %cl
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je .LBB0_3
-; CHECK-NEXT:  # %bb.1: # %entry
-; CHECK-NEXT:    orl {{[0-9]+}}(%esp), %eax
-; CHECK-NEXT:    jne .LBB0_3
-; CHECK-NEXT:  # %bb.2: # %entry
-; CHECK-NEXT:    movb $1, %al
-; CHECK-NEXT:    testb %al, %al
-; CHECK-NEXT:  .LBB0_3: # %bb5507
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %cl, %al
 ; CHECK-NEXT:    movl %ebp, %esp
 ; CHECK-NEXT:    popl %ebp
 ; CHECK-NEXT:    .cfi_def_cfa %esp, 4
diff --git a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
index dd60e641df2543..9147692db15645 100644
--- a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
+++ b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
@@ -1,5 +1,6 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
 ; REQUIRES: asserts
-; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 16
+; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 9
 ; PR1909
 
 @.str = internal constant [48 x i8] c"transformed bounds: (%.2f, %.2f), (%.2f, %.2f)\0A\00"		; <ptr> [#uses=1]
@@ -217,4 +218,4 @@ bb456:		; preds = %bb448, %bb425, %bb417, %bb395, %bb385, %bb371
 	ret void
 }
 
-declare i32 @printf(ptr, ...) nounwind 
+declare i32 @printf(ptr, ...) nounwind
diff --git a/llvm/test/CodeGen/X86/avx-cmp.ll b/llvm/test/CodeGen/X86/avx-cmp.ll
index 502bbf3f5d118b..4ab9c545ed90da 100644
--- a/llvm/test/CodeGen/X86/avx-cmp.ll
+++ b/llvm/test/CodeGen/X86/avx-cmp.ll
@@ -26,40 +26,33 @@ declare void @scale() nounwind
 define v...
[truncated]

llvmbot · 2024-02-13T23:56:44Z

@llvm/pr-subscribers-backend-x86

Author: None (goldsteinn)

Changes

It makes sense to split if the cost of computing cond1 is high
(proportionally to how likely cond0 is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing cond1 is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new CCMP and
CTEST instructions.

Patch is 104.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81689.diff

27 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+5-2)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+159)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+5)
(modified) llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll (+6-5)
(modified) llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll (+56-53)
(modified) llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll (+5-8)
(modified) llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll (+3-2)
(modified) llvm/test/CodeGen/X86/avx-cmp.ll (+10-17)
(modified) llvm/test/CodeGen/X86/cmp.ll (+30-22)
(modified) llvm/test/CodeGen/X86/dagcombine-and-setcc.ll (+2-1)
(modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+154-152)
(modified) llvm/test/CodeGen/X86/inline-spiller-impdef-on-implicit-def-regression.ll (+31-39)
(modified) llvm/test/CodeGen/X86/lsr-addrecloops.ll (+94-4)
(modified) llvm/test/CodeGen/X86/movmsk-cmp.ll (+42-55)
(modified) llvm/test/CodeGen/X86/or-branch.ll (+5-4)
(modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+72-70)
(modified) llvm/test/CodeGen/X86/pr33747.ll (+9-8)
(modified) llvm/test/CodeGen/X86/pr37025.ll (+22-16)
(modified) llvm/test/CodeGen/X86/pr38795.ll (+58-73)
(modified) llvm/test/CodeGen/X86/setcc-logic.ll (+30-46)
(modified) llvm/test/CodeGen/X86/swifterror.ll (-10)
(modified) llvm/test/CodeGen/X86/tail-dup-merge-loop-headers.ll (+50-69)
(modified) llvm/test/CodeGen/X86/tail-opts.ll (+2-3)
(modified) llvm/test/CodeGen/X86/tailcall-extract.ll (+4-4)
(modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+18-16)
(modified) llvm/test/CodeGen/X86/x86-shrink-wrap-unwind.ll (+28-24)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 612433b54f6e44..85288a3ae519ee 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -596,6 +596,13 @@ class TargetLoweringBase {
   /// avoided.
   bool isJumpExpensive() const { return JumpIsExpensive; }
 
+  virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                          const BranchInst &,
+                                          Instruction::BinaryOps, const Value *,
+                                          const Value *) const {
+    return false;
+  }
+
   /// Return true if selects are only cheaper than branches if the branch is
   /// unlikely to be predicted right.
   bool isPredictableSelectExpensive() const {
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 28664b2ed9052d..12a6ec5f7dfb3b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -2646,8 +2646,11 @@ void SelectionDAGBuilder::visitBr(const BranchInst &I) {
     else if (match(BOp, m_LogicalOr(m_Value(BOp0), m_Value(BOp1))))
       Opcode = Instruction::Or;
 
-    if (Opcode && !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
-                    match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value())))) {
+    if (Opcode &&
+        !(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
+          match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value()))) &&
+        !DAG.getTargetLoweringInfo().keepJumpConditionsTogether(
+            FuncInfo, I, Opcode, BOp0, BOp1)) {
       FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB, Opcode,
                            getEdgeProbability(BrMBB, Succ0MBB),
                            getEdgeProbability(BrMBB, Succ1MBB),
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 18f9871b2bd0c3..8ff48630cfc923 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -27,9 +27,12 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/BlockFrequencyInfo.h"
+#include "llvm/Analysis/BranchProbabilityInfo.h"
 #include "llvm/Analysis/ObjCARCUtil.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/IntrinsicLowering.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
@@ -55,6 +58,7 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCSymbol.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -77,6 +81,24 @@ static cl::opt<int> ExperimentalPrefInnermostLoopAlignment(
         "alignment set by x86-experimental-pref-loop-alignment."),
     cl::Hidden);
 
+static cl::opt<int> BrMergingBaseCostThresh(
+    "x86-cond-base", cl::init(1),
+    cl::desc(
+        "Base."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingLikelyBias(
+    "x86-cond-likely-bias", cl::init(0),
+    cl::desc(
+        "Likely."),
+    cl::Hidden);
+
+static cl::opt<int> BrMergingUnlikelyBias(
+    "x86-cond-unlikely-bias", cl::init(1),
+    cl::desc(
+        "Unlikely."),
+    cl::Hidden);
+
 static cl::opt<bool> MulConstantOptimization(
     "mul-constant-optimization", cl::init(true),
     cl::desc("Replace 'mul x, Const' with more effective instructions like "
@@ -3339,6 +3361,143 @@ unsigned X86TargetLowering::preferedOpcodeForCmpEqPiecesOfOperand(
   return ISD::SRL;
 }
 
+// Collect dependings on V recursively. This is used for the cost analysis in
+// `keepJumpConditionsTogether`.
+static bool
+collectDeps(SmallPtrSet<const Instruction *, 8> *Deps, const Value *V,
+            SmallPtrSet<const Instruction *, 8> *Necessary = nullptr,
+            unsigned Depth = 0) {
+  // Return false if we have an incomplete count.
+  if (Depth >= 6)
+    return false;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (I == nullptr)
+    return true;
+
+  if (Necessary != nullptr) {
+    // This instruction is necessary for the other side of the condition so
+    // don't count it.
+    if (Necessary->contains(I))
+      return true;
+  }
+
+  // Already added this dep.
+  if (!Deps->insert(I).second)
+    return true;
+
+  for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)
+    if (!collectDeps(Deps, I->getOperand(OpIdx), Necessary, Depth + 1))
+      return false;
+  return true;
+}
+
+bool X86TargetLowering::keepJumpConditionsTogether(
+    const FunctionLoweringInfo &FuncInfo, const BranchInst &I,
+    Instruction::BinaryOps Opc, const Value *Lhs, const Value *Rhs) const {
+  using namespace llvm::PatternMatch;
+  if (I.getNumSuccessors() != 2)
+    return false;
+
+  // Baseline cost. This is properly arbitrary.
+  InstructionCost CostThresh = BrMergingBaseCostThresh.getValue();
+  if (BrMergingBaseCostThresh.getValue() < 0)
+    return false;
+
+  // a == b && c == d can be efficiently combined.
+  ICmpInst::Predicate Pred;
+  if (Opc == Instruction::And &&
+      match(Lhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ &&
+      match(Rhs, m_ICmp(Pred, m_Value(), m_Value())) &&
+      Pred == ICmpInst::ICMP_EQ)
+    CostThresh += 1;
+
+  BranchProbabilityInfo *BPI = nullptr;
+  if (BrMergingLikelyBias.getValue() || BrMergingUnlikelyBias.getValue())
+    BPI = FuncInfo.BPI;
+  if (BPI != nullptr) {
+    BasicBlock *IfFalse = I.getSuccessor(0);
+    BasicBlock *IfTrue = I.getSuccessor(1);
+
+    std::optional<bool> Likely;
+    if (BPI->isEdgeHot(I.getParent(), IfTrue))
+      Likely = true;
+    else if (BPI->isEdgeHot(I.getParent(), IfFalse))
+      Likely = false;
+
+    if (Likely) {
+      if (Opc == (*Likely ? Instruction::And : Instruction::Or))
+        // Its likely we will have to compute both lhs and rhs of condition
+        CostThresh += BrMergingLikelyBias.getValue();
+      else {
+        // Its likely we will get an early out.
+        CostThresh -= BrMergingUnlikelyBias.getValue();
+        if (BrMergingUnlikelyBias.getValue() < 0) {
+          return false;
+        }
+      }
+    }
+  }
+
+  if (CostThresh <= 0)
+    return false;
+
+  // Collect "all" instructions that lhs condition is dependent on.
+  SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;
+  collectDeps(&LhsDeps, Lhs);
+  // Collect "all" instructions that rhs condition is dependent on AND are
+  // dependencies of lhs. This gives us an estimate on which instructions we
+  // stand to save by splitting the condition.
+  if (!collectDeps(&RhsDeps, Rhs, &LhsDeps))
+    return false;
+  // Add the compare instruction itself unless its a dependency on the LHS.
+  if (const auto *RhsI = dyn_cast<Instruction>(Rhs))
+    if (!LhsDeps.contains(RhsI))
+      RhsDeps.insert(RhsI);
+  const auto &TTI = getTargetMachine().getTargetTransformInfo(*I.getFunction());
+
+  InstructionCost CostOfIncluding = 0;
+  // See if this instruction will need to computed independently of whether RHS
+  // is.
+  auto ShouldCountInsn = [&RhsDeps](const Instruction *Ins) {
+    for (const auto *U : Ins->users()) {
+      // If user is independent of RHS calculation we don't need to count it.
+      if (auto *UIns = dyn_cast<Instruction>(U))
+        if (!RhsDeps.contains(UIns))
+          return false;
+    }
+    return true;
+  };
+
+  // Prune instructions from RHS Deps that are dependencies of unrelated
+  // instructions.
+  const unsigned MaxPruneIters = 8;
+  // Stop after a certain point. No incorrectness from including too many
+  // instructions.
+  for (unsigned PruneIters = 0; PruneIters < MaxPruneIters; ++PruneIters) {
+    const Instruction *ToDrop = nullptr;
+    for (const auto *Ins : RhsDeps) {
+      if (!ShouldCountInsn(Ins)) {
+        ToDrop = Ins;
+        break;
+      }
+    }
+    if (ToDrop == nullptr)
+      break;
+    RhsDeps.erase(ToDrop);
+  }
+
+  for (const auto *Ins : RhsDeps) {
+    CostOfIncluding +=
+        TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);
+
+    if (CostOfIncluding > CostThresh)
+      return false;
+  }
+  return true;
+}
+
 bool X86TargetLowering::preferScalarizeSplat(SDNode *N) const {
   return N->getOpcode() != ISD::FP_EXTEND;
 }
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index f93c54781846bf..f536c8a6766b68 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1150,6 +1150,11 @@ namespace llvm {
 
     bool preferScalarizeSplat(SDNode *N) const override;
 
+    bool keepJumpConditionsTogether(const FunctionLoweringInfo &,
+                                    const BranchInst &, Instruction::BinaryOps,
+                                    const Value *,
+                                    const Value *) const override;
+
     bool shouldFoldConstantShiftPairToMask(const SDNode *N,
                                            CombineLevel Level) const override;
 
diff --git a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
index 0044d1c3568377..e6f28c2057f775 100644
--- a/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
+++ b/llvm/test/CodeGen/X86/2006-04-27-ISelFoldingBug.ll
@@ -18,15 +18,16 @@ define i1 @loadAndRLEsource_no_exit_2E_1_label_2E_0(i32 %tmp.21.reload, i32 %tmp
 ; CHECK-NEXT:    movl _block, %esi
 ; CHECK-NEXT:    movb %al, 1(%esi,%edx)
 ; CHECK-NEXT:    cmpl %ecx, _last
-; CHECK-NEXT:    jge LBB0_3
-; CHECK-NEXT:  ## %bb.1: ## %label.0
+; CHECK-NEXT:    setl %cl
 ; CHECK-NEXT:    cmpl $257, %eax ## imm = 0x101
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  ## %bb.2: ## %label.0.no_exit.1_crit_edge.exitStub
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %al, %cl
+; CHECK-NEXT:    je LBB0_2
+; CHECK-NEXT:  ## %bb.1: ## %label.0.no_exit.1_crit_edge.exitStub
 ; CHECK-NEXT:    movb $1, %al
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
-; CHECK-NEXT:  LBB0_3: ## %codeRepl5.exitStub
+; CHECK-NEXT:  LBB0_2: ## %codeRepl5.exitStub
 ; CHECK-NEXT:    xorl %eax, %eax
 ; CHECK-NEXT:    popl %esi
 ; CHECK-NEXT:    retl
diff --git a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
index 7bdc4e19a1cf66..28b4541c1bfc7f 100644
--- a/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
+++ b/llvm/test/CodeGen/X86/2007-08-09-IllegalX86-64Asm.ll
@@ -44,7 +44,7 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    callq __ubyte_convert_to_ctype
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_4
+; CHECK-NEXT:    js LBB0_6
 ; CHECK-NEXT:  ## %bb.1: ## %cond_next.i
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rsi
 ; CHECK-NEXT:    movq %rbx, %rdi
@@ -53,81 +53,84 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    sarl $31, %ecx
 ; CHECK-NEXT:    andl %eax, %ecx
 ; CHECK-NEXT:    cmpl $-2, %ecx
-; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:    je LBB0_10
 ; CHECK-NEXT:  ## %bb.2: ## %cond_next.i
 ; CHECK-NEXT:    cmpl $-1, %ecx
-; CHECK-NEXT:    jne LBB0_6
-; CHECK-NEXT:  LBB0_3: ## %bb4
+; CHECK-NEXT:    jne LBB0_3
+; CHECK-NEXT:  LBB0_8: ## %bb4
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 16(%rax), %rax
-; CHECK-NEXT:    jmp LBB0_10
-; CHECK-NEXT:  LBB0_4: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    jmp LBB0_9
+; CHECK-NEXT:  LBB0_6: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-2, %eax
-; CHECK-NEXT:    je LBB0_8
-; CHECK-NEXT:  ## %bb.5: ## %_ubyte_convert2_to_ctypes.exit
+; CHECK-NEXT:    je LBB0_10
+; CHECK-NEXT:  ## %bb.7: ## %_ubyte_convert2_to_ctypes.exit
 ; CHECK-NEXT:    cmpl $-1, %eax
-; CHECK-NEXT:    je LBB0_3
-; CHECK-NEXT:  LBB0_6: ## %bb35
+; CHECK-NEXT:    je LBB0_8
+; CHECK-NEXT:  LBB0_3: ## %bb35
 ; CHECK-NEXT:    movq _PyUFunc_API@GOTPCREL(%rip), %r14
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *216(%rax)
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_11
-; CHECK-NEXT:  ## %bb.7: ## %cond_false.i
+; CHECK-NEXT:    je LBB0_4
+; CHECK-NEXT:  ## %bb.12: ## %cond_false.i
+; CHECK-NEXT:    setne %dil
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
 ; CHECK-NEXT:    movzbl %sil, %ecx
 ; CHECK-NEXT:    movl %ecx, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movl %eax, %r15d
 ; CHECK-NEXT:    testb %cl, %cl
-; CHECK-NEXT:    jne LBB0_12
-; CHECK-NEXT:    jmp LBB0_14
-; CHECK-NEXT:  LBB0_8: ## %bb17
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %dil, %al
+; CHECK-NEXT:    jne LBB0_5
+; CHECK-NEXT:  LBB0_13: ## %cond_true.i200
+; CHECK-NEXT:    testb %dl, %dl
+; CHECK-NEXT:    jne LBB0_15
+; CHECK-NEXT:  ## %bb.14: ## %cond_true14.i
+; CHECK-NEXT:    movl $4, %edi
+; CHECK-NEXT:    callq _feraiseexcept
+; CHECK-NEXT:  LBB0_15: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:    xorl %ebx, %ebx
+; CHECK-NEXT:    jmp LBB0_16
+; CHECK-NEXT:  LBB0_10: ## %bb17
 ; CHECK-NEXT:    callq _PyErr_Occurred
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.9: ## %cond_next
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.11: ## %cond_next
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %rax
 ; CHECK-NEXT:    movq (%rax), %rax
 ; CHECK-NEXT:    movq 80(%rax), %rax
-; CHECK-NEXT:  LBB0_10: ## %bb4
+; CHECK-NEXT:  LBB0_9: ## %bb4
 ; CHECK-NEXT:    movq 96(%rax), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    movq %rbx, %rsi
 ; CHECK-NEXT:    callq *40(%rax)
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_11: ## %cond_true.i
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_4: ## %cond_true.i
 ; CHECK-NEXT:    movl $4, %edi
 ; CHECK-NEXT:    callq _feraiseexcept
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    movzbl {{[0-9]+}}(%rsp), %esi
-; CHECK-NEXT:    xorl %r15d, %r15d
 ; CHECK-NEXT:    testb %sil, %sil
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  LBB0_12: ## %cond_false.i
+; CHECK-NEXT:    sete %al
 ; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    je LBB0_14
-; CHECK-NEXT:  ## %bb.13: ## %cond_next17.i
+; CHECK-NEXT:    sete %cl
+; CHECK-NEXT:    xorl %r15d, %r15d
+; CHECK-NEXT:    orb %al, %cl
+; CHECK-NEXT:    jne LBB0_13
+; CHECK-NEXT:  LBB0_5: ## %cond_next17.i
 ; CHECK-NEXT:    movzbl %sil, %eax
 ; CHECK-NEXT:    divb %dl
 ; CHECK-NEXT:    movzbl %ah, %ebx
-; CHECK-NEXT:    jmp LBB0_18
-; CHECK-NEXT:  LBB0_14: ## %cond_true.i200
-; CHECK-NEXT:    testb %dl, %dl
-; CHECK-NEXT:    jne LBB0_17
-; CHECK-NEXT:  ## %bb.16: ## %cond_true14.i
-; CHECK-NEXT:    movl $4, %edi
-; CHECK-NEXT:    callq _feraiseexcept
-; CHECK-NEXT:  LBB0_17: ## %ubyte_ctype_remainder.exit
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:  LBB0_18: ## %ubyte_ctype_remainder.exit
+; CHECK-NEXT:  LBB0_16: ## %ubyte_ctype_remainder.exit
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    callq *224(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je LBB0_21
-; CHECK-NEXT:  ## %bb.19: ## %cond_true61
+; CHECK-NEXT:    je LBB0_19
+; CHECK-NEXT:  ## %bb.17: ## %cond_true61
 ; CHECK-NEXT:    movl %eax, %ebp
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq _.str5@GOTPCREL(%rip), %rdi
@@ -136,8 +139,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rcx
 ; CHECK-NEXT:    callq *200(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    js LBB0_27
-; CHECK-NEXT:  ## %bb.20: ## %cond_next73
+; CHECK-NEXT:    js LBB0_23
+; CHECK-NEXT:  ## %bb.18: ## %cond_next73
 ; CHECK-NEXT:    movl $1, {{[0-9]+}}(%rsp)
 ; CHECK-NEXT:    movq (%r14), %rax
 ; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %rsi
@@ -146,13 +149,13 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    movl %ebp, %edx
 ; CHECK-NEXT:    callq *232(%rax)
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  LBB0_21: ## %cond_next89
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  LBB0_19: ## %cond_next89
 ; CHECK-NEXT:    movl $2, %edi
 ; CHECK-NEXT:    callq _PyTuple_New
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_27
-; CHECK-NEXT:  ## %bb.22: ## %cond_next97
+; CHECK-NEXT:    je LBB0_23
+; CHECK-NEXT:  ## %bb.20: ## %cond_next97
 ; CHECK-NEXT:    movq %rax, %r14
 ; CHECK-NEXT:    movq _PyArray_API@GOTPCREL(%rip), %r12
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -160,8 +163,8 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.23: ## %cond_next135
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.25: ## %cond_next135
 ; CHECK-NEXT:    movb %r15b, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 24(%r14)
 ; CHECK-NEXT:    movq (%r12), %rax
@@ -169,22 +172,22 @@ define ptr @ubyte_divmod(ptr %a, ptr %b) {
 ; CHECK-NEXT:    xorl %esi, %esi
 ; CHECK-NEXT:    callq *304(%rdi)
 ; CHECK-NEXT:    testq %rax, %rax
-; CHECK-NEXT:    je LBB0_25
-; CHECK-NEXT:  ## %bb.24: ## %cond_next182
+; CHECK-NEXT:    je LBB0_21
+; CHECK-NEXT:  ## %bb.26: ## %cond_next182
 ; CHECK-NEXT:    movb %bl, 16(%rax)
 ; CHECK-NEXT:    movq %rax, 32(%r14)
 ; CHECK-NEXT:    movq %r14, %rax
-; CHECK-NEXT:    jmp LBB0_28
-; CHECK-NEXT:  LBB0_25: ## %cond_true113
+; CHECK-NEXT:    jmp LBB0_24
+; CHECK-NEXT:  LBB0_21: ## %cond_true113
 ; CHECK-NEXT:    decq (%r14)
-; CHECK-NEXT:    jne LBB0_27
-; CHECK-NEXT:  ## %bb.26: ## %cond_true126
+; CHECK-NEXT:    jne LBB0_23
+; CHECK-NEXT:  ## %bb.22: ## %cond_true126
 ; CHECK-NEXT:    movq 8(%r14), %rax
 ; CHECK-NEXT:    movq %r14, %rdi
 ; CHECK-NEXT:    callq *48(%rax)
-; CHECK-NEXT:  LBB0_27: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_23: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:  LBB0_28: ## %UnifiedReturnBlock
+; CHECK-NEXT:  LBB0_24: ## %UnifiedReturnBlock
 ; CHECK-NEXT:    addq $32, %rsp
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    popq %r12
diff --git a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
index 4482c5aec8e816..d9d4424267d733 100644
--- a/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
+++ b/llvm/test/CodeGen/X86/2007-12-18-LoadCSEBug.ll
@@ -16,15 +16,12 @@ define void @_ada_c34007g() {
 ; CHECK-NEXT:    andl $-8, %esp
 ; CHECK-NEXT:    subl $8, %esp
 ; CHECK-NEXT:    movl (%esp), %eax
+; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK-NEXT:    orl %eax, %ecx
+; CHECK-NEXT:    sete %cl
 ; CHECK-NEXT:    testl %eax, %eax
-; CHECK-NEXT:    je .LBB0_3
-; CHECK-NEXT:  # %bb.1: # %entry
-; CHECK-NEXT:    orl {{[0-9]+}}(%esp), %eax
-; CHECK-NEXT:    jne .LBB0_3
-; CHECK-NEXT:  # %bb.2: # %entry
-; CHECK-NEXT:    movb $1, %al
-; CHECK-NEXT:    testb %al, %al
-; CHECK-NEXT:  .LBB0_3: # %bb5507
+; CHECK-NEXT:    setne %al
+; CHECK-NEXT:    testb %cl, %al
 ; CHECK-NEXT:    movl %ebp, %esp
 ; CHECK-NEXT:    popl %ebp
 ; CHECK-NEXT:    .cfi_def_cfa %esp, 4
diff --git a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
index dd60e641df2543..9147692db15645 100644
--- a/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
+++ b/llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll
@@ -1,5 +1,6 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
 ; REQUIRES: asserts
-; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 16
+; RUN: llc < %s -mtriple=i686-- -mcpu=yonah -stats 2>&1 | grep "Number of block tails merged" | grep 9
 ; PR1909
 
 @.str = internal constant [48 x i8] c"transformed bounds: (%.2f, %.2f), (%.2f, %.2f)\0A\00"		; <ptr> [#uses=1]
@@ -217,4 +218,4 @@ bb456:		; preds = %bb448, %bb425, %bb417, %bb395, %bb385, %bb371
 	ret void
 }
 
-declare i32 @printf(ptr, ...) nounwind 
+declare i32 @printf(ptr, ...) nounwind
diff --git a/llvm/test/CodeGen/X86/avx-cmp.ll b/llvm/test/CodeGen/X86/avx-cmp.ll
index 502bbf3f5d118b..4ab9c545ed90da 100644
--- a/llvm/test/CodeGen/X86/avx-cmp.ll
+++ b/llvm/test/CodeGen/X86/avx-cmp.ll
@@ -26,40 +26,33 @@ declare void @scale() nounwind
 define v...
[truncated]

github-actions · 2024-02-13T23:58:43Z

✅ With the latest revision this PR passed the C/C++ code formatter.

goldsteinn · 2024-02-13T23:59:25Z

NB: This is nearly the same as: https://reviews.llvm.org/D157581

RKSimon

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

RKSimon · 2024-02-14T10:17:45Z

llvm/include/llvm/CodeGen/TargetLowering.h

@@ -596,6 +596,13 @@ class TargetLoweringBase {
  /// avoided.
  bool isJumpExpensive() const { return JumpIsExpensive; }

+  virtual bool keepJumpConditionsTogether(const FunctionLoweringInfo &,


Add doxygen description

llvm/lib/Target/X86/X86ISelLowering.cpp

RKSimon · 2024-02-14T10:26:52Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      else {
+        // Its likely we will get an early out.
+        CostThresh -= BrMergingUnlikelyBias.getValue();
+        if (BrMergingUnlikelyBias.getValue() < 0) {


Attempt this earlier?

goldsteinn · 2024-02-14T17:08:57Z

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

Fair enough. Ill put the code somewhere generic and make it a call in.

goldsteinn · 2024-02-15T23:27:33Z

Some initial comments - but I can't see why so much logic is in x86 directly. The keepJumpConditionsTogether implementation appears very generic, and I'd expect it to be in DAGBuilder and x86 just has a simple TLI flag callback to enable it.

Fair enough. Ill put the code somewhere generic and make it a call in.

Have moved code to SelectionDAG w/ targets providing the tuning params. I kept the special case for (a == b && c == d) in x86 as not sure if thats really general.

goldsteinn · 2024-02-19T16:56:35Z

ping

RKSimon · 2024-02-21T11:25:44Z

llvm/include/llvm/CodeGen/TargetLowering.h

+  struct CondMergingParams {
+    int BaseCost;
+    int LikelyBias;
+    int UnlikelyBias;


Add comments

RKSimon · 2024-02-21T11:26:27Z

llvm/include/llvm/CodeGen/TargetLowering.h

+    int UnlikelyBias;
+  };
+  // Return params for deciding if we should keep two branch conditions merged
+  // or split them into two seperate branches.


Add paramater descriptions to the comment

What do you mean by this?

Typo "separate"

Describe what the 3 function parameters do

RKSimon · 2024-02-21T11:27:21Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

@@ -2432,6 +2434,141 @@ SelectionDAGBuilder::EmitBranchForMergedCondition(const Value *Cond,
  SL->SwitchCases.push_back(CB);
 }

+// Collect dependings on V recursively. This is used for the cost analysis in


dependings -> dependencies?

RKSimon · 2024-02-21T11:28:00Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+  if (!Deps->insert(I).second)
+    return true;
+
+  for (unsigned OpIdx = 0; OpIdx < I->getNumOperands(); ++OpIdx)


(style) for (unsigned OpIdx = 0, E = I->getNumOperands(); OpIdx < E; ++OpIdx)

RKSimon · 2024-02-21T11:30:06Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+
+  // Prune instructions from RHS Deps that are dependencies of unrelated
+  // instructions.
+  const unsigned MaxPruneIters = 8;


Magic number?

Yeah, changed to SelectionDAG::MaxRecursionDepth and added a comment indicating its relative arbitraryness.

RKSimon · 2024-02-21T11:30:17Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+
+  for (const auto *Ins : RhsDeps) {
+    CostOfIncluding +=
+        TTI.getInstructionCost(Ins, TargetTransformInfo::TCK_Latency);


Why TCK_Latency ?

My thought it we are computing a dependency chain so latency was the right measurement. We could be a bit smart and take ilp into account, but that seems overkill when the BaseCost bound is generally relatively low.

Adding a comment to the affect.

RKSimon · 2024-02-21T11:31:25Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -55,6 +58,7 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCSymbol.h"
+#include "llvm/Support/BranchProbability.h"


Do you still need all these extra includes?

RKSimon · 2024-02-21T11:36:50Z

llvm/test/CodeGen/X86/lsr-addrecloops.ll

@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4


Regenerate tests as a pre-commit so we see the actual diff

RKSimon · 2024-02-21T11:37:47Z

llvm/test/CodeGen/X86/2008-02-18-TailMergingBug.ll

@@ -1,5 +1,6 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4


This didn't use update_llc_test_checks ?

goldsteinn · 2024-02-25T19:01:42Z

ping

jayfoad · 2024-02-26T11:49:21Z

[X86] Don't always seperate conditions in (br (and/or cond0, cond1)) into seperate branches

Typo "separate" (twice).

RKSimon

Mostly there, my remaining concern is how this is supposed to interact with middle-end passes like SimplfyCFG which are handling something similar?

RKSimon · 2024-02-26T12:33:49Z

llvm/include/llvm/CodeGen/TargetLowering.h

@@ -596,6 +596,39 @@ class TargetLoweringBase {
  /// avoided.
  bool isJumpExpensive() const { return JumpIsExpensive; }

+  // Costs paramateres used by


paramateres -> parameters

RKSimon · 2024-02-26T12:35:17Z

llvm/include/llvm/CodeGen/TargetLowering.h

+    int UnlikelyBias;
+  };
+  // Return params for deciding if we should keep two branch conditions merged
+  // or split them into two seperate branches.


Describe what the 3 function parameters do

goldsteinn · 2024-02-26T20:13:50Z

Mostly there, my remaining concern is how this is supposed to interact with middle-end passes like SimplfyCFG which are handling something similar?

I would think this just works better to honor the middle-end's decisions.
I.e if SimplifyCFG breaks the two conditions, we aren't recombining here. If SimplifyCFG keeps them together, this patch adds backend knowledge that will either confirm that decision, or reject the decision (current default behavior).

goldsteinn · 2024-02-29T23:39:57Z

ping

RKSimon

LGTM cheers (with a couple of comments that we should followup - happy if you raise issues or want to look at them in the future).

Any other comments?

RKSimon · 2024-03-01T18:56:02Z

llvm/test/CodeGen/X86/pr37025.ll

-; CHECK-NEXT:  .LBB0_2:
+; CHECK-NEXT:    setne %cl
+; CHECK-NEXT:    andb %al, %cl
+; CHECK-NEXT:    cmpb $1, %cl


Any idea why the and+cmp hasn't folded to test? Not related to this patch but maybe worth a followup investigation?

Seems to be because of the atomic.

SelectionDAG has 21 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t8: i8,ch = llvm.x86.atomic.sub.cc<(volatile load store (s64) on %ir.0, align 64)> t0, TargetConstant:i64<12197>, t2, Constant:i64<1>, TargetConstant:i32<4> t4: i64,ch = CopyFromReg t0, Register:i64 %1 t12: i1 = setcc t4, Constant:i64<0>, setne:ch t9: i1 = truncate t8 t14: i1 = select t12, t9, Constant:i1<0> t16: i1 = xor t14, Constant:i1<-1> t18: ch = brcond t8:1, t16, BasicBlock:ch< 0x55ddb51cd310> t20: ch = br t18, BasicBlock:ch< 0x55ddb51cd210>

We plumb llvm.x86.atomic.sub.cc as i8 instead of i1 and don't fold past the truncate.
Is there any reason llvm.x86.atomic.sub.cc needs to be i8?

RKSimon · 2024-03-01T18:58:05Z

llvm/test/CodeGen/X86/setcc-logic.ll

 ; CHECK-NEXT:    movl $4, %eax
 ; CHECK-NEXT:    retq
+; CHECK-NEXT:  .LBB14_2: # %return
+; CHECK-NEXT:    movl $192, %eax


Pity this didn't become a cmov :(

Interestingly none of the new branches became cmov. I bet when we look for cmov this is still in (and cond0, cond1) and we only cmov on setcc.
Ill look into this.

goldsteinn · 2024-03-01T19:15:16Z

LGTM cheers (with a couple of comments that we should followup - happy if you raise issues or want to look at them in the future).

Any other comments?

Thank you for the review, ill look into the lack of cmov/test. Have an idea for how to fix them.

goldsteinn · 2024-03-01T21:06:17Z

@nikic
NOTE: going to push this, it will increase instruction count on stage2 compile time benchmarks (saves branch misses).

KanRobert · 2024-03-02T08:37:16Z

This will also likely help for APX targets with the new CCMP and
CTEST instructions.

In fact, this causes some downstream CCMP tests to fail b/c branch is already eliminated. @goldsteinn Is there a flag to undo the change in test?

goldsteinn · 2024-03-02T08:42:49Z

This will also likely help for APX targets with the new CCMP and
CTEST instructions.

In fact, this causes some downstream CCMP tests to fail b/c branch is already eliminated. @goldsteinn Is there a flag to undo the change in test?

static cl::opt<int> BrMergingBaseCostThresh(
    "x86-br-merging-base-cost", cl::init(1),
    cl::desc(
        "Sets the cost threshold for when multiple conditionals will be merged "
        "into one branch versus be split in multiple branches. Merging "
        "conditionals saves branches at the cost of additional instructions. "
        "This value sets the instruction cost limit, below which conditionals "
        "will be merged, and above which conditionals will be split."),
    cl::Hidden);

if you set that to -1 it will disable the branch merging.

chapuni

This change introduces nondeterminism.

chapuni · 2024-03-02T10:33:39Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+    return false;
+
+  // Collect "all" instructions that lhs condition is dependent on.
+  SmallPtrSet<const Instruction *, 8> LhsDeps, RhsDeps;


SmallPtrSet should not be iterable.

Ah, will have fix up shortly.
I think its only the first iteration, the second iteration will go through all elements so order shouldnt matter.

Code simplifies when using a small vector for iteration as well :)

See: #83687

goldsteinn · 2024-03-03T07:19:17Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+    for (const auto *U : Ins->users()) {
+      // If user is independent of RHS calculation we don't need to count it.
+      if (auto *UIns = dyn_cast<Instruction>(U))
+        if (!RhsDeps.contains(UIns))


Realized there is a bug here, we should be checking that the use also doesn't equal I.getCondition(), otherwise we can end up overpruning. Have a fix, just testing if it implies any changes too the tuning. Will post tomorrow.
Edit: not a correctness bug.

… cond1))` into separate branches" This has been buggy for a while. Reverts #81689 This reverts commit ae76dfb.

chapuni · 2024-03-03T13:43:26Z

@goldsteinn Excuse me, I have reverted this. Could you work on here please?

goldsteinn · 2024-03-03T16:49:06Z

@goldsteinn Excuse me, I have reverted this. Could you work on here please?

Ah, sure :)

…` into separate branches Changes in Recommit: 1) Fix non-determanism by using `SmallMapVector` instead of `SmallPtrSet`. 2) Fix bug in dependency pruning where we discounted the actual `and/or` combining the two conditions. This lead to over pruning. It makes sense to split if the cost of computing `cond1` is high (proportionally to how likely `cond0` is), but it doesn't really make sense to introduce a second branch if its only a few instructions. Splitting can also get in the way of potentially folding patterns. This patch introduces some logic to try to check if the cost of computing `cond1` is relatively low, and if so don't split the branches. Modest improvement on clang bootstrap build: https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles Average stage2-O3: 0.59% Improvement (cycles) Average stage2-O0-g: 1.20% Improvement (cycles) Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles) There is also a modest compile time improvement with this patch: https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au Note that the stage2 instruction count increases is expected, this patch trades instructions for decreasing branch-misses (which is proportionately lower): https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses NB: This will also likely help for APX targets with the new `CCMP` and `CTEST` instructions.

nikic · 2024-03-03T20:12:01Z

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

FWIW, this is most likely noise. At least I didn't see any stable improvement to cycles after this change landed.

goldsteinn · 2024-03-03T20:30:23Z

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3: 0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

FWIW, this is most likely noise. At least I didn't see any stable improvement to cycles after this change landed.

Ah, yeah looked through, the didn't realize cycle variance was so high. Ill run SPEC locally and update message.

Across llvm benchmark suite + SPEC gets a wooping 2.2% improvement (geometric mean of N=3 runs) on SKX.
SPEC only is less impressive at .2% improvement.

chapuni

Thanks. This works for me.

chapuni · 2024-03-03T21:19:34Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+    for (const auto &InsPair : RhsDeps) {
+      if (!ShouldCountInsn(InsPair.first)) {
+        ToDrop = InsPair.first;
+        break;


Just a question: Does it work or improve if we would erase all found Instructions, not only first found one?
(Then, it should work w/o MapVector)

for (InsPair : RhsDeps) // find mark insns to be erased for (to be erased) RhsDeps.erase(I);

Yes we could do it in bulk, but still need to repeat after each change (or max prune iters is reached). Think since we have a maximum number of pruning iterations, we still potentially could have issues if iteration order is not fixed.

llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Feb 13, 2024

goldsteinn requested a review from topperc February 13, 2024 23:56

goldsteinn requested review from phoebewang and RKSimon February 13, 2024 23:56

phoebewang requested a review from KanRobert February 14, 2024 03:12

RKSimon reviewed Feb 14, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/br-mergining branch from b8e2f05 to cb4616f Compare February 14, 2024 20:16

RKSimon reviewed Feb 21, 2024

View reviewed changes

goldsteinn force-pushed the goldsteinn/br-mergining branch from cb4616f to 475e0fb Compare February 21, 2024 17:47

RKSimon reviewed Feb 26, 2024

View reviewed changes

goldsteinn changed the title ~~[X86] Don't always seperate conditions in (br (and/or cond0, cond1)) into seperate branches~~ [X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches Feb 26, 2024

goldsteinn force-pushed the goldsteinn/br-mergining branch from 475e0fb to 077ff9a Compare February 26, 2024 20:44

RKSimon reviewed Mar 1, 2024

View reviewed changes

RKSimon approved these changes Mar 1, 2024

View reviewed changes

goldsteinn closed this in ae76dfb Mar 1, 2024

chapuni reviewed Mar 2, 2024

View reviewed changes

goldsteinn commented Mar 3, 2024

View reviewed changes

chapuni added a commit that referenced this pull request Mar 3, 2024

Revert "[X86] Don't always separate conditions in `(br (and/or cond0,…

5b4759f

… cond1))` into separate branches" This has been buggy for a while. Reverts #81689 This reverts commit ae76dfb.

chapuni reopened this Mar 3, 2024

goldsteinn force-pushed the goldsteinn/br-mergining branch from 077ff9a to 8d65a72 Compare March 3, 2024 17:08

goldsteinn requested review from RKSimon and chapuni March 3, 2024 17:08

goldsteinn force-pushed the goldsteinn/br-mergining branch from 8d65a72 to 642cc93 Compare March 3, 2024 17:50

chapuni approved these changes Mar 3, 2024

View reviewed changes

goldsteinn closed this in a4951ec Mar 4, 2024

		@@ -1,3 +1,4 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4

		@@ -1,5 +1,6 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4

[X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches #81689

[X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches #81689

Conversation

goldsteinn commented Feb 13, 2024 • edited

llvmbot commented Feb 13, 2024

llvmbot commented Feb 13, 2024

github-actions bot commented Feb 13, 2024 • edited

goldsteinn commented Feb 13, 2024

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Feb 14, 2024

goldsteinn commented Feb 15, 2024

goldsteinn commented Feb 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Feb 25, 2024

jayfoad commented Feb 26, 2024

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Feb 26, 2024

goldsteinn commented Feb 29, 2024

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn commented Mar 1, 2024

goldsteinn commented Mar 1, 2024

KanRobert commented Mar 2, 2024

goldsteinn commented Mar 2, 2024

chapuni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldsteinn Mar 3, 2024 • edited

Choose a reason for hiding this comment

chapuni commented Mar 3, 2024

goldsteinn commented Mar 3, 2024

nikic commented Mar 3, 2024

goldsteinn commented Mar 3, 2024 • edited

chapuni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches #81689

[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches #81689

goldsteinn commented Feb 13, 2024 •

edited

github-actions bot commented Feb 13, 2024 •

edited

goldsteinn Mar 3, 2024 •

edited

goldsteinn commented Mar 3, 2024 •

edited