Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineBlockPlacement: Add tolerance to comparisons #67197

Closed
wants to merge 2 commits into from

Conversation

MatzeB
Copy link
Contributor

@MatzeB MatzeB commented Sep 22, 2023

The LLVM BFI data is susceptible to numerical "noise" in the lower bits:
To get some feeling for this consider that we store BlockFrequency as
64-bit integers and branch weights as 32-bit. There is a natural
numerical difference when the branch weights divisions are not naturally
representable as integers and because block frequencies are scaled by
loop factors the absolute values of those errors can appear big.
Regardless it should be a small relative difference relative to the
frequency numbers.

This changes make the MachineBlockPlacement algorithm more tolerant
against small relative changes in block frequency numbers:

  • Add BlockFrequency::greaterWithTolerance and
    BlockFrequency::lessWithTolerance functions that allow for
    greater/smaller comparisons for the top N significant bits only
    excluding leading zeros.
  • Changes MachineBlockPlacement::selectBestCandidateBlock and
    MachineBlockPlacement::findBestLoopTopHelper to only consider
    something a better candidate when it passes the tolerance check.
  • Blocks that have already been in place prior to the algorithm are
    picked as best candidate when they are not worse than the current best
    with tolerance. This incentivizes to keep the placement as-is when
    multiple candidates have similar frequencies.

The currently value of 20 significant bits works well to reduce
unnecessary changes in my upcoming BFI precision changes. It is likely a
good idea to reduce this number further to accomodate for noise in PGO
data but I did not perform any deeper analysis on what number would fit
this goal.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 22, 2023

@llvm/pr-subscribers-llvm-support
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-backend-x86

Changes

The LLVM BFI data is susceptible to numerical "noise" in the lower bits:
To get some feeling for this consider that we store BlockFrequency as
64-bit integers and branch weights as 32-bit. There is a natural
numerical difference when the branch weights divisions are not naturally
representable as integers and because block frequencies are scaled by
loop factors the absolute values of those errors can appear big.
Regardless it should be a small relative difference relative to the
frequency numbers.

This changes make the MachineBlockPlacement algorithm more tolerant
against small relative changes in block frequency numbers:

  • Add BlockFrequency::greaterWithTolerance and
    BlockFrequency::lessWithTolerance functions that allow for
    greater/smaller comparisons for the top N significant bits only
    excluding leading zeros.
  • Changes MachineBlockPlacement::selectBestCandidateBlock and
    MachineBlockPlacement::findBestLoopTopHelper to only consider
    something a better candidate when it passes the tolerance check.
  • Blocks that have already been in place prior to the algorithm are
    picked as best candidate when they are not worse than the current best
    with tolerance. This incentivizes to keep the placement as-is when
    multiple candidates have similar frequencies.

The currently value of 20 significant bits works well to reduce
unnecessary changes in my upcoming BFI precision changes. It is likely a
good idea to reduce this number further to accomodate for noise in PGO
data but I did not perform any deeper analysis on what number would fit
this goal.


Patch is 253.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67197.diff

75 Files Affected:

  • (modified) llvm/include/llvm/Support/BlockFrequency.h (+36)
  • (modified) llvm/lib/CodeGen/MachineBlockPlacement.cpp (+36-10)
  • (modified) llvm/test/CodeGen/AArch64/cfi-fixup.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll (+33-34)
  • (modified) llvm/test/CodeGen/AArch64/ragreedy-csr.ll (+117-117)
  • (modified) llvm/test/CodeGen/AArch64/sink-and-fold.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+16-15)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_header_nopred.mir (+55-28)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+35-35)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+34-34)
  • (modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+25-26)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+104-23)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+65-60)
  • (modified) llvm/test/CodeGen/ARM/code-placement.ll (+6-5)
  • (modified) llvm/test/CodeGen/ARM/loop-indexing.ll (+13-10)
  • (modified) llvm/test/CodeGen/AVR/rot.ll (+16-16)
  • (modified) llvm/test/CodeGen/AVR/rotate.ll (+10-8)
  • (modified) llvm/test/CodeGen/AVR/shift.ll (+8-8)
  • (modified) llvm/test/CodeGen/Generic/machine-function-splitter.ll (-1)
  • (modified) llvm/test/CodeGen/Hexagon/lsr-postinc-nested-loop.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/prof-early-if.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-multi-loops.ll (+4-4)
  • (modified) llvm/test/CodeGen/Mips/indirect-jump-hazard/jumptables.ll (+104-104)
  • (modified) llvm/test/CodeGen/Mips/jump-table-mul.ll (+16-16)
  • (modified) llvm/test/CodeGen/Mips/nacl-align.ll (+7-3)
  • (modified) llvm/test/CodeGen/Mips/pseudo-jump-fill.ll (+7-7)
  • (modified) llvm/test/CodeGen/PowerPC/2007-11-16-landingpad-split.ll (+15-13)
  • (modified) llvm/test/CodeGen/PowerPC/branch-opt.ll (+5-4)
  • (modified) llvm/test/CodeGen/PowerPC/jump-tables-collapse-rotate.ll (+19-20)
  • (modified) llvm/test/CodeGen/PowerPC/licm-tocReg.ll (+22-22)
  • (modified) llvm/test/CodeGen/PowerPC/lsr-profitable-chain.ll (+17-17)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-crgt.ll (+4-4)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-crlt.ll (+10-10)
  • (modified) llvm/test/CodeGen/PowerPC/p10-spill-crun.ll (+16-16)
  • (modified) llvm/test/CodeGen/PowerPC/pr43527.ll (+4-4)
  • (modified) llvm/test/CodeGen/PowerPC/pr45448.ll (+7-7)
  • (modified) llvm/test/CodeGen/PowerPC/pr46759.ll (+9-7)
  • (modified) llvm/test/CodeGen/PowerPC/pr48519.ll (+22-22)
  • (modified) llvm/test/CodeGen/PowerPC/stack-clash-prologue.ll (+42-39)
  • (modified) llvm/test/CodeGen/RISCV/shrinkwrap-jump-table.ll (+10-10)
  • (modified) llvm/test/CodeGen/Thumb/pr42760.ll (+18-15)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll (+25-25)
  • (modified) llvm/test/CodeGen/Thumb2/bti-indirect-branches.ll (+10-27)
  • (modified) llvm/test/CodeGen/Thumb2/constant-hoisting.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/mve-blockplacement.ll (+36-36)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float16regloops.ll (+34-34)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+67-65)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+75-78)
  • (modified) llvm/test/CodeGen/Thumb2/mve-memtp-branch.ll (+39-39)
  • (modified) llvm/test/CodeGen/Thumb2/mve-memtp-loop.ll (+30-34)
  • (modified) llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll (+35-35)
  • (modified) llvm/test/CodeGen/VE/Scalar/br_jt.ll (+73-72)
  • (modified) llvm/test/CodeGen/X86/2007-01-13-StackPtrIndex.ll (+54-51)
  • (modified) llvm/test/CodeGen/X86/2008-04-17-CoalescerBug.ll (+10-9)
  • (modified) llvm/test/CodeGen/X86/2009-08-12-badswitch.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/block-placement.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-outputs.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/code_placement_loop_rotation2.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+23-24)
  • (modified) llvm/test/CodeGen/X86/dup-cost.ll (+41-13)
  • (modified) llvm/test/CodeGen/X86/hoist-invariant-load.ll (+17-13)
  • (modified) llvm/test/CodeGen/X86/loop-strength-reduce7.ll (+6-5)
  • (modified) llvm/test/CodeGen/X86/mul-constant-result.ll (+56-56)
  • (modified) llvm/test/CodeGen/X86/pic.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/postalloc-coalescing.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/pr38743.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/pr38795.ll (+41-38)
  • (modified) llvm/test/CodeGen/X86/pr63692.ll (+5-4)
  • (modified) llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/speculative-load-hardening-indirect.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/speculative-load-hardening.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/switch.ll (+10-11)
  • (modified) llvm/test/CodeGen/X86/tail-dup-merge-loop-headers.ll (+42-42)
  • (modified) llvm/test/CodeGen/X86/win-catchpad.ll (+4-4)
  • (modified) llvm/unittests/Support/BlockFrequencyTest.cpp (+50)
diff --git a/llvm/include/llvm/Support/BlockFrequency.h b/llvm/include/llvm/Support/BlockFrequency.h
index 12a753301b36aba..8930bf536bff8ce 100644
--- a/llvm/include/llvm/Support/BlockFrequency.h
+++ b/llvm/include/llvm/Support/BlockFrequency.h
@@ -15,6 +15,7 @@
 
 #include <cassert>
 #include <cstdint>
+#include <limits.h>
 #include <optional>
 
 namespace llvm {
@@ -93,6 +94,41 @@ class BlockFrequency {
     return *this;
   }
 
+  /// Returns true if `this` is greater than `other` and the magnitude of the
+  /// difference falls within the topmost `SignificantBits` of `this`.
+  ///
+  /// The formula is derived when comparing a "relative change" to a threshold
+  /// which is chosen as a negative power-of-two so we can compute things
+  /// cheaply:
+  /// A = this->Frequency; B = Other.Frequency; T = SignificantBits - 1;
+  /// relative_change(A, B) = (A - B) / A
+  ///         (A - B) / A  >  2**(-T)
+  ///   <=>   A - B        >  2**(-T) * A
+  ///   <=>   A - B        >  shr(A, T)
+  bool greaterWithTolerance(BlockFrequency Other,
+                            unsigned SignificantBits) const {
+    assert(0 < SignificantBits &&
+           SignificantBits <= sizeof(Frequency) * CHAR_BIT &&
+           "invalid SignificantBits value");
+    return Frequency > Other.Frequency &&
+           (Frequency - Other.Frequency) >=
+               (Frequency >> (SignificantBits - 1));
+  }
+
+  /// Returns true if `this` is less than `other` and the magnitude of the
+  /// difference falls within the topmost `SignificantBits` of `this`.
+  ///
+  /// See `greaterWithTolerance` for how this is related to the
+  /// "relative change" formula.
+  bool lessWithTolerance(BlockFrequency Other, unsigned SignificantBits) const {
+    assert(0 < SignificantBits &&
+           SignificantBits <= sizeof(Frequency) * CHAR_BIT &&
+           "invalid SignificantBits value");
+    return Frequency < Other.Frequency &&
+           (Other.Frequency - Frequency) >=
+               (Frequency >> (SignificantBits - 1));
+  }
+
   bool operator<(BlockFrequency RHS) const {
     return Frequency < RHS.Frequency;
   }
diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 24f0197b419794b..540308a9b0b5bd3 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -213,6 +213,9 @@ static cl::opt<bool> RenumberBlocksBeforeView(
         "into a dot graph. Only used when a function is being printed."),
     cl::init(false), cl::Hidden);
 
+// Number of significant bits when comparing BlockFrequencies with tolerance.
+static constexpr unsigned BlockFreqCompBits = 20;
+
 namespace llvm {
 extern cl::opt<bool> EnableExtTspBlockPlacement;
 extern cl::opt<bool> ApplyExtTspWithoutProfile;
@@ -1715,6 +1718,7 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
 
   bool IsEHPad = WorkList[0]->isEHPad();
 
+  MachineBasicBlock *ChainEnd = *std::prev(Chain.end());
   MachineBasicBlock *BestBlock = nullptr;
   BlockFrequency BestFreq;
   for (MachineBasicBlock *MBB : WorkList) {
@@ -1730,7 +1734,9 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
 
     BlockFrequency CandidateFreq = MBFI->getBlockFreq(MBB);
     LLVM_DEBUG(dbgs() << "    " << getBlockName(MBB) << " -> ";
-               MBFI->printBlockFreq(dbgs(), CandidateFreq) << " (freq)\n");
+               MBFI->printBlockFreq(dbgs(), CandidateFreq)
+               << " (freq) layoutsucc " << ChainEnd->isLayoutSuccessor(MBB)
+               << '\n');
 
     // For ehpad, we layout the least probable first as to avoid jumping back
     // from least probable landingpads to more probable ones.
@@ -1750,11 +1756,26 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
     //                 +-------------------------------------+
     //                 V                                     |
     // OuterLp -> OuterCleanup -> Resume     InnerLp -> InnerCleanup
-    if (BestBlock && (IsEHPad ^ (BestFreq >= CandidateFreq)))
-      continue;
-
-    BestBlock = MBB;
-    BestFreq = CandidateFreq;
+    bool Better = false;
+    if (BestBlock == nullptr) {
+      Better = true;
+    } else if (!IsEHPad) {
+      // Be tolerant to numerical noise: Pick block if clearly better or
+      //  for the current successor if it is not not clearly worse.
+      if (CandidateFreq.greaterWithTolerance(BestFreq, BlockFreqCompBits) ||
+          (ChainEnd->isLayoutSuccessor(MBB) &&
+           !CandidateFreq.lessWithTolerance(BestFreq, BlockFreqCompBits)))
+        Better = true;
+    } else {
+      if (CandidateFreq.lessWithTolerance(BestFreq, BlockFreqCompBits) ||
+          (ChainEnd->isLayoutSuccessor(MBB) &&
+           !CandidateFreq.greaterWithTolerance(BestFreq, BlockFreqCompBits)))
+        Better = true;
+    }
+    if (Better) {
+      BestBlock = MBB;
+      BestFreq = CandidateFreq;
+    }
   }
 
   return BestBlock;
@@ -2095,8 +2116,8 @@ MachineBlockPlacement::findBestLoopTopHelper(
     if (Pred == L.getHeader())
       continue;
     LLVM_DEBUG(dbgs() << "   old top pred: " << getBlockName(Pred) << ", has "
-                      << Pred->succ_size() << " successors, ";
-               MBFI->printBlockFreq(dbgs(), Pred) << " freq\n");
+                      << Pred->succ_size() << " successors, freq ";
+               MBFI->printBlockFreq(dbgs(), Pred) << '\n');
     if (Pred->succ_size() > 2)
       continue;
 
@@ -2112,8 +2133,13 @@ MachineBlockPlacement::findBestLoopTopHelper(
 
     BlockFrequency Gains = FallThroughGains(Pred, OldTop, OtherBB,
                                             LoopBlockSet);
-    if ((Gains > 0) && (Gains > BestGains ||
-        ((Gains == BestGains) && Pred->isLayoutSuccessor(OldTop)))) {
+    LLVM_DEBUG(dbgs() << "    Candidate: " << getBlockName(Pred) << " freq ";
+               MBFI->printBlockFreq(dbgs(), Pred);
+               dbgs() << " layoutsucc " << Pred->isLayoutSuccessor(OldTop)
+                      << " gains " << Gains.getFrequency() << '\n');
+    if (Gains.greaterWithTolerance(BestGains, BlockFreqCompBits) ||
+        (Pred->isLayoutSuccessor(OldTop) &&
+         !Gains.lessWithTolerance(BestGains, BlockFreqCompBits))) {
       BestPred = Pred;
       BestGains = Gains;
     }
diff --git a/llvm/test/CodeGen/AArch64/cfi-fixup.ll b/llvm/test/CodeGen/AArch64/cfi-fixup.ll
index 9a4ad3bb07ee364..e746ea745b92a8c 100644
--- a/llvm/test/CodeGen/AArch64/cfi-fixup.ll
+++ b/llvm/test/CodeGen/AArch64/cfi-fixup.ll
@@ -8,13 +8,13 @@ define i32 @f0(i32 %x) #0 {
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-NEXT:    .cfi_offset w30, -16
 ; CHECK-NEXT:    .cfi_remember_state
-; CHECK-NEXT:    cbz w0, .LBB0_4
+; CHECK-NEXT:    cbz w0, .LBB0_6
 ; CHECK-NEXT:  // %bb.1: // %entry
 ; CHECK-NEXT:    cmp w0, #2
-; CHECK-NEXT:    b.eq .LBB0_5
+; CHECK-NEXT:    b.eq .LBB0_4
 ; CHECK-NEXT:  // %bb.2: // %entry
 ; CHECK-NEXT:    cmp w0, #1
-; CHECK-NEXT:    b.ne .LBB0_6
+; CHECK-NEXT:    b.ne .LBB0_5
 ; CHECK-NEXT:  // %bb.3: // %if.then2
 ; CHECK-NEXT:    bl g1
 ; CHECK-NEXT:    add w0, w0, #1
@@ -22,27 +22,27 @@ define i32 @f0(i32 %x) #0 {
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    .cfi_restore w30
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_4:
+; CHECK-NEXT:  .LBB0_4: // %if.then5
 ; CHECK-NEXT:    .cfi_restore_state
 ; CHECK-NEXT:    .cfi_remember_state
-; CHECK-NEXT:    mov w0, #1
+; CHECK-NEXT:    bl g0
+; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    sub w0, w8, w0
 ; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    .cfi_restore w30
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_5: // %if.then5
+; CHECK-NEXT:  .LBB0_5: // %if.end7
 ; CHECK-NEXT:    .cfi_restore_state
 ; CHECK-NEXT:    .cfi_remember_state
-; CHECK-NEXT:    bl g0
-; CHECK-NEXT:    mov w8, #1
-; CHECK-NEXT:    sub w0, w8, w0
+; CHECK-NEXT:    mov w0, wzr
 ; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    .cfi_restore w30
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_6: // %if.end7
+; CHECK-NEXT:  .LBB0_6:
 ; CHECK-NEXT:    .cfi_restore_state
-; CHECK-NEXT:    mov w0, wzr
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    .cfi_restore w30
@@ -115,7 +115,7 @@ define i32 @f2(i32 %x) #0 {
 ; CHECK-NEXT:    cbz w0, .LBB2_2
 ; CHECK-NEXT:  // %bb.1: // %if.end
 ; CHECK-NEXT:    bl g1
-; CHECK-NEXT:    mov w8, #1
+; CHECK-NEXT:    mov w8, #1 // =0x1
 ; CHECK-NEXT:    sub w0, w8, w0
 ; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
index d4da4f28bb4a3c7..04fd80cd8213fdd 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
@@ -15,31 +15,45 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
 ; CHECK-NEXT:    and x12, x10, #0xfffffff0
 ; CHECK-NEXT:    add x13, x1, #32
 ; CHECK-NEXT:    add x14, x2, #16
-; CHECK-NEXT:    b .LBB0_3
-; CHECK-NEXT:  .LBB0_2: // %for.cond1.for.cond.cleanup3_crit_edge.us
-; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    b .LBB0_6
+; CHECK-NEXT:  .LBB0_2: // in Loop: Header=BB0_6 Depth=1
+; CHECK-NEXT:    mov x18, xzr
+; CHECK-NEXT:  .LBB0_3: // %for.body4.us.preheader
+; CHECK-NEXT:    // in Loop: Header=BB0_6 Depth=1
+; CHECK-NEXT:    add x16, x18, x8
+; CHECK-NEXT:    add x17, x2, x18, lsl #1
+; CHECK-NEXT:    sub x18, x10, x18
+; CHECK-NEXT:    add x16, x1, x16, lsl #2
+; CHECK-NEXT:  .LBB0_4: // %for.body4.us
+; CHECK-NEXT:    // Parent Loop BB0_6 Depth=1
+; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
+; CHECK-NEXT:    ldrsh w3, [x17], #2
+; CHECK-NEXT:    ldr w4, [x16]
+; CHECK-NEXT:    subs x18, x18, #1
+; CHECK-NEXT:    madd w3, w3, w15, w4
+; CHECK-NEXT:    str w3, [x16], #4
+; CHECK-NEXT:    b.ne .LBB0_4
+; CHECK-NEXT:  .LBB0_5: // %for.cond1.for.cond.cleanup3_crit_edge.us
+; CHECK-NEXT:    // in Loop: Header=BB0_6 Depth=1
 ; CHECK-NEXT:    add x9, x9, #1
 ; CHECK-NEXT:    add x13, x13, x11
 ; CHECK-NEXT:    add x8, x8, x10
 ; CHECK-NEXT:    cmp x9, x10
 ; CHECK-NEXT:    b.eq .LBB0_10
-; CHECK-NEXT:  .LBB0_3: // %for.cond1.preheader.us
+; CHECK-NEXT:  .LBB0_6: // %for.cond1.preheader.us
 ; CHECK-NEXT:    // =>This Loop Header: Depth=1
-; CHECK-NEXT:    // Child Loop BB0_6 Depth 2
-; CHECK-NEXT:    // Child Loop BB0_9 Depth 2
+; CHECK-NEXT:    // Child Loop BB0_8 Depth 2
+; CHECK-NEXT:    // Child Loop BB0_4 Depth 2
 ; CHECK-NEXT:    ldrsh w15, [x2, x9, lsl #1]
 ; CHECK-NEXT:    cmp w0, #16
-; CHECK-NEXT:    b.hs .LBB0_5
-; CHECK-NEXT:  // %bb.4: // in Loop: Header=BB0_3 Depth=1
-; CHECK-NEXT:    mov x18, xzr
-; CHECK-NEXT:    b .LBB0_8
-; CHECK-NEXT:  .LBB0_5: // %vector.ph
-; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    b.lo .LBB0_2
+; CHECK-NEXT:  // %bb.7: // %vector.ph
+; CHECK-NEXT:    // in Loop: Header=BB0_6 Depth=1
 ; CHECK-NEXT:    mov x16, x14
 ; CHECK-NEXT:    mov x17, x13
 ; CHECK-NEXT:    mov x18, x12
-; CHECK-NEXT:  .LBB0_6: // %vector.body
-; CHECK-NEXT:    // Parent Loop BB0_3 Depth=1
+; CHECK-NEXT:  .LBB0_8: // %vector.body
+; CHECK-NEXT:    // Parent Loop BB0_6 Depth=1
 ; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    dup v0.8h, w15
 ; CHECK-NEXT:    ldp q1, q4, [x16, #-16]
@@ -53,28 +67,13 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
 ; CHECK-NEXT:    smlal v6.4s, v0.4h, v4.4h
 ; CHECK-NEXT:    stp q3, q2, [x17, #-32]
 ; CHECK-NEXT:    stp q6, q5, [x17], #64
-; CHECK-NEXT:    b.ne .LBB0_6
-; CHECK-NEXT:  // %bb.7: // %middle.block
-; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    b.ne .LBB0_8
+; CHECK-NEXT:  // %bb.9: // %middle.block
+; CHECK-NEXT:    // in Loop: Header=BB0_6 Depth=1
 ; CHECK-NEXT:    cmp x12, x10
 ; CHECK-NEXT:    mov x18, x12
-; CHECK-NEXT:    b.eq .LBB0_2
-; CHECK-NEXT:  .LBB0_8: // %for.body4.us.preheader
-; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
-; CHECK-NEXT:    add x16, x18, x8
-; CHECK-NEXT:    add x17, x2, x18, lsl #1
-; CHECK-NEXT:    sub x18, x10, x18
-; CHECK-NEXT:    add x16, x1, x16, lsl #2
-; CHECK-NEXT:  .LBB0_9: // %for.body4.us
-; CHECK-NEXT:    // Parent Loop BB0_3 Depth=1
-; CHECK-NEXT:    // => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldrsh w3, [x17], #2
-; CHECK-NEXT:    ldr w4, [x16]
-; CHECK-NEXT:    subs x18, x18, #1
-; CHECK-NEXT:    madd w3, w3, w15, w4
-; CHECK-NEXT:    str w3, [x16], #4
-; CHECK-NEXT:    b.ne .LBB0_9
-; CHECK-NEXT:    b .LBB0_2
+; CHECK-NEXT:    b.ne .LBB0_3
+; CHECK-NEXT:    b .LBB0_5
 ; CHECK-NEXT:  .LBB0_10: // %for.cond.cleanup
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/ragreedy-csr.ll b/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
index 5b501762418ef50..fa0ec49907658b5 100644
--- a/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
+++ b/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
@@ -24,7 +24,7 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
 ; CHECK-NEXT:    ldrh w8, [x0]
 ; CHECK-NEXT:    ldrh w9, [x1]
 ; CHECK-NEXT:    cmp w8, w9
-; CHECK-NEXT:    b.ne LBB0_47
+; CHECK-NEXT:    b.ne LBB0_2
 ; CHECK-NEXT:  ; %bb.1: ; %if.end
 ; CHECK-NEXT:    sub sp, sp, #64
 ; CHECK-NEXT:    stp x29, x30, [sp, #48] ; 16-byte Folded Spill
@@ -41,8 +41,12 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
 ; CHECK-NEXT:  Lloh1:
 ; CHECK-NEXT:    ldr x14, [x14, __DefaultRuneLocale@GOTPAGEOFF]
 ; CHECK-NEXT:    ldrsb x8, [x9, x11]
-; CHECK-NEXT:    tbz x8, #63, LBB0_3
-; CHECK-NEXT:  LBB0_2: ; %cond.false.i.i
+; CHECK-NEXT:    tbz x8, #63, LBB0_7
+; CHECK-NEXT:    b LBB0_6
+; CHECK-NEXT:  LBB0_2:
+; CHECK-NEXT:    mov w0, wzr
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  LBB0_3: ; %cond.false.i.i219
 ; CHECK-NEXT:    stp x9, x0, [sp, #32] ; 16-byte Folded Spill
 ; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    mov w1, #32768 ; =0x8000
@@ -61,32 +65,17 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
 ; CHECK-NEXT:    ldr w12, [sp, #4] ; 4-byte Folded Reload
 ; CHECK-NEXT:    ldr x10, [sp, #8] ; 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x0, [sp, #40] ; 8-byte Folded Reload
-; CHECK-NEXT:    cbz w8, LBB0_4
-; CHECK-NEXT:    b LBB0_6
-; CHECK-NEXT:  LBB0_3: ; %cond.true.i.i
-; CHECK-NEXT:    add x8, x14, x8, lsl #2
-; CHECK-NEXT:    ldr w8, [x8, #60]
-; CHECK-NEXT:    and w8, w8, #0x8000
-; CHECK-NEXT:    cbnz w8, LBB0_6
-; CHECK-NEXT:  LBB0_4: ; %lor.rhs
-; CHECK-NEXT:    ldrsb x8, [x10, x11]
-; CHECK-NEXT:    tbnz x8, #63, LBB0_8
-; CHECK-NEXT:  ; %bb.5: ; %cond.true.i.i217
-; CHECK-NEXT:    add x8, x14, x8, lsl #2
-; CHECK-NEXT:    ldr w8, [x8, #60]
-; CHECK-NEXT:    and w8, w8, #0x8000
-; CHECK-NEXT:    cbz w8, LBB0_9
-; CHECK-NEXT:  LBB0_6: ; %while.body
+; CHECK-NEXT:    cbz w8, LBB0_10
+; CHECK-NEXT:  LBB0_4: ; %while.body
 ; CHECK-NEXT:    ldrb w8, [x9, x11]
 ; CHECK-NEXT:    ldrb w15, [x10, x11]
 ; CHECK-NEXT:    cmp w8, w15
-; CHECK-NEXT:    b.ne LBB0_42
-; CHECK-NEXT:  ; %bb.7: ; %if.end17
+; CHECK-NEXT:    b.ne LBB0_43
+; CHECK-NEXT:  ; %bb.5: ; %if.end17
 ; CHECK-NEXT:    add x11, x11, #1
 ; CHECK-NEXT:    ldrsb x8, [x9, x11]
-; CHECK-NEXT:    tbz x8, #63, LBB0_3
-; CHECK-NEXT:    b LBB0_2
-; CHECK-NEXT:  LBB0_8: ; %cond.false.i.i219
+; CHECK-NEXT:    tbz x8, #63, LBB0_7
+; CHECK-NEXT:  LBB0_6: ; %cond.false.i.i
 ; CHECK-NEXT:    stp x9, x0, [sp, #32] ; 16-byte Folded Spill
 ; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    mov w1, #32768 ; =0x8000
@@ -105,163 +94,174 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
 ; CHECK-NEXT:    ldr w12, [sp, #4] ; 4-byte Folded Reload
 ; CHECK-NEXT:    ldr x10, [sp, #8] ; 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x0, [sp, #40] ; 8-byte Folded Reload
-; CHECK-NEXT:    cbnz w8, LBB0_6
-; CHECK-NEXT:  LBB0_9: ; %while.end
+; CHECK-NEXT:    cbz w8, LBB0_8
+; CHECK-NEXT:    b LBB0_4
+; CHECK-NEXT:  LBB0_7: ; %cond.true.i.i
+; CHECK-NEXT:    add x8, x14, x8, lsl #2
+; CHECK-NEXT:    ldr w8, [x8, #60]
+; CHECK-NEXT:    and w8, w8, #0x8000
+; CHECK-NEXT:    cbnz w8, LBB0_4
+; CHECK-NEXT:  LBB0_8: ; %lor.rhs
+; CHECK-NEXT:    ldrsb x8, [x10, x11]
+; CHECK-NEXT:    tbnz x8, #63, LBB0_3
+; CHECK-NEXT:  ; %bb.9: ; %cond.true.i.i217
+; CHECK-NEXT:    add x8, x14, x8, lsl #2
+; CHECK-NEXT:    ldr w8, [x8, #60]
+; CHECK-NEXT:    and w8, w8, #0x8000
+; CHECK-NEXT:    cbnz w8, LBB0_4
+; CHECK-NEXT:  LBB0_10: ; %while.end
 ; CHECK-NEXT:    orr w8, w13, w12
-; CHECK-NEXT:    cbnz w8, LBB0_24
-; CHECK-NEXT:  ; %bb.10: ; %if.then23
+; CHECK-NEXT:    cbnz w8, LBB0_25
+; CHECK-NEXT:  ; %bb.11: ; %if.then23
 ; CHECK-NEXT:    ldr x12, [x0, #16]
 ; CHECK-NEXT:    ldrb w8, [x9, x11]
 ; CHECK-NEXT:    ldrb w13, [x12]
 ; CHECK-NEXT:    cmp w13, #83
-; CHECK-NEXT:    b.eq LBB0_19
-; CHECK-NEXT:  LBB0_11: ; %while.cond59.preheader
-; CHECK-NEXT:    cbz w8, LBB0_23
-; CHECK-NEXT:  LBB0_12: ; %land.rhs.preheader
+; CHECK-NEXT:    b.eq LBB0_20
+; CHECK-NEXT:  LBB0_12: ; %while.cond59.preheader
+; CHECK-NEXT:    cbz w8, LBB0_24
+; CHECK-NEXT:  LBB0_13: ; %land.rhs.preheader
 ; CHECK-NEXT:    add x12, x9, x11
 ; CHECK-NEXT:    add x9, x10, x11
 ; CHECK-NEXT:    add x10, x12, #1
-; CHECK-NEXT:  LBB0_13: ; %land.rhs
+; CHECK-NEXT:  LBB0_14: ; %land.rhs
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldrb w11, [x9], #1
-; CHECK-NEXT:    cbz w11, LBB0_23
-; CHECK-NEXT:  ; %bb.14: ; %while.body66
-; CHECK-NEXT:    ; in Loop: Header=BB0_13 Depth=1
-; CHECK-NEXT:    cmp w8, #42
-; CHECK-NEXT:    b.eq LBB0_18
+; CHECK-NEXT:    cbz w11, LBB0_24
 ; CHECK-NEXT:  ; %bb.15: ; %while.body66
-; CHECK-NEXT:    ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT:    ; in Loop: Header=BB0_14 Depth=1
+; CHECK-NEXT:    cmp w8, #42
+; CHECK-NEXT:    b.eq LBB0_19
+; CHECK-NEXT:  ; %bb.16: ; %while.body66
+; CHECK-NEXT:    ; in Loop: Header=BB0_14 Depth=1
 ; CHECK-NEXT:    cmp w11, #42
-; CHECK-NEXT:    b.eq LBB0_18
-; CHECK-NEXT:  ; %bb.16: ; %lor.lhs.false74
-; CHECK-NEXT:    ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT:    b.eq LBB0_19
+; CHECK-NEXT:  ; %bb.17: ; %lor.lhs.false74
+; CHECK-NEXT:    ; in Loop: Header=BB0_14 Depth=1
 ; CHECK-NEXT:    cmp w8, w11
 ; CHECK-NEXT:    mov w0, wzr
-; CHECK-NEXT:    b.ne LBB0_43
-; CHECK-NEXT:  ; %bb.17: ; %lor.lhs.false74
-; CHECK-NEXT:    ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT:    b.ne LBB0_44
+; CHECK-NEXT:  ; %bb.18: ; %lor.lhs.false74
+; CHECK-NEXT:    ; in Loop: Header=BB0_14 Depth=1
 ; CHECK-NEXT:    cmp w8, #94
-; CHECK-NEXT:    b.eq LBB0_43
-; CHECK-NEXT:  LBB0_18: ; %if.then83
-; CHECK-NEXT:    ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT:    b.eq LBB0_44
+; CHECK-NEXT:  LBB0_19: ; %if.then83
+; CHECK-NEXT:    ; in Loop: Header=BB0_14 Depth=1
 ; CHECK-NEXT:    ldrb w8, [x10], #1
 ; CHECK-NEXT:    mov w0, #1 ; =0x1
-; CHECK-NEXT:    cbnz w8, LBB0_13
-; CHECK-NEXT:    b LBB0_43
-; CHECK-NEXT:  LBB0_19: ; %land.lhs.true28
-; CHECK-NEXT:    cbz w8, LBB0_23
-; CHECK-NEXT:  ; %bb.20: ; %land.lhs.true28
+; CHECK-NEXT:    cbnz w8, LBB0_14
+; CHECK-NEXT:    b LBB0_44
+; CHECK-NEXT:  LBB0_20: ; %land.lhs.true28
+; CHECK-NEXT:    cbz w8, LBB0_24
+; CHECK-NEXT:  ; %bb.21: ; %land.lhs.true28
 ; CHECK-NEXT:    cmp w8, #112
-; CHECK-NEXT:    b.ne LBB0_12
-; CHECK-NEXT:  ; %bb.21: ; %land.lhs.true35
+; CHECK-NEXT:    b.ne LBB0_13
+; CHECK-NEXT:  ; %bb.22: ; %land.lhs.true35
 ; CHECK-NEXT:    ldrb w13, [x10, x11]
 ; CHECK-NEXT:    cmp w13, #112
-; CHECK-NEXT:    b.ne LBB0_12
-; CHECK-NEXT:  ; %bb.22: ; %land.lhs.true43
+; CHECK-NEXT:    b.ne LBB0_13
+; CHECK-NEXT:  ; %bb.23: ; %land.lhs.true43
 ; CHECK-NEXT:    sub x12, x9, x12
 ; CHECK-NEXT:    add x12, x12, x11
 ; CHECK-NEXT:    cmp x12, #1
-; CHECK-NEXT:    b.ne LBB0_44
-; CHECK-NEXT:  LBB0_23:
+; CHECK-NEXT:    b.ne LBB0_45
+; CHECK-NEXT:  LBB0_24:
 ; CHECK-NEXT:    mov w0, #1 ; =0x1
-; CHECK-NEXT:    b LBB0_43
-; CHECK-NEXT:  LBB0_24: ; %if.else88
+; CHECK-NEXT:    b LBB0_44
+; CHECK-NEXT:  LBB0_25: ; %if.else88
 ; CHECK-NEXT:    cmp w12, #1
-; CHECK-NEXT:    b.ne LBB0_33
-; CHECK-NEXT:  ; %bb.25: ; %if.else88
+; CHECK-NEXT:    b.ne LBB0_34
+; CHECK-NEXT:  ; %bb.26: ; %if.else88
 ; CHECK-NEXT:    cmp w13, #2
-; CHECK-NEXT:    b.ne LBB0_33
-; CHECK-NEXT:  ; ...
[truncated]

@MatzeB MatzeB mentioned this pull request Sep 22, 2023
@spupyrev
Copy link
Contributor

I am curios if there is a way to teach BFI to return appropriately rounded values, so that we don't modify downstream algorithms? Similarly, do you think lower bits in the frequencies provide any signal, or are they just noise that always needs to be discarded?

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 25, 2023

I am curios if there is a way to teach BFI to return appropriately rounded values, so that we don't modify downstream algorithms?

Yes in fact I started out with code "rounding" all BFI values and still have it around here. But initially I wanted this more of a drive-by fix rather than challenging base layer assumptions which would probably need more discussion from stakeholders...

Also part of the motivation of the current fix is plainly that #66285 forces me to update hundreds of unit tests and this here seemed like an easier "drive-by" change so I can frontload a number of those updates...

So yeah may be better, but would like some more feedback / support if we want to go in that direction. In the meantime I think this change shouldn't hurt?

@chandlerc chandlerc removed their request for review September 25, 2023 23:30
@WenleiHe
Copy link
Member

I am curios if there is a way to teach BFI to return appropriately rounded values, so that we don't modify downstream algorithms?

Yes in fact I started out with code "rounding" all BFI values and still have it around here.

I also liked the idea of building the rounding into BFI. Initially the rounding can be controlled/enabled by a switch, so we don't suddenly introduce big churns to all PGO users.

Beyond your need for test case stability, another real benefit of rounding within BFI is that it can help reduce the binary level instability for sample PGO due to nature of sampling.

But initially I wanted this more of a drive-by fix rather than challenging base layer assumptions which would probably need more discussion from stakeholders...

You are already challenging the base layer in #66285, in a good way though. :) I'd suggest let's do it properly within BFI, rather than in downstream optimization.

@david-xl
Copy link
Contributor

Are there any measurable performance impact with this change?

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 26, 2023

Are there any measurable performance impact with this change?

I don't expect any changes as this should only affect cases where frequencies are very close to each other so any choice should have a similar effect. I haven't benchmarked this but I could run it through the llvm-test-suite if desired. (I'm not sure I can sensible measure it with our production workloads as BOLT will reorder all the basic blocks code again anyway for us).

The motivation here is to reduce the amount of reordering that is triggered by inconsequentially small changes in BFI data. I did this because I had to touch most of these tests for #66285 anyway, and it seemed like a good idea to make the algorithm more stable to hopefully reduce churn in the future too.

I could also see this being useful to reduce day-to-day churn as sampling PGO data changed, though that likely needs tuning and tighter values for the BlockFreqCompBits constant added here.

@WenleiHe
Copy link
Member

I'm not sure I can sensible measure it with our production workloads as BOLT will reorder all the basic blocks code again anyway for us

We can do prod measurement with BOLT turned off. I don't expect meaningful perf movement, but it's indeed a good sanity check.

I could also see this being useful to reduce day-to-day churn as sampling PGO data changed, though that likely needs tuning and tighter values for the BlockFreqCompBits constant added here.

Yes. It needs to be 1) off by default initially, 2) tunable for the rounding. And just to be clear, I'm mostly thinking about rounding within BFI, not this change as is.

@david-xl
Copy link
Contributor

I'm not sure I can sensible measure it with our production workloads as BOLT will reorder all the basic blocks code again anyway for us

We can do prod measurement with BOLT turned off. I don't expect meaningful perf movement, but it's indeed a good sanity check.

I could also see this being useful to reduce day-to-day churn as sampling PGO data changed, though that likely needs tuning and tighter values for the BlockFreqCompBits constant added here.

Yes. It needs to be 1) off by default initially, 2) tunable for the rounding. And just to be clear, I'm mostly thinking about rounding within BFI, not this change as is.

Sanity check would be great to avoid unexpected downward performance change (regression) as well.

@rlavaee
Copy link
Contributor

rlavaee commented Sep 28, 2023

IIUC, these functions do not satisfy weak ordering. Suppose a series of similar BFI edge frequencies are considered where
w_1 < w_2 < w_3 < ... < w_n
and
w_1 !lessWithTolerance w_2 !lessWithTolerance ... !lessWithTolerance w_n

We can still have w_1 lessWithTolerance w_n.

In that case, is it a good idea to expose these as pseudo-comparators?

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 2, 2023

Sanity check would be great to avoid unexpected downward performance change (regression) as well.

Managed to get one build from our internal "unicorn" service. Showing no regression (shows a tiny improvement that is likely benchmarking noise).

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 2, 2023

IIUC, these functions do not satisfy weak ordering.

...

In that case, is it a good idea to expose these as pseudo-comparators?

Yes it seems this is only a strict partial order but not a weak ordering. But is this actually a problem, would people infer this from the function names and get confused if it isn't? Are you saying we should find a weak ordering here? I don't immediately see any benefits from that at least for my use case.

Or are you proposing that we should find a different name for these functions to set other expectations? But then what name should we chose, I personally didn't have expectations one way or another...

@rlavaee
Copy link
Contributor

rlavaee commented Oct 2, 2023

IIUC, these functions do not satisfy weak ordering.

...

In that case, is it a good idea to expose these as pseudo-comparators?

Yes it seems this is only a strict partial order but not a weak ordering. But is this actually a problem, would people infer this from the function names and get confused if it isn't? Are you saying we should find a weak ordering here? I don't immediately see any benefits from that at least for my use case.

Or are you proposing that we should find a different name for these functions to set other expectations? But then what name should we chose, I personally didn't have expectations one way or another...

Yes. I suggest we use different naming and also make these non-member functions.

The current naming does not resonate with the definitions of these functions. I think these functions define whether one frequency is significantly greater or less than the other.
My suggestion is to define a class for this so we can explicitly mention that it does not satisfy transitivity and therefore, it does not induce a strict total ordering. This also allows us to move the implementation detail about the significant bits (which btw I find more related to non-significant number of bits) into this class.

class SignificantRelativeComparator {
int NonSignificantBits = 20; 
public: 
bool Less (const BlockFrequency &LHS, const BlockFrequency &RHS) {
  // implementation
}
bool Greater (const BlockFrequency &LHS, const BlockFrequency &RHS) {
  // implementation
}
};

@MatzeB MatzeB force-pushed the block_placement_tolerance branch 2 times, most recently from d983c7b to ad0c45b Compare October 3, 2023 22:18
@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 3, 2023

Thanks for the feedback. I contemplated the comments about this not being a partial weak ordering and there are indeed cases where picking the "best" candidate out of a list will give different results based on the list order which isn't desirable. So I reworked the approach to perform a two-pass algorithm: We first search for the biggest frequency (without any tolerances) and then in a 2nd pass check all blocks that are almost equal to the biggest frequency. This also means that I know have an almostEqual function (and no longer have less/greaterWithTolerance).

I also think that this approach is slightly better than pre-rounding BFI numbers, as for pre-rounding we still may have two candidate blocks that had very similar frequency but happened to be at the border where one gets rounded up and the other rounded down, which doesn't happen with this approach.

@github-actions
Copy link

github-actions bot commented Oct 3, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 3, 2023

FWIW: I also realized that the big majority of the instability I was tackling came from MachineBlockPlacement::selectBestCandidateBlock, so I decided to revert the changes to MachineBlockPlacement::findBestLoopTopHelper (there is still a debug print and cleanup there, but no actual change anymore).

@rlavaee
Copy link
Contributor

rlavaee commented Oct 5, 2023

Thanks for the feedback. I contemplated the comments about this not being a partial weak ordering and there are indeed cases where picking the "best" candidate out of a list will give different results based on the list order which isn't desirable. So I reworked the approach to perform a two-pass algorithm: We first search for the biggest frequency (without any tolerances) and then in a 2nd pass check all blocks that are almost equal to the biggest frequency. This also means that I know have an almostEqual function (and no longer have less/greaterWithTolerance).

I also think that this approach is slightly better than pre-rounding BFI numbers, as for pre-rounding we still may have two candidate blocks that had very similar frequency but happened to be at the border where one gets rounded up and the other rounded down, which doesn't happen with this approach.

I like the new approach. Some general comments:

  1. Do you have any idea why this impacts so many of the block ordering tests?
  2. Do you intend to limit this functionality to the baseline ordering only (i.e, Does applyExtTsp() remain sensitive to small variations)?
  3. Is this functionality supposed to be controlled by a flag?

llvm/lib/CodeGen/MachineBlockPlacement.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachineBlockPlacement.cpp Outdated Show resolved Hide resolved
@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 5, 2023

Do you have any idea why this impacts so many of the block ordering tests?

Most tests have no profile information and so we end up assigning equal branch weights in many cases. This means multiple successors with the same BlockFrequency are pretty common.

To make matters worse we not every integer divides cleanly. Say a block as a BlockFrequency value of 7 and has two successors with equal branch weights. We will assign frequencies 3 and 4 to the successors introducing some sort of extra numerical inaccuracy that isn't meaningful for the code.

  1. Do you intend to limit this functionality to the baseline ordering only (i.e, Does applyExtTsp() remain sensitive to small variations)?

I was hoping that this is mostly a drive-by change to make my live easier for #66285 I did not get around to learn abut the est-tsp code so this won't be affected by this change. It would be good to introduce an equivalent change there too at some time.

  1. Is this functionality supposed to be controlled by a flag?

I wasn't sure if anybody would ever use such a flag. I did have a flag during my personal experiments, but I felt we can rather save a couple lines of code when upstreaming and maybe gain a tiny bit of perf from hardcoded values. That said I am happy to bring back a flag if you prefer that.

@spupyrev
Copy link
Contributor

I wonder why the existing implementation of block placement relies on block frequencies (that are relative to the entry block) instead of block counts? Would using counts be equivalent (or very similar) to the rounding suggested in the diff?

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 12, 2023

I wonder why the existing implementation of block placement relies on block frequencies (that are relative to the entry block) instead of block counts? Would using counts be equivalent (or very similar) to the rounding suggested in the diff?

The block counts are not available. We only have a single count for function entry and branch-weights. Functions like MachineBlockFrequencyInfo::getBlockProfileCount will just "emulate" it by multiplying the entry count with the block frequency (that was in turn estimated from the branch weights). So the information in the counts isn't any better than the frequencies for this use-case.

@spupyrev
Copy link
Contributor

The block counts are not available. We only have a single count for function entry and branch-weights. Functions like MachineBlockFrequencyInfo::getBlockProfileCount will just "emulate" it by multiplying the entry count with the block frequency (that was in turn estimated from the branch weights). So the information in the counts isn't any better than the frequencies for this use-case.

That's exactly what i'm asking: Why not using the "emulated/estimated" counts? You get the same rounding effect, as suggested in the diff, but in addition the raw counts have an intuitive interpretation ("how many times a block/jump was executed?").
We can separately work on making this estimation more precise, if it's not already good enough.

The LLVM BFI data is susceptible to numerical "noise" in the lower bits:
To get some feeling for this consider that we store `BlockFrequency` as
64-bit integers and branch weights as 32-bit. There is a natural
numerical difference when the branch weights divisions are not naturally
representable as integers and because block frequencies are scaled by
loop factors the absolute values of those errors can appear big.
Regardless it should be a small relative difference relative to the
frequency numbers.

- Add `BlockFrequency::almostEqual` function to check whether two
  frequencies are close enough to each other that the topmost N bits
  are equal and only lower bits differ.
- Change `MachineBlockPlacement::selectBestCandidateBlock` to be a
  two-pass algorithm that first finds determines the candidates with the
  biggest frequency and then performs a 2nd pass to check among blocks
  with "almostEqual" frequency to prefer the one already being the
  layout successor or first in the worklist.

The currently value of 20 significant bits works well to reduce
unnecessary changes in my upcoming BFI precision changes.
@spupyrev
Copy link
Contributor

Why not using the "emulated/estimated" counts? You get the same rounding effect, as suggested in the diff, but in addition the raw counts have an intuitive interpretation ("how many times a block/jump was executed?").

Any other thoughts on whether we should/could use profile counts in MachineBlockPlacement (and perhaps other optimizations), instead of frequencies? Intuitively it makes sense to me and it would resolve some of the stability issues discussed in the diff. It seems there is already such an option for tail duplication (https://reviews.llvm.org/D83265). I'm curious why it is not applied for other placement heuristics? cc @david-xl

@rlavaee
Copy link
Contributor

rlavaee commented Oct 18, 2023

I was hoping that this is mostly a drive-by change to make my live easier for #66285 I did not get around to learn abut the est-tsp code so this won't be affected by this change. It would be good to introduce an equivalent change there too at some time.

The question is whether we could have a better approach which covers both at the same time (IIUC, rounding handles this, but does not have the nice relativity feature that we have in your PR).

@rlavaee
Copy link
Contributor

rlavaee commented Oct 18, 2023

Another general question. Do you have cases where this precision problem manifests itself (e.g., unstable/non-deterministic builds)?

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 18, 2023

Do you have cases where this precision problem manifests itself (e.g., unstable/non-deterministic builds)?

No. I don't have any real-world evidence other than flip-flopping test-cases. Builds are of course deterministic without this patch already in the sense that given the same inputs you get the same binary. On the other hand code can still look wildly different for minor changes in the profile inputs as other areas like inlining likely show similar effects.

The question is whether we could have a better approach which covers both at the same time (IIUC, rounding handles this, but does not have the nice relativity feature that we have in your PR).

Yes, I apreciate that. This has sparked a lot more (good) discussion than I expected, and it seems a good solution here is more than a "drive-by" patch like this. So I am less sure I want to go down this road and will probably just put this on pause for now (and update the extra test cases in #66285 )

@MatzeB MatzeB closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants