-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineBlockPlacement: Add tolerance to comparisons #67197
Conversation
@llvm/pr-subscribers-llvm-support @llvm/pr-subscribers-backend-x86 ChangesThe LLVM BFI data is susceptible to numerical "noise" in the lower bits: This changes make the MachineBlockPlacement algorithm more tolerant
The currently value of 20 significant bits works well to reduce Patch is 253.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67197.diff 75 Files Affected:
diff --git a/llvm/include/llvm/Support/BlockFrequency.h b/llvm/include/llvm/Support/BlockFrequency.h
index 12a753301b36aba..8930bf536bff8ce 100644
--- a/llvm/include/llvm/Support/BlockFrequency.h
+++ b/llvm/include/llvm/Support/BlockFrequency.h
@@ -15,6 +15,7 @@
#include <cassert>
#include <cstdint>
+#include <limits.h>
#include <optional>
namespace llvm {
@@ -93,6 +94,41 @@ class BlockFrequency {
return *this;
}
+ /// Returns true if `this` is greater than `other` and the magnitude of the
+ /// difference falls within the topmost `SignificantBits` of `this`.
+ ///
+ /// The formula is derived when comparing a "relative change" to a threshold
+ /// which is chosen as a negative power-of-two so we can compute things
+ /// cheaply:
+ /// A = this->Frequency; B = Other.Frequency; T = SignificantBits - 1;
+ /// relative_change(A, B) = (A - B) / A
+ /// (A - B) / A > 2**(-T)
+ /// <=> A - B > 2**(-T) * A
+ /// <=> A - B > shr(A, T)
+ bool greaterWithTolerance(BlockFrequency Other,
+ unsigned SignificantBits) const {
+ assert(0 < SignificantBits &&
+ SignificantBits <= sizeof(Frequency) * CHAR_BIT &&
+ "invalid SignificantBits value");
+ return Frequency > Other.Frequency &&
+ (Frequency - Other.Frequency) >=
+ (Frequency >> (SignificantBits - 1));
+ }
+
+ /// Returns true if `this` is less than `other` and the magnitude of the
+ /// difference falls within the topmost `SignificantBits` of `this`.
+ ///
+ /// See `greaterWithTolerance` for how this is related to the
+ /// "relative change" formula.
+ bool lessWithTolerance(BlockFrequency Other, unsigned SignificantBits) const {
+ assert(0 < SignificantBits &&
+ SignificantBits <= sizeof(Frequency) * CHAR_BIT &&
+ "invalid SignificantBits value");
+ return Frequency < Other.Frequency &&
+ (Other.Frequency - Frequency) >=
+ (Frequency >> (SignificantBits - 1));
+ }
+
bool operator<(BlockFrequency RHS) const {
return Frequency < RHS.Frequency;
}
diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 24f0197b419794b..540308a9b0b5bd3 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -213,6 +213,9 @@ static cl::opt<bool> RenumberBlocksBeforeView(
"into a dot graph. Only used when a function is being printed."),
cl::init(false), cl::Hidden);
+// Number of significant bits when comparing BlockFrequencies with tolerance.
+static constexpr unsigned BlockFreqCompBits = 20;
+
namespace llvm {
extern cl::opt<bool> EnableExtTspBlockPlacement;
extern cl::opt<bool> ApplyExtTspWithoutProfile;
@@ -1715,6 +1718,7 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
bool IsEHPad = WorkList[0]->isEHPad();
+ MachineBasicBlock *ChainEnd = *std::prev(Chain.end());
MachineBasicBlock *BestBlock = nullptr;
BlockFrequency BestFreq;
for (MachineBasicBlock *MBB : WorkList) {
@@ -1730,7 +1734,9 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
BlockFrequency CandidateFreq = MBFI->getBlockFreq(MBB);
LLVM_DEBUG(dbgs() << " " << getBlockName(MBB) << " -> ";
- MBFI->printBlockFreq(dbgs(), CandidateFreq) << " (freq)\n");
+ MBFI->printBlockFreq(dbgs(), CandidateFreq)
+ << " (freq) layoutsucc " << ChainEnd->isLayoutSuccessor(MBB)
+ << '\n');
// For ehpad, we layout the least probable first as to avoid jumping back
// from least probable landingpads to more probable ones.
@@ -1750,11 +1756,26 @@ MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
// +-------------------------------------+
// V |
// OuterLp -> OuterCleanup -> Resume InnerLp -> InnerCleanup
- if (BestBlock && (IsEHPad ^ (BestFreq >= CandidateFreq)))
- continue;
-
- BestBlock = MBB;
- BestFreq = CandidateFreq;
+ bool Better = false;
+ if (BestBlock == nullptr) {
+ Better = true;
+ } else if (!IsEHPad) {
+ // Be tolerant to numerical noise: Pick block if clearly better or
+ // for the current successor if it is not not clearly worse.
+ if (CandidateFreq.greaterWithTolerance(BestFreq, BlockFreqCompBits) ||
+ (ChainEnd->isLayoutSuccessor(MBB) &&
+ !CandidateFreq.lessWithTolerance(BestFreq, BlockFreqCompBits)))
+ Better = true;
+ } else {
+ if (CandidateFreq.lessWithTolerance(BestFreq, BlockFreqCompBits) ||
+ (ChainEnd->isLayoutSuccessor(MBB) &&
+ !CandidateFreq.greaterWithTolerance(BestFreq, BlockFreqCompBits)))
+ Better = true;
+ }
+ if (Better) {
+ BestBlock = MBB;
+ BestFreq = CandidateFreq;
+ }
}
return BestBlock;
@@ -2095,8 +2116,8 @@ MachineBlockPlacement::findBestLoopTopHelper(
if (Pred == L.getHeader())
continue;
LLVM_DEBUG(dbgs() << " old top pred: " << getBlockName(Pred) << ", has "
- << Pred->succ_size() << " successors, ";
- MBFI->printBlockFreq(dbgs(), Pred) << " freq\n");
+ << Pred->succ_size() << " successors, freq ";
+ MBFI->printBlockFreq(dbgs(), Pred) << '\n');
if (Pred->succ_size() > 2)
continue;
@@ -2112,8 +2133,13 @@ MachineBlockPlacement::findBestLoopTopHelper(
BlockFrequency Gains = FallThroughGains(Pred, OldTop, OtherBB,
LoopBlockSet);
- if ((Gains > 0) && (Gains > BestGains ||
- ((Gains == BestGains) && Pred->isLayoutSuccessor(OldTop)))) {
+ LLVM_DEBUG(dbgs() << " Candidate: " << getBlockName(Pred) << " freq ";
+ MBFI->printBlockFreq(dbgs(), Pred);
+ dbgs() << " layoutsucc " << Pred->isLayoutSuccessor(OldTop)
+ << " gains " << Gains.getFrequency() << '\n');
+ if (Gains.greaterWithTolerance(BestGains, BlockFreqCompBits) ||
+ (Pred->isLayoutSuccessor(OldTop) &&
+ !Gains.lessWithTolerance(BestGains, BlockFreqCompBits))) {
BestPred = Pred;
BestGains = Gains;
}
diff --git a/llvm/test/CodeGen/AArch64/cfi-fixup.ll b/llvm/test/CodeGen/AArch64/cfi-fixup.ll
index 9a4ad3bb07ee364..e746ea745b92a8c 100644
--- a/llvm/test/CodeGen/AArch64/cfi-fixup.ll
+++ b/llvm/test/CodeGen/AArch64/cfi-fixup.ll
@@ -8,13 +8,13 @@ define i32 @f0(i32 %x) #0 {
; CHECK-NEXT: .cfi_def_cfa_offset 16
; CHECK-NEXT: .cfi_offset w30, -16
; CHECK-NEXT: .cfi_remember_state
-; CHECK-NEXT: cbz w0, .LBB0_4
+; CHECK-NEXT: cbz w0, .LBB0_6
; CHECK-NEXT: // %bb.1: // %entry
; CHECK-NEXT: cmp w0, #2
-; CHECK-NEXT: b.eq .LBB0_5
+; CHECK-NEXT: b.eq .LBB0_4
; CHECK-NEXT: // %bb.2: // %entry
; CHECK-NEXT: cmp w0, #1
-; CHECK-NEXT: b.ne .LBB0_6
+; CHECK-NEXT: b.ne .LBB0_5
; CHECK-NEXT: // %bb.3: // %if.then2
; CHECK-NEXT: bl g1
; CHECK-NEXT: add w0, w0, #1
@@ -22,27 +22,27 @@ define i32 @f0(i32 %x) #0 {
; CHECK-NEXT: .cfi_def_cfa_offset 0
; CHECK-NEXT: .cfi_restore w30
; CHECK-NEXT: ret
-; CHECK-NEXT: .LBB0_4:
+; CHECK-NEXT: .LBB0_4: // %if.then5
; CHECK-NEXT: .cfi_restore_state
; CHECK-NEXT: .cfi_remember_state
-; CHECK-NEXT: mov w0, #1
+; CHECK-NEXT: bl g0
+; CHECK-NEXT: mov w8, #1 // =0x1
+; CHECK-NEXT: sub w0, w8, w0
; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT: .cfi_def_cfa_offset 0
; CHECK-NEXT: .cfi_restore w30
; CHECK-NEXT: ret
-; CHECK-NEXT: .LBB0_5: // %if.then5
+; CHECK-NEXT: .LBB0_5: // %if.end7
; CHECK-NEXT: .cfi_restore_state
; CHECK-NEXT: .cfi_remember_state
-; CHECK-NEXT: bl g0
-; CHECK-NEXT: mov w8, #1
-; CHECK-NEXT: sub w0, w8, w0
+; CHECK-NEXT: mov w0, wzr
; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT: .cfi_def_cfa_offset 0
; CHECK-NEXT: .cfi_restore w30
; CHECK-NEXT: ret
-; CHECK-NEXT: .LBB0_6: // %if.end7
+; CHECK-NEXT: .LBB0_6:
; CHECK-NEXT: .cfi_restore_state
-; CHECK-NEXT: mov w0, wzr
+; CHECK-NEXT: mov w0, #1 // =0x1
; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT: .cfi_def_cfa_offset 0
; CHECK-NEXT: .cfi_restore w30
@@ -115,7 +115,7 @@ define i32 @f2(i32 %x) #0 {
; CHECK-NEXT: cbz w0, .LBB2_2
; CHECK-NEXT: // %bb.1: // %if.end
; CHECK-NEXT: bl g1
-; CHECK-NEXT: mov w8, #1
+; CHECK-NEXT: mov w8, #1 // =0x1
; CHECK-NEXT: sub w0, w8, w0
; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT: .cfi_def_cfa_offset 0
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
index d4da4f28bb4a3c7..04fd80cd8213fdd 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
@@ -15,31 +15,45 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
; CHECK-NEXT: and x12, x10, #0xfffffff0
; CHECK-NEXT: add x13, x1, #32
; CHECK-NEXT: add x14, x2, #16
-; CHECK-NEXT: b .LBB0_3
-; CHECK-NEXT: .LBB0_2: // %for.cond1.for.cond.cleanup3_crit_edge.us
-; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT: b .LBB0_6
+; CHECK-NEXT: .LBB0_2: // in Loop: Header=BB0_6 Depth=1
+; CHECK-NEXT: mov x18, xzr
+; CHECK-NEXT: .LBB0_3: // %for.body4.us.preheader
+; CHECK-NEXT: // in Loop: Header=BB0_6 Depth=1
+; CHECK-NEXT: add x16, x18, x8
+; CHECK-NEXT: add x17, x2, x18, lsl #1
+; CHECK-NEXT: sub x18, x10, x18
+; CHECK-NEXT: add x16, x1, x16, lsl #2
+; CHECK-NEXT: .LBB0_4: // %for.body4.us
+; CHECK-NEXT: // Parent Loop BB0_6 Depth=1
+; CHECK-NEXT: // => This Inner Loop Header: Depth=2
+; CHECK-NEXT: ldrsh w3, [x17], #2
+; CHECK-NEXT: ldr w4, [x16]
+; CHECK-NEXT: subs x18, x18, #1
+; CHECK-NEXT: madd w3, w3, w15, w4
+; CHECK-NEXT: str w3, [x16], #4
+; CHECK-NEXT: b.ne .LBB0_4
+; CHECK-NEXT: .LBB0_5: // %for.cond1.for.cond.cleanup3_crit_edge.us
+; CHECK-NEXT: // in Loop: Header=BB0_6 Depth=1
; CHECK-NEXT: add x9, x9, #1
; CHECK-NEXT: add x13, x13, x11
; CHECK-NEXT: add x8, x8, x10
; CHECK-NEXT: cmp x9, x10
; CHECK-NEXT: b.eq .LBB0_10
-; CHECK-NEXT: .LBB0_3: // %for.cond1.preheader.us
+; CHECK-NEXT: .LBB0_6: // %for.cond1.preheader.us
; CHECK-NEXT: // =>This Loop Header: Depth=1
-; CHECK-NEXT: // Child Loop BB0_6 Depth 2
-; CHECK-NEXT: // Child Loop BB0_9 Depth 2
+; CHECK-NEXT: // Child Loop BB0_8 Depth 2
+; CHECK-NEXT: // Child Loop BB0_4 Depth 2
; CHECK-NEXT: ldrsh w15, [x2, x9, lsl #1]
; CHECK-NEXT: cmp w0, #16
-; CHECK-NEXT: b.hs .LBB0_5
-; CHECK-NEXT: // %bb.4: // in Loop: Header=BB0_3 Depth=1
-; CHECK-NEXT: mov x18, xzr
-; CHECK-NEXT: b .LBB0_8
-; CHECK-NEXT: .LBB0_5: // %vector.ph
-; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT: b.lo .LBB0_2
+; CHECK-NEXT: // %bb.7: // %vector.ph
+; CHECK-NEXT: // in Loop: Header=BB0_6 Depth=1
; CHECK-NEXT: mov x16, x14
; CHECK-NEXT: mov x17, x13
; CHECK-NEXT: mov x18, x12
-; CHECK-NEXT: .LBB0_6: // %vector.body
-; CHECK-NEXT: // Parent Loop BB0_3 Depth=1
+; CHECK-NEXT: .LBB0_8: // %vector.body
+; CHECK-NEXT: // Parent Loop BB0_6 Depth=1
; CHECK-NEXT: // => This Inner Loop Header: Depth=2
; CHECK-NEXT: dup v0.8h, w15
; CHECK-NEXT: ldp q1, q4, [x16, #-16]
@@ -53,28 +67,13 @@ define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
; CHECK-NEXT: smlal v6.4s, v0.4h, v4.4h
; CHECK-NEXT: stp q3, q2, [x17, #-32]
; CHECK-NEXT: stp q6, q5, [x17], #64
-; CHECK-NEXT: b.ne .LBB0_6
-; CHECK-NEXT: // %bb.7: // %middle.block
-; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT: b.ne .LBB0_8
+; CHECK-NEXT: // %bb.9: // %middle.block
+; CHECK-NEXT: // in Loop: Header=BB0_6 Depth=1
; CHECK-NEXT: cmp x12, x10
; CHECK-NEXT: mov x18, x12
-; CHECK-NEXT: b.eq .LBB0_2
-; CHECK-NEXT: .LBB0_8: // %for.body4.us.preheader
-; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1
-; CHECK-NEXT: add x16, x18, x8
-; CHECK-NEXT: add x17, x2, x18, lsl #1
-; CHECK-NEXT: sub x18, x10, x18
-; CHECK-NEXT: add x16, x1, x16, lsl #2
-; CHECK-NEXT: .LBB0_9: // %for.body4.us
-; CHECK-NEXT: // Parent Loop BB0_3 Depth=1
-; CHECK-NEXT: // => This Inner Loop Header: Depth=2
-; CHECK-NEXT: ldrsh w3, [x17], #2
-; CHECK-NEXT: ldr w4, [x16]
-; CHECK-NEXT: subs x18, x18, #1
-; CHECK-NEXT: madd w3, w3, w15, w4
-; CHECK-NEXT: str w3, [x16], #4
-; CHECK-NEXT: b.ne .LBB0_9
-; CHECK-NEXT: b .LBB0_2
+; CHECK-NEXT: b.ne .LBB0_3
+; CHECK-NEXT: b .LBB0_5
; CHECK-NEXT: .LBB0_10: // %for.cond.cleanup
; CHECK-NEXT: ret
entry:
diff --git a/llvm/test/CodeGen/AArch64/ragreedy-csr.ll b/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
index 5b501762418ef50..fa0ec49907658b5 100644
--- a/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
+++ b/llvm/test/CodeGen/AArch64/ragreedy-csr.ll
@@ -24,7 +24,7 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
; CHECK-NEXT: ldrh w8, [x0]
; CHECK-NEXT: ldrh w9, [x1]
; CHECK-NEXT: cmp w8, w9
-; CHECK-NEXT: b.ne LBB0_47
+; CHECK-NEXT: b.ne LBB0_2
; CHECK-NEXT: ; %bb.1: ; %if.end
; CHECK-NEXT: sub sp, sp, #64
; CHECK-NEXT: stp x29, x30, [sp, #48] ; 16-byte Folded Spill
@@ -41,8 +41,12 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
; CHECK-NEXT: Lloh1:
; CHECK-NEXT: ldr x14, [x14, __DefaultRuneLocale@GOTPAGEOFF]
; CHECK-NEXT: ldrsb x8, [x9, x11]
-; CHECK-NEXT: tbz x8, #63, LBB0_3
-; CHECK-NEXT: LBB0_2: ; %cond.false.i.i
+; CHECK-NEXT: tbz x8, #63, LBB0_7
+; CHECK-NEXT: b LBB0_6
+; CHECK-NEXT: LBB0_2:
+; CHECK-NEXT: mov w0, wzr
+; CHECK-NEXT: ret
+; CHECK-NEXT: LBB0_3: ; %cond.false.i.i219
; CHECK-NEXT: stp x9, x0, [sp, #32] ; 16-byte Folded Spill
; CHECK-NEXT: mov w0, w8
; CHECK-NEXT: mov w1, #32768 ; =0x8000
@@ -61,32 +65,17 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
; CHECK-NEXT: ldr w12, [sp, #4] ; 4-byte Folded Reload
; CHECK-NEXT: ldr x10, [sp, #8] ; 8-byte Folded Reload
; CHECK-NEXT: ldr x0, [sp, #40] ; 8-byte Folded Reload
-; CHECK-NEXT: cbz w8, LBB0_4
-; CHECK-NEXT: b LBB0_6
-; CHECK-NEXT: LBB0_3: ; %cond.true.i.i
-; CHECK-NEXT: add x8, x14, x8, lsl #2
-; CHECK-NEXT: ldr w8, [x8, #60]
-; CHECK-NEXT: and w8, w8, #0x8000
-; CHECK-NEXT: cbnz w8, LBB0_6
-; CHECK-NEXT: LBB0_4: ; %lor.rhs
-; CHECK-NEXT: ldrsb x8, [x10, x11]
-; CHECK-NEXT: tbnz x8, #63, LBB0_8
-; CHECK-NEXT: ; %bb.5: ; %cond.true.i.i217
-; CHECK-NEXT: add x8, x14, x8, lsl #2
-; CHECK-NEXT: ldr w8, [x8, #60]
-; CHECK-NEXT: and w8, w8, #0x8000
-; CHECK-NEXT: cbz w8, LBB0_9
-; CHECK-NEXT: LBB0_6: ; %while.body
+; CHECK-NEXT: cbz w8, LBB0_10
+; CHECK-NEXT: LBB0_4: ; %while.body
; CHECK-NEXT: ldrb w8, [x9, x11]
; CHECK-NEXT: ldrb w15, [x10, x11]
; CHECK-NEXT: cmp w8, w15
-; CHECK-NEXT: b.ne LBB0_42
-; CHECK-NEXT: ; %bb.7: ; %if.end17
+; CHECK-NEXT: b.ne LBB0_43
+; CHECK-NEXT: ; %bb.5: ; %if.end17
; CHECK-NEXT: add x11, x11, #1
; CHECK-NEXT: ldrsb x8, [x9, x11]
-; CHECK-NEXT: tbz x8, #63, LBB0_3
-; CHECK-NEXT: b LBB0_2
-; CHECK-NEXT: LBB0_8: ; %cond.false.i.i219
+; CHECK-NEXT: tbz x8, #63, LBB0_7
+; CHECK-NEXT: LBB0_6: ; %cond.false.i.i
; CHECK-NEXT: stp x9, x0, [sp, #32] ; 16-byte Folded Spill
; CHECK-NEXT: mov w0, w8
; CHECK-NEXT: mov w1, #32768 ; =0x8000
@@ -105,163 +94,174 @@ define fastcc i32 @prune_match(ptr nocapture readonly %a, ptr nocapture readonly
; CHECK-NEXT: ldr w12, [sp, #4] ; 4-byte Folded Reload
; CHECK-NEXT: ldr x10, [sp, #8] ; 8-byte Folded Reload
; CHECK-NEXT: ldr x0, [sp, #40] ; 8-byte Folded Reload
-; CHECK-NEXT: cbnz w8, LBB0_6
-; CHECK-NEXT: LBB0_9: ; %while.end
+; CHECK-NEXT: cbz w8, LBB0_8
+; CHECK-NEXT: b LBB0_4
+; CHECK-NEXT: LBB0_7: ; %cond.true.i.i
+; CHECK-NEXT: add x8, x14, x8, lsl #2
+; CHECK-NEXT: ldr w8, [x8, #60]
+; CHECK-NEXT: and w8, w8, #0x8000
+; CHECK-NEXT: cbnz w8, LBB0_4
+; CHECK-NEXT: LBB0_8: ; %lor.rhs
+; CHECK-NEXT: ldrsb x8, [x10, x11]
+; CHECK-NEXT: tbnz x8, #63, LBB0_3
+; CHECK-NEXT: ; %bb.9: ; %cond.true.i.i217
+; CHECK-NEXT: add x8, x14, x8, lsl #2
+; CHECK-NEXT: ldr w8, [x8, #60]
+; CHECK-NEXT: and w8, w8, #0x8000
+; CHECK-NEXT: cbnz w8, LBB0_4
+; CHECK-NEXT: LBB0_10: ; %while.end
; CHECK-NEXT: orr w8, w13, w12
-; CHECK-NEXT: cbnz w8, LBB0_24
-; CHECK-NEXT: ; %bb.10: ; %if.then23
+; CHECK-NEXT: cbnz w8, LBB0_25
+; CHECK-NEXT: ; %bb.11: ; %if.then23
; CHECK-NEXT: ldr x12, [x0, #16]
; CHECK-NEXT: ldrb w8, [x9, x11]
; CHECK-NEXT: ldrb w13, [x12]
; CHECK-NEXT: cmp w13, #83
-; CHECK-NEXT: b.eq LBB0_19
-; CHECK-NEXT: LBB0_11: ; %while.cond59.preheader
-; CHECK-NEXT: cbz w8, LBB0_23
-; CHECK-NEXT: LBB0_12: ; %land.rhs.preheader
+; CHECK-NEXT: b.eq LBB0_20
+; CHECK-NEXT: LBB0_12: ; %while.cond59.preheader
+; CHECK-NEXT: cbz w8, LBB0_24
+; CHECK-NEXT: LBB0_13: ; %land.rhs.preheader
; CHECK-NEXT: add x12, x9, x11
; CHECK-NEXT: add x9, x10, x11
; CHECK-NEXT: add x10, x12, #1
-; CHECK-NEXT: LBB0_13: ; %land.rhs
+; CHECK-NEXT: LBB0_14: ; %land.rhs
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldrb w11, [x9], #1
-; CHECK-NEXT: cbz w11, LBB0_23
-; CHECK-NEXT: ; %bb.14: ; %while.body66
-; CHECK-NEXT: ; in Loop: Header=BB0_13 Depth=1
-; CHECK-NEXT: cmp w8, #42
-; CHECK-NEXT: b.eq LBB0_18
+; CHECK-NEXT: cbz w11, LBB0_24
; CHECK-NEXT: ; %bb.15: ; %while.body66
-; CHECK-NEXT: ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT: ; in Loop: Header=BB0_14 Depth=1
+; CHECK-NEXT: cmp w8, #42
+; CHECK-NEXT: b.eq LBB0_19
+; CHECK-NEXT: ; %bb.16: ; %while.body66
+; CHECK-NEXT: ; in Loop: Header=BB0_14 Depth=1
; CHECK-NEXT: cmp w11, #42
-; CHECK-NEXT: b.eq LBB0_18
-; CHECK-NEXT: ; %bb.16: ; %lor.lhs.false74
-; CHECK-NEXT: ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT: b.eq LBB0_19
+; CHECK-NEXT: ; %bb.17: ; %lor.lhs.false74
+; CHECK-NEXT: ; in Loop: Header=BB0_14 Depth=1
; CHECK-NEXT: cmp w8, w11
; CHECK-NEXT: mov w0, wzr
-; CHECK-NEXT: b.ne LBB0_43
-; CHECK-NEXT: ; %bb.17: ; %lor.lhs.false74
-; CHECK-NEXT: ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT: b.ne LBB0_44
+; CHECK-NEXT: ; %bb.18: ; %lor.lhs.false74
+; CHECK-NEXT: ; in Loop: Header=BB0_14 Depth=1
; CHECK-NEXT: cmp w8, #94
-; CHECK-NEXT: b.eq LBB0_43
-; CHECK-NEXT: LBB0_18: ; %if.then83
-; CHECK-NEXT: ; in Loop: Header=BB0_13 Depth=1
+; CHECK-NEXT: b.eq LBB0_44
+; CHECK-NEXT: LBB0_19: ; %if.then83
+; CHECK-NEXT: ; in Loop: Header=BB0_14 Depth=1
; CHECK-NEXT: ldrb w8, [x10], #1
; CHECK-NEXT: mov w0, #1 ; =0x1
-; CHECK-NEXT: cbnz w8, LBB0_13
-; CHECK-NEXT: b LBB0_43
-; CHECK-NEXT: LBB0_19: ; %land.lhs.true28
-; CHECK-NEXT: cbz w8, LBB0_23
-; CHECK-NEXT: ; %bb.20: ; %land.lhs.true28
+; CHECK-NEXT: cbnz w8, LBB0_14
+; CHECK-NEXT: b LBB0_44
+; CHECK-NEXT: LBB0_20: ; %land.lhs.true28
+; CHECK-NEXT: cbz w8, LBB0_24
+; CHECK-NEXT: ; %bb.21: ; %land.lhs.true28
; CHECK-NEXT: cmp w8, #112
-; CHECK-NEXT: b.ne LBB0_12
-; CHECK-NEXT: ; %bb.21: ; %land.lhs.true35
+; CHECK-NEXT: b.ne LBB0_13
+; CHECK-NEXT: ; %bb.22: ; %land.lhs.true35
; CHECK-NEXT: ldrb w13, [x10, x11]
; CHECK-NEXT: cmp w13, #112
-; CHECK-NEXT: b.ne LBB0_12
-; CHECK-NEXT: ; %bb.22: ; %land.lhs.true43
+; CHECK-NEXT: b.ne LBB0_13
+; CHECK-NEXT: ; %bb.23: ; %land.lhs.true43
; CHECK-NEXT: sub x12, x9, x12
; CHECK-NEXT: add x12, x12, x11
; CHECK-NEXT: cmp x12, #1
-; CHECK-NEXT: b.ne LBB0_44
-; CHECK-NEXT: LBB0_23:
+; CHECK-NEXT: b.ne LBB0_45
+; CHECK-NEXT: LBB0_24:
; CHECK-NEXT: mov w0, #1 ; =0x1
-; CHECK-NEXT: b LBB0_43
-; CHECK-NEXT: LBB0_24: ; %if.else88
+; CHECK-NEXT: b LBB0_44
+; CHECK-NEXT: LBB0_25: ; %if.else88
; CHECK-NEXT: cmp w12, #1
-; CHECK-NEXT: b.ne LBB0_33
-; CHECK-NEXT: ; %bb.25: ; %if.else88
+; CHECK-NEXT: b.ne LBB0_34
+; CHECK-NEXT: ; %bb.26: ; %if.else88
; CHECK-NEXT: cmp w13, #2
-; CHECK-NEXT: b.ne LBB0_33
-; CHECK-NEXT: ; ...
[truncated]
|
I am curios if there is a way to teach BFI to return appropriately rounded values, so that we don't modify downstream algorithms? Similarly, do you think lower bits in the frequencies provide any signal, or are they just noise that always needs to be discarded? |
Yes in fact I started out with code "rounding" all BFI values and still have it around here. But initially I wanted this more of a drive-by fix rather than challenging base layer assumptions which would probably need more discussion from stakeholders... Also part of the motivation of the current fix is plainly that #66285 forces me to update hundreds of unit tests and this here seemed like an easier "drive-by" change so I can frontload a number of those updates... So yeah may be better, but would like some more feedback / support if we want to go in that direction. In the meantime I think this change shouldn't hurt? |
I also liked the idea of building the rounding into BFI. Initially the rounding can be controlled/enabled by a switch, so we don't suddenly introduce big churns to all PGO users. Beyond your need for test case stability, another real benefit of rounding within BFI is that it can help reduce the binary level instability for sample PGO due to nature of sampling.
You are already challenging the base layer in #66285, in a good way though. :) I'd suggest let's do it properly within BFI, rather than in downstream optimization. |
Are there any measurable performance impact with this change? |
I don't expect any changes as this should only affect cases where frequencies are very close to each other so any choice should have a similar effect. I haven't benchmarked this but I could run it through the llvm-test-suite if desired. (I'm not sure I can sensible measure it with our production workloads as BOLT will reorder all the basic blocks code again anyway for us). The motivation here is to reduce the amount of reordering that is triggered by inconsequentially small changes in BFI data. I did this because I had to touch most of these tests for #66285 anyway, and it seemed like a good idea to make the algorithm more stable to hopefully reduce churn in the future too. I could also see this being useful to reduce day-to-day churn as sampling PGO data changed, though that likely needs tuning and tighter values for the |
We can do prod measurement with BOLT turned off. I don't expect meaningful perf movement, but it's indeed a good sanity check.
Yes. It needs to be 1) off by default initially, 2) tunable for the rounding. And just to be clear, I'm mostly thinking about rounding within BFI, not this change as is. |
Sanity check would be great to avoid unexpected downward performance change (regression) as well. |
IIUC, these functions do not satisfy weak ordering. Suppose a series of similar BFI edge frequencies are considered where We can still have w_1 lessWithTolerance w_n. In that case, is it a good idea to expose these as pseudo-comparators? |
Managed to get one build from our internal "unicorn" service. Showing no regression (shows a tiny improvement that is likely benchmarking noise). |
...
Yes it seems this is only a strict partial order but not a weak ordering. But is this actually a problem, would people infer this from the function names and get confused if it isn't? Are you saying we should find a weak ordering here? I don't immediately see any benefits from that at least for my use case. Or are you proposing that we should find a different name for these functions to set other expectations? But then what name should we chose, I personally didn't have expectations one way or another... |
Yes. I suggest we use different naming and also make these non-member functions. The current naming does not resonate with the definitions of these functions. I think these functions define whether one frequency is significantly greater or less than the other.
|
d983c7b
to
ad0c45b
Compare
Thanks for the feedback. I contemplated the comments about this not being a partial weak ordering and there are indeed cases where picking the "best" candidate out of a list will give different results based on the list order which isn't desirable. So I reworked the approach to perform a two-pass algorithm: We first search for the biggest frequency (without any tolerances) and then in a 2nd pass check all blocks that are almost equal to the biggest frequency. This also means that I know have an I also think that this approach is slightly better than pre-rounding BFI numbers, as for pre-rounding we still may have two candidate blocks that had very similar frequency but happened to be at the border where one gets rounded up and the other rounded down, which doesn't happen with this approach. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
FWIW: I also realized that the big majority of the instability I was tackling came from |
ad0c45b
to
fc034b7
Compare
I like the new approach. Some general comments:
|
Most tests have no profile information and so we end up assigning equal branch weights in many cases. This means multiple successors with the same BlockFrequency are pretty common. To make matters worse we not every integer divides cleanly. Say a block as a BlockFrequency value of 7 and has two successors with equal branch weights. We will assign frequencies 3 and 4 to the successors introducing some sort of extra numerical inaccuracy that isn't meaningful for the code.
I was hoping that this is mostly a drive-by change to make my live easier for #66285 I did not get around to learn abut the est-tsp code so this won't be affected by this change. It would be good to introduce an equivalent change there too at some time.
I wasn't sure if anybody would ever use such a flag. I did have a flag during my personal experiments, but I felt we can rather save a couple lines of code when upstreaming and maybe gain a tiny bit of perf from hardcoded values. That said I am happy to bring back a flag if you prefer that. |
I wonder why the existing implementation of block placement relies on block frequencies (that are relative to the entry block) instead of block counts? Would using counts be equivalent (or very similar) to the rounding suggested in the diff? |
The block counts are not available. We only have a single count for function entry and branch-weights. Functions like |
That's exactly what i'm asking: Why not using the "emulated/estimated" counts? You get the same rounding effect, as suggested in the diff, but in addition the raw counts have an intuitive interpretation ("how many times a block/jump was executed?"). |
The LLVM BFI data is susceptible to numerical "noise" in the lower bits: To get some feeling for this consider that we store `BlockFrequency` as 64-bit integers and branch weights as 32-bit. There is a natural numerical difference when the branch weights divisions are not naturally representable as integers and because block frequencies are scaled by loop factors the absolute values of those errors can appear big. Regardless it should be a small relative difference relative to the frequency numbers. - Add `BlockFrequency::almostEqual` function to check whether two frequencies are close enough to each other that the topmost N bits are equal and only lower bits differ. - Change `MachineBlockPlacement::selectBestCandidateBlock` to be a two-pass algorithm that first finds determines the candidates with the biggest frequency and then performs a 2nd pass to check among blocks with "almostEqual" frequency to prefer the one already being the layout successor or first in the worklist. The currently value of 20 significant bits works well to reduce unnecessary changes in my upcoming BFI precision changes.
fc034b7
to
ebe61e1
Compare
Any other thoughts on whether we should/could use profile counts in MachineBlockPlacement (and perhaps other optimizations), instead of frequencies? Intuitively it makes sense to me and it would resolve some of the stability issues discussed in the diff. It seems there is already such an option for tail duplication (https://reviews.llvm.org/D83265). I'm curious why it is not applied for other placement heuristics? cc @david-xl |
The question is whether we could have a better approach which covers both at the same time (IIUC, rounding handles this, but does not have the nice relativity feature that we have in your PR). |
Another general question. Do you have cases where this precision problem manifests itself (e.g., unstable/non-deterministic builds)? |
No. I don't have any real-world evidence other than flip-flopping test-cases. Builds are of course deterministic without this patch already in the sense that given the same inputs you get the same binary. On the other hand code can still look wildly different for minor changes in the profile inputs as other areas like inlining likely show similar effects.
Yes, I apreciate that. This has sparked a lot more (good) discussion than I expected, and it seems a good solution here is more than a "drive-by" patch like this. So I am less sure I want to go down this road and will probably just put this on pause for now (and update the extra test cases in #66285 ) |
The LLVM BFI data is susceptible to numerical "noise" in the lower bits:
To get some feeling for this consider that we store
BlockFrequency
as64-bit integers and branch weights as 32-bit. There is a natural
numerical difference when the branch weights divisions are not naturally
representable as integers and because block frequencies are scaled by
loop factors the absolute values of those errors can appear big.
Regardless it should be a small relative difference relative to the
frequency numbers.
This changes make the MachineBlockPlacement algorithm more tolerant
against small relative changes in block frequency numbers:
BlockFrequency::greaterWithTolerance
andBlockFrequency::lessWithTolerance
functions that allow forgreater/smaller comparisons for the top N significant bits only
excluding leading zeros.
MachineBlockPlacement::selectBestCandidateBlock
andMachineBlockPlacement::findBestLoopTopHelper
to only considersomething a better candidate when it passes the tolerance check.
picked as best candidate when they are not worse than the current best
with tolerance. This incentivizes to keep the placement as-is when
multiple candidates have similar frequencies.
The currently value of 20 significant bits works well to reduce
unnecessary changes in my upcoming BFI precision changes. It is likely a
good idea to reduce this number further to accomodate for noise in PGO
data but I did not perform any deeper analysis on what number would fit
this goal.