-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bfi precision #66285
Bfi precision #66285
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis ChangesI noticed that we often have poor precision in the `BlockFrequencyInfo` results because of how `convertFloatingToInteger` in the algorithm chooses factors.This is a request for comments for choosing more aggressive factors! Look at changes in Has this been tried before? Is there history I am not aware of? Right now this change does not work yet as it triggers overflows and some instabilities in various places that I am working through. I thought it may be interesting to get some early feedback though in case this change is miguided...Full diff: https://github.com/llvm/llvm-project/pull/66285.diff 9 Files Affected:
diff --git a/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp b/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp index 82b1e3b9eede709..93661d2edeb3083 100644 --- a/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp +++ b/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp @@ -481,26 +481,15 @@ void BlockFrequencyInfoImplBase::distributeMass(const BlockNode &Source, static void convertFloatingToInteger(BlockFrequencyInfoImplBase &BFI, const Scaled64 &Min, const Scaled64 &Max) { - // Scale the Factor to a size that creates integers. Ideally, integers would - // be scaled so that Max == UINT64_MAX so that they can be best - // differentiated. However, in the presence of large frequency values, small - // frequencies are scaled down to 1, making it impossible to differentiate - // small, unequal numbers. When the spread between Min and Max frequencies - // fits well within MaxBits, we make the scale be at least 8. - const unsigned MaxBits = 64; - const unsigned SpreadBits = (Max / Min).lg(); - Scaled64 ScalingFactor; - if (SpreadBits <= MaxBits - 3) { - // If the values are small enough, make the scaling factor at least 8 to - // allow distinguishing small values. - ScalingFactor = Min.inverse(); - ScalingFactor <<= 3; - } else { - // If the values need more than MaxBits to be represented, saturate small - // frequency values down to 1 by using a scaling factor that benefits large - // frequency values. - ScalingFactor = Scaled64(1, MaxBits) / Max; - } + // Scale the Factor to a size that creates integers. If possible scale + // integers so that Max == UINT64_MAX so that they can be best differentiated. + // Is is possible that the range between min and max cannot be accurately + // represented in a 64bit integer without either loosing precision for small + // values (so small unequal numbers all map to 1) or saturaturing big numbers + // loosing precision for big numbers (so unequal big numbers may map to + // UINT64_MAX). We choose to loose precision for small numbers. + const unsigned MaxBits = sizeof(Scaled64::DigitsType) * CHAR_BIT; + Scaled64 ScalingFactor = Scaled64(1, MaxBits) / Max; // Translate the floats to integers. LLVM_DEBUG(dbgs() << "float-to-int: min = " << Min << ", max = " << Max diff --git a/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll b/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll index 41226a1cdfbaf32..fadb47bd256b772 100644 --- a/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll +++ b/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll @@ -59,7 +59,7 @@ declare i32 @printf(i8*, ...) ; CHECK: Printing analysis {{.*}} for function 'main': ; CHECK-NEXT: block-frequency-info: main -define i32 @main() { +define i32 @main() !prof !6 { entry: %retval = alloca i32, align 4 %i = alloca i32, align 4 @@ -93,7 +93,7 @@ for.cond4: ; preds = %for.inc, %for.body3 %cmp5 = icmp slt i32 %2, 100 br i1 %cmp5, label %for.body6, label %for.end, !prof !3 -; CHECK: - for.body6: float = 500000.5, int = 4000004 +; CHECK: - for.body6: {{.*}}, count = 1000000 for.body6: ; preds = %for.cond4 call void @bar() br label %for.inc @@ -143,7 +143,7 @@ for.cond16: ; preds = %for.inc19, %for.bod %cmp17 = icmp slt i32 %8, 10000 br i1 %cmp17, label %for.body18, label %for.end21, !prof !4 -; CHECK: - for.body18: float = 499999.9, int = 3999998 +; CHECK: - for.body18: {{.*}}, count = 1000000 for.body18: ; preds = %for.cond16 call void @bar() br label %for.inc19 @@ -175,7 +175,7 @@ for.cond26: ; preds = %for.inc29, %for.end %cmp27 = icmp slt i32 %12, 1000000 br i1 %cmp27, label %for.body28, label %for.end31, !prof !5 -; CHECK: - for.body28: float = 499995.2, int = 3999961 +; CHECK: - for.body28: {{.*}}, count = 1000224 for.body28: ; preds = %for.cond26 call void @bar() br label %for.inc29 @@ -197,8 +197,9 @@ for.end31: ; preds = %for.cond26 !llvm.ident = !{!0} !0 = !{!"clang version 3.7.0 (trunk 232635) (llvm/trunk 232636)"} -!1 = !{!"branch_weights", i32 101, i32 2} -!2 = !{!"branch_weights", i32 10001, i32 101} -!3 = !{!"branch_weights", i32 1000001, i32 10001} -!4 = !{!"branch_weights", i32 1000001, i32 101} -!5 = !{!"branch_weights", i32 1000001, i32 2} +!1 = !{!"branch_weights", i32 100, i32 1} +!2 = !{!"branch_weights", i32 10000, i32 100} +!3 = !{!"branch_weights", i32 1000000, i32 10000} +!4 = !{!"branch_weights", i32 1000000, i32 100} +!5 = !{!"branch_weights", i32 1000000, i32 1} +!6 = !{!"function_entry_count", i32 1} diff --git a/llvm/test/Analysis/BlockFrequencyInfo/precision.ll b/llvm/test/Analysis/BlockFrequencyInfo/precision.ll new file mode 100644 index 000000000000000..7408d002d065d5b --- /dev/null +++ b/llvm/test/Analysis/BlockFrequencyInfo/precision.ll @@ -0,0 +1,43 @@ +; RUN: opt < %s -disable-output -passes="print<block-freq>" 2>&1 | FileCheck %s +; Sanity check precision for small-ish min/max spread. + +@g = global i32 0 + +; CHECK-LABEL: block-frequency-info: func0 +; CHECK: - entry: float = 1.0, {{.*}}, count = 1000 +; CHECK: - cmp0_true: float = 0.4, {{.*}}, count = 400 +; CHECK: - cmp0_false: float = 0.6, {{.*}}, count = 600 +; CHECK: - cmp1_true: float = 0.1, {{.*}}, count = 100 +; CHECK: - cmp1_false: float = 0.3, {{.*}}, count = 300 +; CHECK: - join: float = 1.0, {{.*}}, count = 1000 + +define void @func0(i32 %a0, i32 %a1) !prof !0 { +entry: + %cmp0 = icmp ne i32 %a0, 0 + br i1 %cmp0, label %cmp0_true, label %cmp0_false, !prof !1 + +cmp0_true: + store volatile i32 1, ptr @g + %cmp1 = icmp ne i32 %a1, 0 + br i1 %cmp1, label %cmp1_true, label %cmp1_false, !prof !2 + +cmp0_false: + store volatile i32 2, ptr @g + br label %join + +cmp1_true: + store volatile i32 3, ptr @g + br label %join + +cmp1_false: + store volatile i32 4, ptr @g + br label %join + +join: + store volatile i32 5, ptr @g + ret void +} + +!0 = !{!"function_entry_count", i64 1000} +!1 = !{!"branch_weights", i32 400, i32 600} +!2 = !{!"branch_weights", i32 1, i32 3} diff --git a/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll b/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll index 63568456d0e58c8..f1eb19dbfdbea42 100644 --- a/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll +++ b/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll @@ -17,8 +17,8 @@ return: ; preds = %entry define internal i32 @dummyCaller(i1 %cond) !prof !1 { entry: -%val = call i32 @inlinedFunc(i1 %cond) -ret i32 %val + %val = call i32 @inlinedFunc(i1 %cond) + ret i32 %val ; CHECK-LABEL: @dummyCaller ; CHECK: call @@ -31,4 +31,4 @@ ret i32 %val !2 = !{!"branch_weights", i32 5, i32 5} !3 = !{!"branch_weights", i32 4, i32 1} -; CHECK: [[COUNT1]] = !{!"branch_weights", i32 31, i32 8} +; CHECK: [[COUNT1]] = !{!"branch_weights", i32 858993459, i32 214748365} diff --git a/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll b/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll index 69e84e942de65bf..64d7c229ff4e52c 100644 --- a/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll +++ b/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll @@ -10,14 +10,13 @@ define void @c() { ; CHECK-LABEL: @c( ; CHECK-NEXT: entry: +; CHECK-NEXT: [[CONST:%.*]] = bitcast i32 1232131 to i32 ; CHECK-NEXT: [[TOBOOL:%.*]] = icmp ne i16 0, 0 ; CHECK-NEXT: br i1 undef, label [[LBL1_US:%.*]], label [[ENTRY_ENTRY_SPLIT_CRIT_EDGE:%.*]] ; CHECK: entry.entry.split_crit_edge: -; CHECK-NEXT: [[CONST:%.*]] = bitcast i32 1232131 to i32 ; CHECK-NEXT: br label [[LBL1:%.*]] ; CHECK: lbl1.us: -; CHECK-NEXT: [[CONST1:%.*]] = bitcast i32 1232131 to i32 -; CHECK-NEXT: store i32 [[CONST1]], ptr @c.a, align 1 +; CHECK-NEXT: store i32 [[CONST]], ptr @c.a, align 1 ; CHECK-NEXT: br label [[FOR_COND4:%.*]] ; CHECK: lbl1: ; CHECK-NEXT: store i32 [[CONST]], ptr @c.a, align 1 diff --git a/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll b/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll index b1fc96ea77ed034..4f413a50837dd69 100644 --- a/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll +++ b/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll @@ -108,5 +108,5 @@ attributes #0 = { nounwind } isOptimized: true, flags: "-O2", splitDebugFilename: "abc.debug", emissionKind: 2) !29 = !{!"function_entry_count", i64 3} -!30 = !{!"branch_weights", i32 99, i32 1} +!30 = !{!"branch_weights", i32 10000, i32 1} !31 = !{!"branch_weights", i32 1, i32 99} diff --git a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll index ed107b10dcd9874..4da1d099645bee2 100644 --- a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll +++ b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll @@ -198,5 +198,5 @@ attributes #0 = { norecurse nounwind ssp uwtable "disable-tail-calls"="false" "l !55 = distinct !{!55, !43} !56 = !{!"function_entry_count", i64 3} !57 = !{!"function_entry_count", i64 50} -!58 = !{!"branch_weights", i32 99, i32 1} +!58 = !{!"branch_weights", i32 10000, i32 1} !59 = !{!"branch_weights", i32 1, i32 99} diff --git a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll index 30d11a12c79c4bc..4b7b714a2562800 100644 --- a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll +++ b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll @@ -209,5 +209,5 @@ attributes #0 = { norecurse nounwind ssp uwtable "disable-tail-calls"="false" "l !55 = distinct !{!55, !43} !56 = !{!"function_entry_count", i64 3} !57 = !{!"function_entry_count", i64 50} -!58 = !{!"branch_weights", i32 99, i32 1} +!58 = !{!"branch_weights", i32 10000, i32 1} !59 = !{!"branch_weights", i32 1, i32 99} diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll index 19e83649723d642..105494942d383d5 100644 --- a/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll +++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll @@ -14,8 +14,8 @@ T1: ; preds = %0 %v1 = call i32 @f1(), !prof !12 %cond3 = icmp eq i32 %v1, 412 call void @llvm.pseudoprobe(i64 6699318081062747564, i64 2, i32 0, i64 -1) -;; The distribution factor -8513881372706734080 stands for 53.85%, whic is from 7/6+7. -; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID:]], i64 4, i32 0, i64 -8513881372706734080) +;; The distribution factor -9223372036854775808 stands for 53.85%, whic is from 7/6+7. +; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID:]], i64 4, i32 0, i64 -9223372036854775808) call void @llvm.pseudoprobe(i64 6699318081062747564, i64 4, i32 0, i64 -1), !dbg !13 ;; Probe 7 has two copies, since they don't share the same inline context, they are not ;; considered sharing samples, thus their distribution factors are not fixed up. @@ -29,8 +29,8 @@ T1: ; preds = %0 Merge: ; preds = %0 %v2 = call i32 @f2(), !prof !12 call void @llvm.pseudoprobe(i64 6699318081062747564, i64 3, i32 0, i64 -1) -;; The distribution factor 8513881922462547968 stands for 46.25%, which is from 6/6+7. -; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 4, i32 0, i64 8513881922462547968) +;; The distribution factor -9223372036854775808 stands for 46.25%, which is from 6/6+7. +; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 4, i32 0, i64 -9223372036854775808) call void @llvm.pseudoprobe(i64 6699318081062747564, i64 4, i32 0, i64 8513881922462547968), !dbg !13 ; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 7, i32 0, i64 -1) call void @llvm.pseudoprobe(i64 6699318081062747564, i64 7, i32 0, i64 -1), !dbg !18 @@ -77,4 +77,4 @@ attributes #0 = { inaccessiblememonly nounwind willreturn } !16 = distinct !DILocation(line: 10, column: 11, scope: !17) !17 = !DILexicalBlockFile(scope: !4, file: !5, discriminator: 186646551) !18 = !DILocation(line: 53, column: 3, scope: !15, inlinedAt: !19) -!19 = !DILocation(line: 12, column: 3, scope: !4) \ No newline at end of file +!19 = !DILocation(line: 12, column: 3, scope: !4) |
Comment in the code from the Phab review, which I think was committed mostly as is (https://reviews.llvm.org/D3125, but it wasn't used; the active review was on the commits list):
My memory is that I had equivalent code to this change originally, then backed off to this compromise after hitting performance regressions, but the intention was (originally) to figure out what was wrong we could go back to this. I don't have access to my email from the time, but I dug up the start of the llvm-commits thread at https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20140303/207695.html. Unfortunately it's split by week. I'll reply again if I can find the specific motivation in the thread. There's been a lot of work on BFI since then so it's possible this will work now. |
Actually, it must have come up when I ran tests before posting the patches to the list, since the initial patch has the same comment. Almost 10 years ago now, so I can't remember if this was the llvm unit tests or the test-suite. IMO it's a good idea to revisit this! |
Thanks for confirming @duncan! The current state is mostly a result of http://reviews.llvm.org/D8718 and 10be9a8 it seems and I am trying to understand the intentions there or whether that could have been an accident. I am definitely running into overflow and noise problems with this patch applied and am working on fixing these issues as I see them, hopefully this is manageable. Another discussion here is whether to normalize "min" value to integer 1 and have higher precisions for the small numbres while saturating big numbers. Or normalize "max" to |
For precision.ll test case, without the patch, we get 388/600 vs 400/600, and 88/288 vs 100/300. How much do we depend on this level of difference? Profile imprecision is an inherent attribute due to other reasons as well for instance race conditions in instrumentation counter update, sampling period difference etc. |
I don't know if the low precision is a problem for practical code. So far it was mostly annoying me when writing simple testcases that look so obvious that I expected them to just work :) It is also very easy to fix these cases just by choosing bigger factors in Are there any downsides to bigger factors? I am certainly experiencing overflow problems, but those seem worth fixing anyway given we could also hit them for big min/max spreads in practice... |
For PGO, we don't care about two things: 1) ratio of the branch weights of the same branch 2) the 'actual' runtime count of a block or callsite. The later is computed by multiplying the relative frequency (to the entry block) with the entry count (real count). In other words, as long as there is no overflow/underflow, the scale factor used does not really matter. On the other hand, if overflow happens, the branch direction can be flipped, which is a real problem. |
Thanks for the comment. I still think choosing more aggressive factors in case of small spreads should not hurt anything and we can easily do so. Even if that just makes people writing test cases happier :) As for overflows I am still trying to get a feeling for how much we need to go and fix those (as in #66280 ) or whether we have to accept them as a fact of (engineering-)life and rather leave enough room in the values... |
Sure. As long as there are no over/under flow with the new approach, it is a good to have the extra precision kept. |
So we probably want something like the following to make this explicit and have a dial between precision and overflow-bug-avoidance:
I'll try how this goes... |
FWIW, I'm supportive of the investigation here. Just one comment:
I think so, but we probably will lose some valuable information too if we collapse small values. Long story short, we'll have regressions :). |
I found that
If the spread between "min" and "max" frequency is bigger than 64bit then we have to "collapse" on one end (short of replacing the whole system with something like llvm::ScaledNumber or floating point; but that is a discussion for another day I suppose). In the current system I rather collapse the lower-end frequencies to 1 than collapsing the higher end frequencies to MAX_UINT64. Unfortunately making this change produces a lot of test-case churn. I am still working on this, but published a PR #67197 to increase stability in MachineBlockPlacement against small BFI noise. |
@mtrofin the issue (saturation) issue may affect ML based register allocation (training quality). |
9a8d1ef
to
521e483
Compare
This is a standard release build. The exact cmake config is https://gist.github.com/nikic/1bf97b5b6aade298b87f41cd13bf37d5 where the stage1 build has the same configuration but is built with gcc. I'm testing CTMark, but the impact is quite consistent across the board, so anything will probably do. |
Oh so this isn't even a PGO enabled build, interesting... |
Internally, with this change, on a few large workloads, we saw ~0.2-0.5% performance improvements, all with 2-3% .text size reduction at the same time. Our testing covered both sampling pgo and instrumentation pgo. The change makes profile more accurate, which is a good improvement by itself. But PGO performance isn't just about profile accuracy, and it depends on training quality and representativeness too. Hypothetically, if some of the training isn't that representative relative to perf testing scenario, by making profile more accurate, we could also make profile deviate more from what's needed for best perf on testing scenario. That said, I think it'd be fair to revert if this is proven to be a general regression. For that, can we wait for your results before reverting? |
Yes -- the revert can wait until more data is available. I agree that it should help performance in theory. |
I will check what happens with clang /release/no-fdo stage2 builds. |
Ah yes, I should have mentioned that. This is indeed without PGO. So I expect this is related to the impact of the estimated BFI on the inlining cost model (which is the only significant place using BFI in non-PGO builds). (I've seen issues with bad inlining decisions due to BF underflow in the past, not sure whether this change makes that better or worse.) |
I think I traced the clang regression down to The example I analyzed starts out as a single-block function with a call resulting in this BFI:
It seems that when we inline calls we just scale their BFI values to the callsite (instead of say recomputing BFI which would have resulted in different scales / entry frequency to be chosen). And given that the entry block only had a value of
Note that some blocks frequencies got scaled down to In comparison with the new conversion scheme we end up with:
for the same situation and blocks like |
So not sure if there is an easy fix for the regression. We should probably have some logic that scales all BFI values when inlining to better accomodate the new range of frequencies so we loose less precision by accident. Though that won't help with the regression at hand as that would need for |
Also note that this 2% cold-code threshold is only applied in situation where no PGO data is available. So maybe we should just ignore this, given that without PGO data it just is often impossible for the compiler to make good inlining decisions... |
Seems awkward to pessimize as "cold" when there's no real data (no PGO, no other marking). What happens if you change the cold call threshold to 0% when there's no PGO data? (I.e., never assume a call is cold without actual evidence) |
The old code of not marking them as cold works also just by accident though. |
Not entirely accidental. When BPI/BFI first landed it was heavily profiled to be sure it didn't pessimize non-PGO code. I don't see why we'd suddenly be okay with pessimizing it. Under 2% isn't hard to hit for hot path code. Lots of functions will have strings of early exit conditions. |
Agree that deciding coldness based on local entry frequency (2%) is a bad idea though. |
Seems this got introduced in https://reviews.llvm.org/D34312 with the rough idea that we shouldn't inline into parts of the code that |
Heh, yeah, the premise seems correct, but a percentage-based numeric threshold doesn't seem right. You kind of want a flag for the block. Or a special value, like "freq=0", which indicates "annotation says this is cold". |
@xur-llvm @david-xl This just reminds me of the observed regressions when we did an experiment to override branch probability from |
Contracting my past self (one of the reviewers of the patch), I think coldness check should be based on a global threshold, which exists when synthetic entry count propagation is enabled (without PGO). |
FYI @ldionne This patch causing test_vector2.pass.cpp trigger memory leak https://lab.llvm.org/buildbot/#/builders/168/builds/16422 |
Leak could be real, as the code terminates before freeing the memory.
Not sure how this change could trigger leaks. Changing BFI here should mostly effect inlining decisions, basic block placement, register allocation. But it shouldn't change program behavior or memory operations... Will try to reproduce this test on my machine, to see if there's anything obvious for the change here (though at a first glance the test code seems to provoke an exception abort and would naturally leak because of it?) |
Some very ad-hoc benchmarking. Of clang compilation speed (measured in instructions as reported by valgrind/callgrind which I think somewhat matches the setup of nikic) compiling Old-BFI (this PR reverted), New-BFI (this PR applied), no-cold (cold-callsite-rel-freq set to 0 to disable
I could benchmark more with llvm-test-suite and put up a PR to disable cold-callsite-rel-freq by default if people support this direction... |
And digging even deeper:
Though if you look into this then I would state that the code got better for the wrong reasons! Not inlining Unfortunately the particular code here Long story short: This ended up another of the many stories where the compiler cannot make the right inlining decisions without having actual runtime data. It tends to make a better decision based on the data that ends up being wrong anyway here. I'm gonna leave things as is and rather put up some improvements to LLVM code instead! |
Interesting. Probably Your plan SGTM! |
Good analysis. Your plan sounds good but lowering the cold threshold is probably a good thing to do with the new BFI. The FindAndConstruct method in DenseMap can probably be annotated with buitin_expect to indicate the insertion is a unlikely path. |
Thanks for investigating, Matthias. So it looks like the regression is indeed just clang non-PGO build being unlucky after the BFI improvement.. Sounds good to leave it as is. |
#70700 improves LLVM code so I am no longer seeing a regression before/after this change. |
@MatzeB ,Our internal release testing have seen lots of very large regressions for tests without PGO or with XFDO only. While I agree this patch is the right way to go and the good performance (without PGO) with oldBFI is somewhat by chance, it is still better not to regress them. Can you help with a follow up patch to tune down relative coldness threshold? |
Do you just see regression or also some wins? I don't know whether this is all stems from the |
Or put another way: Do you see regressions disappear with |
I asked for those data points and share it when ready. |
I noticed that we often have poor precision in the
BlockFrequencyInfo
results because of howconvertFloatingToInteger
in the algorithm chooses factors.This is a request for comments for choosing more aggressive factors! Look at changes in
test/Analysis/BlockFrequencyInfo/precision.ll
to get an impression for the poor precision and improvements.Has this been tried before? Is there history I am not aware of?
Right now this change does not work yet as it triggers overflows and some instabilities in various places that I am working through. I thought it may be interesting to get some early feedback though in case this change is miguided...