Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bfi precision #66285

Merged
merged 4 commits into from
Oct 25, 2023
Merged

Bfi precision #66285

merged 4 commits into from
Oct 25, 2023

Conversation

MatzeB
Copy link
Contributor

@MatzeB MatzeB commented Sep 13, 2023

I noticed that we often have poor precision in the BlockFrequencyInfo results because of how convertFloatingToInteger in the algorithm chooses factors.

This is a request for comments for choosing more aggressive factors! Look at changes in test/Analysis/BlockFrequencyInfo/precision.ll to get an impression for the poor precision and improvements.

Has this been tried before? Is there history I am not aware of?

Right now this change does not work yet as it triggers overflows and some instabilities in various places that I am working through. I thought it may be interesting to get some early feedback though in case this change is miguided...

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 13, 2023

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-support
@llvm/pr-subscribers-pgo
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-lto

@llvm/pr-subscribers-llvm-analysis

Changes I noticed that we often have poor precision in the `BlockFrequencyInfo` results because of how `convertFloatingToInteger` in the algorithm chooses factors.

This is a request for comments for choosing more aggressive factors! Look at changes in test/Analysis/BlockFrequencyInfo/precision.ll to get an impression for the poor precision and improvements.

Has this been tried before? Is there history I am not aware of?

Right now this change does not work yet as it triggers overflows and some instabilities in various places that I am working through. I thought it may be interesting to get some early feedback though in case this change is miguided...

Full diff: https://github.com/llvm/llvm-project/pull/66285.diff

9 Files Affected:

  • (modified) llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp (+9-20)
  • (modified) llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll (+10-9)
  • (added) llvm/test/Analysis/BlockFrequencyInfo/precision.ll (+43)
  • (modified) llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll (+3-3)
  • (modified) llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll (+1-1)
  • (modified) llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll (+5-5)
diff --git a/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp b/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp
index 82b1e3b9eede709..93661d2edeb3083 100644
--- a/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp
+++ b/llvm/lib/Analysis/BlockFrequencyInfoImpl.cpp
@@ -481,26 +481,15 @@ void BlockFrequencyInfoImplBase::distributeMass(const BlockNode &Source,
 
 static void convertFloatingToInteger(BlockFrequencyInfoImplBase &BFI,
                                      const Scaled64 &Min, const Scaled64 &Max) {
-  // Scale the Factor to a size that creates integers.  Ideally, integers would
-  // be scaled so that Max == UINT64_MAX so that they can be best
-  // differentiated.  However, in the presence of large frequency values, small
-  // frequencies are scaled down to 1, making it impossible to differentiate
-  // small, unequal numbers. When the spread between Min and Max frequencies
-  // fits well within MaxBits, we make the scale be at least 8.
-  const unsigned MaxBits = 64;
-  const unsigned SpreadBits = (Max / Min).lg();
-  Scaled64 ScalingFactor;
-  if (SpreadBits <= MaxBits - 3) {
-    // If the values are small enough, make the scaling factor at least 8 to
-    // allow distinguishing small values.
-    ScalingFactor = Min.inverse();
-    ScalingFactor <<= 3;
-  } else {
-    // If the values need more than MaxBits to be represented, saturate small
-    // frequency values down to 1 by using a scaling factor that benefits large
-    // frequency values.
-    ScalingFactor = Scaled64(1, MaxBits) / Max;
-  }
+  // Scale the Factor to a size that creates integers.  If possible scale
+  // integers so that Max == UINT64_MAX so that they can be best differentiated.
+  // Is is possible that the range between min and max cannot be accurately
+  // represented in a 64bit integer without either loosing precision for small
+  // values (so small unequal numbers all map to 1) or saturaturing big numbers
+  // loosing precision for big numbers (so unequal big numbers may map to
+  // UINT64_MAX). We choose to loose precision for small numbers.
+  const unsigned MaxBits = sizeof(Scaled64::DigitsType) * CHAR_BIT;
+  Scaled64 ScalingFactor = Scaled64(1, MaxBits) / Max;
 
   // Translate the floats to integers.
   LLVM_DEBUG(dbgs() << "float-to-int: min = " << Min << ", max = " << Max
diff --git a/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll b/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
index 41226a1cdfbaf32..fadb47bd256b772 100644
--- a/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
+++ b/llvm/test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
@@ -59,7 +59,7 @@ declare i32 @printf(i8*, ...)
 
 ; CHECK: Printing analysis {{.*}} for function 'main':
 ; CHECK-NEXT: block-frequency-info: main
-define i32 @main() {
+define i32 @main() !prof !6 {
 entry:
   %retval = alloca i32, align 4
   %i = alloca i32, align 4
@@ -93,7 +93,7 @@ for.cond4:                                        ; preds = %for.inc, %for.body3
   %cmp5 = icmp slt i32 %2, 100
   br i1 %cmp5, label %for.body6, label %for.end, !prof !3
 
-; CHECK: - for.body6: float = 500000.5, int = 4000004
+; CHECK: - for.body6: {{.*}}, count = 1000000
 for.body6:                                        ; preds = %for.cond4
   call void @bar()
   br label %for.inc
@@ -143,7 +143,7 @@ for.cond16:                                       ; preds = %for.inc19, %for.bod
   %cmp17 = icmp slt i32 %8, 10000
   br i1 %cmp17, label %for.body18, label %for.end21, !prof !4
 
-; CHECK: - for.body18: float = 499999.9, int = 3999998
+; CHECK: - for.body18: {{.*}}, count = 1000000
 for.body18:                                       ; preds = %for.cond16
   call void @bar()
   br label %for.inc19
@@ -175,7 +175,7 @@ for.cond26:                                       ; preds = %for.inc29, %for.end
   %cmp27 = icmp slt i32 %12, 1000000
   br i1 %cmp27, label %for.body28, label %for.end31, !prof !5
 
-; CHECK: - for.body28: float = 499995.2, int = 3999961
+; CHECK: - for.body28: {{.*}}, count = 1000224
 for.body28:                                       ; preds = %for.cond26
   call void @bar()
   br label %for.inc29
@@ -197,8 +197,9 @@ for.end31:                                        ; preds = %for.cond26
 !llvm.ident = !{!0}
 
 !0 = !{!"clang version 3.7.0 (trunk 232635) (llvm/trunk 232636)"}
-!1 = !{!"branch_weights", i32 101, i32 2}
-!2 = !{!"branch_weights", i32 10001, i32 101}
-!3 = !{!"branch_weights", i32 1000001, i32 10001}
-!4 = !{!"branch_weights", i32 1000001, i32 101}
-!5 = !{!"branch_weights", i32 1000001, i32 2}
+!1 = !{!"branch_weights", i32 100, i32 1}
+!2 = !{!"branch_weights", i32 10000, i32 100}
+!3 = !{!"branch_weights", i32 1000000, i32 10000}
+!4 = !{!"branch_weights", i32 1000000, i32 100}
+!5 = !{!"branch_weights", i32 1000000, i32 1}
+!6 = !{!"function_entry_count", i32 1}
diff --git a/llvm/test/Analysis/BlockFrequencyInfo/precision.ll b/llvm/test/Analysis/BlockFrequencyInfo/precision.ll
new file mode 100644
index 000000000000000..7408d002d065d5b
--- /dev/null
+++ b/llvm/test/Analysis/BlockFrequencyInfo/precision.ll
@@ -0,0 +1,43 @@
+; RUN: opt < %s -disable-output -passes="print<block-freq>" 2>&1 | FileCheck %s
+; Sanity check precision for small-ish min/max spread.
+
+@g = global i32 0
+
+; CHECK-LABEL: block-frequency-info: func0
+; CHECK: - entry: float = 1.0, {{.*}}, count = 1000
+; CHECK: - cmp0_true: float = 0.4, {{.*}}, count = 400
+; CHECK: - cmp0_false: float = 0.6, {{.*}}, count = 600
+; CHECK: - cmp1_true: float = 0.1, {{.*}}, count = 100
+; CHECK: - cmp1_false: float = 0.3, {{.*}}, count = 300
+; CHECK: - join: float = 1.0, {{.*}}, count = 1000
+
+define void @func0(i32 %a0, i32 %a1) !prof !0 {
+entry:
+  %cmp0 = icmp ne i32 %a0, 0
+  br i1 %cmp0, label %cmp0_true, label %cmp0_false, !prof !1
+
+cmp0_true:
+  store volatile i32 1, ptr @g
+  %cmp1 = icmp ne i32 %a1, 0
+  br i1 %cmp1, label %cmp1_true, label %cmp1_false, !prof !2
+
+cmp0_false:
+  store volatile i32 2, ptr @g
+  br label %join
+
+cmp1_true:
+  store volatile i32 3, ptr @g
+  br label %join
+
+cmp1_false:
+  store volatile i32 4, ptr @g
+  br label %join
+
+join:
+  store volatile i32 5, ptr @g
+  ret void
+}
+
+!0 = !{!"function_entry_count", i64 1000}
+!1 = !{!"branch_weights", i32 400, i32 600}
+!2 = !{!"branch_weights", i32 1, i32 3}
diff --git a/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll b/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll
index 63568456d0e58c8..f1eb19dbfdbea42 100644
--- a/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll
+++ b/llvm/test/Transforms/CodeExtractor/MultipleExitBranchProb.ll
@@ -17,8 +17,8 @@ return:             ; preds = %entry
 
 define internal i32 @dummyCaller(i1 %cond) !prof !1 {
 entry:
-%val = call i32 @inlinedFunc(i1 %cond)
-ret i32 %val
+  %val = call i32 @inlinedFunc(i1 %cond)
+  ret i32 %val
 
 ; CHECK-LABEL: @dummyCaller
 ; CHECK: call
@@ -31,4 +31,4 @@ ret i32 %val
 !2 = !{!"branch_weights", i32 5, i32 5}
 !3 = !{!"branch_weights", i32 4, i32 1}
 
-; CHECK: [[COUNT1]] = !{!"branch_weights", i32 31, i32 8}
+; CHECK: [[COUNT1]] = !{!"branch_weights", i32 858993459, i32 214748365}
diff --git a/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll b/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll
index 69e84e942de65bf..64d7c229ff4e52c 100644
--- a/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll
+++ b/llvm/test/Transforms/ConstantHoisting/AArch64/consthoist-unreachable.ll
@@ -10,14 +10,13 @@
 define void @c() {
 ; CHECK-LABEL: @c(
 ; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CONST:%.*]] = bitcast i32 1232131 to i32
 ; CHECK-NEXT:    [[TOBOOL:%.*]] = icmp ne i16 0, 0
 ; CHECK-NEXT:    br i1 undef, label [[LBL1_US:%.*]], label [[ENTRY_ENTRY_SPLIT_CRIT_EDGE:%.*]]
 ; CHECK:       entry.entry.split_crit_edge:
-; CHECK-NEXT:    [[CONST:%.*]] = bitcast i32 1232131 to i32
 ; CHECK-NEXT:    br label [[LBL1:%.*]]
 ; CHECK:       lbl1.us:
-; CHECK-NEXT:    [[CONST1:%.*]] = bitcast i32 1232131 to i32
-; CHECK-NEXT:    store i32 [[CONST1]], ptr @c.a, align 1
+; CHECK-NEXT:    store i32 [[CONST]], ptr @c.a, align 1
 ; CHECK-NEXT:    br label [[FOR_COND4:%.*]]
 ; CHECK:       lbl1:
 ; CHECK-NEXT:    store i32 [[CONST]], ptr @c.a, align 1
diff --git a/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll b/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll
index b1fc96ea77ed034..4f413a50837dd69 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll
@@ -108,5 +108,5 @@ attributes #0 = { nounwind }
                              isOptimized: true, flags: "-O2",
                              splitDebugFilename: "abc.debug", emissionKind: 2)
 !29 = !{!"function_entry_count", i64 3}
-!30 = !{!"branch_weights", i32 99, i32 1}
+!30 = !{!"branch_weights", i32 10000, i32 1}
 !31 = !{!"branch_weights", i32 1, i32 99}
diff --git a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll
index ed107b10dcd9874..4da1d099645bee2 100644
--- a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll
+++ b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info-2.ll
@@ -198,5 +198,5 @@ attributes #0 = { norecurse nounwind ssp uwtable "disable-tail-calls"="false" "l
 !55 = distinct !{!55, !43}
 !56 = !{!"function_entry_count", i64 3}
 !57 = !{!"function_entry_count", i64 50}
-!58 = !{!"branch_weights", i32 99, i32 1}
+!58 = !{!"branch_weights", i32 10000, i32 1}
 !59 = !{!"branch_weights", i32 1, i32 99}
diff --git a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll
index 30d11a12c79c4bc..4b7b714a2562800 100644
--- a/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll
+++ b/llvm/test/Transforms/LoopVectorize/diag-with-hotness-info.ll
@@ -209,5 +209,5 @@ attributes #0 = { norecurse nounwind ssp uwtable "disable-tail-calls"="false" "l
 !55 = distinct !{!55, !43}
 !56 = !{!"function_entry_count", i64 3}
 !57 = !{!"function_entry_count", i64 50}
-!58 = !{!"branch_weights", i32 99, i32 1}
+!58 = !{!"branch_weights", i32 10000, i32 1}
 !59 = !{!"branch_weights", i32 1, i32 99}
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll
index 19e83649723d642..105494942d383d5 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-update-2.ll
@@ -14,8 +14,8 @@ T1:                                               ; preds = %0
   %v1 = call i32 @f1(), !prof !12
   %cond3 = icmp eq i32 %v1, 412
   call void @llvm.pseudoprobe(i64 6699318081062747564, i64 2, i32 0, i64 -1)
-;; The distribution factor -8513881372706734080 stands for 53.85%, whic is from 7/6+7.
-; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID:]], i64 4, i32 0, i64 -8513881372706734080)
+;; The distribution factor -9223372036854775808 stands for 53.85%, whic is from 7/6+7.
+; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID:]], i64 4, i32 0, i64 -9223372036854775808)
   call void @llvm.pseudoprobe(i64 6699318081062747564, i64 4, i32 0, i64 -1), !dbg !13
 ;; Probe 7 has two copies, since they don't share the same inline context, they are not
 ;; considered sharing samples, thus their distribution factors are not fixed up.
@@ -29,8 +29,8 @@ T1:                                               ; preds = %0
 Merge:                                            ; preds = %0
   %v2 = call i32 @f2(), !prof !12
   call void @llvm.pseudoprobe(i64 6699318081062747564, i64 3, i32 0, i64 -1)
-;; The distribution factor 8513881922462547968 stands for 46.25%, which is from 6/6+7.
-; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 4, i32 0, i64 8513881922462547968)
+;; The distribution factor  -9223372036854775808 stands for 46.25%, which is from 6/6+7.
+; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 4, i32 0, i64  -9223372036854775808)
   call void @llvm.pseudoprobe(i64 6699318081062747564, i64 4, i32 0, i64 8513881922462547968), !dbg !13
 ; CHECK: call void @llvm.pseudoprobe(i64 [[#GUID]], i64 7, i32 0, i64 -1)
   call void @llvm.pseudoprobe(i64 6699318081062747564, i64 7, i32 0, i64 -1), !dbg !18
@@ -77,4 +77,4 @@ attributes #0 = { inaccessiblememonly nounwind willreturn }
 !16 = distinct !DILocation(line: 10, column: 11, scope: !17)
 !17 = !DILexicalBlockFile(scope: !4, file: !5, discriminator: 186646551)
 !18 = !DILocation(line: 53, column: 3, scope: !15, inlinedAt: !19)
-!19 = !DILocation(line: 12, column: 3, scope: !4)
\ No newline at end of file
+!19 = !DILocation(line: 12, column: 3, scope: !4)

@dexonsmith
Copy link
Collaborator

Comment in the code from the Phab review, which I think was committed mostly as is (https://reviews.llvm.org/D3125, but it wasn't used; the active review was on the commits list):

  // Scale the Factor to a size that creates integers.  Ideally, integers would
  // be scaled so that Max == UINT64_MAX so that they can be best
  // differentiated.  However, the register allocator currently deals poorly
  // with large numbers.  Instead, push Min up a little from 1 to give some
  // room to differentiate small, unequal numbers.
  //
  // TODO: fix issues downstream so that ScalingFactor can be Float(1,64)/Max.

My memory is that I had equivalent code to this change originally, then backed off to this compromise after hitting performance regressions, but the intention was (originally) to figure out what was wrong we could go back to this.

I don't have access to my email from the time, but I dug up the start of the llvm-commits thread at https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20140303/207695.html. Unfortunately it's split by week. I'll reply again if I can find the specific motivation in the thread.

There's been a lot of work on BFI since then so it's possible this will work now.

@dexonsmith
Copy link
Collaborator

I don't have access to my email from the time, but I dug up the start of the llvm-commits thread at https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20140303/207695.html. Unfortunately it's split by week. I'll reply again if I can find the specific motivation in the thread.

Actually, it must have come up when I ran tests before posting the patches to the list, since the initial patch has the same comment. Almost 10 years ago now, so I can't remember if this was the llvm unit tests or the test-suite.

IMO it's a good idea to revisit this!

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 13, 2023

Thanks for confirming @duncan!

The current state is mostly a result of http://reviews.llvm.org/D8718 and 10be9a8 it seems and I am trying to understand the intentions there or whether that could have been an accident.

I am definitely running into overflow and noise problems with this patch applied and am working on fixing these issues as I see them, hopefully this is manageable.

Another discussion here is whether to normalize "min" value to integer 1 and have higher precisions for the small numbres while saturating big numbers. Or normalize "max" to UINT64_MAX which I do here, which may map multiple smaller values to integer 1. I am still collecting experiences here, but my gut feeling would be that we rather want to differentiate the big values than the small ones...

@david-xl
Copy link
Contributor

I noticed that we often have poor precision in the BlockFrequencyInfo results because of how convertFloatingToInteger in the algorithm chooses factors.

This is a request for comments for choosing more aggressive factors! Look at changes in test/Analysis/BlockFrequencyInfo/precision.ll to get an impression for the poor precision and improvements.

Has this been tried before? Is there history I am not aware of?

Right now this change does not work yet as it triggers overflows and some instabilities in various places that I am working through. I thought it may be interesting to get some early feedback though in case this change is miguided...

For precision.ll test case, without the patch, we get 388/600 vs 400/600, and 88/288 vs 100/300. How much do we depend on this level of difference? Profile imprecision is an inherent attribute due to other reasons as well for instance race conditions in instrumentation counter update, sampling period difference etc.

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 13, 2023

I don't know if the low precision is a problem for practical code. So far it was mostly annoying me when writing simple testcases that look so obvious that I expected them to just work :)

It is also very easy to fix these cases just by choosing bigger factors in convertFloatingToInteger...

Are there any downsides to bigger factors? I am certainly experiencing overflow problems, but those seem worth fixing anyway given we could also hit them for big min/max spreads in practice...

@david-xl
Copy link
Contributor

I don't know if the low precision is a problem for practical code. So far it was mostly annoying me when writing simple testcases that look so obvious that I expected them to just work :)

It is also very easy to fix these cases just by choosing bigger factors in convertFloatingToInteger...

Are there any downsides to bigger factors? I am certainly experiencing overflow problems, but those seem worth fixing anyway giving we could also hit them for big min/max spreads in practice...

For PGO, we don't care about two things: 1) ratio of the branch weights of the same branch 2) the 'actual' runtime count of a block or callsite. The later is computed by multiplying the relative frequency (to the entry block) with the entry count (real count). In other words, as long as there is no overflow/underflow, the scale factor used does not really matter.

On the other hand, if overflow happens, the branch direction can be flipped, which is a real problem.

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 13, 2023

Thanks for the comment.

I still think choosing more aggressive factors in case of small spreads should not hurt anything and we can easily do so. Even if that just makes people writing test cases happier :)

As for overflows I am still trying to get a feeling for how much we need to go and fix those (as in #66280 ) or whether we have to accept them as a fact of (engineering-)life and rather leave enough room in the values...

@david-xl
Copy link
Contributor

Thanks for the comment.

I still think choosing more aggressive factors in case of small spreads should not hurt anything and we can easily do so. Even if that just makes people writing test cases happier :)

As for overflows I am still trying to get a feeling for how much we need to go and fix those (as in #66280 ) or whether we have to accept them as a fact of (engineering-)life and rather leave enough room in the values...

Sure. As long as there are no over/under flow with the new approach, it is a good to have the extra precision kept.

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 13, 2023

So we probably want something like the following to make this explicit and have a dial between precision and overflow-bug-avoidance:

...
uint64_t Slack = 8;   
uint64_t MaxBits = 64 - Slack;   // We leave some "slack" room to avoid accidental overflows in user code when integers become too big.
Scaled64 ScalingFactor = Scaled64(1, MaxBits) / Max;
...

I'll try how this goes...

@qcolombet
Copy link
Collaborator

FWIW, I'm supportive of the investigation here.

Just one comment:

I am still collecting experiences here, but my gut feeling would be that we rather want to differentiate the big values than the small ones...

I think so, but we probably will lose some valuable information too if we collapse small values.
I expect this will affect hot/cold placement if e.g., two cold blocks have the same weight now and ditto for spill placement in regalloc. I guess it really depends how we "collapse" frequencies (e.g., do we lose 5% accuracy or do we lose 2x for these). In any case, the cases where this matters and overflow occurs are hopefully going to be rare.

Long story short, we'll have regressions :).

@MatzeB
Copy link
Contributor Author

MatzeB commented Sep 22, 2023

I found that BlockFrequency saturation happens a lot in practice. Trying an IPGO enabled service here, I saw hundreds of functions hitting the saturation. This was mostly due to the fact that even nesting two try/catch statements pushes the minimum block frequency so far below 1.0 that the current ScalingFactor = Min.inverse(); even leads to overflow/MAX_UINT saturation on the entry block. To make matters worse I noticed the register allocator often multiplies frequencies with instruction costs and for those cases we loose all of these costs because the numbers are already saturated. I am measuring a real performance improvement (0.3-0.4%) in this service when I switch to the code from my comment above (with 10 slack bits) so the register allocator and other passes can multiply without triggering immediate saturation.

In any case, the cases where this matters and overflow occurs are hopefully going to be rare.

If the spread between "min" and "max" frequency is bigger than 64bit then we have to "collapse" on one end (short of replacing the whole system with something like llvm::ScaledNumber or floating point; but that is a discussion for another day I suppose). In the current system I rather collapse the lower-end frequencies to 1 than collapsing the higher end frequencies to MAX_UINT64.

Unfortunately making this change produces a lot of test-case churn. I am still working on this, but published a PR #67197 to increase stability in MachineBlockPlacement against small BFI noise.

@david-xl
Copy link
Contributor

@mtrofin the issue (saturation) issue may affect ML based register allocation (training quality).

@nikic
Copy link
Contributor

nikic commented Oct 25, 2023

This is a standard release build. The exact cmake config is https://gist.github.com/nikic/1bf97b5b6aade298b87f41cd13bf37d5 where the stage1 build has the same configuration but is built with gcc. I'm testing CTMark, but the impact is quite consistent across the board, so anything will probably do.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 25, 2023

Oh so this isn't even a PGO enabled build, interesting...

@WenleiHe
Copy link
Member

This is concerning. Can this be reverted for now and we can help with some internal performance testing. @xur-llvm

Internally, with this change, on a few large workloads, we saw ~0.2-0.5% performance improvements, all with 2-3% .text size reduction at the same time. Our testing covered both sampling pgo and instrumentation pgo.

The change makes profile more accurate, which is a good improvement by itself. But PGO performance isn't just about profile accuracy, and it depends on training quality and representativeness too. Hypothetically, if some of the training isn't that representative relative to perf testing scenario, by making profile more accurate, we could also make profile deviate more from what's needed for best perf on testing scenario.

That said, I think it'd be fair to revert if this is proven to be a general regression. For that, can we wait for your results before reverting?

@david-xl
Copy link
Contributor

Yes -- the revert can wait until more data is available. I agree that it should help performance in theory.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 25, 2023

I will check what happens with clang /release/no-fdo stage2 builds.

@nikic
Copy link
Contributor

nikic commented Oct 25, 2023

Oh so this isn't even a PGO enabled build, interesting...

Ah yes, I should have mentioned that. This is indeed without PGO. So I expect this is related to the impact of the estimated BFI on the inlining cost model (which is the only significant place using BFI in non-PGO builds). (I've seen issues with bad inlining decisions due to BF underflow in the past, not sure whether this change makes that better or worse.)

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 26, 2023

I think I traced the clang regression down to InlineCostCallAnalyzer::isColdCallSite returning true in more instances. It previously did not do that because of what feels more like an accidental loss of precision:

The example I analyzed starts out as a single-block function with a call resulting in this BFI:

 - entry: float = 1.0, int = 8

It seems that when we inline calls we just scale their BFI values to the callsite (instead of say recomputing BFI which would have resulted in different scales / entry frequency to be chosen). And given that the entry block only had a value of int = 8 we have basically no precision left on the low-end. So after inlining a bit more the BFI in my example looks like this:

 - entry: float = 1.0, int = 8
- if.then.i: float = 0.0, int = 0
- if.else.i: float = 0.0, int = 7
- if.end15.sink.split.i: float = 0.0, int = 0
- if.end15.i: float = 0.0, int = 8
- if.then19.i: float = 0.0, int = 5
- _ZN4llvm12DenseMapBaseINS_8DenseMapIPNS_10StructTypeEPNS_12StructLayoutENS_12DenseMapInfoIS3_vEENS_6detail12DenseMapPairIS3_S5_EEEES3_S5_S7_SA_E20InsertIntoBucketImplIS3_EEPSA_RKS3_RKT_SE_.exit: float = 0.0, int = 8

Note that some blocks frequencies got scaled down to 0. But more importantly no matter what we do it is impossible to have any code marked as "cold" in this situation. Because the cold code threshold is computed as 2% of the entry block frequency. With an entry block value of int = 8 the 2% cold-threshold is integer 0 and it is effectively impossible for any block to be below that threshold.

In comparison with the new conversion scheme we end up with:

 - entry: float = 1.0, int = 18014398509481984
- if.then.i: float = 0.0, int = 9002696048640
- if.else.i: float = 0.0, int = 18005395813433344
- if.end15.sink.split.i: float = 0.0, int = 18000892999733
- if.end15.i: float = 0.0, int = 18014398509481984
- if.then19.i: float = 0.0, int = 11258999068426240
- _ZN4llvm12DenseMapBaseINS_8DenseMapIPNS_10StructTypeEPNS_12StructLayoutENS_12DenseMapInfoIS3_vEENS_6detail12DenseMapPairIS3_S5_EEEES3_S5_S7_SA_E20InsertIntoBucketImplIS3_EEPSA_RKS3_RKT_SE_.exit: float = 0.0, int = 18014398509481984

for the same situation and blocks like if.end15.sink.split.i end up being classified as cold code now.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 26, 2023

So not sure if there is an easy fix for the regression.

We should probably have some logic that scales all BFI values when inlining to better accomodate the new range of frequencies so we loose less precision by accident. Though that won't help with the regression at hand as that would need for isColdCallSite to accidentally not work as before to get more inlining...

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 26, 2023

Also note that this 2% cold-code threshold is only applied in situation where no PGO data is available. So maybe we should just ignore this, given that without PGO data it just is often impossible for the compiler to make good inlining decisions...

@dexonsmith
Copy link
Collaborator

Seems awkward to pessimize as "cold" when there's no real data (no PGO, no other marking). What happens if you change the cold call threshold to 0% when there's no PGO data? (I.e., never assume a call is cold without actual evidence)

@david-xl
Copy link
Contributor

The old code of not marking them as cold works also just by accident though.

@dexonsmith
Copy link
Collaborator

Not entirely accidental. When BPI/BFI first landed it was heavily profiled to be sure it didn't pessimize non-PGO code. I don't see why we'd suddenly be okay with pessimizing it.

Under 2% isn't hard to hit for hot path code. Lots of functions will have strings of early exit conditions.

@david-xl
Copy link
Contributor

Agree that deciding coldness based on local entry frequency (2%) is a bad idea though.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 26, 2023

Seems this got introduced in https://reviews.llvm.org/D34312 with the rough idea that we shouldn't inline into parts of the code that _builtin_expect(...) deems unlikely. Which makes sense when you express it like this, but I guess numeric thresholds can go wrong...

@dexonsmith
Copy link
Collaborator

Seems this got introduced in https://reviews.llvm.org/D34312 with the rough idea that we shouldn't inline into parts of the code that _builtin_expect(...) deems unlikely. Which makes sense when you express it like this, but I guess numeric thresholds can go wrong...

Heh, yeah, the premise seems correct, but a percentage-based numeric threshold doesn't seem right. You kind of want a flag for the block. Or a special value, like "freq=0", which indicates "annotation says this is cold".

@minglotus-6
Copy link
Contributor

Seems this got introduced in https://reviews.llvm.org/D34312 with the rough idea that we shouldn't inline into parts of the code that _builtin_expect(...) deems unlikely

@xur-llvm @david-xl This just reminds me of the observed regressions when we did an experiment to override branch probability from absl_predict_{true,false} with (manually collected IIRC) sampled PGO profiles.

@david-xl
Copy link
Contributor

Contracting my past self (one of the reviewers of the patch), I think coldness check should be based on a global threshold, which exists when synthetic entry count propagation is enabled (without PGO).

@vitalybuka
Copy link
Collaborator

FYI @ldionne

This patch causing test_vector2.pass.cpp trigger memory leak https://lab.llvm.org/buildbot/#/builders/168/builds/16422
Not sure if this was leaking but lsan didn't see, it's possible. Or some regression in lsan.

ldionne referenced this pull request Oct 27, 2023
Leak could be real, as the code terminates before freeing the memory.
@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 27, 2023

Not sure how this change could trigger leaks. Changing BFI here should mostly effect inlining decisions, basic block placement, register allocation. But it shouldn't change program behavior or memory operations...

Will try to reproduce this test on my machine, to see if there's anything obvious for the change here (though at a first glance the test code seems to provoke an exception abort and would naturally leak because of it?)

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 27, 2023

Some very ad-hoc benchmarking. Of clang compilation speed (measured in instructions as reported by valgrind/callgrind which I think somewhat matches the setup of nikic) compiling sqlite3 of CTMark:

Old-BFI (this PR reverted), New-BFI (this PR applied), no-cold (cold-callsite-rel-freq set to 0 to disable InlineCostCallAnalyzer::isColdCallSite behavior for non-PGO builds; also captured .text section sizes:

Old-BFI:          size:  59,802,772  (baseline)  insn: 2,812,833,144  (baseline)
New-BFI:          size:  60,247,076  +0.74%      insn: 2,818,639,641  +0.21%
Old-BFI, no-cold: size:  60,773,988  +1.62%      insn: 2,806,521,932  -0.22%
New-BFI, no-cold: size:  60,741,700  +1.57%      insn: 2,803,489,298  -0.33%

I could benchmark more with llvm-test-suite and put up a PR to disable cold-callsite-rel-freq by default if people support this direction...

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 27, 2023

And digging even deeper:

  • FWIW I noticed that I only used clang -c as benchmark previously, should have used clang -c -O3 resulting in this:
Old-BFI:          insn: 37,821,687,947  (baseline)
New-BFI:          insn: 38,133,312,923  +0.82%
Old-BFI, no-cold: insn: 37,423,365,184  -1.05%
New-BFI, no-cold: insn: 37,438,736,713  -1.01%
  • The problematic part of the code that is inlined/not-inlined is actually marked as LLVM_UNLIKELY here: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/ADT/DenseMap.h#L607 and intuitively one would think that it is probably best to not inline grow (as will happen with the new BFI version)
  • In practice not-inlining grow mostly ends up being worse because of knock-on effects as far as I can tell. These are the inlining decisions I noticed for the most prominent situation:
    - `grow` inlined (Old-BFI):
         - Instruction::getMetadataImpl
             -> Value::getMetadata           not inlined
             -> DenseMap::operator[]         inlined
             -> DenseMap::FindAndConstruct   inlined
             -> DenseMap::InsertIntoBucket   not inlined, size
                   likely too big with `grow` inlined here
             -> DenseMap::grow               inlined
    - `grow` inlined (New-BFI):
         - Instruction::getMadataImpl
             -> Value::getMetadata          inlined
             -> DenseMap::operator[]        inlined
             -> DenseMap::FindAndConstruct  not inlined, size
             -> DenseMap::InsertIntoBucket  inlined
             -> DenseMap::grow              not inlined

Though if you look into this then I would state that the code got better for the wrong reasons! Not inlining grow is a sensible decision in this context and the LLVM_UNLIKELY annotation makes sense (I actually added some counters and see the unlikely branch taken somewhere between 1% and 10% of the cases depending on inputs, so seems sensible).

Unfortunately the particular code here getMetadataImpl never inserts new things into the map, but unfortunately operator[] gives you that behavior by default so nobody noticed. So not inlining InsertIntoBucket happened to be a great decision previously that the compiler did by accident without having good data to support this. Now with better but still insufficient data (as this is PGO) we happen to end up inlining InsertIntoBucket wasting code size leading to worse inlining decisions further up the stack...

Long story short: This ended up another of the many stories where the compiler cannot make the right inlining decisions without having actual runtime data. It tends to make a better decision based on the data that ends up being wrong anyway here. I'm gonna leave things as is and rather put up some improvements to LLVM code instead!

@dexonsmith
Copy link
Collaborator

Interesting. Probably Value::getMetadata() could/should call DenseMap::find() instead of operator[]() and assert that it's found before dereferencing, because Value::hasMetadata() (which, IIRC, consults a bit stored in Value) has already promised something will be there. Probably you'll find me on the git-blame for this...

Your plan SGTM!

@david-xl
Copy link
Contributor

And digging even deeper:

  • FWIW I noticed that I only used clang -c as benchmark previously, should have used clang -c -O3 resulting in this:
Old-BFI:          insn: 37,821,687,947  (baseline)
New-BFI:          insn: 38,133,312,923  +0.82%
Old-BFI, no-cold: insn: 37,423,365,184  -1.05%
New-BFI, no-cold: insn: 37,438,736,713  -1.01%
  • The problematic part of the code that is inlined/not-inlined is actually marked as LLVM_UNLIKELY here: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/ADT/DenseMap.h#L607 and intuitively one would think that it is probably best to not inline grow (as will happen with the new BFI version)
  • In practice not-inlining grow mostly ends up being worse because of knock-on effects as far as I can tell. These are the inlining decisions I noticed for the most prominent situation:
    - `grow` inlined (Old-BFI):
         - Instruction::getMetadataImpl
             -> Value::getMetadata           not inlined
             -> DenseMap::operator[]         inlined
             -> DenseMap::FindAndConstruct   inlined
             -> DenseMap::InsertIntoBucket   not inlined, size
                   likely too big with `grow` inlined here
             -> DenseMap::grow               inlined
    - `grow` inlined (New-BFI):
         - Instruction::getMadataImpl
             -> Value::getMetadata          inlined
             -> DenseMap::operator[]        inlined
             -> DenseMap::FindAndConstruct  not inlined, size
             -> DenseMap::InsertIntoBucket  inlined
             -> DenseMap::grow              not inlined

Though if you look into this then I would state that the code got better for the wrong reasons! Not inlining grow is a sensible decision in this context and the LLVM_UNLIKELY annotation makes sense (I actually added some counters and see the unlikely branch taken somewhere between 1% and 10% of the cases depending on inputs, so seems sensible).

Unfortunately the particular code here getMetadataImpl never inserts new things into the map, but unfortunately operator[] gives you that behavior by default so nobody noticed. So not inlining InsertIntoBucket happened to be a great decision previously that the compiler did by accident without having good data to support this. Now with better but still insufficient data (as this is PGO) we happen to end up inlining InsertIntoBucket wasting code size leading to worse inlining decisions further up the stack...

Long story short: This ended up another of the many stories where the compiler cannot make the right inlining decisions without having actual runtime data. It tends to make a better decision based on the data that ends up being wrong anyway here. I'm gonna leave things as is and rather put up some improvements to LLVM code instead!

Good analysis.

Your plan sounds good but lowering the cold threshold is probably a good thing to do with the new BFI. The FindAndConstruct method in DenseMap can probably be annotated with buitin_expect to indicate the insertion is a unlikely path.

@WenleiHe
Copy link
Member

Thanks for investigating, Matthias. So it looks like the regression is indeed just clang non-PGO build being unlucky after the BFI improvement.. Sounds good to leave it as is.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 30, 2023

#70700 improves LLVM code so I am no longer seeing a regression before/after this change.

@david-xl
Copy link
Contributor

david-xl commented Nov 1, 2023

@MatzeB ,Our internal release testing have seen lots of very large regressions for tests without PGO or with XFDO only. While I agree this patch is the right way to go and the good performance (without PGO) with oldBFI is somewhat by chance, it is still better not to regress them. Can you help with a follow up patch to tune down relative coldness threshold?

@MatzeB
Copy link
Contributor Author

MatzeB commented Nov 1, 2023

Do you just see regression or also some wins? I don't know whether this is all stems from the isColdCallSite behavior, and not just the usual "some things get better some things get worse" situation when making inlining changes?

@MatzeB
Copy link
Contributor Author

MatzeB commented Nov 1, 2023

Or put another way: Do you see regressions disappear with -mllvm -cold-callsite-rel-freq=0 ?

@david-xl
Copy link
Contributor

david-xl commented Nov 1, 2023

I asked for those data points and share it when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants