[CSSPGO] Reject high checksum mismatched profile #84097

wlei-llvm · 2024-03-06T00:09:54Z

Error out the build if the checksum mismatch is extremely high, it's better to drop the profile rather than apply the bad profile.
Note that the check is on a module level, the user could make big changes to functions in one single module but those changes might not be performance significant to the whole binary, so we want to be conservative, only expect to catch big perf regression. To do this, we select a set of the "hot" functions for the check. We use two parameter(hot-func-cutoff-for-staleness-error and min-functions-for-staleness-error) to control the function selection to make sure the selected are hot enough and the num of function is not small.
Tuned the parameters on our internal services, it works to catch big perf regression due to the high mismatch .

llvmbot · 2024-03-06T00:10:21Z

@llvm/pr-subscribers-llvm-transforms

Author: Lei Wang (wlei-llvm)

Changes

Error out the build if the checksum mismatch is extremely high, it's better to drop the profile rather than apply the bad profile.
Note that the check is on a module level, the user could make big changes to functions in one single module but those changes might not be performance significant to the whole binary, so we want to be conservative, only expect to catch big perf regression. To do this, we select a set of the "hot" functions for the check. We use two parameter(checksum-mismatch-func-hot-block-skip and checksum-mismatch-num-func-skip) to control the function selection to make sure the selected are hot enough and the num of function is not small.
Tuned the parameters on our internal services, it works to catch big perf regression due to the high mismatch .

Full diff: https://github.com/llvm/llvm-project/pull/84097.diff

2 Files Affected:

(modified) llvm/lib/Transforms/IPO/SampleProfile.cpp (+81)
(added) llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll (+5)

diff --git a/llvm/lib/Transforms/IPO/SampleProfile.cpp b/llvm/lib/Transforms/IPO/SampleProfile.cpp
index ffc26f1a0a972d..9ec224bb32a551 100644
--- a/llvm/lib/Transforms/IPO/SampleProfile.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfile.cpp
@@ -234,6 +234,24 @@ static cl::opt<unsigned> ProfileICPRelativeHotnessSkip(
     cl::desc(
         "Skip relative hotness check for ICP up to given number of targets."));
 
+static cl::opt<unsigned> ChecksumMismatchFuncHotBlockSkip(
+    "checksum-mismatch-func-hot-block-skip", cl::Hidden, cl::init(100),
+    cl::desc("For checksum-mismatch error check, skip checking the function "
+             "whose num of hot(on average) blocks is smaller than the "
+             "given number."));
+
+static cl::opt<unsigned> ChecksumMismatchNumFuncSkip(
+    "checksum-mismatch-num-func-skip", cl::Hidden, cl::init(50),
+    cl::desc("For checksum-mismatch error check, skip the check if the total "
+             "number of selected function is smaller than the given number."));
+
+static cl::opt<unsigned> ChecksumMismatchErrorThreshold(
+    "checksum-mismatch-error-threshold", cl::Hidden, cl::init(80),
+    cl::desc(
+        "For checksum-mismatch error check, error out if the percentage of "
+        "function mismatched-checksum is higher than the given percentage "
+        "threshold"));
+
 static cl::opt<bool> CallsitePrioritizedInline(
     "sample-profile-prioritized-inline", cl::Hidden,
 
@@ -630,6 +648,8 @@ class SampleProfileLoader final : public SampleProfileLoaderBaseImpl<Function> {
   std::vector<Function *> buildFunctionOrder(Module &M, LazyCallGraph &CG);
   std::unique_ptr<ProfiledCallGraph> buildProfiledCallGraph(Module &M);
   void generateMDProfMetadata(Function &F);
+  bool errorIfHighChecksumMismatch(Module &M, ProfileSummaryInfo *PSI,
+                                   const SampleProfileMap &Profiles);
 
   /// Map from function name to Function *. Used to find the function from
   /// the function name. If the function name contains suffix, additional
@@ -2184,6 +2204,61 @@ bool SampleProfileLoader::doInitialization(Module &M,
   return true;
 }
 
+// Note that this is a module-level check. Even if one module is errored out,
+// the entire build will be errored out. However, the user could make big
+// changes to functions in single module but those changes might not be
+// performance significant to the whole binary. Therefore, we use a conservative
+// approach to make sure we only error out if it globally impacts the binary
+// performance. To achieve this, we use heuristics to select a reasonable
+// big set of functions that are supposed to be globally performance
+// significant, only compute and check the mismatch within those functions. The
+// function selection is based on two criteria: 1) The function is "hot" enough,
+// which is tuned by a hotness-based flag(ChecksumMismatchFuncHotBlockSkip). 2)
+// The num of function is large enough which is tuned by the
+// ChecksumMismatchNumFuncSkip flag.
+bool SampleProfileLoader::errorIfHighChecksumMismatch(
+    Module &M, ProfileSummaryInfo *PSI, const SampleProfileMap &Profiles) {
+  assert(FunctionSamples::ProfileIsProbeBased &&
+         "Only support for probe-based profile");
+  uint64_t TotalSelectedFunc = 0;
+  uint64_t NumMismatchedFunc = 0;
+  for (const auto &I : Profiles) {
+    const auto &FS = I.second;
+    const auto *FuncDesc = ProbeManager->getDesc(FS.getGUID());
+    if (!FuncDesc)
+      continue;
+
+    // We want to select a set of functions that are globally performance
+    // significant, in other words, if those functions profiles are
+    // checksum-mismatched and dropped, the whole binary will likely be
+    // impacted, so here we use a hotness-based threshold to control the
+    // selection.
+    if (FS.getTotalSamples() <
+        ChecksumMismatchFuncHotBlockSkip * PSI->getOrCompHotCountThreshold())
+      continue;
+
+    TotalSelectedFunc++;
+    if (ProbeManager->profileIsHashMismatched(*FuncDesc, FS))
+      NumMismatchedFunc++;
+  }
+  // Make sure that the num of selected function is not too small to distinguish
+  // from the user's benign changes.
+  if (TotalSelectedFunc < ChecksumMismatchNumFuncSkip)
+    return false;
+
+  // Finally check the mismatch percentage against the threshold.
+  if (NumMismatchedFunc * 100 >=
+      TotalSelectedFunc * ChecksumMismatchErrorThreshold) {
+    auto &Ctx = M.getContext();
+    const char *Msg =
+        "The FDO profile is too old and will cause big performance regression, "
+        "please drop the profile and collect a new one.";
+    Ctx.diagnose(DiagnosticInfoSampleProfile(M.getModuleIdentifier(), Msg));
+    return true;
+  }
+  return false;
+}
+
 void SampleProfileMatcher::findIRAnchors(
     const Function &F, std::map<LineLocation, StringRef> &IRAnchors) {
   // For inlined code, recover the original callsite and callee by finding the
@@ -2715,6 +2790,12 @@ bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM,
                         ProfileSummary::PSK_Sample);
     PSI->refresh();
   }
+
+  // Error out if the profile checksum mismatch is too high.
+  if (FunctionSamples::ProfileIsProbeBased &&
+      errorIfHighChecksumMismatch(M, PSI, Reader->getProfiles()))
+    return false;
+
   // Compute the total number of samples collected in this profile.
   for (const auto &I : Reader->getProfiles())
     TotalCollectedSamples += I.second.getTotalSamples();
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll
new file mode 100644
index 00000000000000..84a2a0ff2eb6ad
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll
@@ -0,0 +1,5 @@
+; REQUIRES: x86_64-linux
+; RUN: opt < %S/pseudo-probe-profile-mismatch.ll -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-profile-mismatch.prof -checksum-mismatch-func-hot-block-skip=0 -checksum-mismatch-num-func-skip=1 -checksum-mismatch-error-threshold=1 -S 2>%t -o %t.ll
+; RUN: FileCheck %s --input-file %t
+
+; CHECK: error: [[*]]: The FDO profile is too old and will cause big performance regression, please drop the profile and collect a new one.

WenleiHe · 2024-03-15T02:48:30Z

llvm/lib/Transforms/IPO/SampleProfile.cpp

+    // impacted, so here we use a hotness-based threshold to control the
+    // selection.
+    if (FS.getTotalSamples() <
+        ChecksumMismatchFuncHotBlockSkip * PSI->getOrCompHotCountThreshold())


The multiplier doesn't really has to do with "block" of the function, so the name can be confusing?

What do you think of using isFunctionHotInCallGraphNthPercentile or isHotCountNthPercentile with a percentile cutoff? We can tun the percentile to get what we need.

The multiplier doesn't really has to do with "block" of the function, so the name can be confusing?

What do you think of using isFunctionHotInCallGraphNthPercentile or isHotCountNthPercentile with a percentile cutoff? We can tun the percentile to get what we need.

Let me try to clarify my initial thought: IIUC, all the "HotCount" stuffs in PSI are "block-level"(or body-sample level) concept, the isFunctionHotXXX function are used to check the entry_count of the function which is also "block-level"(the first block), however what I used here is the getTotalSamples() which I intend to include all the callee/inlinee samples(it could be the entry count is cold, but it does a lot of hot inlinings), as total_samples is kind of sum of all the block counts of this function, so I used a multiplier which I want to mean how many blocks on average is hot in this function.

So if we pass a sum of blocks(getTotalSamples) to this isFunctionHotInCallGraphNthPercentile or isHotCountNthPercentile function, might still be confusing?

I feet we actually don't have a function-level thing for this hotness check. so I was also thinking to add a new function to PSI, like:(but no sure if it's generally useful)

int getNumHotBlocks(const FunctionSamples &FS) { int num = 0; for(samples in all body samples) if(samples > isHotCount()) num++; return num; }

Any thought? thanks!

the isFunctionHotXXX function are used to check the entry_count of the function which is also "block-level"(the first block), however what I used here is the getTotalSamples() which I intend to include all the callee/inlinee samples(it could be the entry count is cold, but it does a lot of hot inlinings)

So there are different ways to decide whether a function is hot. Here we want to select a few hot functions and see how many of them has checksum mismatch. Checking on total samples is one way, and is different from existing isFunctionHotXXX. But do we really need to add another way of deciding whether a function is hot? I don't know the answer, but maybe use isFunctionHotInCallGraphNthPercentile to decide whether function is hot for our purpose is ok too?

I agree that passing total samples into isHotCountNthPercentile maybe not be a good idea since the latter expects block counts.

I see, sounds good to use total_samples to decide if the function is hot.

I took a look at isFunctionHotInCallGraphNthPercentile , looks like the function's parameter is Function (isFunctionHotOrColdInCallGraphNthPercentile(int PercentileCutoff, const FuncT *F, BFIT &FI)) https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Analysis/ProfileSummaryInfo.h#L285 , and it does looks for the annotations on the IR to check the hotness, however, here our function is before the sample annotation, so we can't directly use this function.
Actually I found that in isFunctionHotOrColdInCallGraphNthPercentile https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Analysis/ProfileSummaryInfo.h#L285 , it also uses isHotCountNthPercentile for TotalCallCount which seems also not block-level counts, probably right now we don't have an existing function for total_samples, so maybe to make it simple, we can just use isHotCountNthPercentile for total_samples? (Or maybe add a overload function like isFunctionHotInCallGraphNthPercentile(... , total_samples) to accept total_samples, but I think it's really just a wrapper of isHotCountNthPercentile)

Digging more, I found that the callsiteIsHot https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/SampleProfileLoaderBaseUtil.cpp#L64 also uses total_samples to decide hot function..

Maybe a different topic, but I'm surprised why it uses total_samples to make inlining decision..

Sounds good to just use isHotCountNthPercentile against total samples for now.

Maybe a different topic, but I'm surprised why it uses total_samples to make inlining decision..

Yes, this is not ideal. We intentionally switched away from that for CSSPGO. So if you check, the path that uses callsiteIsHot should be AutoFDO only.

Sounds good to just use isHotCountNthPercentile against total samples for now.

Maybe a different topic, but I'm surprised why it uses total_samples to make inlining decision..

Yes, this is not ideal. We intentionally switched away from that for CSSPGO. So if you check, the path that uses callsiteIsHot should be AutoFDO only.

Good to know, checked that the usages of callsiteIsHot are indeed from AutoFDO path.

llvm/lib/Transforms/IPO/SampleProfile.cpp

github-actions · 2024-03-15T21:44:36Z

✅ With the latest revision this PR passed the C/C++ code formatter.

WenleiHe · 2024-03-19T20:15:37Z

The current version looks good. For testing, have you tried to build with some workloads to make sure we don't have false positives?

llvm/lib/Transforms/IPO/SampleProfile.cpp

WenleiHe · 2024-03-19T20:11:55Z

llvm/lib/Transforms/IPO/SampleProfile.cpp

+    // impacted, so here we use a hotness-based threshold to control the
+    // selection.
+    if (FS.getTotalSamples() <
+        ChecksumMismatchFuncHotBlockSkip * PSI->getOrCompHotCountThreshold())


Sounds good to just use isHotCountNthPercentile against total samples for now.

Maybe a different topic, but I'm surprised why it uses total_samples to make inlining decision..

Yes, this is not ideal. We intentionally switched away from that for CSSPGO. So if you check, the path that uses callsiteIsHot should be AutoFDO only.

wlei-llvm · 2024-03-19T22:08:32Z

The current version looks good. For testing, have you tried to build with some workloads to make sure we don't have false positives?

Yeah, I'm going to give another tuning for this "HotFuncCutoffForStalenessError" flag value.

WenleiHe

LGTM with nits, thanks.

llvm/lib/Transforms/IPO/SampleProfile.cpp

llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll

wlei-llvm · 2024-03-21T03:49:47Z

The current version looks good. For testing, have you tried to build with some workloads to make sure we don't have false positives?

Update the flag tuning: looks it's quite distinguishable. On one ads service, when using a super bad profile(applying both #79919 + #75092), surprisedly even using a 300000(30%) as hot-cut-off, it can still throw the error, 20% doesn't work... on the other side, when using a regular profile(1~2 week old), using 99.9% doesn't trigger the false positives.. so pretty large room for those upgrade error.. Now I set to 80%, leaning towards a low bar to avoid false negative.

…s hot

wlei-llvm requested review from MatzeB and WenleiHe March 6, 2024 00:09

llvmbot added the llvm:transforms label Mar 6, 2024

WenleiHe reviewed Mar 15, 2024

View reviewed changes

WenleiHe reviewed Mar 19, 2024

View reviewed changes

WenleiHe approved these changes Mar 20, 2024

View reviewed changes

llvm/lib/Transforms/IPO/SampleProfile.cpp Outdated Show resolved Hide resolved

llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll Show resolved Hide resolved

wlei-llvm force-pushed the error-hash-mismatch branch from 0b1fb2b to 7bd76b3 Compare March 21, 2024 03:51

wlei-llvm added 7 commits March 26, 2024 16:33

[CSSPGO] Error out if the checksum mismatch is high

90dea0e

add test

f0be485

addressing comments

02bb76c

Use isHotCountNthPercentile and total_samples to decide if function i…

d61442c

…s hot

rephrase the flag description

9a210ad

addressing comments and set hot-func-cutoff 800000

30f0940

tweak comments

8a09260

wlei-llvm force-pushed the error-hash-mismatch branch from 7bd76b3 to 8a09260 Compare March 26, 2024 23:34

wlei-llvm changed the title ~~[CSSPGO] Error out if the checksum mismatch is high~~ [CSSPGO] Add support to reject high checksum mismatched profile Mar 26, 2024

wlei-llvm changed the title ~~[CSSPGO] Add support to reject high checksum mismatched profile~~ [CSSPGO] Reject high checksum mismatched profile Mar 27, 2024

wlei-llvm merged commit 2598aa6 into llvm:main Mar 27, 2024
4 checks passed

wlei-llvm deleted the error-hash-mismatch branch March 28, 2024 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CSSPGO] Reject high checksum mismatched profile #84097

[CSSPGO] Reject high checksum mismatched profile #84097

wlei-llvm commented Mar 6, 2024 •

edited

llvmbot commented Mar 6, 2024

WenleiHe Mar 15, 2024

wlei-llvm Mar 15, 2024

WenleiHe Mar 19, 2024

wlei-llvm Mar 19, 2024

wlei-llvm Mar 19, 2024 •

edited

WenleiHe Mar 19, 2024

wlei-llvm Mar 19, 2024

github-actions bot commented Mar 15, 2024 •

edited

WenleiHe commented Mar 19, 2024

WenleiHe Mar 19, 2024

wlei-llvm commented Mar 19, 2024

WenleiHe left a comment

wlei-llvm commented Mar 21, 2024

[CSSPGO] Reject high checksum mismatched profile #84097

[CSSPGO] Reject high checksum mismatched profile #84097

Conversation

wlei-llvm commented Mar 6, 2024 • edited

llvmbot commented Mar 6, 2024

WenleiHe Mar 15, 2024

Choose a reason for hiding this comment

wlei-llvm Mar 15, 2024

Choose a reason for hiding this comment

WenleiHe Mar 19, 2024

Choose a reason for hiding this comment

wlei-llvm Mar 19, 2024

Choose a reason for hiding this comment

wlei-llvm Mar 19, 2024 • edited

Choose a reason for hiding this comment

WenleiHe Mar 19, 2024

Choose a reason for hiding this comment

wlei-llvm Mar 19, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 15, 2024 • edited

WenleiHe commented Mar 19, 2024

WenleiHe Mar 19, 2024

Choose a reason for hiding this comment

wlei-llvm commented Mar 19, 2024

WenleiHe left a comment

Choose a reason for hiding this comment

wlei-llvm commented Mar 21, 2024

wlei-llvm commented Mar 6, 2024 •

edited

wlei-llvm Mar 19, 2024 •

edited

github-actions bot commented Mar 15, 2024 •

edited