-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648
Conversation
@llvm/pr-subscribers-llvm-transforms Author: None (modiking) ChangesFixes PR77842 where UBSAN causes pragma full unroll to try and unroll INT_MAX times. This sets a cap to make sure we don't attempt this and crash the compiler. Testing: Full diff: https://github.com/llvm/llvm-project/pull/78648.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index ee6f7b35750af0f..8529fa1db18e187 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -109,6 +109,10 @@ UnrollVerifyLoopInfo("unroll-verify-loopinfo", cl::Hidden,
#endif
);
+static cl::opt<unsigned>
+ UnrollMaxIterations("unroll-max-iterations", cl::init(1'000'000),
+ cl::Hidden,
+ cl::desc("Maximum allowed iterations to unroll."));
/// Check if unrolling created a situation where we need to insert phi nodes to
/// preserve LCSSA form.
@@ -453,6 +457,14 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
}
}
+ // Certain cases with UBSAN can cause trip count to be calculated as INT_MAX,
+ // Block unrolling at a reasonable limit so that the compiler doesn't hang
+ // trying to unroll the loop. See PR77842
+ if (ULO.Count > UnrollMaxIterations) {
+ LLVM_DEBUG(dbgs() << "Won't unroll; trip count is too large\n");
+ return LoopUnrollResult::Unmodified;
+ }
+
using namespace ore;
// Report the unrolling decision.
if (CompletelyUnroll) {
diff --git a/llvm/test/Transforms/LoopUnroll/pr77842.ll b/llvm/test/Transforms/LoopUnroll/pr77842.ll
new file mode 100644
index 000000000000000..834033bbe3618eb
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/pr77842.ll
@@ -0,0 +1,37 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -passes=loop-unroll -disable-output -debug-only=loop-unroll %s 2>&1 | FileCheck %s
+
+; Validate that loop unroll full doesn't try to fully unroll values whose trip counts are too large.
+
+; CHECK: Exiting block %cont23: TripCount=2147483648, TripMultiple=0, BreakoutTrip=0
+; CHECK-NEXT: Won't unroll; trip count is too large
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-redhat-linux-gnu"
+
+define void @foo(i64 %end) {
+entry:
+ br label %loopheader
+
+loopheader:
+ %iv = phi i64 [ 0, %entry ], [ %iv_new, %backedge ]
+ %exit = icmp eq i64 %iv, %end
+ br i1 %exit, label %for.cond.cleanup.loopexit, label %cont23
+
+for.cond.cleanup.loopexit:
+ ret void
+
+cont23:
+ %exitcond241 = icmp eq i64 %iv, 2147483647
+ br i1 %exitcond241, label %handler.add_overflow, label %backedge
+
+handler.add_overflow:
+ unreachable
+
+backedge: ; preds = %cont23
+ %iv_new = add i64 %iv, 1
+ br label %loopheader, !llvm.loop !0
+}
+
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.unroll.full"}
|
if (UP.Count > UnrollMaxIterations) { | ||
LLVM_DEBUG(dbgs() << "Won't unroll; trip count is too large\n"); | ||
return LoopUnrollResult::Unmodified; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but I don't think it fits well into the way the unroll count is currently computed. Rather than checking this after the fact, we should integrate this into the unroll count calculation. In particular, inside https://github.com/llvm/llvm-project/blob/88b1087035fd397996837e35d579d808d6b3f28c/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp#L784-L785 we should apply a pragma-specific iteration limit. This will also ensure that even if we decline to perform a full unroll to an unreasonable count, we can still chose to perform a partial or runtime unroll to a lower iteration count.
@nikic Appreciate the thorough review on the best way to get this done. Right now, with locking the full unroll under Pragma the repro unrolls 2048 times which is fine but fairly slow to complete (155s on my SkyLake). Thoughts on a way to cap this in some way or is this alright? |
Friendly ping @nikic |
I think the general behavior is fine -- we still get the aggressive unrolling expected from For the test case, it should be possible to reduce the amount of unrolling by setting Though I'm surprised that unrolling to 2048 times takes 155s -- is this with a debug build of LLVM by chance? |
I checked again and I think you're right on that, with a release build it's much faster:
|
I still think you should use A possible alternative would be to use the |
Makes sense. Good suggestion with |
…iterations we try to unroll
@nikic Hmm interesting enough it's still trying to partially unroll with
Because TripCount == 2147483648 so MaxTripCount == 0 while partial unroll sets UP.Count == 2048. Scanning through I think we should also catch this case and add a check for |
@modiking Yeah, I think that would make sense. Assuming it doesn't break something else... |
@nikic Change passes unit tests, also checked that both are needed as it fails a unit test if we only switch to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yeah, currently we need both as they are mutually exclusive -- we only set MaxTripCount if TripCount is not available. Does make we wonder if it would make things cleaner if we just always computed MaxTripCount. But that's unrelated to this patch... |
Thanks for the review!
That would be cleaner. But yeah a different patch |
Co-authored-by: Nikita Popov <github@npopov.com>
we're seeing a failure with this patch internally (don't have a reproducer yet, just letting you know in case it's actionable in the mean time while we continue investigating) An existing loop with a static bound (256) with a The contents of the loop does include a couple of gotos, a continue and a return (all under various conditions), if that's relevant/helpful. |
Interesting, no certainty on the answer here but there's 2 changes here:
Could be worthwhile to see which one of these is causing the issue. Also possible this is an existing issue and was just exposed by this change |
It might be that this test times out if the loop isn't fully unrolled. (removing the unroll pragma causes the program to timeout - FWIW, this is is something BPF related - there's some comment about the pragma being required for some verification - I don't really understand BPF enough to explain it) If I remove this part of your change:
And switch it back to the old code, the program finishes in roughly the expected time (~3-4 minutes, rather than hitting our 10min test timeout). |
Gotcha, that seems to be the issue then. To me this seems like it's a source side fix to ensure full unrolling via specifying the trip count or using |
So not sure I'm following. The condition that was added in this patch that seems to be relevant, if I'm understanding correctly, is to add But there is no count on the unroll pragma in this case - so why would that condition fire? (I'm poking around in a debugger now to better understand it) (& I tried to find some documentation, but couldn't - to figure out what |
Oh, I found https://releases.llvm.org/4.0.0/tools/clang/docs/AttributeReference.html#pragma-unroll-pragma-nounroll Which says "Specifying #pragma unroll without a parameter directs the loop unroller to attempt to fully unroll the loop if the trip count is known at compile time and attempt to partially unroll the loop if the trip count is not known at compile time" The loop condition is simple/known here (though the conditional goto/returns - maybe those complicate things?) Oh, the docs also say "#pragma unroll and #pragma unroll value have identical semantics to #pragma clang loop unroll(full) and #pragma clang loop unroll_count(value) respectively. " - so that doesn't sound like |
Oh interesting, I was't quite certain myself on what the exact semantics are. Thanks for digging it up.
The way the code is structured is that the unroll trip count is either stored in Actually, I wonder if that's what causing the change where previously it would try and do a partial unroll but that no longer happens. What's the output with |
I haven't done proper A/B debugging before and after this patch yet - but you might be right that the code is currently was previously falling under partial unrolling - but I don't know why it would count as partial unrolling? The code here llvm-project/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp Lines 791 to 805 in f872706
If I'm reading this right, it doesn't look like it treats PragmaFullUnroll and PragmaEnableUnroll equally - or perhaps clang is mis-lowering pragma without a bound & it should be lowering it to pragma full unroll? The LLVM docs: https://llvm.org/docs/TransformMetadata.html#loop-unrolling don't seem to be super clear about what the different IR metadata means? |
You're looking at some very old docs there. The current ones are https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll and they have the s/full/enable typo fixed. The unrolling heuristics are very complex and it's hard to make any definitive statements without a reproducer. Based on what you say, I suspect that this patch has moved a partial unroll from happening in the full unroll pass to the runtime unroll pass and that exposed a second order compile-time issue in your case. Or it might be something else entirely... |
Still - going to that documentation, then going down into the linked LanguageExtensions documentation mentions this: "If unroll(full) is specified the unroller will attempt to fully unroll the loop if the trip count is known at compile time identically to unroll(enable). However, with unroll(full) the loop will not be unrolled if the loop count is not known at compile time." So that sounds like "unroll(full)" should unroll in fewer cases than "unroll(enable)"? But in this case, with this patch applied, it seems like the loop gets unrolled only with "unroll(full)" and not with "unroll(enable)"
Yep, we're working on reproducers & trying to talk through what we have until then in case it's helpful in the mean time. |
TLDR ---- The verif_scale_pyperf600 fails when compiled using recent clang. Basic blocks rearrangement causes "jumps is too complex" error. Investigation shows that nothing is wrong with clang or verifier, and simplest way to fix the regression, is to increase max allowed jump history. What happened ------------- The verif_scale_pyperf600 test fails to verify when compiled by recent clang revisions. The last known good revision is [0], the first known bad revision is [1]. Revision [1] comes from the pull request [2]. Verifier error when using revision [1]: ... ; if (frame->co_name) @ pyperf.h:118 25460: (79) r3 = *(u64 *)(r10 -32) ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm 25461: (15) if r3 == 0x0 goto pc+7 The sequence of 8193 jumps is too complex. verification time 822174 usec stack depth 360 [0] Last good revision: c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)") [1] First broken revision: 99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)") [2] Pull request for first broken revision: llvm/llvm-project#78648 LLVM change description ----------------------- The relevant part of [1] is: --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE, } // Do not attempt partial/runtime unrolling in FullLoopUnrolling - if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) { + if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) { LLVM_DEBUG( dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n"); return LoopUnrollResult::Unmodified; - `UP.Count` is a preferred number of iterations to be unrolled, it is 150 for pyperf600; - `TripCount` is a predicted number or loop iterations, it is 600 for pyperf600. The hunk above does exactly what comments says: prevents partial unrolling of the main pyperf600 loop on full unrolling pass. There is also a partial unrolling pass done later in the pipeline. pyperf600 structure ------------------- The relevant parts of the test look as follows: static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... if (frame->co_filename) { ... } if (frame->co_name) { ... } return true; } int __on_event(...) { ... #pragma clang loop unroll(UNROLL_COUNT) // UNROLL_COUNT == 150 for (int i = 0; i < STACK_MAX_LEN; ++i) // STACK_MAX_LEN == 600 if (frame_ptr && get_frame_data(...)) { if (!symbol_id) { ... } if (*symbol_id == new_symbol_id) { ... } ... } ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... __on_event(...); __on_event(...); __on_event(...); __on_event(...); __on_event(...); ... } The call to get_frame_data() is inlined. The main takeaways are: - BPF program consists of five calls to __on_event(); - __on_event() has a big loop inside; - loop body has 5 conditionals (when counted with conditionals in get_frame_data()). LLVM change impact on pyperf600 ------------------------------- Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass, after [1] it is unrolled by partial unrolling pass. Such change causes a subtle rearrangement of basic blocks inside the loop which turns out to be important for the verifier. The rearrangement occurs inside inlined body of get_frame_data(): static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... } Translation before [1]: Translation after [1] ; if (!frame->f_code) ; if (!frame->f_code) r3 = *(u64 *)(r10 - 0x30) r3 = *(u64 *)(r10 - 0x30) if r3 != 0x0 goto +0x2 <LBB0_19> if r3 == 0x0 goto +0x4b <LBB0_39> Before [1] the fall-through path is for `return false`, after [1] the fall-through path is to the rest of get_frame_data() body. The `if (!frame->f_code)` is the first conditional in the loop body and it guards all other conditionals in the body (when !frame_code == 1 the rest of conditionals is skipped). LLVM change impact on verifier ------------------------------ (Below is my speculative understanding, it is valid only to a certain degree). Before [1] the verifier would process pyperf600 in the following order: - __on_event() - process loop 600 times: - `if (!frame->f_code) return false`: - fall-through is to `return false`; - push one jump to jump history - assume fall-through branch and skip the rest of the loop body; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): this is the last call to __on_event, all branches within it are verified before proceeding with branches pushed for previous calls. When the loop inside the last call to __on_event() is verified a checkpoint at it's start becomes viable. Branches pushed when previous calls to __on_event() were processed would eventually hit this checkpoint and the whole process would converge eventually. Thus, at it's peak the jump history length would be ~600*5 == 3000. However, after [1] the fall-through path for the `if (!frame->f_code)` leads to other conditionals in the loop body, pushing up to 5 conditionals to jump history for each iteration. Hence, peak jump history length would be something like ~600*5*5 == 15000. Which is outside of current limits for the verifier. This analysis does not 100% reflects reality, the test passes only when jump history limit is increased up to 24K entries. Mitigation strategies --------------------- The following strategies were considered: a. Change pyperf600 basic blocks layout: using some inline assembly it is possible to get old basic blocks layout, however this seems very fragile. b. Non-DFS verifier logic for choosing next branch: when inside a loop, don't explore the fall-through branch first, instead predict which branch would push less conditionals onto jump history and explore that first. A variation of this was explored as [3], picking branches closer to 'exit' instruction, using distance in basic blocks in a reverse control flow graph as a metric. This variation fixes pyperf600 but causes a few other tests to fail because of the limits issues. Seems to complex to proceed. c. Increase jump history limit. This change is simple and preserves the purpose of jump history limit: - caps memory usage; - still exits early for too complex BPF programs, that would otherwise hit 1M instructions limit some time later. [3] https://github.com/eddyz87/bpf/tree/branch-visit-order-wip Performance impact ------------------ | configuration | instructions | time | |-------------------------------------+--------------+--------------------| | LLVM18 + bpf-next | 31457711580 | 3.87599 (+-0.15%) | | LLVM/git + bpf-next/24k | 18836657690 | 2.19285 (+-0.23%) | | LLVM/git + bpf-next | 687041950 | 0.117966 (+-0.32%) | | LLVM/git + bpf-next/24k/bigger test | 1824485084 | 0.281100 (+-0.23%) | Time measured using the following command: - perf stat -r 10 -e instructions:k ./veristat -q cpuv4/pyperf600.bpf.o Notes: - "LLVM/git" means llvm compiled from current main branch [4]; - "bpf-next/24k" means bpf-next with this patch on top; - "bpf-next/24k/bigger test" means bpf-next with this patch on top and the following addition to the pyperf600 test: --- a/tools/testing/selftests/bpf/progs/pyperf.h +++ b/tools/testing/selftests/bpf/progs/pyperf.h @@ -352,6 +352,8 @@ int on_event(struct bpf_raw_tracepoint_args* ctx) ret |= __on_event(ctx); ret |= __on_event(ctx); ret |= __on_event(ctx); + ret |= __on_event(ctx); + ret |= __on_event(ctx); return ret; } [4] LLVM revision used for performance impact testing: 55eb93b2688d ("[RISCV] Remove RISCVISD::FP_EXTEND_BF16. (#106939)") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
TLDR ---- The verif_scale_pyperf600 fails when compiled using recent clang. Basic blocks rearrangement causes "jumps is too complex" error. Investigation shows that nothing is wrong with clang or verifier, and simplest way to fix the regression, is to increase max allowed jump history. What happened ------------- The verif_scale_pyperf600 test fails to verify when compiled by recent clang revisions. The last known good revision is [0], the first known bad revision is [1]. Revision [1] comes from the pull request [2]. Verifier error when using revision [1]: ... ; if (frame->co_name) @ pyperf.h:118 25460: (79) r3 = *(u64 *)(r10 -32) ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm 25461: (15) if r3 == 0x0 goto pc+7 The sequence of 8193 jumps is too complex. verification time 822174 usec stack depth 360 [0] Last good revision: c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)") [1] First broken revision: 99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)") [2] Pull request for first broken revision: llvm/llvm-project#78648 LLVM change description ----------------------- The relevant part of [1] is: --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE, } // Do not attempt partial/runtime unrolling in FullLoopUnrolling - if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) { + if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) { LLVM_DEBUG( dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n"); return LoopUnrollResult::Unmodified; - `UP.Count` is a preferred number of iterations to be unrolled, it is 150 for pyperf600; - `TripCount` is a predicted number or loop iterations, it is 600 for pyperf600. The hunk above does exactly what comments says: prevents partial unrolling of the main pyperf600 loop on full unrolling pass. There is also a partial unrolling pass done later in the pipeline. pyperf600 structure ------------------- The relevant parts of the test look as follows: static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... if (frame->co_filename) { ... } if (frame->co_name) { ... } return true; } int __on_event(...) { ... #pragma clang loop unroll(UNROLL_COUNT) // UNROLL_COUNT == 150 for (int i = 0; i < STACK_MAX_LEN; ++i) // STACK_MAX_LEN == 600 if (frame_ptr && get_frame_data(...)) { if (!symbol_id) { ... } if (*symbol_id == new_symbol_id) { ... } ... } ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... __on_event(...); __on_event(...); __on_event(...); __on_event(...); __on_event(...); ... } The call to get_frame_data() is inlined. The main takeaways are: - BPF program consists of five calls to __on_event(); - __on_event() has a big loop inside; - loop body has 5 conditionals (when counted with conditionals in get_frame_data()). LLVM change impact on pyperf600 ------------------------------- Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass, after [1] it is unrolled by partial unrolling pass. Such change causes a subtle rearrangement of basic blocks inside the loop which turns out to be important for the verifier. The rearrangement occurs inside inlined body of get_frame_data(): static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... } Translation before [1]: Translation after [1] ; if (!frame->f_code) ; if (!frame->f_code) r3 = *(u64 *)(r10 - 0x30) r3 = *(u64 *)(r10 - 0x30) if r3 != 0x0 goto +0x2 <LBB0_19> if r3 == 0x0 goto +0x4b <LBB0_39> Before [1] the fall-through path is for `return false`, after [1] the fall-through path is to the rest of get_frame_data() body. The `if (!frame->f_code)` is the first conditional in the loop body and it guards all other conditionals in the body (when !frame_code == 1 the rest of conditionals is skipped). LLVM change impact on verifier ------------------------------ (Below is my speculative understanding, it is valid only to a certain degree). Before [1] the verifier would process pyperf600 in the following order: - __on_event() - process loop 600 times: - `if (!frame->f_code) return false`: - fall-through is to `return false`; - push one jump to jump history - assume fall-through branch and skip the rest of the loop body; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): this is the last call to __on_event, all branches within it are verified before proceeding with branches pushed for previous calls. When the loop inside the last call to __on_event() is verified a checkpoint at it's start becomes viable. Branches pushed when previous calls to __on_event() were processed would eventually hit this checkpoint and the whole process would converge eventually. Thus, at it's peak the jump history length would be ~600*5 == 3000. However, after [1] the fall-through path for the `if (!frame->f_code)` leads to other conditionals in the loop body, pushing up to 5 conditionals to jump history for each iteration. Hence, peak jump history length would be something like ~600*5*5 == 15000. Which is outside of current limits for the verifier. This analysis does not 100% reflects reality, the test passes only when jump history limit is increased up to 24K entries. Mitigation strategies --------------------- The following strategies were considered: a. Change pyperf600 basic blocks layout: using some inline assembly it is possible to get old basic blocks layout, however this seems very fragile. b. Non-DFS verifier logic for choosing next branch: when inside a loop, don't explore the fall-through branch first, instead predict which branch would push less conditionals onto jump history and explore that first. A variation of this was explored as [3], picking branches closer to 'exit' instruction, using distance in basic blocks in a reverse control flow graph as a metric. This variation fixes pyperf600 but causes a few other tests to fail because of the limits issues. Seems to complex to proceed. c. Increase jump history limit. This change is simple and preserves the purpose of jump history limit: - caps memory usage; - still exits early for too complex BPF programs, that would otherwise hit 1M instructions limit some time later. [3] https://github.com/eddyz87/bpf/tree/branch-visit-order-wip Performance impact ------------------ | configuration | instructions | time | |-------------------------------------+--------------+--------------------| | LLVM18 + bpf-next | 31457711580 | 3.87599 (+-0.15%) | | LLVM/git + bpf-next/24k | 18836657690 | 2.19285 (+-0.23%) | | LLVM/git + bpf-next | 687041950 | 0.117966 (+-0.32%) | | LLVM/git + bpf-next/24k/bigger test | 1824485084 | 0.281100 (+-0.23%) | Time measured using the following command: - perf stat -r 10 -e instructions:k ./veristat -q cpuv4/pyperf600.bpf.o Notes: - "LLVM/git" means llvm compiled from current main branch [4]; - "bpf-next/24k" means bpf-next with this patch on top; - "bpf-next/24k/bigger test" means bpf-next with this patch on top and the following addition to the pyperf600 test: --- a/tools/testing/selftests/bpf/progs/pyperf.h +++ b/tools/testing/selftests/bpf/progs/pyperf.h @@ -352,6 +352,8 @@ int on_event(struct bpf_raw_tracepoint_args* ctx) ret |= __on_event(ctx); ret |= __on_event(ctx); ret |= __on_event(ctx); + ret |= __on_event(ctx); + ret |= __on_event(ctx); return ret; } [4] LLVM revision used for performance impact testing: 55eb93b2688d ("[RISCV] Remove RISCVISD::FP_EXTEND_BF16. (#106939)") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
TLDR ==== The verif_scale_pyperf600 fails when compiled using recent clang. Basic blocks rearrangement causes "jumps is too complex" error. Investigation shows that nothing is wrong with clang or verifier, and simplest way to fix the regression, is to increase max allowed jump history. What happened ============= The verif_scale_pyperf600 test fails to verify when compiled by recent clang revisions. The last known good revision is [0], the first known bad revision is [1]. Revision [1] comes from the pull request [2]. Verifier error when using revision [1]: ... ; if (frame->co_name) @ pyperf.h:118 25460: (79) r3 = *(u64 *)(r10 -32) ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm 25461: (15) if r3 == 0x0 goto pc+7 The sequence of 8193 jumps is too complex. verification time 822174 usec stack depth 360 [0] Last good revision: c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)") [1] First broken revision: 99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)") [2] Pull request for first broken revision: llvm/llvm-project#78648 LLVM change description ======================= The relevant part of [1] is: -- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp ++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE, } // Do not attempt partial/runtime unrolling in FullLoopUnrolling - if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) { + if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) { LLVM_DEBUG( dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n"); return LoopUnrollResult::Unmodified; - `UP.Count` is a preferred number of iterations to be unrolled, it is 150 for pyperf600; - `TripCount` is a predicted number or loop iterations, it is 600 for pyperf600. The hunk above does exactly what comments says: prevents partial unrolling of the main pyperf600 loop on full unrolling pass. There is also a partial unrolling pass done later in the pipeline. pyperf600 structure =================== The relevant parts of the test look as follows: static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... if (frame->co_filename) { ... } if (frame->co_name) { ... } return true; } int __on_event(...) { ... #pragma clang loop unroll(UNROLL_COUNT) // UNROLL_COUNT == 150 for (int i = 0; i < STACK_MAX_LEN; ++i) // STACK_MAX_LEN == 600 if (frame_ptr && get_frame_data(...)) { if (!symbol_id) { ... } if (*symbol_id == new_symbol_id) { ... } ... } ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... __on_event(...); __on_event(...); __on_event(...); __on_event(...); __on_event(...); ... } The call to get_frame_data() is inlined. The main takeaways are: - BPF program consists of five calls to __on_event(); - __on_event() has a big loop inside; - loop body has 5 conditionals (when counted with conditionals in get_frame_data()). LLVM change impact on pyperf600 =============================== Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass, after [1] it is unrolled by partial unrolling pass. Such change causes a subtle rearrangement of basic blocks inside the loop which turns out to be important for the verifier. The rearrangement occurs inside inlined body of get_frame_data(): static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... } Translation before [1]: Translation after [1] ; if (!frame->f_code) ; if (!frame->f_code) r3 = *(u64 *)(r10 - 0x30) r3 = *(u64 *)(r10 - 0x30) if r3 != 0x0 goto +0x2 <LBB0_19> if r3 == 0x0 goto +0x4b <LBB0_39> Before [1] the fall-through path is for `return false`, after [1] the fall-through path is to the rest of get_frame_data() body. The `if (!frame->f_code)` is the first conditional in the loop body and it guards all other conditionals in the body (when !frame_code == 1 the rest of conditionals is skipped). LLVM change impact on verifier ============================== (Below is my speculative understanding, it is valid only to a certain degree). Before [1] the verifier would process pyperf600 in the following order: - __on_event() - process loop 600 times: - `if (!frame->f_code) return false`: - fall-through is to `return false`; - push one jump to jump history - assume fall-through branch and skip the rest of the loop body; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): this is the last call to __on_event, all branches within it are verified before proceeding with branches pushed for previous calls. When the loop inside the last call to __on_event() is verified a checkpoint at it's start becomes viable. Branches pushed when previous calls to __on_event() were processed would eventually hit this checkpoint and the whole process would converge eventually. Thus, at it's peak the jump history length would be ~600*5 == 3000. However, after [1] the fall-through path for the `if (!frame->f_code)` leads to other conditionals in the loop body, pushing up to 5 conditionals to jump history for each iteration. Hence, peak jump history length would be something like ~600*5*5 == 15000. Which is outside of current limits for the verifier. This analysis does not 100% reflects reality, the test passes only when jump history limit is increased up to 24K entries. Mitigation strategies ===================== The following strategies were considered: a. Change pyperf600 basic blocks layout: using some inline assembly it is possible to get old basic blocks layout, however this seems very fragile. b. Non-DFS verifier logic for choosing next branch: when inside a loop, don't explore the fall-through branch first, instead predict which branch would push less conditionals onto jump history and explore that first. A variation of this was explored as [3], picking branches closer to 'exit' instruction, using distance in basic blocks in a reverse control flow graph as a metric. This variation fixes pyperf600 but causes a few other tests to fail because of the limits issues. Seems to complex to proceed. c. Increase jump history limit. This change is simple and preserves the purpose of jump history limit: - caps memory usage; - still exits early for too complex BPF programs, that would otherwise hit 1M instructions limit some time later. [3] https://github.com/eddyz87/bpf/tree/branch-visit-order-wip Performance impact ================== | configuration | instructions | time | |-------------------------------------+--------------+--------------------| | LLVM18 + bpf-next | 31457711580 | 3.87599 (+-0.15%) | | LLVM/git + bpf-next/24k | 18836657690 | 2.19285 (+-0.23%) | | LLVM/git + bpf-next | 687041950 | 0.117966 (+-0.32%) | | LLVM/git + bpf-next/24k/bigger test | 1824485084 | 0.281100 (+-0.23%) | Time measured using the following command: - perf stat -r 10 -e instructions:k ./veristat -q cpuv4/pyperf600.bpf.o Notes: - "LLVM/git" means llvm compiled from current main branch [4]; - "bpf-next/24k" means bpf-next with this patch on top; - "bpf-next/24k/bigger test" means bpf-next with this patch on top and the following addition to the pyperf600 test: -- a/tools/testing/selftests/bpf/progs/pyperf.h ++ b/tools/testing/selftests/bpf/progs/pyperf.h @ -352,6 +352,8 @@ int on_event(struct bpf_raw_tracepoint_args* ctx) ret |= __on_event(ctx); ret |= __on_event(ctx); ret |= __on_event(ctx); + ret |= __on_event(ctx); + ret |= __on_event(ctx); return ret; } [4] LLVM revision used for performance impact testing: 55eb93b2688d ("[RISCV] Remove RISCVISD::FP_EXTEND_BF16. (#106939)") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes PR77842 where UBSAN causes pragma full unroll to try and unroll INT_MAX times. This sets a cap to make sure we don't attempt this and crash the compiler.
Testing:
ninja check-all with new test