[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648

modiking · 2024-01-19T00:14:56Z

Fixes PR77842 where UBSAN causes pragma full unroll to try and unroll INT_MAX times. This sets a cap to make sure we don't attempt this and crash the compiler.

Testing:
ninja check-all with new test

llvmbot · 2024-01-19T00:19:40Z

@llvm/pr-subscribers-llvm-transforms

Author: None (modiking)

Changes

Fixes PR77842 where UBSAN causes pragma full unroll to try and unroll INT_MAX times. This sets a cap to make sure we don't attempt this and crash the compiler.

Testing:
ninja check-all with new test

Full diff: https://github.com/llvm/llvm-project/pull/78648.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Utils/LoopUnroll.cpp (+12)
(added) llvm/test/Transforms/LoopUnroll/pr77842.ll (+37)

diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index ee6f7b35750af0f..8529fa1db18e187 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -109,6 +109,10 @@ UnrollVerifyLoopInfo("unroll-verify-loopinfo", cl::Hidden,
 #endif
                     );
 
+static cl::opt<unsigned>
+    UnrollMaxIterations("unroll-max-iterations", cl::init(1'000'000),
+                        cl::Hidden,
+                        cl::desc("Maximum allowed iterations to unroll."));
 
 /// Check if unrolling created a situation where we need to insert phi nodes to
 /// preserve LCSSA form.
@@ -453,6 +457,14 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
     }
   }
 
+  // Certain cases with UBSAN can cause trip count to be calculated as INT_MAX,
+  // Block unrolling at a reasonable limit so that the compiler doesn't hang
+  // trying to unroll the loop. See PR77842
+  if (ULO.Count > UnrollMaxIterations) {
+    LLVM_DEBUG(dbgs() << "Won't unroll; trip count is too large\n");
+    return LoopUnrollResult::Unmodified;
+  }
+
   using namespace ore;
   // Report the unrolling decision.
   if (CompletelyUnroll) {
diff --git a/llvm/test/Transforms/LoopUnroll/pr77842.ll b/llvm/test/Transforms/LoopUnroll/pr77842.ll
new file mode 100644
index 000000000000000..834033bbe3618eb
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/pr77842.ll
@@ -0,0 +1,37 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -passes=loop-unroll -disable-output -debug-only=loop-unroll %s 2>&1 | FileCheck %s
+
+; Validate that loop unroll full doesn't try to fully unroll values whose trip counts are too large.
+
+; CHECK: Exiting block %cont23: TripCount=2147483648, TripMultiple=0, BreakoutTrip=0
+; CHECK-NEXT: Won't unroll; trip count is too large
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-redhat-linux-gnu"
+
+define void @foo(i64 %end) {
+entry:
+  br label %loopheader
+
+loopheader:
+  %iv = phi i64 [ 0, %entry ], [ %iv_new, %backedge ]
+  %exit = icmp eq i64 %iv, %end
+  br i1 %exit, label %for.cond.cleanup.loopexit, label %cont23
+
+for.cond.cleanup.loopexit:
+  ret void
+
+cont23:
+  %exitcond241 = icmp eq i64 %iv, 2147483647
+  br i1 %exitcond241, label %handler.add_overflow, label %backedge
+
+handler.add_overflow:
+  unreachable
+
+backedge: ; preds = %cont23
+  %iv_new = add i64 %iv, 1
+  br label %loopheader, !llvm.loop !0
+}
+
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.unroll.full"}

llvm/lib/Transforms/Utils/LoopUnroll.cpp

llvm/test/Transforms/LoopUnroll/pr77842.ll

llvm/lib/Transforms/Utils/LoopUnroll.cpp

nikic · 2024-01-24T10:17:38Z

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

+  if (UP.Count > UnrollMaxIterations) {
+    LLVM_DEBUG(dbgs() << "Won't unroll; trip count is too large\n");
+    return LoopUnrollResult::Unmodified;
+  }


This works, but I don't think it fits well into the way the unroll count is currently computed. Rather than checking this after the fact, we should integrate this into the unroll count calculation. In particular, inside https://github.com/llvm/llvm-project/blob/88b1087035fd397996837e35d579d808d6b3f28c/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp#L784-L785 we should apply a pragma-specific iteration limit. This will also ensure that even if we decline to perform a full unroll to an unreasonable count, we can still chose to perform a partial or runtime unroll to a lower iteration count.

modiking · 2024-01-25T19:32:47Z

@nikic Appreciate the thorough review on the best way to get this done. Right now, with locking the full unroll under Pragma the repro unrolls 2048 times which is fine but fairly slow to complete (155s on my SkyLake).

Thoughts on a way to cap this in some way or is this alright?

modiking · 2024-01-30T18:17:54Z

Friendly ping @nikic

nikic · 2024-01-30T22:38:25Z

@nikic Appreciate the thorough review on the best way to get this done. Right now, with locking the full unroll under Pragma the repro unrolls 2048 times which is fine but fairly slow to complete (155s on my SkyLake).

Thoughts on a way to cap this in some way or is this alright?

I think the general behavior is fine -- we still get the aggressive unrolling expected from #pragma unroll, while staying within the pragma unroll threshold.

For the test case, it should be possible to reduce the amount of unrolling by setting -pragma-unroll-threshold to a lower value.

Though I'm surprised that unrolling to 2048 times takes 155s -- is this with a debug build of LLVM by chance?

modiking · 2024-01-30T23:29:55Z

Though I'm surprised that unrolling to 2048 times takes 155s -- is this with a debug build of LLVM by chance?

I checked again and I think you're right on that, with a release build it's much faster:

~/llvm-project/llvm/test/Transforms/LoopUnroll# time ~/llvm-project/build-rel/bin/llvm-lit pr77842.ll 
-- Testing: 1 tests, 1 workers --
PASS: LLVM :: Transforms/LoopUnroll/pr77842.ll (1 of 1)

Testing Time: 6.25s

Total Discovered Tests: 1
  Passed: 1 (100.00%)

real    0m6.486s
user    0m6.430s
sys     0m0.083s

nikic · 2024-02-01T15:23:06Z

I still think you should use -pragma-unroll-threshold in the test to limit the size. It's not really useful to have a 12-thousand line test vs one that tests the same thing on smaller limits.

A possible alternative would be to use the loop-unroll-full pass instead of the loop-unroll pass. I think that one should just stop unrolling entirely after your change.

modiking · 2024-02-01T18:17:34Z

I still think you should use -pragma-unroll-threshold in the test to limit the size. It's not really useful to have a 12-thousand line test vs one that tests the same thing on smaller limits.

A possible alternative would be to use the loop-unroll-full pass instead of the loop-unroll pass. I think that one should just stop unrolling entirely after your change.

Makes sense. Good suggestion with loop-unroll-full, confirmed it fails without this PR and works correctly with

…iterations we try to unroll

modiking · 2024-02-02T19:53:25Z

@nikic Hmm interesting enough it's still trying to partially unroll with -loop-unroll-full. Looking into it more, this condition isn't bailing out:

  // Do not attempt partial/runtime unrolling in FullLoopUnrolling
  if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) {
    LLVM_DEBUG(
        dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n");
    return LoopUnrollResult::Unmodified;
  }

Because TripCount == 2147483648 so MaxTripCount == 0 while partial unroll sets UP.Count == 2048. Scanning through I think we should also catch this case and add a check for UP.Count < TripCount. WDYT?

nikic · 2024-02-02T21:46:57Z

@modiking Yeah, I think that would make sense. Assuming it doesn't break something else...

modiking · 2024-02-05T18:31:22Z

@nikic Change passes unit tests, also checked that both are needed as it fails a unit test if we only switch to UP.Count < TripCount

nikic

LGTM

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

nikic · 2024-02-05T20:00:33Z

Change passes unit tests, also checked that both are needed as it fails a unit test if we only switch to UP.Count < TripCount

Yeah, currently we need both as they are mutually exclusive -- we only set MaxTripCount if TripCount is not available. Does make we wonder if it would make things cleaner if we just always computed MaxTripCount. But that's unrelated to this patch...

modiking · 2024-02-05T21:51:13Z

Thanks for the review!

Change passes unit tests, also checked that both are needed as it fails a unit test if we only switch to UP.Count < TripCount

Yeah, currently we need both as they are mutually exclusive -- we only set MaxTripCount if TripCount is not available. Does make we wonder if it would make things cleaner if we just always computed MaxTripCount. But that's unrelated to this patch...

That would be cleaner. But yeah a different patch

Co-authored-by: Nikita Popov <github@npopov.com>

dwblaikie · 2024-02-14T19:12:20Z

we're seeing a failure with this patch internally (don't have a reproducer yet, just letting you know in case it's actionable in the mean time while we continue investigating)

An existing loop with a static bound (256) with a #pragma unroll worked without the patch, but the resulting program times out with this patch applied. Putting an explicit bound on the unroll (which should be identical to the static bound in the loop) like #pragma unroll 256 doesn't time out anymore.

The contents of the loop does include a couple of gotos, a continue and a return (all under various conditions), if that's relevant/helpful.

modiking · 2024-02-14T19:39:20Z

we're seeing a failure with this patch internally (don't have a reproducer yet, just letting you know in case it's actionable in the mean time while we continue investigating)

An existing loop with a static bound (256) with a #pragma unroll worked without the patch, but the resulting program times out with this patch applied. Putting an explicit bound on the unroll (which should be identical to the static bound in the loop) like #pragma unroll 256 doesn't time out anymore.

The contents of the loop does include a couple of gotos, a continue and a return (all under various conditions), if that's relevant/helpful.

Interesting, no certainty on the answer here but there's 2 changes here:

Detect super high full unroll and bail
Add an additional condition on loop-unroll-full phase that blocks it from partially unrolling in certain cases (UP.Count < TripCount)

Could be worthwhile to see which one of these is causing the issue. Also possible this is an existing issue and was just exposed by this change

dwblaikie · 2024-02-14T21:16:18Z

It might be that this test times out if the loop isn't fully unrolled. (removing the unroll pragma causes the program to timeout - FWIW, this is is something BPF related - there's some comment about the pragma being required for some verification - I don't really understand BPF enough to explain it)

If I remove this part of your change:

if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) {

And switch it back to the old code, the program finishes in roughly the expected time (~3-4 minutes, rather than hitting our 10min test timeout).

modiking · 2024-02-14T21:51:53Z

Gotcha, that seems to be the issue then.

To me this seems like it's a source side fix to ensure full unrolling via specifying the trip count or using #pragma clang loop unroll(full). cc @nikic

dwblaikie · 2024-02-14T22:59:25Z

So not sure I'm following. The condition that was added in this patch that seems to be relevant, if I'm understanding correctly, is to add UP.Count < TripCount as a condition to bailout/abort all the unrolling, if the trip count exceeds the count on the unroll pragma, I guess?

But there is no count on the unroll pragma in this case - so why would that condition fire? (I'm poking around in a debugger now to better understand it)

(& I tried to find some documentation, but couldn't - to figure out what #pragma clang loop unroll means, and how it differs from #pragma clang loop unroll(full) but didn't find any docs - do you have a link to some?)

dwblaikie · 2024-02-14T23:01:30Z

Oh, I found https://releases.llvm.org/4.0.0/tools/clang/docs/AttributeReference.html#pragma-unroll-pragma-nounroll

Which says "Specifying #pragma unroll without a parameter directs the loop unroller to attempt to fully unroll the loop if the trip count is known at compile time and attempt to partially unroll the loop if the trip count is not known at compile time" The loop condition is simple/known here (though the conditional goto/returns - maybe those complicate things?)

Oh, the docs also say "#pragma unroll and #pragma unroll value have identical semantics to #pragma clang loop unroll(full) and #pragma clang loop unroll_count(value) respectively. " - so that doesn't sound like #pragma clang loop unroll(full) would help the situation?

modiking · 2024-02-15T04:26:33Z

Oh, the docs also say "#pragma unroll and #pragma unroll value have identical semantics to #pragma clang loop unroll(full) and #pragma clang loop unroll_count(value) respectively. " - so that doesn't sound like #pragma clang loop unroll(full) would help the situation?

Oh interesting, I was't quite certain myself on what the exact semantics are. Thanks for digging it up.

The condition that was added in this patch that seems to be relevant, if I'm understanding correctly, is to add UP.Count < TripCount as a condition to bailout/abort all the unrolling, if the trip count exceeds the count on the unroll pragma, I guess?

The way the code is structured is that the unroll trip count is either stored in TripCount or in MaxTripCount depending on how it was calculated. Before, only checking against MaxTripCount would cause #pragma clang loop unroll(full) to also perform partial unrolling under -loop-unroll-full.

Actually, I wonder if that's what causing the change where previously it would try and do a partial unroll but that no longer happens. What's the output with -debug-only=loop-unroll?

dwblaikie · 2024-02-15T18:27:34Z

I haven't done proper A/B debugging before and after this patch yet - but you might be right that the code is currently was previously falling under partial unrolling - but I don't know why it would count as partial unrolling?

The code here

llvm-project/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

Lines 791 to 805 in f872706

    
           if (PInfo.PragmaFullUnroll && TripCount != 0) { 
        
             // Certain cases with UBSAN can cause trip count to be calculated as 
        
             // INT_MAX, Block full unrolling at a reasonable limit so that the compiler 
        
             // doesn't hang trying to unroll the loop. See PR77842 
        
             if (TripCount > PragmaUnrollFullMaxIterations) { 
        
               LLVM_DEBUG(dbgs() << "Won't unroll; trip count is too large\n"); 
        
               return std::nullopt; 
        
             } 
        
             return TripCount; 
        
           } 
        
           if (PInfo.PragmaEnableUnroll && !TripCount && MaxTripCount && 
        
               MaxTripCount <= UP.MaxUpperBound) 
        
             return MaxTripCount;

If I'm reading this right, it doesn't look like it treats PragmaFullUnroll and PragmaEnableUnroll equally - or perhaps clang is mis-lowering pragma without a bound & it should be lowering it to pragma full unroll?

The LLVM docs: https://llvm.org/docs/TransformMetadata.html#loop-unrolling don't seem to be super clear about what the different IR metadata means?

nikic · 2024-02-15T18:37:52Z

Oh, the docs also say "#pragma unroll and #pragma unroll value have identical semantics to #pragma clang loop unroll(full) and #pragma clang loop unroll_count(value) respectively. " - so that doesn't sound like #pragma clang loop unroll(full) would help the situation?

You're looking at some very old docs there. The current ones are https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll and they have the s/full/enable typo fixed.

The unrolling heuristics are very complex and it's hard to make any definitive statements without a reproducer. Based on what you say, I suspect that this patch has moved a partial unroll from happening in the full unroll pass to the runtime unroll pass and that exposed a second order compile-time issue in your case. Or it might be something else entirely...

dwblaikie · 2024-02-15T19:20:11Z

Oh, the docs also say "#pragma unroll and #pragma unroll value have identical semantics to #pragma clang loop unroll(full) and #pragma clang loop unroll_count(value) respectively. " - so that doesn't sound like #pragma clang loop unroll(full) would help the situation?

You're looking at some very old docs there. The current ones are https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll and they have the s/full/enable typo fixed.
Oh, right, thanks for spotting that.

Still - going to that documentation, then going down into the linked LanguageExtensions documentation mentions this: "If unroll(full) is specified the unroller will attempt to fully unroll the loop if the trip count is known at compile time identically to unroll(enable). However, with unroll(full) the loop will not be unrolled if the loop count is not known at compile time."

So that sounds like "unroll(full)" should unroll in fewer cases than "unroll(enable)"? But in this case, with this patch applied, it seems like the loop gets unrolled only with "unroll(full)" and not with "unroll(enable)"

The unrolling heuristics are very complex and it's hard to make any definitive statements without a reproducer. Based on what you say, I suspect that this patch has moved a partial unroll from happening in the full unroll pass to the runtime unroll pass and that exposed a second order compile-time issue in your case. Or it might be something else entirely...

Yep, we're working on reproducers & trying to talk through what we have until then in case it's helpful in the mean time.

TLDR ---- The verif_scale_pyperf600 fails when compiled using recent clang. Basic blocks rearrangement causes "jumps is too complex" error. Investigation shows that nothing is wrong with clang or verifier, and simplest way to fix the regression, is to increase max allowed jump history. What happened ------------- The verif_scale_pyperf600 test fails to verify when compiled by recent clang revisions. The last known good revision is [0], the first known bad revision is [1]. Revision [1] comes from the pull request [2]. Verifier error when using revision [1]: ... ; if (frame->co_name) @ pyperf.h:118 25460: (79) r3 = *(u64 *)(r10 -32) ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm 25461: (15) if r3 == 0x0 goto pc+7 The sequence of 8193 jumps is too complex. verification time 822174 usec stack depth 360 [0] Last good revision: c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)") [1] First broken revision: 99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)") [2] Pull request for first broken revision: llvm/llvm-project#78648 LLVM change description ----------------------- The relevant part of [1] is: --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE, } // Do not attempt partial/runtime unrolling in FullLoopUnrolling - if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) { + if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) { LLVM_DEBUG( dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n"); return LoopUnrollResult::Unmodified; - `UP.Count` is a preferred number of iterations to be unrolled, it is 150 for pyperf600; - `TripCount` is a predicted number or loop iterations, it is 600 for pyperf600. The hunk above does exactly what comments says: prevents partial unrolling of the main pyperf600 loop on full unrolling pass. There is also a partial unrolling pass done later in the pipeline. pyperf600 structure ------------------- The relevant parts of the test look as follows: static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... if (frame->co_filename) { ... } if (frame->co_name) { ... } return true; } int __on_event(...) { ... #pragma clang loop unroll(UNROLL_COUNT) // UNROLL_COUNT == 150 for (int i = 0; i < STACK_MAX_LEN; ++i) // STACK_MAX_LEN == 600 if (frame_ptr && get_frame_data(...)) { if (!symbol_id) { ... } if (*symbol_id == new_symbol_id) { ... } ... } ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... __on_event(...); __on_event(...); __on_event(...); __on_event(...); __on_event(...); ... } The call to get_frame_data() is inlined. The main takeaways are: - BPF program consists of five calls to __on_event(); - __on_event() has a big loop inside; - loop body has 5 conditionals (when counted with conditionals in get_frame_data()). LLVM change impact on pyperf600 ------------------------------- Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass, after [1] it is unrolled by partial unrolling pass. Such change causes a subtle rearrangement of basic blocks inside the loop which turns out to be important for the verifier. The rearrangement occurs inside inlined body of get_frame_data(): static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... } Translation before [1]: Translation after [1] ; if (!frame->f_code) ; if (!frame->f_code) r3 = *(u64 *)(r10 - 0x30) r3 = *(u64 *)(r10 - 0x30) if r3 != 0x0 goto +0x2 <LBB0_19> if r3 == 0x0 goto +0x4b <LBB0_39> Before [1] the fall-through path is for `return false`, after [1] the fall-through path is to the rest of get_frame_data() body. The `if (!frame->f_code)` is the first conditional in the loop body and it guards all other conditionals in the body (when !frame_code == 1 the rest of conditionals is skipped). LLVM change impact on verifier ------------------------------ (Below is my speculative understanding, it is valid only to a certain degree). Before [1] the verifier would process pyperf600 in the following order: - __on_event() - process loop 600 times: - `if (!frame->f_code) return false`: - fall-through is to `return false`; - push one jump to jump history - assume fall-through branch and skip the rest of the loop body; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): this is the last call to __on_event, all branches within it are verified before proceeding with branches pushed for previous calls. When the loop inside the last call to __on_event() is verified a checkpoint at it's start becomes viable. Branches pushed when previous calls to __on_event() were processed would eventually hit this checkpoint and the whole process would converge eventually. Thus, at it's peak the jump history length would be ~600*5 == 3000. However, after [1] the fall-through path for the `if (!frame->f_code)` leads to other conditionals in the loop body, pushing up to 5 conditionals to jump history for each iteration. Hence, peak jump history length would be something like ~600*5*5 == 15000. Which is outside of current limits for the verifier. This analysis does not 100% reflects reality, the test passes only when jump history limit is increased up to 24K entries. Mitigation strategies --------------------- The following strategies were considered: a. Change pyperf600 basic blocks layout: using some inline assembly it is possible to get old basic blocks layout, however this seems very fragile. b. Non-DFS verifier logic for choosing next branch: when inside a loop, don't explore the fall-through branch first, instead predict which branch would push less conditionals onto jump history and explore that first. A variation of this was explored as [3], picking branches closer to 'exit' instruction, using distance in basic blocks in a reverse control flow graph as a metric. This variation fixes pyperf600 but causes a few other tests to fail because of the limits issues. Seems to complex to proceed. c. Increase jump history limit. This change is simple and preserves the purpose of jump history limit: - caps memory usage; - still exits early for too complex BPF programs, that would otherwise hit 1M instructions limit some time later. [3] https://github.com/eddyz87/bpf/tree/branch-visit-order-wip Performance impact ------------------ | configuration | instructions | time | |-------------------------------------+--------------+--------------------| | LLVM18 + bpf-next | 31457711580 | 3.87599 (+-0.15%) | | LLVM/git + bpf-next/24k | 18836657690 | 2.19285 (+-0.23%) | | LLVM/git + bpf-next | 687041950 | 0.117966 (+-0.32%) | | LLVM/git + bpf-next/24k/bigger test | 1824485084 | 0.281100 (+-0.23%) | Time measured using the following command: - perf stat -r 10 -e instructions:k ./veristat -q cpuv4/pyperf600.bpf.o Notes: - "LLVM/git" means llvm compiled from current main branch [4]; - "bpf-next/24k" means bpf-next with this patch on top; - "bpf-next/24k/bigger test" means bpf-next with this patch on top and the following addition to the pyperf600 test: --- a/tools/testing/selftests/bpf/progs/pyperf.h +++ b/tools/testing/selftests/bpf/progs/pyperf.h @@ -352,6 +352,8 @@ int on_event(struct bpf_raw_tracepoint_args* ctx) ret |= __on_event(ctx); ret |= __on_event(ctx); ret |= __on_event(ctx); + ret |= __on_event(ctx); + ret |= __on_event(ctx); return ret; } [4] LLVM revision used for performance impact testing: 55eb93b2688d ("[RISCV] Remove RISCVISD::FP_EXTEND_BF16. (#106939)") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>

TLDR ==== The verif_scale_pyperf600 fails when compiled using recent clang. Basic blocks rearrangement causes "jumps is too complex" error. Investigation shows that nothing is wrong with clang or verifier, and simplest way to fix the regression, is to increase max allowed jump history. What happened ============= The verif_scale_pyperf600 test fails to verify when compiled by recent clang revisions. The last known good revision is [0], the first known bad revision is [1]. Revision [1] comes from the pull request [2]. Verifier error when using revision [1]: ... ; if (frame->co_name) @ pyperf.h:118 25460: (79) r3 = *(u64 *)(r10 -32) ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm 25461: (15) if r3 == 0x0 goto pc+7 The sequence of 8193 jumps is too complex. verification time 822174 usec stack depth 360 [0] Last good revision: c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)") [1] First broken revision: 99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)") [2] Pull request for first broken revision: llvm/llvm-project#78648 LLVM change description ======================= The relevant part of [1] is: -- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp ++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE, } // Do not attempt partial/runtime unrolling in FullLoopUnrolling - if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) { + if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) { LLVM_DEBUG( dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n"); return LoopUnrollResult::Unmodified; - `UP.Count` is a preferred number of iterations to be unrolled, it is 150 for pyperf600; - `TripCount` is a predicted number or loop iterations, it is 600 for pyperf600. The hunk above does exactly what comments says: prevents partial unrolling of the main pyperf600 loop on full unrolling pass. There is also a partial unrolling pass done later in the pipeline. pyperf600 structure =================== The relevant parts of the test look as follows: static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... if (frame->co_filename) { ... } if (frame->co_name) { ... } return true; } int __on_event(...) { ... #pragma clang loop unroll(UNROLL_COUNT) // UNROLL_COUNT == 150 for (int i = 0; i < STACK_MAX_LEN; ++i) // STACK_MAX_LEN == 600 if (frame_ptr && get_frame_data(...)) { if (!symbol_id) { ... } if (*symbol_id == new_symbol_id) { ... } ... } ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... __on_event(...); __on_event(...); __on_event(...); __on_event(...); __on_event(...); ... } The call to get_frame_data() is inlined. The main takeaways are: - BPF program consists of five calls to __on_event(); - __on_event() has a big loop inside; - loop body has 5 conditionals (when counted with conditionals in get_frame_data()). LLVM change impact on pyperf600 =============================== Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass, after [1] it is unrolled by partial unrolling pass. Such change causes a subtle rearrangement of basic blocks inside the loop which turns out to be important for the verifier. The rearrangement occurs inside inlined body of get_frame_data(): static __always_inline bool get_frame_data(...) { ... if (!frame->f_code) return false; ... } Translation before [1]: Translation after [1] ; if (!frame->f_code) ; if (!frame->f_code) r3 = *(u64 *)(r10 - 0x30) r3 = *(u64 *)(r10 - 0x30) if r3 != 0x0 goto +0x2 <LBB0_19> if r3 == 0x0 goto +0x4b <LBB0_39> Before [1] the fall-through path is for `return false`, after [1] the fall-through path is to the rest of get_frame_data() body. The `if (!frame->f_code)` is the first conditional in the loop body and it guards all other conditionals in the body (when !frame_code == 1 the rest of conditionals is skipped). LLVM change impact on verifier ============================== (Below is my speculative understanding, it is valid only to a certain degree). Before [1] the verifier would process pyperf600 in the following order: - __on_event() - process loop 600 times: - `if (!frame->f_code) return false`: - fall-through is to `return false`; - push one jump to jump history - assume fall-through branch and skip the rest of the loop body; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): same thing, push 600 jumps to jump history; - __on_event(): this is the last call to __on_event, all branches within it are verified before proceeding with branches pushed for previous calls. When the loop inside the last call to __on_event() is verified a checkpoint at it's start becomes viable. Branches pushed when previous calls to __on_event() were processed would eventually hit this checkpoint and the whole process would converge eventually. Thus, at it's peak the jump history length would be ~600*5 == 3000. However, after [1] the fall-through path for the `if (!frame->f_code)` leads to other conditionals in the loop body, pushing up to 5 conditionals to jump history for each iteration. Hence, peak jump history length would be something like ~600*5*5 == 15000. Which is outside of current limits for the verifier. This analysis does not 100% reflects reality, the test passes only when jump history limit is increased up to 24K entries. Mitigation strategies ===================== The following strategies were considered: a. Change pyperf600 basic blocks layout: using some inline assembly it is possible to get old basic blocks layout, however this seems very fragile. b. Non-DFS verifier logic for choosing next branch: when inside a loop, don't explore the fall-through branch first, instead predict which branch would push less conditionals onto jump history and explore that first. A variation of this was explored as [3], picking branches closer to 'exit' instruction, using distance in basic blocks in a reverse control flow graph as a metric. This variation fixes pyperf600 but causes a few other tests to fail because of the limits issues. Seems to complex to proceed. c. Increase jump history limit. This change is simple and preserves the purpose of jump history limit: - caps memory usage; - still exits early for too complex BPF programs, that would otherwise hit 1M instructions limit some time later. [3] https://github.com/eddyz87/bpf/tree/branch-visit-order-wip Performance impact ================== | configuration | instructions | time | |-------------------------------------+--------------+--------------------| | LLVM18 + bpf-next | 31457711580 | 3.87599 (+-0.15%) | | LLVM/git + bpf-next/24k | 18836657690 | 2.19285 (+-0.23%) | | LLVM/git + bpf-next | 687041950 | 0.117966 (+-0.32%) | | LLVM/git + bpf-next/24k/bigger test | 1824485084 | 0.281100 (+-0.23%) | Time measured using the following command: - perf stat -r 10 -e instructions:k ./veristat -q cpuv4/pyperf600.bpf.o Notes: - "LLVM/git" means llvm compiled from current main branch [4]; - "bpf-next/24k" means bpf-next with this patch on top; - "bpf-next/24k/bigger test" means bpf-next with this patch on top and the following addition to the pyperf600 test: -- a/tools/testing/selftests/bpf/progs/pyperf.h ++ b/tools/testing/selftests/bpf/progs/pyperf.h @ -352,6 +352,8 @@ int on_event(struct bpf_raw_tracepoint_args* ctx) ret |= __on_event(ctx); ret |= __on_event(ctx); ret |= __on_event(ctx); + ret |= __on_event(ctx); + ret |= __on_event(ctx); return ret; } [4] LLVM revision used for performance impact testing: 55eb93b2688d ("[RISCV] Remove RISCVISD::FP_EXTEND_BF16. (#106939)") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>

modiking requested review from nikic and fhahn January 19, 2024 00:17

modiking self-assigned this Jan 19, 2024

modiking force-pushed the up_main branch from ac28ee9 to 41d6a5f Compare January 19, 2024 00:18

modiking marked this pull request as ready for review January 19, 2024 00:19

llvmbot added the llvm:transforms label Jan 19, 2024

modiking force-pushed the up_main branch from 41d6a5f to d8c5609 Compare January 22, 2024 19:20

dcci approved these changes Jan 23, 2024

View reviewed changes

nikic requested changes Jan 23, 2024

View reviewed changes

llvm/lib/Transforms/Utils/LoopUnroll.cpp Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopUnroll/pr77842.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopUnroll/pr77842.ll Outdated Show resolved Hide resolved

nikic reviewed Jan 24, 2024

View reviewed changes

modiking added 6 commits February 2, 2024 11:03

[LoopUnroll] Introduce UnrollMaxIterations as a hard cap on how many …

7127eb0

…iterations we try to unroll

Move changes to LoopUnrollPass.cpp, fix up tests

8c862e1

remove stray diff, make limit only apply under pragma unroll full

2150d1b

Fix up check to look at TripCount

698183d

Update test

034d2b4

use loop-unroll-full

7e0def2

modiking force-pushed the up_main branch from 09a55bd to 7e0def2 Compare February 2, 2024 19:03

Update full unroll check to also take into account TripCount

38e9fd9

nikic approved these changes Feb 5, 2024

View reviewed changes

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp Outdated Show resolved Hide resolved

modiking and others added 2 commits February 5, 2024 13:51

Update llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

bd1d8ff

Co-authored-by: Nikita Popov <github@npopov.com>

s/UnrollFullMaxIterations/PragmaUnrollFullMaxIterations/d in usage

38f5e15

modiking changed the title ~~[LoopUnroll] Introduce UnrollMaxIterations as a hard cap on how many iterations we try to unroll~~ [LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll Feb 5, 2024

modiking merged commit 99ddd77 into llvm:main Feb 6, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648

[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648

modiking commented Jan 19, 2024 •

edited

Loading

llvmbot commented Jan 19, 2024

nikic Jan 24, 2024

modiking commented Jan 25, 2024

modiking commented Jan 30, 2024

nikic commented Jan 30, 2024

modiking commented Jan 30, 2024

nikic commented Feb 1, 2024

modiking commented Feb 1, 2024

modiking commented Feb 2, 2024

nikic commented Feb 2, 2024

modiking commented Feb 5, 2024

nikic left a comment

nikic commented Feb 5, 2024

modiking commented Feb 5, 2024 •

edited

Loading

dwblaikie commented Feb 14, 2024

modiking commented Feb 14, 2024 •

edited

Loading

dwblaikie commented Feb 14, 2024

modiking commented Feb 14, 2024

dwblaikie commented Feb 14, 2024

dwblaikie commented Feb 14, 2024

modiking commented Feb 15, 2024

dwblaikie commented Feb 15, 2024

nikic commented Feb 15, 2024

dwblaikie commented Feb 15, 2024

[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648

[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll #78648

Conversation

modiking commented Jan 19, 2024 • edited Loading

llvmbot commented Jan 19, 2024

nikic Jan 24, 2024

Choose a reason for hiding this comment

modiking commented Jan 25, 2024

modiking commented Jan 30, 2024

nikic commented Jan 30, 2024

modiking commented Jan 30, 2024

nikic commented Feb 1, 2024

modiking commented Feb 1, 2024

modiking commented Feb 2, 2024

nikic commented Feb 2, 2024

modiking commented Feb 5, 2024

nikic left a comment

Choose a reason for hiding this comment

nikic commented Feb 5, 2024

modiking commented Feb 5, 2024 • edited Loading

dwblaikie commented Feb 14, 2024

modiking commented Feb 14, 2024 • edited Loading

dwblaikie commented Feb 14, 2024

modiking commented Feb 14, 2024

dwblaikie commented Feb 14, 2024

dwblaikie commented Feb 14, 2024

modiking commented Feb 15, 2024

dwblaikie commented Feb 15, 2024

nikic commented Feb 15, 2024

dwblaikie commented Feb 15, 2024

modiking commented Jan 19, 2024 •

edited

Loading

modiking commented Feb 5, 2024 •

edited

Loading

modiking commented Feb 14, 2024 •

edited

Loading