pkg/bisect: improve success rate #1051

dvyukov · 2019-03-12T15:57:57Z

Umbrella bug for collection of suboptimal bisection results and hopefully then figuring out ways how to improve it.

https://syzkaller.appspot.com/bug?extid=fdce8f2a8903f3ba0e6b
https://syzkaller.appspot.com/x/bisect.txt?x=16e68713200000
Bisected to wrong commit.
Root cause: unrelated bug that happen on all tests.
Bisection switches to an unrelated bug "INFO: trying to register non-static key in can_notifier" which seem to happen on all tests. This then leads to a completely random commit.
https://syzkaller.appspot.com/bug?extid=b92e4f1472a54e1c7dec
https://syzkaller.appspot.com/x/bisect.txt?x=1681996f200000
Same as 1.
https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a
https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db200000
https://groups.google.com/d/msg/syzkaller-bugs/c6i429YYtqk/eSMAl3hDCQAJ
Bisected correctly, but the bug was fixed several months ago, syzbot wasn't aware.
https://syzkaller.appspot.com/bug?extid=1505c80c74256c6118a5
https://syzkaller.appspot.com/x/bisect.txt?x=13220283200000
A mix of problems: unrelated bug triggered by the same repro ("WARNING: ODEBUG bug in netdev_freemem"); lots of infrastructure failures ("failed to copy test binary to VM"; also the original failure seems to be flaky. All this contributed to pointing to a random commit.
Al Viro points out that the commit only touches comments, so we could mark the end result as suspicious.
Tetsuo points out that if lots (say, 7/8) tests failed with infra problems, then we should retry/skip or something. This zeroes the effect of having multiple independent tests.
https://syzkaller.appspot.com/bug?extid=65788f9af9d54844389e
https://syzkaller.appspot.com/x/bisect.txt?x=121f55db200000
Bisected to wrong commit.
Root cause: it seems that everything went well except for bug being too hard to trigger, so somewhere along the way bisection slightly diverged from the right course.

We could somehow estimate how hard it is to reproduce a bug and increase number of instances and/or testing time for hard to reproduce bugs. Also see #885.

https://groups.google.com/d/msg/syzkaller-bugs/38HP_pUXJ3s/cxDiQRvSDAAJ
2 potential improvements:

(simpler) noting in the bisection log things like disabled configs, cherry-picked fixes, etc.
(harder) try to figure out that the bug actually depends on the disabled config

https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397
https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000
and:
https://syzkaller.appspot.com/bug?extid=c4c4b2bb358bb936ad7e
https://syzkaller.appspot.com/x/bisect.txt?x=10c0298b200000
Crash is reproduced with very low probability (1/10) so bisection diverges.
https://groups.google.com/d/msg/syzkaller-bugs/zgIMBABdi-I/AnNztytDBQAJ
Thomas suggest to revert the guilty commit on top of the bisection start commit (or original crash commit?), and see if a crash still happens. If kernel still crashes, then the bisection result may be marked as "suspicious". But it's unclear what to do in this case. Also this check will be flaky too, so in some sense we are just adding more flakiness on top of flakiness...
https://syzkaller.appspot.com/bug?extid=6f39a9deb697359fe520
https://syzkaller.appspot.com/x/bisect.txt?x=17f1bacd200000

testing commit 669de8bda87b92ab9a2fc663b3f5743c2ad1ae9f with gcc (GCC) 8.1.0
run #0: crashed: WARNING: locking bug in flush_workqueue
run #1: basic kernel testing failed: timed out
run #2: basic kernel testing failed: timed out
run #3: basic kernel testing failed: timed out
run #4: basic kernel testing failed: timed out
run #5: basic kernel testing failed: timed out
run #6: basic kernel testing failed: timed out
run #7: basic kernel testing failed: timed out
run #8: basic kernel testing failed: timed out
run #9: basic kernel testing failed: timed out
# git bisect bad 669de8bda87b92ab9a2fc663b3f5743c2ad1ae9f

If kernel hanged in these cases, we need to provide better explanation. If this is an infra failure, we need to understand why this happens and try to improve.

Eric also proposed an idea of confidence levels: don't mail results if we have high confidence that an unrelated bug has diverged the process; if we have high confidence that flakiness has diverged the process (or low confidence that it had not diverged the process?); if we bisected to a comment-only change or a merge or a change to another arch.

The text was updated successfully, but these errors were encountered:

dvyukov · 2019-03-27T17:21:12Z

FTR, the analysis of 118 bisection results:
https://groups.google.com/d/msg/syzkaller/sR8aAXaWEF4/tTWYRgvmAwAJ
https://docs.google.com/spreadsheets/d/1WdBAN54-csaZpD3LgmTcIMR7NDFuQoOZZqPZ-CUqQgA
this code was used to produce the spreadsheet:
dvyukov@8b1f44b

josharian · 2019-03-31T21:48:13Z

It sounds to me like a new general purpose tool is called for. Here's a sketch.

The tool is for running git bisect on hard-to-reproduce bugs. It requires fully scripted reproduction steps. It assumes that false negatives are possible (bug did not reproduce even though present) but that false positives are not. The script is given a git commit. (The git commit is provided to the script, rather than giving the script a working tree, so that the script can cache intermediate artifacts.) The script attempts to reproduce the bug at that commit, and generates one of a few outcomes: (1) bug did not reproduce, (2) bug did reproduce, (3) some other permanent failure occurred, such as a syntax error in the code at that commit, (4) some other possibly transient failure occurred, such as a timeout or some other flaky failure.

The runner script runs indefinitely, doing a kind of ratcheting bisection. The result of every run is accumulated into an overall set of stats.

Start by running a bunch of iterations of the known bad commit to estimate the failure rate. Do a regular bisection using enough runs that failures are likely (if not guaranteed). That gives you a decent starting place to investigate as a human while the tool continues to narrow the field.

Next, keep track of the first commit in which the bug ever reproduced. Select commits to test using an exponential distribution, such that you're more likely to select commits near your first known bad commit. Over time, you will either (a) move the first known bad commit closer or (b) gather a lot of evidence that that commit really was the first known bad commit.

Make a nice visualization of the number of runs and the outcome distribution of those runs available to the user for inspection and human intelligence. As a bonus, let the user nudge the tool while it runs by providing human annotations about known good/bad commits.

None of this is perfect. For example, the reproduction rate might vary over time due to other changes, and the failure rate might be so low and reproduction time be so high that this brute force approach takes too long to stabilize.

Nevertheless, this should provide a systematic (and generically useful) way to approach the problem, and shift effort from humans to CPUs.

dvyukov · 2019-04-01T14:58:30Z

Few notes to make things more interesting ;)

False positives are there and by far the most common source of problems.
The more you test, the higher chances of false positives.
Current bisection with 1 try per commit and optimal logN time can take days. If this takes significantly more than that, the results may not be useful.
Users won't be happy about lots of emails with partial/incorrect results.
Users may not be willing to interact with the system (esp in a way it understands).

josharian · 2019-04-04T20:45:27Z

Wow. Yeah, that's going to be hard. :)

(I still think the tool I describe might actually be useful in an easier problem domain like regular Go bugs. I currently use stress and manually manage the bisection. cc @mvdan who likes thinking about tools)

dvyukov · 2019-04-05T07:36:09Z

I am thinking about increasing number of independent tests per commit. It should help with false negatives and is like 1 line change. A more complex version would be to estimate how flaky the crash is (which we want to do for other reasons too), and then choose number of tests based on that.

I would expect that Go bugs should not have almost any of these problems. There are no tens of thousands of bugs at any point in time, no prolonged broken builds/boots (e.g. runtime bugs), most crashes manifest in the same way. Also doing more independent tests is much simpler (no need to boot unreliable VMs for each test). So I would expect that doing 50 tests per commit and normal scripted bisection should be good enough. There is long tail of corner cases that only humans can deal with anyway.

mvdan · 2019-04-14T00:52:00Z

@josharian thanks for the ping - highly unlikely I'll have time to think about yet another tool this summer, though :)

dvyukov · 2019-06-17T07:21:28Z

FTR a case where reproduction rate seems to drop dramatically at some point:
https://syzkaller.appspot.com/bug?extid=f8bb48225fbdb35f81f5
https://syzkaller.appspot.com/bug?extid=14cb52733e7d8c5cc675
The fix says the bug was introduced 9 years ago:
https://patchwork.ozlabs.org/patch/1116484/
Both bisections are incorrect:
https://syzkaller.appspot.com/x/bisect.txt?x=130b1b66a00000
https://syzkaller.appspot.com/x/bisect.txt?x=16b35349a00000
In both cases the crash was reproduced with 90-100% probability for v5.1..v4.7, but then rate seems to drop to few percents at most.

dvyukov · 2019-07-29T11:14:18Z

FTR, mini-analysis of memory leak bisections:
https://groups.google.com/forum/#!topic/syzkaller/sR8aAXaWEF4
Short version: too many unrelated leaks and other bugs diverge bisection.

dvyukov mentioned this issue Jun 24, 2019

pkg/bisect: "failed to copy test binary to VM" errors #1250

Open

dvyukov mentioned this issue Jul 5, 2019

pkg/bisect: detect incorrect bisections #1271

Closed

dvyukov mentioned this issue Nov 6, 2022

pkg/bisect: bisect release tags #3376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/bisect: improve success rate #1051

pkg/bisect: improve success rate #1051

dvyukov commented Mar 12, 2019 •

edited

Loading

dvyukov commented Mar 27, 2019 •

edited

Loading

josharian commented Mar 31, 2019

dvyukov commented Apr 1, 2019

josharian commented Apr 4, 2019

dvyukov commented Apr 5, 2019

mvdan commented Apr 14, 2019

dvyukov commented Jun 17, 2019

dvyukov commented Jul 29, 2019

pkg/bisect: improve success rate #1051

pkg/bisect: improve success rate #1051

Comments

dvyukov commented Mar 12, 2019 • edited Loading

dvyukov commented Mar 27, 2019 • edited Loading

josharian commented Mar 31, 2019

dvyukov commented Apr 1, 2019

josharian commented Apr 4, 2019

dvyukov commented Apr 5, 2019

mvdan commented Apr 14, 2019

dvyukov commented Jun 17, 2019

dvyukov commented Jul 29, 2019

dvyukov commented Mar 12, 2019 •

edited

Loading

dvyukov commented Mar 27, 2019 •

edited

Loading