Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/bisect: improve success rate #1051

Open
dvyukov opened this issue Mar 12, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@dvyukov
Copy link
Collaborator

commented Mar 12, 2019

Umbrella bug for collection of suboptimal bisection results and hopefully then figuring out ways how to improve it.

  1. https://syzkaller.appspot.com/bug?extid=fdce8f2a8903f3ba0e6b
    https://syzkaller.appspot.com/x/bisect.txt?x=16e68713200000
    Bisected to wrong commit.
    Root cause: unrelated bug that happen on all tests.
    Bisection switches to an unrelated bug "INFO: trying to register non-static key in can_notifier" which seem to happen on all tests. This then leads to a completely random commit.

  2. https://syzkaller.appspot.com/bug?extid=b92e4f1472a54e1c7dec
    https://syzkaller.appspot.com/x/bisect.txt?x=1681996f200000
    Same as 1.

  3. https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a
    https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db200000
    https://groups.google.com/d/msg/syzkaller-bugs/c6i429YYtqk/eSMAl3hDCQAJ
    Bisected correctly, but the bug was fixed several months ago, syzbot wasn't aware.

  4. https://syzkaller.appspot.com/bug?extid=1505c80c74256c6118a5
    https://syzkaller.appspot.com/x/bisect.txt?x=13220283200000
    A mix of problems: unrelated bug triggered by the same repro ("WARNING: ODEBUG bug in netdev_freemem"); lots of infrastructure failures ("failed to copy test binary to VM"; also the original failure seems to be flaky. All this contributed to pointing to a random commit.
    Al Viro points out that the commit only touches comments, so we could mark the end result as suspicious.
    Tetsuo points out that if lots (say, 7/8) tests failed with infra problems, then we should retry/skip or something. This zeroes the effect of having multiple independent tests.

  5. https://syzkaller.appspot.com/bug?extid=65788f9af9d54844389e
    https://syzkaller.appspot.com/x/bisect.txt?x=121f55db200000
    Bisected to wrong commit.
    Root cause: it seems that everything went well except for bug being too hard to trigger, so somewhere along the way bisection slightly diverged from the right course.

We could somehow estimate how hard it is to reproduce a bug and increase number of instances and/or testing time for hard to reproduce bugs. Also see #885.

  1. https://groups.google.com/d/msg/syzkaller-bugs/38HP_pUXJ3s/cxDiQRvSDAAJ
    2 potential improvements:
  • (simpler) noting in the bisection log things like disabled configs, cherry-picked fixes, etc.
  • (harder) try to figure out that the bug actually depends on the disabled config
  1. https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397
    https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000
    and:
    https://syzkaller.appspot.com/bug?extid=c4c4b2bb358bb936ad7e
    https://syzkaller.appspot.com/x/bisect.txt?x=10c0298b200000
    Crash is reproduced with very low probability (1/10) so bisection diverges.

  2. https://groups.google.com/d/msg/syzkaller-bugs/zgIMBABdi-I/AnNztytDBQAJ
    Thomas suggest to revert the guilty commit on top of the bisection start commit (or original crash commit?), and see if a crash still happens. If kernel still crashes, then the bisection result may be marked as "suspicious". But it's unclear what to do in this case. Also this check will be flaky too, so in some sense we are just adding more flakiness on top of flakiness...

  3. https://syzkaller.appspot.com/bug?extid=6f39a9deb697359fe520
    https://syzkaller.appspot.com/x/bisect.txt?x=17f1bacd200000

testing commit 669de8bda87b92ab9a2fc663b3f5743c2ad1ae9f with gcc (GCC) 8.1.0
run #0: crashed: WARNING: locking bug in flush_workqueue
run #1: basic kernel testing failed: timed out
run #2: basic kernel testing failed: timed out
run #3: basic kernel testing failed: timed out
run #4: basic kernel testing failed: timed out
run #5: basic kernel testing failed: timed out
run #6: basic kernel testing failed: timed out
run #7: basic kernel testing failed: timed out
run #8: basic kernel testing failed: timed out
run #9: basic kernel testing failed: timed out
# git bisect bad 669de8bda87b92ab9a2fc663b3f5743c2ad1ae9f

If kernel hanged in these cases, we need to provide better explanation. If this is an infra failure, we need to understand why this happens and try to improve.

Eric also proposed an idea of confidence levels: don't mail results if we have high confidence that an unrelated bug has diverged the process; if we have high confidence that flakiness has diverged the process (or low confidence that it had not diverged the process?); if we bisected to a comment-only change or a merge or a change to another arch.

@dvyukov

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 27, 2019

@josharian

This comment has been minimized.

Copy link

commented Mar 31, 2019

It sounds to me like a new general purpose tool is called for. Here's a sketch.

The tool is for running git bisect on hard-to-reproduce bugs. It requires fully scripted reproduction steps. It assumes that false negatives are possible (bug did not reproduce even though present) but that false positives are not. The script is given a git commit. (The git commit is provided to the script, rather than giving the script a working tree, so that the script can cache intermediate artifacts.) The script attempts to reproduce the bug at that commit, and generates one of a few outcomes: (1) bug did not reproduce, (2) bug did reproduce, (3) some other permanent failure occurred, such as a syntax error in the code at that commit, (4) some other possibly transient failure occurred, such as a timeout or some other flaky failure.

The runner script runs indefinitely, doing a kind of ratcheting bisection. The result of every run is accumulated into an overall set of stats.

Start by running a bunch of iterations of the known bad commit to estimate the failure rate. Do a regular bisection using enough runs that failures are likely (if not guaranteed). That gives you a decent starting place to investigate as a human while the tool continues to narrow the field.

Next, keep track of the first commit in which the bug ever reproduced. Select commits to test using an exponential distribution, such that you're more likely to select commits near your first known bad commit. Over time, you will either (a) move the first known bad commit closer or (b) gather a lot of evidence that that commit really was the first known bad commit.

Make a nice visualization of the number of runs and the outcome distribution of those runs available to the user for inspection and human intelligence. As a bonus, let the user nudge the tool while it runs by providing human annotations about known good/bad commits.

None of this is perfect. For example, the reproduction rate might vary over time due to other changes, and the failure rate might be so low and reproduction time be so high that this brute force approach takes too long to stabilize.

Nevertheless, this should provide a systematic (and generically useful) way to approach the problem, and shift effort from humans to CPUs.

@dvyukov

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 1, 2019

Few notes to make things more interesting ;)

  1. False positives are there and by far the most common source of problems.
  2. The more you test, the higher chances of false positives.
  3. Current bisection with 1 try per commit and optimal logN time can take days. If this takes significantly more than that, the results may not be useful.
  4. Users won't be happy about lots of emails with partial/incorrect results.
  5. Users may not be willing to interact with the system (esp in a way it understands).
@josharian

This comment has been minimized.

Copy link

commented Apr 4, 2019

Wow. Yeah, that's going to be hard. :)

(I still think the tool I describe might actually be useful in an easier problem domain like regular Go bugs. I currently use stress and manually manage the bisection. cc @mvdan who likes thinking about tools)

@dvyukov

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 5, 2019

I am thinking about increasing number of independent tests per commit. It should help with false negatives and is like 1 line change. A more complex version would be to estimate how flaky the crash is (which we want to do for other reasons too), and then choose number of tests based on that.

I would expect that Go bugs should not have almost any of these problems. There are no tens of thousands of bugs at any point in time, no prolonged broken builds/boots (e.g. runtime bugs), most crashes manifest in the same way. Also doing more independent tests is much simpler (no need to boot unreliable VMs for each test). So I would expect that doing 50 tests per commit and normal scripted bisection should be good enough. There is long tail of corner cases that only humans can deal with anyway.

@mvdan

This comment has been minimized.

Copy link

commented Apr 14, 2019

@josharian thanks for the ping - highly unlikely I'll have time to think about yet another tool this summer, though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.