-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/bisect: improve success rate #1051
Comments
FTR, the analysis of 118 bisection results: |
It sounds to me like a new general purpose tool is called for. Here's a sketch. The tool is for running git bisect on hard-to-reproduce bugs. It requires fully scripted reproduction steps. It assumes that false negatives are possible (bug did not reproduce even though present) but that false positives are not. The script is given a git commit. (The git commit is provided to the script, rather than giving the script a working tree, so that the script can cache intermediate artifacts.) The script attempts to reproduce the bug at that commit, and generates one of a few outcomes: (1) bug did not reproduce, (2) bug did reproduce, (3) some other permanent failure occurred, such as a syntax error in the code at that commit, (4) some other possibly transient failure occurred, such as a timeout or some other flaky failure. The runner script runs indefinitely, doing a kind of ratcheting bisection. The result of every run is accumulated into an overall set of stats. Start by running a bunch of iterations of the known bad commit to estimate the failure rate. Do a regular bisection using enough runs that failures are likely (if not guaranteed). That gives you a decent starting place to investigate as a human while the tool continues to narrow the field. Next, keep track of the first commit in which the bug ever reproduced. Select commits to test using an exponential distribution, such that you're more likely to select commits near your first known bad commit. Over time, you will either (a) move the first known bad commit closer or (b) gather a lot of evidence that that commit really was the first known bad commit. Make a nice visualization of the number of runs and the outcome distribution of those runs available to the user for inspection and human intelligence. As a bonus, let the user nudge the tool while it runs by providing human annotations about known good/bad commits. None of this is perfect. For example, the reproduction rate might vary over time due to other changes, and the failure rate might be so low and reproduction time be so high that this brute force approach takes too long to stabilize. Nevertheless, this should provide a systematic (and generically useful) way to approach the problem, and shift effort from humans to CPUs. |
Few notes to make things more interesting ;)
|
Wow. Yeah, that's going to be hard. :) (I still think the tool I describe might actually be useful in an easier problem domain like regular Go bugs. I currently use stress and manually manage the bisection. cc @mvdan who likes thinking about tools) |
I am thinking about increasing number of independent tests per commit. It should help with false negatives and is like 1 line change. A more complex version would be to estimate how flaky the crash is (which we want to do for other reasons too), and then choose number of tests based on that. I would expect that Go bugs should not have almost any of these problems. There are no tens of thousands of bugs at any point in time, no prolonged broken builds/boots (e.g. runtime bugs), most crashes manifest in the same way. Also doing more independent tests is much simpler (no need to boot unreliable VMs for each test). So I would expect that doing 50 tests per commit and normal scripted bisection should be good enough. There is long tail of corner cases that only humans can deal with anyway. |
@josharian thanks for the ping - highly unlikely I'll have time to think about yet another tool this summer, though :) |
FTR a case where reproduction rate seems to drop dramatically at some point: |
FTR, mini-analysis of memory leak bisections: |
Umbrella bug for collection of suboptimal bisection results and hopefully then figuring out ways how to improve it.
https://syzkaller.appspot.com/bug?extid=fdce8f2a8903f3ba0e6b
https://syzkaller.appspot.com/x/bisect.txt?x=16e68713200000
Bisected to wrong commit.
Root cause: unrelated bug that happen on all tests.
Bisection switches to an unrelated bug "INFO: trying to register non-static key in can_notifier" which seem to happen on all tests. This then leads to a completely random commit.
https://syzkaller.appspot.com/bug?extid=b92e4f1472a54e1c7dec
https://syzkaller.appspot.com/x/bisect.txt?x=1681996f200000
Same as 1.
https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a
https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db200000
https://groups.google.com/d/msg/syzkaller-bugs/c6i429YYtqk/eSMAl3hDCQAJ
Bisected correctly, but the bug was fixed several months ago, syzbot wasn't aware.
https://syzkaller.appspot.com/bug?extid=1505c80c74256c6118a5
https://syzkaller.appspot.com/x/bisect.txt?x=13220283200000
A mix of problems: unrelated bug triggered by the same repro ("WARNING: ODEBUG bug in netdev_freemem"); lots of infrastructure failures ("failed to copy test binary to VM"; also the original failure seems to be flaky. All this contributed to pointing to a random commit.
Al Viro points out that the commit only touches comments, so we could mark the end result as suspicious.
Tetsuo points out that if lots (say, 7/8) tests failed with infra problems, then we should retry/skip or something. This zeroes the effect of having multiple independent tests.
https://syzkaller.appspot.com/bug?extid=65788f9af9d54844389e
https://syzkaller.appspot.com/x/bisect.txt?x=121f55db200000
Bisected to wrong commit.
Root cause: it seems that everything went well except for bug being too hard to trigger, so somewhere along the way bisection slightly diverged from the right course.
We could somehow estimate how hard it is to reproduce a bug and increase number of instances and/or testing time for hard to reproduce bugs. Also see #885.
2 potential improvements:
https://syzkaller.appspot.com/bug?extid=5cd33f0e6abe2bb3e397
https://syzkaller.appspot.com/x/bisect.txt?x=15330437200000
and:
https://syzkaller.appspot.com/bug?extid=c4c4b2bb358bb936ad7e
https://syzkaller.appspot.com/x/bisect.txt?x=10c0298b200000
Crash is reproduced with very low probability (1/10) so bisection diverges.
https://groups.google.com/d/msg/syzkaller-bugs/zgIMBABdi-I/AnNztytDBQAJ
Thomas suggest to revert the guilty commit on top of the bisection start commit (or original crash commit?), and see if a crash still happens. If kernel still crashes, then the bisection result may be marked as "suspicious". But it's unclear what to do in this case. Also this check will be flaky too, so in some sense we are just adding more flakiness on top of flakiness...
https://syzkaller.appspot.com/bug?extid=6f39a9deb697359fe520
https://syzkaller.appspot.com/x/bisect.txt?x=17f1bacd200000
If kernel hanged in these cases, we need to provide better explanation. If this is an infra failure, we need to understand why this happens and try to improve.
Eric also proposed an idea of confidence levels: don't mail results if we have high confidence that an unrelated bug has diverged the process; if we have high confidence that flakiness has diverged the process (or low confidence that it had not diverged the process?); if we bisected to a comment-only change or a merge or a change to another arch.
The text was updated successfully, but these errors were encountered: