Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
pkg/bisect: improve success rate #1051
Umbrella bug for collection of suboptimal bisection results and hopefully then figuring out ways how to improve it.
We could somehow estimate how hard it is to reproduce a bug and increase number of instances and/or testing time for hard to reproduce bugs. Also see #885.
If kernel hanged in these cases, we need to provide better explanation. If this is an infra failure, we need to understand why this happens and try to improve.
Eric also proposed an idea of confidence levels: don't mail results if we have high confidence that an unrelated bug has diverged the process; if we have high confidence that flakiness has diverged the process (or low confidence that it had not diverged the process?); if we bisected to a comment-only change or a merge or a change to another arch.
FTR, the analysis of 118 bisection results:
It sounds to me like a new general purpose tool is called for. Here's a sketch.
The tool is for running git bisect on hard-to-reproduce bugs. It requires fully scripted reproduction steps. It assumes that false negatives are possible (bug did not reproduce even though present) but that false positives are not. The script is given a git commit. (The git commit is provided to the script, rather than giving the script a working tree, so that the script can cache intermediate artifacts.) The script attempts to reproduce the bug at that commit, and generates one of a few outcomes: (1) bug did not reproduce, (2) bug did reproduce, (3) some other permanent failure occurred, such as a syntax error in the code at that commit, (4) some other possibly transient failure occurred, such as a timeout or some other flaky failure.
The runner script runs indefinitely, doing a kind of ratcheting bisection. The result of every run is accumulated into an overall set of stats.
Start by running a bunch of iterations of the known bad commit to estimate the failure rate. Do a regular bisection using enough runs that failures are likely (if not guaranteed). That gives you a decent starting place to investigate as a human while the tool continues to narrow the field.
Next, keep track of the first commit in which the bug ever reproduced. Select commits to test using an exponential distribution, such that you're more likely to select commits near your first known bad commit. Over time, you will either (a) move the first known bad commit closer or (b) gather a lot of evidence that that commit really was the first known bad commit.
Make a nice visualization of the number of runs and the outcome distribution of those runs available to the user for inspection and human intelligence. As a bonus, let the user nudge the tool while it runs by providing human annotations about known good/bad commits.
None of this is perfect. For example, the reproduction rate might vary over time due to other changes, and the failure rate might be so low and reproduction time be so high that this brute force approach takes too long to stabilize.
Nevertheless, this should provide a systematic (and generically useful) way to approach the problem, and shift effort from humans to CPUs.
Few notes to make things more interesting ;)
I am thinking about increasing number of independent tests per commit. It should help with false negatives and is like 1 line change. A more complex version would be to estimate how flaky the crash is (which we want to do for other reasons too), and then choose number of tests based on that.
I would expect that Go bugs should not have almost any of these problems. There are no tens of thousands of bugs at any point in time, no prolonged broken builds/boots (e.g. runtime bugs), most crashes manifest in the same way. Also doing more independent tests is much simpler (no need to boot unreliable VMs for each test). So I would expect that doing 50 tests per commit and normal scripted bisection should be good enough. There is long tail of corner cases that only humans can deal with anyway.