testing: possible fuzzing strategies #46507
As of golang.org/cl/312009 the Go fuzzer implements a basic coverage based fuzzing strategy, where mutated inputs are considered worthwhile for further fuzzing if they cause instrumented blocks that were previously not executed to be executed. While this method provides a good baseline, there are a number of advanced strategies that are used by other existing fuzzers, or have been suggested in research, which may provide better results.
This document aims to list a number of possible strategies to investigate, and outline a framework for evaluating their effectiveness. Due to the inherently random nature of fuzzers, the naive approach of simply testing the delta in performance of a single target across two implementations is unlikely to be representative of that actual difference. The 2018 paper Evaluating Fuzz Testing provides a useful overview of the difficulties with this kind of testing.
Bucketing Coverage Counters
An approach taken by both libFuzzer and AFL (albeit with slightly different implementations) is to quantize block coverage counters into a series of buckets, and consider a mutated input to increase coverage if the input causes a counter to jump from one bucket to another. The rationale is that causing the number of times a block is executed to change by some factor is likely as interesting as causing a new block to be executed.
The AFL approach to this is documented in part in section 2 of their whitepaper.
Score Based Execution Time
The current Go fuzzer will fuzz an input from the queue until either a mutation which finds new coverage is discovered, or a timeout is hit. AFL takes a somewhat different approach, an input is fuzzed for a certain number of cycles, calculated based on how properties of the input compare to the average over all the inputs in the queue (documented in fuzz.c). For instance, if an input is smaller than average, faster than average, or has more coverage than average then it will be fuzzed for longer, on the assumption that these are beneficial characteristics which are likely to produce similarly desirable mutations. Rather than returning as soon as an interesting mutation is found, AFL will actually increase the number of executions when finding an interesting mutation, on the assumption that an input which produces one interesting mutation may produce more, and that it is more valuable to continue fuzzing that input than fuzzing another input in the queue that may not yield anything.
Queue Prioritization (AFL style smallest covering set)
After fuzzing an input AFL will re-process its queue of inputs, attempting to reduce the set of inputs which it considers for fuzzing to only those most ‘interesting’. AFL maintains a list which maps each counter in the coverage map to the smallest, fastest executing input which triggers that counter, it then picks the smallest set of the inputs in this list which cover all of the currently triggered counters and marks them as prioritized (this process is documented in section 4 of the AFL whitepaper). When selecting the next input from the queue to fuzz, prioritized inputs are then picked for further fuzzing with considerably higher odds.
In theory this should allow the fuzzer to focus on a much smaller subset of generated inputs which produce both more interesting results as well as inputs which don’t bog the fuzzer down.
Queue Prioritization (rare branches)
The FairFuzz paper suggests prioritizing inputs which trigger branches which are rarely triggered by other inputs, as well as determining which bytes of the input may be mutated while still causing these rare branches to be executed. Information about how the input can be mutated is then taken into account by the mutator to prevent bytes from being mutated which may cause a regression in coverage.
Before resorting to randomized stacked mutations, AFL runs a series of deterministic mutations against the input. Michał Zalewski suggests that this results in discovering ~140 new execution paths per million input over fully randomized mutations.
For each generated input the deterministic stage only needs to be executed once, and then subsequent fuzzing of the input can be entirely randomized.
AFL and libFuzzer make use of dictionaries of ‘interesting’ syntax, which can be inserted into inputs as a possible mutation.
AFL additionally makes use of the coverage instrumentation to generate its own dictionaries from mutated inputs, by checking if when doing a walking bit flip of the input repeated flips cause consistent change in the coverage, suggesting an underlying comparison matching fewer bytes of the input (this is discussed in section 7 of the AFL whitepaper and in a blog post by Michał Zalewski).
libFuzzer takes a similar approach by instrumenting CMP instructions, and using the values being compared against to construct a dictionary of interesting syntax to insert during mutations (this is discussed in the Tracing data flow section of the LLVM SantizerCoverage documentation). The Go libFuzzer implementation currently has support for this type of instrumentation, but the Go fuzzer does not.
Another approach that may be viable when fuzzing text based parsers (such as the golang.org/x/net/html parser) would be to instrument comparisons against strings, since these are likely to give valuable keywords (this would also capture usage of bytes.Equal). A similar method to the one used for instrumenting comparisons of (constant) integers is relatively simple.
The REDQUEEN paper suggests not just capturing the input to comparison functions, but additionally attempting to map the arguments to processed input, creating mutations which will explicitly satisfy the comparison by replacing the portions of the input being checked with the expected values. Two interesting optimization techniques are also suggested, pre-processing the input to increase the entropy so it is easier to determine where the input needs to be mutated, and using a handful of encoding techniques to determine where in the input the argument to the comparison function is being taken from.
FuzzBench is the closest project to what we are looking for, but given we are only planning on testing two implementations against each other, and only using Go-specific fuzz targets, it is likely that it’ll be easier to build our own tooling to run these tests than to try to retrofit their framework to fit out needs.
At least initially the main metric to focus on when evaluating performance of the fuzzer implementation is coverage, since it follows that the higher the program coverage the fuzzer can achieve the more likely it is to discover bugs in the code. Eventually we will likely want to assemble either a set of programs with known bugs, or design a program with synthetic bugs, and additionally use the number of discovered bugs as well as coverage, but this is a significantly more time intensive approach.
As suggested in Evaluating Fuzz Testing, since code coverage across multiple runs is likely to be nonparametric, we can compare performance of a series of runs with a proposed strategy against runs of the baseline using the Mann–Whitney U statistical test.
Effectiveness of different strategies is likely to depend, in some cases, on the type of code being fuzzed. In order to properly evaluate proposed strategies we should use multiple targets which cover the most common use cases. FuzzBench takes the approach of using a large number of targets which cover a range of applications. Using the standard library as a source of targets we should be able to create a wide range of targets which cover the spectrum of parser designs. These packages are likely to provide a good starting point:
Similarly to the selection of targets, we should also evaluate performance with different types of input corpora. The three most common corpus strategies are using an empty seed, a single basic seed, and a highly saturated corpus. This tests the ability of the fuzzer to identify paths from scratch, evolve paths from a basic validly formatted input, and find any remaining undiscovered paths given a corpus that highly exercises the program.
An open question is how many runs of each target we should run, and for how long we should run each target. Evaluating Fuzz Testing makes somewhat vague suggestions here, without particularly empirical backing, of ~30 runs of each test for at least 24 hours. FuzzBench takes a similar approach, doing 20 runs for 23 hours each. The number of runs is a relatively easy question to answer, that mainly boils down to how many compute resources we want to devote to testing, where more is probably better in terms of getting a more accurate result from the statistical tests. Run time is a harder question to answer. 24 hours is likely to provide interesting results, but may not be appropriate for all tests. In particular we may want to vary the run time for the various corpora tests, as in the empty and basic seed cases coverage is likely to increase logarithmically and stall out after a number of hours, so a shorter run time may be as informative as a longer one, whereas when using a highly saturated corpus increases in coverage are likely to happen quite rarely, so a much longer run time may by more appropriate.
The text was updated successfully, but these errors were encountered: