This document describes several recommended strategies for splitting a single fuzzer-generated input into several parts (sub-inputs).
Splitting a fuzzer-generated input into several independent parts (sub-inputs) is required very often. Some examples:
-
Fuzzing a regular expression library requires
- The regular expression (RE)
- Flags for RE compilation and matching
- A string to search the RE in
-
Fuzzing an audio/video format decoder often requires
- Decoding flags
- Several frames
-
Fuzzing a XSLT or CSS library requires
- The stylesheet input
- The XML/HTML input
-
Fuzzing a font-rendering library requires
- The font file
- The text to render
- The rendering flags
-
Fuzzing a database library may require
- The query text
- The database state
TODO: more examples?
When trying to split the fuzzer-generated input into several parts, the first question one needs to ask is whether the input format is common, i.e. is it used or processed by other libraries, APIs, of fuzz targets.
If the data format is common (e.g. a widely used media format or network packet format) then it is highly desirable for a fuzz target to consume exactly this data format, and not some custom modification. This way it will be easier to procure a seed corpus for this fuzz target and to use the generated corpus to test/fuzz other targets.
If the data format may be processed by a fuzz target in a small number of different ways, it is often the best approach to split the fuzz target into several ones, each processing the input in exactly one way.
Make sure to cross-pollinate the corpora between these targets. OSS-Fuzz does that automatically.
When a fuzz target for a common data format requires some flags, options, or additional auxiliary sub-input(s), it is sometimes possible to embed the extra input inside a custom section or a comment of the main data format.
Examples:
- PNG allows custom "chunks", and so a fuzz target for a PNG decoder can
hide the flags used during PNG processing
in a separate PNG chunk, e.g.
fUZz
(example). - When fuzzing C/C++/Java/JavaScript inputs one may hide a sub-input in a single-line
//
comment. TODO: example?
When only one small fixed-size sub-input is required (such as flags / options), the fuzz target may compute a hash function on the full input and use it as the flag bits. This option is very easy to implement, but it's applicability is limited to relatively simple cases. The major problem is that a small local mutation of the input leads to a large change in the sub-input, which often makes fuzzing less efficient. Try this approach if the flags are individual bits and the input type allows some bit flips in the inputs (e.g. a plain text).
TODO: example.
If you do not intend to share the corpus with any other API or fuzz targets, then a custom serialization format might be a good option for a multi-input fuzz target.
When only one fixed-size sub-input is required (such as flags / options),
it is possible to treat the first (or last) K
bytes of the input as sub-input,
and the rest of the bytes as the main input.
Just remember to copy the main input into a separate heap buffer of Size - K
bytes, so that buffer under/overflows on the main input are detected.
TODO: example.
Choose a 4-byte (or 8-byte) magic constant that will serve as a separator
between the inputs.
In the fuzz target, split the input using this separator. Use memmem
to
find the separator in the input -- memmem
is known to be friendly to fuzzing
engines, at least to libFuzzer.
Example (see full code here):
// Splits [data,data+size) into a vector of strings using a "magic" Separator.
std::vector<std::vector<uint8_t>> SplitInput(const uint8_t *Data, size_t Size,
const uint8_t *Separator,
size_t SeparatorSize) { ... }
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
const uint8_t Separator[] = {0xDE, 0xAD, 0xBE, 0xEF};
auto Inputs = SplitInput(Data, Size, Separator, sizeof(Separator));
// Use Inputs.size(), Inputs[0], Inputs[1], ...
}
It is relatively easy for a modern fuzzing engine to discover the separator, but nevertheless we recommend to provide several seed inputs with the desired number of separators.
FuzzedDataProvider (FDP) is a single-header C++ library that is helpful for
splitting a fuzz input into multiple parts of various types. It is a part of
LLVM and can be included via #include <fuzzer/FuzzedDataProvider.h>
directive.
If your compiler doesn't have this header (in case it's an older Clang version
or some other compiler), you can copy the header from here and add it to your
project manually. It should just work, as the header doesn't depend on LLVM.
An advantage and disadvantage of using this library is that the input splitting happens dynamically, i.e. you don't need to define any structure of the input. This might be very helpful in certain cases, but would also make the corpus to be no longer in a particular format. For example, if you fuzz an image parser and split the fuzz input into several parts, the corpus elements will no longer be valid image files, and you won't be able to simply add image files to your corpus.
- FuzzedDataProvider is a class whose constructor accepts
const uint8_t*, size_t
arguments. Usually, you would call it in the beginning of yourLLVMFuzzerTestOneInput
and pass thedata, size
parameters provided by the fuzzing engine. - Once an FDP object is constructed using the fuzz input, you can consume the data from the input by calling the FDP methods listed below.
- If there is not enough data left to consume, FDP will consume all the
remaining bytes. For example, if you call
ConsumeBytes(10)
when there are only4
bytes left in the fuzz input, FDP will return a vector of length4
. - If there is no data left, FDP will return the default value for the requested type or an empty container (when consuming a sequence of bytes).
- If you consume data from FDP in a loop, make sure to check the value returned
by
remaining_bytes()
between loop iterations. - Do not use the methods that return
std::string
unless your API requires a string object or a C-style string with a trailing null byte. This is a common mistake that hides off-by-one buffer overflows from AddressSanitizer.
ConsumeBool
,ConsumeIntegral
,ConsumeIntegralInRange
methods are helpful for extracting a single boolean or integer value (the exact type is defined by a template parameter), e.g. some flag for the target API, or a number of iterations for a loop, or length of a part of the fuzz input.ConsumeProbability
,ConsumeFloatingPoint
,ConsumeFloatingPointInRange
methods are very similar to the ones mentioned above. The difference is that these methods return a floating point value.ConsumeEnum
andPickValueInArray
methods are handy when the fuzz input needs to be selected from a predefined set of values, such as an enum or an array.
These methods are using the last bytes of the fuzz input for deriving the requested values. This allows to use valid / test files as a seed corpus in some cases.
Many of these methods have a length argument. You can always know how many bytes
are left inside the provider object by calling remaining_bytes()
method on it.
ConsumeBytes
andConsumeBytesWithTerminator
methods return astd::vector
of the requested size. These methods are helpful when you know how long a certain part of the fuzz input should be. Use.data()
and.size()
methods of the resulting object if your API works with raw memory arguments.ConsumeBytesAsString
method returns astd::string
of the requested length. This is useful when you need a null-terminated C-string. Callingc_str()
on the resulting object is the best way to obtain it.ConsumeRandomLengthString
method returns astd::string
as well, but its length is derived from the fuzz input and typically is hard to predict, though always deterministic. The caller can provide the max length argument.ConsumeRemainingBytes
andConsumeRemainingBytesAsString
methods returnstd::vector
andstd::string
objects respectively, initialized with all the bytes from the fuzz input that left unused.ConsumeData
method copies the requested number of bytes from the fuzz input to the given pointer (void *destination
). The method is useful when you need to fill an existing buffer or object (e.g. a struct) with fuzzing data.
For more information about the methods, their arguments and implementation details, please refer to the FuzzedDataProvider source code. Every method has a detailed comment in that file, and the implementation is relatively small.
- net_verify_name_match_fuzzer splits the fuzz input into two parts.
- net_http2_frame_decoder_fuzzer reads data in small chunks in a loop in order to emulate a sequence of frames coming from the network connection.
- net_crl_set_fuzzer initialized multiple parameters and uses the rest of the fuzz input for the main argument (i.e. data to be parsed / processed). Note that using Protobufs based fuzzing might be more efficient for such a target.
- net_parse_cookie_line_fuzzer is a slightly more sophisticated fuzz target that emulates different actions with different parameters initialized with the fuzz input.
A custom Type-length-value, or TLV, may sound like a good solution. However, we typically do not recommend using a custom TLV to split your fuzzer-generated input for the following reasons:
- This is more test-only code for you to maintain, and easy to get wrong
- Typical mutations performed by fuzzing engines, such as inserting a byte, will break the TLV structure too often, making fuzzing less efficient
However, a TLV input combined with a custom mutator might be a good option. See Structure-Aware Fuzzing.
Yet another option is to use one of the general-purpose serialization formats, such as Protobufs, in combination with a custom mutator. See Structure-Aware Fuzzing.