x/fuzzdata: new repository for fuzzing corpus data #31215

thepudds · 2019-04-02T20:07:32Z

Summary

In #30719 and #30979, dvyukov/go-fuzz compatible fuzz functions were landed in:

the standard library in src/image/png/fuzz.go
golang.org/x in golang.org/x/image/tiff/fuzz.go

The follow-up item in this issue here is to add a new golang.org/x/corpus repository to hold an example fuzzing corpus for portions of the standard library and x subrepos. This is to help with the exploration requested by the core Go team in discussion of the #19109 proposal to "make fuzzing a first class citizen".

The name for the new corpus repo alternatively could be golang.org/x/fuzz or something else; some additional naming discussion below.

Note that this issue is at least currently intended to be solely about creating the repository itself, and this issue does not cover checking in any corresponding fuzzing corpus (which is likely to be a follow-up issue).

Background

See the "Background" section of #30719 or #19109 (comment).

Additional Details

As part of the #19109 proposal discussion, there were multiple comments/requests from the core Go team asking to develop a better understanding of how a fuzzing corpus looks and behaves when it resides in a repository. For example, this March 2017 request in #19109 (comment) from Russ:

add fuzz tests to at least the x subrepos and maybe the standard library, so that we can understand the implications of having them in the source repos (including how much space we're going to spend on testdata/fuzz corpus directories

In the March 2017 proposal document in #19109 (comment), @dvyukov proposed:

For the standard library it is proposed to check in corpus into golang.org/x/fuzz repo.

golang.org/x/fuzz seems to be a very reasonable name. Two alternative possible names:

golang.org/x/corpus
golang.org/x/fuzzcorpus

Personally, I think x/corpus or x/fuzzcorpus might be more evocative than x/fuzz (e.g., someone might incorrectly think x/fuzz is where the fuzzing implementation or fuzzing functions live), but I do not have a strong opinion on the name and suspect others will have stronger opinions. Until there is additional feedback on the name, the rest of this comment here will use the term golang.org/x/corpus.

Populating the corpus

The initial seeding of the corpus can likely come from https://github.com/dvyukov/go-fuzz-corpus for a given Fuzz function. If that happens, then right now, there would be two corpus directories populated: go-fuzz-corpus/png/corpus and go-fuzz-corpus/tiff/corpus.

In parallel, multiple people are making progress on integrating dvyukov/go-fuzz into oss-fuzz in google/oss-fuzz#36, google/oss-fuzz#2188, dvyukov/go-fuzz#213 and elsewhere. Most likely, it will make sense to periodically update the golang.org/x/corpus repo with the output from oss-fuzz (which otherwise by default is stored in a Google Cloud Storage bucket).

However, the exact mechanism of populating and updating golang.org/x/corpus repo I think can be discussed outside of this particular issue. (Reason: In general, it seems more tractable to make progress if things are broken down into more manageable discrete chunks of work, especially to break out into separate steps the things that must be done by someone on the core Go team, vs. could be done by someone from the broader community. This issue is focused on the creation of the repo, which presumably must be done by someone on the core Go team, whereas populating the corpus can be done by a greater range of people from the broader community in consultation with the core Go team).

I am of course happy to discuss anything here, and happy to be corrected if any of the above is different than how people would like to proceed.

CC @dvyukov @josharian @FiloSottile @bradfitz @acln0

The text was updated successfully, but these errors were encountered:

FiloSottile · 2019-04-02T20:43:28Z

LGTM. I suspect bringing the full oss-fuzz corpus back into the repo might bloat it too much, but we definitely need a place for seed corpus, so this sounds good.

dmitshur · 2019-04-02T20:45:52Z

Related issue: #14304.

FiloSottile · 2019-04-02T20:53:00Z

Oh, how about x/fuzzdata? x/corpus does not imply fuzzing to everyone I suspect. (Bikeshed!)

kybin · 2019-04-02T20:58:03Z

will the repository used only for fuzzing?

thepudds · 2019-04-02T21:00:10Z

For reasons of clarity, I personally like x/fuzzdata or x/fuzzcorpus over x/corpus or x/fuzz. I led off the issue using x/corpus mainly because it seems the core Go team generally places great weight on brevity.

bradfitz · 2019-04-02T21:03:19Z

I was also going to suggest fuzzdata.

thepudds · 2019-04-02T21:09:28Z

@kybin

will the repository used only for fuzzing?

That is the current intent of the suggestion in this issue.

dvyukov · 2019-04-03T05:50:04Z

LGTM. I suspect bringing the full oss-fuzz corpus back into the repo might bloat it too much, but we definitely need a place for seed corpus, so this sounds good.

If it's only for seed corpus, then seed corpus may belong to the main repo better. We do want to use it in unit tests too! See dvyukov/go-fuzz#218 (comment) for more explanation.

dvyukov · 2019-04-03T08:04:54Z

To clarify, I mean only seed corpus, i.e. small number of high-quality inputs without significant churn (no more than what happens with code). For a fuzzer-generated corpus we do need a separate repo.

thepudds · 2019-04-03T08:14:58Z

Part of what is tricky is there are really 2 things running in parallel, I think:

Actually fuzzing the stdlib much more (mainly via oss-fuzz).
Use the stdlib and x subrepos as a “pilot” for the “first class fuzzing” proposal (at Russ’ suggestion).

For 2., it seems important that golang.org/x/fuzzdata gets created so that the stdlib and x subrepos can be a pilot of a decent size project that stores a decent sized corpus with some churn in a repo... and I think that is important even under the split corpus idea in dvyukov/go-fuzz#218 (comment), which I commented on there yesterday (probably too enthusiastically, sorry).

thepudds · 2021-02-20T20:25:29Z

Hi @katiehockman, I had a question for you about the generated corpus under the current draft design for first class fuzzing, which in turn might have some implications for what to do this with issue.

The draft includes:

The generated corpus will be in a new directory within $GOCACHE, in the form $GOCACHE/fuzz/$pkg/$test/$name,

and also:

A generated corpus will be managed by the fuzzing engine and will live outside the module in a subdirectory of $GOCACHE. This generated corpus will grow as the fuzzing engine discovers new coverage.

The details of how the corpus is built and processed should be unimportant to users.

It makes sense that the details of what is inside the generated corpus can effectively be opaque to ordinary users, but the generated corpus itself is fairly valuable, and often worthy of keeping around & building up over time.

Given that generated corpus is valuable, it can be fairly useful to allow more direct control of the location of the generated corpus (to allow for easier storage in a separate repo, or cloud storage, or a shared filesystem, or easier integration with the diversity of CI systems in use, or purpose built fuzzing systems like OSS-Fuzz)

The prior first class fuzzing proposal went through a couple iterations on corpus behavior, but the behavior in the fzgo prototype of that proposal ultimately defaulted to reading from testdata (e.g., for a seed corpus), wrote to GOPATH/pkg/fuzz by default as a local cache for the generated corpus, but also allowed more direct control of the generated corpus via an optional -fuzzdir flag to support use cases such as those mentioned just above.

GOCACHE certainly can be made to work in different scenarios, but what are your thoughts around a more convenient flag or fuzz-specific env variable to allow more direct control of the location of the generated corpus?

Thanks, and exciting to see the progress you have been making! 🚀

josharian · 2021-02-21T02:54:41Z

I’d like to second the notion that corpuses are valuable and should be (able to be) retained and shared. It’s not just computational resources to generate them. Some corpuses I’ve developed alongside the code, so they contain good coverage already for alternative implementations. And other corpuses are seeded manually, which can make a dramatic difference to fuzzing effectiveness.

katiehockman · 2021-02-23T17:50:57Z

It makes sense that the details of what is inside the generated corpus can effectively be opaque to ordinary users, but the generated corpus itself is fairly valuable, and often worthy of keeping around & building up over time.

Agreed. Even though it defaults to$GOCACHE, we are still making sure that go clean would not delete it unless specified. I added those details to the design draft, so thanks for bringing that up. From @jayconrod's CL where he implemented this:

Although files are written into a subdirectory of GOCACHE, they are
not removed automatically, nor are they removed by 'go clean -cache'.
Instead, they may be removed with 'go clean -fuzzcache'.

But to your point about making this customizable, I've gone ahead and added that to the open issues section of the design draft. It's something that we'll likely be considering in the future, it is just unlikely to make it into the first experimental release: https://go.googlesource.com/proposal/+/master/design/draft-fuzzing.md

FiloSottile changed the title ~~create a 'golang.org/x/corpus' repo for prototype of "fuzzing as a first class citizen"~~ x/corpus: new repository for fuzzing corpus data Apr 2, 2019

gopherbot added this to the Unreleased milestone Apr 2, 2019

FiloSottile added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Apr 2, 2019

FiloSottile changed the title ~~x/corpus: new repository for fuzzing corpus data~~ x/fuzzdata: new repository for fuzzing corpus data Apr 2, 2019

thepudds mentioned this issue Apr 6, 2019

encoding/json: add sample fuzz test for prototype of "fuzzing as a first class citizen" #31309

Closed

thepudds mentioned this issue May 2, 2019

Golang internal library fuzzers google/oss-fuzz#2188

Merged

thepudds mentioned this issue Dec 9, 2021

net/netip: add fuzz test comparing with net.ParseIP #49367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/fuzzdata: new repository for fuzzing corpus data #31215

x/fuzzdata: new repository for fuzzing corpus data #31215

thepudds commented Apr 2, 2019 •

edited

Loading

FiloSottile commented Apr 2, 2019

dmitshur commented Apr 2, 2019

FiloSottile commented Apr 2, 2019 •

edited

Loading

kybin commented Apr 2, 2019

thepudds commented Apr 2, 2019

bradfitz commented Apr 2, 2019

thepudds commented Apr 2, 2019

dvyukov commented Apr 3, 2019

dvyukov commented Apr 3, 2019

thepudds commented Apr 3, 2019 •

edited

Loading

thepudds commented Feb 20, 2021

josharian commented Feb 21, 2021

katiehockman commented Feb 23, 2021

x/fuzzdata: new repository for fuzzing corpus data #31215

x/fuzzdata: new repository for fuzzing corpus data #31215

Comments

thepudds commented Apr 2, 2019 • edited Loading

Summary

Background

Additional Details

Populating the corpus

FiloSottile commented Apr 2, 2019

dmitshur commented Apr 2, 2019

FiloSottile commented Apr 2, 2019 • edited Loading

kybin commented Apr 2, 2019

thepudds commented Apr 2, 2019

bradfitz commented Apr 2, 2019

thepudds commented Apr 2, 2019

dvyukov commented Apr 3, 2019

dvyukov commented Apr 3, 2019

thepudds commented Apr 3, 2019 • edited Loading

thepudds commented Feb 20, 2021

josharian commented Feb 21, 2021

katiehockman commented Feb 23, 2021

thepudds commented Apr 2, 2019 •

edited

Loading

FiloSottile commented Apr 2, 2019 •

edited

Loading

thepudds commented Apr 3, 2019 •

edited

Loading