Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/fuzzdata: new repository for fuzzing corpus data #31215

Open
thepudds opened this issue Apr 2, 2019 · 13 comments
Open

x/fuzzdata: new repository for fuzzing corpus data #31215

thepudds opened this issue Apr 2, 2019 · 13 comments
Labels
NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@thepudds
Copy link
Contributor

thepudds commented Apr 2, 2019

Summary

In #30719 and #30979, dvyukov/go-fuzz compatible fuzz functions were landed in:

The follow-up item in this issue here is to add a new golang.org/x/corpus repository to hold an example fuzzing corpus for portions of the standard library and x subrepos. This is to help with the exploration requested by the core Go team in discussion of the #19109 proposal to "make fuzzing a first class citizen".

The name for the new corpus repo alternatively could be golang.org/x/fuzz or something else; some additional naming discussion below.

Note that this issue is at least currently intended to be solely about creating the repository itself, and this issue does not cover checking in any corresponding fuzzing corpus (which is likely to be a follow-up issue).

Background

See the "Background" section of #30719 or #19109 (comment).

Additional Details

As part of the #19109 proposal discussion, there were multiple comments/requests from the core Go team asking to develop a better understanding of how a fuzzing corpus looks and behaves when it resides in a repository. For example, this March 2017 request in #19109 (comment) from Russ:

add fuzz tests to at least the x subrepos and maybe the standard library, so that we can understand the implications of having them in the source repos (including how much space we're going to spend on testdata/fuzz corpus directories

In the March 2017 proposal document in #19109 (comment), @dvyukov proposed:

For the standard library it is proposed to check in corpus into golang.org/x/fuzz repo.

golang.org/x/fuzz seems to be a very reasonable name. Two alternative possible names:

  • golang.org/x/corpus
  • golang.org/x/fuzzcorpus

Personally, I think x/corpus or x/fuzzcorpus might be more evocative than x/fuzz (e.g., someone might incorrectly think x/fuzz is where the fuzzing implementation or fuzzing functions live), but I do not have a strong opinion on the name and suspect others will have stronger opinions. Until there is additional feedback on the name, the rest of this comment here will use the term golang.org/x/corpus.

Populating the corpus

The initial seeding of the corpus can likely come from https://github.com/dvyukov/go-fuzz-corpus for a given Fuzz function. If that happens, then right now, there would be two corpus directories populated: go-fuzz-corpus/png/corpus and go-fuzz-corpus/tiff/corpus.

In parallel, multiple people are making progress on integrating dvyukov/go-fuzz into oss-fuzz in google/oss-fuzz#36, google/oss-fuzz#2188, dvyukov/go-fuzz#213 and elsewhere. Most likely, it will make sense to periodically update the golang.org/x/corpus repo with the output from oss-fuzz (which otherwise by default is stored in a Google Cloud Storage bucket).

However, the exact mechanism of populating and updating golang.org/x/corpus repo I think can be discussed outside of this particular issue. (Reason: In general, it seems more tractable to make progress if things are broken down into more manageable discrete chunks of work, especially to break out into separate steps the things that must be done by someone on the core Go team, vs. could be done by someone from the broader community. This issue is focused on the creation of the repo, which presumably must be done by someone on the core Go team, whereas populating the corpus can be done by a greater range of people from the broader community in consultation with the core Go team).

I am of course happy to discuss anything here, and happy to be corrected if any of the above is different than how people would like to proceed.

CC @dvyukov @josharian @FiloSottile @bradfitz @acln0

@FiloSottile FiloSottile changed the title create a 'golang.org/x/corpus' repo for prototype of "fuzzing as a first class citizen" x/corpus: new repository for fuzzing corpus data Apr 2, 2019
@gopherbot gopherbot added this to the Unreleased milestone Apr 2, 2019
@FiloSottile FiloSottile added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Apr 2, 2019
@FiloSottile
Copy link
Contributor

LGTM. I suspect bringing the full oss-fuzz corpus back into the repo might bloat it too much, but we definitely need a place for seed corpus, so this sounds good.

@dmitshur
Copy link
Contributor

dmitshur commented Apr 2, 2019

Related issue: #14304.

@FiloSottile
Copy link
Contributor

FiloSottile commented Apr 2, 2019

Oh, how about x/fuzzdata? x/corpus does not imply fuzzing to everyone I suspect. (Bikeshed!)

@kybin
Copy link
Contributor

kybin commented Apr 2, 2019

will the repository used only for fuzzing?

@thepudds
Copy link
Contributor Author

thepudds commented Apr 2, 2019

For reasons of clarity, I personally like x/fuzzdata or x/fuzzcorpus over x/corpus or x/fuzz. I led off the issue using x/corpus mainly because it seems the core Go team generally places great weight on brevity.

@bradfitz
Copy link
Contributor

bradfitz commented Apr 2, 2019

I was also going to suggest fuzzdata.

@thepudds
Copy link
Contributor Author

thepudds commented Apr 2, 2019

@kybin

will the repository used only for fuzzing?

That is the current intent of the suggestion in this issue.

@FiloSottile FiloSottile changed the title x/corpus: new repository for fuzzing corpus data x/fuzzdata: new repository for fuzzing corpus data Apr 2, 2019
@dvyukov
Copy link
Member

dvyukov commented Apr 3, 2019

LGTM. I suspect bringing the full oss-fuzz corpus back into the repo might bloat it too much, but we definitely need a place for seed corpus, so this sounds good.

If it's only for seed corpus, then seed corpus may belong to the main repo better. We do want to use it in unit tests too! See dvyukov/go-fuzz#218 (comment) for more explanation.

@dvyukov
Copy link
Member

dvyukov commented Apr 3, 2019

To clarify, I mean only seed corpus, i.e. small number of high-quality inputs without significant churn (no more than what happens with code). For a fuzzer-generated corpus we do need a separate repo.

@thepudds
Copy link
Contributor Author

thepudds commented Apr 3, 2019

Part of what is tricky is there are really 2 things running in parallel, I think:

  1. Actually fuzzing the stdlib much more (mainly via oss-fuzz).

  2. Use the stdlib and x subrepos as a “pilot” for the “first class fuzzing” proposal (at Russ’ suggestion).

For 2., it seems important that golang.org/x/fuzzdata gets created so that the stdlib and x subrepos can be a pilot of a decent size project that stores a decent sized corpus with some churn in a repo... and I think that is important even under the split corpus idea in dvyukov/go-fuzz#218 (comment), which I commented on there yesterday (probably too enthusiastically, sorry).

@thepudds
Copy link
Contributor Author

Hi @katiehockman, I had a question for you about the generated corpus under the current draft design for first class fuzzing, which in turn might have some implications for what to do this with issue.

The draft includes:

The generated corpus will be in a new directory within $GOCACHE, in the form $GOCACHE/fuzz/$pkg/$test/$name,

and also:

A generated corpus will be managed by the fuzzing engine and will live outside the module in a subdirectory of $GOCACHE. This generated corpus will grow as the fuzzing engine discovers new coverage.

The details of how the corpus is built and processed should be unimportant to users.

It makes sense that the details of what is inside the generated corpus can effectively be opaque to ordinary users, but the generated corpus itself is fairly valuable, and often worthy of keeping around & building up over time.

Given that generated corpus is valuable, it can be fairly useful to allow more direct control of the location of the generated corpus (to allow for easier storage in a separate repo, or cloud storage, or a shared filesystem, or easier integration with the diversity of CI systems in use, or purpose built fuzzing systems like OSS-Fuzz)

The prior first class fuzzing proposal went through a couple iterations on corpus behavior, but the behavior in the fzgo prototype of that proposal ultimately defaulted to reading from testdata (e.g., for a seed corpus), wrote to GOPATH/pkg/fuzz by default as a local cache for the generated corpus, but also allowed more direct control of the generated corpus via an optional -fuzzdir flag to support use cases such as those mentioned just above.

GOCACHE certainly can be made to work in different scenarios, but what are your thoughts around a more convenient flag or fuzz-specific env variable to allow more direct control of the location of the generated corpus?

Thanks, and exciting to see the progress you have been making! 🚀

@josharian
Copy link
Contributor

I’d like to second the notion that corpuses are valuable and should be (able to be) retained and shared. It’s not just computational resources to generate them. Some corpuses I’ve developed alongside the code, so they contain good coverage already for alternative implementations. And other corpuses are seeded manually, which can make a dramatic difference to fuzzing effectiveness.

@katiehockman
Copy link
Contributor

It makes sense that the details of what is inside the generated corpus can effectively be opaque to ordinary users, but the generated corpus itself is fairly valuable, and often worthy of keeping around & building up over time.

Agreed. Even though it defaults to$GOCACHE, we are still making sure that go clean would not delete it unless specified. I added those details to the design draft, so thanks for bringing that up. From @jayconrod's CL where he implemented this:

Although files are written into a subdirectory of GOCACHE, they are
not removed automatically, nor are they removed by 'go clean -cache'.
Instead, they may be removed with 'go clean -fuzzcache'.

But to your point about making this customizable, I've gone ahead and added that to the open issues section of the design draft. It's something that we'll likely be considering in the future, it is just unlikely to make it into the first experimental release: https://go.googlesource.com/proposal/+/master/design/draft-fuzzing.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

9 participants