Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_edge and new_features stat may be incorrect (e.g. -max_len is used or -use_value_profile) #802

Open
3 of 6 tasks
Dor1s opened this issue Aug 8, 2019 · 11 comments
Open
3 of 6 tasks
Assignees
Labels
bug Something isn't working

Comments

@Dor1s
Copy link
Contributor

Dor1s commented Aug 8, 2019

  1. If we fuzz with some -max_len value, the initial_edge_coverage stat will not represent the actual edge coverage of the full corpus, because libFuzzer will ignore the contents past the first max_len bytes in every input.

  2. During the merge, we reasonably remove -max_len argument, in order to avoid truncating the corpus and losing the coverage we have.

  3. The resulting edge_coverage is extracted from the merge log and might be greater than the initial_edge_coverage even if we didn't generate any inputs, just because -max_len caused the initial coverage to be incomplete.

  4. The resulting new_edge metric is calculated as the difference between these two (excluding the corpus subset strategy, as in that case we were able to realize the very same problem and fixed it some time ago):

I don't have a good solution in mind. We can skip new_edge metric for random max length strategy, but that would mean we lose a signal for the strategy.

My best idea so far is to make libFuzzer's merge to give us all the numbers (i.e. initial coverage + new coverage), but that'll complicate libFuzzer's interface, so I don't really like this either.

UPD: we don't use value profiling during merge, that means we might be missing some features.

UPD 2: One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.

Action Items

  • 1. Modify libFuzzer's merge process so that it could provide the accurate stats. Done in https://reviews.llvm.org/D66107.
  • 2. Switch ClusterFuzz's launcher code to use the new merge process.
  • 3. Monitor the new behavior, debug and fix any failures.
  • 4. Revisit old code for parsing merge stats (and other stats overlapping with these), consider removing (e.g. merge_edge_coverage column), sha1 check for corpus filenames.
  • 5. Rewrite the _is_multistep_merge_supported using a less hacky approach (check help message).
  • 6. Update Android test binaries: Add two-step merge testing for Android #1732.
@Dor1s Dor1s added the bug Something isn't working label Aug 8, 2019
@Dor1s
Copy link
Contributor Author

Dor1s commented Aug 8, 2019

@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?

Another bad idea is to do an extra run with -runs=0 on the full corpus to get the real initial_edge_coverage. We can do it only for the runs where random max length is used, but that will be quite expensive and unwise anyway :(

@Dor1s
Copy link
Contributor Author

Dor1s commented Aug 8, 2019

@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?

Discussed offline. The experimental data should be good for now, so I'll upload a CL to record new_edges = 0 when random max length strategy is used.

@Dor1s Dor1s self-assigned this Aug 8, 2019
@Dor1s
Copy link
Contributor Author

Dor1s commented Aug 9, 2019

@inferno-chromium suggested a nice solution on libFuzzer side (to print stats after reading the first corpus dir), I've uploaded a CL as https://reviews.llvm.org/D66020, but we probably need to chose some other keyword. INITED is supposed to indicate that ALL inputs files are read, so in our case it's kinda PARTIALLY INITED :)

dtzWill pushed a commit to llvm-mirror/compiler-rt that referenced this issue Aug 9, 2019
Summary:
The purpose is to be able to extract the number of new edges added to
the original (i.e. output) corpus directory after doing the merge. Use case
example: in ClusterFuzz, we do merge after every fuzzing session, to avoid
uploading too many corpus files, and we also record coverage stats at that
point. Having a separate line indicating stats after reading the initial output
corpus directory would make the stats extraction easier for both humans and
parsing scripts.

Context: google/clusterfuzz#802.

Reviewers: morehouse, hctim

Reviewed By: hctim

Subscribers: delcypher, #sanitizers, llvm-commits, kcc

Tags: #llvm, #sanitizers

Differential Revision: https://reviews.llvm.org/D66020

git-svn-id: https://llvm.org/svn/llvm-project/compiler-rt/trunk@368461 91177308-0d34-0410-b5e6-96231b3b80d8
llvm-git-migration pushed a commit to llvm/llvm-project that referenced this issue Aug 9, 2019
Summary:
The purpose is to be able to extract the number of new edges added to
the original (i.e. output) corpus directory after doing the merge. Use case
example: in ClusterFuzz, we do merge after every fuzzing session, to avoid
uploading too many corpus files, and we also record coverage stats at that
point. Having a separate line indicating stats after reading the initial output
corpus directory would make the stats extraction easier for both humans and
parsing scripts.

Context: google/clusterfuzz#802.

Reviewers: morehouse, hctim

Reviewed By: hctim

Subscribers: delcypher, #sanitizers, llvm-commits, kcc

Tags: #llvm, #sanitizers

Differential Revision: https://reviews.llvm.org/D66020

llvm-svn: 368461
@Dor1s
Copy link
Contributor Author

Dor1s commented Sep 3, 2019

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.

I'll fix it as well when making the other merge changes.

@Dor1s Dor1s changed the title new_edge stat may be incorrect when -max_len is used new_edge and new_features stat may be incorrect (e.g. -max_len is used or -use_value_profile) Sep 3, 2019
@Dor1s
Copy link
Contributor Author

Dor1s commented Sep 6, 2019

One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.

@Dor1s
Copy link
Contributor Author

Dor1s commented Sep 19, 2019

FTR, the most recent Clang roll in Chromium should include my -merge=1 change, so that is not a blocker anymore, but there are still some other things that can delay switching to the new merge logic.

Dor1s added a commit that referenced this issue Dec 11, 2019
#815)

* Use the two step merge process to collect accurate stats for the new corpus units.

* Working on the tests

* add files for two merge steps

* Make the tests work with two step process

* improve the tests, add Minijail, Fuchsia, and Android support.

* self review and minor clean up

* add default values for device specific dirs

* Address comments by Jonathan

* Refactor minimize_corpus and revert to the original interface

* Address other review comments

* Minor fixes / comments clarifications.

* Adjust the timeout after the first step

* address final comments, remove old stats code, test everywhere

* Fix old stats test
Dor1s added a commit that referenced this issue Dec 11, 2019
Dor1s added a commit that referenced this issue Dec 12, 2019
Dor1s added a commit that referenced this issue Dec 12, 2019
#802). (#815)"

* Use the two step merge process to collect accurate stats for the new corpus units.

* Working on the tests

* add files for two merge steps

* Make the tests work with two step process

* improve the tests, add Minijail, Fuchsia, and Android support.

* self review and minor clean up

* add default values for device specific dirs

* Address comments by Jonathan

* Refactor minimize_corpus and revert to the original interface

* Address other review comments

* Minor fixes / comments clarifications.

* Adjust the timeout after the first step

* address final comments, remove old stats code, test everywhere

* Fix old stats test
Dor1s added a commit that referenced this issue Dec 12, 2019
#802). (#815)" (#1261)

* Reland "libFuzzer: use two step merge after fuzzing for accurate stats (#802). (#815)"

* Use the two step merge process to collect accurate stats for the new corpus units.

* Working on the tests

* add files for two merge steps

* Make the tests work with two step process

* improve the tests, add Minijail, Fuchsia, and Android support.

* self review and minor clean up

* add default values for device specific dirs

* Address comments by Jonathan

* Refactor minimize_corpus and revert to the original interface

* Address other review comments

* Minor fixes / comments clarifications.

* Adjust the timeout after the first step

* address final comments, remove old stats code, test everywhere

* Fix old stats test

* Add a fallback to the single step merge + convert assertions into errors

* Patch existing tests to work

* Update integration test and add a new one.

* add dict and options file for the old fuzzer

* make the check work for Fuchsia

* Another fix for Fuchsia
@Dor1s
Copy link
Contributor Author

Dor1s commented Dec 16, 2019

I've checked BigQuery stats for the past 15 days for both Chromium ClusterFuzz and OSS-Fuzz.

The change has definitely had an effect -- I can share stats with Googlers.

One thing that is noticeable and expected is that an average new_edges value has dropped.

Some other movements are also evident, though hard to comment on, as previously stats were incomplete as we skipped recording them for certain strategies.

Dor1s added a commit that referenced this issue Dec 17, 2019
…ep (#802).

I think this case is very unlikely to happen, but better safe than sorry I guess.
Dor1s added a commit that referenced this issue Dec 17, 2019
…ep (#802). (#1289)

* Use correct exception for timeout detected in _minimize_corpus_two_step (#802).

I think this case is very unlikely to happen, but better safe than sorry I guess.

* Also add logging of failed merge output in _merge_new_units

* fix a typo
@Dor1s
Copy link
Contributor Author

Dor1s commented Dec 17, 2019

As per Abhishek's suggestion I've checked whether this new implementation affected the number of post-fuzzing merge timeouts. I've checked the logs for the past 7+ days and it looks like the overall number of timeouts is not affected, which is expected, as the overall complexity of performing a merge hasn't changed much.

# Chromium ClusterFuzz, number of "Merging new testcases timed out" errors a day.
2019-12-10 : 1423
2019-12-11 : 1246
2019-12-12 : 1435 # two step merge was deployed at the end of this day
2019-12-13 : 1497
2019-12-14 : 1330
2019-12-15 : 1363
2019-12-16 : 1446
2019-12-17 : 1128 # not a full day

@Dor1s
Copy link
Contributor Author

Dor1s commented Dec 26, 2019

Re-evaluating some strategies / metrics after the change being deployed for ~2 weeks (access to the link is restricted for non-Googlers):

  1. DFT on OSS-Fuzz: https://docs.google.com/spreadsheets/d/14hAEuRsfofSiInUTR7KA48SbGY4QPuxHOzObjVPInQk/edit#gid=1395928105

on average, DFT strategy is still losing to the runs where it's not being used

  1. ML RNN on ClusterFuzz: https://docs.google.com/spreadsheets/d/13te_2ZcRQhbm25B_h2Xg8-TFV0TBn4TSSBmbTNLy2_E/edit#gid=0

ML RNN looks pretty good, it seems to be be useful, as 66.7% of fuzz targets perform better with ML RNN, while only 32% of targets perform better without it (all on average)

  1. MAB probabilities: https://datastudio.google.com/c/u/0/reporting/1crV7jDmoFh25j2ZMATEMa3X0HfsKqGOQ/page/QAox

The graphs seem to be broken, possibly due to #1304

I've ran some queries myself in https://docs.google.com/spreadsheets/d/1dNgK9xg9zkdds6ckFalNWniVI2FtVZylkfcFKR52kX0/edit#gid=0

It might be a little hard to read, but an awesome observation here is how fork strategy is showing up among a half of the top strategies since Dec 16th. Until that, fork strategy wasn't ever in top 10, just because we did not have good stats for it. ML RNN probability has also jumped for ~10-20%, but that might be a temporary fluctuation.

Once the Data Studio charts are fixed, we should be able to observe even more changes.

li-zhipeng pushed a commit to webrtc-lizp/chromium-llvm-project-compiler-rt-lib-fuzzer that referenced this issue Jan 6, 2020
Summary:
The purpose is to be able to extract the number of new edges added to
the original (i.e. output) corpus directory after doing the merge. Use case
example: in ClusterFuzz, we do merge after every fuzzing session, to avoid
uploading too many corpus files, and we also record coverage stats at that
point. Having a separate line indicating stats after reading the initial output
corpus directory would make the stats extraction easier for both humans and
parsing scripts.

Context: google/clusterfuzz#802.

Reviewers: morehouse, hctim

Reviewed By: hctim

Subscribers: delcypher, #sanitizers, llvm-commits, kcc

Tags: #llvm, #sanitizers

Differential Revision: https://reviews.llvm.org/D66020

git-svn-id: https://llvm.org/svn/llvm-project/compiler-rt/trunk/lib/fuzzer@368461 91177308-0d34-0410-b5e6-96231b3b80d8
@nwellnhof
Copy link

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.

I'll fix it as well when making the other merge changes.

Has this been fixed?

@Dor1s
Copy link
Contributor Author

Dor1s commented Feb 26, 2024

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.
I'll fix it as well when making the other merge changes.

Has this been fixed?

I think so: #1915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants