Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: uses of os/exec can trigger fuzz failures when stopping via ^C #50228

Open
mvdan opened this issue Dec 16, 2021 · 11 comments
Open

cmd/go: uses of os/exec can trigger fuzz failures when stopping via ^C #50228

mvdan opened this issue Dec 16, 2021 · 11 comments

Comments

@mvdan
Copy link
Member

@mvdan mvdan commented Dec 16, 2021

$ go version
go version devel go1.18-9d0ca262bb Wed Dec 15 00:33:55 2021 +0000 linux/amd64
$ cat fuzz_test.go 
package main

import (
	"os/exec"
	"testing"
)

func FuzzFoo(f *testing.F) {
	f.Fuzz(func(t *testing.T, n uint64) {
		if _, err := exec.Command("sleep", "30").CombinedOutput(); err != nil {
			t.Fatal(err)
		}
	})
}
$ go test -fuzz=.
warning: starting with empty corpus
fuzz: elapsed: 0s, execs: 0 (0/sec), new interesting: 0 (total: 0)
^Cfuzz: elapsed: 1s, execs: 1 (1/sec), new interesting: 0 (total: 0)
--- FAIL: FuzzFoo (1.21s)
    --- FAIL: FuzzFoo (1.07s)
        fuzz_test.go:11: signal: interrupt
    
    Failing input written to testdata/fuzz/FuzzFoo/3fe7fc020b056d7f28108224509370cb47eb5c199b3fcfd6fea2d67aaa6db9f2
    To re-run:
    go test -run=FuzzFoo/3fe7fc020b056d7f28108224509370cb47eb5c199b3fcfd6fea2d67aaa6db9f2
FAIL
exit status 1
FAIL	test	1.213s

In this case, I hit ^C a.k.a. SIGINT after about one second. I can reproduce this kind of error about 20% of the time; I was able to reproduce with this minimal fuzz func on the fourth try, for instance.

This is a minimal reproducer, using /bin/sleep for the sake of simplicity. My real fuzz func calls a reasonable program that exits within milliseconds. But if I stop the fuzzer just at the right moment with ^C, I might get a failure that's not really a relevant failure. The code is here: https://github.com/mvdan/sh/blob/4a4c1600341f6413719f5e8a50105b8d13a96d6d/syntax/fuzz_test.go#L82-L94

I don't think my code is to blame here. I could possibly expand my err != nil check to ignore signal: interrupt errors, but that doesn't seem right either - any use of os/exec would need to also be adapted to handle such a case.

I think the solution should come from cmd/go's fuzzing - if the user hits ^C, any fuzz failures that happen afterwards should be ignored entirely.

cc @katiehockman

@mvdan
Copy link
Member Author

@mvdan mvdan commented Dec 16, 2021

I tried to write a reproducer that fails ~100% of the time, but so far I've failed.

@cherrymui cherrymui added this to the Backlog milestone Dec 16, 2021
@cherrymui cherrymui added the fuzz label Dec 16, 2021
@cherrymui cherrymui removed this from the Backlog milestone Dec 16, 2021
@cherrymui cherrymui added this to the Go1.18 milestone Dec 16, 2021
@katiehockman
Copy link
Member

@katiehockman katiehockman commented Dec 20, 2021

@cespare
Copy link
Contributor

@cespare cespare commented Dec 20, 2021

When you hit ^C you're sending SIGINT to the foreground process group, including all the sleep processes. So it's a race as to whether the killed sleep triggers a failure first or whether go test exits first.

If instead of ^C you use kill -int to selectively kill a sleep process, you can make the fuzzer fail 100% of the time.

if the user hits ^C, any fuzz failures that happen afterwards should be ignored entirely.

I don't think this eliminates the race. The fuzz failure can happen before go test sees SIGINT.

To extend the idea, when the fuzzer is exiting due to a signal, it can suppress any failures that occurred during the last, say, 50ms. That should reduce the race window to ~0 but it seems really hacky.

(In general using -fuzztime rather than ^C as much as possible seems like a good idea.)

@mvdan
Copy link
Member Author

@mvdan mvdan commented Jan 4, 2022

(In general using -fuzztime rather than ^C as much as possible seems like a good idea.)

I'm not sure I agree. It can be hard to estimate how long one should fuzz for. Sometimes I fuzz for five minutes, and the corpus stops expanding almost entirely. Sometimes I fuzz for twenty minutes, and the corpus keeps expanding. Being able to stop the fuzzing process interactively is helpful.

When you hit ^C you're sending SIGINT to the foreground process group, including all the sleep processes. So it's a race as to whether the killed sleep triggers a failure first or whether go test exits first.

Good observation, I hadn't thought of that.

To extend the idea, when the fuzzer is exiting due to a signal, it can suppress any failures that occurred during the last, say, 50ms. That should reduce the race window to ~0 but it seems really hacky.

From the user's perspective, this solution seems fine. If I'm manually hitting ^C, I really don't care that much about a few hundred milliseconds being lost. That's usually what happens with "cleanup" until the process actually exits, anyway. It may not be a foolproof mechanism, but if it reduces the odds of false positive failures to near-zero, that seems like a big step forward.

@rogpeppe
Copy link
Contributor

@rogpeppe rogpeppe commented Jan 4, 2022

Another possibility might just be to start the process in a new process group.

@cespare
Copy link
Contributor

@cespare cespare commented Jan 4, 2022

Another possibility might just be to start the process in a new process group.

Which process? The sleep process that FuzzFoo creates, or the parent process that go test creates?

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Jan 5, 2022

This is a bit complicated. I thought we explicitly recommended against using fuzzing in this way. We don't, but we probably should. With the way the fuzzing engine is designed, fuzzing things outside of the runtime is likely to have other strange behaviors other than this anyway.

Throwing away things that happen in some interval before exiting could work, but it would also end up masking some real failures. I also cannot think of an elegant implementation of this that doesn't simply rely on buffering findings, which would end up adding a not insignificant amount of complexity.

Given that this is a pretty narrow issue, and that we don't have all that much time left, I think for 1.18 we should just document that this is not a recommended usage of the fuzzer, and is likely to lead to to such issues, and consider if there is a larger structural change we could make for this for 1.19.

@rolandshoemaker rolandshoemaker removed this from the Go1.18 milestone Jan 5, 2022
@rolandshoemaker rolandshoemaker added this to the Backlog milestone Jan 5, 2022
@cespare
Copy link
Contributor

@cespare cespare commented Jan 5, 2022

I thought we explicitly recommended against using fuzzing in this way. We don't, but we probably should. With the way the fuzzing engine is designed, fuzzing things outside of the runtime is likely to have other strange behaviors other than this anyway.

What exactly do "this way" and "fuzzing things outside of the runtime" mean? (IOW, what are you proposing to recommend against for Go 1.18?)

Do you have examples of the "other strange behaviors"?

@cespare
Copy link
Contributor

@cespare cespare commented Jan 5, 2022

Throwing away things that happen in some interval before exiting could work, but it would also end up masking some real failures.

As long as the suppression window is a lot smaller than the fuzzing duration, this doesn't seem significant. For example, suppose you suppress failures from the last 50ms of the fuzzing run. In this scenario, running the fuzzer for 5 seconds would be equivalent to running the fuzzer for 4.95s without suppression.

Also, you wouldn't suppress results when exiting due to -fuzztime, so you only "lose out" on a small amount of fuzzing time when you're relying on ^C to end the fuzzing run, which is inherently inexact anyway.

@thepudds
Copy link

@thepudds thepudds commented Jan 5, 2022

implementation of this that doesn't simply rely on buffering findings

Given finding crashers is rare, rather than continually buffering, would it be reasonable to instead effectively pause when a crasher is found but before reporting it? In other words, the new behavior could kick in only when it would have been about to report a crasher? Or maybe that’s what you were already saying…

@thepudds
Copy link

@thepudds thepudds commented Jan 5, 2022

And sorry for the double post, but just one more quick comment to agree with @mvdan that it would be nice to handle this gracefully eventually. It can be helpful to be able to execute an external program for comparative fuzzing, even if you don’t get coverage with the external binary.

For example, from the go-fuzz-corpus:

https://github.com/dvyukov/go-fuzz-corpus/blob/master/gotypes/main.go#L225

which found a bunch of real bugs comparing go/types, cmd/compile, and gccgo, where only go/types had coverage instrumentation in that example (I think).

Alll that said, it seems reasonable to do whatever is expedient for 1.18, which might just be a note in the documentation or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants