-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/fuzzer: retry inputs from crashed VMs #4676
Conversation
4812cbb
to
05ae0c3
Compare
Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again. Let's give these inputs one more chance, but under certain conditions: 1) The VM has been running long enough, so we may risk crashing it. The PR sets the restart budget to 10%. 2) Don't feed more than 1 unsafe input per 30 seconds
05ae0c3
to
3bf2780
Compare
@dvyukov I've just pushed a second commit with an experimental implementation of crash avoidance. Wdyt about this approach? |
f90449e
to
155ec81
Compare
What I see from local runs:
It looks like we'd better be able to dynamically enable/disable calls during fuzzing. E.g. keep two choice tables in
But:
|
This approach also seems to work quite well: We give a crash budget (0.001 for non-risky VMs, 0.01 for risk-ready VMs) and, using the estimated probability for every program, sample them to fit the risk into the budget. Cons:
|
Another, probably even easier, approach could be to just add some |
Or more generally: a function that transforms a program into a "safe" version. We already have something similar for argument sanitation. |
syz-manager/rpc.go
Outdated
return false | ||
} | ||
|
||
// Don't sent too many risky inputs at once, otherwise we won't know which one was truly bad. | ||
const riskyOnceIn = time.Minute / 2 | ||
const riskyOnceIn = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of guessing this value I think it's better to explicitly keep track if a VM already has 1 risky input or not. We know when we gave it one, we know when it finished running it.
Some VMs may be very slow, yet set high procs, then they will cache multiple ricky programs.
pkg/fuzzer/retry.go
Outdated
return input | ||
} | ||
retryer.statRiskyDiscarded.Add(1) | ||
retryer.toBacklog(input, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we say that it's not important?
pkg/fuzzer/retry.go
Outdated
func() int { | ||
return ret.delayed.Len() | ||
}, stats.StackedGraph("prog reruns")) | ||
|
||
go func() { | ||
for range time.NewTicker(time.Minute).C { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expose it in the web interface instead. There is too much information in the whole manager to print it all periodically. Most of it is not interesting for most users in most cases.
Web interface allows to look specifically at the info one want to look at, with all verbosity, exactly when one want to look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it belongs to /syscalls page. That table will also allow to sort by the value.
|
||
func (ce *crashEstimator) save(p *prog.Prog, prob float64, tentative bool) { | ||
if !ce.mu.TryLock() { | ||
if tentative { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any calls with tentative=true.
|
||
type crashEstimator struct { | ||
mu sync.RWMutex | ||
callProbs map[*prog.Syscall]*stats.AverageValue[float64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type can be made much simpler, faster and consume less memory if we use an array of calls (they have dense IDs and we know the max ID), don't allocate AverageValue lazily (we will allocate all of them anyway), AverageValue also has own mutex which we don't need/use here.
There are lots of small heap allocations, indirections, synchronization, etc.
It can be just []struct { crahed, ok atomic.Uint64 }
.
pkg/fuzzer/retry.go
Outdated
// We're okay if the instance crashes, so no checks are needed. | ||
return input | ||
} | ||
if attempts == 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. We pull more than we need, queue separately, but don't pull more than 2 because handle excess is bad, so we don't want to pull too much. So we get 2 crashing in the row, this still does not avoid the crash.
What about the idea of making programs safe? Something like:
if !mayCrash {
input.Prog.MakeSafe()
}
Looks more reasonable.
However, I am not sure what to do with non fuzz/gen program in this case (shouldn't modify them).
Alternatively, I think we should classify programs as risky earlier and queue them into separate queues.
155ec81
to
1bee55e
Compare
I've pushed an updated approach:
a) Precious -- if they contain dangerous calls, we want to still execute them, but probably later. If they were on a crashed VM, we give them one more chance. It's triage and hints. |
Ah, there must also be two choice tables in this case -- otherwise we won't give much more chance to those disabled calls. |
1bee55e
to
d2b482b
Compare
This is an experimental approach. It needs more evaluation.
d2b482b
to
de63f06
Compare
The retrying functionality was done in #4762. |
Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again.
Let's give these inputs one more chance, but under certain conditions:
The PR sets the restart budget to 10%.
This is another way to implement #4666