Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/fuzzer: retry inputs from crashed VMs #4676

Closed
wants to merge 2 commits into from

Conversation

a-nogikh
Copy link
Collaborator

Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again.

Let's give these inputs one more chance, but under certain conditions:

  1. The VM has been running long enough, so we may risk crashing it.
    The PR sets the restart budget to 10%.
  2. Don't feed more than 1 unsafe input per 30 seconds

This is another way to implement #4666

Non-finished requests at the time of a crash are dangerous because one of
them is likely to crash the instance again.

Let's give these inputs one more chance, but under certain conditions:

1) The VM has been running long enough, so we may risk crashing it.
   The PR sets the restart budget to 10%.
2) Don't feed more than 1 unsafe input per 30 seconds
@a-nogikh
Copy link
Collaborator Author

@dvyukov I've just pushed a second commit with an experimental implementation of crash avoidance. Wdyt about this approach?

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Apr 11, 2024

What I see from local runs:

  • In general, individual calls do seem to be quite well associated with the probability of causing a crash.
  • At least on v6.9-rc3, the number of suspicious calls is big (10-20?).
  • If we evaluate every input from the fuzzer after it was generated and the number of bad calls is big, we have to discard/postpone too many programs.
    • Even if I wait only 5*bootTime before running risky programs and schedule a risky program every second, the backlog queue only keeps on growing.

It looks like we'd better be able to dynamically enable/disable calls during fuzzing. E.g. keep two choice tables in Fuzzer:

  • One that only enables safe syscalls. It's used in smash jobs and in most exec fuzz / exec gen.
  • One with all calls. It's only used for a fraction of exec fuzz / exec gen for VMs that may take risky calls.

pkg/fuzzer records all statistics and once in a while regenerates the first choice table.

But:

  • The banned calls still remain in the corpus and may leak from there to VMs.
  • It won't scale well if we ever make the criteria more fine-grained (e.g. call combinations or arg values).

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Apr 11, 2024

This approach also seems to work quite well:
755e185

We give a crash budget (0.001 for non-risky VMs, 0.01 for risk-ready VMs) and, using the estimated probability for every program, sample them to fit the risk into the budget.

Cons:

  • It may "fake" quite a lot of crashes, so maybe our job.go must be more crash-tolerant itself (e.g. don't abort smash jobs on crashes, make triage not fail on a single crash, etc).
  • 3 attempts are not enough in 15% of cases (the risky progs fallback stat). Maybe it will be better with higher crash risk budgets.

@a-nogikh
Copy link
Collaborator Author

Another, probably even easier, approach could be to just add some (skip) call attribute and assign it to individual program calls by this wrapper code. So if the call is statistically dangerous, it just will be skipped and there will be no signal/coverage it, but the rest of the program will be executed.

@dvyukov
Copy link
Collaborator

dvyukov commented Apr 12, 2024

Another, probably even easier, approach could be to just add some (skip)

Or more generally: a function that transforms a program into a "safe" version. We already have something similar for argument sanitation.
Why "skip" attribute and not remove the syscall?

return false
}

// Don't sent too many risky inputs at once, otherwise we won't know which one was truly bad.
const riskyOnceIn = time.Minute / 2
const riskyOnceIn = time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of guessing this value I think it's better to explicitly keep track if a VM already has 1 risky input or not. We know when we gave it one, we know when it finished running it.
Some VMs may be very slow, yet set high procs, then they will cache multiple ricky programs.

return input
}
retryer.statRiskyDiscarded.Add(1)
retryer.toBacklog(input, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we say that it's not important?

func() int {
return ret.delayed.Len()
}, stats.StackedGraph("prog reruns"))

go func() {
for range time.NewTicker(time.Minute).C {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expose it in the web interface instead. There is too much information in the whole manager to print it all periodically. Most of it is not interesting for most users in most cases.
Web interface allows to look specifically at the info one want to look at, with all verbosity, exactly when one want to look at it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it belongs to /syscalls page. That table will also allow to sort by the value.


func (ce *crashEstimator) save(p *prog.Prog, prob float64, tentative bool) {
if !ce.mu.TryLock() {
if tentative {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls with tentative=true.


type crashEstimator struct {
mu sync.RWMutex
callProbs map[*prog.Syscall]*stats.AverageValue[float64]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type can be made much simpler, faster and consume less memory if we use an array of calls (they have dense IDs and we know the max ID), don't allocate AverageValue lazily (we will allocate all of them anyway), AverageValue also has own mutex which we don't need/use here.
There are lots of small heap allocations, indirections, synchronization, etc.
It can be just []struct { crahed, ok atomic.Uint64 }.

// We're okay if the instance crashes, so no checks are needed.
return input
}
if attempts == 2 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. We pull more than we need, queue separately, but don't pull more than 2 because handle excess is bad, so we don't want to pull too much. So we get 2 crashing in the row, this still does not avoid the crash.

What about the idea of making programs safe? Something like:

if !mayCrash {
  input.Prog.MakeSafe()
}

Looks more reasonable.

However, I am not sure what to do with non fuzz/gen program in this case (shouldn't modify them).

Alternatively, I think we should classify programs as risky earlier and queue them into separate queues.

@a-nogikh
Copy link
Collaborator Author

I've pushed an updated approach:

  1. We track the probabilities of crash for every call.
  2. Every X seconds, we pick the most dangerous calls and update the choice table so that they are not generated.
  3. Additionally, we split all request into two caterogies:

a) Precious -- if they contain dangerous calls, we want to still execute them, but probably later. If they were on a crashed VM, we give them one more chance. It's triage and hints.
b) Non-precious -- if they contain dangerous calls and there's no suitable VM that may take risks, we can just discard them. Also, if they were on a crashed VM, we don't want to retry them.

@a-nogikh
Copy link
Collaborator Author

Ah, there must also be two choice tables in this case -- otherwise we won't give much more chance to those disabled calls.

This is an experimental approach. It needs more evaluation.
@a-nogikh
Copy link
Collaborator Author

The retrying functionality was done in #4762.
Crash avoidance will be posted separately.

@a-nogikh a-nogikh closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants