pkg/fuzzer: retry inputs from crashed VMs #4676

a-nogikh · 2024-04-11T14:21:33Z

Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again.

Let's give these inputs one more chance, but under certain conditions:

The VM has been running long enough, so we may risk crashing it.
The PR sets the restart budget to 10%.
Don't feed more than 1 unsafe input per 30 seconds

This is another way to implement #4666

Non-finished requests at the time of a crash are dangerous because one of them is likely to crash the instance again. Let's give these inputs one more chance, but under certain conditions: 1) The VM has been running long enough, so we may risk crashing it. The PR sets the restart budget to 10%. 2) Don't feed more than 1 unsafe input per 30 seconds

a-nogikh · 2024-04-11T15:28:09Z

@dvyukov I've just pushed a second commit with an experimental implementation of crash avoidance. Wdyt about this approach?

a-nogikh · 2024-04-11T17:04:06Z

What I see from local runs:

In general, individual calls do seem to be quite well associated with the probability of causing a crash.
At least on v6.9-rc3, the number of suspicious calls is big (10-20?).
If we evaluate every input from the fuzzer after it was generated and the number of bad calls is big, we have to discard/postpone too many programs.
- Even if I wait only 5*bootTime before running risky programs and schedule a risky program every second, the backlog queue only keeps on growing.

It looks like we'd better be able to dynamically enable/disable calls during fuzzing. E.g. keep two choice tables in Fuzzer:

One that only enables safe syscalls. It's used in smash jobs and in most exec fuzz / exec gen.
One with all calls. It's only used for a fraction of exec fuzz / exec gen for VMs that may take risky calls.

pkg/fuzzer records all statistics and once in a while regenerates the first choice table.

But:

The banned calls still remain in the corpus and may leak from there to VMs.
It won't scale well if we ever make the criteria more fine-grained (e.g. call combinations or arg values).

a-nogikh · 2024-04-11T18:03:44Z

This approach also seems to work quite well:
755e185

We give a crash budget (0.001 for non-risky VMs, 0.01 for risk-ready VMs) and, using the estimated probability for every program, sample them to fit the risk into the budget.

Cons:

It may "fake" quite a lot of crashes, so maybe our job.go must be more crash-tolerant itself (e.g. don't abort smash jobs on crashes, make triage not fail on a single crash, etc).
3 attempts are not enough in 15% of cases (the risky progs fallback stat). Maybe it will be better with higher crash risk budgets.

a-nogikh · 2024-04-12T06:59:17Z

Another, probably even easier, approach could be to just add some (skip) call attribute and assign it to individual program calls by this wrapper code. So if the call is statistically dangerous, it just will be skipped and there will be no signal/coverage it, but the rest of the program will be executed.

dvyukov · 2024-04-12T10:19:53Z

Another, probably even easier, approach could be to just add some (skip)

Or more generally: a function that transforms a program into a "safe" version. We already have something similar for argument sanitation.
Why "skip" attribute and not remove the syscall?

dvyukov · 2024-04-12T10:53:10Z

syz-manager/rpc.go

 		return false
 	}

-	// Don't sent too many risky inputs at once, otherwise we won't know which one was truly bad.
-	const riskyOnceIn = time.Minute / 2
+	const riskyOnceIn = time.Second


Instead of guessing this value I think it's better to explicitly keep track if a VM already has 1 risky input or not. We know when we gave it one, we know when it finished running it.
Some VMs may be very slow, yet set high procs, then they will cache multiple ricky programs.

dvyukov · 2024-04-12T10:56:31Z

pkg/fuzzer/retry.go

+			return input
+		}
+		retryer.statRiskyDiscarded.Add(1)
+		retryer.toBacklog(input, false)


Why do we say that it's not important?

dvyukov · 2024-04-12T11:01:04Z

pkg/fuzzer/retry.go

 		func() int {
 			return ret.delayed.Len()
 		}, stats.StackedGraph("prog reruns"))
+
+	go func() {
+		for range time.NewTicker(time.Minute).C {


Let's expose it in the web interface instead. There is too much information in the whole manager to print it all periodically. Most of it is not interesting for most users in most cases.
Web interface allows to look specifically at the info one want to look at, with all verbosity, exactly when one want to look at it.

It looks like it belongs to /syscalls page. That table will also allow to sort by the value.

dvyukov · 2024-04-12T11:10:50Z

pkg/fuzzer/retry.go

+
+func (ce *crashEstimator) save(p *prog.Prog, prob float64, tentative bool) {
+	if !ce.mu.TryLock() {
+		if tentative {


I don't see any calls with tentative=true.

dvyukov · 2024-04-12T11:25:25Z

pkg/fuzzer/retry.go

+
+type crashEstimator struct {
+	mu        sync.RWMutex
+	callProbs map[*prog.Syscall]*stats.AverageValue[float64]


This type can be made much simpler, faster and consume less memory if we use an array of calls (they have dense IDs and we know the max ID), don't allocate AverageValue lazily (we will allocate all of them anyway), AverageValue also has own mutex which we don't need/use here.
There are lots of small heap allocations, indirections, synchronization, etc.
It can be just []struct { crahed, ok atomic.Uint64 }.

dvyukov · 2024-04-12T11:31:16Z

pkg/fuzzer/retry.go

+			// We're okay if the instance crashes, so no checks are needed.
+			return input
+		}
+		if attempts == 2 {


I don't like this. We pull more than we need, queue separately, but don't pull more than 2 because handle excess is bad, so we don't want to pull too much. So we get 2 crashing in the row, this still does not avoid the crash.

What about the idea of making programs safe? Something like:

if !mayCrash { input.Prog.MakeSafe() }

Looks more reasonable.

However, I am not sure what to do with non fuzz/gen program in this case (shouldn't modify them).

Alternatively, I think we should classify programs as risky earlier and queue them into separate queues.

a-nogikh · 2024-04-12T11:46:05Z

I've pushed an updated approach:

We track the probabilities of crash for every call.
Every X seconds, we pick the most dangerous calls and update the choice table so that they are not generated.
Additionally, we split all request into two caterogies:

a) Precious -- if they contain dangerous calls, we want to still execute them, but probably later. If they were on a crashed VM, we give them one more chance. It's triage and hints.
b) Non-precious -- if they contain dangerous calls and there's no suitable VM that may take risks, we can just discard them. Also, if they were on a crashed VM, we don't want to retry them.

a-nogikh · 2024-04-12T11:48:27Z

Ah, there must also be two choice tables in this case -- otherwise we won't give much more chance to those disabled calls.

This is an experimental approach. It needs more evaluation.

a-nogikh · 2024-05-16T16:52:33Z

The retrying functionality was done in #4762.
Crash avoidance will be posted separately.

a-nogikh force-pushed the features/retry-crashes branch from 4812cbb to 05ae0c3 Compare April 11, 2024 14:22

a-nogikh force-pushed the features/retry-crashes branch from 05ae0c3 to 3bf2780 Compare April 11, 2024 14:25

a-nogikh requested a review from dvyukov April 11, 2024 14:27

a-nogikh force-pushed the features/retry-crashes branch from f90449e to 155ec81 Compare April 11, 2024 16:42

dvyukov reviewed Apr 12, 2024

View reviewed changes

a-nogikh force-pushed the features/retry-crashes branch from 155ec81 to 1bee55e Compare April 12, 2024 11:41

a-nogikh force-pushed the features/retry-crashes branch from 1bee55e to d2b482b Compare April 12, 2024 12:03

pkg/fuzzer: avoid risky executions

de63f06

This is an experimental approach. It needs more evaluation.

a-nogikh force-pushed the features/retry-crashes branch from d2b482b to de63f06 Compare April 12, 2024 12:41

a-nogikh closed this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/fuzzer: retry inputs from crashed VMs #4676

pkg/fuzzer: retry inputs from crashed VMs #4676

a-nogikh commented Apr 11, 2024

a-nogikh commented Apr 11, 2024

a-nogikh commented Apr 11, 2024 •

edited

Loading

a-nogikh commented Apr 11, 2024 •

edited

Loading

a-nogikh commented Apr 12, 2024

dvyukov commented Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

dvyukov Apr 12, 2024

a-nogikh commented Apr 12, 2024

a-nogikh commented Apr 12, 2024

a-nogikh commented May 16, 2024

pkg/fuzzer: retry inputs from crashed VMs #4676

pkg/fuzzer: retry inputs from crashed VMs #4676

Conversation

a-nogikh commented Apr 11, 2024

a-nogikh commented Apr 11, 2024

a-nogikh commented Apr 11, 2024 • edited Loading

a-nogikh commented Apr 11, 2024 • edited Loading

a-nogikh commented Apr 12, 2024

dvyukov commented Apr 12, 2024

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

dvyukov Apr 12, 2024

Choose a reason for hiding this comment

a-nogikh commented Apr 12, 2024

a-nogikh commented Apr 12, 2024

a-nogikh commented May 16, 2024

a-nogikh commented Apr 11, 2024 •

edited

Loading

a-nogikh commented Apr 11, 2024 •

edited

Loading