prog: smarter mutations #534

dvyukov · 2018-03-08T10:36:32Z

Brain dump of ideas for improving mutation efficiency:

argument mutation priorities (int64 should be mutated more frequently than bool8)
call insertion/mutation priorities (IPT_SO_SET_REPLACE should be inserted/mutated more frequently than sched_yield) based on argument complexity
special mutation for binary flags (if flags values are all single bits, we should combine random sets of them)
resource-centeric generation/mutation
program priority in corpus based on amount of new coverage (e.g. 1 new basic block vs large chunk of code covered)
better estimate state of the system before each syscall (e.g. if we created a file "foo", it makes sense to try to open it, otherwise there is little sense in invoking syscalls that require existence of file "foo"; if we opened a socket on port X, it makes sense to connect to that port, etc)

lcytxw · 2018-03-14T09:16:00Z

Hi, dvyukov.
I have some questions about the signal returned by the function execute1(). In syzkaller, new input added to corpus should contain new signals, this will cause problems to some extent. For example, if there is a input exercise first path with signal [A, C, D], and a input exercise second path with signal [B, C, E]. when a new input exercise a path with signal [A, C, E], this path will not be considered a new path by syzkaller. Is this not a problem?

dvyukov · 2018-03-14T09:19:35Z

I don't see how this is a problem. Why do you think it could be a problem?

lcytxw · 2018-03-14T11:26:36Z

Sorry, I just remember the path calculation method in afl and you are absolute right .

dvyukov · 2018-03-15T08:35:16Z

I see what you mean. But it's not that simple.
There is spectrum of feedback signal sensitivity. One one end of the spectrum we have plain basic-block code coverage, then we have edges, longer paths, counters, stacks, pairs of basic-blocks, and ultimately using whole trace as signal (i.e. hash of all traced PCs), or even hash of input (i.e. different inputs are always memorized in corpus).
The more sensitive the signal, the more useful inputs we add to the corpus. But at the same time we add even more unuseful inputs. Too bloated corpus starts giving negative effect at some point. What's the sweet spot here is a hard question.
syz-manager has a special benchmarking mode (-bench flag). If you can tune signal and prove with numbers that one option is better then another, that would be great.

dvyukov · 2018-06-15T09:22:53Z

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

daydayup40 · 2019-01-20T12:04:36Z

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

I see that this is same as the implementation method of dynamic prios now. Right?

dvyukov · 2019-01-20T12:57:18Z

Yes, dynamic prios are similar.

dvyukov · 2019-02-20T08:53:04Z

A bit on resource-centeric generation/mutation from a mailing list:

I have a long-standing idea to rewrite program generation/mutation
around the concept of resources, rather that syscalls. Say, if we have
a socket created in a program, we consider what else we can do with
that socket (and have some tables as to what can be done with
sockets). This also allows much smarter program splicing. For example,
we can take an initialized resource from one program and plug it (with
the whole syscall sequences that initializes it) into another program.
This should give us much more efficient generation/mutation, because
what matters in the end is resources (that's the unit of state
accumulation in kernel).
I think this should also solve A->B->C problem you mentioned. For
example, syscall A creates a resource of type X, and then we know that
B and C are syscalls that act on resource of type X, so we will
consider adding them with higher probability.

xairy · 2019-08-06T12:36:29Z

An idea on how to quickly get an estimate for mutation parameters (priorities). Instead of doing some incremental changes to the parameters and monitoring the difference in coverage size (or number of crashes) after fuzzing for a certain period of time, we can compute those parameters based on the corpus that we already have on syzbot. The idea is to split the corpus into two parts in some way, and try to figure out what are the most optimal values for mutation parameters to generate the first part of the corpus from the other one via mutations as efficient as possible.

For example to calculate arguments mutation priorities we can do the following. Split all programs in the corpus into syscalls, group all syscalls into buckets based on syscall type (name), and then calculate the shortest sequence of mutations for each pair of syscalls in each group. Then take some kind of average (proportionally to the number of pairs in each bucket?) of the mutation types that we had to do, and use the result values as mutation priorities.

Update google#534

Update #534

harperchen · 2020-07-14T09:59:29Z

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

Hi, May I ask how to reproduce the 30% increase after introducing the dynamic priority?

dvyukov added the enhancement label Mar 8, 2018

melver mentioned this issue May 24, 2019

prog: generate new programs by crossover #1198

Open

dvyukov mentioned this issue Aug 19, 2019

syz-manager: corpus rotation #1348

Open

This was referenced Sep 3, 2019

prog: resource-centric generation/mutation #1355

Merged

prog: mutation heuristics #1306

Merged

veronicaradu mentioned this issue Sep 4, 2019

syz-fuzzer: add program priority in corpus #1379

Merged

dvyukov mentioned this issue Sep 4, 2019

prog: better call-to-call priority calculation #1380

Open

veronicaradu pushed a commit to veronicaradu/syzkaller that referenced this issue Sep 24, 2019

syz-fuzzer: add program priority in corpus

dfdd1df

Update google#534

dvyukov pushed a commit that referenced this issue Sep 24, 2019

syz-fuzzer: add program priority in corpus

2cad5aa

Update #534

xairy added the enhancement:mutations label Oct 1, 2019

dvyukov mentioned this issue Jul 17, 2020

prog: use reinforcement learning #1950

Open

gwangmu mentioned this issue Apr 12, 2021

Input prioritization by its amount of coverage, rather than "new" coverage? #2533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prog: smarter mutations #534

prog: smarter mutations #534

dvyukov commented Mar 8, 2018 •

edited

Loading

lcytxw commented Mar 14, 2018

dvyukov commented Mar 14, 2018

lcytxw commented Mar 14, 2018

dvyukov commented Mar 15, 2018

dvyukov commented Jun 15, 2018 •

edited

Loading

daydayup40 commented Jan 20, 2019

dvyukov commented Jan 20, 2019

dvyukov commented Feb 20, 2019

xairy commented Aug 6, 2019

harperchen commented Jul 14, 2020

prog: smarter mutations #534

prog: smarter mutations #534

Comments

dvyukov commented Mar 8, 2018 • edited Loading

lcytxw commented Mar 14, 2018

dvyukov commented Mar 14, 2018

lcytxw commented Mar 14, 2018

dvyukov commented Mar 15, 2018

dvyukov commented Jun 15, 2018 • edited Loading

daydayup40 commented Jan 20, 2019

dvyukov commented Jan 20, 2019

dvyukov commented Feb 20, 2019

xairy commented Aug 6, 2019

harperchen commented Jul 14, 2020

dvyukov commented Mar 8, 2018 •

edited

Loading

dvyukov commented Jun 15, 2018 •

edited

Loading