Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prog: smarter mutations #534

Open
dvyukov opened this issue Mar 8, 2018 · 10 comments
Open

prog: smarter mutations #534

dvyukov opened this issue Mar 8, 2018 · 10 comments

Comments

@dvyukov
Copy link
Collaborator

dvyukov commented Mar 8, 2018

Brain dump of ideas for improving mutation efficiency:

  • argument mutation priorities (int64 should be mutated more frequently than bool8)
  • call insertion/mutation priorities (IPT_SO_SET_REPLACE should be inserted/mutated more frequently than sched_yield) based on argument complexity
  • special mutation for binary flags (if flags values are all single bits, we should combine random sets of them)
  • resource-centeric generation/mutation
  • program priority in corpus based on amount of new coverage (e.g. 1 new basic block vs large chunk of code covered)
  • better estimate state of the system before each syscall (e.g. if we created a file "foo", it makes sense to try to open it, otherwise there is little sense in invoking syscalls that require existence of file "foo"; if we opened a socket on port X, it makes sense to connect to that port, etc)
@lcytxw
Copy link

lcytxw commented Mar 14, 2018

Hi, dvyukov.
I have some questions about the signal returned by the function execute1(). In syzkaller, new input added to corpus should contain new signals, this will cause problems to some extent. For example, if there is a input exercise first path with signal [A, C, D], and a input exercise second path with signal [B, C, E]. when a new input exercise a path with signal [A, C, E], this path will not be considered a new path by syzkaller. Is this not a problem?

@dvyukov
Copy link
Collaborator Author

dvyukov commented Mar 14, 2018

I don't see how this is a problem. Why do you think it could be a problem?

@lcytxw
Copy link

lcytxw commented Mar 14, 2018

Sorry, I just remember the path calculation method in afl and you are absolute right .

@dvyukov
Copy link
Collaborator Author

dvyukov commented Mar 15, 2018

I see what you mean. But it's not that simple.
There is spectrum of feedback signal sensitivity. One one end of the spectrum we have plain basic-block code coverage, then we have edges, longer paths, counters, stacks, pairs of basic-blocks, and ultimately using whole trace as signal (i.e. hash of all traced PCs), or even hash of input (i.e. different inputs are always memorized in corpus).
The more sensitive the signal, the more useful inputs we add to the corpus. But at the same time we add even more unuseful inputs. Too bloated corpus starts giving negative effect at some point. What's the sweet spot here is a hard question.
syz-manager has a special benchmarking mode (-bench flag). If you can tune signal and prove with numbers that one option is better then another, that would be great.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Jun 15, 2018

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

@daydayup40
Copy link

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

I see that this is same as the implementation method of dynamic prios now. Right?

@dvyukov
Copy link
Collaborator Author

dvyukov commented Jan 20, 2019

Yes, dynamic prios are similar.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Feb 20, 2019

A bit on resource-centeric generation/mutation from a mailing list:

I have a long-standing idea to rewrite program generation/mutation
around the concept of resources, rather that syscalls. Say, if we have
a socket created in a program, we consider what else we can do with
that socket (and have some tables as to what can be done with
sockets). This also allows much smarter program splicing. For example,
we can take an initialized resource from one program and plug it (with
the whole syscall sequences that initializes it) into another program.
This should give us much more efficient generation/mutation, because
what matters in the end is resources (that's the unit of state
accumulation in kernel).
I think this should also solve A->B->C problem you mentioned. For
example, syscall A creates a resource of type X, and then we know that
B and C are syscalls that act on resource of type X, so we will
consider adding them with higher probability.

@xairy
Copy link
Collaborator

xairy commented Aug 6, 2019

An idea on how to quickly get an estimate for mutation parameters (priorities). Instead of doing some incremental changes to the parameters and monitoring the difference in coverage size (or number of crashes) after fuzzing for a certain period of time, we can compute those parameters based on the corpus that we already have on syzbot. The idea is to split the corpus into two parts in some way, and try to figure out what are the most optimal values for mutation parameters to generate the first part of the corpus from the other one via mutations as efficient as possible.

For example to calculate arguments mutation priorities we can do the following. Split all programs in the corpus into syscalls, group all syscalls into buckets based on syscall type (name), and then calculate the shortest sequence of mutations for each pair of syscalls in each group. Then take some kind of average (proportionally to the number of pairs in each bucket?) of the mutation types that we had to do, and use the result values as mutation priorities.

@harperchen
Copy link

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

Hi, May I ask how to reproduce the 30% increase after introducing the dynamic priority?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants