New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fuzzer for JavaCC grammars #140
Comments
Hi
Why limit to just random inputs ?
I would dream to have JavaCC generate a full set of inputs covering all the syntax …
Regards
Marc
De : Vladimir Sitnikov <notifications@github.com>
Envoyé : lundi 20 janvier 2020 10:30
À : javacc/javacc <javacc@noreply.github.com>
Cc : Subscribed <subscribed@noreply.github.com>
Objet : [javacc/javacc] Implement fuzzer for JavaCC grammars (#140)
As of now, JavaCC generates parsers.
However, it would be nice if it could generate fuzzers as well.
In other words, it should take a Random instance, and produce a randomized sample as if it was parsed according to the grammar.
For instance, https://github.com/javacc/javacc/tree/master/examples/Interpreter generates a parser that takes stream as an input and produces objects like ASTCompilationUnit.
However, it would be great if JavaCC could generate a randomized generator of ASTCompilationUnit instances.
It looks like JavaCC has quite good information on the output tree structure, so it could randomize between alternations.
For instance, Apache Calcite uses JavaCC for SQL parsing. It would be nice to be able to generate randomized inputs so we could test SQL execution and optimization engine.
WDYT?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#140?email_source=notifications&email_token=AFQZRC6ZQ7WHEFAT734VJM3Q6VVHVA5CNFSM4KJA7ZKKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHJREMQ>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFQZRC5QN2PUNK27BLK3QW3Q6VVHVANCNFSM4KJA7ZKA>.
|
@vlsi javacc next release 8.0.0 split the javacc parser from the code generator... I think you could develop a specific code generator to generate fuzzer code |
@MarcMazas , the full set of inputs is often infinite, so I do not see how you can generate infinite set of inputs. |
I'd like to generate a set of grammar inputs where the tokens may be not fully covered (so random is ok, or any other strategy), but the syntax (the combination of the productions) is fully covered. For example, for SQL, have an (one or a few random) example(s) of each combination of clauses in a statement, each combination of statements in a block. The number of these combinations may be high but not infinite if you stop recursing if there are recursing parts. |
is infinite :( How do you know when to stop? Note: sometimes characters in the literal have special meaning. For instance:
|
I would limit the combinations : |
They say randomization + coverage guidance helps to generate semantically valid inputs: http://lcamtuf.coredump.cx/afl/ |
Ok. The fuzzer in #160 works for certain cases. Lookahead is not yet implemented, however, it does generate tokens. Here are the samples with
|
I moved the fuzzer development to vavrcc/vavrcc#1 |
As of now, JavaCC generates parsers.
However, it would be nice if it could generate fuzzers as well.
In other words, it should take a
Random
instance, and produce a randomized sample as if it was parsed according to the grammar.For instance, https://github.com/javacc/javacc/tree/master/examples/Interpreter generates a parser that takes
stream
as an input and produces objects likeASTCompilationUnit
.However, it would be great if JavaCC could generate a randomized generator of
ASTCompilationUnit
instances.It looks like JavaCC has quite good information on the output tree structure, so it could randomize between alternations.
For instance, Apache Calcite uses JavaCC for SQL parsing. It would be nice to be able to generate randomized inputs so we could test SQL execution and optimization engine.
Challenges
Whitespace
Parsers skip whitespace, and they do not care. What they care is tokens.
However, whitespace becomes important for unparse scenario.
For instance
select name from emps
is valid SQL, however,selectnamefromemps
is not.That means fuzzer must insert whitespace somehow.
A workaround could be to insert a space after each token.
Production weights
For instance,
expression
grammar has trees, and there should be a way to specify the desired expression depth.In other words, if the fuzzer generates
a+b*(c+d-(k+m/(3-5)))
, then there should be some way to limit the depth for expression.Can the weights be embedded into the original grammar?
For instance, SQL grammar is like 8000 lines, 200'000 bytes. It would be sad to maintain multiple grammar files (e.g. one for regular parsing, and the second one for fuzzing), and it would take time them in sync.
Repetition counts: parser, tokenizer
For instance, Apache Calcite SQL grammar has the following:
Unfortunately, it gives no information on the expected length of the tokens.
Return values and arguments of production rules
JavaCC allows production to return values and receive arguments:
This is great for parsers, however, it seems to prevent the strategy of using Arbitrary.
In other words, compiling
Arbitrary<List<String>> Ids(Arbitrary<String>)
won't probably work, since the user code accesses the objects in an arbitrary fashion.That means the fuzzer would probably need to keep the signatures intact.
Early termination
Grammars like
ArithmeticExpression
might produce stackoverflow since generator does not really know when to stop.It looks like the iteration counts (and the taken alternatives) should be subject to the currently generated sequence.
In other words, the probability of taking the same token should reduce.
Summary
WDYT?
The text was updated successfully, but these errors were encountered: