Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fuzzer for JavaCC grammars #140

Open
vlsi opened this issue Jan 20, 2020 · 9 comments
Open

Implement fuzzer for JavaCC grammars #140

vlsi opened this issue Jan 20, 2020 · 9 comments

Comments

@vlsi
Copy link
Contributor

vlsi commented Jan 20, 2020

As of now, JavaCC generates parsers.

However, it would be nice if it could generate fuzzers as well.

In other words, it should take a Random instance, and produce a randomized sample as if it was parsed according to the grammar.

For instance, https://github.com/javacc/javacc/tree/master/examples/Interpreter generates a parser that takes stream as an input and produces objects like ASTCompilationUnit.
However, it would be great if JavaCC could generate a randomized generator of ASTCompilationUnit instances.

It looks like JavaCC has quite good information on the output tree structure, so it could randomize between alternations.

For instance, Apache Calcite uses JavaCC for SQL parsing. It would be nice to be able to generate randomized inputs so we could test SQL execution and optimization engine.

Challenges

Whitespace

Parsers skip whitespace, and they do not care. What they care is tokens.
However, whitespace becomes important for unparse scenario.

For instance select name from emps is valid SQL, however, selectnamefromemps is not.

That means fuzzer must insert whitespace somehow.
A workaround could be to insert a space after each token.

Production weights

For instance, expression grammar has trees, and there should be a way to specify the desired expression depth.
In other words, if the fuzzer generates a+b*(c+d-(k+m/(3-5))), then there should be some way to limit the depth for expression.

Can the weights be embedded into the original grammar?
For instance, SQL grammar is like 8000 lines, 200'000 bytes. It would be sad to maintain multiple grammar files (e.g. one for regular parsing, and the second one for fuzzing), and it would take time them in sync.

Repetition counts: parser, tokenizer

For instance, Apache Calcite SQL grammar has the following:

    < DECIMAL_NUMERIC_LITERAL:
    (["0"-"9"])+(".")?(["0"-"9"])*
    | "."(["0"-"9"])+
    >

Unfortunately, it gives no information on the expected length of the tokens.

Return values and arguments of production rules

JavaCC allows production to return values and receive arguments:

List<String> Ids(String arg) : ...

This is great for parsers, however, it seems to prevent the strategy of using Arbitrary.

In other words, compiling Arbitrary<List<String>> Ids(Arbitrary<String>) won't probably work, since the user code accesses the objects in an arbitrary fashion.

That means the fuzzer would probably need to keep the signatures intact.

Early termination

Grammars like ArithmeticExpression might produce stackoverflow since generator does not really know when to stop.

It looks like the iteration counts (and the taken alternatives) should be subject to the currently generated sequence.
In other words, the probability of taking the same token should reduce.

Summary

WDYT?

@MarcMazas
Copy link
Collaborator

MarcMazas commented Jan 20, 2020 via email

@zosrothko
Copy link
Member

@vlsi javacc next release 8.0.0 split the javacc parser from the code generator... I think you could develop a specific code generator to generate fuzzer code

@vlsi
Copy link
Contributor Author

vlsi commented Jan 25, 2020

Why limit to just random inputs ?

@MarcMazas , the full set of inputs is often infinite, so I do not see how you can generate infinite set of inputs.

@MarcMazas
Copy link
Collaborator

I'd like to generate a set of grammar inputs where the tokens may be not fully covered (so random is ok, or any other strategy), but the syntax (the combination of the productions) is fully covered. For example, for SQL, have an (one or a few random) example(s) of each combination of clauses in a statement, each combination of statements in a block. The number of these combinations may be high but not infinite if you stop recursing if there are recursing parts.
Then I would use this set as positive test cases, and build new test cases changing some parts (replace some identifiers with keywords or other strings, change the order of clauses or statements, add new ones, delete some, ...) and use them as negative test cases, to see that my grammar is not incomplete.

@vlsi
Copy link
Contributor Author

vlsi commented Jan 27, 2020

select '1' from dual
select '2' from dual
...
select 'i am literal' from dual
select 'select' from dual

is infinite :(

How do you know when to stop?

Note: sometimes characters in the literal have special meaning.

For instance:

select 1 from emps where name like '_%' escape '_' -- it is equivalent to name='%'

@MarcMazas
Copy link
Collaborator

I would limit the combinations :
[] : 2 combinations, with and without
() : 1 combination
()+ : 1 combination
()* : 2 combinations, with and without
( l() | r() ) : 2 combinations, left and right cases
( l() | m() | r() ) : like ( ( l() | m() ) | r() ): 2+1=3 combinations, left, midle and right cases (and so on)
A() B() : if A() has 4 subcombinations A1 A2 A3 A4 and B() 2 B1 B2, I would generate max(4, 2) combinations in a circular manner : A1 B1, A2 B2, A3 B1, A4 B2.
So if the grammar is "select" (Num() | Lit() | "dual' | "select" ) "from" Id() , I would generate : select 'Lit1' from dual and select 'Lit2' from Id1 and select 1 from dual and select 'select' from Id2.
I would not try to generate something that has semantic meaning, as I just want to build a set of JavaCC parsing test cases, not SQL test cases in your example.
So I would generate select 1 from Id1 where Id2 like 'Lit1' and select 2 from Id3 where Id4 like 'Lit2' escape 'a'.
If you want to generate test cases with semantic in them, I'm fine with this idea, but I don't know how to do this; I doubt that using randomly generated litterals would allow to achieve this (... where Id1 like '' escape '' in your SQL example would not meet the target if is not the first char of ).

@vlsi
Copy link
Contributor Author

vlsi commented Jan 27, 2020

If you want to generate test cases with semantic in them, I'm fine with this idea, but I don't know how to do this

They say randomization + coverage guidance helps to generate semantically valid inputs: http://lcamtuf.coredump.cx/afl/

@vlsi
Copy link
Contributor Author

vlsi commented Feb 1, 2020

Ok. The fuzzer in #160 works for certain cases. Lookahead is not yet implemented, however, it does generate tokens.

Here are the samples with grammars/PlSql.jj:

CALL "" . gXyMM . bpr5 @ ! ;
FUNCTION kAnwU$CK RETURN "!��" . "�" @ MDpZs#_ . oMms . t#_ . mHQJXJ %TYPE IS
BEGIN
  RETURN ;
  NULL ;
END ;
FUNCTION EkJO ( fwifM_0d# IN OUT pD$R , gVTR#__# QbPpK_#4 %TYPE , sqwcL9J$ OUT TABLE OF utE %TYPE ) RETURN BOOLEAN ;
PROCEDURE WBfsP56G_ IS
BEGIN NULL ;
EXCEPTION
 WHEN P$__$ OR qmq OR v_$$z OR sMhAU##
    THEN << xVezFy4_ >>
 WHEN HQhq#r OR lnUt
    THEN NULL ;
END 

@vlsi
Copy link
Contributor Author

vlsi commented Feb 12, 2020

I moved the fuzzer development to vavrcc/vavrcc#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants