Implement fuzzer for JavaCC grammars #140

vlsi · 2020-01-20T09:29:28Z

As of now, JavaCC generates parsers.

However, it would be nice if it could generate fuzzers as well.

In other words, it should take a Random instance, and produce a randomized sample as if it was parsed according to the grammar.

For instance, https://github.com/javacc/javacc/tree/master/examples/Interpreter generates a parser that takes stream as an input and produces objects like ASTCompilationUnit.
However, it would be great if JavaCC could generate a randomized generator of ASTCompilationUnit instances.

It looks like JavaCC has quite good information on the output tree structure, so it could randomize between alternations.

For instance, Apache Calcite uses JavaCC for SQL parsing. It would be nice to be able to generate randomized inputs so we could test SQL execution and optimization engine.

Challenges

Whitespace

Parsers skip whitespace, and they do not care. What they care is tokens.
However, whitespace becomes important for unparse scenario.

For instance select name from emps is valid SQL, however, selectnamefromemps is not.

That means fuzzer must insert whitespace somehow.
A workaround could be to insert a space after each token.

Production weights

For instance, expression grammar has trees, and there should be a way to specify the desired expression depth.
In other words, if the fuzzer generates a+b*(c+d-(k+m/(3-5))), then there should be some way to limit the depth for expression.

Can the weights be embedded into the original grammar?
For instance, SQL grammar is like 8000 lines, 200'000 bytes. It would be sad to maintain multiple grammar files (e.g. one for regular parsing, and the second one for fuzzing), and it would take time them in sync.

Repetition counts: parser, tokenizer

For instance, Apache Calcite SQL grammar has the following:

    < DECIMAL_NUMERIC_LITERAL:
    (["0"-"9"])+(".")?(["0"-"9"])*
    | "."(["0"-"9"])+
    >

Unfortunately, it gives no information on the expected length of the tokens.

Return values and arguments of production rules

JavaCC allows production to return values and receive arguments:

List<String> Ids(String arg) : ...

This is great for parsers, however, it seems to prevent the strategy of using Arbitrary.

In other words, compiling Arbitrary<List<String>> Ids(Arbitrary<String>) won't probably work, since the user code accesses the objects in an arbitrary fashion.

That means the fuzzer would probably need to keep the signatures intact.

Early termination

Grammars like ArithmeticExpression might produce stackoverflow since generator does not really know when to stop.

It looks like the iteration counts (and the taken alternatives) should be subject to the currently generated sequence.
In other words, the probability of taking the same token should reduce.

Summary

WDYT?

The text was updated successfully, but these errors were encountered:

MarcMazas · 2020-01-20T11:27:20Z

Hi Why limit to just random inputs ? I would dream to have JavaCC generate a full set of inputs covering all the syntax … Regards Marc De : Vladimir Sitnikov <notifications@github.com> Envoyé : lundi 20 janvier 2020 10:30 À : javacc/javacc <javacc@noreply.github.com> Cc : Subscribed <subscribed@noreply.github.com> Objet : [javacc/javacc] Implement fuzzer for JavaCC grammars (#140) As of now, JavaCC generates parsers. However, it would be nice if it could generate fuzzers as well. In other words, it should take a Random instance, and produce a randomized sample as if it was parsed according to the grammar. For instance, https://github.com/javacc/javacc/tree/master/examples/Interpreter generates a parser that takes stream as an input and produces objects like ASTCompilationUnit. However, it would be great if JavaCC could generate a randomized generator of ASTCompilationUnit instances. It looks like JavaCC has quite good information on the output tree structure, so it could randomize between alternations. For instance, Apache Calcite uses JavaCC for SQL parsing. It would be nice to be able to generate randomized inputs so we could test SQL execution and optimization engine. WDYT? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#140?email_source=notifications&email_token=AFQZRC6ZQ7WHEFAT734VJM3Q6VVHVA5CNFSM4KJA7ZKKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHJREMQ>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFQZRC5QN2PUNK27BLK3QW3Q6VVHVANCNFSM4KJA7ZKA>.

zosrothko · 2020-01-20T18:43:39Z

@vlsi javacc next release 8.0.0 split the javacc parser from the code generator... I think you could develop a specific code generator to generate fuzzer code

vlsi · 2020-01-25T22:26:02Z

Why limit to just random inputs ?

@MarcMazas , the full set of inputs is often infinite, so I do not see how you can generate infinite set of inputs.

MarcMazas · 2020-01-27T09:16:49Z

I'd like to generate a set of grammar inputs where the tokens may be not fully covered (so random is ok, or any other strategy), but the syntax (the combination of the productions) is fully covered. For example, for SQL, have an (one or a few random) example(s) of each combination of clauses in a statement, each combination of statements in a block. The number of these combinations may be high but not infinite if you stop recursing if there are recursing parts.
Then I would use this set as positive test cases, and build new test cases changing some parts (replace some identifiers with keywords or other strings, change the order of clauses or statements, add new ones, delete some, ...) and use them as negative test cases, to see that my grammar is not incomplete.

vlsi · 2020-01-27T09:37:12Z

select '1' from dual
select '2' from dual
...
select 'i am literal' from dual
select 'select' from dual

is infinite :(

How do you know when to stop?

Note: sometimes characters in the literal have special meaning.

For instance:

select 1 from emps where name like '_%' escape '_' -- it is equivalent to name='%'

MarcMazas · 2020-01-27T10:49:44Z

I would limit the combinations :
[] : 2 combinations, with and without
() : 1 combination
()+ : 1 combination
()* : 2 combinations, with and without
( l() | r() ) : 2 combinations, left and right cases
( l() | m() | r() ) : like ( ( l() | m() ) | r() ): 2+1=3 combinations, left, midle and right cases (and so on)
A() B() : if A() has 4 subcombinations A1 A2 A3 A4 and B() 2 B1 B2, I would generate max(4, 2) combinations in a circular manner : A1 B1, A2 B2, A3 B1, A4 B2.
So if the grammar is "select" (Num() | Lit() | "dual' | "select" ) "from" Id() , I would generate : select 'Lit1' from dual and select 'Lit2' from Id1 and select 1 from dual and select 'select' from Id2.
I would not try to generate something that has semantic meaning, as I just want to build a set of JavaCC parsing test cases, not SQL test cases in your example.
So I would generate select 1 from Id1 where Id2 like 'Lit1' and select 2 from Id3 where Id4 like 'Lit2' escape 'a'.
If you want to generate test cases with semantic in them, I'm fine with this idea, but I don't know how to do this; I doubt that using randomly generated litterals would allow to achieve this (... where Id1 like '' escape '' in your SQL example would not meet the target if is not the first char of ).

vlsi · 2020-01-27T11:15:33Z

If you want to generate test cases with semantic in them, I'm fine with this idea, but I don't know how to do this

They say randomization + coverage guidance helps to generate semantically valid inputs: http://lcamtuf.coredump.cx/afl/

vlsi · 2020-02-01T15:14:18Z

Ok. The fuzzer in #160 works for certain cases. Lookahead is not yet implemented, however, it does generate tokens.

Here are the samples with grammars/PlSql.jj:

CALL "" . gXyMM . bpr5 @ ! ;

FUNCTION kAnwU$CK RETURN "!��" . "�" @ MDpZs#_ . oMms . t#_ . mHQJXJ %TYPE IS
BEGIN
  RETURN ;
  NULL ;
END ;

FUNCTION EkJO ( fwifM_0d# IN OUT pD$R , gVTR#__# QbPpK_#4 %TYPE , sqwcL9J$ OUT TABLE OF utE %TYPE ) RETURN BOOLEAN ;

PROCEDURE WBfsP56G_ IS
BEGIN NULL ;
EXCEPTION
 WHEN P$__$ OR qmq OR v_$$z OR sMhAU##
    THEN << xVezFy4_ >>
 WHEN HQhq#r OR lnUt
    THEN NULL ;
END

vlsi · 2020-02-12T20:35:32Z

I moved the fuzzer development to vavrcc/vavrcc#1

vlsi mentioned this issue Jan 20, 2020

Support coverage-guided generation jqwik-team/jqwik#84

Open

vlsi mentioned this issue Jan 25, 2020

Discuss: javax.tools.JavaCompiler or Janino for grammar tests #145

Closed

This was referenced Jan 26, 2020

Fix HelloWorld sample in readme agrosner/KPoet#3

Open

Add support for raw (opaque String) parameter list square/javapoet#760

Closed

vlsi mentioned this issue Jan 27, 2020

$ makes uses from Kotlin weird square/javapoet#761

Closed

vlsi mentioned this issue Feb 1, 2020

WIP: fuzzer for JavaCC grammars #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fuzzer for JavaCC grammars #140

Implement fuzzer for JavaCC grammars #140

vlsi commented Jan 20, 2020 •

edited

MarcMazas commented Jan 20, 2020 via email

zosrothko commented Jan 20, 2020

vlsi commented Jan 25, 2020

MarcMazas commented Jan 27, 2020

vlsi commented Jan 27, 2020 •

edited

MarcMazas commented Jan 27, 2020

vlsi commented Jan 27, 2020

vlsi commented Feb 1, 2020

vlsi commented Feb 12, 2020

Implement fuzzer for JavaCC grammars #140

Implement fuzzer for JavaCC grammars #140

Comments

vlsi commented Jan 20, 2020 • edited

Challenges

Whitespace

Production weights

Repetition counts: parser, tokenizer

Return values and arguments of production rules

Early termination

Summary

MarcMazas commented Jan 20, 2020 via email

zosrothko commented Jan 20, 2020

vlsi commented Jan 25, 2020

MarcMazas commented Jan 27, 2020

vlsi commented Jan 27, 2020 • edited

MarcMazas commented Jan 27, 2020

vlsi commented Jan 27, 2020

vlsi commented Feb 1, 2020

vlsi commented Feb 12, 2020

vlsi commented Jan 20, 2020 •

edited

vlsi commented Jan 27, 2020 •

edited