-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar-based fuzzer based on pest grammars #209
Comments
This is undecidable in general, I suspect. Along these lines: Use the alphabet {0, 1}. Take two indexed families of nonempty strings {a₁, ⋯ , aₙ} and {b₁, ⋯ , bₙ}, each of length n. Write cᵢ = rev(bᵢ), the string reverse of each bᵢ. Then with the grammar
the language generated by |
@wirelyre Intuitively, I would have said that the undecidability of language disjointness also applies to PEGs. Your proof looks fine. But I don't see what this has to do with fuzzing. The way I understand it, the fuzzer should construct an ever-growing set of strings from the language defined by a pest grammar, which sounds feasible. |
@dragostis The implication is that, for any fuzzer, there exist grammars which will cause it to run forever but never make a single matching string. And it is not possible to identify which grammars will do that. Maybe that's okay behavior. I'm bringing it up because I hunch that there is a simple and practical restriction on lookahead that makes it trivial to detect such cases. Then a search through the language space might construct new words in bounded time. |
Constructing a string matching a grammar is the other side of the coin of recognizing a grammar.
Obviously some kind of safe guard to avoid generating too large texts (run forever) would be needed So:
|
@bd82 I think the question might be a bit more complicated than that. These rules have negative lookaheads and use the stack to match different things from the grammar. |
Yeah that would make it more complicated... I was thinking in terms of pure EBNF... |
I was wrong. A 2004 article shows how to convert any well-formed PEG (which doesn't admit the empty string) into another one that doesn't use the "predicate operations"
@bd82 Even without lookaheads and the stack, ordered choice is already really hard. Suppose I want to exercise the second branch of It seems obvious that there should be an algorithm that works well if you give it a sane grammar. There probably is! So my question is: what is a "sane grammar"? |
I normally work with LL(K) grammars not PEG, maybe that makes a difference to the complexity?
In PEG if two alternatives can match exactly the same thing, the first one is accepted and the grammar is still valid? |
Lets assume that due to edge cases the text generator cannot always guarantee the generated text is valid. So maybe to provide 100 valid inputs we need to generate 120 possible inputs. Maybe if we reduce the constraint for generating valid inputs 50% of the time, the definition of a "sane" |
Earlier this year I released a project at DEFCON called synfuzz https://www.github.com/jrozner/synfuzz (written in Rust) that may help with accomplishing this once there is a more firm plan in place. From what I can tell it takes a similar approach to the project listed above. The goal is to provide a common platform that various frontends for different parser generator definition languages can target. I’d potentially be interested in helping on this. |
The purpose of this kind of fuzzer would be to automatically generate "passing" cases? That is, examples of strings that match the grammar given as input. Is that right? I ask because if we assume Pest's parsing to be correct (which it should more or less be, after a good amount of time and people using it have passed), then any non-matching string for a given grammar would be automatically rejected. Therefore, I think the most useful fuzzer would generate matching strings, which would of course pass Pest's filtering and generate a Parse Tree. These Parse Trees would then be valid input for your AST builder, allowing you to fuzz-test it. I think fuzz testing the AST building step is a worthwhile goal. The more complex your grammar is, the more paths there are to test, and maybe one of them makes your parser crash. I'd love this feature <3 |
Also: maybe this can be implemented with a PEG -> BNF translator, and then using the BNF representation to call the generators from |
That's an interesting idea, though potentially problematic: PEG is deterministic while pure BNF is ambiguous in some cases. Clarification on what that meansThe PEG grammar:
versus the "equivalent" EBNF: production = keyword | identifier ;
keyword = "keyword" ;
identifier = ?letter?, { ?letter? } ; (I didn't want to write out the EBNF for The input string For the most part, programming languages intend to be fairly free from ambiguity, and PEG accomplishes this by just always doing the first choice. It'd be interesting if generating based on the (e)bnf generates inputs that aren't accepted by the PEG parser or are ambiguous in the (e)bnf, and the user of the tool could see them, determine if they should parse, and which direction they should be going. It could help eliminate the biggest "surprise" potential of using PEG. But yeah, fuzzing based off of grammars is about fuzzing the recipient of the parsed tree more-so than the parser itself (though that does help increase confidence in the correctness of the parser under more exotic optimization). |
You raise a fair point. Is there literature on PEG -> BNF transformations? I'm thinking that maybe the BNF can be built while preserving the unambiguity property of the source PEG.
Wait, optimization? I'm thinking on post-(semantic analysis) optimization, which is probably unrelated to what you meant. You mean optimizations in the sense that we want to know how correct is the output of building Pest with |
It would be really neat to have a library to do this. Here's an example of another project that does a similar thing. https://github.com/d0c-s4vage/gramfuzz
The text was updated successfully, but these errors were encountered: