Giza is a parser, but not in the same vein as what you're used to. For one, it's not a parser-generator. Rather than turning a grammar into a bunch of source code, it instead converts the grammar into a ready-to-use form, and can then start parsing immediately. No compiling necessary!
One of the cool benefits of this is that you can try out and edit your grammar in real time. Take a look at the included
giza program. It's actually a REPL for editing grammars and parsing text. How cool is that!?
Giza would probably be classified as a kind of GLR parser. It goes through all possible paths without the need for backtracking. It can also handle ambiguities, by producing multiple parse trees, if needed. The format of its grammars is not quite BNF; instead the syntax is more inspired by programming languages and regexes. The syntax of grammars is described entirely by the "Supergrammar", which is written in it's own syntax. Take a look at
Giza maintains a distinction between "tokenized" and "non-tokenized" processing. The former is what parsers usually do, constructing a parse tree from a stream of input tokens. The latter is more akin to what regexes do, matching against a stream of input characters instead. Herein, non-tokenized processing often goes by the name "spanning" and is done by a "spanner".
Here's how you would work with a grammar in the REPL:
>>> expr = ( mult-expr | add-expr | sub-expr ); >>> mult-expr = sub-expr ( [*/%] sub-expr )+; >>> add-expr = ( mult-expr | sub-expr ) ( [-+] ( mult-expr | sub-expr ) )+; >>> sub-expr = ( number | var | paren ); >>> <token> number = [\d]+; >>> <token> var = [\l]+;
Then you can check the definitions for errors like this:
>>> check --tokenized There are errors in the grammar: Definition 'sub-expr' references a definition 'paren' which is not defined.
Oops, we forgot to define
>>> paren = '(' expr ')'; >>> check --tokenized There are no errors or warnings.
Once that's settled, we can parse some text:
>>> parse expr '1 + 2 * 3 - four / 5 % 6 + seven' There is 1 valid parse of the input.
To get more info, use the
--verbose flag to print out the parse tree:
>>> parse expr --verbose '1 + 2 * 3 - four / 5 % 6 + seven' There is 1 valid parse of the input. expr add-expr sub-expr 1 number + $implicit char class +- mult-expr sub-expr 2 number * $implicit char class %*/ sub-expr 3 number - $implicit char class +- mult-expr sub-expr four var / $implicit char class %*/ sub-expr 5 number % $implicit char class %*/ sub-expr 6 number + $implicit char class +- sub-expr seven var
In the left column, you see the sequence of tokens in the input. On the right is the parse tree with indentation to show hierarchy, with each matching node on the same line of text as the token that matched it.
But that's not all. There's tons of stuff the REPL can do. It has a built-in help system to explain everything:
>>> help Usage: >>> [options] >>> help [command_or_topic] >>> command [args...] Commands: help Display general help, or help on a specific topic. list List all of the definitions currently defined. print Print definitions as text in giza grammar format delete Delete the specified definitions. save Save definitions to a file as text in giza grammar format load Load definitions from a file check Check definitions for errors parse Parse one or more inputs with a tokenized grammar, starting with a given definition, and print how many valid parse trees are found span Span one or more inputs with a non-tokenized grammar, starting with a given definition, and print how many valid span trees are found render Convert definitions to state machine format and render the state machines to a C# class.
Example in Code
Once you've worked out your language's grammar, you typically want to use it in some other application. To do that, follow these steps:
- Save all relevant definitions to a file. By convention, the file extension is
.giza, but you can use anything. See an example here.
- Use the
rendercommand. This takes your grammar, converts all of the definitions into state machine format, and then emits the C# code for a class that creates the same state machine representation1. See example here.
- Create a class that takes your
*Grammarclass and plugs it into a
Parserobject. Whenever your want to parse some input text, pass it to the
Parser.Parsemethod to get a list of parse trees. See example here.
- [Optional] If there's more than one parse tree, then there's some ambiguity in your grammar with that particular input. You can do some kind of semantic analysis to choose which of the parse trees is the 'right' one. Or you can just pick the first one, if you're lazy.
- Convert the generalized parse tree(s) into whatever domain objects you need. See example here.
1 This kinda violates the "not turning a grammar into a bunch of source code" assertion above, but not entirely. The code so generated is not capable of parsing anything. It's basically a data structure serialized as C#. We could just as well store the raw text of the grammar file and run it through
SupergrammarSpanner to generate the state machine representation, which is what the
render command does for you. Whatever. We're working on making it prettier.
State of the project
Unfortunately, you caught us right in the middle of a major architecture overhaul. Much of the internals of how the system work are being completely re-worked. In particular, we're trying to treat tokenized and non-tokenized processing as special cases of a generalized pattern-matching system. Hence, the hideous
Spanner2 dichotomy. It's a work in progress. Code and API may change at any time, although the command-line tool will be pretty stable. There's a lot of cosmetic stuff to be fixed up as well, such as the fact that
Supergrammar.txt should really be named
Supergrammar.giza. Stay tuned!