Arnaud Bailly edited this page Jan 13, 2017 · 7 revisions

jparsec Overview

In a typical parser program written in jparsec, the programmer creates a bunch of Parser objects and combines them together. These Parser objects each represent a piece of parsing logic.

How do I start?

With jparsec, one constructs Parser objects in terms of the production rules of the grammar. Once a Parser object is created, it can be used as in:

parser.parse("code to parse");

Depending on your need, this return value can be either the calculation result or an abstract syntax tree.

So how does one create a Parser object? The following are the most important classes:

  • Parser: A parser encapsulates a piece of parsing logic, and simple Parser objects can be combined to create more complex parsers.
  • Parsers: Implementations of common parsers
  • Scanners: A scanner is a parser that scans the source string and recognizes certain patterns.
  • Terminals: Provides tokenizers and lexers for common terminals such as identifier, integer, scientific number etc.
  • OperatorTable: Supports operator precedence grammars. The programmer declares operators in an OperatorTable and the framework will take care of constructing a full-blown expression parser.

What are the top 5 combinators that I need to familiarize myself with?

  • or: This is the logical alternative combinator. The production rule A ::= B|C|D can be represented as:

    Parser<Foo> a = Parsers.or(b, c, d);
  • next/sequence: This is the sequential combinator. The production rule A ::= BCD can be represented as:

    Parser<Foo> a = Parsers.sequence(b,c,d);

    With alternative and sequential combinators, one can start building sophisticated parsers.

  • map/sequence: When constructing a parser, we typically want to not only recognize a certain grammar, but also to build some object or perform some computation based on the recognized grammar. This family of map/sequence combinators can be used to perform such computations. For example, in order to use parser results of types B, C, D to create an object of type A, one can implement the callback interface Map3, which accepts the parser results of B, C and D as input parameters and returns an object of type A as a result.

    Implementing anonymous class for the Map interfaces could be verbose though. A convenience Mapper class is provided to simplify the syntax. It requires an additional dependency on cglib

  • many/many1: These combinators implement the "kleene star" and "kleene cross" logic in BNF. A ::= B* is represented as:

    Parser<Foo> foo = ...;
    Parser<Void> a = foo.skipMany();


    Parser<Foo> foo = ...;
    Parser<List<Foo>> a = foo.many();

    where the latter will additionally return a list of Foo objects as the parser result.

  • lazy: Production rules can involve recursion. (for example, an expression with binary operators is represented recursively in a production rule). The lazy combinator allows a parser to reference a parser that will be set later.

Lexical analysis vs. syntactical analysis

In a simple scenario, all the work can be done in the scanning phase. For example:

Parser<List<String>> numbers = Scanners.INTEGER.sepBy(Scanners.isChar(','));
assertEquals(Arrays.asList("1", "2", "3"), numbers.parse("1,2,3"));

However, when the complexity of the grammar rule scales up and there are whitespaces and comments to ignore from the grammar, one-pass parsing becomes awkward. A 2-pass approach can then be used. That is, a lexical analysis phase scans the source as a list of Tokens and then a second syntactical analysis phase parses the tokens.

The Terminals class provides common tokenizers that scans the source string and turns them into tokens. It also provides corresponding syntactic parsers that recognize these tokens in the syntactical analysis phase.

A syntactical parser takes a list of tokens as input; this list needs to come from the output of a lexer. The Parser.from() API can be used to chain a syntactical parser with a lexer.

What are the typical steps in creating a conventional 2-pass parser?

Step 1: Terminals

Use the pre-defined tokenizers and terminal syntactical parsers in Terminals to define the atoms of your language.

For example, the following parser parses a list of integers separated by a comma, with whitespaces and block comments ignored.

Terminals operators = Terminals.operators(","); // only one operator supported so far
Parser<?> integerTokenizer = Terminals.IntegerLiteral.TOKENIZER;
Parser<String> integerSyntacticParser = Terminals.IntegerLiteral.PARSER;
Parser<?> ignored = Parsers.or(Scanners.JAVA_BLOCK_COMMENT, Scanners.WHITESPACES);
Parser<?> tokenizer = Parsers.or(operators.tokenizer(), integerTokenizer); // tokenizes the operators and integer
Parser<List<String>> integers = integerSyntacticParser.sepBy(operators.token(","))
    .from(tokenizer, ignored.skipMany());
assertEquals(Arrays.asList("1", "2", "3"), integers.parse("1, /*this is comment*/2, 3");

Step 2: Production rule

The next step is to build the syntactical parser following production rules. The "integers" parser used above is a simple example. Real parsers can be arbitrarily complex. For operator precedence grammars, OperatorTable can be used to declare operator precedences and associativities and construct a parser based on the declaration.

As in most recursive descent parsers, left-recursion needs to be avoided. Avoid writing a parser like this:

Parser.Reference<Expr> ref = Parser.newReference();
Parser<Expr> expr = Parsers.sequence(ref.lazy(), operators.token("+"), number); // left recursion!

It will fail with stack overflow!

A less obvious left-recursion is a production rule that looks like:

Parser.Reference<Expr> ref = Parser.newReference();
Parser<Expr> expr = Parsers.sequence(operators.token("-").many(), ref.lazy());

As many can occur 0 times, we have a potential left recursion here.

Although left recursive grammar isn't generally supported, the most common case of left recursion stems from left associative binary operator, which is handled by [OperatorTable] (https://github.com/abailly/jparsec/blob/master/jparsec/src/main/java/org/jparsec/OperatorTable.java).

Please see jparsec Tips for tips and catches.