-
Notifications
You must be signed in to change notification settings - Fork 56
Tutorial
In this tutorial, we will use the classical example: calculator, to show how jparsec's declarative API can be used to construct a parser.
Our calculator should be able to:
- support calculation of decimal numbers.
- support Operators such as '+', '-', '*', '/'.
- '(' and ')' can be used to group expression.
- To make the parser more interesting, we'll allow java style single-line comment started by a "//" and block comment enclosed by "/*" and "*/".
- In math, we typically omit the '*' operator for multiplication. For example: "3a" is equivalent to "3*a". Though for simplicity, we don't include symbols in the calculator, we still want to allow whitespace as an alternative syntax for multiplication.
A typical parsing process takes 2 steps:
- Scan the input, create a list of tokens. These tokens include keywords, literals, identifiers,operators etc.
- The program that recognizes a single token is called "tokenizer".
- Parse the tokens based on the grammar of the language.
First, let's create our tokenizer. The tokenizer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to token.
The prebuilt Terminals.DecimalLiteral can be used to tokenize decimal literals.
An instance of Terminals can be created to manage our operators:
private static final Terminals OPERATORS =
Terminals.operators("+","-","*","/", "(", ")");
This calculator allows any number of white spaces before and after any number or operator.
Single-line comment and block comment are also allowed and treated as white space.
static final Parser<Void> IGNORED = Parsers.or(
Scanners.JAVA_LINE_COMMENT,
Scanners.JAVA_BLOCK_COMMENT,
Scanners.WHITESPACES).skipMany();
With the number literal tokenizer and the OPERATORS constant, our tokenizer will be:
static final Parser<?> TOKENIZER = OPERATORS.tokenizer().cast()
.or(Terminals.DecimalLiteral.TOKENIZER);
Note the use of cast()
which (ab)uses Java generics to force compiler to handle properly this definition.
And given any syntactical parser, we will be able to chain it with the tokenizer and ignore the whitespaces and comments:
Parser<T> grammar = ...;
Parser<T> parser = grammar.from(TOKENIZER, IGNORED);
With the tokenizer taking care of lexical analysis, we can now focus on grammar rules.
From the tokenizer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a Double object, we use the corresponding syntactic parser for decimal literal and transform the string result:
static final Parser<Double> NUMBER =
Terminals.DecimalLiteral.PARSER.map(Double::valueOf);
All the operators are managed by the OPERATORS constant. The Terminals.token()
method returns a parser that recognizes operator. A convenience method can be created on top of it to save a few keystrokes for us:
static Parser<?> term(String... names) {
return OPERATORS.token(names);
}
For each operator, the parser will first recognize the token using the term() method and then return the corresponding BinaryOperator/UnaryOperator instance, as in:
static <T> Parser<T> op(String name, T value) {
return term(name).retn(value);
}
At the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present.
Therefore we have:
static final Parser<?> WHITESPACE_MUL = term("+","-","*","/").not();
In a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example:
term_expr ::= ...
muldiv_expr ::= muldiv_expr ('*'|'/') term_expr
expr = expr ('+'|'-') muldiv_expr
This solution can lead to messy production rules when the number of operators and precedence levels scale up.
It is more desirable to be able to specify the operator precedence declaratively.
Another drawback of recursive-desent parser is left recursion.
In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr
is muldiv_expr
itself).
Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation.
Jparsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators.
Our target syntax should look like:
Parser<Double> unit = ...;
Parser<Double> parser = new OperatorTable<Double>()
.infixl(op("+", (l, r) -> l + r), 10)
.infixl(op("-", (l, r) -> l - r), 10)
.infixl(Parsers.or(term("*"), WHITESPACE_MUL).retn((l, r) -> l * r), 20)
.infixl(op("/", (l, r) -> l / r), 20)
.prefix(op("-", v -> -v), 30)
.build(unit);
The higher the precedence number, the more tightly it binds the operands.
In order to support parens, recursion needs to be involved. An expression inside a pair of parens is itself another expression, which can be nested inside another pair of parens. The way to handle recursion is to use Parser.Reference:
Parser.Reference<Double> ref = Parser.newReference();
Parser<Double> parenthesized = ref.lazy().between(term("("), term(")"));
If you run the "parenthesized" parser above, it will fail with some error message about the reference is not set yet. And that is accurately what we are missing here. The reference needs to finally be set with the parser for expression, because, well, literally any expression can be nested within a pair of parens.
So, putting it together with the operator table and whitespace operator, we have:
static Parser<Double> calculator(Parser<Double> atom) {
Parser.Reference<Double> ref = Parser.newReference();
Parser<Double> unit = ref.lazy().between(term("("), term(")")).or(atom);
Parser<Double> parser = new OperatorTable<Double>()
.infixl(op("+", (l, r) -> l + r), 10)
.infixl(op("-", (l, r) -> l - r), 10)
.infixl(Parsers.or(term("*"), WHITESPACE_MUL).retn((l, r) -> l * r), 20)
.infixl(op("/", (l, r) -> l / r), 20)
.prefix(op("-", v -> -v), 30)
.build(unit);
.build(unit);
ref.set(parser); // DO NOT FORGET THIS!
return parser;
}
And that is almost everything we need. Oh, wait, let's pass the NUMBER parser in and hook it up with the lexer to get the final parser:
static final Parser<Double> CALCULATOR =
calculator(NUMBER).from(TOKENIZER, IGNORED);
And it doesn't hurt to run it and see how it happily calculates, does it?
System.out.println(CALCULATOR.parse("1 + 2 (4 - 3) /*comment*/"));
The final parser code is as following:
public class Calculator {
static final Parser<Double> NUMBER =
Terminals.DecimalLiteral.PARSER.map(Double::valueOf);
private static final Terminals OPERATORS =
Terminals.operators("+", "-", "*", "/", "(", ")");
static final Parser<Void> IGNORED = Parsers.or(
Scanners.JAVA_LINE_COMMENT,
Scanners.JAVA_BLOCK_COMMENT,
Scanners.WHITESPACES).skipMany();
static final Parser<?> TOKENIZER =
Parsers.or(Terminals.DecimalLiteral.TOKENIZER, OPERATORS.tokenizer());
static Parser<?> term(String... names) {
return OPERATORS.token(names);
}
static final Parser<?> WHITESPACE_MUL = term("+", "-", "*", "/").not();
static <T> Parser<T> op(String name, T value) {
return term(name).retn(value);
}
static Parser<Double> calculator(Parser<Double> atom) {
Parser.Reference<Double> ref = Parser.newReference();
Parser<Double> unit = ref.lazy().between(term("("), term(")")).or(atom);
Parser<Double> parser = new OperatorTable<Double>()
.infixl(op("+", (l, r) -> l + r), 10)
.infixl(op("-", (l, r) -> l - r), 10)
.infixl(Parsers.or(term("*"), WHITESPACE_MUL).retn((l, r) -> l * r), 20)
.infixl(op("/", (l, r) -> l / r), 20)
.prefix(op("-", v -> -v), 30)
.build(unit);
ref.set(parser);
return parser;
}
public static final Parser<Double> CALCULATOR =
calculator(NUMBER).from(TOKENIZER, IGNORED);
}
In this tutorial, we chose to run the calculation directly in the semantic actions.
We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code that creates the abstract syntax tree (of course, we also need to replace <Double>
with <whatever your expression type>
).
This short tutorial cannot cover every detail of the jparsec API. There are many other useful utility methods provided to help the process of constructing a complex parser.
Please refer to the jparsec API Documentation for further information.
You are welcome to shoot me a note to Ben Yu should you have any comment or question.