Skip to content

Parsing Expression Grammars

pkoppstein edited this page Dec 26, 2022 · 18 revisions

jq as a PEG engine

jq contains all the ingredients required for conveniently writing PEGs and PEG parsers. In effect, one can write a PEG parser for a PEG grammar simply by transcribing the grammar into jq, though some care with respect to jq's "declare before use" rule may be required.

Other tasks, such as evaluation of arithmetic expressions defined by a PEG grammar, can be achieved with either no or only minimal changes to the transcribed grammar. This is illustrated in the second example below.

jq also has the machinery required to add parameterized non-terminals, as illustrated in Example 3 below.

For an introduction to Parsing Expression Grammars, see for example https://en.wikipedia.org/wiki/Parsing_expression_grammar

Table of Contents

Transcribing a PEG grammar to jq

Transcription to jq is possible because each of the seven basic[^1] PEG operations can be mapped directly to jq filters with the same semantics. Here is the mapping that will be used in this article:

 Sequence: e1 e2             e1 | e2
 Ordered choice: e1 / e2     e1 // e2
 Zero-or-more: e*            star(E)
 One-or-more: e+             plus(E)
 Optional: e?                optional(E)
 And-predicate: &e           amp(E) 
 Not-predicate: !e           neg(E)

where:

def star(E): (E | star(E)) // .;

def plus(E): E | (plus(E) // . );

def optional(E): E // .;

def amp(E): . as $in | E | $in;

def neg(E): select( [E] == [] );

Example 0: Balanced Square Brackets

A string consisting of nothing but square brackets ("[" and "]") may be said to be "balanced" if the repeated application of gsub("[[][]]"; "") yields the empty string. To obtain a jq program that checks whether or not a string of square brackets is balanced can be directly derived from a literal transcription of a PEG grammar for such strings as follows.
The definitions of the generic eos (end-of-string) and consume filters are given in the Appendix.


# a single expression
def balancedExpression: 
  consume("[[]") | optional(balancedExpression) | consume("[]]") ;

# The empty string is balanced:
def balanced:
  star(balancedExpression) | eos;
def isBalanced:
  {remainder: .} | (balanced | true) // false;

Example:

("", "[][]", "[[[[]", "]]][[[" )
| isBalanced

Example 1: a^n b^n c^n : n >= 1

Following the article on PEGs, the PEG for the string expressions a^n b^n c^n where n >= 1 can be written as follows:

 S ← &(A 'c') 'a'+ B !.
 A ← 'a' A? 'b'
 B ← 'b' B? 'c'

This is a nice example as it uses five of the seven basic PEG operations, including the two look-ahead operations & and !. In addition, it uses '.' to signify that the input is nonempty.

The PEG can be expressed in jq by simple transcription:

def A: literal("a") | optional(A) | literal("b");
def B: literal("b") | optional(B) | literal("c");
def S: amp(A | literal("c")) | plus(literal("a")) | B | neg(nonempty);

assuming that literal and nonempty have been suitably defined. Their implementations will depend on the task at hand. In the following we consider two specific tasks: recognition, and tokenization.

Recognition

For recognition of an input string, the following would suffice:

def literal($s): select(startswith($s)) | .[$s|length :];

def nonempty: select(length > 0);

If we want the output to be true or false depending on whether the input string conforms to the grammar, we could write:

(S | true) // false

Example:

Assuming all the above jq code has been assembled into a file, parse.jq:

   $ jq -f parse.jq <<< abc
   true

Tokenization

Let's suppose we wish to build up an array of the recognized terminals. In that case, we need only modify the functions that consume or examine the string input, i.e. literal and nonempty.

For convenience, on the jq side, we'll pass a JSON object of the form {"remainder": _, "result": _} through the filter.

The definitions of literal and nonempty thus become:

def literal($s):
  select(.remainder | startswith($s))
  | .result += [$s]
  | .remainder |= .[$s | length :] ;


def nonempty: select( (.remainder | length) > 0 );

And for convenience, we could define:

def parse: {remainder: .} | S | .result;

So, with parse as our main program, a run would now look like this:

jq -R -cf abc.jq <<< abc
["a","b","c"]

Declare-before-use

In jq, a filter can only be invoked after it has been declared. This means that when transcribing PEG rules into jq filters, the order is important. To handle circularly-defined PEG expressions, one must take advantage of jq's support for subfunctions, that is, defs within defs. This is illustrated in the next example, also taken from the Wikipedia article.

Example 2: arithmetic

As presented in the aforementioned Wikipedia article, a PEG for recognizing simple arithmetic expressions for addition, subtraction, multiplication and division applied to non-negative integers is as follows:

Expr    ← Sum
Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*

In this case, for the reason mentioned in the previous section, care must be taken with transcription. Since the main goal is in any case to define Expr, it makes sense to define all the other PEG expressions as subfunctions of Expr:

# Expr    ← Sum
def Expr:

  # Value   ← [0-9]+ / '(' Expr ')'
  def Value: parseNumber( "[0-9]+" ) // (literal("(") | Expr | literal(")"));
   
  # Product ← Value (('*' / '/') Value)*
  def Product: (Value | star( (literal("*") // literal("/")) | Value));

  # Sum     ← Product (('+' / '-') Product)*
  def Sum: (Product | star( (literal("+") // literal("-")) | Product));

  Sum ;

With literal as previously defined for a recognition engine, and parseNumber as defined in the Appendix, we thus have a recognition engine.

Evaluation

In this section, we'll illustrate how the transcribed PEG can be adapted to evaluate the recognized arithmetic expressions.

The strategy will be to pass along an object {"remainder":, "result":} as with the previous example, except that when a parenthesized subexpression is encountered, the result will be augemented by an array. Thus the def of Value above will be modified to:

def Value: parseNumber( "[0-9]+" ) // (consume("[(]") | box(Expr) | consume("[)["));

where consume/1 simply consumes the string matching the regex input, and box/1 records Expr in an array. These helper functions, together with an eval function for evaluating the parsed result, are given in the Appendix below.

We can then define a jq function to evalute a string representing strings defined by the PEG as follows:

def evaluate:
  {remainder: .}
  | Expr.result
  | eval ;

Example:

   $ jq -f arithmetic.jq <<< "(1+2)*(3+4)+(1*1)"
   22

Example 3: [a-z]+ing

Because a PEG parser does not backtrack, the task of determining the first match of the regex [a-z]+ing in a string cannot be accomplished simply by writing a PEG rule such as:

plus(parse("[a-z]")) | literal("ing")

because the first expression will gobble up all the "i-n-g" characters, resulting in failure.

An appropriate PEG grammar for the task at hand could be achieved in jq as follows:

def atoz: parse("[a-z]");
def ing:  literal("ing");

plus(neg(ing) | atoz) | ing

Better yet, we could define a parameterized non-terminal to capture the pattern for reusability:

def Aplus_B(A; B): plus(neg(B) | A) | B;

So the PEG grammar now looks like the corresponding regular expression:

Aplus_B( atoz; ing )

Example 4: CSV-to-JSON

Here is a PEG-based program for converting a string of comma-separated values to a JSON array of the corresponding fields. This program can be used as simple CSV-to-TSV translator by adding a call to @tsv. The definitions of the helper functions (parse and consume) are given in the Appendix.

# White space
def ws: consume(" *");

def quoted_field_content:
  parse("((\"\")|([^\"]))*")
  | .result[-1] |= gsub("\"\""; "\"");

def unquoted_field: parse("[^,\"]*");

def quoted_field: consume("\"") | quoted_field_content | consume("\"");

def field: (ws | quoted_field | ws) // unquoted_field;

def record: field | star(consume(",") | field);

def csv2json:
  {remainder: .} | record | .result;

Appendix - Helper Functions

# end of string
def eos:
  select((.remainder|length == 0));

def eval:
  if type == "array" then
    if length == 0 then null
    else .[-1] |= eval
    | if length == 1 then .[0]
      else (.[:-2] | eval) as $v
      | if   .[-2] == "*" then $v * .[-1]
        elif .[-2] == "/" then $v / .[-1]
        elif .[-2] == "+" then $v + .[-1] 
        elif .[-2] == "-" then $v - .[-1] 
        else tostring|error
	end
      end
    end
  else .
  end ;

def box(E):
   ((.result = null) | E) as $e
   | .remainder = $e.remainder
   | .result += [$e.result]  # the magic sauce
   ;

# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
  # on failure, match yields empty
  (.remainder | match("^" + $re)) as $match
  | .remainder |= .[$match.length :]
  | .match = $match.string ;

def parse($re):
  consume($re)
  | .result = .result + [.match] ;

def parseNumber($re):
  consume($re)
  | .result = .result + [.match|tonumber] ;

[^1]: Note that some of these can be written in terms of the others, e.g. E+ is just EE*, so "basic" is not meant in the sense of "fundamental".

Clone this wiki locally