Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
421 lines (319 sloc) 15.395 kb

Grammars organize regexes, just like classes organize methods. The following example demonstrates how to parse JSON, a data exchange format already introduced (see sec:multis).

A grammar contains various named regex. Regex names may be constructed the same as subroutine names or method names. While regex names are completely up to the grammar writer, a regex named TOP will, by default, be invoked when the .parse() method is executed on a grammarThe name of the regex that is automatically invoked may also be specified as a parameter to .parse(). For instance, JSON::Tiny::Grammar.parse($tester, :rule<object>) will start parsing at the regex named object . So, the call to JSON::Tiny::Grammar.parse($tester) attempts to match the regex named TOP to the string $tester.

However, the example code does not seem to have a regex named TOP; it has a rule named TOP. Looking at the grammar, you'll also note that there are also token declarations as well. What's the difference? Each of the token and rule declarations are just regex declarations with special default behavior. A token declares a regex that does not backtrack by default, so that when a partial pattern match fails, the regex engine will not go back up and try another alternative (this is equivalent to using the :ratchet modifier). A rule will also not backtrack by default, additionally, sequences of whitespace are deemed significant and will match actual whitespace within the string using the built-in <ws> rule (this is equivalent to using the :sigspace modifier).

Add code to POD processor that does the right thing with S?? links.

See sec:regexes for more information on modifiers and how they may be used or look at S05.

Usually, when talking about regex that are rules or tokens, we tend to call them rules or tokens rather than the more general term "regex", to distinguish their special behaviors.

In this example, the TOP rule anchors the match to the start (with ^) and end of the string (with $), so that the whole string has to be in valid JSON format for the match to succeed. After matching the anchor at the start of the string, the regex attempts to match either an <array> or an <object>. Enclosing a regex name in angle brackets causes the regex engine to attempt to match a regex by that name within the same grammar. Subsequent matches are straightforward and reflect the structure in which JSON components can appear.

Regexes can be recursive. An array contains value. In turn a value can be an array. This will not cause an infinite loop as long as at least one regex per recursive call consumes at least one character. If a set of regexes were to call each other recursively without progressing in the string, the recursion could go on infinitely and never proceed to other parts of the grammar.

The example JSON grammar introduces the goal matching syntax which can be presented abstractly as: A ~ B C. In JSON::Tiny::Grammar, A is '{', B is '}' and C is <pairlist>. The atom on the left of the tilde (A) is matched normally, but the atom to the right of the tilde (B) is set as the goal, and then the final atom (C) is matched. Once the final atom matches, the regex engine attempts to match the goal (B). This has the effect of switching the match order of the final two atoms (B and C), but since Perl knows that the regex engine should be looking for the goal, a better error message can be given when the goal does not match. This is very helpful for bracketing constructs as it puts the brackets near one another.

Another novelty is the declaration of a proto token:

The proto token syntax indicates that value will be a set of alternatives instead of a single regex. Each alternative has a name of the form token value:sym<thing>, which can be read as alternative of value with parameter sym set to thing. The body of such an alternative is a normal regex, where the call <sym> matches the value of the parameter, in this example thing.

When calling the rule <value>, the grammar engine attempts to match the alternatives in parallel and the longest match wins. This is exactly like normal alternation, but as we'll see in the next section, has the advantage of being extensible.

Grammar Inheritance

The similarity of grammars to classes goes deeper than storing regexes in a namespace as a class might store methods. You can inherit from and extend grammars, mix roles into them, and take advantage of polymorphism. In fact, a grammar is a class which by default inherits from Grammar instead of Any. The Grammar base grammar contains broadly useful rules predefined. For instance, there is a rule to match alphabetic characters (<alpha>), and another to match digits (<digit>), and another to match whitespace (<ws>), etc.

Suppose you want to enhance the JSON grammar to allow single-line C++ or JavaScript comments, which begin with // and continue until the end of the line. The simplest enhancement is to allow such a comment in any place where whitespace is valid.

However, JSON::Tiny::Grammar only implicitly matches whitespace through the use of rules. Implicit whitespace is matched with the inherited regex <ws>, so the simplest approach to enable single- line comments is to override that named regex:

The first two lines introduce a grammar that inherits from JSON::Tiny::Grammar. Just as subclasses inherit methods from superclasses, so grammars inherit rules from its base grammar. Any rule used within the grammar will be looked for first in the grammar in which it was used, then within its parent(s).

In this minimal JSON grammar, whitespace is never mandatory, so ws can match nothing at all. After optional spaces, two slashes '//' introduce a comment, after which must follow an arbitrary number of non- newline characters, and then a newline. In prose, the comment starts with '//' and extends to the rest of the line.

Inherited grammars may also add variants to proto tokens:

In this grammar, a call to <value> matches either one of the newly added alternatives, or any of the old alternatives from the parent grammar JSON::Tiny::Grammar. Such extensibility is difficult to achieve with ordinary, | delimited alternatives.

Extracting data

The .parse method of a grammar returns a Match object through which you can access all the relevant information of the match. Named regex that match within the grammar may be accessed via the Match object similar to a hash where the keys are the regex names and the values are the Match object that represents that part of the overall regex match. Similarly, portions of the match that are captured with parentheses are available as positional elements of the Match object (as if it were an array).

Once you have the Match object, what can you do with it? You could recursively traverse this object and create data structures based on what you find or execute code. Perl 6 provides another alternative: action methods.

This example passes an actions object to the grammar's parse method. Whenever the grammar engine finishes parsing a regex, it calls a method on the actions object with the same name as the regex. If no such method exists, the grammar engine continues parsing the rest of the grammar. If a method does exist, the grammar engine passes the current match object as a positional argument.

Each match object has a slot called ast (short for abstract syntax tree) for a payload object. This slot can hold a custom data structure that you create from the action methods. Calling make $thing in an action method sets the ast attribute of the current match object to $thing.

In the case of the JSON parser, the payload is the data structure that the JSON string represents. For each matching rule, the grammar engine calls an action method to populate the ast slot of the match object. This process transforms the match tree into a different tree--in this case, the actual JSON tree.

Although the rules and action methods live in different namespaces (and in a real-world project probably even in separate files), here they are adjacent to demonstrate their correspondence:

The make explanation is fuzzy. The rest of this chapter assumes some implicit knowledge that readers likely won't have now. The real insight for me was realizing that transforming trees is the best way to write a compiler, but I don't expect readers to have gone through the trouble of writing compilers the hard way first.

The TOP rule has an alternation with two branches, object and array. Both have a named capture. $/.values returns a list of all captures, here either the object or the array capture.

The action method takes the AST attached to the match object of that sub capture, and promotes it as its own AST by calling make.

The reduction method for object extracts the AST of the pairlist submatch and turns it into a hash by calling its hash method.

The pairlist rule matches multiple comma-separated pairs. The reduction method calls the .ast method on each matched pair and installs the result list in its own AST.

A pair consists of a string key and a value, so the action method constructs a Perl 6 pair with the => operator.

The other action methods work the same way. They transform the information they extract from the match object into native Perl 6 data structures, and call make to set those native structures as their own ASTs.

The action methods for proto tokens include the full name of each individual rule, including the sym part:

When a <value> call matches, the action method with the same symbol as the matching subrule executes.

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 1:

Unknown directive: =head0

Around line 3:

A non-empty Z<>

Around line 5:

Deleting unknown formatting code A<>

Around line 74:

Deleting unknown formatting code N<>

Around line 102:

=end author without matching =begin. (Stack: [empty])

Around line 360:

=end for without matching =begin. (Stack: [empty])

Jump to Line
Something went wrong with that request. Please try again.