Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Jonathan Scott Duff
file 320 lines (245 sloc) 14.211 kb

This text assumes that you have a cursory understanding of Perl 6 and some knowledge of Perl 6 regex and are familiar with the concepts of object oriented programming (but not necessarily in Perl 6). No deep knowledge of these subjects is required; most concepts will be explained in detail as if the reader has very little understanding of them. At the end of this introduction, the reader should be able to undertand and use grammars in Perl 6.

The examples in this introduction were all validated against the Rakudo Perl 6 compiler which can be found at http://rakudo.org/downloads and http://github.com/rakudo/rakudo/downloads. There is also a more complete Perl 6 distribution built on top of Rakudo called Rakudo Star. This distribution may be found at http://github.com/star/dowloads and comes with several useful modules, some of which illustrate Perl 6 Grammars.

The foundation of grammars in Perl 6 is the ability to do pattern matching on strings. This is accomplished in Perl 6 through the use of regex. A regex in Perl 6 parlance is the natural evolution of regular expressions. Regular expressions are a small, compact language used to describe strings of text. How they differ is that while regular expressions are declarative in nature, regex may have some procedural aspects.

Below are some examples of regex in Perl 6:

    / abc /         # match the string "abc"
    / a b* /        # match a string containing "a" and any number of "b"
                    # (including none)
    / ^ a* $ /      # match a string consisting entirely of "a" characters
    / \w+ /         # match a string containing a sequence of word characters
    / \d+ /         # match a string containing a sequence of digits

Regex are traditionally delimited by slash characters (/), but Perl 6 allows you to use many alternate delimiters. As we'll see below, popular delimiters are curly braces ({ and }).

The result of a pattern match using a regex is a Match object which can provide the results of the match in various ways. In a boolean context it just returns a true or false value to tell you whether the match succeeded or not. And in a list context you get all of the portions of the string that were captured within the pattern. See below for more information.

Applying regex to a string is useful in that it gives you a yes/no response answering the question, "Does the string match this pattern?" However, as regex become more complicated, it makes sense that you would want to give meaningful names to portions of the regex. For instance, to match a person's name you could use a regex like this:

    / \w+ \s+ \w+ \s+ \w+ /     # first name, middle name, last name

But wouldn't it be a little more useful to the person reading that regex if the pieces were named something like this:

    / <first-name> \s+ <middle-name> \s+ <last-name> /

Sure, that's a little more verbose, but it eliminates some commentary and makes the code a little more self-documenting. Now when someone changes the code they are also changing comments. Which, of course, eliminates the possibilty of them becoming out of sync.

Perl 6 allows you to declare a meaningful name for your regex just as you would give meaningful names to your subroutines. In fact, any legal identifier that can be used for a subroutine name may also be used for the name of your regex. So, the above regex may have been declared thusly:

    regex person's-name { <first-name> \s+ <middle-name> \s+ <last-name> }
    regex first-name { \w+ }
    regex middle-name { \w+ }
    regex last-name { \w+ }

Since named regex are analogous to subroutines, we've switched to using curly braces as the regex delimiter and using an explicit declarator regex. Before, when we used the / ... / syntax, it was a short-hand for auto-declaring a regex as if we'd said regex { ... }. The slash syntax exists for historical reasons and because it's such a common occurence to declare and immediately use the regex that it deserves a short syntaxThis is a principle know as Huffman Encoding. David Huffman showed that the maximum compression technique utilizing "dictionaries" would encode the most frequent token with the shortest sequence. Perl 6 applies this principle to language design so that the more frequently used constructs are relatively short and constructs that should be used less frequently are longer..

Another benefit, besides self documentation is that when a pattern match succeeds, you may ask about portions of the match in a meaningful way. Once you've matched the person's-name, how do you know their last name? With a named regex, you can extract that part by name.

Remember before when we said that the result of matching a regex to a string is a Match object that can provide match results in various ways? With a named regex, the Match object can be treated like a hash with the name of the regex as the key and the portion of the string that was matched by that named regex as the value.

At last we come to it. Grammars. Naming regex is fine, but just as subroutines can be grouped into logically cohesive groups (modules), so too can regex. This is the function of a grammar.

A grammar allows you to think of higher-level concepts as a unit. Rather than just a regex to match a person's name and a regex to match an address and a regex to match a phone number and so forth, you can have a grammar that matches an employee record. Rather than just a regex that matches tags and a regex that matches attributes, you can have a grammar that matches XML.

Grammars are declared in Perl, oddly enough, with the grammar declarator:

    grammar Some-Name-Here {
        # named regex here
    }

Any of the named regex declared within the block "belong" to the grammar. When they are mentioned within other named regex, Perl will look within the grammar for a regex with the appropriate nameThis isn't the whole truth, but that's why this is just an introduction and not a reference. This is very similar to the way objects work. Grammars are analogous to classes and the named regex within are analogous to methods.

So, how do you "mention" a named regex? Just as in the earlier examples, if you enclose the name in angle brackets (< and >), Perl will try to match that named regex at that point in the regex. It's almost as if you substituted the name with the actual regex.

As a side-effect of using a named regex, the portion of the string that matches will be saved as part of the Match object and the Match object will obtain a hash-like interface where the keys are the names of the regex and the values are the portion of the string that matched.

If you don't want this capturing behaviour, but still want the benefit of named regex, you can call the regex with a leading dot ., like so:

    regex foo { <.bar> }

Why would you want to turn off the capturing behavior? For one thing, the grammar engine has more bookkeeping to do in order for portions of the string to be captured. So, there's a bit of an efficiency gain to not capture. Also, there are times when a named regex really isn't a regex at all, but rather just a means to execute some arbitrary Perl code. Capturing the result of that execution may not be desirable or meaningful.

Later we'll see more complete examples that utilize all of the things we've talked about so far.

Okay ... so far we've danced around the declarational details for grammars, but then what? What's the syntax for matching a string against a grammar? Each grammar automatically gets a method called .parse() that allows you to do just that:

    my $match = YourGrammar.parse($some-string);

Afterwards, $match will contain the Match object that will allow you to access the parts of the string that were captured via a capturing group (parentheses) or a named regex.

By default, calling .parse() as above will try to match the string against a regex named TOP within the grammar. If the grammar has no regex named TOP, then an error is generated. TOP is considered the entry point to the grammar. But, you aren't stuck with that name. If you want to use a different named regex as the starting point for parsing a string with a grammar, you can specify it in the call to .parse():

    my $match = YourGrammar.parse($some-string, :rule<fred>);

This invocation will use the regex named fred within YourGrammar to parse $some-string.

The analog between grammars and classes runs quite deep. So deep, that grammars can inherit from other grammars just like classes may inherit from other classes. This is, in fact, how the built-in named regex are available to you. All grammars are automatically derived from the base Grammar class which has the default definitions for the built-in named regex.

More over you can also compose grammars just like you'd compose classes by using roles. If you find yourself using a particular subset of a grammar over and over again, you could factor it out into a role and then compose that role into your grammar just like you would compose a role into a class:

Just being able to parse a string and being able to understand its components is great, but often times, you want to do something as you parse. And more often than not, what you do depends on the text that you've parsed.

Like anything else in Perl, there's more than one way to do it, and which way you choose will depend on your exact needs. The simplest way is probably just to include the code directly in the regex that make up your grammar.

    grammar Foo {
        regex foo  { blah blah { say "hi" } blah blah }
    }

In Perl 6, the curly braces are consistently used to denote an executable block of code--even in a regex. If the portion of the regex before the code block matches, then the code block is immediately executed. Be careful though, if, in the process of matching a string, your regex happens to backtrack across the code block, it will be executed as the parse moves forward across it again and again. As a simple example of this phenomenon, consider the following pattern match:

    "aaa" ~~ / a { say "hi" } b /

As the regex engine tries to match the "ab" sequence, it will match the first "a" in the string, then execute the code (and say "hi"), then attempt to match a "b". Since there is no "b" immediately after the first "a" in the string, the regex engine skips ahead one character in the string and backtracks to try to match the "a" again. Again, it matches an "a", says "hi", fails to match a "b" and then backtracks. On the third attempt, the same sequence of events happens except this time it runs out of string attempting to match the "b" and so the process ends with a failed pattern match. However, whether the match succeeds (if it had ended in a "b") or fails, it still outputs "hi" 3 times.

As should be evident from the example above, the ability to execute code in this manner really has nothing to do with grammars, but is a feature of regex in general.

Another mechanism to execute arbitrary code while parsing the grammar is to create a method within the grammar and then include a call to it at the point in the parse were you want the code to execute.

    grammar foo {
        regex foo { <.setup> blah blah }
        method setup {
            # do stuff here
        }
    }

Remember before when we mentioned that a grammar is just a funny kind of class? The ability to define methods on the grammar is just another manifestation of that.

Yet another way to execute code during parsing is to specify an actions object to the .parse() method. After a named regex is completely parsed, a method of the same name in the actions object is executed with the current match object as its parameter.

The full syntax of a call to .parse() is thus:

    my $match = Grammar.parse($string, :rule($start-regex), :actions($actions-object) );

Add more useful stuff

Hopefully I've covered enough of grammars for you to start playing with them and using them in your own code. For more information on Perl 6 grammars, see the official Perl 6 documentation at http://perlcabal.org/syn/S05.html. There are also some historical documents at http://dev.perl.org/perl6/doc/design/apo/A05.html and http://dev.perl.org/perl6/doc/design/exe/E05.html that may give you a feel for things. If you're really interested in learning more but feel you need to interact with people try the mailing list at perl6-language@perl.org or log on to a freenode IRC server and drop by #perl6.

Jonathan Scott Duff is an Information Technology Research Manager at the Conrad Blucher Institute for Surveying and Science on the campus of Texas A&M University-Corpus Christi. He has a beautiful wife and 4 lovely children. When not working or spending time with his family, Scott tries to keep up with Parrot and Perl 6 development. Sometimes he can be found on IRC as PerlJam in one of the perl-related channels. But if you really want to get in touch with him, the best way is via email: duff@pobox.com

Copyright 2011 Jonathan Scott Duff

Hey! The above document had some coding errors, which are explained below:

Deleting unknown formatting code N<>

Deleting unknown formatting code N<>

=end for without matching =begin. (Stack: [empty])

=end todo without matching =begin. (Stack: [empty])

Something went wrong with that request. Please try again.