Permalink
Fetching contributors…
Cannot retrieve contributors at this time
605 lines (467 sloc) 20.8 KB
=begin pod :tag<perl6>
=TITLE Grammars
=SUBTITLE Parsing and interpreting text
Grammar is a powerful tool used to destructure text and often to return
data structures that have been created by interpreting that text.
For example, Perl 6 is parsed and executed using a Perl 6-style grammar.
An example that's more practical to the common Perl 6 user is the
L<JSON::Tiny module|https://github.com/moritz/json>, which can deserialize
any valid JSON file; however, the deserializing code is written in less than
100 lines of simple, extensible code.
If you didn't like grammar in school, don't let that scare you off grammars.
Grammars allow you to group regexes, just as classes allow you to group
methods of regular code.
=head1 X<Named Regexes|declarator,token>
The main ingredient of grammars is named L<regexes|/language/regexes>.
While the syntax of L<Perl 6 Regexes|/language/regexes> is outside the scope
of this document, I<named> regexes have a special syntax, similar to
subroutine definitions: N<In fact, named regexes can even take extra
arguments, using the same syntax as subroutine parameter lists>
=begin code
my regex number { \d+ [ \. \d+ ]? }
=end code
In this case, we have to specify that the regex is lexically scoped using
the C<my> keyword, because named regexes are normally used within grammars.
Being named gives us the advantage of being able to easily reuse the regex
elsewhere:
=begin code :preamble<my regex number{...};>
say so "32.51" ~~ &number; # OUTPUT: «True␤»
say so "15 + 4.5" ~~ /<number>\s* '+' \s*<number>/ # OUTPUT: «True␤»
=end code
B<C<regex>> isn't the only declarator for named regexes. In fact, it's the
least common. Most of the time, the B<C<token>> or B<C<rule>> declarators
are used. These are both I<ratcheting>, which means that the match engine
won't back up and try again if it fails to match something. This will
usually do what you want, but isn't appropriate for all cases:
=begin code
my regex works-but-slow { .+ q }
my token fails-but-fast { .+ q }
my $s = 'Tokens won\'t backtrack, which makes them fail quicker!';
say so $s ~~ &works-but-slow; # OUTPUT: «True␤»
say so $s ~~ &fails-but-fast; # OUTPUT: «False␤»
# the entire string get taken by the .+
=end code
Note that non-backtracking works on terms, that is, as the example below, if
you have matched something, then you will never backtrack. But when you fail to
match, if there is another candidate introduced by C<|> or C<||>, you will
retry to match again.
=begin code
my token tok-a { .* d };
my token tok-b { .* d | bd };
say so "bd" ~~ &tok-a; # OUTPUT: «False␤»
say so "bd" ~~ &tok-b; # OUTPUT: «True␤»
=end code
=head2 X<Rules|declarator,rule>
The only difference between the C<token> and C<rule> declarators is that the
C<rule> declarator causes L<C<:sigspace>|/language/regexes#Sigspace> to go
into effect for the Regex:
=begin code
my token non-space-y { 'once' 'upon' 'a' 'time' }
my rule space-y { 'once' 'upon' 'a' 'time' }
say so 'onceuponatime' ~~ &non-space-y; # OUTPUT: «True␤»
say so 'once upon a time' ~~ &non-space-y; # OUTPUT: «False␤»
say so 'onceuponatime' ~~ &space-y; # OUTPUT: «False␤»
say so 'once upon a time' ~~ &space-y; # OUTPUT: «True␤»
=end code
=head1 X<Creating grammars>
L<Grammar|/type/Grammar> is the superclass that classes automatically get when
they are declared with the C<grammar> keyword instead of C<class>. Grammars
should only be used to parse text; if you wish to extract complex data, you can
add actions within the grammar, or an
L<action object|/language/grammars#Action_objects> is recommended to be used in
conjunction with the grammar. If action objects are not used, C<.parse> returns
a L<Match> object and sets, by default, the
L<default match object C<$/>|/syntax/$$SOLIDUS>, to the same value.
=head2 X«Proto regexes| :sym<>; proto regex; declarator,grammar»
L<Grammar|/type/Grammar>s are composed of rules, tokens and regexes; these are
actually methods, since grammars are classes.
N<They are actually a special kind of class, but for the rest of the section,
they behave in the same way as a I<normal> class would>
These methods can share a name and
functionality in common, and thus can use L<proto|/syntax/proto>.
For instance, if you have a lot of alternations, it may become difficult to
produce readable code or subclass your grammar. In the C<Actions> class below,
the ternary in C<method TOP> is less than ideal and it becomes even worse the
more operations we add:
grammar Calculator {
token TOP { [ <add> | <sub> ] }
rule add { <num> '+' <num> }
rule sub { <num> '-' <num> }
token num { \d+ }
}
class Calculations {
method TOP ($/) { make $<add> ?? $<add>.made !! $<sub>.made; }
method add ($/) { make [+] $<num>; }
method sub ($/) { make [-] $<num>; }
}
say Calculator.parse('2 + 3', actions => Calculations).made;
# OUTPUT: «5␤»
To make things better, we can use proto regexes that look like C«:sym<...>»
adverbs on tokens:
grammar Calculator {
token TOP { <calc-op> }
proto rule calc-op {*}
rule calc-op:sym<add> { <num> '+' <num> }
rule calc-op:sym<sub> { <num> '-' <num> }
token num { \d+ }
}
class Calculations {
method TOP ($/) { make $<calc-op>.made; }
method calc-op:sym<add> ($/) { make [+] $<num>; }
method calc-op:sym<sub> ($/) { make [-] $<num>; }
}
say Calculator.parse('2 + 3', actions => Calculations).made;
# OUTPUT: «5␤»
In the grammar, the alternation has now been replaced with C«<calc-op>»,
which is essentially the name of a group of values we'll create. We do so by
defining a rule prototype with C<proto rule calc-op>. Each of our previous
alternations have been replaced by a new C<rule calc-op> definition and the
name of the alternation is attached with C«:sym<>» adverb.
In the class that declares actions, we now got rid of the ternary operator and
simply take the C<.made> value from the C«$<calc-op>» match object. And the
actions for individual alternations now follow the same naming pattern as in the
grammar: C«method calc-op:sym<add>» and C«method calc-op:sym<sub>».
The real beauty of this method can be seen when you subclass that grammar
and actions class. Let's say we want to add a multiplication feature to the
calculator:
=begin code :preamble<grammar Calculator {}; class Calculations {}>
grammar BetterCalculator is Calculator {
rule calc-op:sym<mult> { <num> '*' <num> }
}
class BetterCalculations is Calculations {
method calc-op:sym<mult> ($/) { make [*] $<num> }
}
say BetterCalculator.parse('2 * 3', actions => BetterCalculations).made;
# OUTPUT: «6␤»
=end code
All we had to add are additional rule and action to the C<calc-op> group and
the thing works—all thanks to proto regexes.
=head2 Special tokens
=head3 X<C<TOP>|TOP>
grammar Foo {
token TOP { \d+ }
}
The C<TOP> token is the default first token attempted to match
when parsing with a grammar. Note that if you're parsing with
L<C<.parse>|/type/Grammar#method_parse> method, C<token TOP> is automatically
anchored to the start and end of the string. If you don't want to parse the
whole string, look up L<C<.subparse>|/type/Grammar#method_subparse>.
Using C<rule TOP> or C<regex TOP> are also acceptable.
A different token can be chosen to be matched first using the C<:rule> named
argument to C<.parse>, C<.subparse>, or C<.parsefile>. These are all C<Grammar>
methods.
X<|ws>
=head3 C<ws>
When C<rule> instead of C<token> is used, any whitespace after terms and closing
parenthesis/brackets is turned into a non-capturing call to C<ws>,
written as C«<.ws>» where C<.> means non-capturing. That is to say:
rule entry { <key> '=' <value> }
Is the same as:
token entry { <key> <.ws> '=' <.ws> <value> <.ws> }
The default C<ws> matches zero or more whitespace characters, as long as that
point is not within a word (in code form, that's C«regex ws { <!ww> \s* }»):
# First <.ws> matches word boundary at the start of the line
# and second <.ws> matches the whitespace between 'b' and 'c'
say 'ab c' ~~ /<.ws> ab <.ws> c /; # OUTPUT: «「ab c」␤»
# Failed match: there is neither any whitespace nor a word
# boundary between 'a' and 'b'
say 'ab' ~~ /. <.ws> b/; # OUTPUT: «Nil␤»
# Successful match: there is a word boundary between ')' and 'b'
say ')b' ~~ /. <.ws> b/; # OUTPUT: «「)b」␤»
You can also redefine the default C<ws> token:
grammar Foo {
rule TOP { \d \d }
}.parse: "4 \n\n 5"; # Succeeds
grammar Bar {
rule TOP { \d \d }
token ws { \h* }
}.parse: "4 \n\n 5"; # Fails
=head3 X«C<sym>|<sym>»
The C«<sym>» token can be used inside proto regexes to match the string value of
the C<:sym> adverb for that particular regex:
grammar Foo {
token TOP { <letter>+ }
proto token letter {*}
token letter:sym<P> { <sym> }
token letter:sym<e> { <sym> }
token letter:sym<r> { <sym> }
token letter:sym<l> { <sym> }
token letter:sym<*> { . }
}.parse("I ♥ Perl", actions => class {
method TOP($/) { make $<letter>.grep(*.<sym>).join }
}).made.say; # OUTPUT: «Perl␤»
This comes in handy when you're already differentiating the proto regexes with
the strings you're going to match, as using C«<sym>» token prevents repetition
of those strings.
X«|<?>»
=head3 "Always succeed" assertion
The C«<?>» is the I<always succeed> assertion. When used as a grammar
token, it can be used to trigger an Action class method. In the following
grammar we look for Arabic digits and define a C<succ> token with the always
succeed assertion.
In the action class, we use calls to the C<succ> method to do set up (in this
case, we prepare a new element in C<@!numbers>). In the C<digit> method,
we use the Arabic digit as an index into a list of Devanagari digits and add
it to the last element of C<@!numbers>. Thanks to C<succ>, the last element
will always be the number for the currently parsed C<digit> digits.
grammar Digifier {
rule TOP {
[ <.succ> <digit>+ ]+
}
token succ { <?> }
token digit { <[0..9]> }
}
class Devanagari {
has @!numbers;
method digit ($/) { @!numbers.tail ~= <० १ २ ३ ४ ५ ६ ७ ८ ९>[$/] }
method succ ($) { @!numbers.push: '' }
method TOP ($/) { make @!numbers[^(*-1)] }
}
say Digifier.parse('255 435 777', actions => Devanagari.new).made;
# OUTPUT: «(२५५ ४३५ ७७७)␤»
=head2 Methods in grammars
It's fine to use methods instead of rules or tokens in a grammar, as long
as they return a L<Cursor|/type/Cursor>:
grammar DigitMatcher {
method TOP (:$full-unicode) {
$full-unicode ?? self.num-full !! self.num-basic;
}
token num-full { \d+ }
token num-basic { <[0..9]>+ }
}
The grammar above will attempt different matches depending on the arguments
provided by parse methods:
=begin code :preamble<grammar DigitMatcher{};>
say +DigitMatcher.subparse: '12७१७९०९', args => \(:full-unicode);
# OUTPUT: «12717909␤»
say +DigitMatcher.subparse: '12७१७९०९', args => \(:!full-unicode);
# OUTPUT: «12␤»
=end code
=head2 Dynamic variables in grammars
Variables can be defined in tokens by prefixing the lines of code defining them
with C<:>. Arbitrary code can be embedded anywhere in a token by surrounding it
with curly braces. This is useful for keeping state between tokens, which can be
used to alter how the grammar will parse text. Using dynamic variables
(variables with C<$*>, C<@*>, C<&*>, C<%*> twigils) in tokens cascades down
through I<all> tokens defined thereafter within the one where it's defined,
avoiding having to pass them from token to token as arguments.
One use for dynamic variables is guards for matches. This example uses guards to
explain which regex classes parse whitespace literally:
grammar GrammarAdvice {
rule TOP {
:my Int $*USE-WS;
"use" <type> "for" <significance> "whitespace by default"
}
token type {
| "rules" { $*USE-WS = 1 }
| "tokens" { $*USE-WS = 0 }
| "regexes" { $*USE-WS = 0 }
}
token significance {
| <?{ $*USE-WS == 1 }> "significant"
| <?{ $*USE-WS == 0 }> "insignificant"
}
}
Here, text such as "use rules for significant whitespace by default" will only
match if the state assigned by whether rules, tokens, or regexes are mentioned
matches with the correct guard:
=begin code :preamble<grammar GrammarAdvice{};>
say GrammarAdvice.subparse("use rules for significant whitespace by default");
# OUTPUT: «use rules for significant whitespace by default»
say GrammarAdvice.subparse("use tokens for insignificant whitespace by default");
# OUTPUT: «use tokens for insignificant whitespace by default»
say GrammarAdvice.subparse("use regexes for insignificant whitespace by default");
# OUTPUT: «use regexes for insignificant whitespace by default»
say GrammarAdvice.subparse("use regexes for significant whitespace by default")
# OUTPUT: #<failed match>
=end code
=head2 Attributes in grammars
Attributes may be defined in grammars. However, they can only be accessed by
methods. Attempting to use them from within a token will throw an exception
because tokens are methods of L<Cursor|/type/Cursor>, not of the grammar
itself. Note that mutating an attribute from within a method called in a token
will I<only modify the attribute for that token's own match object>! Grammar
attributes can be accessed in the match returned after parsing if made public:
=begin code
grammar HTTPRequest {
has Bool $.invalid;
token TOP {
<type> <path> 'HTTP/1.1' \r\n
[<field> \r\n]+
\r\n
$<body>=.*
}
token type {
| GET | POST | OPTIONS | HEAD | PUT | DELETE | TRACE | CONNECT
| \S+ <.error>
}
token path {
| '/' [[\w+]+ % \/] [\.\w+]?
| '*'
| \S+ <.error>
}
token field {
| $<name>=\w+ : $<value>=<-[\r\n]>*
| <-[\r\n]>+ <.error>
}
method error(--> ::?CLASS:D) {
$!invalid = True;
self;
}
}
my $header = "MEOWS / HTTP/1.1\r\nHost: docs.perl6.org\r\nsup lol\r\n\r\n";
my $/ = HTTPRequest.parse($header);
say $<type>.invalid;
# OUTPUT: True
say $<path>.invalid;
# OUTPUT: (Bool)
say $<field>».invalid;
# OUTPUT: [(Bool) True]
=end code
=head2 Passing arguments into grammars
To pass arguments into a grammar, you can use the named argument of C<:args> on any of the parsing methods of grammar. The arguments passed should be in a C<list>.
=begin code
grammar demonstrate-arguments {
rule TOP ($word) {
"I like" $word
}
}
# Notice the comma after "sweets" when passed to :args to coerce it to a list
say demonstrate-arguments.parse("I like sweets", :args(("sweets",))); # OUTPUT: «「I like sweets」␤»
=end code
Once the arguments are passed in, they can be used in a call to a named regex inside the grammar.
=begin code
grammar demonstrate-arguments-again {
rule TOP ($word) {
<phrase-stem><added-word($word)>
}
rule phrase-stem {
"I like"
}
rule added-word($passed-word) {
$passed-word
}
}
say demonstrate-arguments-again.parse("I like vegetables", :args(("vegetables",)));
# OUTPUT: 「I like vegetables」␤»
# OUTPUT: «phrase-stem => 「I like 」␤»
# OUTPUT: «added-word => 「vegetables」␤»
=end code
Alternatively, you can initialise dynamic variables and use any arguments that way within the grammar.
=begin code
grammar demonstrate-arguments-dynamic {
rule TOP ($*word, $*extra) {
<phrase-stem><added-words>
}
rule phrase-stem {
"I like"
}
rule added-words {
$*word $*extra
}
}
say demonstrate-arguments-dynamic.parse("I like everything else", :args(("everything", "else")));
# OUTPUT: «「I like everything else」␤»
# OUTPUT: «phrase-stem => 「I like 」␤»
# OUTPUT: «added-words => 「everything else」␤»
=end code
X<|Actions>
=head1 Action objects
A successful grammar match gives you a parse tree of L<Match|/type/Match>
objects, and the deeper that match tree gets, and the more branches in the
grammar are, the harder it becomes to navigate the match tree to get the
information you are actually interested in.
To avoid the need for diving deep into a match tree, you can supply an
I<actions> object. After each successful parse of a named rule in your
grammar, it tries to call a method of the same name as the grammar rule,
giving it the newly created L<Match|/type/Match> object as a positional
argument. If no such method exists, it is skipped.
Here is a contrived example of a grammar and actions in action:
=begin code
grammar TestGrammar {
token TOP { \d+ }
}
class TestActions {
method TOP($/) {
$/.make(2 + $/);
}
}
my $match = TestGrammar.parse('40', actions => TestActions.new);
say $match; # OUTPUT: «「40」␤»
say $match.made; # OUTPUT: «42␤»
=end code
An instance of C<TestActions> is passed as named argument C<actions> to the
L<parse> call, and when token C<TOP> has matched successfully,
it automatically calls method C<TOP>, passing the match object as an argument.
To make it clear that the argument is a match object, the example uses C<$/>
as a parameter name to the action method, though that's just a handy
convention, nothing intrinsic. C<$match> would have worked too. (Though using
C<$/> does give the advantage of providing C«$<capture>» as a shortcut
for C«$/<capture>»).
A slightly more involved example follows:
=begin code
grammar KeyValuePairs {
token TOP {
[<pair> \n+]*
}
token ws {
\h*
}
rule pair {
<key=.identifier> '=' <value=.identifier>
}
token identifier {
\w+
}
}
class KeyValuePairsActions {
method pair ($/) {
$/.make: $<key>.made => $<value>.made
}
method identifier($/) {
# subroutine `make` is the same as calling .make on $/
make ~$/
}
method TOP ($match) {
# can use any variable name for parameter, not just $/
$match.make: $match<pair>».made
}
}
my $actions = KeyValuePairsActions;
my $res = KeyValuePairs.parse(q:to/EOI/, :$actions).made;
second=b
hits=42
perl=6
EOI
for @$res -> $p {
say "Key: $p.key()\tValue: $p.value()";
}
=end code
This produces the following output:
=begin code :lang<text>
Key: second Value: b
Key: hits Value: 42
Key: perl Value: 6
=end code
Rule C<pair>, which parsed a pair separated by an equals sign, aliases the two
calls to token C<identifier> to separate capture names to make them available
more easily and intuitively. The corresponding action method constructs a
L<Pair|/type/Pair> object, and uses the C<.made> property of the sub match
objects. So it (like the action method C<TOP> too) exploits the fact that
action methods for submatches are called before those of the calling/outer
regex. So action methods are called in
L<post-order|https://en.wikipedia.org/wiki/Tree_traversal#Post-order>.
The action method C<TOP> simply collects all the objects that were C<.made> by
the multiple matches of the C<pair> rule, and returns them in a list.
Also note that C<KeyValuePairsActions> was passed as a type object to method
C<parse>, which was possible because none of the action methods use attributes
(which would only be available in an instance).
In other cases, action methods might want to keep state in attributes. Then of
course you must pass an instance to method parse.
Note that C<token> C<ws> is special: when C<:sigspace> is enabled (and it is
when we are using C<rule>), it replaces certain whitespace sequences. This is
why the spaces around the equals sign in C<rule pair> work just fine and why
the whitespace before closing C<}> does not gobble up the newlines looked for
in C<token TOP>.
=end pod
# vim: expandtab softtabstop=4 shiftwidth=4 ft=perl6