Document protoregexes

zoffixznet · zoffixznet · commit 9e305f3f713f · 2016-06-30T15:10:19.000-04:00
diff --git a/doc/Language/grammars.pod b/doc/Language/grammars.pod
@@ -0,0 +1,319 @@
+=begin pod
+
+=TITLE Grammars
+
+=SUBTITLE Parsing and interpreting text
+
+Grammars are a powerful tool used to destructure text and often to return
+data structures that have been created by interpreting that text.
+
+For example, Perl 6 is parsed and executed using a Perl 6-style grammar.
+
+An example that's more practical to the common Perl 6 user is the
+L<JSON::Tiny module|https://github.com/moritz/json>, which can deserialize
+any valid JSON file, however the deserializing code is written in less than
+100 lines of simple, extensible code.
+
+If you didn't like grammar in school, don't let that scare you off grammars.
+Grammars allow you to group regexes, just as classes allow you to group
+methods of regular code.
+
+=head1 X<Named Regexes|declarator,regex;declarator,token;declarator,rule>
+
+The main ingredient of grammars is named L<regexes|/language/regexes>.
+While the syntax of L<Perl 6 Regexes|/language/regexes> is outside the scope
+of this document, I<named> regexes have a special syntax, similar to
+subroutine definitions:N<In fact, named regexes can even take extra
+arguments, using the same syntax as subroutine parameter lists>
+
+    =begin code :allow<B>
+    my B<regex number {> \d+ [ \. \d+ ]? B<}>
+    =end code
+
+In this case, we have to specify that the regex is lexically scoped using
+the C<my> keyword, because named regexes are normally used within grammars.
+
+Being named gives us the advantage of being able to easily reuse the regex
+elsewhere:
+
+    =begin code :allow<B>
+    say "32.51" ~~ B<&number>;
+    say "15 + 4.5" ~~ /B<< <number> >>\s* '+' \s*B<< <number> >>/
+    =end code
+
+B<C<regex>> isn't the only declarator for named regexes -- in fact, it's the
+least common. Most of the time, the B<C<token>> or B<C<rule>> declarators
+are used. These are both I<ratcheting>, which means that the match engine
+won't back up and try again if it fails to match something. This will
+usually do what you want, but isn't appropriate for all cases:
+
+    =begin code :allow<B>
+    my regex works-but-slow { .+ q }
+    my token fails-but-fast { .+ q }
+    my $s = 'Tokens won\'t backtrack, which makes them fail quicker!';
+    say so $s ~~ &works-but-slow; # True
+    say so $s ~~ &fails-but-fast; # False, the entire string get taken by the .+
+    =end code
+
+The only difference between the C<token> and C<rule> declarators is that the
+C<rule> declarator causes L<C<:sigspace>|/language/regexes#Sigspace> to go
+into effect for the Regex:
+
+    =begin code :allow<B>
+    my token non-space-y { 'once' 'upon' 'a' 'time' }
+    my rule space-y { 'once' 'upon' 'a' 'time' }
+    say 'onceuponatime'    ~~ &non-space-y;
+    say 'once upon a time' ~~ &space-y;
+    =end code
+
+=head1 X<Creating Grammars|class,Grammar;declarator,grammar>
+
+=SUBTITLE Group of named regexes that form a formal grammar
+
+L<Grammar|/type/Grammar> is the superclass that classes automatically get when
+they are declared with the C<grammar> keyword instead of C<class>. Grammars
+should only be used to parse text; if you wish to extract complex data, an
+L<action object|/language/grammars#Action_Objects> is recommended to be used in
+conjunction with the grammar.
+
+X<sym>
+X<< :sym<> >>
+X<protoregex>
+=head2 Protoregexes
+
+If you have a lot of alternations, it may become difficult to produce
+readable code or subclass your grammar. In the Actions class below, the
+ternary in C<method TOP> is less than ideal and it becomes even worse the more
+operations we we add:
+
+    grammar Calculator {
+        token TOP { [ <add> | <sub> ] }
+        rule  add { <num> '+' <num> }
+        rule  sub { <num> '-' <num> }
+        token num { \d+ }
+    }
+
+    class Calculations {
+        method TOP ($/) { make $<add> ?? $<add>.made !! $<sub>.made; }
+        method add ($/) { make [+] $<num>; }
+        method sub ($/) { make [-] $<num>; }
+    }
+
+    say Calculator.parse('2 + 3', actions => Calculations).made;
+
+    # OUTPUT:
+    # 5
+
+To make things better, we can use protoregexes that look like `C<< :sym<...> >>
+adverbs on tokens:
+
+    grammar Calculator {
+        token TOP { <calc-op> }
+
+        proto rule calc-op          {*}
+              rule calc-op:sym<add> { <num> '+' <num> }
+              rule calc-op:sym<sub> { <num> '-' <num> }
+
+        token num { \d+ }
+    }
+
+    class Calculations {
+        method TOP              ($/) { make $<calc-op>.made; }
+        method calc-op:sym<add> ($/) { make [+] $<num>; }
+        method calc-op:sym<sub> ($/) { make [-] $<num>; }
+    }
+
+    say Calculator.parse('2 + 3', actions => Calculations).made;
+
+    # OUTPUT:
+    # 5
+
+In the grammar, the alternation has now been replaced with C<< <calc-op> >>,
+which is essentially the name of a group of values we'll create. We do so by
+defining a rule prototype with C<proto rule calc-op>. Each of our previous
+alternations have been replaced by a new C<rule calc-op> definition and the
+name of the alternation is attached with C<< :sym<> >> adverb.
+
+In the actions class, we now got rid of the ternary operator and simply take
+the C<.made> value from the C<< $<calc-op> >> match object. And the actions for
+individual alternations now follow the name naming pattern as in the grammar:
+C<< method calc-op:sym<add> >> and C<method calc-op:sym<sub> >>.
+
+The real beauty of this method can be seen when you subclass that grammar
+and actions class. Let's say we want to add a multiplication feature to the
+calculator:
+
+    grammar BetterCalculator is Calculator {
+        rule calc-op:sym<mult> { <num> '*' <num> }
+    }
+
+    class BetterCalculations is Calculations {
+        method calc-op:sym<mult> ($/) { make [*] $<num> }
+    }
+
+    say BetterCalculator.parse('2 * 3', actions => BetterCalculations).made;
+
+    # OUTPUT:
+    # 6
+
+All we had to add are additional rule and action to the C<calc-op> group and
+the thing works—all thanks to protoregexes.
+
+=head2 Special Tokens
+
+X<TOP>
+=head3 C<TOP>
+
+    grammar Foo {
+        token TOP { \d+ }
+    }
+
+The C<TOP> token is the first token attempted to match when parsing with
+a grammar—the root of the tree. Note
+that if you're parsing with L<C<.parse>|/type/Grammar#method_parse> method,
+C<token TOP> is automatically anchored to the start and end of the string
+(see also: L<C<.subparse>|/type/Grammar#method_subparse>).
+
+Using C<rule TOP> or C<regex TOP> are also acceptable.
+
+X<ws>
+=head3 C<ws>
+
+When C<rule> instead of C<token> is used, any whitespace after an
+atom is turned into a non-capturing call to C<ws>. That is:
+
+    rule entry { <key> ’=’ <value> }
+
+Is the same as:
+
+    token entry { <key> <.ws> ’=’ <.ws> <value> <.ws> } # . = non-capturing
+
+The default C<ws> matches "whitespace", such a sequence of spaces (of whatever
+type), newlines, unspaces, or heredocs.
+
+It's perfectly fine to provide your own C<ws> token:
+
+    grammar Foo {
+        rule TOP { \d \d }
+    }.parse: "4   \n\n 5"; # Succeeds
+
+    grammar Bar {
+        rule TOP { \d \d }
+        token ws { \h*   }
+    }.parse: "4   \n\n 5"; # Fails
+
+=head1 Action Objects
+
+A successful grammar match gives you a parse tree of L<Match|/type/Match>
+objects, and the deeper that match tree gets, and the more branches in the
+grammar are, the harder it becomes to navigate the match tree to get the
+information you are actually interested in.
+
+To avoid the need for diving deep into a match tree, you can supply an
+I<actions> object. After each successful parse of a named rule in your
+grammar, it tries to call a method of the same name as the grammar rule,
+giving it the newly created L<Match|/type/Match> object as a positional
+argument. If no such method exists, it is skipped.
+
+Here is a contrived example of a grammar and actions in action:
+
+=begin code
+use v6;
+
+grammar TestGrammar {
+    token TOP { \d+ }
+}
+
+class TestActions {
+    method TOP($/) {
+        $/.make(2 + $/);
+    }
+}
+
+my $actions = TestActions.new;
+my $match = TestGrammar.parse('40', :$actions);
+say $match;         # ｢40｣
+say $match.made;    # 42
+=end code
+
+An instance of C<TestActions> is passed as named argument C<actions> to the
+L<parse> call, and when token C<TOP> has matched
+successfully, it automatically calls method C<TOP>, passing the match object
+as an argument.
+
+To make it clear that the argument is a match object, the example uses C<$/>
+as a parameter name to the action method, though that's just a handy
+convention, nothing intrinsic. C<$match> would have worked too. (Though using
+C<$/> does give the advantage of providing C<< $<capture> >> as a shortcut
+for C<< $/<capture> >>).
+
+A slightly more involved example follows:
+
+=begin code
+use v6;
+
+grammar KeyValuePairs {
+    token TOP {
+        [<pair> \n+]*
+    }
+    token ws { \h* }
+
+    rule pair {
+        <key=.identifier> '=' <value=.identifier>
+    }
+    token identifier {
+        \w+
+    }
+}
+
+class KeyValuePairsActions {
+    method identifier($/) { $/.make: ~$/                          }
+    method pair      ($/) { $/.make: $<key>.made => $<value>.made }
+    method TOP       ($/) { $/.make: $<pair>».made                }
+}
+
+my  $res = KeyValuePairs.parse(q:to/EOI/, :actions(KeyValuePairsActions)).made;
+    second=b
+    hits=42
+    perl=6
+    EOI
+
+for @$res -> $p {
+    say "Key: $p.key()\tValue: $p.value()";
+}
+=end code
+
+This produces the following output:
+
+=begin code
+Key: second     Value: b
+Key: hits       Value: 42
+Key: perl       Value: 6
+=end code
+
+Rule C<pair>, which parsed a pair separated by an equals sign, aliases the two
+calls to token C<identifier> to separate capture names to make them available
+more easily and intuitively. The corresponding action method constructs a
+L<Pair|/type/Pair> object, and uses the C<.made> property of the sub match
+objects. So it (like the action method C<TOP> too) exploits the fact that
+action methods for submatches are called before those of the calling/outer
+regex. So action methods are called in
+L<post-order|https://en.wikipedia.org/wiki/Tree_traversal#Post-order>.
+
+The action method C<TOP> simply collects all the objects that were C<.made> by
+the multiple matches of the C<pair> rule, and returns them in a list.
+
+Also note that C<KeyValuePairsActions> was passed as a type object to method
+C<parse>, which was possible because none of the action methods use attributes
+(which would only be available in an instance).
+
+In other cases, action methods might want to keep state in attributes. Then of
+course you must pass an instance to method parse.
+
+Note that C<token> C<ws> is special: when C<:sigspace> is enabled (and it is
+when we are using C<rule>), it replaces certain whitespace sequences. This is
+why the spaces around the equals sign in C<rule pair> work just fine and why
+the whitespace before closing C<}> does not gobble up the newlines looked for
+in C<token TOP>.
+
+=end pod