Skip to content

Commit 9e305f3

Browse files
committed
Document protoregexes
1 parent 22c1f65 commit 9e305f3

File tree

1 file changed

+319
-0
lines changed

1 file changed

+319
-0
lines changed

doc/Language/grammars.pod

Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
=begin pod
2+
3+
=TITLE Grammars
4+
5+
=SUBTITLE Parsing and interpreting text
6+
7+
Grammars are a powerful tool used to destructure text and often to return
8+
data structures that have been created by interpreting that text.
9+
10+
For example, Perl 6 is parsed and executed using a Perl 6-style grammar.
11+
12+
An example that's more practical to the common Perl 6 user is the
13+
L<JSON::Tiny module|https://github.com/moritz/json>, which can deserialize
14+
any valid JSON file, however the deserializing code is written in less than
15+
100 lines of simple, extensible code.
16+
17+
If you didn't like grammar in school, don't let that scare you off grammars.
18+
Grammars allow you to group regexes, just as classes allow you to group
19+
methods of regular code.
20+
21+
=head1 X<Named Regexes|declarator,regex;declarator,token;declarator,rule>
22+
23+
The main ingredient of grammars is named L<regexes|/language/regexes>.
24+
While the syntax of L<Perl 6 Regexes|/language/regexes> is outside the scope
25+
of this document, I<named> regexes have a special syntax, similar to
26+
subroutine definitions:N<In fact, named regexes can even take extra
27+
arguments, using the same syntax as subroutine parameter lists>
28+
29+
=begin code :allow<B>
30+
my B<regex number {> \d+ [ \. \d+ ]? B<}>
31+
=end code
32+
33+
In this case, we have to specify that the regex is lexically scoped using
34+
the C<my> keyword, because named regexes are normally used within grammars.
35+
36+
Being named gives us the advantage of being able to easily reuse the regex
37+
elsewhere:
38+
39+
=begin code :allow<B>
40+
say "32.51" ~~ B<&number>;
41+
say "15 + 4.5" ~~ /B<< <number> >>\s* '+' \s*B<< <number> >>/
42+
=end code
43+
44+
B<C<regex>> isn't the only declarator for named regexes -- in fact, it's the
45+
least common. Most of the time, the B<C<token>> or B<C<rule>> declarators
46+
are used. These are both I<ratcheting>, which means that the match engine
47+
won't back up and try again if it fails to match something. This will
48+
usually do what you want, but isn't appropriate for all cases:
49+
50+
=begin code :allow<B>
51+
my regex works-but-slow { .+ q }
52+
my token fails-but-fast { .+ q }
53+
my $s = 'Tokens won\'t backtrack, which makes them fail quicker!';
54+
say so $s ~~ &works-but-slow; # True
55+
say so $s ~~ &fails-but-fast; # False, the entire string get taken by the .+
56+
=end code
57+
58+
The only difference between the C<token> and C<rule> declarators is that the
59+
C<rule> declarator causes L<C<:sigspace>|/language/regexes#Sigspace> to go
60+
into effect for the Regex:
61+
62+
=begin code :allow<B>
63+
my token non-space-y { 'once' 'upon' 'a' 'time' }
64+
my rule space-y { 'once' 'upon' 'a' 'time' }
65+
say 'onceuponatime' ~~ &non-space-y;
66+
say 'once upon a time' ~~ &space-y;
67+
=end code
68+
69+
=head1 X<Creating Grammars|class,Grammar;declarator,grammar>
70+
71+
=SUBTITLE Group of named regexes that form a formal grammar
72+
73+
L<Grammar|/type/Grammar> is the superclass that classes automatically get when
74+
they are declared with the C<grammar> keyword instead of C<class>. Grammars
75+
should only be used to parse text; if you wish to extract complex data, an
76+
L<action object|/language/grammars#Action_Objects> is recommended to be used in
77+
conjunction with the grammar.
78+
79+
X<sym>
80+
X<< :sym<> >>
81+
X<protoregex>
82+
=head2 Protoregexes
83+
84+
If you have a lot of alternations, it may become difficult to produce
85+
readable code or subclass your grammar. In the Actions class below, the
86+
ternary in C<method TOP> is less than ideal and it becomes even worse the more
87+
operations we we add:
88+
89+
grammar Calculator {
90+
token TOP { [ <add> | <sub> ] }
91+
rule add { <num> '+' <num> }
92+
rule sub { <num> '-' <num> }
93+
token num { \d+ }
94+
}
95+
96+
class Calculations {
97+
method TOP ($/) { make $<add> ?? $<add>.made !! $<sub>.made; }
98+
method add ($/) { make [+] $<num>; }
99+
method sub ($/) { make [-] $<num>; }
100+
}
101+
102+
say Calculator.parse('2 + 3', actions => Calculations).made;
103+
104+
# OUTPUT:
105+
# 5
106+
107+
To make things better, we can use protoregexes that look like `C<< :sym<...> >>
108+
adverbs on tokens:
109+
110+
grammar Calculator {
111+
token TOP { <calc-op> }
112+
113+
proto rule calc-op {*}
114+
rule calc-op:sym<add> { <num> '+' <num> }
115+
rule calc-op:sym<sub> { <num> '-' <num> }
116+
117+
token num { \d+ }
118+
}
119+
120+
class Calculations {
121+
method TOP ($/) { make $<calc-op>.made; }
122+
method calc-op:sym<add> ($/) { make [+] $<num>; }
123+
method calc-op:sym<sub> ($/) { make [-] $<num>; }
124+
}
125+
126+
say Calculator.parse('2 + 3', actions => Calculations).made;
127+
128+
# OUTPUT:
129+
# 5
130+
131+
In the grammar, the alternation has now been replaced with C<< <calc-op> >>,
132+
which is essentially the name of a group of values we'll create. We do so by
133+
defining a rule prototype with C<proto rule calc-op>. Each of our previous
134+
alternations have been replaced by a new C<rule calc-op> definition and the
135+
name of the alternation is attached with C<< :sym<> >> adverb.
136+
137+
In the actions class, we now got rid of the ternary operator and simply take
138+
the C<.made> value from the C<< $<calc-op> >> match object. And the actions for
139+
individual alternations now follow the name naming pattern as in the grammar:
140+
C<< method calc-op:sym<add> >> and C<method calc-op:sym<sub> >>.
141+
142+
The real beauty of this method can be seen when you subclass that grammar
143+
and actions class. Let's say we want to add a multiplication feature to the
144+
calculator:
145+
146+
grammar BetterCalculator is Calculator {
147+
rule calc-op:sym<mult> { <num> '*' <num> }
148+
}
149+
150+
class BetterCalculations is Calculations {
151+
method calc-op:sym<mult> ($/) { make [*] $<num> }
152+
}
153+
154+
say BetterCalculator.parse('2 * 3', actions => BetterCalculations).made;
155+
156+
# OUTPUT:
157+
# 6
158+
159+
All we had to add are additional rule and action to the C<calc-op> group and
160+
the thing works—all thanks to protoregexes.
161+
162+
=head2 Special Tokens
163+
164+
X<TOP>
165+
=head3 C<TOP>
166+
167+
grammar Foo {
168+
token TOP { \d+ }
169+
}
170+
171+
The C<TOP> token is the first token attempted to match when parsing with
172+
a grammar—the root of the tree. Note
173+
that if you're parsing with L<C<.parse>|/type/Grammar#method_parse> method,
174+
C<token TOP> is automatically anchored to the start and end of the string
175+
(see also: L<C<.subparse>|/type/Grammar#method_subparse>).
176+
177+
Using C<rule TOP> or C<regex TOP> are also acceptable.
178+
179+
X<ws>
180+
=head3 C<ws>
181+
182+
When C<rule> instead of C<token> is used, any whitespace after an
183+
atom is turned into a non-capturing call to C<ws>. That is:
184+
185+
rule entry { <key> ’=’ <value> }
186+
187+
Is the same as:
188+
189+
token entry { <key> <.ws> ’=’ <.ws> <value> <.ws> } # . = non-capturing
190+
191+
The default C<ws> matches "whitespace", such a sequence of spaces (of whatever
192+
type), newlines, unspaces, or heredocs.
193+
194+
It's perfectly fine to provide your own C<ws> token:
195+
196+
grammar Foo {
197+
rule TOP { \d \d }
198+
}.parse: "4 \n\n 5"; # Succeeds
199+
200+
grammar Bar {
201+
rule TOP { \d \d }
202+
token ws { \h* }
203+
}.parse: "4 \n\n 5"; # Fails
204+
205+
=head1 Action Objects
206+
207+
A successful grammar match gives you a parse tree of L<Match|/type/Match>
208+
objects, and the deeper that match tree gets, and the more branches in the
209+
grammar are, the harder it becomes to navigate the match tree to get the
210+
information you are actually interested in.
211+
212+
To avoid the need for diving deep into a match tree, you can supply an
213+
I<actions> object. After each successful parse of a named rule in your
214+
grammar, it tries to call a method of the same name as the grammar rule,
215+
giving it the newly created L<Match|/type/Match> object as a positional
216+
argument. If no such method exists, it is skipped.
217+
218+
Here is a contrived example of a grammar and actions in action:
219+
220+
=begin code
221+
use v6;
222+
223+
grammar TestGrammar {
224+
token TOP { \d+ }
225+
}
226+
227+
class TestActions {
228+
method TOP($/) {
229+
$/.make(2 + $/);
230+
}
231+
}
232+
233+
my $actions = TestActions.new;
234+
my $match = TestGrammar.parse('40', :$actions);
235+
say $match; # 「40」
236+
say $match.made; # 42
237+
=end code
238+
239+
An instance of C<TestActions> is passed as named argument C<actions> to the
240+
L<parse> call, and when token C<TOP> has matched
241+
successfully, it automatically calls method C<TOP>, passing the match object
242+
as an argument.
243+
244+
To make it clear that the argument is a match object, the example uses C<$/>
245+
as a parameter name to the action method, though that's just a handy
246+
convention, nothing intrinsic. C<$match> would have worked too. (Though using
247+
C<$/> does give the advantage of providing C<< $<capture> >> as a shortcut
248+
for C<< $/<capture> >>).
249+
250+
A slightly more involved example follows:
251+
252+
=begin code
253+
use v6;
254+
255+
grammar KeyValuePairs {
256+
token TOP {
257+
[<pair> \n+]*
258+
}
259+
token ws { \h* }
260+
261+
rule pair {
262+
<key=.identifier> '=' <value=.identifier>
263+
}
264+
token identifier {
265+
\w+
266+
}
267+
}
268+
269+
class KeyValuePairsActions {
270+
method identifier($/) { $/.make: ~$/ }
271+
method pair ($/) { $/.make: $<key>.made => $<value>.made }
272+
method TOP ($/) { $/.make: $<pair>».made }
273+
}
274+
275+
my $res = KeyValuePairs.parse(q:to/EOI/, :actions(KeyValuePairsActions)).made;
276+
second=b
277+
hits=42
278+
perl=6
279+
EOI
280+
281+
for @$res -> $p {
282+
say "Key: $p.key()\tValue: $p.value()";
283+
}
284+
=end code
285+
286+
This produces the following output:
287+
288+
=begin code
289+
Key: second Value: b
290+
Key: hits Value: 42
291+
Key: perl Value: 6
292+
=end code
293+
294+
Rule C<pair>, which parsed a pair separated by an equals sign, aliases the two
295+
calls to token C<identifier> to separate capture names to make them available
296+
more easily and intuitively. The corresponding action method constructs a
297+
L<Pair|/type/Pair> object, and uses the C<.made> property of the sub match
298+
objects. So it (like the action method C<TOP> too) exploits the fact that
299+
action methods for submatches are called before those of the calling/outer
300+
regex. So action methods are called in
301+
L<post-order|https://en.wikipedia.org/wiki/Tree_traversal#Post-order>.
302+
303+
The action method C<TOP> simply collects all the objects that were C<.made> by
304+
the multiple matches of the C<pair> rule, and returns them in a list.
305+
306+
Also note that C<KeyValuePairsActions> was passed as a type object to method
307+
C<parse>, which was possible because none of the action methods use attributes
308+
(which would only be available in an instance).
309+
310+
In other cases, action methods might want to keep state in attributes. Then of
311+
course you must pass an instance to method parse.
312+
313+
Note that C<token> C<ws> is special: when C<:sigspace> is enabled (and it is
314+
when we are using C<rule>), it replaces certain whitespace sequences. This is
315+
why the spaces around the equals sign in C<rule pair> work just fine and why
316+
the whitespace before closing C<}> does not gobble up the newlines looked for
317+
in C<token TOP>.
318+
319+
=end pod

0 commit comments

Comments
 (0)