Skip to content

Commit f819c26

Browse files
Rewrite of regexes-lexical conventions section
1 parent b14b5c0 commit f819c26

File tree

1 file changed

+197
-30
lines changed

1 file changed

+197
-30
lines changed

doc/Language/regexes.pod6

Lines changed: 197 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -12,46 +12,138 @@ matching those patterns to actual text.
1212
1313
=head1 X<Lexical conventions|quote,/ /;quote,rx;quote,m>
1414
15-
Perl 6 has special syntax for literal regexes:
15+
Fundamentally, regexes are very much like subroutines: both are code objects,
16+
and just as you can have anonymous subs and named subs, you can have anonymous
17+
and named regexes.
1618
17-
m/abc/; # a regex that is immediately matched against $_
18-
rx/abc/; # a Regex object
19-
/abc/; # a Regex object; shorthand version of 'rx/ /' operator
19+
A regex, whether anonymous or named, is represented by a L<C<Regex>|/type/Regex>
20+
object. The syntax for constructing anonymous and named C<Regex> objects
21+
differs, as do their intended uses.
2022
21-
One difference between the C<m/ /> and C<rx/ /> forms on the one hand, and the
22-
C</ /> form on the other, is that C<m> and C<rx> may be followed by
23-
L<adverbs|/language/regexes#Adverbs>. Another difference is that the
24-
former forms allow delimiters other than the slash to be used:
23+
In short, anonymous regexes may be used anywhere where a regex is needed with
24+
the exception of L<C<Grammars>|/type/Grammar>, which are the domain of named
25+
regexes. Named regexes form the building blocks of grammars, in which they serve
26+
as methods (also known as 'subrules') that can be called from other regexes to
27+
effectively parse textual data.
2528
26-
m{abc}; # curly braces as delimiters
27-
rx:i[abc]; # :i adverb, and square brackets as delimiters
2829
29-
As may be inferred from the above example, the use of a colon as an alternative
30-
delimiter would clash with the use of adverbs; accordingly, such use of the
31-
colon is forbidden. Similarly, parentheses cannot be used as alternative regex
32-
delimiters, at least not without a space between C<m> or C<rx> and the
33-
opening delimiter. This is because identifiers that are immediately followed by
34-
parentheses are always parsed as a subroutine call. For example, in C<rx()> the L<call
35-
operator|/language/operators#postcircumfix_(_)> C<()> invokes the subroutine
36-
C<rx>. The form C<rx ( abc )>, however, I<does> define a Regex object.
30+
=head2 Anonymous regex definition syntax
3731
38-
Here's an example that illustrates the difference between the C<m/ /> and C</ />
39-
operators:
32+
An anonymous regex may be constructed in one of the following ways:
4033
41-
my $match;
42-
$_ = "abc";
43-
$match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」␤Match␤»
44-
$match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/␤Regex␤»
34+
rx/pattern/; # an anonymous Regex object; 'rx' stands for 'regex'
35+
/pattern/; # an anonymous Regex object; shorthand for 'rx/.../'
36+
37+
regex { pattern } # keyword-declared anonymous regex; this form is
38+
# intended for defining named regexes and is discussed
39+
# in that context in the next section
40+
41+
The C<rx/ /> form has two advantages over the bare shorthand form C</ />.
42+
43+
Firstly, it enables the use of delimiters other than the slash, which may be
44+
used to improve the readability of the regex definition:
45+
46+
rx{ '/tmp/'.* } # the use of curly braces as delimiters makes this first
47+
rx/ '/tmp/'.* / # definition somewhat easier on the eyes than the second
48+
49+
Although the choice is vast, not every character may be chosen as an alternative
50+
regex delimiter:
51+
52+
=begin item
53+
You cannot use whitespace or alphanumeric characters as delimiters. Whitespace
54+
in regex definition syntax is generally optional, except where it is required to
55+
distinguish from function call syntax (discussed below).
56+
=end item
57+
58+
=begin item
59+
Use of a colon as a delimiter would clash with the use of adverbs of the form
60+
C<:adverb>; accordingly, such use of the colon is forbidden.
61+
=end item
62+
63+
=begin item
64+
Parentheses can be used as alternative regex delimiters, but only with a space
65+
between C<rx> and the opening delimiter. This is because identifiers that are
66+
immediately followed by parentheses are always parsed as a subroutine call. For example,
67+
in C<rx()> the L<call operator|/language/operators#postcircumfix_(_)> C<()>
68+
invokes the subroutine C<rx>. The form C<rx ( abc )>, however, I<does> define a
69+
C<Regex> object.
70+
=end item
71+
72+
=begin item
73+
The hash C<#> is not available as a delimiter, since it is parsed as the start
74+
of a L<comment|/language/syntax#Single-line_comments> that runs until the end of
75+
the line.
76+
=end item
77+
78+
Secondly, the C<rx> form enables the use of
79+
L<regex adverbs|/language/regexes#Adverbs>, which may be placed between C<rx> and the
80+
opening delimiter to modify the definition of the entire regex:
81+
82+
rx:r:s/pattern/ # :r (:ratchet) and :s (:sigspace) adverbs, defining
83+
# a racheting regex in which whitespace is significant
84+
85+
Although anonymous regexes are not, as such, I<named>, they may effectively be
86+
given a name by putting them inside a named variable, after which they can be
87+
referenced, e.g. direcly or by means of
88+
L<interpolation|/language/regexes#Regex_interpolation>:
89+
90+
my $regex = / k \w+ /;
91+
say "Made in a low firing kiln" ~~ $regex; # OUTPUT: 「kiln」
92+
93+
my $regex = /pottery/;
94+
"Japanese pottery rocks!" ~~ / <$regex> /; # Interpolation of $regex into /.../
95+
say $/; # OUTPUT: 「pottery」
96+
97+
=head2 Named regex definition syntax
98+
99+
A named regex may be constructed using the C<regex> declarator as follows:
100+
101+
regex R { pattern } # a named Regex object, named 'R'
102+
103+
Unlike with the C<rx> form, you cannot chose your preferred delimiter: curly
104+
braces are mandatory. In this regard it should be noted that the definition of a
105+
named regex using the C<regex> form is syntactically similar to the definition
106+
of a subroutine:
107+
108+
my sub S { /pattern/ }; # definition of Sub object (returning a Regex)
109+
my regex R { pattern }; # definition of Regex object
110+
111+
which emphasizes the fact that a L<C<Regex>|/type/Regex> object represents code
112+
rather than data:
113+
114+
&S ~~ Code # OUTPUT: True
115+
116+
&R ~~ Code # OUTPUT: True
117+
&R ~~ Method # OUTPUT: True (A Regex is really a Method!)
118+
119+
Also unlike with the C<rx> form for defining an anonymous regex, the definition
120+
of a named regex using the C<regex> form does not allow for adverbs to be
121+
inserted before the opening delimiter. Instead, adverbs that are to modify the
122+
entire regex pattern may be included first thing within the curly braces:
123+
124+
regex R { :i pattern } # :i (:ignorecase), renders pattern case insensitive
125+
126+
Alternatively, by way of shorthand, it is also possible (and recommended) to use
127+
the C<rule> and C<token> variants of the C<regex> declarator for defining a
128+
C<Regex> when the C<:ratchet> and C<:sigspace> adverbs are of interest:
45129
46-
Whitespace in literal regexes is ignored unless the
47-
L<C<:sigspace> adverb|/language/regexes#Sigspace> is used to make whitespace
130+
regex R { :r pattern } # apply :r (:ratchet) to entire pattern
131+
token R { pattern } # same thing: 'token' implies ':r'
132+
133+
regex R { :r :s pattern } # apply :r (:ratchet) and :s (:sigspace) to pattern
134+
rule R { pattern } # same thing: 'rule' implies ':r:s'
135+
136+
137+
=head2 Regex readability: whitespace and comments
138+
139+
Whitespace in regexes is ignored unless the
140+
L<C<:sigspace>|/language/regexes#Sigspace> adverb is used to make whitespace
48141
syntactically significant.
49142
50143
In addition to whitespace, comments may be used inside of regexes to improve
51-
their readability and comprehensibility just as in Perl 6 code in general. This
52-
is true for both L<single line comments|/language/syntax#Single-line_comments>
53-
and L<multi line/embedded comments|
54-
/language/syntax#Multi-line_/_embedded_comments>:
144+
their comprehensibility just as in code in general. This is true for both
145+
L<single line comments|/language/syntax#Single-line_comments> and
146+
L<multi line/embedded comments|/language/syntax#Multi-line_/_embedded_comments>:
55147
56148
my $regex = rx/ \d ** 4 #`(match the year YYYY)
57149
'-'
@@ -61,6 +153,81 @@ and L<multi line/embedded comments|
61153
62154
say '2015-12-25'.match($regex); # OUTPUT: «「2015-12-25」␤»
63155
156+
=head2 Match syntax
157+
158+
There are a variety of ways to match a string against a regex. Irrespective of
159+
the syntax chosen, a successful match results in a L<C<Match>|/type/Match>
160+
object. In case the match is unsuccessful, the result is L<C<Nil>|/type/Nil>. In
161+
either case, the result of the match operation is available via the special
162+
match variable L<C<$/>|/syntax/$$SOLIDUS>.
163+
164+
The most common ways to match a string against an anonymous regex C</pattern/> or
165+
against a named regex C<R> include the following:
166+
167+
=begin item
168+
I«Smartmatch: "string" ~~ /pattern/, "string" ~~ /<R>/»
169+
170+
L<Smartmatching|/language/operators#index-entry-smartmatch_operator> a string
171+
(C<Str>) against a C<Regex> performs a regex match of the string against the
172+
C<Regex>:
173+
174+
say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: 「Go」
175+
176+
my regex R { me|you };
177+
say "You talkin' to me?" ~~ / <R> /; # OUTPUT: «「me」␤ R => 「me」␤»
178+
say "May the force be with you. ~~ &R ; # OUTPUT: 「you」
179+
180+
The different outputs of the last two statements show that these two ways of
181+
smartmatching against a named regex are not identical. The difference arises
182+
because the method call C«<R>» from within the anonymous regex C</.../> installs
183+
a so-called L<'named capture'|/language/regexes#Named_captures> in the C<Match>
184+
object, while the smartmatch against the named C<Regex> as such does not.
185+
=end item
186+
187+
=begin item
188+
I«Explicit topic match: m/pattern/, m/<R>/»
189+
190+
The match operator C<m/ /> immediately matches the topic variable
191+
L<C<$_>|/language/variables#index-entry-topic_variable> against the regex
192+
following the C<m>. As with the C<rx/ /> syntax for regex definitions, the match
193+
operator may be used with adverbs in between C<m> and the opening regex
194+
delimiter, and with delimiters other than the slash.
195+
196+
Here's an example that illustrates the difference between the C<m/ /> and C</ />
197+
operators:
198+
199+
my $match;
200+
$_ = "abc";
201+
$match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」␤Match␤»
202+
$match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/␤Regex␤»
203+
=end item
204+
205+
=begin item
206+
I«Implicit topic match in sink and boolean contexts»
207+
208+
In case a C<Regex> object is used in sink context, or in a context in which it
209+
is coerced to L<C<Bool>|/type/Bool>, the topic variable
210+
L<C<$_>|/language/variables#index-entry-topic_variable> is automatically matched
211+
against it:
212+
213+
$_ = "dummy string"; # Set the topic explicitly
214+
215+
rx/ s.* /; # Regex object in sink context matches automatically
216+
say $/; # OUTPUT: 「string」
217+
218+
say $/ if rx/ d.* /; # Regex object in boolean context matches automatically
219+
# OUTPUT: 「dummy string」
220+
=end item
221+
222+
=begin item
223+
I«Match method: "string".match: /pattern/, "string".match: /<R>/»
224+
225+
The L<C<match>|/type/Str#method_match> method is analogous to the C<m/ />
226+
operator discussed above. Invoking it on a string (C<Str>), with a C<Regex> as
227+
an argument, matches the string against the C<Regex>.
228+
=end item
229+
230+
64231
=head1 Literals and metacharacters
65232
66233
A regex describes a pattern to be matched in terms of literals and

0 commit comments

Comments
 (0)