@@ -12,46 +12,138 @@ matching those patterns to actual text.
12
12
13
13
= head1 X < Lexical conventions|quote,/ /;quote,rx;quote,m >
14
14
15
- Perl 6 has special syntax for literal regexes:
15
+ Fundamentally, regexes are very much like subroutines: both are code objects,
16
+ and just as you can have anonymous subs and named subs, you can have anonymous
17
+ and named regexes.
16
18
17
- m/abc/; # a regex that is immediately matched against $_
18
- rx/abc/; # a Regex object
19
- /abc/; # a Regex object; shorthand version of 'rx/ /' operator
19
+ A regex, whether anonymous or named, is represented by a L < C < Regex > |/type/Regex >
20
+ object. The syntax for constructing anonymous and named C < Regex > objects
21
+ differs, as do their intended uses.
20
22
21
- One difference between the C < m/ / > and C < rx/ / > forms on the one hand, and the
22
- C < / / > form on the other, is that C < m > and C < rx > may be followed by
23
- L < adverbs|/language/regexes#Adverbs > . Another difference is that the
24
- former forms allow delimiters other than the slash to be used:
23
+ In short, anonymous regexes may be used anywhere where a regex is needed with
24
+ the exception of L < C < Grammars > |/type/Grammar> , which are the domain of named
25
+ regexes. Named regexes form the building blocks of grammars, in which they serve
26
+ as methods (also known as 'subrules') that can be called from other regexes to
27
+ effectively parse textual data.
25
28
26
- m{abc}; # curly braces as delimiters
27
- rx:i[abc]; # :i adverb, and square brackets as delimiters
28
29
29
- As may be inferred from the above example, the use of a colon as an alternative
30
- delimiter would clash with the use of adverbs; accordingly, such use of the
31
- colon is forbidden. Similarly, parentheses cannot be used as alternative regex
32
- delimiters, at least not without a space between C < m > or C < rx > and the
33
- opening delimiter. This is because identifiers that are immediately followed by
34
- parentheses are always parsed as a subroutine call. For example, in C < rx() > the L < call
35
- operator|/language/operators#postcircumfix_(_) > C < () > invokes the subroutine
36
- C < rx > . The form C < rx ( abc ) > , however, I < does > define a Regex object.
30
+ = head2 Anonymous regex definition syntax
37
31
38
- Here's an example that illustrates the difference between the C < m/ / > and C < / / >
39
- operators:
32
+ An anonymous regex may be constructed in one of the following ways:
40
33
41
- my $match;
42
- $_ = "abc";
43
- $match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」Match»
44
- $match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/Regex»
34
+ rx/pattern/; # an anonymous Regex object; 'rx' stands for 'regex'
35
+ /pattern/; # an anonymous Regex object; shorthand for 'rx/.../'
36
+
37
+ regex { pattern } # keyword-declared anonymous regex; this form is
38
+ # intended for defining named regexes and is discussed
39
+ # in that context in the next section
40
+
41
+ The C < rx/ / > form has two advantages over the bare shorthand form C < / / > .
42
+
43
+ Firstly, it enables the use of delimiters other than the slash, which may be
44
+ used to improve the readability of the regex definition:
45
+
46
+ rx{ '/tmp/'.* } # the use of curly braces as delimiters makes this first
47
+ rx/ '/tmp/'.* / # definition somewhat easier on the eyes than the second
48
+
49
+ Although the choice is vast, not every character may be chosen as an alternative
50
+ regex delimiter:
51
+
52
+ = begin item
53
+ You cannot use whitespace or alphanumeric characters as delimiters. Whitespace
54
+ in regex definition syntax is generally optional, except where it is required to
55
+ distinguish from function call syntax (discussed below).
56
+ = end item
57
+
58
+ = begin item
59
+ Use of a colon as a delimiter would clash with the use of adverbs of the form
60
+ C < :adverb > ; accordingly, such use of the colon is forbidden.
61
+ = end item
62
+
63
+ = begin item
64
+ Parentheses can be used as alternative regex delimiters, but only with a space
65
+ between C < rx > and the opening delimiter. This is because identifiers that are
66
+ immediately followed by parentheses are always parsed as a subroutine call. For example,
67
+ in C < rx() > the L < call operator|/language/operators#postcircumfix_(_) > C < () >
68
+ invokes the subroutine C < rx > . The form C < rx ( abc ) > , however, I < does > define a
69
+ C < Regex > object.
70
+ = end item
71
+
72
+ = begin item
73
+ The hash C < # > is not available as a delimiter, since it is parsed as the start
74
+ of a L < comment|/language/syntax#Single-line_comments > that runs until the end of
75
+ the line.
76
+ = end item
77
+
78
+ Secondly, the C < rx > form enables the use of
79
+ L < regex adverbs|/language/regexes#Adverbs > , which may be placed between C < rx > and the
80
+ opening delimiter to modify the definition of the entire regex:
81
+
82
+ rx:r:s/pattern/ # :r (:ratchet) and :s (:sigspace) adverbs, defining
83
+ # a racheting regex in which whitespace is significant
84
+
85
+ Although anonymous regexes are not, as such, I < named > , they may effectively be
86
+ given a name by putting them inside a named variable, after which they can be
87
+ referenced, e.g. direcly or by means of
88
+ L < interpolation|/language/regexes#Regex_interpolation > :
89
+
90
+ my $regex = / k \w+ /;
91
+ say "Made in a low firing kiln" ~~ $regex; # OUTPUT: 「kiln」
92
+
93
+ my $regex = /pottery/;
94
+ "Japanese pottery rocks!" ~~ / <$regex> /; # Interpolation of $regex into /.../
95
+ say $/; # OUTPUT: 「pottery」
96
+
97
+ = head2 Named regex definition syntax
98
+
99
+ A named regex may be constructed using the C < regex > declarator as follows:
100
+
101
+ regex R { pattern } # a named Regex object, named 'R'
102
+
103
+ Unlike with the C < rx > form, you cannot chose your preferred delimiter: curly
104
+ braces are mandatory. In this regard it should be noted that the definition of a
105
+ named regex using the C < regex > form is syntactically similar to the definition
106
+ of a subroutine:
107
+
108
+ my sub S { /pattern/ }; # definition of Sub object (returning a Regex)
109
+ my regex R { pattern }; # definition of Regex object
110
+
111
+ which emphasizes the fact that a L < C < Regex > |/type/Regex> object represents code
112
+ rather than data:
113
+
114
+ &S ~~ Code # OUTPUT: True
115
+
116
+ &R ~~ Code # OUTPUT: True
117
+ &R ~~ Method # OUTPUT: True (A Regex is really a Method!)
118
+
119
+ Also unlike with the C < rx > form for defining an anonymous regex, the definition
120
+ of a named regex using the C < regex > form does not allow for adverbs to be
121
+ inserted before the opening delimiter. Instead, adverbs that are to modify the
122
+ entire regex pattern may be included first thing within the curly braces:
123
+
124
+ regex R { :i pattern } # :i (:ignorecase), renders pattern case insensitive
125
+
126
+ Alternatively, by way of shorthand, it is also possible (and recommended) to use
127
+ the C < rule > and C < token > variants of the C < regex > declarator for defining a
128
+ C < Regex > when the C < :ratchet > and C < :sigspace > adverbs are of interest:
45
129
46
- Whitespace in literal regexes is ignored unless the
47
- L < C < :sigspace > adverb|/language/regexes#Sigspace> is used to make whitespace
130
+ regex R { :r pattern } # apply :r (:ratchet) to entire pattern
131
+ token R { pattern } # same thing: 'token' implies ':r'
132
+
133
+ regex R { :r :s pattern } # apply :r (:ratchet) and :s (:sigspace) to pattern
134
+ rule R { pattern } # same thing: 'rule' implies ':r:s'
135
+
136
+
137
+ = head2 Regex readability: whitespace and comments
138
+
139
+ Whitespace in regexes is ignored unless the
140
+ L < C < :sigspace > |/language/regexes#Sigspace> adverb is used to make whitespace
48
141
syntactically significant.
49
142
50
143
In addition to whitespace, comments may be used inside of regexes to improve
51
- their readability and comprehensibility just as in Perl 6 code in general. This
52
- is true for both L < single line comments|/language/syntax#Single-line_comments >
53
- and L < multi line/embedded comments|
54
- /language/syntax#Multi-line_/_embedded_comments > :
144
+ their comprehensibility just as in code in general. This is true for both
145
+ L < single line comments|/language/syntax#Single-line_comments > and
146
+ L < multi line/embedded comments|/language/syntax#Multi-line_/_embedded_comments > :
55
147
56
148
my $regex = rx/ \d ** 4 #`(match the year YYYY)
57
149
'-'
@@ -61,6 +153,81 @@ and L<multi line/embedded comments|
61
153
62
154
say '2015-12-25'.match($regex); # OUTPUT: «「2015-12-25」»
63
155
156
+ = head2 Match syntax
157
+
158
+ There are a variety of ways to match a string against a regex. Irrespective of
159
+ the syntax chosen, a successful match results in a L < C < Match > |/type/Match>
160
+ object. In case the match is unsuccessful, the result is L < C < Nil > |/type/Nil> . In
161
+ either case, the result of the match operation is available via the special
162
+ match variable L < C < $/ > |/syntax/$$SOLIDUS> .
163
+
164
+ The most common ways to match a string against an anonymous regex C < /pattern/ > or
165
+ against a named regex C < R > include the following:
166
+
167
+ = begin item
168
+ I « Smartmatch: "string" ~~ /pattern/, "string" ~~ /<R>/ »
169
+
170
+ L < Smartmatching|/language/operators#index-entry-smartmatch_operator > a string
171
+ (C < Str > ) against a C < Regex > performs a regex match of the string against the
172
+ C < Regex > :
173
+
174
+ say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: 「Go」
175
+
176
+ my regex R { me|you };
177
+ say "You talkin' to me?" ~~ / <R> /; # OUTPUT: «「me」 R => 「me」»
178
+ say "May the force be with you. ~~ &R ; # OUTPUT: 「you」
179
+
180
+ The different outputs of the last two statements show that these two ways of
181
+ smartmatching against a named regex are not identical. The difference arises
182
+ because the method call C « <R> » from within the anonymous regex C < /.../ > installs
183
+ a so-called L < 'named capture'|/language/regexes#Named_captures > in the C < Match >
184
+ object, while the smartmatch against the named C < Regex > as such does not.
185
+ = end item
186
+
187
+ = begin item
188
+ I « Explicit topic match: m/pattern/, m/<R>/ »
189
+
190
+ The match operator C < m/ / > immediately matches the topic variable
191
+ L < C < $_ > |/language/variables#index-entry-topic_variable> against the regex
192
+ following the C < m > . As with the C < rx/ / > syntax for regex definitions, the match
193
+ operator may be used with adverbs in between C < m > and the opening regex
194
+ delimiter, and with delimiters other than the slash.
195
+
196
+ Here's an example that illustrates the difference between the C < m/ / > and C < / / >
197
+ operators:
198
+
199
+ my $match;
200
+ $_ = "abc";
201
+ $match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」Match»
202
+ $match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/Regex»
203
+ = end item
204
+
205
+ = begin item
206
+ I « Implicit topic match in sink and boolean contexts »
207
+
208
+ In case a C < Regex > object is used in sink context, or in a context in which it
209
+ is coerced to L < C < Bool > |/type/Bool> , the topic variable
210
+ L < C < $_ > |/language/variables#index-entry-topic_variable> is automatically matched
211
+ against it:
212
+
213
+ $_ = "dummy string"; # Set the topic explicitly
214
+
215
+ rx/ s.* /; # Regex object in sink context matches automatically
216
+ say $/; # OUTPUT: 「string」
217
+
218
+ say $/ if rx/ d.* /; # Regex object in boolean context matches automatically
219
+ # OUTPUT: 「dummy string」
220
+ = end item
221
+
222
+ = begin item
223
+ I « Match method: "string".match: /pattern/, "string".match: /<R>/ »
224
+
225
+ The L < C < match > |/type/Str#method_match> method is analogous to the C < m/ / >
226
+ operator discussed above. Invoking it on a string (C < Str > ), with a C < Regex > as
227
+ an argument, matches the string against the C < Regex > .
228
+ = end item
229
+
230
+
64
231
= head1 Literals and metacharacters
65
232
66
233
A regex describes a pattern to be matched in terms of literals and
0 commit comments