6
6
7
7
X < |Regular Expressions >
8
8
9
- Regular expressions, I < regexes > for short, are written in a domain-specific
10
- language that describes text patterns. Pattern matching is the process of
11
- matching those patterns to actual text.
9
+ A I < regular expression > is a sequence of characters that defines a certain text
10
+ pattern, typically one that one wishes to find in some large body of text.
11
+
12
+ In theoretical computer science and formal language theory, regular expressions
13
+ are used to describe so-called
14
+ L < I < regular languages > |https://en.wikipedia.org/wiki/Regular_language> . Since
15
+ their inception in the 1950's, practical implementations of regular expressions,
16
+ for instance in the text search and replace functions of text editors, have outgrown
17
+ their strict scientific definition. In acknowledgement of this, and in an attempt
18
+ to disambiguate, a regular expression in Perl 6 is normally referred to as a
19
+ I < regex > (from: I < reg > ular I < ex > pression), a term that is also in common use in
20
+ other programming languages.
21
+
22
+ In Perl 6, regexes are written in a
23
+ L < I < domain-specific language > |https://en.wikipedia.org/wiki/Domain-specific_language> ,
24
+ i.e. a sublanguage or I < slang > . This page describes this language, and explains how
25
+ regexes can be used to search for text patterns in strings in a process called
26
+ I < pattern matching > .
12
27
13
28
= head1 X < Lexical conventions|quote,/ /;quote,rx;quote,m >
14
29
@@ -45,12 +60,7 @@ regex delimiter:
45
60
= begin item
46
61
You cannot use whitespace or alphanumeric characters as delimiters. Whitespace
47
62
in regex definition syntax is generally optional, except where it is required to
48
- distinguish from function call syntax (discussed below).
49
- = end item
50
-
51
- = begin item
52
- Use of a colon as a delimiter would clash with the use of adverbs of the form
53
- C < :adverb > ; accordingly, such use of the colon is forbidden.
63
+ distinguish from function call syntax (discussed hereafter).
54
64
= end item
55
65
56
66
= begin item
@@ -63,7 +73,13 @@ C<Regex> object.
63
73
= end item
64
74
65
75
= begin item
66
- The hash C < # > is not available as a delimiter, since it is parsed as the start
76
+ Use of a colon as a delimiter would clash with the use of
77
+ L < adverbs|/language/regexes#Adverbs > , which take the form C < :adverb > ;
78
+ accordingly, such use of the colon is forbidden.
79
+ = end item
80
+
81
+ = begin item
82
+ The hashmark C < # > is not available as a delimiter since it is parsed as the start
67
83
of a L < comment|/language/syntax#Single-line_comments > that runs until the end of
68
84
the line.
69
85
= end item
@@ -72,13 +88,13 @@ Secondly, the C<rx> form enables the use of
72
88
L < regex adverbs|/language/regexes#Adverbs > , which may be placed between C < rx > and the
73
89
opening delimiter to modify the definition of the entire regex:
74
90
75
- rx:r:s/pattern/ # :r (:ratchet) and :s (:sigspace) adverbs, defining
91
+ rx:r:s/pattern/; # :r (:ratchet) and :s (:sigspace) adverbs, defining
76
92
# a racheting regex in which whitespace is significant
77
93
78
94
Although anonymous regexes are not, as such, I < named > , they may effectively be
79
95
given a name by putting them inside a named variable, after which they can be
80
- referenced, e.g. direcly or by means of
81
- L < interpolation|/language/regexes#Regex_interpolation > :
96
+ referenced, both outside of an embedding regex and from within an embedding
97
+ regex by means of L < interpolation|/language/regexes#Regex_interpolation > :
82
98
83
99
my $regex = / R \w+ /;
84
100
say "Zen Buddists like Raku too" ~~ $regex; # OUTPUT: 「Raku」
@@ -110,7 +126,7 @@ rather than data:
110
126
&R ~~ Method; # OUTPUT: True (A Regex is really a Method!)
111
127
112
128
Also unlike with the C < rx > form for defining an anonymous regex, the definition
113
- of a named regex using the C < regex > form does not allow for adverbs to be
129
+ of a named regex using the C < regex > keyword does not allow for adverbs to be
114
130
inserted before the opening delimiter. Instead, adverbs that are to modify the
115
131
entire regex pattern may be included first thing within the curly braces:
116
132
@@ -128,9 +144,9 @@ C<Regex> when the C<:ratchet> and C<:sigspace> adverbs are of interest:
128
144
129
145
Named regexes may be used as building blocks for other regexes, as they are
130
146
methods that may called from within other regexes using the C « <regex-name> »
131
- syntax. When they are used this way, they are often referred to as ' subrules' ;
147
+ syntax. When they are used this way, they are often referred to as I < subrules > ;
132
148
see for more details on their use L < here|/language/regexes#Subrules > .
133
- L < C < Grammars > |/type/Grammar> are the natural niche for subrules, but many common
149
+ L < C < Grammars > |/type/Grammar> are the natural habitat of subrules, but many common
134
150
predefined character classes are also implemented as named regexes.
135
151
136
152
= head2 Regex readability: whitespace and comments
@@ -164,36 +180,41 @@ The most common ways to match a string against an anonymous regex C</pattern/> o
164
180
against a named regex C < R > include the following:
165
181
166
182
= begin item
167
- I « Smartmatch: "string" ~~ /pattern/, "string" ~~ /<R>/ »
183
+ I « Smartmatch: "string" ~~ /pattern/, or "string" ~~ /<R>/ »
168
184
169
185
L < Smartmatching|/language/operators#index-entry-smartmatch_operator > a string
170
- (C < Str > ) against a C < Regex > performs a regex match of the string against the
171
- C < Regex > :
186
+ against a C < Regex > performs a regex match of the string against the C < Regex > :
172
187
173
- say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: 「Go」
188
+ say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: « 「Go」»
174
189
175
190
my regex R { me|you };
176
191
say "You talkin' to me?" ~~ / <R> /; # OUTPUT: «「me」 R => 「me」»
177
- say "May the force be with you. ~~ &R ; # OUTPUT: 「you」
192
+ say "May the force be with you. ~~ &R ; # OUTPUT: « 「you」»
178
193
179
194
The different outputs of the last two statements show that these two ways of
180
195
smartmatching against a named regex are not identical. The difference arises
181
- because the method call C « <R> » from within the anonymous regex C < /... / > installs
196
+ because the method call C « <R> » from within the anonymous regex C < / / > installs
182
197
a so-called L < 'named capture'|/language/regexes#Named_captures > in the C < Match >
183
198
object, while the smartmatch against the named C < Regex > as such does not.
184
199
= end item
185
200
186
201
= begin item
187
- I « Explicit topic match: m/pattern/, m/<R>/ »
202
+ I « Explicit topic match: m/pattern/, or m/<R>/ »
188
203
189
204
The match operator C < m/ / > immediately matches the topic variable
190
205
L < C < $_ > |/language/variables#index-entry-topic_variable> against the regex
191
- following the C < m > . As with the C < rx/ / > syntax for regex definitions, the match
192
- operator may be used with adverbs in between C < m > and the opening regex
193
- delimiter, and with delimiters other than the slash.
206
+ following the C < m > .
194
207
195
- Here's an example that illustrates the difference between the C < m/ / > and C < / / >
196
- operators:
208
+ As with the C < rx/ / > syntax for regex definitions, the match operator may be
209
+ used with adverbs in between C < m > and the opening regex delimiter, and with
210
+ delimiters other than the slash. However, while the C < rx/ / > syntax may only be
211
+ used with L < I < regex adverbs > |/language/regexes#Regex_adverbs> that affect the
212
+ compilation of the regex, the C < m/ / > syntax may additionally be used with
213
+ L < I < matching adverbs > |/language/regexes#Matching_adverbs> that determine how the
214
+ regex engine is to perform pattern matching.
215
+
216
+ Here's an example that illustrates the primary difference between the C < m/ / >
217
+ and C < / / > syntax:
197
218
198
219
my $match;
199
220
$_ = "abc";
@@ -219,11 +240,21 @@ against it:
219
240
= end item
220
241
221
242
= begin item
222
- I « Match method: "string".match: /pattern/, "string".match: /<R>/ »
243
+ I « Match method: "string".match: /pattern/, or "string".match: /<R>/ »
223
244
224
245
The L < C < match > |/type/Str#method_match> method is analogous to the C < m/ / >
225
- operator discussed above. Invoking it on a string (C < Str > ), with a C < Regex > as
226
- an argument, matches the string against the C < Regex > .
246
+ operator discussed above. Invoking it on a string, with a C < Regex > as an
247
+ argument, matches the string against the C < Regex > . =end item
248
+
249
+ = begin item
250
+ I « Parsing grammars: grammar-name.parse($string) »
251
+
252
+ Although parsing a L < Grammar|/language/grammars > involves more than just
253
+ matching a string against a regex, this powerful regex-based text destructuring
254
+ tool can't be left out from this overview of common pattern matching methods.
255
+
256
+ If you feel that your needs exceed what simple regexes have to offer, check out this
257
+ L < grammar tutorial > |/language/grammar_tutorial> to take regexes to the next level.
227
258
= end item
228
259
229
260
0 commit comments