Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 866 lines (666 sloc) 37.676 kB
4fe1c6e @perlpilot initial regex-intro document
authored
1 =head1 Introduction to Perl 6 Regex
2
3 =head2 Context
4
5 Over the years programming languages have incorporated features for
6 regular expressions. Some, such as Javascript, have added syntax
7 specifically to support regular expressions. Others, such as PHP, have
8 just reused their native string type and utilize special subroutines to
9 parse strings as regular expressions. But one thing almost all of them
10 have in common is that they have mimicked the extended regular
11 expression syntax of Perl.
12
13 Of course, Perl wasn't the first programming language to have support
14 for regular expressions. But it did make them popular. Perl has been so
15 successful as a text processing and glue language and regular
16 expressions so well interwoven into the language that anyone who uses
17 Perl almost I<has> to learn regular expressions. Also, by applying some
18 of Perl's philosophy to regular expressions, common usages became easy
19 and complex usages became possible. Here are just a few features that
20 resulted: character class shortcuts, annotated regular expressions,
21 ability to match unicode properties, zero-width assertions, independant
22 subexpressions, and code execution inside of a regular expression.
23
24 Unfortunately, as the regular expressioning public put more demand on
25 Perl's regular expression syntax, it accumulated some crufty items--
26 little inconsistencies that were to maintain backward compatibility or
27 were introduced because they were needed, but before they were fully
28 thought out. In designing Perl 6, Larry Wall not only looked at the
29 syntax and semantics of Perl proper, but he also took a hard look at the
30 sub-language that is regular expressions and refactored it into
31 something that makes better sense.
32
33 In this article I'm going to give an introduction to Perl 6 regex (we
34 call them "regex" to maintain the historical association with regular
35 expressions though they've strayed quite far from the mathematical
36 sense of regular languages). I'll point out differences from
37 Perl 5 syntax but no knowledge of Perl 5's regular expression syntax
38 should be necessary to understand this document. If you're a Perl 5
39 geek, you may be bored for a while, but read anyway so that you can
40 pick up the syntactic and semantic differences.
41
42 =head2 Literals
43
44 Firstly, let's get some small syntactic things out of the way. In Perl
45 6, as in other implementations, regex are typically delimited by slashes
46 (aka, leaning toothpicks), so a typical regex to match the string "abc"
47 would look like this:
48
49 /abc/
50
51 A regex is also sometimes called a "pattern" because we're looking for
52 a portion of a string that looks like the strings that the regex
53 describes. Regex are also sometimes called "rules" because they
54 describe the conditions under which the string may match. But what
55 string are we looking in? As in Perl 5, Perl 6 applies the regex to a
56 variable called C<$_> if we haven't explicitly specified a variable to
57 match against. For now I'm going to continue assuming that our string
58 is in C<$_> in my examples. Later I'll show you how to specify a
59 different string to match against.
60
61 So, the above regex tries to find the pattern "abc" in the string C<$_>.
62 If the pattern appears in the string, the regex returns a true value to
63 indicate that it matched successfully, otherwise it returns a false
64 value. It's important to note that the pattern may appear anywhere in
65 the string. So, for instance, if C<$_> contained "fooabcbar", the
66 above pattern will sucessfully match and the regex will return a true
67 value. Here are some more strings that would successfully match against
68 the regex:
69
70 abcgoobltygook
71 now I know my abcs
72 abc
73 babcock
74
75 =head2 Meta-syntax
76
77 Now, if all you can do is match literal strings, regexes wouldn't be so
78 useful would they? Some characters rather than taken literally are so-
79 called metacharacters that have special meaning in a regex. In Perl 6,
80 any non-alphanumeric character is considered a metacharacter by
81 default. That is, alphabetic and numeric characters match themselves and
82 any other character may not match itself because it may have special
83 meaning. (For the purposes of metasyntax, the underscore is considered
84 alphanumeric.)
85
86 Currently not all metacharacters actually I<have> a special meaning but
87 many do and in order to keep things simple, Perl 6 chooses to designate
88 all non-alphanumeric characters as metasyntactic. However, there's an
89 "escape mechanism" that lets you treat metacharacters as themselves
90 (literally), and alphanumeric characters as metasyntactic. By prefixing
91 an alphanumeric character with a backslash (B<\>) it becomes a
92 metacharacter and is special. By the same token, prefixing a non-
93 alphanumeric character with a backslash removes its metasyntactic nature
94 and it becomes literal.
95
96 So, for example, a common metasyntactic character found in regular
97 expressions is a period (C<.>, sometimes just called a "dot") and it
98 matches B<any> character. Thus,
99
100 /f..d/
101
102 will match any four character sequence that starts with an "f" and ends
103 with a "d". All of the following strings match this pattern:
104
105 my name is fred # matched "fred"
106 I need food, now! # matched "food"
107 those guys are turf idols # matched "f id"
108 shift down a gear # matched "ft d"
109
110 If you want to actually match a period you can escape it like so:
111
112 /foo\./ # matches "foo."
113
114 Similarly, the letter "t" in a regular expression matches itself (i.e.,
115 an occurence of the letter "t"). But, with a backslash immediately
116 preceding the "t", it takes on its metasyntactic meaning of a tab
117 character. For example:
118
119 /tall/ # matches the string "tall"
120 /\tall/ # matches a tab character, followed by "all"
121
122 Another way to match character sequences that are to be taken
123 literally is to enclose them in quotes:
124
125 /'foo.'/ # matches "foo."
126 /"foo."/ # same
127
128 In these cases, the quotes are metasyntactic delimiters that mean
129 "match the characters in between literally".
130
131 =begin sidebar
132
133 In Perl 5 (and most other regular expression variants), a dot matches
134 any character except for the newline sequence. In Perl 6, this odd
135 restriction is lifted and the dot matches B<any> character, including
136 the newline sequence. Perl 6 has other mechanisms to accomplish Perl
137 5's behavior. See L<"Character Classes"> below.
138
139 =end sidebar
140
141 The most important metasyntactic characters in regex are whitespace
142 (typically space characters, but sometimes tabs and other "invisible"
143 characters). Because regular expressions tend towards high character
144 density, they can often be difficult to read. In Perl 6 regex, you may
145 use whitespace to separate parts of your regex to make it easier to
146 read. The whitespace is ignored by the regex engine. To match literal
413df07 @perlpilot remove errant text about backwhacking whitespace
authored
147 space characters or other whitespace you should enclose it in quotes.
4fe1c6e @perlpilot initial regex-intro document
authored
148
149 In future examples I will occasionally show a given regex in its spaced
150 form. The spaced form is preferable when writing regex as it makes them
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
151 easier to read later, and will be used from now on.
4fe1c6e @perlpilot initial regex-intro document
authored
152
153 =head2 Quantifiers
154
155 Three other important metacharacters are what are called quantifiers.
156 These characters allow you to specify repetition; that a character or
157 group of characters may be matched multiple times.
158
159 quantifier matches
160 * the preceding thing zero or more times
161 + the preceding thing one or more times
162 ? the preceding thing zero or one time
163
164
165 Examples:
166
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
167 / fo* / # will match "f", "fo", "foo", "fooo", etc.
168 / fo+ / # will match "fo", "foo", "fooo", "foooo", etc.
169 / fo? / # will only match "f" or "fo"
4fe1c6e @perlpilot initial regex-intro document
authored
170
171 Note that in the above examples the quantifier is only applied to the
172 preceding character. If you need to match a group of characters
173 repeatedly, you have to use one of the several grouping mechanisms in
174 Perl 6 regex (see L<Grouping> below).
175
176 There is another character sequence sometimes called the universal
177 quantifier that allows you to prescribe a specific number of times a
178 particular pattern may match. To use the universal quantifier, you use
179 two C<*> characters followed by a number or a range like so:
180
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
181 / fo ** 3 / # matches only "fooo"
182 / fo ** 3..5 / # matches one of "fooo", "foooo", or "fooooo"
4fe1c6e @perlpilot initial regex-intro document
authored
183
184 You may also specify a closure after the C<**>, but that is beyond the
185 scope of this introduction to explain. See the references given at
186 the end of this article for more reading.
187
188 =begin sidebar
189
190 =head3 A note on greed
191
192 As in Perl 5, the quantifiers are greedy. That is, they try to
193 consume as much of the target string as possible. So the pattern
194
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
195 / .* abc /
4fe1c6e @perlpilot initial regex-intro document
authored
196
197 when applied to the string "abcdefghujklmnopqrstuvwyxz" will first try
198 to match the entire string (because C<.> matches any character and C<*>
199 tries to match as much as it can). When the regex engine gets to the end
200 of the string and can not match "abc" it will back up one character and
201 try to match "abc" again. This process will repeat until the regex
202 engine either matches or runs out of characters to process.
203
204 Also as in Perl 5, the greediness of quantifiers can be "turned off" by
205 adding a question mark (C<?>) immediately after the quantifier.
206 So for instance,
207
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
208 / . *? abc / # first tries to match nothing, then "abc"
209 / . +? abc / # first tries matching one character, then "abc"
210 / . **?3..5 abc/ # first tries matching three characters, then "abc"
4fe1c6e @perlpilot initial regex-intro document
authored
211
212 The default behavior (greediness) can be thought of as the regex engine
213 first matching the most characters that the quantifier will allow and
214 then backing up one character at a time if the rest of the pattern
215 doesn't match (to try again). With greediness turned off, the regex
216 engine matches as little as the quantifier allows and moves forward one
217 character at a time to match the rest of the pattern.
218
219 Note that a regex is processed from left to right and the greediness (or
220 non-greediness) behavior B<only> applies to the part of the regex
221 governed by a particular quantifier. Making quantifiers match
222 minimally does not cause the pattern as a whole to match minimally.
223
224 =end sidebar
225
226 =head2 Character Classes
227
228 We've seen how to match a specific character at a given location by
229 putting that character in the regex and we've seen how to match any
230 character at a specific location by putting a dot in the regex, but
231 sometimes you want to match a specific I<set> of characters at a given
232 position. The mechism to do this in regex is called a "character class".
233 Character classes are designated by C<< <[]> >> with the specific
234 characters listed inside the brackets. For instance,
235
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
236 / foo <[dlt]> / # matches "food", "fool" or "foot"
4fe1c6e @perlpilot initial regex-intro document
authored
237
238 The sequence C<< <[dlt]> >> represents any one of the characters "d", "l",
239 or "t". You can also specify a set of contiguous characters to match
240 using a range like so:
241
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
242 / <[ a..d ]> / # matches one of "a", "b", "c", or "d"
4fe1c6e @perlpilot initial regex-intro document
authored
243
244 You can also mix ranges and specific characters:
245
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
246 /<[ a..d xyz ]>/ # matches one of "a","b","c","d","x","y", or "z"
247 /<[ xyz a..d ]>/ # same
248 /<[ x..z a..d ]>/ # same
4fe1c6e @perlpilot initial regex-intro document
authored
249
250 Some character classes are so useful that they have their own
251 designated short-cuts. All of the character class short-cuts make use
252 of alphabetic characters that have been given a metasyntactic meaning
253 by prefixing the character with a back slash. Here's a table of the
254 short-cuts:
255
256 short-cut matches
257 \w word characters (alphabetics, numerics, and underscore)
258 \W non-word characters
259 \d digits
260 \D non-digits
261 \s whitespace characters
262 \S non-whitespace characters
263 \t tab character
264 \T anything but a tab character
265 \n newline sequence
266 \N anything I<but> a newline sequence
267 \r carriage return character
268 \R anything but carriage return character
269 \f form feed character
270 \F anything but form feed character
271 \h horizontal whitespace
272 \H anything but horizontal whitespace
273 \v vertical whitespace
274 \V anything but vertical whitespace
275
276 You may notice some regularity in this table. For every character class
277 short-cut of the form C<< \<lower case letter> >> the anti-class is
278 always C<< \<corresponding upper case letter> >>. (The old non-newline
279 meaning of C<.> maps neatly to the new C<\N> sequence.)
280
281 Character classes bear a remarkable resemblance to sets. In fact, you can "add"
282 and "subtract" character classes much like you would sets:
283
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
284 /<[a..z] - [aeiou]>/ # Only match a consonant
285 /<[asdfg] + [hjkl;]> +/ # Only match a sequence of characters that can
4fe1c6e @perlpilot initial regex-intro document
authored
286 # be made from home row keys.
287
288
289 =head2 Grouping
290
291 There are several ways to group a portion of a regex. We saw one such
292 way earlier in our discussion of literals: surround the characters with
293 quotes. Quoting does two things, it forces all of the characters between
294 the quotes to be treated literally and it groups the string of
295 characters together into a quantifiable unit.
296
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
297 / 'foo.' * / # will match "foo.", "foo.foo.", "foo.foo.foo.", etc.
4fe1c6e @perlpilot initial regex-intro document
authored
298
299 Another way to create a quantifiable unit is to use square brackets
300 (C<[]>). Square brackets delimit a portion of the regex that may be
301 treated as a whole. The text in between the brackets is just another
302 regex.
303
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
304 / f[ oo ]* / # will match "f", "foo", "foooo", "foooooo", etc.
305 / a[ bc* d ]? / # will match "a", "abd", "abcd", "abccd", "abcccd", etc.
4fe1c6e @perlpilot initial regex-intro document
authored
306
307 Yet another way to group a portion of a regex to be treated as a unit is
308 to use round brackets (C<()>). These are identical to square brackets
309 as far as grouping goes, but additionally, the portion of the string
310 that is matched by the regex inside the round brackets is also saved
311 somewhere and may be accessed later in a variety of ways. Round brackets
312 are said to be "capturing brackets" because of this property. The
313 following table shows some examples of what would be captured if the
314 given regex matched certain portions of a string:
315
316 regex matched captured
317
318 /f(oo)*/ "f" "" # the empty string
319 "foo" "oo"
320 "foooo" "oooo"
321
322 /a(bc*d)?/ "a" "" # the empty string
323 "abd" "bd"
324 "abcd" "bcd"
325 "abccd" "bccd"
326
327 Both round and square brackets delimit a portion of the regex. This
328 portion of the regex is called a "subpattern". The portion of the string
329 that matches a subpattern can be referenced and accessed individually.
330 We'll talk more about capturing and where the captured portion of the
331 string is stored later.
332
333 =head2 Alternation and Conjunction
334
335 There are a couple of other useful concepts in regex called
336 "alternation" and "conjunction". Alternation is the idea that at a
337 given location in a string, there are alternatives to what may be
338 matched. Conjunction is the idea that at a given location in a
339 string, there are multiple portions of a regex that must match exactly
340 the same section of the string.
341
342 Alternation is designated in regex by either a single vertical bar
343 (C<|>) or a double vertical bar (C<||>). While each allows you to
344 specify alternatives, how they process those alternatives is different.
345
346 A single vertical bar does "longest token" matching on the
347 alternations with no inherent order as to which alternative is tried
348 first. So, for instance, if we were matching the string
349 "football", the following regex
350
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
351 / f | fo | foo /
4fe1c6e @perlpilot initial regex-intro document
authored
352
353 would match "foo" since that's the longest matching portion of the
354 regex in the alternation. But the regular expression engine may have
355 tried them all before it discovered "foo", or perhaps "foo" was the
356 first and only alternative tried. It is completely up to the
357 regex engine implementation as to how the alternatives are tried.
358
359 Had the regex been
360
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
361 / f | fo | fo.*l | foo /
4fe1c6e @perlpilot initial regex-intro document
authored
362
363 then the third item in the alternation would be matched since
364 C<fo.*l> will match the entire string. Again, which order the
365 alternatives are tried is unspecified.
366
367 A double vertical bar (C<||>) will match each alternative in a left-to-
368 right manner. That is, the regex
369
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
370 / f || fo || foo /
4fe1c6e @perlpilot initial regex-intro document
authored
371
372 will first try to match "f", and then (if it failed to match "f") try
373 to match "fo", and finally it will try to match "foo". So, were the
374 above regex applied to the string "football" as before, the first
375 alternative ("f") would match. This behavior is exactly the same as
376 traditional implementations of alternation in other backtracking
377 regular expression engines.
378
379 Which alternative matches and the order in which the alternatives are
380 tried becomes particularly important when each alternative has side
381 effects (such as setting a variable or calling a subroutine). We'll
382 talk more about that later.
383
384 Similar to alternations, conjunctions in regex are designated by either
385 a single ampersand (C<&>) or a double ampersand (C<&&>). In both forms,
386 all of the conjuncted terms must match the exact same portion of the
387 string they are being matched against. But, as with alternation, the single-
388 ampersand version matches the subpatterns in some unspecified order while the
389 double ampersand version of conjunctions will try each conjuncted
390 portion of the regex in a left-to-right manner.
391
392 For an example, if the following regex were applied to the string
393 "blah",
394
395 / <[a..z]>+ & [ ... ] / # matches "bla"
396
397 it would match the string "bla" because the subpattern on the right of the
398 ampersand matches exactly 3 characters and the subpattern on the left
399 matches any sequence of lower case letters. By comparison, had the regex
400 been (still applied to the string "blah"):
401
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
402 / <[a..z]>+ && [ ... ] /
4fe1c6e @perlpilot initial regex-intro document
authored
403
404 It B<still> would match "bla" but how it arrives at that match is
405 slightly different. First, the left hand side of the C<&&> would match
406 as much as it possibly can (since C<*> is greedy), then the right hand
407 side would match its 3 characters. Since the two sides then do not
408 match the exact same portion of the string, the regex engine is said
409 to "backtrack" and try the left hand side with one fewer character
410 before trying to match the right hand side again. This continues
411 until both sides match the exact same portion of the string or until
412 it can be determined that no such match is possible.
413
414 =head2 Backtracking
415
416 Backtracking has the potential to happen whenever there is a decision
417 point in a regex. An alternation creates a decision point with several
418 alternatives; a quantifier creates a decision point where a given
419 portion of a regex may match one more or one fewer times. Backtracking
420 is caused by the regex engine's attempt to satisfy an overall match. For
421 instance, when matching the following regex against the string "footbag"
422
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
423 / [ foot || base || hand ] ball /
4fe1c6e @perlpilot initial regex-intro document
authored
424
425 the regex engine will match "foot" and then attempt to match "ball" when
426 it realizes that "ball" will not match, it backtracks to the point where
427 it matched "foot" and tries to match the next alternative at that point
428 (in this case, "base"). When that fails to match, the regex engine will
429 try to match the next alternative, and so forth until either one of the
430 alternatives matches or they are all exhausted.
431
432 Now, as a human walking through the same sequence of steps that the
433 regex goes through, I can tell right away that since C<foot> was
434 matched, that neither C<base> nor C<hand> will match. But the regex
435 engine may not know this. (For this simple example, the regex engine can
436 probably figure out that the alternatives are mutually exclusive. If
437 that bothers you, imagine that they are instead complicated expressions
438 rather than simple strings) As the match for "ball" repeatedly fails,
439 the regex engine repeatedly backtracks and tries again.
440
441 However, Perl 6 provides a way to tell the regex engine when to give
442 up. A colon in the regex causes Perl to not retry the preceding
443 "atom". In regex parlance, an atom is anything that is matched as a
444 unit; a group is an atom, a single character may be an atom, etc.
445 So, in the above example, if I wanted to tell the regex engine to not
446 try the other alternatives once one of them matched (because I, as the
447 person writing the regex, know that they are all mutually exclusive),
448 I simply follow the group with a colon like so:
449
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
450 / [ foot || base || hand ] : ball /
4fe1c6e @perlpilot initial regex-intro document
authored
451
452 As soon as one of the alternatives match, the regex engine will move
453 past the colon and try to match C<ball>. When it fails to match C<ball>,
454 ordinarily it would backtrack to try other possibilities. But now the
455 colon acts as a stopping point that says, "don't bother backtracking
456 because nothing else will match" and so the regex engine will fail that
457 portion of the regex immediately rather than trying all of the other
458 alternatives. It's important to note that the colon does not necessarily
459 cause the entire regex to fail once it is backtracked over, but only the
460 atom to which it is applied. If not matching that atom means that the
461 entire regex won't match, then, of course, the entire regex will fail.
462
463 Perl 6 has other forms of backtracking control. A double colon will
464 cause its enclosing group to fail if backtracked over. A triple colon
465 will cause the entire regex to fail. For more information on these
466 see L<S05:Backtracking control>.
467
468 =head2 Zero-Width Assertions
469
470 As the regular expression engine processes a string to see if it matches
471 a given pattern, it keeps a marker to denote how much of the string it
472 has processed so far. (When this marker moves backwards, the regex
473 engine is backtracking) As the marker moves along the string, it is said
474 to "consume" the string. Sometimes you want to match without consuming
475 any of the input string or match between characters (say the transition
476 from an alphabetic character to a numeric character or only at the
477 beginning of the string). This idea of matching without consuming is
478 called a zero-width assertion.
479
480 Perl 6 provides some metacharacters that denote handy zero-width
481 assertions.
482
483 ^ only matches at the beginning of the string
484 $ only matches at the end of the string
485 ^^ matches at the beginning of any line within the string
486 $$ matches at the end of any line within the string
487 << A word boundary, but only matches the transition from
488 non-word character to word character (i.e., the left-hand
489 side of a word)
490 >> A word boundary, but only matches the transition from
491 word character to non-word character (i.e., the right-hand
492 side of a word)
493
494 Here are some example patterns and the portion(s) of the string that
495 would match if each pattern were applied repeatedly to the entire string
496 "the quick\nbrown fox\njumped over\nthe lazy\ndog" ("\n" denotes a new
497 line sequence within the string (i.e. "\n" terminates each line))
498
499 Pattern Matches
500
501 / ^the \s+ \w+ / "the quick"
502 / \w+ $ / "dog"
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
503 / ^^ \w+ / "the", "brown", "jumped", "the", and "dog"
4fe1c6e @perlpilot initial regex-intro document
authored
504 / \w+ $$ / "quick", "fox", "over", "lazy", and "dog"
505 / << o\w+ / "over"
506 / o \w+ >> / "own", "over", "og"
507
508 In order for the patterns that would match multiple portions of the string
509 to actually match those substrings, there needs to be some way to tell the regex
510 engine to continue matching from where the last match left off. See L<modifiers>
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
511 below.
4fe1c6e @perlpilot initial regex-intro document
authored
512
513 =head2 Interacting with Perl 6
514
515 =head3 match objects and capturing
516
517 When a regex successfully matches a portion of a string it returns a
518 Match object. In a boolean context the Match object evaluates to true.
519 When using capturing brackets (parentheses), the part of the string that
520 is captured is placed in the Match object and can be accessed in a
521 number of ways.
522
523 The Match object itself is called C<$/>. The substring captured by the
524 first set of parentheses is placed in C<$/[0]>, the substring captured
525 by the second set of parentheses is placed in C<$/[1]> and so forth.
526 There is a short-hand way to access these elements of the Match object
527 that will be familiar (yet slightly different) to people who have used
528 regular expression engines similar to Perl 5. The short-hand for
529 $/[0],$/[1],$/[2], etc. are the special variables $0, $1, $2, etc.
530 The big difference from other regular expression engines is that the
531 numbering starts with 0 rather than 1. Starting from 0 was chosen to
532 mimic the array indices of the match object.
533
534 =head3 matching Perl variables
535
536 Unlike Perl 5, a variable placed inside a regex does not automatically
537 interpolate the value of the variable. What happens with the variable
538 depends on context (this B<is> perl after all :-). An "unadorned"
539 variable will interpolate as a literal string to match if it's a scalar,
540 or as an alternation of literal strings to match if it's an array or
541 hash (not strictly true, but true enough for now). So, given the
542 following declarations:
543
544 my $foo = "ab*c";
545 my @bar = < one two three >;
546
547 The regex:
548
549 / $foo @bar /
550
551 matches exactly as if you had written
552
553 / 'ab*c' [ one | two | three ] /
554
555 Sometimes a variable inside of a regex is actually used to affect a
556 named capture of a specific portion of the string instead of (or even
557 in addition to) storing the captured portion in $0, $1, $2, etc.
558 For instance:
559
e994bd3 @perlpilot fix character class thinko
authored
560 / $<foo>:=[ <[A..Z]+[0..9]>**4 ] /
4fe1c6e @perlpilot initial regex-intro document
authored
561
562 if the group matches, the result is placed in C<< $/<$foo> >>. As with
563 numeric captures, there is a short-hand syntax for accessing a named
564 portion of a Match object: C<< $<foo> >>
565
566 =head3 matching other variables.
567
568 Until now we've talked primarily about the pattern matching syntax
569 itself, but how do we apply these rules to a string other than C<$_>?
570 We use the smart match operator C<~~>. This operator is called "smart
571 match" because it does quite a bit more than just apply regular
572 expressions to strings, but for now we're just going to focus on that
573 one aspect of the smart match operator. So to match a regular
574 expression against the string contained in a variable called C<$foo>,
575 we'd do this:
576
577 $foo ~~ / <regex here> /;
578
579 There's a more general syntax that allows the author to choose
580 different delimiters if your regular expression happens to match the
581 C</> character and you don't feel like writing C<\/> so much:
582
583 $foo ~~ m/ <regex here> /;
584 $foo ~~ m! <regex here> !; # different delimiter
585
586 =head3 modifiers
587
588 The more general syntax also gives us a convienent place to put
589 modifiers that will affect the regular expression as a whole. For
590 instance, there is an C<ignorecase> modifier that causes the RE engine
591 to be agnostic towards case distinctions in alphabetic characters.
592 There's also a short-hand for this modifier for those times when
593 C<ignorecase> is too much to type out.
594
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
595 $foo ~~ m :ignorecase/ foo /; # will match "foo", "FOO", "fOo", etc.
596 $foo ~~ m :i/ foo /; # same
4fe1c6e @perlpilot initial regex-intro document
authored
597
598 Perl 6 predefines several of these modifiers (for a complete list,
599 see L<S05>):
600
601 modifier short-hand meaning
602 :ignorecase :i Ignore case distinctions
603 :basechar :b Ignore accents and other marks
604 :sigspace :s whitespace in pattern matches
605 whitespace in string
606 :global :g matches as many times as possible
607 :continue :c Continue matching from where
608 previous match left off
609 :pos :p Just like :c but pattern is
610 anchored where the previous match left off
611 :ratchet Don't do any backtracking
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
612 :bytes dot matches bytes
4fe1c6e @perlpilot initial regex-intro document
authored
613 :codes dot matches codepoints
614 :graphs dot matches language-independent graphemes
615 :chars dot matches "characters" at current
616 Unicode level
617 :Perl5 :P5 use perl 5 regex syntax
618
619 There are two other modifiers for matching a pattern some number of
620 times or only matching, say, the third time we see a pattern in a
621 string. These modifiers are a little strange in that their short-hand
622 forms consist of a number followed by some text:
623
624 modifier short-hand meaning
625 :x() :1x,:4x,:12x match some number of times
626 :nth() :1st,:2nd,:3rd,:4th match only the Nth occurance
627
628 Here are some examples to illustrate these modifiers:
629
630 $_ = "foo bar baz blat";
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
631 m :3x/ a / # matches the "a" characters in each word
632 m :nth(3)/ \w+ / # matches "baz"
4fe1c6e @perlpilot initial regex-intro document
authored
633
634 Some of these modifiers may also be placed inside the regular expression and
635 their effect is scoped until the end of the innermost enclosing
636 bracketing construct or the end of the pattern.
637
638 / a [ :i foo ] z/ # matches "afooz", "aFOOz", "aFooz", "afOoz", etc.
639
640 The :sigspace modifier is quite useful. If you're unsure of the amount
641 of whitespace between tokens or can't guarantee a certain number of
642 spaces, you may be inclined to use C<\s*> or C<\s+> often in your regex.
643 However, it can get tedious typing C<\s+> so often and it tends to
644 visually detract from the parts you're really interested in matching.
645 Thus Perl 6 regex provides the :sigspace modifier so that whitespace
646 in your pattern matches whitespace in your string. This is I<so>
647 useful that Perl provides a nice short-cut for it.
648
649 /\s*One\s+small\s+step\s*/ # yuck
650 m:sigspace/One small step/ # much better
651 mm/One small step/ # even better!
652
653 =head2 Named Assertions
654
655 Earlier we talked about "zero-width assertions" that allow us to match
656 "in between" characters. But we've also talked about other kinds of
7fa5574 @perlpilot minor typo correction
authored
657 assertions, only we didn't call them that. Whenever you write a regex to
4fe1c6e @perlpilot initial regex-intro document
authored
658 match I<anything>, you're asserting something about what a successful
659 match should look like. So, for instance, in the following regex,
660
661 / \w+ '(' [ \w+ ',' ]* [ \w+ ]? ')' /
662
663 we are asserting that, for a string to match, it must contain a sequence
664 of word characters followed by a C<(> followed by zero or more word
665 character sequences optionally terminated with a comma and finally, a
666 C<)>. Each of the "tokens" is an assertion about what must match for the
667 entire regex to match. So, C<\w> is an assertion that a word character
668 must match, C<'('> is an assertion that an open parenthesis must match,
669 and so forth. (There are 6 assertions in the above regex)
670
671 However, the regex would make more sense if we could give the
672 individual pieces meaningful names. For instance, if we could
673 write the above regex like so:
674
675 / <function_name> '(' [ <parameter> ',' ]* <parameter>? ')' /
676
677 you might have a better idea what those C<\w+> sequences were for.
678 Lucky for us, Perl 6 provides just such a mechanism.
679
680 The syntax for declaring a named regex is:
681
682 regex identifier { \w+ }
683
684 Once declared, we can use this in another regex like so:
685
686 / <identifier> /
687
688 and it is identical to
689
690 / \w+ /
691
692 with an important and useful exception: the portion of the string that
b49e2d4 @pdfrod Fix formatting code in regex intro
pdfrod authored
693 matches can also be accessed as C<< $/<identifier> >>.
4fe1c6e @perlpilot initial regex-intro document
authored
694
695 Perl 6 predeclares several useful named regex (See L<S05> for a complete list):
696
697 <alpha> a single alphabetic character
698 <digit> a single numeric character
699 <ident> an "identifier"
700 <sp> a single space character
701 <ws> an arbitrary amount of whitespace
702 <dot> a period (same as '.')
703 <lt> a less-than character (same as '<')
704 <gt> a greater-than character (same as '>')
705 <null> matches nothing (useful in alternations that may be empty)
706
707 You may have noticed that a C<regex> declaration looks very similar to a
708 subroutine declaration. Indeed, regex are very much like subroutines.
709 They may even have parameters. There are two named regex that are used
710 to obtain zero-width look-ahead and look-behind. The parameter passed to
711 these named regex may be another regex:
712
713 <before ...> Zero-width look ahead for ...
714 <after ...> Zero-width look behind for ...
715
716 An example:
717
718 / foo <before \d+> / # only matches on "foo" followed
719 # immediately by some digits
720
721 Since these assertions are zero-width, the "pointer" that keeps track
722 of how much of the string has been consumed will point just after the
723 "foo" portion of the string on successful match so that the digits
724 can be processed by other means if necessary.
725
726 By declaring named regex like this you can build up a whole library of
727 regex that match some special purpose language. In fact, Perl 6 lets
728 you group your regex under a common name by declaring that all of your
729 regex belong to the same "grammar".
730
731 grammar Calc;
732
733 regex expr {
734 <term> '*' <expr> |
735 <term> '/' <expr> |
736 <term>
737 }
738
739 regex term {
740 <factor> '+' <term> |
741 <factor> '-' <term> |
742 <factor>
743 }
744
745 regex factor { <digit>+ | '(' <expr> ')' }
746
747 The grammar declaration must appear at the beginning of the file and is
748 in effect until the end of file. To explicitly declare the scope of the
749 grammar, enclose the regex in curly braces like so:
750
751 grammar Calc {
752 regex expr { ... }
753 regex term { ... }
754 regex factor { ... }
755 }
756
757 To match strings that belong to this grammar, the named regex must be
758 fully qualified:
759
760 "3+5*2" ~~ / <Calc.expr> /;
761
762 Perl 6 also has some shortcuts for specifying common and useful defaults
763 to the regex engine. If, instead of using the C<regex> keyword, you use
764 C<token>, Perl 6 will automatically turn on the C<:ratchet> modifier for
765 the duration of the regex. The idea being that once you've matched a
766 "token" you're not likely to want to backtrack into it.
767
768 Also, if you use C<rule> instead of C<regex>, Perl 6 will turn on both
769 of the C<:ratchet> and C<:sigspace> modifiers.
770
771 Here's the C<Calc> grammar above, rewritten to use these syntactic
772 shortcuts:
773
774 grammar Calc;
775
776 rule expr {
777 <term> '*' <expr> |
778 <term> '/' <expr> |
779 <term>
780 }
781
782 rule term {
783 <factor> '+' <term> |
784 <factor> '-' <term> |
785 <factor>
786 }
787
788 token factor { <digit>+ | '(' <expr> ')' }
789
790 There's not much difference is there? But it makes a big difference in
791 what gets parsed. The original grammar did not have any provisions for
792 matching whitespace, so any whitespace in the string would cause the
793 pattern to fail. A string like "3 + 5 * 7" would not be matched by the
794 original grammar. Now, because whitespace in the pattern is parsed as
795 whitespace in the string, that string will parse successfully.
796
797 =head2 Strings and Beyond
798
799 Throughout this article I've been talking about regex as they apply to very
800 ASCII-like strings, however Perl 6 regex are not restricted by ASCII. Perl 6
801 regex can be applied to any string of Unicode characters and, in fact, are
802 written in Unicode by default.
803
804 Moreover, Perl 6 regex can be applied to things that are not strings but can be
805 made to look like strings. For instance, they can be applied to a filehandle
806 (which can represent itself as a stream of bytes/characters/whatever). Even
807 stranger, is that regex can be applied to an array of objects. See
808 L<S05:Matching against non-strings>.
809
810 =head2 Conclusion
811
812 Well, that's about it for this introduction to Perl 6 regex. I've run
813 out of steam. There are tons of features that I've left out since this
814 is just an introduction. A few that come to mind are:
815
816 =over 4
817
818 =item * match object polymorphism
819
820 Captured portions of the regex can be accessed as strings, but they
821 can also be accessed in other ways: as a match object, as an array (if
822 the subpattern is quantified), as a hash, etc.
823
824 =item * Perl code in regex
825
826 Curly braces in a regex allow for execution of arbitrary perl code as
827 the regex is matched.
828
829 =item * quantifier enhancement
830
831 The universal quantifier can be used to match more than just some number
832 of times. If the thing on the RHS of the C<**> is a regex, then that is
833 taken as the pattern to match as the separator between items that match
834 the LHS. (e.g., <ident>**',' will match a series of identifiers
835 separated by comma characters)
836
837 =back
838
839 Be sure to read the references given below for a more detailed
840 explanation of the features mentioned in this article.
841
842 =head2 References
843
c3a3d84 @perlpilot Updated references, author info and copyright
authored
844 If you want to read more about Perl 6 regex, see the official Perl 6
845 documentation at L<http://perlcabal.org/syn/S05.html>. There are also
846 some historical documents at
847 L<http://dev.perl.org/perl6/doc/design/apo/A05.html> and
848 L<http://dev.perl.org/perl6/doc/design/exe/E05.html> that may give you a
849 feel for things. If you're really interested in learning more but feel
850 you need to interact with people try the mailing list at
851 perl6-language@perl.org or log on to a freenode IRC server and drop
852 by #perl6.
4fe1c6e @perlpilot initial regex-intro document
authored
853
854 =head2 About the Author
855
c3a3d84 @perlpilot Updated references, author info and copyright
authored
856 Jonathan Scott Duff is an Information Technology Research Manager at the
857 Conrad Blucher Institute for Surveying and Science on the campus of
858 Texas A&M University-Corpus Christi. He has a beautiful wife and 4 lovely
4fe1c6e @perlpilot initial regex-intro document
authored
859 children. When not working or spending time with his family, Scott tries
860 to keep up with Parrot and Perl 6 development. Sometimes he can be found
861 on IRC as PerlJam in one of the perl-related channels. But if you really
862 want to get in touch with him, the best way is via email: duff@pobox.com
863
c3a3d84 @perlpilot Updated references, author info and copyright
authored
864
865 Copyright 2007-2009 Jonathan Scott Duff
Something went wrong with that request. Please try again.