Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 866 lines (667 sloc) 37.726 kb
4fe1c6e5 »
2009-07-28 initial regex-intro document
1 =head1 Introduction to Perl 6 Regex
2
3 =head2 Context
4
5 Over the years programming languages have incorporated features for
6 regular expressions. Some, such as Javascript, have added syntax
7 specifically to support regular expressions. Others, such as PHP, have
8 just reused their native string type and utilize special subroutines to
9 parse strings as regular expressions. But one thing almost all of them
10 have in common is that they have mimicked the extended regular
11 expression syntax of Perl.
12
13 Of course, Perl wasn't the first programming language to have support
14 for regular expressions. But it did make them popular. Perl has been so
15 successful as a text processing and glue language and regular
16 expressions so well interwoven into the language that anyone who uses
17 Perl almost I<has> to learn regular expressions. Also, by applying some
18 of Perl's philosophy to regular expressions, common usages became easy
19 and complex usages became possible. Here are just a few features that
20 resulted: character class shortcuts, annotated regular expressions,
21 ability to match unicode properties, zero-width assertions, independant
22 subexpressions, and code execution inside of a regular expression.
23
24 Unfortunately, as the regular expressioning public put more demand on
25 Perl's regular expression syntax, it accumulated some crufty items--
26 little inconsistencies that were to maintain backward compatibility or
27 were introduced because they were needed, but before they were fully
28 thought out. In designing Perl 6, Larry Wall not only looked at the
29 syntax and semantics of Perl proper, but he also took a hard look at the
30 sub-language that is regular expressions and refactored it into
31 something that makes better sense.
32
33 In this article I'm going to give an introduction to Perl 6 regex (we
34 call them "regex" to maintain the historical association with regular
35 expressions though they've strayed quite far from the mathematical
36 sense of regular languages). I'll point out differences from
37 Perl 5 syntax but no knowledge of Perl 5's regular expression syntax
38 should be necessary to understand this document. If you're a Perl 5
39 geek, you may be bored for a while, but read anyway so that you can
40 pick up the syntactic and semantic differences.
41
42 =head2 Literals
43
44 Firstly, let's get some small syntactic things out of the way. In Perl
45 6, as in other implementations, regex are typically delimited by slashes
46 (aka, leaning toothpicks), so a typical regex to match the string "abc"
47 would look like this:
48
49 /abc/
50
51 A regex is also sometimes called a "pattern" because we're looking for
52 a portion of a string that looks like the strings that the regex
53 describes. Regex are also sometimes called "rules" because they
54 describe the conditions under which the string may match. But what
55 string are we looking in? As in Perl 5, Perl 6 applies the regex to a
56 variable called C<$_> if we haven't explicitly specified a variable to
57 match against. For now I'm going to continue assuming that our string
58 is in C<$_> in my examples. Later I'll show you how to specify a
59 different string to match against.
60
61 So, the above regex tries to find the pattern "abc" in the string C<$_>.
62 If the pattern appears in the string, the regex returns a true value to
63 indicate that it matched successfully, otherwise it returns a false
64 value. It's important to note that the pattern may appear anywhere in
65 the string. So, for instance, if C<$_> contained "fooabcbar", the
66 above pattern will sucessfully match and the regex will return a true
67 value. Here are some more strings that would successfully match against
68 the regex:
69
70 abcgoobltygook
71 now I know my abcs
72 abc
73 babcock
74
75 =head2 Meta-syntax
76
77 Now, if all you can do is match literal strings, regexes wouldn't be so
78 useful would they? Some characters rather than taken literally are so-
79 called metacharacters that have special meaning in a regex. In Perl 6,
80 any non-alphanumeric character is considered a metacharacter by
81 default. That is, alphabetic and numeric characters match themselves and
82 any other character may not match itself because it may have special
83 meaning. (For the purposes of metasyntax, the underscore is considered
84 alphanumeric.)
85
86 Currently not all metacharacters actually I<have> a special meaning but
87 many do and in order to keep things simple, Perl 6 chooses to designate
88 all non-alphanumeric characters as metasyntactic. However, there's an
89 "escape mechanism" that lets you treat metacharacters as themselves
90 (literally), and alphanumeric characters as metasyntactic. By prefixing
91 an alphanumeric character with a backslash (B<\>) it becomes a
92 metacharacter and is special. By the same token, prefixing a non-
93 alphanumeric character with a backslash removes its metasyntactic nature
94 and it becomes literal.
95
96 So, for example, a common metasyntactic character found in regular
97 expressions is a period (C<.>, sometimes just called a "dot") and it
98 matches B<any> character. Thus,
99
100 /f..d/
101
102 will match any four character sequence that starts with an "f" and ends
103 with a "d". All of the following strings match this pattern:
104
105 my name is fred # matched "fred"
106 I need food, now! # matched "food"
107 those guys are turf idols # matched "f id"
108 shift down a gear # matched "ft d"
109
110 If you want to actually match a period you can escape it like so:
111
112 /foo\./ # matches "foo."
113
114 Similarly, the letter "t" in a regular expression matches itself (i.e.,
115 an occurence of the letter "t"). But, with a backslash immediately
116 preceding the "t", it takes on its metasyntactic meaning of a tab
117 character. For example:
118
119 /tall/ # matches the string "tall"
120 /\tall/ # matches a tab character, followed by "all"
121
122 Another way to match character sequences that are to be taken
123 literally is to enclose them in quotes:
124
125 /'foo.'/ # matches "foo."
126 /"foo."/ # same
127
128 In these cases, the quotes are metasyntactic delimiters that mean
129 "match the characters in between literally".
130
131 =begin sidebar
132
133 In Perl 5 (and most other regular expression variants), a dot matches
134 any character except for the newline sequence. In Perl 6, this odd
135 restriction is lifted and the dot matches B<any> character, including
136 the newline sequence. Perl 6 has other mechanisms to accomplish Perl
137 5's behavior. See L<"Character Classes"> below.
138
139 =end sidebar
140
141 The most important metasyntactic characters in regex are whitespace
142 (typically space characters, but sometimes tabs and other "invisible"
143 characters). Because regular expressions tend towards high character
144 density, they can often be difficult to read. In Perl 6 regex, you may
145 use whitespace to separate parts of your regex to make it easier to
146 read. The whitespace is ignored by the regex engine. To match literal
147 space characters, just as with other metasyntax, you may precede the
148 space with a backslash or enclose the space in quotes.
149
150 In future examples I will occasionally show a given regex in its spaced
151 form. The spaced form is preferable when writing regex as it makes them
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
152 easier to read later, and will be used from now on.
4fe1c6e5 »
2009-07-28 initial regex-intro document
153
154 =head2 Quantifiers
155
156 Three other important metacharacters are what are called quantifiers.
157 These characters allow you to specify repetition; that a character or
158 group of characters may be matched multiple times.
159
160 quantifier matches
161 * the preceding thing zero or more times
162 + the preceding thing one or more times
163 ? the preceding thing zero or one time
164
165
166 Examples:
167
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
168 / fo* / # will match "f", "fo", "foo", "fooo", etc.
169 / fo+ / # will match "fo", "foo", "fooo", "foooo", etc.
170 / fo? / # will only match "f" or "fo"
4fe1c6e5 »
2009-07-28 initial regex-intro document
171
172 Note that in the above examples the quantifier is only applied to the
173 preceding character. If you need to match a group of characters
174 repeatedly, you have to use one of the several grouping mechanisms in
175 Perl 6 regex (see L<Grouping> below).
176
177 There is another character sequence sometimes called the universal
178 quantifier that allows you to prescribe a specific number of times a
179 particular pattern may match. To use the universal quantifier, you use
180 two C<*> characters followed by a number or a range like so:
181
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
182 / fo ** 3 / # matches only "fooo"
183 / fo ** 3..5 / # matches one of "fooo", "foooo", or "fooooo"
4fe1c6e5 »
2009-07-28 initial regex-intro document
184
185 You may also specify a closure after the C<**>, but that is beyond the
186 scope of this introduction to explain. See the references given at
187 the end of this article for more reading.
188
189 =begin sidebar
190
191 =head3 A note on greed
192
193 As in Perl 5, the quantifiers are greedy. That is, they try to
194 consume as much of the target string as possible. So the pattern
195
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
196 / .* abc /
4fe1c6e5 »
2009-07-28 initial regex-intro document
197
198 when applied to the string "abcdefghujklmnopqrstuvwyxz" will first try
199 to match the entire string (because C<.> matches any character and C<*>
200 tries to match as much as it can). When the regex engine gets to the end
201 of the string and can not match "abc" it will back up one character and
202 try to match "abc" again. This process will repeat until the regex
203 engine either matches or runs out of characters to process.
204
205 Also as in Perl 5, the greediness of quantifiers can be "turned off" by
206 adding a question mark (C<?>) immediately after the quantifier.
207 So for instance,
208
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
209 / . *? abc / # first tries to match nothing, then "abc"
210 / . +? abc / # first tries matching one character, then "abc"
211 / . **?3..5 abc/ # first tries matching three characters, then "abc"
4fe1c6e5 »
2009-07-28 initial regex-intro document
212
213 The default behavior (greediness) can be thought of as the regex engine
214 first matching the most characters that the quantifier will allow and
215 then backing up one character at a time if the rest of the pattern
216 doesn't match (to try again). With greediness turned off, the regex
217 engine matches as little as the quantifier allows and moves forward one
218 character at a time to match the rest of the pattern.
219
220 Note that a regex is processed from left to right and the greediness (or
221 non-greediness) behavior B<only> applies to the part of the regex
222 governed by a particular quantifier. Making quantifiers match
223 minimally does not cause the pattern as a whole to match minimally.
224
225 =end sidebar
226
227 =head2 Character Classes
228
229 We've seen how to match a specific character at a given location by
230 putting that character in the regex and we've seen how to match any
231 character at a specific location by putting a dot in the regex, but
232 sometimes you want to match a specific I<set> of characters at a given
233 position. The mechism to do this in regex is called a "character class".
234 Character classes are designated by C<< <[]> >> with the specific
235 characters listed inside the brackets. For instance,
236
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
237 / foo <[dlt]> / # matches "food", "fool" or "foot"
4fe1c6e5 »
2009-07-28 initial regex-intro document
238
239 The sequence C<< <[dlt]> >> represents any one of the characters "d", "l",
240 or "t". You can also specify a set of contiguous characters to match
241 using a range like so:
242
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
243 / <[ a..d ]> / # matches one of "a", "b", "c", or "d"
4fe1c6e5 »
2009-07-28 initial regex-intro document
244
245 You can also mix ranges and specific characters:
246
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
247 /<[ a..d xyz ]>/ # matches one of "a","b","c","d","x","y", or "z"
248 /<[ xyz a..d ]>/ # same
249 /<[ x..z a..d ]>/ # same
4fe1c6e5 »
2009-07-28 initial regex-intro document
250
251 Some character classes are so useful that they have their own
252 designated short-cuts. All of the character class short-cuts make use
253 of alphabetic characters that have been given a metasyntactic meaning
254 by prefixing the character with a back slash. Here's a table of the
255 short-cuts:
256
257 short-cut matches
258 \w word characters (alphabetics, numerics, and underscore)
259 \W non-word characters
260 \d digits
261 \D non-digits
262 \s whitespace characters
263 \S non-whitespace characters
264 \t tab character
265 \T anything but a tab character
266 \n newline sequence
267 \N anything I<but> a newline sequence
268 \r carriage return character
269 \R anything but carriage return character
270 \f form feed character
271 \F anything but form feed character
272 \h horizontal whitespace
273 \H anything but horizontal whitespace
274 \v vertical whitespace
275 \V anything but vertical whitespace
276
277 You may notice some regularity in this table. For every character class
278 short-cut of the form C<< \<lower case letter> >> the anti-class is
279 always C<< \<corresponding upper case letter> >>. (The old non-newline
280 meaning of C<.> maps neatly to the new C<\N> sequence.)
281
282 Character classes bear a remarkable resemblance to sets. In fact, you can "add"
283 and "subtract" character classes much like you would sets:
284
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
285 /<[a..z] - [aeiou]>/ # Only match a consonant
286 /<[asdfg] + [hjkl;]> +/ # Only match a sequence of characters that can
4fe1c6e5 »
2009-07-28 initial regex-intro document
287 # be made from home row keys.
288
289
290 =head2 Grouping
291
292 There are several ways to group a portion of a regex. We saw one such
293 way earlier in our discussion of literals: surround the characters with
294 quotes. Quoting does two things, it forces all of the characters between
295 the quotes to be treated literally and it groups the string of
296 characters together into a quantifiable unit.
297
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
298 / 'foo.' * / # will match "foo.", "foo.foo.", "foo.foo.foo.", etc.
4fe1c6e5 »
2009-07-28 initial regex-intro document
299
300 Another way to create a quantifiable unit is to use square brackets
301 (C<[]>). Square brackets delimit a portion of the regex that may be
302 treated as a whole. The text in between the brackets is just another
303 regex.
304
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
305 / f[ oo ]* / # will match "f", "foo", "foooo", "foooooo", etc.
306 / a[ bc* d ]? / # will match "a", "abd", "abcd", "abccd", "abcccd", etc.
4fe1c6e5 »
2009-07-28 initial regex-intro document
307
308 Yet another way to group a portion of a regex to be treated as a unit is
309 to use round brackets (C<()>). These are identical to square brackets
310 as far as grouping goes, but additionally, the portion of the string
311 that is matched by the regex inside the round brackets is also saved
312 somewhere and may be accessed later in a variety of ways. Round brackets
313 are said to be "capturing brackets" because of this property. The
314 following table shows some examples of what would be captured if the
315 given regex matched certain portions of a string:
316
317 regex matched captured
318
319 /f(oo)*/ "f" "" # the empty string
320 "foo" "oo"
321 "foooo" "oooo"
322
323 /a(bc*d)?/ "a" "" # the empty string
324 "abd" "bd"
325 "abcd" "bcd"
326 "abccd" "bccd"
327
328 Both round and square brackets delimit a portion of the regex. This
329 portion of the regex is called a "subpattern". The portion of the string
330 that matches a subpattern can be referenced and accessed individually.
331 We'll talk more about capturing and where the captured portion of the
332 string is stored later.
333
334 =head2 Alternation and Conjunction
335
336 There are a couple of other useful concepts in regex called
337 "alternation" and "conjunction". Alternation is the idea that at a
338 given location in a string, there are alternatives to what may be
339 matched. Conjunction is the idea that at a given location in a
340 string, there are multiple portions of a regex that must match exactly
341 the same section of the string.
342
343 Alternation is designated in regex by either a single vertical bar
344 (C<|>) or a double vertical bar (C<||>). While each allows you to
345 specify alternatives, how they process those alternatives is different.
346
347 A single vertical bar does "longest token" matching on the
348 alternations with no inherent order as to which alternative is tried
349 first. So, for instance, if we were matching the string
350 "football", the following regex
351
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
352 / f | fo | foo /
4fe1c6e5 »
2009-07-28 initial regex-intro document
353
354 would match "foo" since that's the longest matching portion of the
355 regex in the alternation. But the regular expression engine may have
356 tried them all before it discovered "foo", or perhaps "foo" was the
357 first and only alternative tried. It is completely up to the
358 regex engine implementation as to how the alternatives are tried.
359
360 Had the regex been
361
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
362 / f | fo | fo.*l | foo /
4fe1c6e5 »
2009-07-28 initial regex-intro document
363
364 then the third item in the alternation would be matched since
365 C<fo.*l> will match the entire string. Again, which order the
366 alternatives are tried is unspecified.
367
368 A double vertical bar (C<||>) will match each alternative in a left-to-
369 right manner. That is, the regex
370
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
371 / f || fo || foo /
4fe1c6e5 »
2009-07-28 initial regex-intro document
372
373 will first try to match "f", and then (if it failed to match "f") try
374 to match "fo", and finally it will try to match "foo". So, were the
375 above regex applied to the string "football" as before, the first
376 alternative ("f") would match. This behavior is exactly the same as
377 traditional implementations of alternation in other backtracking
378 regular expression engines.
379
380 Which alternative matches and the order in which the alternatives are
381 tried becomes particularly important when each alternative has side
382 effects (such as setting a variable or calling a subroutine). We'll
383 talk more about that later.
384
385 Similar to alternations, conjunctions in regex are designated by either
386 a single ampersand (C<&>) or a double ampersand (C<&&>). In both forms,
387 all of the conjuncted terms must match the exact same portion of the
388 string they are being matched against. But, as with alternation, the single-
389 ampersand version matches the subpatterns in some unspecified order while the
390 double ampersand version of conjunctions will try each conjuncted
391 portion of the regex in a left-to-right manner.
392
393 For an example, if the following regex were applied to the string
394 "blah",
395
396 / <[a..z]>+ & [ ... ] / # matches "bla"
397
398 it would match the string "bla" because the subpattern on the right of the
399 ampersand matches exactly 3 characters and the subpattern on the left
400 matches any sequence of lower case letters. By comparison, had the regex
401 been (still applied to the string "blah"):
402
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
403 / <[a..z]>+ && [ ... ] /
4fe1c6e5 »
2009-07-28 initial regex-intro document
404
405 It B<still> would match "bla" but how it arrives at that match is
406 slightly different. First, the left hand side of the C<&&> would match
407 as much as it possibly can (since C<*> is greedy), then the right hand
408 side would match its 3 characters. Since the two sides then do not
409 match the exact same portion of the string, the regex engine is said
410 to "backtrack" and try the left hand side with one fewer character
411 before trying to match the right hand side again. This continues
412 until both sides match the exact same portion of the string or until
413 it can be determined that no such match is possible.
414
415 =head2 Backtracking
416
417 Backtracking has the potential to happen whenever there is a decision
418 point in a regex. An alternation creates a decision point with several
419 alternatives; a quantifier creates a decision point where a given
420 portion of a regex may match one more or one fewer times. Backtracking
421 is caused by the regex engine's attempt to satisfy an overall match. For
422 instance, when matching the following regex against the string "footbag"
423
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
424 / [ foot || base || hand ] ball /
4fe1c6e5 »
2009-07-28 initial regex-intro document
425
426 the regex engine will match "foot" and then attempt to match "ball" when
427 it realizes that "ball" will not match, it backtracks to the point where
428 it matched "foot" and tries to match the next alternative at that point
429 (in this case, "base"). When that fails to match, the regex engine will
430 try to match the next alternative, and so forth until either one of the
431 alternatives matches or they are all exhausted.
432
433 Now, as a human walking through the same sequence of steps that the
434 regex goes through, I can tell right away that since C<foot> was
435 matched, that neither C<base> nor C<hand> will match. But the regex
436 engine may not know this. (For this simple example, the regex engine can
437 probably figure out that the alternatives are mutually exclusive. If
438 that bothers you, imagine that they are instead complicated expressions
439 rather than simple strings) As the match for "ball" repeatedly fails,
440 the regex engine repeatedly backtracks and tries again.
441
442 However, Perl 6 provides a way to tell the regex engine when to give
443 up. A colon in the regex causes Perl to not retry the preceding
444 "atom". In regex parlance, an atom is anything that is matched as a
445 unit; a group is an atom, a single character may be an atom, etc.
446 So, in the above example, if I wanted to tell the regex engine to not
447 try the other alternatives once one of them matched (because I, as the
448 person writing the regex, know that they are all mutually exclusive),
449 I simply follow the group with a colon like so:
450
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
451 / [ foot || base || hand ] : ball /
4fe1c6e5 »
2009-07-28 initial regex-intro document
452
453 As soon as one of the alternatives match, the regex engine will move
454 past the colon and try to match C<ball>. When it fails to match C<ball>,
455 ordinarily it would backtrack to try other possibilities. But now the
456 colon acts as a stopping point that says, "don't bother backtracking
457 because nothing else will match" and so the regex engine will fail that
458 portion of the regex immediately rather than trying all of the other
459 alternatives. It's important to note that the colon does not necessarily
460 cause the entire regex to fail once it is backtracked over, but only the
461 atom to which it is applied. If not matching that atom means that the
462 entire regex won't match, then, of course, the entire regex will fail.
463
464 Perl 6 has other forms of backtracking control. A double colon will
465 cause its enclosing group to fail if backtracked over. A triple colon
466 will cause the entire regex to fail. For more information on these
467 see L<S05:Backtracking control>.
468
469 =head2 Zero-Width Assertions
470
471 As the regular expression engine processes a string to see if it matches
472 a given pattern, it keeps a marker to denote how much of the string it
473 has processed so far. (When this marker moves backwards, the regex
474 engine is backtracking) As the marker moves along the string, it is said
475 to "consume" the string. Sometimes you want to match without consuming
476 any of the input string or match between characters (say the transition
477 from an alphabetic character to a numeric character or only at the
478 beginning of the string). This idea of matching without consuming is
479 called a zero-width assertion.
480
481 Perl 6 provides some metacharacters that denote handy zero-width
482 assertions.
483
484 ^ only matches at the beginning of the string
485 $ only matches at the end of the string
486 ^^ matches at the beginning of any line within the string
487 $$ matches at the end of any line within the string
488 << A word boundary, but only matches the transition from
489 non-word character to word character (i.e., the left-hand
490 side of a word)
491 >> A word boundary, but only matches the transition from
492 word character to non-word character (i.e., the right-hand
493 side of a word)
494
495 Here are some example patterns and the portion(s) of the string that
496 would match if each pattern were applied repeatedly to the entire string
497 "the quick\nbrown fox\njumped over\nthe lazy\ndog" ("\n" denotes a new
498 line sequence within the string (i.e. "\n" terminates each line))
499
500 Pattern Matches
501
502 / ^the \s+ \w+ / "the quick"
503 / \w+ $ / "dog"
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
504 / ^^ \w+ / "the", "brown", "jumped", "the", and "dog"
4fe1c6e5 »
2009-07-28 initial regex-intro document
505 / \w+ $$ / "quick", "fox", "over", "lazy", and "dog"
506 / << o\w+ / "over"
507 / o \w+ >> / "own", "over", "og"
508
509 In order for the patterns that would match multiple portions of the string
510 to actually match those substrings, there needs to be some way to tell the regex
511 engine to continue matching from where the last match left off. See L<modifiers>
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
512 below.
4fe1c6e5 »
2009-07-28 initial regex-intro document
513
514 =head2 Interacting with Perl 6
515
516 =head3 match objects and capturing
517
518 When a regex successfully matches a portion of a string it returns a
519 Match object. In a boolean context the Match object evaluates to true.
520 When using capturing brackets (parentheses), the part of the string that
521 is captured is placed in the Match object and can be accessed in a
522 number of ways.
523
524 The Match object itself is called C<$/>. The substring captured by the
525 first set of parentheses is placed in C<$/[0]>, the substring captured
526 by the second set of parentheses is placed in C<$/[1]> and so forth.
527 There is a short-hand way to access these elements of the Match object
528 that will be familiar (yet slightly different) to people who have used
529 regular expression engines similar to Perl 5. The short-hand for
530 $/[0],$/[1],$/[2], etc. are the special variables $0, $1, $2, etc.
531 The big difference from other regular expression engines is that the
532 numbering starts with 0 rather than 1. Starting from 0 was chosen to
533 mimic the array indices of the match object.
534
535 =head3 matching Perl variables
536
537 Unlike Perl 5, a variable placed inside a regex does not automatically
538 interpolate the value of the variable. What happens with the variable
539 depends on context (this B<is> perl after all :-). An "unadorned"
540 variable will interpolate as a literal string to match if it's a scalar,
541 or as an alternation of literal strings to match if it's an array or
542 hash (not strictly true, but true enough for now). So, given the
543 following declarations:
544
545 my $foo = "ab*c";
546 my @bar = < one two three >;
547
548 The regex:
549
550 / $foo @bar /
551
552 matches exactly as if you had written
553
554 / 'ab*c' [ one | two | three ] /
555
556 Sometimes a variable inside of a regex is actually used to affect a
557 named capture of a specific portion of the string instead of (or even
558 in addition to) storing the captured portion in $0, $1, $2, etc.
559 For instance:
560
561 / $<foo>:=[ <[A-Z]+[0-9]>**4 ] /
562
563 if the group matches, the result is placed in C<< $/<$foo> >>. As with
564 numeric captures, there is a short-hand syntax for accessing a named
565 portion of a Match object: C<< $<foo> >>
566
567 =head3 matching other variables.
568
569 Until now we've talked primarily about the pattern matching syntax
570 itself, but how do we apply these rules to a string other than C<$_>?
571 We use the smart match operator C<~~>. This operator is called "smart
572 match" because it does quite a bit more than just apply regular
573 expressions to strings, but for now we're just going to focus on that
574 one aspect of the smart match operator. So to match a regular
575 expression against the string contained in a variable called C<$foo>,
576 we'd do this:
577
578 $foo ~~ / <regex here> /;
579
580 There's a more general syntax that allows the author to choose
581 different delimiters if your regular expression happens to match the
582 C</> character and you don't feel like writing C<\/> so much:
583
584 $foo ~~ m/ <regex here> /;
585 $foo ~~ m! <regex here> !; # different delimiter
586
587 =head3 modifiers
588
589 The more general syntax also gives us a convienent place to put
590 modifiers that will affect the regular expression as a whole. For
591 instance, there is an C<ignorecase> modifier that causes the RE engine
592 to be agnostic towards case distinctions in alphabetic characters.
593 There's also a short-hand for this modifier for those times when
594 C<ignorecase> is too much to type out.
595
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
596 $foo ~~ m :ignorecase/ foo /; # will match "foo", "FOO", "fOo", etc.
597 $foo ~~ m :i/ foo /; # same
4fe1c6e5 »
2009-07-28 initial regex-intro document
598
599 Perl 6 predefines several of these modifiers (for a complete list,
600 see L<S05>):
601
602 modifier short-hand meaning
603 :ignorecase :i Ignore case distinctions
604 :basechar :b Ignore accents and other marks
605 :sigspace :s whitespace in pattern matches
606 whitespace in string
607 :global :g matches as many times as possible
608 :continue :c Continue matching from where
609 previous match left off
610 :pos :p Just like :c but pattern is
611 anchored where the previous match left off
612 :ratchet Don't do any backtracking
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
613 :bytes dot matches bytes
4fe1c6e5 »
2009-07-28 initial regex-intro document
614 :codes dot matches codepoints
615 :graphs dot matches language-independent graphemes
616 :chars dot matches "characters" at current
617 Unicode level
618 :Perl5 :P5 use perl 5 regex syntax
619
620 There are two other modifiers for matching a pattern some number of
621 times or only matching, say, the third time we see a pattern in a
622 string. These modifiers are a little strange in that their short-hand
623 forms consist of a number followed by some text:
624
625 modifier short-hand meaning
626 :x() :1x,:4x,:12x match some number of times
627 :nth() :1st,:2nd,:3rd,:4th match only the Nth occurance
628
629 Here are some examples to illustrate these modifiers:
630
631 $_ = "foo bar baz blat";
c6502a07 » moritz
2009-07-31 [regex-intro] use the spaced form by default
632 m :3x/ a / # matches the "a" characters in each word
633 m :nth(3)/ \w+ / # matches "baz"
4fe1c6e5 »
2009-07-28 initial regex-intro document
634
635 Some of these modifiers may also be placed inside the regular expression and
636 their effect is scoped until the end of the innermost enclosing
637 bracketing construct or the end of the pattern.
638
639 / a [ :i foo ] z/ # matches "afooz", "aFOOz", "aFooz", "afOoz", etc.
640
641 The :sigspace modifier is quite useful. If you're unsure of the amount
642 of whitespace between tokens or can't guarantee a certain number of
643 spaces, you may be inclined to use C<\s*> or C<\s+> often in your regex.
644 However, it can get tedious typing C<\s+> so often and it tends to
645 visually detract from the parts you're really interested in matching.
646 Thus Perl 6 regex provides the :sigspace modifier so that whitespace
647 in your pattern matches whitespace in your string. This is I<so>
648 useful that Perl provides a nice short-cut for it.
649
650 /\s*One\s+small\s+step\s*/ # yuck
651 m:sigspace/One small step/ # much better
652 mm/One small step/ # even better!
653
654 =head2 Named Assertions
655
656 Earlier we talked about "zero-width assertions" that allow us to match
657 "in between" characters. But we've also talked about other kinds of
7fa5574d »
2011-05-24 minor typo correction
658 assertions, only we didn't call them that. Whenever you write a regex to
4fe1c6e5 »
2009-07-28 initial regex-intro document
659 match I<anything>, you're asserting something about what a successful
660 match should look like. So, for instance, in the following regex,
661
662 / \w+ '(' [ \w+ ',' ]* [ \w+ ]? ')' /
663
664 we are asserting that, for a string to match, it must contain a sequence
665 of word characters followed by a C<(> followed by zero or more word
666 character sequences optionally terminated with a comma and finally, a
667 C<)>. Each of the "tokens" is an assertion about what must match for the
668 entire regex to match. So, C<\w> is an assertion that a word character
669 must match, C<'('> is an assertion that an open parenthesis must match,
670 and so forth. (There are 6 assertions in the above regex)
671
672 However, the regex would make more sense if we could give the
673 individual pieces meaningful names. For instance, if we could
674 write the above regex like so:
675
676 / <function_name> '(' [ <parameter> ',' ]* <parameter>? ')' /
677
678 you might have a better idea what those C<\w+> sequences were for.
679 Lucky for us, Perl 6 provides just such a mechanism.
680
681 The syntax for declaring a named regex is:
682
683 regex identifier { \w+ }
684
685 Once declared, we can use this in another regex like so:
686
687 / <identifier> /
688
689 and it is identical to
690
691 / \w+ /
692
693 with an important and useful exception: the portion of the string that
694 matches can also be accessed as C< $/<identifier> >.
695
696 Perl 6 predeclares several useful named regex (See L<S05> for a complete list):
697
698 <alpha> a single alphabetic character
699 <digit> a single numeric character
700 <ident> an "identifier"
701 <sp> a single space character
702 <ws> an arbitrary amount of whitespace
703 <dot> a period (same as '.')
704 <lt> a less-than character (same as '<')
705 <gt> a greater-than character (same as '>')
706 <null> matches nothing (useful in alternations that may be empty)
707
708 You may have noticed that a C<regex> declaration looks very similar to a
709 subroutine declaration. Indeed, regex are very much like subroutines.
710 They may even have parameters. There are two named regex that are used
711 to obtain zero-width look-ahead and look-behind. The parameter passed to
712 these named regex may be another regex:
713
714 <before ...> Zero-width look ahead for ...
715 <after ...> Zero-width look behind for ...
716
717 An example:
718
719 / foo <before \d+> / # only matches on "foo" followed
720 # immediately by some digits
721
722 Since these assertions are zero-width, the "pointer" that keeps track
723 of how much of the string has been consumed will point just after the
724 "foo" portion of the string on successful match so that the digits
725 can be processed by other means if necessary.
726
727 By declaring named regex like this you can build up a whole library of
728 regex that match some special purpose language. In fact, Perl 6 lets
729 you group your regex under a common name by declaring that all of your
730 regex belong to the same "grammar".
731
732 grammar Calc;
733
734 regex expr {
735 <term> '*' <expr> |
736 <term> '/' <expr> |
737 <term>
738 }
739
740 regex term {
741 <factor> '+' <term> |
742 <factor> '-' <term> |
743 <factor>
744 }
745
746 regex factor { <digit>+ | '(' <expr> ')' }
747
748 The grammar declaration must appear at the beginning of the file and is
749 in effect until the end of file. To explicitly declare the scope of the
750 grammar, enclose the regex in curly braces like so:
751
752 grammar Calc {
753 regex expr { ... }
754 regex term { ... }
755 regex factor { ... }
756 }
757
758 To match strings that belong to this grammar, the named regex must be
759 fully qualified:
760
761 "3+5*2" ~~ / <Calc.expr> /;
762
763 Perl 6 also has some shortcuts for specifying common and useful defaults
764 to the regex engine. If, instead of using the C<regex> keyword, you use
765 C<token>, Perl 6 will automatically turn on the C<:ratchet> modifier for
766 the duration of the regex. The idea being that once you've matched a
767 "token" you're not likely to want to backtrack into it.
768
769 Also, if you use C<rule> instead of C<regex>, Perl 6 will turn on both
770 of the C<:ratchet> and C<:sigspace> modifiers.
771
772 Here's the C<Calc> grammar above, rewritten to use these syntactic
773 shortcuts:
774
775 grammar Calc;
776
777 rule expr {
778 <term> '*' <expr> |
779 <term> '/' <expr> |
780 <term>
781 }
782
783 rule term {
784 <factor> '+' <term> |
785 <factor> '-' <term> |
786 <factor>
787 }
788
789 token factor { <digit>+ | '(' <expr> ')' }
790
791 There's not much difference is there? But it makes a big difference in
792 what gets parsed. The original grammar did not have any provisions for
793 matching whitespace, so any whitespace in the string would cause the
794 pattern to fail. A string like "3 + 5 * 7" would not be matched by the
795 original grammar. Now, because whitespace in the pattern is parsed as
796 whitespace in the string, that string will parse successfully.
797
798 =head2 Strings and Beyond
799
800 Throughout this article I've been talking about regex as they apply to very
801 ASCII-like strings, however Perl 6 regex are not restricted by ASCII. Perl 6
802 regex can be applied to any string of Unicode characters and, in fact, are
803 written in Unicode by default.
804
805 Moreover, Perl 6 regex can be applied to things that are not strings but can be
806 made to look like strings. For instance, they can be applied to a filehandle
807 (which can represent itself as a stream of bytes/characters/whatever). Even
808 stranger, is that regex can be applied to an array of objects. See
809 L<S05:Matching against non-strings>.
810
811 =head2 Conclusion
812
813 Well, that's about it for this introduction to Perl 6 regex. I've run
814 out of steam. There are tons of features that I've left out since this
815 is just an introduction. A few that come to mind are:
816
817 =over 4
818
819 =item * match object polymorphism
820
821 Captured portions of the regex can be accessed as strings, but they
822 can also be accessed in other ways: as a match object, as an array (if
823 the subpattern is quantified), as a hash, etc.
824
825 =item * Perl code in regex
826
827 Curly braces in a regex allow for execution of arbitrary perl code as
828 the regex is matched.
829
830 =item * quantifier enhancement
831
832 The universal quantifier can be used to match more than just some number
833 of times. If the thing on the RHS of the C<**> is a regex, then that is
834 taken as the pattern to match as the separator between items that match
835 the LHS. (e.g., <ident>**',' will match a series of identifiers
836 separated by comma characters)
837
838 =back
839
840 Be sure to read the references given below for a more detailed
841 explanation of the features mentioned in this article.
842
843 =head2 References
844
c3a3d841 »
2009-07-28 Updated references, author info and copyright
845 If you want to read more about Perl 6 regex, see the official Perl 6
846 documentation at L<http://perlcabal.org/syn/S05.html>. There are also
847 some historical documents at
848 L<http://dev.perl.org/perl6/doc/design/apo/A05.html> and
849 L<http://dev.perl.org/perl6/doc/design/exe/E05.html> that may give you a
850 feel for things. If you're really interested in learning more but feel
851 you need to interact with people try the mailing list at
852 perl6-language@perl.org or log on to a freenode IRC server and drop
853 by #perl6.
4fe1c6e5 »
2009-07-28 initial regex-intro document
854
855 =head2 About the Author
856
c3a3d841 »
2009-07-28 Updated references, author info and copyright
857 Jonathan Scott Duff is an Information Technology Research Manager at the
858 Conrad Blucher Institute for Surveying and Science on the campus of
859 Texas A&M University-Corpus Christi. He has a beautiful wife and 4 lovely
4fe1c6e5 »
2009-07-28 initial regex-intro document
860 children. When not working or spending time with his family, Scott tries
861 to keep up with Parrot and Perl 6 development. Sometimes he can be found
862 on IRC as PerlJam in one of the perl-related channels. But if you really
863 want to get in touch with him, the best way is via email: duff@pobox.com
864
c3a3d841 »
2009-07-28 Updated references, author info and copyright
865
866 Copyright 2007-2009 Jonathan Scott Duff
Something went wrong with that request. Please try again.