Skip to content


Subversion checkout URL

You can clone with
Download ZIP
100644 866 lines (666 sloc) 37.676 kB
4fe1c6e @perlpilot initial regex-intro document
1 =head1 Introduction to Perl 6 Regex
3 =head2 Context
5 Over the years programming languages have incorporated features for
6 regular expressions. Some, such as Javascript, have added syntax
7 specifically to support regular expressions. Others, such as PHP, have
8 just reused their native string type and utilize special subroutines to
9 parse strings as regular expressions. But one thing almost all of them
10 have in common is that they have mimicked the extended regular
11 expression syntax of Perl.
13 Of course, Perl wasn't the first programming language to have support
14 for regular expressions. But it did make them popular. Perl has been so
15 successful as a text processing and glue language and regular
16 expressions so well interwoven into the language that anyone who uses
17 Perl almost I<has> to learn regular expressions. Also, by applying some
18 of Perl's philosophy to regular expressions, common usages became easy
19 and complex usages became possible. Here are just a few features that
20 resulted: character class shortcuts, annotated regular expressions,
21 ability to match unicode properties, zero-width assertions, independant
22 subexpressions, and code execution inside of a regular expression.
24 Unfortunately, as the regular expressioning public put more demand on
25 Perl's regular expression syntax, it accumulated some crufty items--
26 little inconsistencies that were to maintain backward compatibility or
27 were introduced because they were needed, but before they were fully
28 thought out. In designing Perl 6, Larry Wall not only looked at the
29 syntax and semantics of Perl proper, but he also took a hard look at the
30 sub-language that is regular expressions and refactored it into
31 something that makes better sense.
33 In this article I'm going to give an introduction to Perl 6 regex (we
34 call them "regex" to maintain the historical association with regular
35 expressions though they've strayed quite far from the mathematical
36 sense of regular languages). I'll point out differences from
37 Perl 5 syntax but no knowledge of Perl 5's regular expression syntax
38 should be necessary to understand this document. If you're a Perl 5
39 geek, you may be bored for a while, but read anyway so that you can
40 pick up the syntactic and semantic differences.
42 =head2 Literals
44 Firstly, let's get some small syntactic things out of the way. In Perl
45 6, as in other implementations, regex are typically delimited by slashes
46 (aka, leaning toothpicks), so a typical regex to match the string "abc"
47 would look like this:
49 /abc/
51 A regex is also sometimes called a "pattern" because we're looking for
52 a portion of a string that looks like the strings that the regex
53 describes. Regex are also sometimes called "rules" because they
54 describe the conditions under which the string may match. But what
55 string are we looking in? As in Perl 5, Perl 6 applies the regex to a
56 variable called C<$_> if we haven't explicitly specified a variable to
57 match against. For now I'm going to continue assuming that our string
58 is in C<$_> in my examples. Later I'll show you how to specify a
59 different string to match against.
61 So, the above regex tries to find the pattern "abc" in the string C<$_>.
62 If the pattern appears in the string, the regex returns a true value to
63 indicate that it matched successfully, otherwise it returns a false
64 value. It's important to note that the pattern may appear anywhere in
65 the string. So, for instance, if C<$_> contained "fooabcbar", the
66 above pattern will sucessfully match and the regex will return a true
67 value. Here are some more strings that would successfully match against
68 the regex:
70 abcgoobltygook
71 now I know my abcs
72 abc
73 babcock
75 =head2 Meta-syntax
77 Now, if all you can do is match literal strings, regexes wouldn't be so
78 useful would they? Some characters rather than taken literally are so-
79 called metacharacters that have special meaning in a regex. In Perl 6,
80 any non-alphanumeric character is considered a metacharacter by
81 default. That is, alphabetic and numeric characters match themselves and
82 any other character may not match itself because it may have special
83 meaning. (For the purposes of metasyntax, the underscore is considered
84 alphanumeric.)
86 Currently not all metacharacters actually I<have> a special meaning but
87 many do and in order to keep things simple, Perl 6 chooses to designate
88 all non-alphanumeric characters as metasyntactic. However, there's an
89 "escape mechanism" that lets you treat metacharacters as themselves
90 (literally), and alphanumeric characters as metasyntactic. By prefixing
91 an alphanumeric character with a backslash (B<\>) it becomes a
92 metacharacter and is special. By the same token, prefixing a non-
93 alphanumeric character with a backslash removes its metasyntactic nature
94 and it becomes literal.
96 So, for example, a common metasyntactic character found in regular
97 expressions is a period (C<.>, sometimes just called a "dot") and it
98 matches B<any> character. Thus,
100 /f..d/
102 will match any four character sequence that starts with an "f" and ends
103 with a "d". All of the following strings match this pattern:
105 my name is fred # matched "fred"
106 I need food, now! # matched "food"
107 those guys are turf idols # matched "f id"
108 shift down a gear # matched "ft d"
110 If you want to actually match a period you can escape it like so:
112 /foo\./ # matches "foo."
114 Similarly, the letter "t" in a regular expression matches itself (i.e.,
115 an occurence of the letter "t"). But, with a backslash immediately
116 preceding the "t", it takes on its metasyntactic meaning of a tab
117 character. For example:
119 /tall/ # matches the string "tall"
120 /\tall/ # matches a tab character, followed by "all"
122 Another way to match character sequences that are to be taken
123 literally is to enclose them in quotes:
125 /'foo.'/ # matches "foo."
126 /"foo."/ # same
128 In these cases, the quotes are metasyntactic delimiters that mean
129 "match the characters in between literally".
131 =begin sidebar
133 In Perl 5 (and most other regular expression variants), a dot matches
134 any character except for the newline sequence. In Perl 6, this odd
135 restriction is lifted and the dot matches B<any> character, including
136 the newline sequence. Perl 6 has other mechanisms to accomplish Perl
137 5's behavior. See L<"Character Classes"> below.
139 =end sidebar
141 The most important metasyntactic characters in regex are whitespace
142 (typically space characters, but sometimes tabs and other "invisible"
143 characters). Because regular expressions tend towards high character
144 density, they can often be difficult to read. In Perl 6 regex, you may
145 use whitespace to separate parts of your regex to make it easier to
146 read. The whitespace is ignored by the regex engine. To match literal
413df07 @perlpilot remove errant text about backwhacking whitespace
147 space characters or other whitespace you should enclose it in quotes.
4fe1c6e @perlpilot initial regex-intro document
149 In future examples I will occasionally show a given regex in its spaced
150 form. The spaced form is preferable when writing regex as it makes them
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
151 easier to read later, and will be used from now on.
4fe1c6e @perlpilot initial regex-intro document
153 =head2 Quantifiers
155 Three other important metacharacters are what are called quantifiers.
156 These characters allow you to specify repetition; that a character or
157 group of characters may be matched multiple times.
159 quantifier matches
160 * the preceding thing zero or more times
161 + the preceding thing one or more times
162 ? the preceding thing zero or one time
165 Examples:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
167 / fo* / # will match "f", "fo", "foo", "fooo", etc.
168 / fo+ / # will match "fo", "foo", "fooo", "foooo", etc.
169 / fo? / # will only match "f" or "fo"
4fe1c6e @perlpilot initial regex-intro document
171 Note that in the above examples the quantifier is only applied to the
172 preceding character. If you need to match a group of characters
173 repeatedly, you have to use one of the several grouping mechanisms in
174 Perl 6 regex (see L<Grouping> below).
176 There is another character sequence sometimes called the universal
177 quantifier that allows you to prescribe a specific number of times a
178 particular pattern may match. To use the universal quantifier, you use
179 two C<*> characters followed by a number or a range like so:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
181 / fo ** 3 / # matches only "fooo"
182 / fo ** 3..5 / # matches one of "fooo", "foooo", or "fooooo"
4fe1c6e @perlpilot initial regex-intro document
184 You may also specify a closure after the C<**>, but that is beyond the
185 scope of this introduction to explain. See the references given at
186 the end of this article for more reading.
188 =begin sidebar
190 =head3 A note on greed
192 As in Perl 5, the quantifiers are greedy. That is, they try to
193 consume as much of the target string as possible. So the pattern
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
195 / .* abc /
4fe1c6e @perlpilot initial regex-intro document
197 when applied to the string "abcdefghujklmnopqrstuvwyxz" will first try
198 to match the entire string (because C<.> matches any character and C<*>
199 tries to match as much as it can). When the regex engine gets to the end
200 of the string and can not match "abc" it will back up one character and
201 try to match "abc" again. This process will repeat until the regex
202 engine either matches or runs out of characters to process.
204 Also as in Perl 5, the greediness of quantifiers can be "turned off" by
205 adding a question mark (C<?>) immediately after the quantifier.
206 So for instance,
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
208 / . *? abc / # first tries to match nothing, then "abc"
209 / . +? abc / # first tries matching one character, then "abc"
210 / . **?3..5 abc/ # first tries matching three characters, then "abc"
4fe1c6e @perlpilot initial regex-intro document
212 The default behavior (greediness) can be thought of as the regex engine
213 first matching the most characters that the quantifier will allow and
214 then backing up one character at a time if the rest of the pattern
215 doesn't match (to try again). With greediness turned off, the regex
216 engine matches as little as the quantifier allows and moves forward one
217 character at a time to match the rest of the pattern.
219 Note that a regex is processed from left to right and the greediness (or
220 non-greediness) behavior B<only> applies to the part of the regex
221 governed by a particular quantifier. Making quantifiers match
222 minimally does not cause the pattern as a whole to match minimally.
224 =end sidebar
226 =head2 Character Classes
228 We've seen how to match a specific character at a given location by
229 putting that character in the regex and we've seen how to match any
230 character at a specific location by putting a dot in the regex, but
231 sometimes you want to match a specific I<set> of characters at a given
232 position. The mechism to do this in regex is called a "character class".
233 Character classes are designated by C<< <[]> >> with the specific
234 characters listed inside the brackets. For instance,
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
236 / foo <[dlt]> / # matches "food", "fool" or "foot"
4fe1c6e @perlpilot initial regex-intro document
238 The sequence C<< <[dlt]> >> represents any one of the characters "d", "l",
239 or "t". You can also specify a set of contiguous characters to match
240 using a range like so:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
242 / <[ a..d ]> / # matches one of "a", "b", "c", or "d"
4fe1c6e @perlpilot initial regex-intro document
244 You can also mix ranges and specific characters:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
246 /<[ a..d xyz ]>/ # matches one of "a","b","c","d","x","y", or "z"
247 /<[ xyz a..d ]>/ # same
248 /<[ x..z a..d ]>/ # same
4fe1c6e @perlpilot initial regex-intro document
250 Some character classes are so useful that they have their own
251 designated short-cuts. All of the character class short-cuts make use
252 of alphabetic characters that have been given a metasyntactic meaning
253 by prefixing the character with a back slash. Here's a table of the
254 short-cuts:
256 short-cut matches
257 \w word characters (alphabetics, numerics, and underscore)
258 \W non-word characters
259 \d digits
260 \D non-digits
261 \s whitespace characters
262 \S non-whitespace characters
263 \t tab character
264 \T anything but a tab character
265 \n newline sequence
266 \N anything I<but> a newline sequence
267 \r carriage return character
268 \R anything but carriage return character
269 \f form feed character
270 \F anything but form feed character
271 \h horizontal whitespace
272 \H anything but horizontal whitespace
273 \v vertical whitespace
274 \V anything but vertical whitespace
276 You may notice some regularity in this table. For every character class
277 short-cut of the form C<< \<lower case letter> >> the anti-class is
278 always C<< \<corresponding upper case letter> >>. (The old non-newline
279 meaning of C<.> maps neatly to the new C<\N> sequence.)
281 Character classes bear a remarkable resemblance to sets. In fact, you can "add"
282 and "subtract" character classes much like you would sets:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
284 /<[a..z] - [aeiou]>/ # Only match a consonant
285 /<[asdfg] + [hjkl;]> +/ # Only match a sequence of characters that can
4fe1c6e @perlpilot initial regex-intro document
286 # be made from home row keys.
289 =head2 Grouping
291 There are several ways to group a portion of a regex. We saw one such
292 way earlier in our discussion of literals: surround the characters with
293 quotes. Quoting does two things, it forces all of the characters between
294 the quotes to be treated literally and it groups the string of
295 characters together into a quantifiable unit.
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
297 / 'foo.' * / # will match "foo.", "", "", etc.
4fe1c6e @perlpilot initial regex-intro document
299 Another way to create a quantifiable unit is to use square brackets
300 (C<[]>). Square brackets delimit a portion of the regex that may be
301 treated as a whole. The text in between the brackets is just another
302 regex.
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
304 / f[ oo ]* / # will match "f", "foo", "foooo", "foooooo", etc.
305 / a[ bc* d ]? / # will match "a", "abd", "abcd", "abccd", "abcccd", etc.
4fe1c6e @perlpilot initial regex-intro document
307 Yet another way to group a portion of a regex to be treated as a unit is
308 to use round brackets (C<()>). These are identical to square brackets
309 as far as grouping goes, but additionally, the portion of the string
310 that is matched by the regex inside the round brackets is also saved
311 somewhere and may be accessed later in a variety of ways. Round brackets
312 are said to be "capturing brackets" because of this property. The
313 following table shows some examples of what would be captured if the
314 given regex matched certain portions of a string:
316 regex matched captured
318 /f(oo)*/ "f" "" # the empty string
319 "foo" "oo"
320 "foooo" "oooo"
322 /a(bc*d)?/ "a" "" # the empty string
323 "abd" "bd"
324 "abcd" "bcd"
325 "abccd" "bccd"
327 Both round and square brackets delimit a portion of the regex. This
328 portion of the regex is called a "subpattern". The portion of the string
329 that matches a subpattern can be referenced and accessed individually.
330 We'll talk more about capturing and where the captured portion of the
331 string is stored later.
333 =head2 Alternation and Conjunction
335 There are a couple of other useful concepts in regex called
336 "alternation" and "conjunction". Alternation is the idea that at a
337 given location in a string, there are alternatives to what may be
338 matched. Conjunction is the idea that at a given location in a
339 string, there are multiple portions of a regex that must match exactly
340 the same section of the string.
342 Alternation is designated in regex by either a single vertical bar
343 (C<|>) or a double vertical bar (C<||>). While each allows you to
344 specify alternatives, how they process those alternatives is different.
346 A single vertical bar does "longest token" matching on the
347 alternations with no inherent order as to which alternative is tried
348 first. So, for instance, if we were matching the string
349 "football", the following regex
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
351 / f | fo | foo /
4fe1c6e @perlpilot initial regex-intro document
353 would match "foo" since that's the longest matching portion of the
354 regex in the alternation. But the regular expression engine may have
355 tried them all before it discovered "foo", or perhaps "foo" was the
356 first and only alternative tried. It is completely up to the
357 regex engine implementation as to how the alternatives are tried.
359 Had the regex been
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
361 / f | fo | fo.*l | foo /
4fe1c6e @perlpilot initial regex-intro document
363 then the third item in the alternation would be matched since
364 C<fo.*l> will match the entire string. Again, which order the
365 alternatives are tried is unspecified.
367 A double vertical bar (C<||>) will match each alternative in a left-to-
368 right manner. That is, the regex
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
370 / f || fo || foo /
4fe1c6e @perlpilot initial regex-intro document
372 will first try to match "f", and then (if it failed to match "f") try
373 to match "fo", and finally it will try to match "foo". So, were the
374 above regex applied to the string "football" as before, the first
375 alternative ("f") would match. This behavior is exactly the same as
376 traditional implementations of alternation in other backtracking
377 regular expression engines.
379 Which alternative matches and the order in which the alternatives are
380 tried becomes particularly important when each alternative has side
381 effects (such as setting a variable or calling a subroutine). We'll
382 talk more about that later.
384 Similar to alternations, conjunctions in regex are designated by either
385 a single ampersand (C<&>) or a double ampersand (C<&&>). In both forms,
386 all of the conjuncted terms must match the exact same portion of the
387 string they are being matched against. But, as with alternation, the single-
388 ampersand version matches the subpatterns in some unspecified order while the
389 double ampersand version of conjunctions will try each conjuncted
390 portion of the regex in a left-to-right manner.
392 For an example, if the following regex were applied to the string
393 "blah",
395 / <[a..z]>+ & [ ... ] / # matches "bla"
397 it would match the string "bla" because the subpattern on the right of the
398 ampersand matches exactly 3 characters and the subpattern on the left
399 matches any sequence of lower case letters. By comparison, had the regex
400 been (still applied to the string "blah"):
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
402 / <[a..z]>+ && [ ... ] /
4fe1c6e @perlpilot initial regex-intro document
404 It B<still> would match "bla" but how it arrives at that match is
405 slightly different. First, the left hand side of the C<&&> would match
406 as much as it possibly can (since C<*> is greedy), then the right hand
407 side would match its 3 characters. Since the two sides then do not
408 match the exact same portion of the string, the regex engine is said
409 to "backtrack" and try the left hand side with one fewer character
410 before trying to match the right hand side again. This continues
411 until both sides match the exact same portion of the string or until
412 it can be determined that no such match is possible.
414 =head2 Backtracking
416 Backtracking has the potential to happen whenever there is a decision
417 point in a regex. An alternation creates a decision point with several
418 alternatives; a quantifier creates a decision point where a given
419 portion of a regex may match one more or one fewer times. Backtracking
420 is caused by the regex engine's attempt to satisfy an overall match. For
421 instance, when matching the following regex against the string "footbag"
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
423 / [ foot || base || hand ] ball /
4fe1c6e @perlpilot initial regex-intro document
425 the regex engine will match "foot" and then attempt to match "ball" when
426 it realizes that "ball" will not match, it backtracks to the point where
427 it matched "foot" and tries to match the next alternative at that point
428 (in this case, "base"). When that fails to match, the regex engine will
429 try to match the next alternative, and so forth until either one of the
430 alternatives matches or they are all exhausted.
432 Now, as a human walking through the same sequence of steps that the
433 regex goes through, I can tell right away that since C<foot> was
434 matched, that neither C<base> nor C<hand> will match. But the regex
435 engine may not know this. (For this simple example, the regex engine can
436 probably figure out that the alternatives are mutually exclusive. If
437 that bothers you, imagine that they are instead complicated expressions
438 rather than simple strings) As the match for "ball" repeatedly fails,
439 the regex engine repeatedly backtracks and tries again.
441 However, Perl 6 provides a way to tell the regex engine when to give
442 up. A colon in the regex causes Perl to not retry the preceding
443 "atom". In regex parlance, an atom is anything that is matched as a
444 unit; a group is an atom, a single character may be an atom, etc.
445 So, in the above example, if I wanted to tell the regex engine to not
446 try the other alternatives once one of them matched (because I, as the
447 person writing the regex, know that they are all mutually exclusive),
448 I simply follow the group with a colon like so:
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
450 / [ foot || base || hand ] : ball /
4fe1c6e @perlpilot initial regex-intro document
452 As soon as one of the alternatives match, the regex engine will move
453 past the colon and try to match C<ball>. When it fails to match C<ball>,
454 ordinarily it would backtrack to try other possibilities. But now the
455 colon acts as a stopping point that says, "don't bother backtracking
456 because nothing else will match" and so the regex engine will fail that
457 portion of the regex immediately rather than trying all of the other
458 alternatives. It's important to note that the colon does not necessarily
459 cause the entire regex to fail once it is backtracked over, but only the
460 atom to which it is applied. If not matching that atom means that the
461 entire regex won't match, then, of course, the entire regex will fail.
463 Perl 6 has other forms of backtracking control. A double colon will
464 cause its enclosing group to fail if backtracked over. A triple colon
465 will cause the entire regex to fail. For more information on these
466 see L<S05:Backtracking control>.
468 =head2 Zero-Width Assertions
470 As the regular expression engine processes a string to see if it matches
471 a given pattern, it keeps a marker to denote how much of the string it
472 has processed so far. (When this marker moves backwards, the regex
473 engine is backtracking) As the marker moves along the string, it is said
474 to "consume" the string. Sometimes you want to match without consuming
475 any of the input string or match between characters (say the transition
476 from an alphabetic character to a numeric character or only at the
477 beginning of the string). This idea of matching without consuming is
478 called a zero-width assertion.
480 Perl 6 provides some metacharacters that denote handy zero-width
481 assertions.
483 ^ only matches at the beginning of the string
484 $ only matches at the end of the string
485 ^^ matches at the beginning of any line within the string
486 $$ matches at the end of any line within the string
487 << A word boundary, but only matches the transition from
488 non-word character to word character (i.e., the left-hand
489 side of a word)
490 >> A word boundary, but only matches the transition from
491 word character to non-word character (i.e., the right-hand
492 side of a word)
494 Here are some example patterns and the portion(s) of the string that
495 would match if each pattern were applied repeatedly to the entire string
496 "the quick\nbrown fox\njumped over\nthe lazy\ndog" ("\n" denotes a new
497 line sequence within the string (i.e. "\n" terminates each line))
499 Pattern Matches
501 / ^the \s+ \w+ / "the quick"
502 / \w+ $ / "dog"
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
503 / ^^ \w+ / "the", "brown", "jumped", "the", and "dog"
4fe1c6e @perlpilot initial regex-intro document
504 / \w+ $$ / "quick", "fox", "over", "lazy", and "dog"
505 / << o\w+ / "over"
506 / o \w+ >> / "own", "over", "og"
508 In order for the patterns that would match multiple portions of the string
509 to actually match those substrings, there needs to be some way to tell the regex
510 engine to continue matching from where the last match left off. See L<modifiers>
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
511 below.
4fe1c6e @perlpilot initial regex-intro document
513 =head2 Interacting with Perl 6
515 =head3 match objects and capturing
517 When a regex successfully matches a portion of a string it returns a
518 Match object. In a boolean context the Match object evaluates to true.
519 When using capturing brackets (parentheses), the part of the string that
520 is captured is placed in the Match object and can be accessed in a
521 number of ways.
523 The Match object itself is called C<$/>. The substring captured by the
524 first set of parentheses is placed in C<$/[0]>, the substring captured
525 by the second set of parentheses is placed in C<$/[1]> and so forth.
526 There is a short-hand way to access these elements of the Match object
527 that will be familiar (yet slightly different) to people who have used
528 regular expression engines similar to Perl 5. The short-hand for
529 $/[0],$/[1],$/[2], etc. are the special variables $0, $1, $2, etc.
530 The big difference from other regular expression engines is that the
531 numbering starts with 0 rather than 1. Starting from 0 was chosen to
532 mimic the array indices of the match object.
534 =head3 matching Perl variables
536 Unlike Perl 5, a variable placed inside a regex does not automatically
537 interpolate the value of the variable. What happens with the variable
538 depends on context (this B<is> perl after all :-). An "unadorned"
539 variable will interpolate as a literal string to match if it's a scalar,
540 or as an alternation of literal strings to match if it's an array or
541 hash (not strictly true, but true enough for now). So, given the
542 following declarations:
544 my $foo = "ab*c";
545 my @bar = < one two three >;
547 The regex:
549 / $foo @bar /
551 matches exactly as if you had written
553 / 'ab*c' [ one | two | three ] /
555 Sometimes a variable inside of a regex is actually used to affect a
556 named capture of a specific portion of the string instead of (or even
557 in addition to) storing the captured portion in $0, $1, $2, etc.
558 For instance:
e994bd3 @perlpilot fix character class thinko
560 / $<foo>:=[ <[A..Z]+[0..9]>**4 ] /
4fe1c6e @perlpilot initial regex-intro document
562 if the group matches, the result is placed in C<< $/<$foo> >>. As with
563 numeric captures, there is a short-hand syntax for accessing a named
564 portion of a Match object: C<< $<foo> >>
566 =head3 matching other variables.
568 Until now we've talked primarily about the pattern matching syntax
569 itself, but how do we apply these rules to a string other than C<$_>?
570 We use the smart match operator C<~~>. This operator is called "smart
571 match" because it does quite a bit more than just apply regular
572 expressions to strings, but for now we're just going to focus on that
573 one aspect of the smart match operator. So to match a regular
574 expression against the string contained in a variable called C<$foo>,
575 we'd do this:
577 $foo ~~ / <regex here> /;
579 There's a more general syntax that allows the author to choose
580 different delimiters if your regular expression happens to match the
581 C</> character and you don't feel like writing C<\/> so much:
583 $foo ~~ m/ <regex here> /;
584 $foo ~~ m! <regex here> !; # different delimiter
586 =head3 modifiers
588 The more general syntax also gives us a convienent place to put
589 modifiers that will affect the regular expression as a whole. For
590 instance, there is an C<ignorecase> modifier that causes the RE engine
591 to be agnostic towards case distinctions in alphabetic characters.
592 There's also a short-hand for this modifier for those times when
593 C<ignorecase> is too much to type out.
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
595 $foo ~~ m :ignorecase/ foo /; # will match "foo", "FOO", "fOo", etc.
596 $foo ~~ m :i/ foo /; # same
4fe1c6e @perlpilot initial regex-intro document
598 Perl 6 predefines several of these modifiers (for a complete list,
599 see L<S05>):
601 modifier short-hand meaning
602 :ignorecase :i Ignore case distinctions
603 :basechar :b Ignore accents and other marks
604 :sigspace :s whitespace in pattern matches
605 whitespace in string
606 :global :g matches as many times as possible
607 :continue :c Continue matching from where
608 previous match left off
609 :pos :p Just like :c but pattern is
610 anchored where the previous match left off
611 :ratchet Don't do any backtracking
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
612 :bytes dot matches bytes
4fe1c6e @perlpilot initial regex-intro document
613 :codes dot matches codepoints
614 :graphs dot matches language-independent graphemes
615 :chars dot matches "characters" at current
616 Unicode level
617 :Perl5 :P5 use perl 5 regex syntax
619 There are two other modifiers for matching a pattern some number of
620 times or only matching, say, the third time we see a pattern in a
621 string. These modifiers are a little strange in that their short-hand
622 forms consist of a number followed by some text:
624 modifier short-hand meaning
625 :x() :1x,:4x,:12x match some number of times
626 :nth() :1st,:2nd,:3rd,:4th match only the Nth occurance
628 Here are some examples to illustrate these modifiers:
630 $_ = "foo bar baz blat";
c6502a0 @moritz [regex-intro] use the spaced form by default
moritz authored
631 m :3x/ a / # matches the "a" characters in each word
632 m :nth(3)/ \w+ / # matches "baz"
4fe1c6e @perlpilot initial regex-intro document
634 Some of these modifiers may also be placed inside the regular expression and
635 their effect is scoped until the end of the innermost enclosing
636 bracketing construct or the end of the pattern.
638 / a [ :i foo ] z/ # matches "afooz", "aFOOz", "aFooz", "afOoz", etc.
640 The :sigspace modifier is quite useful. If you're unsure of the amount
641 of whitespace between tokens or can't guarantee a certain number of
642 spaces, you may be inclined to use C<\s*> or C<\s+> often in your regex.
643 However, it can get tedious typing C<\s+> so often and it tends to
644 visually detract from the parts you're really interested in matching.
645 Thus Perl 6 regex provides the :sigspace modifier so that whitespace
646 in your pattern matches whitespace in your string. This is I<so>
647 useful that Perl provides a nice short-cut for it.
649 /\s*One\s+small\s+step\s*/ # yuck
650 m:sigspace/One small step/ # much better
651 mm/One small step/ # even better!
653 =head2 Named Assertions
655 Earlier we talked about "zero-width assertions" that allow us to match
656 "in between" characters. But we've also talked about other kinds of
7fa5574 @perlpilot minor typo correction
657 assertions, only we didn't call them that. Whenever you write a regex to
4fe1c6e @perlpilot initial regex-intro document
658 match I<anything>, you're asserting something about what a successful
659 match should look like. So, for instance, in the following regex,
661 / \w+ '(' [ \w+ ',' ]* [ \w+ ]? ')' /
663 we are asserting that, for a string to match, it must contain a sequence
664 of word characters followed by a C<(> followed by zero or more word
665 character sequences optionally terminated with a comma and finally, a
666 C<)>. Each of the "tokens" is an assertion about what must match for the
667 entire regex to match. So, C<\w> is an assertion that a word character
668 must match, C<'('> is an assertion that an open parenthesis must match,
669 and so forth. (There are 6 assertions in the above regex)
671 However, the regex would make more sense if we could give the
672 individual pieces meaningful names. For instance, if we could
673 write the above regex like so:
675 / <function_name> '(' [ <parameter> ',' ]* <parameter>? ')' /
677 you might have a better idea what those C<\w+> sequences were for.
678 Lucky for us, Perl 6 provides just such a mechanism.
680 The syntax for declaring a named regex is:
682 regex identifier { \w+ }
684 Once declared, we can use this in another regex like so:
686 / <identifier> /
688 and it is identical to
690 / \w+ /
692 with an important and useful exception: the portion of the string that
b49e2d4 @pdfrod Fix formatting code in regex intro
pdfrod authored
693 matches can also be accessed as C<< $/<identifier> >>.
4fe1c6e @perlpilot initial regex-intro document
695 Perl 6 predeclares several useful named regex (See L<S05> for a complete list):
697 <alpha> a single alphabetic character
698 <digit> a single numeric character
699 <ident> an "identifier"
700 <sp> a single space character
701 <ws> an arbitrary amount of whitespace
702 <dot> a period (same as '.')
703 <lt> a less-than character (same as '<')
704 <gt> a greater-than character (same as '>')
705 <null> matches nothing (useful in alternations that may be empty)
707 You may have noticed that a C<regex> declaration looks very similar to a
708 subroutine declaration. Indeed, regex are very much like subroutines.
709 They may even have parameters. There are two named regex that are used
710 to obtain zero-width look-ahead and look-behind. The parameter passed to
711 these named regex may be another regex:
713 <before ...> Zero-width look ahead for ...
714 <after ...> Zero-width look behind for ...
716 An example:
718 / foo <before \d+> / # only matches on "foo" followed
719 # immediately by some digits
721 Since these assertions are zero-width, the "pointer" that keeps track
722 of how much of the string has been consumed will point just after the
723 "foo" portion of the string on successful match so that the digits
724 can be processed by other means if necessary.
726 By declaring named regex like this you can build up a whole library of
727 regex that match some special purpose language. In fact, Perl 6 lets
728 you group your regex under a common name by declaring that all of your
729 regex belong to the same "grammar".
731 grammar Calc;
733 regex expr {
734 <term> '*' <expr> |
735 <term> '/' <expr> |
736 <term>
737 }
739 regex term {
740 <factor> '+' <term> |
741 <factor> '-' <term> |
742 <factor>
743 }
745 regex factor { <digit>+ | '(' <expr> ')' }
747 The grammar declaration must appear at the beginning of the file and is
748 in effect until the end of file. To explicitly declare the scope of the
749 grammar, enclose the regex in curly braces like so:
751 grammar Calc {
752 regex expr { ... }
753 regex term { ... }
754 regex factor { ... }
755 }
757 To match strings that belong to this grammar, the named regex must be
758 fully qualified:
760 "3+5*2" ~~ / <Calc.expr> /;
762 Perl 6 also has some shortcuts for specifying common and useful defaults
763 to the regex engine. If, instead of using the C<regex> keyword, you use
764 C<token>, Perl 6 will automatically turn on the C<:ratchet> modifier for
765 the duration of the regex. The idea being that once you've matched a
766 "token" you're not likely to want to backtrack into it.
768 Also, if you use C<rule> instead of C<regex>, Perl 6 will turn on both
769 of the C<:ratchet> and C<:sigspace> modifiers.
771 Here's the C<Calc> grammar above, rewritten to use these syntactic
772 shortcuts:
774 grammar Calc;
776 rule expr {
777 <term> '*' <expr> |
778 <term> '/' <expr> |
779 <term>
780 }
782 rule term {
783 <factor> '+' <term> |
784 <factor> '-' <term> |
785 <factor>
786 }
788 token factor { <digit>+ | '(' <expr> ')' }
790 There's not much difference is there? But it makes a big difference in
791 what gets parsed. The original grammar did not have any provisions for
792 matching whitespace, so any whitespace in the string would cause the
793 pattern to fail. A string like "3 + 5 * 7" would not be matched by the
794 original grammar. Now, because whitespace in the pattern is parsed as
795 whitespace in the string, that string will parse successfully.
797 =head2 Strings and Beyond
799 Throughout this article I've been talking about regex as they apply to very
800 ASCII-like strings, however Perl 6 regex are not restricted by ASCII. Perl 6
801 regex can be applied to any string of Unicode characters and, in fact, are
802 written in Unicode by default.
804 Moreover, Perl 6 regex can be applied to things that are not strings but can be
805 made to look like strings. For instance, they can be applied to a filehandle
806 (which can represent itself as a stream of bytes/characters/whatever). Even
807 stranger, is that regex can be applied to an array of objects. See
808 L<S05:Matching against non-strings>.
810 =head2 Conclusion
812 Well, that's about it for this introduction to Perl 6 regex. I've run
813 out of steam. There are tons of features that I've left out since this
814 is just an introduction. A few that come to mind are:
816 =over 4
818 =item * match object polymorphism
820 Captured portions of the regex can be accessed as strings, but they
821 can also be accessed in other ways: as a match object, as an array (if
822 the subpattern is quantified), as a hash, etc.
824 =item * Perl code in regex
826 Curly braces in a regex allow for execution of arbitrary perl code as
827 the regex is matched.
829 =item * quantifier enhancement
831 The universal quantifier can be used to match more than just some number
832 of times. If the thing on the RHS of the C<**> is a regex, then that is
833 taken as the pattern to match as the separator between items that match
834 the LHS. (e.g., <ident>**',' will match a series of identifiers
835 separated by comma characters)
837 =back
839 Be sure to read the references given below for a more detailed
840 explanation of the features mentioned in this article.
842 =head2 References
c3a3d84 @perlpilot Updated references, author info and copyright
844 If you want to read more about Perl 6 regex, see the official Perl 6
845 documentation at L<>. There are also
846 some historical documents at
847 L<> and
848 L<> that may give you a
849 feel for things. If you're really interested in learning more but feel
850 you need to interact with people try the mailing list at
851 or log on to a freenode IRC server and drop
852 by #perl6.
4fe1c6e @perlpilot initial regex-intro document
854 =head2 About the Author
c3a3d84 @perlpilot Updated references, author info and copyright
856 Jonathan Scott Duff is an Information Technology Research Manager at the
857 Conrad Blucher Institute for Surveying and Science on the campus of
858 Texas A&M University-Corpus Christi. He has a beautiful wife and 4 lovely
4fe1c6e @perlpilot initial regex-intro document
859 children. When not working or spending time with his family, Scott tries
860 to keep up with Parrot and Perl 6 development. Sometimes he can be found
861 on IRC as PerlJam in one of the perl-related channels. But if you really
862 want to get in touch with him, the best way is via email:
c3a3d84 @perlpilot Updated references, author info and copyright
865 Copyright 2007-2009 Jonathan Scott Duff
Something went wrong with that request. Please try again.