Skip to content
This repository
Browse code

Minor edits to regex chapter; a couple of editorial notes added.

  • Loading branch information...
commit b50dc2ae790ef925ed08e44e27ac26f1d4fc96cc 1 parent 41d351c
chromatic authored February 26, 2010

Showing 1 changed file with 339 additions and 227 deletions. Show diff stats Hide diff stats

  1. 566  src/regexes.pod
566  src/regexes.pod
Source Rendered
... ...
@@ -1,8 +1,11 @@
1 1
 =head0 Pattern matching
2 2
 
3  
-A common error while writing is to accidentally duplicate a word.
4  
-It is hard to catch errors by rereading your own text, so we present a way to
5  
-let Perl 6 search for your errors, introducing so-called I<regexes>:
  3
+X<regular expressions>
  4
+X<regex>
  5
+
  6
+A common writing error is to duplicate a word by accident.  It is hard to
  7
+catch such errors by rereading your own text, but Perl can do it for you.  A
  8
+simple technique uses so-called I<regular expressions> or I<regexes>:
6 9
 
7 10
 =begin programlisting
8 11
 
@@ -19,13 +22,12 @@ let Perl 6 search for your errors, introducing so-called I<regexes>:
19 22
 Regular expressions are a concept from computer science, and consist of
20 23
 primitive patterns that describe how text looks. In Perl 6 the pattern
21 24
 matching is much more powerful (comparable to Context-Free Languages), so we
22  
-prefer to call them just C<regex>. (If you know regexes from other
23  
-programming languages it's best to forget all of their syntax, since in
24  
-Perl 6 much is different than in PCRE or POSIX regexes.)
  25
+prefer to call them just C<regex>. (If you know regexes from other programming
  26
+languages it's best to forget their syntax; Perl 6 differs from PCRE or POSIX
  27
+regexes.)
25 28
 
26  
-In the simplest case a regex contains
27  
-just a constant string, and matching a string against that regex just searches
28  
-for that string:
  29
+In the simplest case a regex consists of a constant string. Matching a string
  30
+against that regex searches for that string:
29 31
 
30 32
 =begin programlisting
31 33
 
@@ -35,17 +37,17 @@ for that string:
35 37
 
36 38
 =end programlisting
37 39
 
38  
-The construct C<m/ ... /> builds a regex, and putting it on the right hand
39  
-side of the C<~~> smart match operator applies it against the string on the
40  
-left hand side. By default, whitespace inside the regex are irrelevant for the
41  
-matching, so writing the regex as C<m/ perl />, C<m/perl/> or C<m/ p e rl/> all
42  
-produce the exact same semantics - although the first way is probably the most
43  
-readable one.
  40
+The construct C<m/ ... /> builds a regex.  A regex on the right hand side of
  41
+the C<~~> smart match operator applies against the string on the left hand
  42
+side. By default, whitespace inside the regex is irrelevant for the matching,
  43
+so writing the regex as C<m/ perl />, C<m/perl/> or C<m/ p e rl/> all produce
  44
+the exact same semantics--although the first way is probably the most readable
  45
+one.
44 46
 
45  
-Only word characters, digits and the underscore cause an exact substring
46  
-search. All other characters have, at least potentially, a special meaning. If
47  
-you want to search for a comma, an asterisk or other non-word characters, you
48  
-have to quote or escape them:
  47
+Only word characters, digits, and the underscore cause an exact substring
  48
+search. All other characters may have a special meaning. If you want to search
  49
+for a comma, an asterisk, or another non-word character, you must quote or
  50
+escape it:
49 51
 
50 52
 =begin programlisting
51 53
 
@@ -58,8 +60,18 @@ have to quote or escape them:
58 60
 
59 61
 =end programlisting
60 62
 
61  
-However searching for literal strings gets boring pretty quickly, so let's
62  
-explore some "special" (also called I<metasyntactic>) characters. The dot (C<.>)
  63
+=for author
  64
+
  65
+What are the C<index> and C<rindex> ops in Perl 6?
  66
+
  67
+=end for
  68
+
  69
+X<regex; metasyntactic characters>
  70
+X<regex; special characters>
  71
+X<regex; . character>
  72
+
  73
+However searching for literal strings gets boring pretty quickly.  Regex
  74
+support special (also called I<metasyntactic>) characters. The dot (C<.>)
63 75
 matches a single, arbitrary character:
64 76
 
65 77
 =begin programlisting
@@ -77,24 +89,29 @@ matches a single, arbitrary character:
77 89
 
78 90
 This prints
79 91
 
  92
+=begin screen
  93
+
80 94
     spell contains pell
81 95
     superlative contains perl
82 96
     openly contains penl
83 97
     no match for stuff
84 98
 
85  
-The dot matched an C<l>, C<r> and C<n>, but it would also match a space in the
86  
-sentence I<the spectroscoB<pe l>acks resolution> - regexes don't care about
87  
-word boundaries at all. The special variable C<$/> stores (among other things)
88  
-just the part of the string the matched the regular expression. C<$/> holds
89  
-the so-called I<match object>.
  99
+=end screen
  100
+
  101
+The dot matched an C<l>, C<r>, and C<n>, but it would also match a space in
  102
+the sentence I<< the spectroscoB<pe l>acks resolution >>--regexes don't care
  103
+about word boundaries at all. The special variable C<$/> stores (among other
  104
+things) only the part of the string the matched the regular expression. C<$/>
  105
+holds the so-called I<match object>.
  106
+
  107
+X<regex; \w>
90 108
 
91  
-Suppose you had a big chunk of text, and for solving a
92  
-crossword puzzle you are looking  for words containing C<pe>, then an
93  
-arbitrary letter, and then an C<l> - but not a space, your crossword puzzle
94  
-has extra markers for those. The appropriate regex for that is C<m/pe \w l/>.
95  
-The C<\w> is a control sequence that stands for a "word" character, that is a
96  
-letter, digit or an underscore. Other common control sequences that each match
97  
-a single character, can be found in the following table
  109
+Suppose you have a big chunk of text.  For solving a crossword puzzle you are
  110
+looking for words containing C<pe>, then an arbitrary letter, and then an C<l>
  111
+(but not a space, as your puzzle has extra markers for those). The appropriate
  112
+regex for that is C<m/pe \w l/>.  The C<\w> control sequence that stands for a
  113
+"word" character--a letter, digit, or an underscore. Several other common
  114
+control sequences each match a single character:
98 115
 
99 116
 =begin table Backslash sequences and their meaning
100 117
 
@@ -170,19 +187,22 @@ a single character, can be found in the following table
170 187
 
171 188
 Each of these backslash sequence means the complete opposite if you convert
172 189
 the letter to upper case: C<\W> matches a character that's not a word
173  
-character, C<\N> matches a single character that's not a newline.
  190
+character and C<\N> matches a single character that's not a newline.
174 191
 
175  
-These matches are not limited to the ASCII range - C<\d> matches Latin,
  192
+X<regex; custom character classes>
  193
+
  194
+These matches are not limited to the ASCII range--C<\d> matches Latin,
176 195
 Arabic-Indic, Devanagari and other digits, C<\s> matches non-breaking
177 196
 whitespace and so on. These I<character classes> follow the Unicode definition
178  
-of what is a letter, number and so on. You can define custom character classes
179  
-by listing them inside nested angle and square brackets C<< <[ ... ]> >>.
  197
+of what is a letter, a number, and so on. Define custom character classes by
  198
+listing them inside nested angle and square brackets C<< <[ ... ]> >>.
180 199
 
181 200
 =begin programlisting
182 201
 
183 202
     if $str ~~ / <[aeiou]> / {
184 203
         say "'$str' contains a vowel";
185 204
     }
  205
+
186 206
     # negation with a -
187 207
     if $str ~~ / <-[aeiou]> / {
188 208
         say "'$str' contains something that's not a vowel";
@@ -190,10 +210,11 @@ by listing them inside nested angle and square brackets C<< <[ ... ]> >>.
190 210
 
191 211
 =end programlisting
192 212
 
193  
-Rather than listing each character in the character class individually,
194  
-ranges of characters may be specified by placing the range operator
195  
-C<..> between the character that starts the range and the character
196  
-that ends the range.  For instance,
  213
+X<regex; character range>
  214
+
  215
+Rather than listing each character in the character class individually, you
  216
+may specify a range of characters by placing the range operator C<..> between
  217
+the character that starts the range and the character that ends the range:
197 218
 
198 219
 =begin programlisting
199 220
 
@@ -204,34 +225,45 @@ that ends the range.  For instance,
204 225
 
205 226
 =end programlisting
206 227
 
207  
-Character classes may also be added or subtracted by using the C<+>
208  
-and C<-> operators:
  228
+X<regex; character class addition>
  229
+X<regex; character class subtraction>
  230
+
  231
+Added to or subtract from character classes with the C<+> and C<-> operators:
209 232
 
210 233
 =begin programlisting
211 234
 
212 235
     if $str ~~ / <[a..z]+[0..9]> / {
213 236
         say "'$str' contains a letter or number";
214 237
     }
  238
+
215 239
     if $str ~~ / <[a..z]-[aeiou]> / {
216 240
         say "'$str' contains a consonant";
217 241
     }
218 242
 
219 243
 =end programlisting
220 244
 
221  
-The negated character class is just a special application of this
222  
-idea.
  245
+The negated character class is a special application of this idea.
  246
+
  247
+X<regex; quantifier>
  248
+X<regex; ? quantifier>
223 249
 
224 250
 A I<quantifier> can specify how often something has to occur. A question mark
225  
-C<?> makes the preceding thing (be it a letter, a character class or
226  
-something more complicated) optional, meaning it can either be present either
227  
-zero or one times in the string being matched. So C<m/ho u? se/> matches
228  
-either C<house> or C<hose>. You can also write the regex as C<m/hou?se/>
229  
-without any spaces, and the C<?> still quantifies only the C<u>.
  251
+C<?> makes the preceding unit (be it a letter, a character class, or something
  252
+more complicated) optional, meaning it can either be present either zero or
  253
+one times in the string being matched. So C<m/ho u? se/> matches either
  254
+C<house> or C<hose>. You can also write the regex as C<m/hou?se/> without any
  255
+spaces, and the C<?> still quantifies only the C<u>.
  256
+
  257
+X<regex; * quantifier>
  258
+X<regex; + quantifier>
230 259
 
231 260
 The asterisk C<*> stands for zero or more occurrences, so C<m/z\w*o/> can
232 261
 match C<zo>, C<zoo>, C<zero> and so on. The plus C<+> stands for one or more
233  
-occurrences, C<\w+> matches what is usually considered a word (though only
234  
-matches the first three characters from C<isn't> because C<'> isn't a word character).
  262
+occurrences, C<\w+> I<usually> matches what you might consider a word (though
  263
+only matches the first three characters from C<isn't> because C<'> isn't a
  264
+word character).
  265
+
  266
+X<regex; ** quantifier>
235 267
 
236 268
 The most general quantifier is C<**>. If followed by a number it matches that
237 269
 many times, and if followed by a range, it can match any number of times that
@@ -241,38 +273,42 @@ the range allows:
241 273
 
242 274
     # match a date of the form 2009-10-24:
243 275
     m/ \d**4 '-' \d\d '-' \d\d /
  276
+
244 277
     # match at least three 'a's in a row:
245 278
     m/ a ** 3..* /
246 279
 
247 280
 =end programlisting
248 281
 
249  
-If the right hand side is neither a number nor a range, it is taken as a
  282
+If the right hand side is neither a number nor a range, it becomes a
250 283
 delimiter, which means that C<m/ \w ** ', '/> matches a list of characters
251  
-which are separated by a comma and a whitespace each.
  284
+separated by a comma and a whitespace each.
252 285
 
253  
-If a quantifier has several ways to match, the longest one is chosen.
  286
+X<regex; greedy matching>
  287
+X<regex; non-greedy matching>
  288
+
  289
+If a quantifier has several ways to match, Perl will choose the longest one.
  290
+This is I<greedy> matching. Appending a question mark to a quantifier makes it
  291
+non-greedy N<The non-greedy general quantifier is C<$thing **? $count>, so the
  292
+question mark goes directly after the second asterisk.>N<This example is a
  293
+very poor way to parse HTML; using a proper parser is always preferable.>:
254 294
 
255 295
 =begin programlisting
256 296
 
257 297
     my $html = '<p>A paragraph</p> <p>And a second one</p>';
258 298
     if $html ~~ m/ '<p>' .* '</p>' / {
259  
-        say "Matches the complete string!";
  299
+        say 'Matches the complete string!';
260 300
     }
261 301
 
262  
-=end programlisting
  302
+    if $html ~~ m/ '<p>' .*? '</p>' / {
  303
+        say 'Matches only <p>A paragraph</p>!';
  304
+    }
263 305
 
264  
-This is called I<greedy> matching. Appending a question mark to a quantifier
265  
-makes it non-greedy,
266  
-so using C<.*?> instead of C<.*> in the example above
267  
-makes the regex match only the string C<< <p>A paragraph</p> >>.
  306
+=end programlisting
268 307
 
269  
-N<The non-greedy general quantifier is C<$thing **? $count>, so
270  
-the question mark goes directly after the second asterisk.>
271  
-N<Still it's a very poor way to parse HTML, and a proper parser is always
272  
-preferable.>
  308
+X<regex; grouping>
273 309
 
274  
-If you wish to apply a modifier to more than just one character or character
275  
-class, you can group items with square brackets:
  310
+To apply a modifier to more than just one character or character class, group
  311
+items with square brackets:
276 312
 
277 313
 =begin programlisting
278 314
 
@@ -282,9 +318,11 @@ class, you can group items with square brackets:
282 318
 
283 319
 =end programlisting
284 320
 
285  
-Alternatives can be separated by vertical bars. One vertical bar between two
286  
-parts of a regex means that the longest alternative wins, two bars make the
287  
-first matching alternative win.
  321
+X<regex; alternation>
  322
+
  323
+Separate I<alternations>--tokens and units of which I<any> can match-- with
  324
+vertical bars. One vertical bar between two parts of a regex means that the
  325
+longest alternative wins.  Two bars make the first matching alternative win.
288 326
 
289 327
 =begin programlisting
290 328
 
@@ -294,13 +332,20 @@ first matching alternative win.
294 332
 
295 333
 =head1 Anchors
296 334
 
297  
-So far every regex we have looked at could match anywhere within a string, but
298  
-often it is desirable to limit the match to the start or end of a string or
299  
-line, or to word boundaries.
  335
+X<regex; anchors>
  336
+
  337
+So far every regex could match anywhere within a string.  Often it is
  338
+desirable to limit the match to the start or end of a string or line, or to
  339
+word boundaries.
  340
+
  341
+X<regex; string start anchor>
  342
+X<regex; ^>
  343
+X<regex; string end anchor>
  344
+X<regex; $>
300 345
 
301 346
 A single caret C<^> anchors the regex to the start of the string, a dollar
302  
-C<$> to the end. So C<m/ ^a /> matches strings beginning with an C<a>, and
303  
-C<m/ ^ a $ /> matches strings that only consist of an C<a>.
  347
+C<$> to the end. C<m/ ^a /> matches strings beginning with an C<a>, and C<m/ ^
  348
+a $ /> matches strings that consist only of an C<a>.
304 349
 
305 350
 =begin table Regex anchors
306 351
 
@@ -366,30 +411,35 @@ C<m/ ^ a $ /> matches strings that only consist of an C<a>.
366 411
 
367 412
 =head1 Captures
368 413
 
369  
-Regexes are good to check if a string is in a certain format, and
370  
-to search for pattern. But with some more features they can be very good for
371  
-I<extracting> information too.
  414
+X<regex; captures>
  415
+
  416
+Regexes are useful to check if a string is in a certain format, and to search
  417
+for patterns within a string. With some more features they can be very good
  418
+for I<extracting> information too.
  419
+
  420
+X<regex; $/>
372 421
 
373  
-Surrounding a part of a regex by round brackets C<(...)> makes it
  422
+Surrounding a part of a regex with round brackets C<(...)> makes Perl
374 423
 I<capture> the string it matches. The string matched by the first group of
375  
-parenthesis is stored in C<$/[0]>, the second in C<$/[1]> etc. In fact you can
376  
-use C<$/> as an array containing the captures from each parenthesis group.
  424
+parentheses is available in C<$/[0]>, the second in C<$/[1]>, etc.  C<$/> acts
  425
+as an array containing the captures from each parentheses group.
377 426
 
378 427
 =begin programlisting
379 428
 
380 429
     my $str = 'Germany was reunited on 1990-10-03, peacefully';
381 430
     if $str ~~ m/ (\d**4) \- (\d\d) \- (\d\d) / {
382  
-        say "Year:  ", $/[0];
383  
-        say "Month: ", $/[1];
384  
-        say "Day:   ", $/[2];
  431
+        say 'Year:  ', $/[0];
  432
+        say 'Month: ', $/[1];
  433
+        say 'Day:   ', $/[2];
385 434
         # usage as an array:
386 435
         say $/.join('-');       # prints 1990-10-03
387 436
     }
388 437
 
389 438
 =end programlisting
390 439
 
  440
+X<regex; quantified capture>
391 441
 
392  
-If a capture is quantified, the corresponding entry in the match object is a
  442
+If you quantify a capture, the corresponding entry in the match object is a
393 443
 list of other match objects:
394 444
 
395 445
 =begin programlisting
@@ -404,21 +454,25 @@ list of other match objects:
404 454
 
405 455
 This prints
406 456
 
  457
+=begin screen
  458
+
407 459
     list: eggs | milk | sugar
408 460
     end:  flour
409 461
 
410  
-To the screen. The first capture, C<(\w+)>, was quantified, and thus C<$/[0]>
411  
-is a list on which we can call the C<.join> method. Regardless how many
412  
-times the first capture matches, the second is still available in C<$/[1]>.
  462
+=end screen
  463
+
  464
+The first capture, C<(\w+)>, was quantified, and thus C<$/[0]> is a list on
  465
+which the code calls the C<.join> method. Regardless of how many times the
  466
+first capture matches, the second is still available in C<$/[1]>.
413 467
 
414 468
 As a shortcut, C<$/[0]> is also available under the name C<$0>, C<$/[1]> as
415  
-C<$1> and so on. These aliases are also available inside the regex. This
416  
-allows us to write a regex that detects a rather common error when writing a
417  
-text: an accidentally duplicated word.
  469
+C<$1>, and so on. These aliases are also available inside the regex. This
  470
+allows you to write a regex that detects that common error of duplicated
  471
+words:
418 472
 
419 473
 =begin programlisting
420 474
 
421  
-    my $s = 'the quick brown fox jumped over the the lazy dog';
  475
+    my $s = 'the quick brown fox jumped over B<the the> lazy dog';
422 476
 
423 477
     if $s ~~ m/ « ( \w+ ) \W+ $0 » / {
424 478
         say "Found two '$0' in a row";
@@ -427,52 +481,61 @@ text: an accidentally duplicated word.
427 481
 =end programlisting
428 482
 
429 483
 The regex first anchors to a left word boundary with C<«> so that it doesn't
430  
-match partial duplication of words. Then a word is captured C<( \w+ )>,
431  
-followed by at least one non-word character C<\W+> (which implies a right word
432  
-boundary, so no need to use an explicit one here), and then followed by
433  
-previously matched word, terminated by another word boundary.
  484
+match partial duplication of words.  Next, the regex captures a word (C<( \w+
  485
+)>), followed by at least one non-word character C<\W+>.  This implies a right
  486
+word boundary, so there is no need to use an explicit boundary.  Then it
  487
+matches the previous capture followed by a right word boundary.
434 488
 
435  
-Without the first word boundary anchor the regex would for example match
436  
-I<strB<and and> beach>, without the last word boundary anchor it would also
437  
-match I<B<the the>ory>.
  489
+Without the first word boundary anchor, the regex would for example match I<<
  490
+strB<and and> beach >> or I<< laB<the the> table leg >>.  Without the last
  491
+word boundary anchor it would also match I<< B<the the>ory >>.
438 492
 
439 493
 =head1 Named regexes
440 494
 
441  
-You can declare regexes just like subroutines, and give them names. Suppose
442  
-you found the previous example useful, and wanted to make it available easily.
443  
-Also you don't like the fact that doesn't catch two C<doesn't> or C<isn't> in
444  
-a row, so you want to extend it a bit:
  495
+X<regex; named>
  496
+
  497
+You can declare regexes just like subroutines and even name them. Suppose you
  498
+found the previous example useful and want to make it available easily.
  499
+Suppose also you want to extend it to handle contractions such as C<doesn't>
  500
+or C<isn't>:
445 501
 
446 502
 =begin programlisting
447 503
 
448 504
     regex word { \w+ [ \' \w+]? }
449  
-    regex dup { « <word> \W+ $<word> » }
  505
+    regex dup  { « <word> \W+ $<word> » }
  506
+
450 507
     if $s ~~ m/ <dup> / {
451 508
         say "Found '{$<dup><word>}' twice in a row";
452 509
     }
453 510
 
454 511
 =end programlisting
455 512
 
456  
-Here we introduce a regex with name C<word>, which matches at least one word
  513
+X<regex; backreference>
  514
+
  515
+This code introduces a regex named C<word>, which matches at least one word
457 516
 character, optionally followed by a single quote. Another regex called C<dup>
458  
-(short for I<duplicate>) is anchored at a word boundary, then calls the regex
459  
-C<word> by putting it in angle brackets, then matches at least one non-word
460  
-character, and then matches the same string as previously matched by the regex
461  
-C<word>.  After that another word boundary is required.  The syntax for this 
462  
-I<backreference> is a dollar, followed by the name of the named regex in angle
463  
-brackets.
464  
-
465  
-In the mainline code C<< $<dup> >>, short for C<$/{'dup'}>, accesses the match
466  
-object that the regex C<dup> produced. C<dup> also has a subrule called C<word>,
467  
-and the match object produced from that call is accessible as
  517
+(short for I<duplicate>) is anchored at a word boundary.  It calls the regex
  518
+C<word> (via C<< <word> >>), matches at least one non-word character, and then
  519
+matches the same string as previously matched by the regex C<word>.  It ends
  520
+with another word boundary.  The syntax for this I<backreference> is a dollar
  521
+sign followed by the name of the named regex in angle brackets.
  522
+
  523
+X<subrule>
  524
+X<regex; subrule>
  525
+
  526
+Within the C<if> block, C<< $<dup> >> is short for C<$/{'dup'}>.  It accesses
  527
+the match object that the regex C<dup> produced. C<dup> also has a subrule
  528
+called C<word>, and the match object produced from that call is accessible as
468 529
 C<< $<dup><word> >>.
469 530
 
470  
-Named regexes make it easy to organize complex regexes in smaller pieces, just
471  
-as subroutines allow for ordinary code.
  531
+Just as subroutines allow for ordinary code, named regexes make it easy to
  532
+organize complex regexes in smaller pieces.
472 533
 
473 534
 =head1 Modifiers
474 535
 
475  
-A previously used example to match a list of words was
  536
+X<regex; modifiers>
  537
+
  538
+The previous example to match a list of words was:
476 539
 
477 540
 =begin programlisting
478 541
 
@@ -480,14 +543,18 @@ A previously used example to match a list of words was
480 543
 
481 544
 =end programlisting
482 545
 
483  
-This works, but it is kinda clumsy - all these C<\s*> could be left out if we
484  
-had a way to just say "allow whitespaces anywhere". Since this is quite
485  
-common, Perl 6 regexes provide such an option: the C<:sigspace> modifier,
486  
-short C<:s>
  546
+X<regex; :sigspace modifier>
  547
+X<regex; :s modifier>
  548
+
  549
+This works, but the repeated "I don't care about whitespace" units are clumsy.
  550
+The desire to allow whitespace I<anywhere>way to just say "allow whitespaces
  551
+anywhere" is common, and Perl 6 regexes provide such an option: the
  552
+C<:sigspace> modifier (shortened to C<:s>):
487 553
 
488 554
 =begin programlisting
489 555
 
490 556
     my $ingredients = 'eggs, milk, sugar and flour';
  557
+
491 558
     if $ingredients ~~ m/:s ( \w+ ) ** \,'and' (\w+)/ {
492 559
         say 'list: ', $/[0].join(' | ');
493 560
         say 'end:  ', $/[1];
@@ -495,10 +562,13 @@ short C<:s>
495 562
 
496 563
 =end programlisting
497 564
 
498  
-It allows optional whitespaces in the text wherever there is one or more
499  
-whitespace in the pattern. Actually it's even a bit cleverer than that:
500  
-between two word characters whitespaces are not optional, but mandatory;
501  
-so the regex above does not match the string C<eggs, milk, sugarandflour>.
  565
+This modifier allows optional whitespaces in the text wherever there is one or
  566
+more whitespace character in the pattern. It's even a bit cleverer than that:
  567
+between two word characters whitespaces are mandatory.  The regex does I<not>
  568
+match the string C<eggs, milk, sugarandflour>.
  569
+
  570
+X<regex; :ignorecase modifier>
  571
+X<regex; :i>
502 572
 
503 573
 The C<:ignorecase> or C<:i> modifier makes the regex insensitive to upper and
504 574
 lower case, so C<m/ :i perl /> matches not only C<perl>, but also C<PerL> or
@@ -507,39 +577,44 @@ letters).
507 577
 
508 578
 =head1 Backtracking control
509 579
 
510  
-In the course of matching a regex against a string, the regex engine may
511  
-reach a point where an alternation has matched a particular branch
512  
-or a quantifier has greedily matched all it can but the final portion of
513  
-the regex fails to match. So, the regex engine backs up and attempts to
514  
-match another alternative or matches one less character on the
515  
-quantified portion to see if the overall regex succeeds. This process of
516  
-failing and trying again is called I<backtracking>.
517  
-
518  
-For example matching C<m/\w+ 'en'/> against the string C<oxen> makes the
519  
-C<\w+> group first match the whole string (because of the greediness of
520  
-C<+>), but then the C<en> literal at the end can't match anything. So
521  
-C<\w+> gives up one character, and now matches C<oxe>. Still, C<en> can't
522  
-match, so the C<\w+> group again gives up one character and now matches
523  
-C<ox>. The C<en> literal can now match the last two characters of the
524  
-string, and the overall match succeeds.
525  
-
526  
-While backtracking is often what one wants, and very convenient, it can also
527  
-be slow, and sometimes confusing. A colon C<:> switches off backtracking for
528  
-the previous quantifier or alternation. So C<m/ \w+: 'en'/> can never match
529  
-any string, because the C<\w+> always eats up all word characters, and never
530  
-releases them.
  580
+X<regex; backtracking>
  581
+
  582
+In the course of matching a regex against a string, the regex engine may reach
  583
+a point where an alternation has matched a particular branch or a quantifier
  584
+has greedily matched all it can but the final portion of the regex fails to
  585
+match.  In this case, the regex engine backs up and attempts to match another
  586
+alternative or matches one fewer character on the quantified portion to see if
  587
+the overall regex succeeds. This process of failing and trying again is called
  588
+I<backtracking>.
  589
+
  590
+When matching C<m/\w+ 'en'/> against the string C<oxen>, the C<\w+> group
  591
+first matches the whole string (because of the greediness of C<+>), but then
  592
+the C<en> literal at the end can't match anything.  C<\w+> gives up one
  593
+character to match C<oxe>.  C<en> still can't match, so the C<\w+> group again
  594
+gives up one character and now matches C<ox>. The C<en> literal can now match
  595
+the last two characters of the string, and the overall match succeeds.
  596
+
  597
+X<regex; :>
  598
+X<regex; disable backtracking>
  599
+
  600
+While backtracking is often useful and convenient, it can also be slow and
  601
+confusing. A colon C<:> switches off backtracking for the previous quantifier
  602
+or alternation. So C<m/ \w+: 'en'/> can never match any string, because the
  603
+C<\w+> always eats up all word characters, and never releases them.
  604
+
  605
+X<regex; :ratchet>
531 606
 
532 607
 The C<:ratchet> modifier disables backtracking for a whole regex, which is
533  
-often desirable in a small regex that is called from others regexes. When
534  
-searching for duplicate words, we had to anchor the regex to word boundaries,
535  
-because C<\w+> would allow matching only part of a word. By disabling
536  
-backtracking we get the more intuitive behavior that C<\w+> always matches a
537  
-full word:
  608
+often desirable in a small regex called often from other regexes.  The
  609
+duplicate word search regex had to anchor the regex to word boundaries,
  610
+because C<\w+> would allow matching only part of a word. Disabling
  611
+backtracking produces simpler behavior where C<\w+> always matches a full
  612
+word:
538 613
 
539 614
 =begin programlisting
540 615
 
541 616
     regex word { :ratchet \w+ [ \' \w+]? }
542  
-    regex dup { <word> \W+ $<word> }
  617
+    regex dup  { <word> \W+ $<word> }
543 618
 
544 619
     # no match, doesn't match the 'and'
545 620
     # in 'strand' without backtracking
@@ -547,21 +622,27 @@ full word:
547 622
 
548 623
 =end programlisting
549 624
 
550  
-However the effect of C<:ratchet> is limited to the regex it stands in - the
551  
-outer one still backtracks, and can also retry the regex C<word> at a
552  
-different staring position.
  625
+However the effect of C<:ratchet> applies only to the regex in which it
  626
+appears.  The outer regex still backtracks, and can also retry the regex
  627
+C<word> at a different staring position.
  628
+
  629
+X<regex; token>
  630
+X<token>
553 631
 
554 632
 The C<regex { :ratchet ... }> pattern is common that it has its own shortcut:
555  
-C<token { ... }>. So you'd typically write the previous example as
  633
+C<token { ... }>.  The duplicate word searcher is idiomatic when written:
556 634
 
557 635
 =begin programlisting
558 636
 
559  
-    token word { \w+ [ \' \w+]? }
560  
-    regex dup { <word> \W+ $<word> }
  637
+    B<token> word { \w+ [ \' \w+]? }
  638
+    regex dup  { <word> \W+ $<word> }
561 639
 
562 640
 =end programlisting
563 641
 
564  
-A token that also switches on the C<:sigspace> modifier is called a C<rule>.
  642
+X<regex; rule>
  643
+X<rule>
  644
+
  645
+A token that also switches on the C<:sigspace> modifier is a C<rule>:
565 646
 
566 647
 =begin programlisting
567 648
 
@@ -571,10 +652,13 @@ A token that also switches on the C<:sigspace> modifier is called a C<rule>.
571 652
 
572 653
 =head1 Substitutions
573 654
 
574  
-Regexes are not only popular for data validation and extraction, but
575  
-also data manipulation. The C<subst> method matches a regex against a
576  
-string, and if a match is found, substitutes the portion of the string
577  
-that matches with its second argument.
  655
+X<subst>
  656
+X<substitutions>
  657
+
  658
+Regexes are not only popular for data validation and extraction, but also data
  659
+manipulation. The C<subst> method matches a regex against a string.  If it
  660
+finds a match is found, it substitutes the portion of the string that matches
  661
+with its second argument.
578 662
 
579 663
 =begin programlisting
580 664
 
@@ -584,34 +668,40 @@ that matches with its second argument.
584 668
 
585 669
 =end programlisting
586 670
 
587  
-The C<:g> at the end tells the substitution to work I<globally>, so that every
588  
-match of regex is replaced. Without C<:g> it stops after the first match.
  671
+X<regex; :g>
  672
+X<regex; global substitution>
  673
+
  674
+The C<:g> at the end tells the substitution to work I<globally> to replace
  675
+every match. Without C<:g>, it stops after the first match.
589 676
 
590  
-Note that the regex was constructed with C<rx/ ... /> rather than C<m/ ... />.
591  
-The former constructs a regex object, the latter not only constructs the regex
592  
-object, but immediately matches it against the topic variable C<$_>.
593  
-Had we used C<m/ ... /> in the call to C<subst>, a match object would
594  
-have been passed as the first argument rather than the regex itself.
  677
+X<operators; rx//>
  678
+X<operators; m//>
595 679
 
596  
-=head1 Other regex features
  680
+Note the use of C<rx/ ... /> rather than C<m/ ... /> to construct the regex.
  681
+The former constructs a regex object. The latter not only constructs the regex
  682
+object, but immediately matches it against the topic variable C<$_>.  Using
  683
+C<m/ ... /> in the call to C<subst> creates a match object and passes it as
  684
+the first argument, rather than the regex itself.
597 685
 
598  
-Sometimes you want to call other regexes, but don't want them to capture
599  
-the matched text, for example when parsing a programming language you might
600  
-discard whitespaces and comments. You can achieve that by calling the regex
601  
-as C<< <.otherrule> >>.
  686
+=head1 Other Regex Features
602 687
 
603  
-For example if you use the C<:sigspace> modifier, every continuous piece of
604  
-whitespaces is internally replaced by C<< <.ws> >>, which means you can
605  
-provide a different idea of what a whitespace is - more on that in
606  
-$theGrammarChapter.
  688
+X<regex; avoid captures>
607 689
 
608  
-Sometimes you just want to take a look ahead, and check if the
609  
-next characters fulfill some properties -- but without actually consuming
610  
-them, so that the following parts of the regex can still match them.
  690
+Sometimes you want to call other regexes, but don't want them to capture the
  691
+matched text.  For example, when parsing a programming language you might
  692
+discard whitespaces and comments. You can achieve that by calling the regex as
  693
+C<< <.otherrule> >>.
611 694
 
612  
-A common use for that are substitutions. In normal English text you always place
613  
-a whitespace after a comma, and if somebody forgets to add that whitespace, a
614  
-regex can clean up after the lazy writer:
  695
+For example, if you use the C<:sigspace> modifier, every continuous piece of
  696
+whitespaces calls the built-in rule C<< <.ws> >>.  This use of a rule rather
  697
+than a character class allows you to define your own version of whitespace
  698
+characters (see L<grammars>).
  699
+
  700
+Sometimes you just want to take a look ahead, and check if the next characters
  701
+fulfill some properties without actually consuming them, so that the following
  702
+parts of the regex can still match them.  This is common in substitutions. In
  703
+normal English text, you always place a whitespace after a comma.  If somebody
  704
+forgets to add that whitespace, a regex can clean up after the lazy writer:
615 705
 
616 706
 =begin programlisting
617 707
 
@@ -621,14 +711,15 @@ regex can clean up after the lazy writer:
621 711
 
622 712
 =end programlisting
623 713
 
624  
-The word character after the comma is not part of the match, because it
625  
-is in a look-ahead, which C<< <?before ... > >> introduces. The leading
626  
-question mark indicates an I<zero width assertion>, that is a rule that
627  
-never uses up characters from the matched string.
  714
+X<regex; lookahead>
  715
+X<regex; zero-width assertion>
628 716
 
629  
-In fact you can turn any call to a subrule into an zero width assertion.
630  
-The built-in token C<< <alpha> >> matches an alphabetic character, so
631  
-you could write the example above as
  717
+The word character after the comma is not part of the match, because it is in
  718
+a look-ahead, which C<< <?before ... > >> introduces. The leading question
  719
+mark indicates an I<zero-width assertion>: a rule that never consumes
  720
+characters from the matched string.  You can turn any call to a subrule into
  721
+an zero width assertion.  The built-in token C<< <alpha> >> matches an
  722
+alphabetic character, so you can rewrite this example as:
632 723
 
633 724
 =begin programlisting
634 725
 
@@ -636,8 +727,9 @@ you could write the example above as
636 727
 
637 728
 =end programlisting
638 729
 
639  
-instead. With an exclamation mark the meaning is negated, so yet another way
640  
-to write it is
  730
+X<regex; negative look-ahead assertion>
  731
+
  732
+An leading exclamation mark negates the meaning; another variant is:
641 733
 
642 734
 =begin programlisting
643 735
 
@@ -645,6 +737,12 @@ to write it is
645 737
 
646 738
 =end programlisting
647 739
 
  740
+=for author
  741
+
  742
+The first sentence of the next paragraph confuses me.
  743
+
  744
+=end for
  745
+
648 746
 A look in the opposite direction is also possible, with C<< <?after> >>. In
649 747
 fact many built-in anchors can be written with look-ahead and look-behind
650 748
 assertions, though usually not quite as efficient:
@@ -724,38 +822,51 @@ assertions, though usually not quite as efficient:
724 822
 
725 823
 =end programlisting
726 824
 
727  
-Every regex match returns an object of type C<Match>. Evaluated in boolean
728  
-context, such a match object returns C<True> for successful matches and
729  
-C<False> for failed ones. Most properties are only interesting after
730  
-successful matches, so we'll concentrate on those.
  825
+X<regex; Match object>
  826
+X<Match>
  827
+
  828
+Every regex match returns an object of type C<Match>. In boolean context, a
  829
+match object returns C<True> for successful matches and C<False> for failed
  830
+ones. Most properties are only interesting after successful matches.
  831
+
  832
+X<Match.orig>
  833
+X<Match.from>
  834
+X<Match.to>
731 835
 
732  
-The C<orig> method returns the string that was matched against, C<from> and
733  
-C<to> the positions of the start point and end point of the match.
  836
+The C<orig> method returns the string that was matched against.  The C<from>
  837
+and C<to> methods return the positions of the start and end points of the
  838
+match.
734 839
 
735  
-In the example above the C<line-and-column> function determines the line
736  
-number the match occurred in, by extracting the string up to the match
737  
-position (C<$m.orig.substr(0, $m.from)>), splitting it by newlines and
738  
-counting the elements. The column is determined by searching backwards from
739  
-the match position, and calculating the difference to the match position.
  840
+In the previous example, the C<line-and-column> function determines the line
  841
+number in which the match occurred by extracting the string up to the match
  842
+position (C<$m.orig.substr(0, $m.from)>), splitting it by newlines, and
  843
+counting the elements. It calculates the column by searching backwards from
  844
+the match position and calculating the difference to the match position.
740 845
 
741 846
 =begin sidebar
742 847
 
743 848
 The C<rindex> method searches a string for another substring, starting at the
744  
-end of the string, moving forward until the search string is found. It returns
745  
-the position of search string.
  849
+end of the string, and moving backward until it finds the search string. It
  850
+returns the position of the search string.
746 851
 
747 852
 =end sidebar
748 853
 
749  
-Using a match object as an array yields access to the positional captures,
750  
-using it as a hash reveals the named captures - which is what C<< $<dup> >>
751  
-was doing in the previous example -- it is a shortcut for C<< $/<dup> >> or
752  
-C<< $/{ 'dup' } >>. These captures are again C<Match> objects, so
753  
-match objects are really trees of matches.
  854
+X<Match; access as a hash>
  855
+X<named captures>
  856
+X<regex; named captures>
  857
+
  858
+Using a match object as an array yields access to the positional captures.
  859
+Using it as a hash reveals the named captures.  In the previous example,
  860
+C<< $<dup> >> is a shortcut for C<< $/<dup> >> or C<< $/{ 'dup' } >>. These
  861
+captures are again C<Match> objects, so match objects are really trees of
  862
+matches.
  863
+
  864
+X<Match.caps>
754 865
 
755 866
 The C<caps> method returns all captures, named and positional, in the order in
756 867
 which their matched text appears in the source string. The return value is a
757  
-list of C<Pair> object, the keys of which are the name or number of the
758  
-capture, the value the corresponding C<Match> object.
  868
+list of C<Pair> objects, the keys of which are the names or numbers of the
  869
+capture and the values the corresponding C<Match> objects.
759 870
 
760 871
 =begin programlisting
761 872
 
@@ -765,7 +876,7 @@ capture, the value the corresponding C<Match> object.
765 876
 
766 877
         }
767 878
     }
768  
-    
  879
+
769 880
     # Output:
770 881
     # 0 => a
771 882
     # alpha => b
@@ -774,16 +885,17 @@ capture, the value the corresponding C<Match> object.
774 885
 =end programlisting
775 886
 
776 887
 In this case the captures are in the same order as they are in the regex, but
777  
-quantifiers can change that. Still C<$/.caps> follows the ordering of the
778  
-string, not of the regex. If there is a part of the string that is matched
779  
-but not captured, it does not appear anywhere in the values that C<caps>
780  
-returned.
781  
-
782  
-If you want the non-captured parts too, you need to use C<$/.chunks> instead.
783  
-It returns both the captured and the non-captured part of the matched string,
784  
-in the same format as C<caps>, but with a tilde C<~> as key. So if there are
785  
-no overlapping captures (which could only come from look-around assertions),
786  
-the concatenation of all the pair values that C<chunks> returns is equal to
787  
-the matched part of the string.
  888
+quantifiers can change that. Even so, C<$/.caps> follows the ordering of the
  889
+string, not of the regex. Any parts of the string which match but not as part
  890
+of captures will not appear in the values that C<caps> returns.
  891
+
  892
+X<Match.chunks>
  893
+
  894
+To access the non-captured parts too, use C<$/.chunks> instead.  It returns
  895
+both the captured and the non-captured part of the matched string, in the same
  896
+format as C<caps>, but with a tilde C<~> as key. If there are no overlapping
  897
+captures (which could only come from look-around assertions), the
  898
+concatenation of all the pair values that C<chunks> returns is the same as the
  899
+matched part of the string.
788 900
 
789 901
 =for vim: spell spelllang=en tw=78

0 notes on commit b50dc2a

Please sign in to comment.
Something went wrong with that request. Please try again.