@@ -28,8 +28,8 @@ the colon is forbidden because it clashes with adverbs, such as C<rx:i/abc/>
28
28
(case insensitive regexes), and round parentheses indicate a function call
29
29
instead.
30
30
31
- Whitespace in regexes is generally ignored (except with the C < :s > or
32
- C < :sigspace > adverb).
31
+ Whitespace in regexes is generally ignored (except with the C < :s > or,
32
+ completely, C < :sigspace > adverb).
33
33
34
34
As with Perl 6, in general, comments in regexes start with a hash character
35
35
C < # > and go to the end of the line.
@@ -62,8 +62,8 @@ part of the string matches the regex:
62
62
};
63
63
64
64
Match results are stored in the C < $/ > variable and are also returned from
65
- the match. The result is of L < type Match|/type/Match > if the match was successful;
66
- otherwise it's L < Nil|/type/Nil > .
65
+ the match. The result is of L < type Match|/type/Match > if the match was
66
+ successful; otherwise it's L < Nil|/type/Nil > .
67
67
68
68
= head1 Wildcards and character classes
69
69
@@ -89,7 +89,7 @@ because there's no character to match before C<per> in the target string.
89
89
There are predefined character classes of the form C < \w > . Its negation is
90
90
written with an upper-case letter, C < \W > .
91
91
92
- = item X < \d and \D |regex,\d;regex,\D>
92
+ = item X < C < \d > and C < \D > |regex,\d;regex,\D>
93
93
94
94
C < \d > matches a single digit (Unicode property C < N > ) and C < \D > matches a
95
95
single character that is not a digit.
@@ -102,21 +102,21 @@ match C<\d>, but also digits from other scripts.
102
102
103
103
Examples for digits are:
104
104
105
- = begin code :skip-test
105
+ = begin code :lang<text>
106
106
U+0035 5 DIGIT FIVE
107
- U+07C2 ߂ NKO DIGIT TWO
107
+ U+0BEB ௫ TAMIL DIGIT FIVE
108
108
U+0E53 ๓ THAI DIGIT THREE
109
- U+1B56 ᭖ BALINESE DIGIT SIX
109
+ U+17E5 ៥ KHMER DIGIT FIVE
110
110
= end code
111
111
112
- = item X < \h and \H |regex,\h;regex,\H>
112
+ = item X < C < \h > and C < \H > |regex,\h;regex,\H>
113
113
114
114
C < \h > matches a single horizontal whitespace character. C < \H > matches a
115
115
single character that is not a horizontal whitespace character.
116
116
117
117
Examples for horizontal whitespace characters are
118
118
119
- = begin code :skip-test
119
+ = begin code :lang<text>
120
120
U+0020 SPACE
121
121
U+00A0 NO-BREAK SPACE
122
122
U+0009 CHARACTER TABULATION
@@ -126,14 +126,14 @@ Examples for horizontal whitespace characters are
126
126
Vertical whitespace like newline characters are explicitly excluded; those
127
127
can be matched with C < \v > , and C < \s > matches any kind of whitespace.
128
128
129
- = item X < \n and \N |regex,\n;regex,\N>
129
+ = item X < C < \n > and C < \N > |regex,\n;regex,\N>
130
130
131
131
C < \n > matches a single, logical newline character. C < \n > is supposed to also
132
132
match a Windows CR LF codepoint pair; though it's unclear whether the magic
133
133
happens at the time that external data is read, or at regex match time.
134
134
C < \N > matches a single character that's not a logical newline.
135
135
136
- = item X < \s and \S |regex,\s;regex,\S>
136
+ = item X < C < \s > and C < \S > |regex,\s;regex,\S>
137
137
138
138
C < \s > matches a single whitespace character. C < \S > matches a single
139
139
character that is not whitespace.
@@ -142,20 +142,20 @@ character that is not whitespace.
142
142
say ~$/; # OUTPUT: «word»
143
143
}
144
144
145
- = item X < \t and \T |regex,\t;regex,\T>
145
+ = item X < C < \t > and C < \T > |regex,\t;regex,\T>
146
146
147
147
C < \t > matches a single tab/tabulation character, C < U+0009 > . (Note that
148
148
exotic tabs like the C < U+000B VERTICAL TABULATION > character are not
149
149
included here). C < \T > matches a single character that is not a tab.
150
150
151
- = item X < \v and \V |regex,\v;regex,\V>
151
+ = item X < C < \v > and C < \V > |regex,\v;regex,\V>
152
152
153
153
C < \v > matches a single vertical whitespace character. C < \V > matches a single
154
154
character that is not vertical whitespace.
155
155
156
156
Examples for vertical whitespace characters:
157
157
158
- = begin code :skip-test
158
+ = begin code :lang<text>
159
159
U+000A LINE FEED
160
160
U+000B VERTICAL TABULATION
161
161
U+000C FORM FEED
@@ -167,15 +167,15 @@ Examples for vertical whitespace characters:
167
167
168
168
Use C < \s > to match any kind of whitespace, not just vertical whitespace.
169
169
170
- = item X < \w and \W |regex,\w;regex,\W>
170
+ = item X < C < \w > and C < \W > |regex,\w;regex,\W>
171
171
172
172
C < \w > matches a single word character; i.e., a letter (Unicode category L), a
173
173
digit or an underscore. C < \W > matches a single character that isn't a word
174
174
character.
175
175
176
176
Examples of word characters:
177
177
178
- = begin code :skip-test
178
+ = begin code :lang<text>
179
179
0041 A LATIN CAPITAL LETTER A
180
180
0031 1 DIGIT ONE
181
181
03B4 δ GREEK SMALL LETTER DELTA
@@ -185,37 +185,37 @@ Examples of word characters:
185
185
186
186
Predefined subrules:
187
187
188
- = begin code :skip-test
189
- <alnum> \w 'alpha' plus 'digit'
188
+ = begin code :lang<text>
190
189
<alpha> <:L> Alphabetic characters
191
- <blank> \h Horizontal whitespace
192
- <cntrl> Control characters
193
190
<digit> \d Decimal digits
194
- <graph> 'alnum' plus 'punct'
195
- <lower> <:Ll> Lowercase characters
196
- <print> 'graph' plus 'space', but no 'cntrl'
191
+ <xdigit> Hexadecimal digit [0-9A-Fa-f]
192
+ <alnum> \w 'alpha' plus 'digit'
197
193
<punct> Punctuation and Symbols (only Punct beyond ASCII)
194
+ <graph> 'alnum' plus 'punct'
198
195
<space> \s Whitespace
196
+ <cntrl> Control characters
197
+ <print> 'graph' plus 'space', but no 'cntrl'
198
+ <blank> \h Horizontal whitespace
199
+ <lower> <:Ll> Lowercase characters
199
200
<upper> <:Lu> Uppercase characters
200
201
<?same> Matches between two identical characters
201
202
<?wb> Word Boundary (zero-width assertion, ? suppress capture)
202
203
<?ww> Within Word (zero-width assertion, ? suppress capture)
203
- <xdigit> Hexadecimal digit [0-9A-Fa-f]
204
204
= end code
205
205
206
206
= head2 X « Unicode properties|regex,<:property> »
207
207
208
208
The character classes mentioned so far are mostly for convenience; another
209
- approach is to use Unicode character properties. These come in the form C <<
210
- <:property> >> , where C < property > can be a short or long Unicode General
209
+ approach is to use Unicode character properties. These come in the form
210
+ C « <:property> » , where C < property > can be a short or long Unicode General
211
211
Category name. These use pair syntax.
212
212
213
213
To match against a Unicode Property:
214
214
215
215
"a".uniprop('Script'); # OUTPUT: «Latin»
216
- "a" ~~ / <:Script<Latin>> /;
216
+ "a" ~~ / <:Script<Latin>> /; # OUTPUT: «「a」»
217
217
"a".uniprop('Block'); # OUTPUT: «Basic Latin»
218
- "a" ~~ / <:Block('Basic Latin')> /;
218
+ "a" ~~ / <:Block('Basic Latin')> /; # OUTPUT: «「a」»
219
219
220
220
The following list of Unicode General Categories is stolen from the Perl 5
221
221
L < perlunicode|http://perldoc.perl.org/perlunicode.html > documentation:
@@ -267,9 +267,9 @@ L<perlunicode|http://perldoc.perl.org/perlunicode.html> documentation:
267
267
268
268
= end table
269
269
270
- For example, C << < :Lu> >> matches a single, upper-case letter.
270
+ For example, C « < :Lu>» matches a single, upper-case letter.
271
271
272
- It's negation is this: C << < :!property> >> . So, C << < :!Lu> >> matches a single
272
+ It's negation is this: C « < :!property>» . So, C « < :!Lu>» matches a single
273
273
character that isn't an upper-case letter.
274
274
275
275
Categories can be used together, with an infix operator:
@@ -287,7 +287,7 @@ Categories can be used together, with an infix operator:
287
287
= end table
288
288
289
289
To match either a lower-case letter or a number, write
290
- C << < :Ll+:N> >> or C << < :Ll+:Number> >> or C << < + :Lowercase_Letter + :Number> >> .
290
+ C « < :Ll+:N>» or C « < :Ll+:Number>» or C « < + :Lowercase_Letter + :Number>» .
291
291
292
292
It's also possible to group categories and sets of categories with
293
293
parentheses; for example:
@@ -297,20 +297,20 @@ parentheses; for example:
297
297
= head2 X « Enumerated character classes and ranges|regex,<[ ]>;regex,<-[ ]> »
298
298
299
299
Sometimes the pre-existing wildcards and character classes are not enough.
300
- Fortunately, defining your own is fairly simple. Within C << < [ ]> >> , you
300
+ Fortunately, defining your own is fairly simple. Within C « < [ ]>» , you
301
301
can put any number of single characters and ranges of characters (expressed
302
302
with two dots between the end points), with or without whitespace.
303
303
304
304
"abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]> /;
305
305
# Unicode hex codepoint range
306
306
"ÀÁÂÃÄÅÆ" ~~ / <[ \x[00C0] .. \x[00C6] ]> /;
307
307
# Unicode named codepoint range
308
- "ÀÁÂÃÄÅÆ " ~~ / <[ \c[LATIN CAPITAL LETTER A WITH GRAVE] .. \c[LATIN CAPITAL LETTER AE] ]> /;
308
+ "αβγ " ~~ /<[ \c[GREEK SMALL LETTER ALPHA].. \c[GREEK SMALL LETTER GAMMA]]> /;
309
309
310
- Within the C << < > >> you can use C < + > and C < - > to add or
310
+ Within the C « < > » you can use C < + > and C < - > to add or
311
311
remove multiple range definitions and
312
312
even mix in some of the unicode categories above. You can also
313
- write the backslashed forms for character classes between the C < [ ] > .
313
+ write the backslashed forms for character classes between the C < [ ] > .
314
314
315
315
/ <[\d] - [13579]> /;
316
316
# starts with \d and removes odd ASCII digits, but not quite the same as
0 commit comments