Skip to content

Commit 7ada731

Browse files
committed
[regex] start to document Unicode character classes
1 parent 511b47b commit 7ada731

File tree

1 file changed

+88
-0
lines changed

1 file changed

+88
-0
lines changed

lib/regexes.pod

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,94 @@ Examples of word characters:
180180
03F3 ϳ GREEK LETTER YOT
181181
0409 Љ CYRILLIC CAPITAL LETTER LJE
182182
183+
=head2 Unicode properties
184+
185+
The character classes so far are mostly for convenience; a more systematic
186+
approach is the use of Unicode properties. They are called in the form
187+
C<< <:property> >>, where C<property> can be a short or long Unicode property
188+
name.
189+
190+
The following list is stolen from the Perl 5
191+
L<perlunicode|http://perldoc.perl.org/perlunicode.html> documentation:
192+
193+
=begin table
194+
195+
Short Long
196+
===== =====
197+
L Letter
198+
LC Cased_Letter
199+
Lu Uppercase_Letter
200+
Ll Lowercase_Letter
201+
Lt Titlecase_Letter
202+
Lm Modifier_Letter
203+
Lo Other_Letter
204+
205+
M Mark
206+
Mn Nonspacing_Mark
207+
Mc Spacing_Mark
208+
Me Enclosing_Mark
209+
210+
N Number
211+
Nd Decimal_Number (also Digit)
212+
Nl Letter_Number
213+
No Other_Number
214+
215+
P Punctuation (also Punct)
216+
Pc Connector_Punctuation
217+
Pd Dash_Punctuation
218+
Ps Open_Punctuation
219+
Pe Close_Punctuation
220+
Pi Initial_Punctuation
221+
(may behave like Ps or Pe depending on usage)
222+
Pf Final_Punctuation
223+
(may behave like Ps or Pe depending on usage)
224+
Po Other_Punctuation
225+
226+
S Symbol
227+
Sm Math_Symbol
228+
Sc Currency_Symbol
229+
Sk Modifier_Symbol
230+
So Other_Symbol
231+
232+
Z Separator
233+
Zs Space_Separator
234+
Zl Line_Separator
235+
Zp Paragraph_Separator
236+
237+
C Other
238+
Cc Control (also Cntrl)
239+
Cf Format
240+
Cs Surrogate
241+
Co Private_Use
242+
Cn Unassigned
243+
244+
=end table
245+
246+
So for example C<< <:Lu> >> matches a single, upper-case letter.
247+
248+
Negation works as C<< <:!category> >>, so C<< <:!Lu> >> matches a single
249+
character that isn't an upper-case letter.
250+
251+
Several category can be combined with one of these infix operators:
252+
253+
=begin table
254+
255+
Operator Meaning
256+
======== =======
257+
+ set union
258+
| set union
259+
& set intersection
260+
- set difference (first minus second)
261+
^ symmetric set intersection / XOR
262+
263+
=end table
264+
265+
So for example to either match a lower-case letter or a number, one can write
266+
C<< <:Ll+:N> >> or C<< <:Ll+:Number> >> or C<< C<+ :Lowercase_Letter + :Number> >>.
267+
268+
(Grouping of set operations with round parens inside character classes is
269+
supposed to work, but not supported by Rakudo at the time of writing).
270+
183271
=head2 Enumerated character classes and ranges
184272
185273
TODO

0 commit comments

Comments
 (0)