[regex] start to document Unicode character classes

moritz · moritz · commit 7ada73100734 · 2014-02-23T22:20:16.000+01:00
diff --git a/lib/regexes.pod b/lib/regexes.pod
@@ -180,6 +180,94 @@ Examples of word characters:
     03F3 ϳ GREEK LETTER YOT
     0409 Љ CYRILLIC CAPITAL LETTER LJE
 
+=head2 Unicode properties
+
+The character classes so far are mostly for convenience; a more systematic
+approach is the use of Unicode properties. They are called in the form
+C<< <:property> >>, where C<property> can be a short or long Unicode property
+name.
+
+The following list is stolen from the Perl 5
+L<perlunicode|http://perldoc.perl.org/perlunicode.html> documentation:
+
+=begin table
+
+    Short       Long
+    =====       =====
+    L           Letter
+    LC          Cased_Letter
+    Lu          Uppercase_Letter
+    Ll          Lowercase_Letter
+    Lt          Titlecase_Letter
+    Lm          Modifier_Letter
+    Lo          Other_Letter
+
+    M           Mark
+    Mn          Nonspacing_Mark
+    Mc          Spacing_Mark
+    Me          Enclosing_Mark
+
+    N           Number
+    Nd          Decimal_Number (also Digit)
+    Nl          Letter_Number
+    No          Other_Number
+
+    P           Punctuation (also Punct)
+    Pc          Connector_Punctuation
+    Pd          Dash_Punctuation
+    Ps          Open_Punctuation
+    Pe          Close_Punctuation
+    Pi          Initial_Punctuation
+                (may behave like Ps or Pe depending on usage)
+    Pf          Final_Punctuation
+                (may behave like Ps or Pe depending on usage)
+    Po          Other_Punctuation
+
+    S           Symbol
+    Sm          Math_Symbol
+    Sc          Currency_Symbol
+    Sk          Modifier_Symbol
+    So          Other_Symbol
+
+    Z           Separator
+    Zs          Space_Separator
+    Zl          Line_Separator
+    Zp          Paragraph_Separator
+
+    C           Other
+    Cc          Control (also Cntrl)
+    Cf          Format
+    Cs          Surrogate
+    Co          Private_Use
+    Cn          Unassigned
+
+=end table
+
+So for example C<< <:Lu> >> matches a single, upper-case letter.
+
+Negation works as C<< <:!category> >>, so C<< <:!Lu> >> matches a single
+character that isn't an upper-case letter.
+
+Several category can be combined with one of these infix operators:
+
+=begin table
+
+    Operator    Meaning
+    ========    =======
+    +           set union
+    |           set union
+    &           set intersection
+    -           set difference (first minus second)
+    ^           symmetric set intersection / XOR
+
+=end table
+
+So for example to either match a lower-case letter or a number, one can write
+C<< <:Ll+:N> >> or C<< <:Ll+:Number> >> or C<< C<+ :Lowercase_Letter + :Number> >>.
+
+(Grouping of set operations with round parens inside character classes is
+supposed to work, but not supported by Rakudo at the time of writing).
+
 =head2 Enumerated character classes and ranges
 
 TODO