Skip to content

Commit 81058c1

Browse files
committed
define extensible boundary syntax
The Unicode folks seem to want an extensible boundary syntax with \b, but we've abandoned \b for boundary, so it's now <|x> for various values of x. (And <!|x> is the negation, so no need for <|X>.) <?wb> is now <|w>.
1 parent 4ec52e3 commit 81058c1

File tree

1 file changed

+10
-2
lines changed

1 file changed

+10
-2
lines changed

S05-regex.pod

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1775,6 +1775,14 @@ Note that a consequence of the previous section is that you also get
17751775

17761776
for free, which fails if the current rule would match again at this location.
17771777

1778+
=item *
1779+
1780+
A leading C<|> indicates some kind of a zero-width boundary.
1781+
1782+
<|w> word boundary
1783+
<|g> grapheme boundary (always matches in grapheme mode)
1784+
<|c> codepoint boundary (always matches in grapheme/codepoint mode)
1785+
17781786
=back
17791787

17801788
The following tokens include angles but are not required to balance:
@@ -1809,8 +1817,8 @@ These tokens are considered declarative, but may force backtracking behavior.
18091817

18101818
A C<«> or C<<< << >>> token indicates a left word boundary. A C<»> or
18111819
C<<< >> >>> token indicates a right word boundary. (As separate tokens,
1812-
these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <?wb> >>
1813-
"word boundary" assertion, while C<\B> becomes C<< <!wb> >>. (None of
1820+
these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <|w> >>
1821+
"word boundary" assertion, while C<\B> becomes C<< <!|w> >>. (None of
18141822
these are dependent on the definition of C<< <.ws> >>, but only on the C<\w>
18151823
definition of "word" characters. Non-space mark characters are ignored in
18161824
calculating word properties of the preceding character. See TR18 1.4.)

0 commit comments

Comments
 (0)