Skip to content

Commit a455b11

Browse files
committed
old <a>**',' is now written <a>+%','
The separator syntax is now required to be a modifier on a quantifier, so the old <a> ** ',' is now written <a>+ % ','. This allows for other quantifiers like * or ** on the left as the basis of the constraint, while disambiguating use of ** to only work as a quantifier without overloading separator matching. The C<%> because it can be pronounced 'modulo' and because it looks like the relationship of two things. And also because it's unlikely to be confused with other regex forms.
1 parent e639b7d commit a455b11

File tree

1 file changed

+62
-48
lines changed

1 file changed

+62
-48
lines changed

S05-regex.pod

Lines changed: 62 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Synopsis 5: Regexes and Rules
1818
Created: 24 Jun 2002
1919

2020
Last Modified: 20 Sep 2011
21-
Version: 147
21+
Version: 148
2222

2323
This document summarizes Apocalypse 5, which is about the new regex
2424
syntax. We now try to call them I<regex> rather than "regular
@@ -941,7 +941,7 @@ side of the complete quantifier. This space is considered significant
941941
under C<:sigspace>, and will be distributed as a call to <.ws> between
942942
all the elements of the match but not on either end.
943943

944-
The next token will determine what kind of repetition is desired:
944+
The next token constrains how many times the pattern on the left must match.
945945

946946
If the next thing is an integer, then it is parsed as either as an exact
947947
count or a range:
@@ -965,21 +965,51 @@ It is illegal to return a list, so this easy mistake fails:
965965
The closure form is always considered procedural, so the item it is
966966
modifying is never considered part of the longest token.
967967

968-
If you supply any other atom (which may be quantified), it is
969-
interpreted as a separator (such as an infix operator), and the
970-
initial item is quantified by the number of times the separator is
971-
seen between items:
968+
=item *
969+
970+
Negative range values are allowed, but only when modifying a reversible
971+
pattern (such as C<after> could match). For example, to search the
972+
surrounding 200 characters as defined by 'dot', you could say:
973+
974+
/ . ** -100..100 <element> /
975+
976+
Similarly, you can back up 50 characters with:
972977

973-
<alt> ** '|' # repetition controlled by presence of character
974-
<addend> ** <addop> # repetition controlled by presence of subrule
975-
<item> ** [ \!?'==' ] # repetition controlled by presence of operator
976-
<file>**\h+ # repetition controlled by presence of whitespace
978+
/ . ** -50 <element> /
977979

978-
A successful match of such a quantifier always ends "in the middle",
980+
[Conjecture: A negative quantifier forces the construct to be
981+
considered procedural rather than declarational.]
982+
983+
=item *
984+
985+
Any quantifier may be modified by an additional constraint that
986+
specifies the separator to look for between repeats of the left side.
987+
This is indicated by use of a C<%> between the quantifier and
988+
the separator. The initial item is iterated only as long as the
989+
separator is seen between items:
990+
991+
<alt>+ % '|' # repetition controlled by presence of character
992+
<addend>+ % <addop> # repetition controlled by presence of subrule
993+
<item>+ % [ \!?'==' ] # repetition controlled by presence of operator
994+
<file>+%\h+ # repetition controlled by presence of whitespace
995+
996+
Any quantifier may be so modified:
997+
998+
<a>* % ',' # 0 or more comma-separated elements
999+
<a>+ % ',' # 1 or more
1000+
<a>? % ',' # 0 or 1 (but ',' never used!?!)
1001+
<a> ** 2..* % ',' # 2 or more
1002+
1003+
The C<%> modifier may only be used on a quantifier; any attempt
1004+
to use it on a bare term will result in a parse error (to minimize
1005+
possible confusion with any hash notations we choose to support in
1006+
Perl 6 regexes).
1007+
1008+
A successful match of a C<%> construct generally ends "in the middle" at the C<%>,
9791009
that is, after the initial item but before the next separator.
9801010
Therefore
9811011

982-
/ <ident> ** ',' /
1012+
/ <ident>+ % ',' /
9831013

9841014
can match
9851015

@@ -992,67 +1022,51 @@ but never
9921022
foo,
9931023
foo,bar,
9941024

1025+
The only time such a match doesn't end in the middle is if the left
1026+
side can match 0 times (and does so), in which case the whole construct
1027+
matches the null string.
1028+
1029+
'' ~~ / <ident>* % ',' / # matches because of the *
1030+
1031+
If you wish to quantify each match on the left without the modifier, you must place it in brackets:
1032+
1033+
[<a>*]+ % ','
1034+
9951035
It is legal for the separator to be zero-width as long as the pattern on
9961036
the left progresses on each iteration:
9971037

998-
. ** <?same> # match sequence of identical characters
1038+
.+ % <?same> # match sequence of identical characters
9991039

10001040
The separator never matches independently of the next item; if the
10011041
separator matches but the next item fails, it backtracks all the way
10021042
back through the separator. Likewise, this matching of the separator
10031043
does not count as "progress" under C<:ratchet> semantics unless the
10041044
next item succeeds.
10051045

1006-
When significant space is used under C<:sigspace> with the separator
1007-
form, it applies on both sides of the separator, so
1046+
[Note: the following may be subject to change, now that this construct is
1047+
a quant modifier.] When significant space is used under C<:sigspace>,
1048+
it applies on both sides of the separator, so
10081049

1009-
ms/<element> ** ','/
1010-
ms/<element>** ','/
1011-
ms/<element> **','/
1050+
ms/<element>+ % ','/
1051+
ms/<element>+% ','/
1052+
ms/<element>+ %','/
10121053

10131054
all allow whitespace around the separator like this:
10141055

10151056
/ <element>[<.ws>','<.ws><element>]* /
10161057

10171058
while
10181059

1019-
ms/<element>**','/
1060+
ms/<element>+%','/
10201061

1021-
excludes all significant whitespace:
1062+
excludes all significant whitespace like this:
10221063

10231064
/ <element>[','<element>]* /
10241065

10251066
Of course, you can always match whitespace explicitly if necessary, so to
10261067
allow whitespace after the comma but not before, you can say:
10271068

1028-
/ <element>**[','\s*] /
1029-
1030-
You may use both forms of C<**> at once by use of this special form:
1031-
1032-
/ <expr> ** 0..* ** ',' /
1033-
1034-
The default is a mininum of 1 time, so when you write
1035-
1036-
/ <stuff> ** ',' /
1037-
1038-
it really means
1039-
1040-
/ <stuff> ** 1..* ** ',' /
1041-
1042-
=item *
1043-
1044-
Negative range values are allowed, but only when modifying a reversible
1045-
pattern (such as C<after> could match). For example, to search the
1046-
surrounding 200 characters as defined by 'dot', you could say:
1047-
1048-
/ . ** -100..100 <element> /
1049-
1050-
Similarly, you can back up 50 characters with:
1051-
1052-
/ . ** -50 <element> /
1053-
1054-
[Conjecture: A negative quantifier forces the construct to be
1055-
considered procedural rather than declarational.]
1069+
/ <element>+%[','\s*] /
10561070

10571071
=item *
10581072

0 commit comments

Comments
 (0)