@@ -18,7 +18,7 @@ Synopsis 5: Regexes and Rules
18
18
Created: 24 Jun 2002
19
19
20
20
Last Modified: 20 Sep 2011
21
- Version: 147
21
+ Version: 148
22
22
23
23
This document summarizes Apocalypse 5, which is about the new regex
24
24
syntax. We now try to call them I<regex> rather than "regular
@@ -941,7 +941,7 @@ side of the complete quantifier. This space is considered significant
941
941
under C<:sigspace>, and will be distributed as a call to <.ws> between
942
942
all the elements of the match but not on either end.
943
943
944
- The next token will determine what kind of repetition is desired:
944
+ The next token constrains how many times the pattern on the left must match.
945
945
946
946
If the next thing is an integer, then it is parsed as either as an exact
947
947
count or a range:
@@ -965,21 +965,51 @@ It is illegal to return a list, so this easy mistake fails:
965
965
The closure form is always considered procedural, so the item it is
966
966
modifying is never considered part of the longest token.
967
967
968
- If you supply any other atom (which may be quantified), it is
969
- interpreted as a separator (such as an infix operator), and the
970
- initial item is quantified by the number of times the separator is
971
- seen between items:
968
+ =item *
969
+
970
+ Negative range values are allowed, but only when modifying a reversible
971
+ pattern (such as C<after> could match). For example, to search the
972
+ surrounding 200 characters as defined by 'dot', you could say:
973
+
974
+ / . ** -100..100 <element> /
975
+
976
+ Similarly, you can back up 50 characters with:
972
977
973
- <alt> ** '|' # repetition controlled by presence of character
974
- <addend> ** <addop> # repetition controlled by presence of subrule
975
- <item> ** [ \!?'==' ] # repetition controlled by presence of operator
976
- <file>**\h+ # repetition controlled by presence of whitespace
978
+ / . ** -50 <element> /
977
979
978
- A successful match of such a quantifier always ends "in the middle",
980
+ [Conjecture: A negative quantifier forces the construct to be
981
+ considered procedural rather than declarational.]
982
+
983
+ =item *
984
+
985
+ Any quantifier may be modified by an additional constraint that
986
+ specifies the separator to look for between repeats of the left side.
987
+ This is indicated by use of a C<%> between the quantifier and
988
+ the separator. The initial item is iterated only as long as the
989
+ separator is seen between items:
990
+
991
+ <alt>+ % '|' # repetition controlled by presence of character
992
+ <addend>+ % <addop> # repetition controlled by presence of subrule
993
+ <item>+ % [ \!?'==' ] # repetition controlled by presence of operator
994
+ <file>+%\h+ # repetition controlled by presence of whitespace
995
+
996
+ Any quantifier may be so modified:
997
+
998
+ <a>* % ',' # 0 or more comma-separated elements
999
+ <a>+ % ',' # 1 or more
1000
+ <a>? % ',' # 0 or 1 (but ',' never used!?!)
1001
+ <a> ** 2..* % ',' # 2 or more
1002
+
1003
+ The C<%> modifier may only be used on a quantifier; any attempt
1004
+ to use it on a bare term will result in a parse error (to minimize
1005
+ possible confusion with any hash notations we choose to support in
1006
+ Perl 6 regexes).
1007
+
1008
+ A successful match of a C<%> construct generally ends "in the middle" at the C<%>,
979
1009
that is, after the initial item but before the next separator.
980
1010
Therefore
981
1011
982
- / <ident> ** ',' /
1012
+ / <ident>+ % ',' /
983
1013
984
1014
can match
985
1015
@@ -992,67 +1022,51 @@ but never
992
1022
foo,
993
1023
foo,bar,
994
1024
1025
+ The only time such a match doesn't end in the middle is if the left
1026
+ side can match 0 times (and does so), in which case the whole construct
1027
+ matches the null string.
1028
+
1029
+ '' ~~ / <ident>* % ',' / # matches because of the *
1030
+
1031
+ If you wish to quantify each match on the left without the modifier, you must place it in brackets:
1032
+
1033
+ [<a>*]+ % ','
1034
+
995
1035
It is legal for the separator to be zero-width as long as the pattern on
996
1036
the left progresses on each iteration:
997
1037
998
- . ** <?same> # match sequence of identical characters
1038
+ .+ % <?same> # match sequence of identical characters
999
1039
1000
1040
The separator never matches independently of the next item; if the
1001
1041
separator matches but the next item fails, it backtracks all the way
1002
1042
back through the separator. Likewise, this matching of the separator
1003
1043
does not count as "progress" under C<:ratchet> semantics unless the
1004
1044
next item succeeds.
1005
1045
1006
- When significant space is used under C<:sigspace> with the separator
1007
- form, it applies on both sides of the separator, so
1046
+ [Note: the following may be subject to change, now that this construct is
1047
+ a quant modifier.] When significant space is used under C<:sigspace>,
1048
+ it applies on both sides of the separator, so
1008
1049
1009
- ms/<element> ** ','/
1010
- ms/<element>** ','/
1011
- ms/<element> ** ','/
1050
+ ms/<element>+ % ','/
1051
+ ms/<element>+% ','/
1052
+ ms/<element>+ % ','/
1012
1053
1013
1054
all allow whitespace around the separator like this:
1014
1055
1015
1056
/ <element>[<.ws>','<.ws><element>]* /
1016
1057
1017
1058
while
1018
1059
1019
- ms/<element>** ','/
1060
+ ms/<element>+% ','/
1020
1061
1021
- excludes all significant whitespace:
1062
+ excludes all significant whitespace like this :
1022
1063
1023
1064
/ <element>[','<element>]* /
1024
1065
1025
1066
Of course, you can always match whitespace explicitly if necessary, so to
1026
1067
allow whitespace after the comma but not before, you can say:
1027
1068
1028
- / <element>**[','\s*] /
1029
-
1030
- You may use both forms of C<**> at once by use of this special form:
1031
-
1032
- / <expr> ** 0..* ** ',' /
1033
-
1034
- The default is a mininum of 1 time, so when you write
1035
-
1036
- / <stuff> ** ',' /
1037
-
1038
- it really means
1039
-
1040
- / <stuff> ** 1..* ** ',' /
1041
-
1042
- =item *
1043
-
1044
- Negative range values are allowed, but only when modifying a reversible
1045
- pattern (such as C<after> could match). For example, to search the
1046
- surrounding 200 characters as defined by 'dot', you could say:
1047
-
1048
- / . ** -100..100 <element> /
1049
-
1050
- Similarly, you can back up 50 characters with:
1051
-
1052
- / . ** -50 <element> /
1053
-
1054
- [Conjecture: A negative quantifier forces the construct to be
1055
- considered procedural rather than declarational.]
1069
+ / <element>+%[','\s*] /
1056
1070
1057
1071
=item *
1058
1072
0 commit comments