Skip to content

Commit a30596e

Browse files
authored
Add example on Capture markers
1 parent dbff958 commit a30596e

File tree

1 file changed

+106
-71
lines changed

1 file changed

+106
-71
lines changed

doc/Language/regexes.pod6

Lines changed: 106 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -526,8 +526,9 @@ This can be useful for augmenting an existing regex. For example if you have
526526
a regex C<quoted> that matches a quoted string, then C</ <quoted> && <-[x]>* />
527527
matches a quoted string that does not contain the character C<x>.
528528
529-
Note that you cannot easily obtain the same behavior with a look-ahead, because
530-
a look-ahead doesn't stop looking when the quoted string stops matching.
529+
Note that you cannot easily obtain the same behavior with a look-ahead, that
530+
is, a regex doens't consume characters, because a look-ahead doesn't stop
531+
looking when the quoted string stops matching.
531532
532533
=begin code
533534
say 'abc' ~~ / <?before a> && . /; # OUTPUT: «Nil␤»
@@ -590,65 +591,93 @@ The following is a multi-line string:
590591
and keep it safe
591592
EOS
592593
593-
say so $str ~~ /safe $/; # OUTPUT: «True␤» -- 'safe' is at the end of the string
594-
say so $str ~~ /secret $/; # OUTPUT: «False␤» -- 'secret' is at the end of a line -- not the string
595-
say so $str ~~ /^Keep /; # OUTPUT: «True␤» -- 'Keep' is at the start of the string
596-
say so $str ~~ /^and /; # OUTPUT: «False␤» -- 'and' is at the start of a line -- not the string
594+
# 'safe' is at the end of the string
595+
say so $str ~~ /safe $/; # OUTPUT: «True␤»
596+
597+
# 'secret' is at the end of a line, not the string
598+
say so $str ~~ /secret $/; # OUTPUT: «False␤»
599+
600+
# 'Keep' is at the start of the string
601+
say so $str ~~ /^Keep /; # OUTPUT: «True␤»
602+
603+
# 'and' is at the start of a line -- not the string
604+
say so $str ~~ /^and /; # OUTPUT: «False␤»
597605
598606
=head2 X«C<^^>, Start of Line and C<$$>, End of Line|regex,^^;regex,$$»
599607
600608
The C<^^> assertion matches at the start of a logical line. That is, either
601-
at the start of the string, or after a newline character. However, it does not match
602-
at the end of the string, even if it ends with a newline character.
609+
at the start of the string, or after a newline character. However, it does not
610+
match at the end of the string, even if it ends with a newline character.
603611
604612
C<$$> matches only at the end of a logical line, that is, before a newline
605613
character, or at the end of the string when the last character is not a
606614
newline character.
607615
608616
(To understand the following example, it's important to know that the
609-
C<q:to/EOS/...EOS> "heredoc" syntax removes leading indention to the same
610-
level as the C<EOS> marker, so that the first, second and last lines have no
611-
leading space and the third and fourth lines have two leading spaces each).
617+
C<q:to/EOS/...EOS> L<heredoc|/language/quoting#Heredocs:_:to> syntax removes
618+
leading indention to the same level as the C<EOS> marker, so that the first,
619+
second and last lines have no leading space and the third and fourth lines have
620+
two leading spaces each).
612621
613-
=begin code
614-
my $str = q:to/EOS/;
615-
There was a young man of Japan
616-
Whose limericks never would scan.
617-
When asked why this was,
618-
He replied "It's because
619-
I always try to fit as many syllables into the last line as ever I possibly can."
620-
EOS
621-
622-
say so $str ~~ /^^ There/; # OUTPUT: «True␤» -- start of string
623-
say so $str ~~ /^^ limericks/; # OUTPUT: «False␤» -- not at the start of a line
624-
say so $str ~~ /^^ I/; # OUTPUT: «True␤» -- start of the last line
625-
say so $str ~~ /^^ When/; # OUTPUT: «False␤» -- there are blanks between
626-
# start of line and the "When"
627-
628-
say so $str ~~ / Japan $$/; # OUTPUT: «True␤» -- end of first line
629-
say so $str ~~ / scan $$/; # OUTPUT: «False␤» -- there's a . between "scan"
630-
# and the end of line
631-
say so $str ~~ / '."' $$/; # OUTPUT: «True␤» -- at the last line
632-
=end code
622+
=begin code
623+
my $str = q:to/EOS/;
624+
There was a young man of Japan
625+
Whose limericks never would scan.
626+
When asked why this was,
627+
He replied "It's because I always try to fit
628+
as many syllables into the last line as ever I possibly can."
629+
EOS
630+
631+
# 'There' is at the start of string
632+
say so $str ~~ /^^ There/; # OUTPUT: «True␤»
633+
634+
# 'limericks' is not at the start of a line
635+
say so $str ~~ /^^ limericks/; # OUTPUT: «False␤»
636+
637+
# 'as' is at start of the last line
638+
say so $str ~~ /^^ as/; # OUTPUT: «True␤»
639+
640+
# there are blanks between start of line and the "When"
641+
say so $str ~~ /^^ When/; # OUTPUT: «False␤»
642+
643+
# 'Japan' is at end of first line
644+
say so $str ~~ / Japan $$/; # OUTPUT: «True␤»
645+
646+
# there's a . between "scan" and the end of line
647+
say so $str ~~ / scan $$/; # OUTPUT: «False␤»
648+
649+
# matched at the last line
650+
say so $str ~~ / '."' $$/; # OUTPUT: «True␤»
651+
=end code
633652
634653
635654
=head2 X«C«<|w>» and C«<!|w>», word boundary|regex, <|w>;regex, <!|w>»
636655
637656
To match any word boundary, use C«<|w>». This is similar to other
638-
languages’ X«C<\b>|regex deprecated,\b».
639-
To match not a word boundary, use <!|w>, similar to other languages X<C<\B>|regex deprecated, \B >.
657+
languages' X«C<\b>|regex deprecated,\b».
658+
659+
To match not a word boundary, use <!|w>. This is similar to other
660+
languages' X<C<\B>|regex deprecated, \B >.
661+
640662
These are both zero width assertions.
641663
642-
=head2 X<<<<C<<< << >>> and C<<< >> >>>, left and right word boundary|regex,<<;regex,>>;regex,«;regex,»>>>>
664+
say "two-words" ~~ / "two"<|w>"-"<|w>"words" /; # OUTPUT: «「two-words」␤»
665+
say "two-words" ~~ / "two"<!|w>"-"<!|w>"words" /; # OUTPUT: «Nil␤»
666+
667+
=head2 C«<<» and C«>>», left and right word boundary
643668
644-
C<<< << >>> matches a left word boundary. It matches positions where there
669+
X«|regex, <<; regex, >>; regex, «; regex, »»
670+
671+
C«<<» matches a left word boundary. It matches positions where there
645672
is a non-word character at the left (or the start of the string) and a word
646673
character to the right.
647674
648-
C<<< >> >>> matches a right word boundary. It matches positions where there
675+
C«>>» matches a right word boundary. It matches positions where there
649676
is a word character at the left and a non-word character at the right (or
650677
the end of the string).
651678
679+
These are both zero width assertions.
680+
652681
my $str = 'The quick brown fox';
653682
say so $str ~~ /br/; # OUTPUT: «True␤»
654683
say so $str ~~ /<< br/; # OUTPUT: «True␤»
@@ -663,34 +692,34 @@ You can also use the variants C<«> and C<»> :
663692
say so $str ~~ /« own/; # OUTPUT: «False␤»
664693
say so $str ~~ /own »/; # OUTPUT: «True␤»
665694
666-
=head1 X«Grouping and Capturing|regex,( );regex,[ ];regex,$<capture> =»
695+
=head1 Grouping and Capturing
667696
668697
In regular (non-regex) Perl 6, you can use parentheses to group things
669698
together, usually to override operator precedence:
670699
671-
say 1 + 4 * 2; # 9, parsed as 1 + (4 * 2)
672-
say (1 + 4) * 2; # OUTPUT: «10␤»
700+
say 1 + 4 * 2; # OUTPUT: «9␤», parsed as 1 + (4 * 2)
701+
say (1 + 4) * 2; # OUTPUT: «10␤»
673702
674703
The same grouping facility is available in regexes:
675704
676-
/ a || b c /; # matches 'a' or 'bc'
677-
/ ( a || b ) c /; # matches 'ac' or 'bc'
705+
/ a || b c /; # matches 'a' or 'bc'
706+
/ ( a || b ) c /; # matches 'ac' or 'bc'
678707
679708
The same grouping applies to quantifiers:
680709
681-
/ a b+ /; # matches an 'a' followed by one or more 'b's
682-
/ (a b)+ /; # matches one or more sequences of 'ab'
683-
/ (a || b)+ /; # matches a sequence of 'a's and 'b's, at least one long
710+
/ a b+ /; # matches an 'a' followed by one or more 'b's
711+
/ (a b)+ /; # matches one or more sequences of 'ab'
712+
/ (a || b)+ /; # matches a string of 'a's and 'b's, except empty string
684713
685714
An unquantified capture produces a L<Match> object. When a capture is
686715
quantified (except with the C<?> quantifier) the capture becomes a list of
687716
L<Match> objects instead.
688717
689-
=head2 Capturing
718+
=head2 X«Capturing|regex,( )»
690719
691720
The round parentheses don't just group, they also I<capture>; that is, they
692721
make the string matched within the group available as a variable, and also as
693-
an element of the resulting L<Match|/type/Match> object:
722+
an element of the resulting L<Match> object:
694723
695724
my $str = 'number 42';
696725
if $str ~~ /'number ' (\d+) / {
@@ -716,7 +745,7 @@ access all elements:
716745
say $/.list.join: ', ' # OUTPUT: «a, c␤»
717746
}
718747
719-
=head2 Non-capturing grouping
748+
=head2 X«Non-capturing grouping|regex,[ ]»
720749
721750
The parentheses in regexes perform a double role: they group the regex
722751
elements inside and they capture what is matched by the sub-regex inside.
@@ -728,9 +757,10 @@ instead.
728757
say ~$0; # OUTPUT: «c␤»
729758
}
730759
731-
If you do not need the captures, using non-capturing groups provides three
732-
benefits: they more cleanly communicate the regex intent; they make it easier to
733-
count the capturing groups that you do care about; and matching is bit faster.
760+
If you do not need the captures, using non-capturing groups provides
761+
three benefits: they more cleanly communicate the regex intent; they
762+
make it easier to count the capturing groups that you do care about;
763+
and matching is bit faster.
734764
735765
=head2 Capture numbers
736766
@@ -749,21 +779,16 @@ Alternations reset the capture count:
749779
Example:
750780
751781
if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
752-
say ~$1; # b
782+
say ~$1; # OUTPUT: «b␤»
753783
}
754784
755785
If two (or more) alternations have a different number of captures,
756786
the one with the most captures determines the index of the next capture:
757787
758-
=begin code
759-
$_ = 'abcd';
760-
761-
if / a [ b (.) || (x) (y) ] (.) / {
762-
# $0 $0 $1 $2
763-
say ~$2; # d
764-
}
765-
=end code
766-
788+
if 'abcd' ~~ / a [ b (.) || (x) (y) ] (.) / {
789+
# $0 $0 $1 $2
790+
say ~$2; # OUTPUT: «d␤»
791+
}
767792
768793
Captures can be nested, in which case they are numbered per level
769794
@@ -783,23 +808,24 @@ it in a variable first:
783808
say "11" ~~ /(\d) {} :my $c = $0; ($c)/;
784809
# OUTPUT: «「11」␤ 0 => 「1」␤ 1 => 「1」␤»
785810
786-
=head2 Named captures
811+
=head2 X<Named captures|regex, Named captures>
787812
788-
Instead of numbering captures, you can also give them names. The generic --
789-
and slightly verbose -- way of naming captures is like this:
813+
Instead of numbering captures, you can also give them names. The generic,
814+
and slightly verbose, way of naming captures is like this:
790815
791816
if 'abc' ~~ / $<myname> = [ \w+ ] / {
792817
say ~$<myname> # OUTPUT: «abc␤»
793818
}
794819
795-
The access to the named capture, C<< $<myname> >>, is a shorthand for indexing
796-
the match object as a hash, in other words: C<$/{ 'myname' }> or C<< $/<myname> >>.
820+
The access to the named capture, C«$<myname>», is a shorthand for indexing
821+
the match object as a hash, in other words: C<$/{ 'myname' }> or C«$/<myname>».
797822
798823
Named captures can also be nested using regular capture group syntax:
799824
800825
if 'abc-abc-abc' ~~ / $<string>=( [ $<part>=[abc] ]* % '-' ) / {
801-
say ~$<string>; # OUTPUT: «abc-abc-abc␤»
802-
say ~$<string><part>; # OUTPUT: «abc abc abc␤»
826+
say ~$<string>; # OUTPUT: «abc-abc-abc␤»
827+
say ~$<string><part>; # OUTPUT: «abc abc abc␤»
828+
say ~$<string><part>[0]; # OUTPUT: «abc␤»
803829
}
804830
805831
Coercing the match object to a hash gives you easy programmatic access to
@@ -818,12 +844,21 @@ all named captures:
818844
}
819845
820846
A more convenient way to get named captures is discussed in
821-
the Subrules section.
847+
the L<Subrules|#Subrules> section.
848+
822849
=head2 X«Capture markers: C«<( )>»|regex,<( )>»
823850
824-
A C«<(» token indicates the start of the match's overall capture, while the corresponding C«)>»
825-
token indicates its endpoint. The C«<(» is similar to other languages X<\K|regex deprecated,\K> to discard any matches
826-
found before the C<\K>.
851+
A C«<(» token indicates the start of the match's overall capture, while the
852+
corresponding C«)>» token indicates its endpoint. The C«<(» is similar to other
853+
languages X<\K|regex deprecated,\K> to discard any matches found before the
854+
C<\K>.
855+
856+
say 'abc' ~~ / a <( b )> c/; # OUTPUT: «「b」␤»
857+
say 'abc' ~~ / <(a <( b )> c)>/; # OUTPUT: «「bc」␤»
858+
859+
As the example above, you can see C«<(» set the startpoint and C«<)» set the
860+
endpoint. They are actually independent.
861+
827862
828863
=head1 Substitution
829864

0 commit comments

Comments
 (0)