S05-regex.pod


=encoding utf8

=head1 TITLE

Synopsis 5: Regexes and Rules

=head1 AUTHORS

    Damian Conway <damian@conway.org>
    Allison Randal <al@shadowed.net>
    Patrick Michaud <pmichaud@pobox.com>
    Larry Wall <larry@wall.org>
    Moritz Lenz <moritz@faui2k3.org>

=head1 VERSION

    Created: 24 Jun 2002

    Last Modified: 31 Jul 2012
    Version: 158

This document summarizes Apocalypse 5, which is about the new regex
syntax.  We now try to call them I<regex> rather than "regular
expressions" because they haven't been regular expressions for a
long time, and we think the popular term "regex" is in the process of
becoming a technical term with a precise meaning of: "something you do
pattern matching with, kinda like a regular expression".  On the other
hand, one of the purposes of the redesign is to make portions of
our patterns more amenable to analysis under traditional regular
expression and parser semantics, and that involves making careful
distinctions between which parts of our patterns and grammars are
to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar,
the terms I<rule> and I<token> are generally preferred over I<regex>.

=head1 Overview

In essence, Perl 6 natively implements Parsing Expression Grammars (PEGs)
as an extension of regular expression notation.  PEGs require that you
provide a "pecking order" for ambiguous parses.  Perl 6's pecking order
is determined by a multi-level tie-breaking test:

    1) Longest token matching: food\s+ beats foo by 2 or more positions
    2) Longest literal prefix: food\w* beats foo\w* by 1 position
    3) Declaration from most-derived grammar beats less-derived
    4) Within a given compilation unit, earlier declaration wins
    5) Declaration with least number of 'uses' wins

Note that tiebreaker #5 can occur only when a grammar is monkey-patched
from another compilation unit.  Like #3, it privileges local declarations
over distant ones.

In addition to this pecking order, if any rule chosen under the pecking
backtracks, the next best rule is chosen.  That is, the pecking order
determines a candidate list; just because one candidate is chosen does not
mean the rest are thrown away.  They may, however, be explicitly thrown away
by an appropriate backtracking control (sometimes called a "cut" operator,
but Perl 6 has several of them, depending on how much you want to cut).

=head1 New match result and capture variables

The underlying match object is now available via the C<$/>
variable, which is implicitly lexically scoped.  All user access to the
most recent match is through this variable, even when
it doesn't look like it.  The individual capture variables (such as C<$0>,
C<$1>, etc.) are just elements of C<$/>.

By the way, unlike in Perl 5, the numbered capture variables now
start at C<$0> instead of C<$1>.  See below.


=head1 Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

=over

=item *

Capturing: (...)

=item *

Repetition quantifiers: *, +, and ?

=item *

Alternatives:  |

=item *

Backslash escape:  \

=item *

Minimal matching suffix:   ??,  *?,  +?

=back

While the syntax of C<|> does not change, the default semantics do
change slightly.  We are attempting to concoct a pleasing mixture
of declarative and procedural matching so that we can have the
best of both.  In short, you need not write your own tokener for
a grammar because Perl will write one for you.  See the section
below on "Longest-token matching".

=head1 Simplified lexical parsing of patterns

Unlike traditional regular expressions, Perl 6 does not require
you to memorize an arbitrary list of metacharacters.  Instead it
classifies characters by a simple rule.  All glyphs (graphemes)
whose base characters are either the underscore (C<_>) or have
a Unicode classification beginning with 'L' (i.e. letters) or 'N'
(i.e. numbers) are always literal (i.e. self-matching) in regexes. They
must be escaped with a C<\> to make them metasyntactic (in which
case that single alphanumeric character is itself metasyntactic,
but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite:
they are always considered metasyntactic (i.e. non-self-matching) and
must be escaped or quoted to make them literal.  As is traditional,
they may be individually escaped with C<\>, but in Perl 6 they may
be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all)
may be made literal by placing them inside single quotes.  (Double
quotes are also allowed, with the same interpolative semantics as
the current language in which the regex is lexically embedded.)
Quotes create a quantifiable atom, so while

    moose*

quantifies only the 'e' and matches "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed

 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when
escaped), non-identifier glyphs are metasyntactic (or literal when
escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently
meaningful as metasyntax in Perl 6 regexes (e.g. C<\1> C<\_> C<->
C<!>). It is more accurate to say that all unescaped non-identifier
glyphs are I<potential> metasyntax, and reserved for future use.
If you use such a sequence, a helpful compile-time error is issued
indicating that you either need to quote the sequence or define a new
operator to recognize it.

The semicolon character is specifically reserved as a non-meaningful
metacharacter; if an unquoted semicolon is seen, the compiler will
complain that the regex is missing its terminator.

=head1 Modifiers

=over

=item *

The extended syntax (C</x>) is no longer required...it's the default.
(In fact, it's pretty much mandatory--the only way to get back to
the old syntax is with the C<:Perl5>/C<:P5> modifier.)

=item *

There are no C</s> or C</m> modifiers (changes to the meta-characters
replace them - see below).

=item *

There is no C</e> evaluation modifier on substitutions; instead use:

     s/pattern/{ doit() }/

or:

     s[pattern] = doit()

Instead of C</ee> say:

     s/pattern/{ eval doit() }/

or:

     s[pattern] = eval doit()

=item *

Modifiers are now placed as adverbs at the I<start> of a match/substitution:

     m:g:i/\s* (\w*) \s* ,?/;

Every modifier must start with its own colon.  The delimiter must be
separated from the final modifier by whitespace if it would otherwise be taken
as an argument to the preceding modifier (which is true if and only if
the next character is a left parenthesis.)

=item *

The single-character modifiers also have longer versions:

         :i        :ignorecase
         :m        :ignoremark
         :g        :global
         :r        :ratchet

=item *

The C<:i> (or C<:ignorecase>) modifier causes case distinctions to be
ignored in its lexical scope, but not in its dynamic scope.  That is,
subrules always use their own case settings.  The amount of case folding
depends on the current context.  In byte and codepoint mode, level 1 case folding
is required (as defined in TR18 section 2.4).  In grapheme mode level 2 is
required.

The C<:ii> (or C<:samecase>) variant may be used on a substitution to change the
substituted string to the same case pattern as the matched string.

If the pattern is matched without the C<:sigspace> modifier, case
info is carried across on a character by character basis.  If the
right string is longer than the left one, the case of the final
character is replicated.  Titlecase is carried across if possible
regardless of whether the resulting letter is at the beginning of
a word or not; if there is no titlecase character available, the
corresponding uppercase character is used.  (This policy can be
modified within a lexical scope by a language-dependent Unicode
declaration to substitute titlecase according to the orthographic
rules of the specified language.)  Characters that carry no case
information leave their corresponding replacement character unchanged.

If the pattern is matched with C<:sigspace>, then a slightly smarter
algorithm is used which attempts to determine if there is a uniform
capitalization policy over each matched word, and applies the same
policy to each replacement word.  If there doesn't seem to be a uniform
policy on the left, the policy for each word is carried over word by
word, with the last pattern word replicated if necessary.  If a word
does not appear to have a recognizable policy, the replacement word
is translated character for character as in the non-sigspace case.
Recognized policies include:

    lc()
    uc()
    tc()
    tclc()
    tcuc()

In any case, only the officially matched string part of the pattern
match counts, so any sort of lookahead or contextual matching is not
included in the analysis.

=item *

The C<:m> (or C<:ignoremark>) modifier scopes exactly like C<:ignorecase>
except that it ignores marks (accents and such) instead of case.  It is equivalent
to taking each grapheme (in both target and pattern), converting
both to NFD (maximally decomposed) and then comparing the two base
characters (Unicode non-mark characters) while ignoring any trailing
mark characters.  The mark characters are ignored only for the purpose
of determining the truth of the assertion; the actual text matched
includes all ignored characters, including any that follow the final
base character.

The C<:mm> (or C<:samemark>) variant may be used on a substitution to change the
substituted string to the same mark/accent pattern as the matched string.
Mark info is carried across on a character by character basis.  If
the right string is longer than the left one, the remaining characters
are substituted without any modification.  (Note that NFD/NFC distinctions
are usually immaterial, since Perl encapsulates that in grapheme mode.)
Under C<:sigspace> the preceding rules are applied word by word.

=item *

The C<:c> (or C<:continue>) modifier causes the pattern to continue
scanning from the specified position (defaulting to C<($/ ?? $/.to !! 0)>):

     m:c($p)/ pattern /     # start scanning at position $p

Note that this does not automatically anchor the pattern to the starting
location.  (Use C<:p> for that.)  The pattern you supply to C<split>
has an implicit C<:c> modifier.

String positions are of type C<StrPos> and should generally be treated
as opaque.

=item *

The C<:p> (or C<:pos>) modifier causes the pattern to try to match only at
the specified string position:

     m:pos($p)/ pattern /  # match at position $p

If the argument is omitted, it defaults to C<($/ ?? $/.to !! 0)>.  (Unlike in
Perl 5, the string itself has no clue where its last match ended.)
All subrule matches are implicitly passed their starting position.
Likewise, the pattern you supply to a Perl macro's C<is parsed>
trait has an implicit C<:p> modifier.

Note that

     m:c($p)/pattern/

is roughly equivalent to

     m:p($p)/.*? <( pattern )> /

All of C<:g>, C<:ov>, C<:nth>, and C<:x> are incompatible with C<:p> and
will fail, recommending use of C<:c> instead.  The C<:ex> modifier is allowed
but will produce only matches at that position.

=item *

The new C<:s> (C<:sigspace>) modifier causes certain whitespace sequences
to be considered "significant"; they are replaced by a whitespace
matching rule, C<< <.ws> >>.  Only whitespace sequences immediately following a
matching construct (atom, quantified atom, or assertion) are eligible.
Hence, initial whitespace is ignored at the front of
any regex, to make it easy to write rules that can participate in longest-token-matching
alternations.  That is,

     m:s/ next cmd '='   <condition>/

is the same as:

     m/ next <.ws> cmd <.ws> '=' <.ws> <condition>/

which is effectively the same as:

     m/ next \s+ cmd \s* '=' \s* <condition>/

But in the case of

     m:s{(a|\*) (b|\+)}

or equivalently,

     m { (a|\*) <.ws> (b|\+) }

C<< <.ws> >> can't decide what to do until it sees the data.
It still does the right thing.  If not, define your own C<< ws >>
and C<:sigspace> will use that.

Whitespace is ignored not just at the front of any rule that might
participate in longest-token matching, but in the front of any
alternative within an explicit alternation as well, for the same
reason.  If you want to match sigspace before a set of alternatives,
place your whitespace outside of the brackets containing the alternation.

When you write

    rule TOP { ^ <stuff> $ }

this is the same as

    token TOP { ^ <.ws> <stuff> <.ws> $ <.ws> }

but note that the final C<< <.ws> >> always matches the null string, since C<$> asserts end of string.
Also, if your C<TOP> rule does not anchor with C<^>, it might not match initial whitespace.

Specifically, the following constructs turn following whitespace into sigspace:

    any atom or quantified atom
    $foo @bar
    'a' "$b"
    ^ $ ^^ $$
    (...) [...] <...> as a whole atoms
    (...)* [...]* <...>* as quantified atoms
    <( and )>
    « and » (but don't use « that way!)

and these do not:

    opening ( or [
    | or ||
    & or &&
    ** % or %%
    :foo declarations, including :my and :sigspace itself
    {...}

When we say sigspace can follow either an atom or a quantified atom, we
mean that it can come between an atom and its quantifier:

    ms/ <atom> * /      # means / [<atom><.ws>]* /

(If each atom matches whitespace, then it doesn't need to match after the
quantifier.)

In general you don't need to use C<:sigspace> within grammars because
the parser rules automatically handle whitespace policy for you.
In this context, whitespace often includes comments, depending on
how the grammar chooses to define its whitespace rule.  Although the
default C<< <.ws> >> subrule recognizes no comment construct, any
grammar is free to override the rule.  The C<< <.ws> >> rule is not
intended to mean the same thing everywhere.

It's also possible to pass an argument to C<:sigspace> specifying
a completely different subrule to apply.  This can be any rule, it
doesn't have to match whitespace.  When discussing this modifier, it is
important to distinguish the significant whitespace in the pattern from
the "whitespace" being matched, so we'll call the pattern's whitespace
I<sigspace>, and generally reserve I<whitespace> to indicate whatever
C<< <.ws> >> matches in the current grammar. The correspondence
between sigspace and whitespace is primarily metaphorical, which is
why the correspondence is both useful and (potentially) confusing.

The C<:ss> (or C<:samespace>) variant may be used on substitutions to
do smart space mapping.  For each sigspace-induced call to C<< <ws> >>
on the left, the matched whitespace is copied over to the corresponding
slot on the right, as represented by a single whitespace character
in the replacement string wherever space replacement is desired.
If there are more whitespace slots on the right than the left, those
righthand characters remain themselves.  If there are not enough
whitespace slots on the right to map all the available whitespace
slots from the match, the algorithm tries to minimize information
loss by randomly splicing "common" whitespace characters out of the
list of whitespace.  From least valuable to most, the pecking order is:

    spaces
    tabs
    all other horizontal whitespace, including Unicode
    newlines (including crlf as a unit)
    all other vertical whitespace, including Unicode

The primary intent of these rules is to minimize format disruption
when substitution happens across line boundaries and such.  There is,
of course, no guarantee that the result will be exactly what a human would
do.

The C<:s> modifier is considered sufficiently important that
match variants are defined for them:

    ms/match some words/                        # same as m:sigspace
    ss/match some words/replace those words/    # same as s:samespace

Note that C<ss///> is defined in terms of C<:ss>, so:

    $_ = "a b\nc\td";
    ss/b c d/x y z/;

ends up with a value of "C<a x\ny\tz>".

=item *

New modifiers specify Unicode level:

     m:bytes  / .**2 /       # match two bytes
     m:codes  / .**2 /       # match two codepoints
     m:graphs / .**2 /       # match two language-independent graphemes
     m:chars  / .**2 /       # match two characters at current max level

There are corresponding pragmas to default to these levels.  Note that
the C<:chars> modifier is always redundant because dot always matches
characters at the highest level allowed in scope.  This highest level
may be identical to one of the other three levels, or it may be more
specific than C<:graphs> when a particular language's character rules
are in use.  Note that you may not specify language-dependent character
processing without specifying I<which> language you're depending on.
[Conjecture: the C<:chars> modifier could take an argument specifying
which language's rules to use for this match.]

=item *

The new C<:Perl5>/C<:P5> modifier allows Perl 5 regex syntax to be
used instead.  (It does not go so far as to allow you to put your
modifiers at the end.)  For instance,

     m:P5/(?mi)^(?:[a-z]|\d){1,2}(?=\s)/

is equivalent to the Perl 6 syntax:

    m/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /

=item *

Any integer modifier specifies a count. What kind of count is
determined by the character that follows.

=item *

If followed by an C<x>, it means repetition.  Use C<:x(4)> for the
general form.  So

     s:4x [ (<.ident>) '=' (\N+) $$] = "$0 => $1";

is the same as:

     s:x(4) [ (<.ident>) '=' (\N+) $$] = "$0 => $1";

which is almost the same as:

     s:c[ (<.ident>) '=' (\N+) $$] = "$0 => $1" for 1..4;

except that the string is unchanged unless all four matches are found.
However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere
from one to four matches.

=item *

If the number is followed by an C<st>, C<nd>, C<rd>, or C<th>, it means
find the I<N>th occurrence.  Use C<:nth(3)> for the general form.  So

     s:3rd/(\d+)/@data[$0]/;

is the same as

     s:nth(3)/(\d+)/@data[$0]/;

which is the same as:

     m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$0]/;

The argument to C<:nth> is allowed to be a list of integers, but such a list
should be monotonically increasing.  (Values which are less than or equal to
any previous value will be ignored.)  So:


    :nth(2,4,6...*)    # return only even matches
    :nth(1,1,*+*...*)  # match only at 1,2,3,5,8,13...

This option is no longer required to support smartmatching.  You can grep a list
of integers if you really need that capability:

    :nth(grep *.oracle, 1..*)

If both C<:nth> and C<:x> are present, the matching routine looks for submatches
that match with C<:nth>. If the number of post-nth matches is compatible with
the constraint in C<:x>, the whole match succeeds with the highest possible
number of submatches. The combination of C<:nth> and C<:x> typically only
makes sense if C<:nth> is not a single scalar.

=item *

With the new C<:ov> (C<:overlap>) modifier, the current regex will
match at all possible character positions (including overlapping)
and return all matches in list context, or a disjunction of matches
in item context.  The first match at any position is returned.
The matches are guaranteed to be returned in left-to-right order with
respect to the starting positions.

     $str = "abracadabra";

     if $str ~~ m:overlap/ a (.*) a / {
         @substrings = slice @();    # bracadabr cadabr dabr br
     }

=item *

With the new C<:ex> (C<:exhaustive>) modifier, the current regex will
match every possible way (including overlapping) and return a list of
all matches.

The matches are guaranteed to be returned in left-to-right order with
respect to the starting positions.  The order within each starting
position is not guaranteed and may depend on the nature of both the
pattern and the matching engine.  (Conjecture: or we could enforce
backtracking engine semantics.  Or we could guarantee no order at all
unless the pattern starts with "::" or some such to suppress DFAish
solutions.)

     $str = "abracadabra";

     if $str ~~ m:exhaustive/ a (.*?) a / {
         say "@()";    # br brac bracad bracadabr c cad cadabr d dabr br
     }

Note that the C<~~> above can return as soon as the first match is found,
and the rest of the matches may be performed lazily by C<@()>.

=item *

The new C<:rw> modifier causes this regex to I<claim> the current
string for modification rather than assuming copy-on-write semantics.
All the captures in C<$/> become lvalues into the string, such
that if you modify, say, C<$1>, the original string is modified in
that location, and the positions of all the other fields modified
accordingly (whatever that means).  In the absence of this modifier
(especially if it isn't implemented yet, or is never implemented),
all pieces of C<$/> are considered copy-on-write, if not read-only.

[Conjecture: this should really associate a pattern with a string variable,
not a (presumably immutable) string value.]

=item *

The new C<:r> or C<:ratchet> modifier causes this regex to not backtrack by default.
(Generally you do not use this modifier directly, since it's implied by
C<token> and C<rule> declarations.)  The effect of this modifier is
to imply a C<:> after every atom, including but not limited to
C<*>, C<+>, and C<?> quantifiers, as well as alternations.  Explicit
backtracking modifiers on quantified atoms, such as C<**>, will override this.
(Note: for portions of patterns subject to longest-token analysis, a C<:>
is ignored in any case, since there will be no backtracking necessary.)

=item *

The C<:i>, C<:s>, C<:Perl5>, and Unicode-level modifiers can be
placed inside the regex (and are lexically scoped):

     m/:s alignment '=' [:i left|right|cent[er|re]] /

As with modifiers outside, only parentheses are recognized as valid
brackets for args to the adverb.  In particular:

    m/:foo[xxx]/        Parses as :foo [xxx]
    m/:foo{xxx}/        Parses as :foo {xxx}
    m/:foo<xxx>/        Parses as :foo <xxx>

=item *

User-defined modifiers will be possible:

         m:fuzzy/pattern/;

=item *

User-defined modifiers can also take arguments, but only in parentheses:

         m:fuzzy('bare')/pattern/;

=item *

To use parens for your delimiters you have to separate:

         m:fuzzy (pattern);

or you'll end up with:

         m:fuzzy(fuzzyargs); pattern ;

=item *

Any grammar regex is really just a kind of method, and you may
declare variables in such a routine using a colon followed by any
scope declarator parsed by the Perl 6 grammar, including C<my>,
C<our>, C<state>, and C<constant>.  (As quasi declarators, C<temp>
and C<let> are also recognized.)  A single statement (up through
a terminating semicolon) is parsed as normal Perl 6 code:

    token prove-nondeterministic-parsing {
        :my $threshold = rand;
        'maybe' \s+ <it($threshold)>
    }

Such declarations do not terminate longest-token-matching,
so an otherwise useless declaration may be used as a peg
to hang side effects on without changing how the subsequent
pattern matches:

    rule breaker {
        :state $ = say "got here at least once";
        ...
    }

=back

=head2 Allowed modifiers

Some modifiers are allowed in all possible places where modifiers can occur,
but not all of them.

In general, a modifier that affects the compilation of a regex (like C<:i>)
must be known at compile time. A modifier that affects only the calling
behaviour, and not the regex itself (eg. C<:pos>, C<:overlap>, C<:x(4)>) may
only appear on constructs that involve a call (like C<m//> and C<s///>), and
not on C<rx//>. Finally overlapping is disallowed on substitutions, while
adverbs that affect modifications (eg. C<:samecase>) are only allowed on
substitutions.

These principle result in the following rules:

=over

=item *

The C<:ignorecase>, C<:ignoremark>, C<:sigspace>, C<:ratchet> and C<:Perl5>
modifiers and their short forms are allowed everywhere: inside a regex,
and on C<m//>, C<rx//> and C<s///> constructs. An implementation may require
that their value is known at compile time, and give a compile-time error
message if that is not the case.

    rx:i/ hello /           # OK
    rx:i(1) /hello/         # OK
    my $i = 1;
    rx:i($i) /hello/        # may error out at compile time

=item *

The C<:samecase>, C<:samespace> and C<:samemark> modifiers (and their short
forms) modifiers are only allowed on substitutions (C<s///> and C<s[] = ...>).

=item *

The C<:overlap> and C<:exhaustive> modifiers (and their short forms) are only
allowed on matches (ie C<m//>), not on substitutions or regex quotes.

=item * 

The C<:pos>, C<:continue>, C<:x> and C<:nth> modifiers and their aliases are
only allowed on constructs that involve immediate calls, eg. C<m//> and C<s///>
(but not on C<rx//>).

=item *

The C<:dba> adverb is only allowed inside a regex.

=back

=head1 Changed metacharacters

=over

=item *

A dot C<.> now matches I<any> character including newline. (The C</s>
modifier is gone.)

=item *

C<^> and C<$> now always match the start/end of a string, like the old
C<\A> and C<\z>. (The C</m> modifier is gone.)  On the right side of
an embedded C<~~> or C<!~~> operator they always match the start/end
of the indicated submatch because that submatch is logically being
treated as a separate string.

=item *

A C<$> no longer matches an optional preceding C<\n> so it's necessary
to say C<\n?$> if that's what you mean.

=item *

C<\n> now matches a logical (platform independent) newline, not just C<\x0a>.
See TR18 section 1.6 for a list of logical newlines.

=item *

The C<\A>, C<\Z>, and C<\z> metacharacters are gone.

=back

=head1 New metacharacters

=over

=item *

Because C</x> is default:

=over

=item *

An unquoted C<#> now always introduces a comment.  If followed
by a backtick and an opening bracket character,
it introduces an embedded comment that terminates with the closing
bracket.  Otherwise the comment terminates at the newline.

=item *

Whitespace is now always metasyntactic, i.e. used only for layout
and not matched literally (but see the C<:sigspace> modifier described above).

=back

=item *

C<^^> and C<$$> match line beginnings and endings. (The C</m>
modifier is gone.)  They are both zero-width assertions.  C<$$>
matches before any C<\n> (logical newline), and also at the end of
the string if the final character was I<not> a C<\n>.  C<^^> always
matches the beginning of the string and after any C<\n> that is not
the final character in the string.

=item *

C<.> matches an I<anything>, while C<\N> matches an I<anything except
what C<\n> matches>. (The C</s> modifier is gone.)  In particular, C<\N> matches
neither carriage return nor line feed.

=item *

The new C<&> metacharacter separates conjunctive terms.  The patterns
on either side must match with the same beginning and end point.
Note: if you don't want your two terms to end at the same point,
then you really want to use a lookahead instead.

As with the disjunctions C<|> and C<||>, conjunctions come in both
C<&> and C<&&> forms.  The C<&> form is considered declarative rather than
procedural; it allows the compiler and/or the
run-time system to decide which parts to evaluate first, and it is
erroneous to assume either order happens consistently.  The C<&&>
form guarantees left-to-right order, and backtracking makes the right
argument vary faster than the left.  In other words, C<&&> and C<||> establish
sequence points.  The left side may be backtracked into when backtracking
is allowed into the construct as a whole.

The C<&> operator is list associative like C<|>, but has slightly
tighter precedence.  Likewise C<&&> has slightly tighter precedence
than C<||>.  As with the normal junctional and short-circuit operators,
C<&> and C<|> are both tighter than C<&&> and C<||>.

=item *

The C<~~> and C<!~~> operators cause a submatch to be performed on
whatever was matched by the variable or atom on the left.  String
anchors consider that submatch to be the entire string.  So, for
instance, you can ask to match any identifier that does not contain
the word "moose":

    <ident> !~~ 'moose'

In contrast

    <ident> !~~ ^ 'moose' $

would allow any identifier (including any identifier containing
"moose" as a substring) as long as the identifier as a whole is not
equal to "moose". (Note the anchors, which attach the submatch to the
beginning and end of the identifier as if that were the entire match.)
When used as part of a longer match, for clarity it might be good to
use extra brackets:

    [ <ident> !~~ ^ 'moose' $ ]

The precedence of C<~~> and C<!~~> fits in between the junctional and
sequential versions of the logical operators just as it does in normal
Perl expressions (see S03).  Hence

    <ident> !~~ 'moose' | 'squirrel'

parses as

    <ident> !~~ [ 'moose' | 'squirrel' ]

while

    <ident> !~~ 'moose' || 'squirrel'

parses as

    [ <ident> !~~ 'moose' ] || 'squirrel'

=item *

The C<~> operator is a helper for matching nested subrules with a
specific terminator as the goal.  It is designed to be placed between an
opening and closing bracket, like so:

    '(' ~ ')' <expression>

However, it mostly ignores the left argument, and operates on the next
two atoms (which may be quantified).  Its operation on those next
two atoms is to "twiddle" them so that they are actually matched in
reverse order.  Hence the expression above, at first blush, is merely
shorthand for:

    '(' <expression> ')'

But beyond that, when it rewrites the atoms it also inserts the
apparatus that will set up the inner expression to recognize the
terminator, and to produce an appropriate error message if the
inner expression does not terminate on the required closing atom.
So it really does pay attention to the left bracket as well, and it
actually rewrites our example to something more like:

    $<OPEN> = '(' <SETGOAL: ')'> <expression> [ $GOAL || <FAILGOAL> ]

Note that you can use this construct to set up expectations for
a closing construct even when there's no opening bracket:

    <?> ~ ')' \d+

Here <?> returns true on the first null string.

By default the error message uses the name of the current rule as an
indicator of the abstract goal of the parser at that point.  However,
often this is not terribly informative, especially when rules are named
according to an internal scheme that will not make sense to the user.
The C<:dba("doing business as")> adverb may be used to set up a more informative name for
what the following code is trying to  parse:

    token postfix:sym<[ ]> {
        :dba('array subscript')
        '[' ~ ']' <expression>
    }

Then instead of getting a message like:

    Unable to parse expression in postfix:sym<[ ]>; couldn't find final ']'

you'll get a message like:

    Unable to parse expression in array subscript; couldn't find final ']'

(The C<:dba> adverb may also be used to give names to alternations
and alternatives, which helps the lexer give better error messages.)

=back

=head1 Bracket rationalization

=over

=item *

C<(...)> still delimits a capturing group. However the ordering of these
groups is hierarchical rather than linear. See L<Nested subpattern captures>.

=item *

C<[...]> is no longer a character class.
It now delimits a non-capturing group.

A character class is now specified using C<< <[...]> >>.
See also L<Extensible metasyntax>.

=item *

C<{...}> is no longer a repetition quantifier.
It now delimits an embedded closure.  It is always considered
procedural rather than declarative; it establishes a sequence point
between what comes before and what comes after.  (To avoid this
use the C<< <?{...}> >> assertion syntax instead.)  A closure
within a regex establishes its own lexical scope.

=item *

You can call Perl code as part of a regex match by using a closure.
Embedded code does not usually affect the match--it is only used
for side-effects:

     / (\S+) { print "string not blank\n"; $text = $0; }
        \s+  { print "but does contain whitespace\n" }
     /

An B<explicit> reduction using the C<make> function generates the
I<abstract syntax tree> object (I<abstract object> or I<ast> for short)
for this match:

        / (\d) { make $0.sqrt } Remainder /;

This has the effect of capturing the square root of the numified
string, instead of the string.  The C<Remainder> part is matched and
returned as part of the C<Match> object but is not returned
as part of the abstract object.  Since the abstract object usually
represents the top node of an abstract syntax tree, the abstract object
may be extracted from the C<Match> object by use of the C<.ast> method.

A second call to C<make> overrides any previous call to C<make>.
C<make> is also available as a method on each match object.

Within a closure, the instantaneous
position within the search is denoted by the C<$¢.pos> method.
As with all string positions, you must not treat it
as a number unless you are very careful about which units you are
dealing with.

The C<Cursor> object can also return the original item that we are
matching against; this is available from the C<.orig> method.

The closure is also guaranteed to start with a C<$/> C<Match> object
representing the match so far.  However, if the closure does its own
internal matching, its C<$/> variable will be rebound to the result
of I<that> match until the end of the embedded closure.  (The match
will actually continue with the current value of the C<$¢> object after
the closure.  C<$/> and C<$¢> just start out the same in your closure.)

=item *

It can affect the match if it calls C<fail>:

     / (\d+) { $0 < 256 or fail } /

Since closures establish a sequence point, they are guaranteed to be
called at the canonical time even if the optimizer could prove that
something after them can't match.  (Anything before is fair game,
however.  In particular, a closure often serves as the terminator
of a longest-token pattern.)

=item *

The general repetition specifier is now C<**> for greedy matching,
with a corresponding C<**?> for frugal matching.  (All such quantifier
modifiers now go directly after the C<**>.)  Space is allowed on either
side of the complete quantifier, but only the space before the C<**> will
be considered significant under C<:sigspace> and match between repetitions.
(Sigspace after the entire construct matches once after the all repetitions
are found.)

The next token constrains how many times the pattern on the left must match.

If the next thing is an integer, then it is parsed as either as an exact
count or a range:

    . ** 42                  # match exactly 42 times
    <item> ** 3..*           # match 3 or more times

This form is considered declarational.

If you supply a closure, it should return either an C<Int> or a C<Range> object.

    'x' ** {$m}              # exact count returned from closure
    <foo> ** {$m..$n}        # range returned from closure

    / value was (\d **? {1..6}) with ([ <alpha>\w* ]**{$m..$n}) /

It is illegal to return a list, so this easy mistake fails:

    / [foo] ** {1,3} /

The closure form is always considered procedural, so the item it is
modifying is never considered part of the longest token.

For backwards compatibility with previous versions of Perl 6, if the token
following ** is not a closure or literal integer, it is interpreted as +%
with a warning:

   / x ** y /                # same as / x+ % y /
   / x ** $y /               # same as / x [$y x]* /

No check is made to see if $y contains an integer or range value.  This
compatibility feature is not guaranteed to exist forever.

=item *

Negative range values are allowed, but only when modifying a reversible
pattern (such as C<after> could match).  For example, to search the
surrounding 200 characters as defined by 'dot', you could say:

    / . ** -100..100 <element> /

Similarly, you can back up 50 characters with:

    / . ** -50 <element> /

[Conjecture: A negative quantifier forces the construct to be
considered procedural rather than declarational.]

=item *

Any quantified atom may be modified by an additional constraint that
specifies the separator to look for between repeats of the left side.
This is indicated by use of a C<%> between the quantifier and
the separator.  The initial item is iterated only as long as the
separator is seen between items:

    <alt>+ % '|'            # repetition controlled by presence of character
    <addend>+ % <addop>     # repetition controlled by presence of subrule
    <item>+ % [ \!?'==' ]   # repetition controlled by presence of operator
    <file>+%\h+             # repetition controlled by presence of whitespace

Any quantifier may be so modified:

    <a>* % ','              # 0 or more comma-separated elements
    <a>+ % ','              # 1 or more 
    <a>? % ','              # 0 or 1 (but ',' never used!?!)
    <a> ** 2..* % ','       # 2 or more 

The C<%> modifier may only be used on a quantifier; any attempt
to use it on a bare term will result in a parse error (to minimize
possible confusion with any hash notations we choose to support in
Perl 6 regexes).

A successful match of a C<%> construct generally ends "in the middle" at the C<%>,
that is, after the initial item but before the next separator.
Therefore

    / <ident>+ % ',' /

can match

    foo
    foo,bar
    foo,bar,baz

but never

    foo,
    foo,bar,

The only time such a match doesn't end in the middle is if the left
side can match 0 times (and does so), in which case the whole construct
matches the null string.

    '' ~~ / <ident>* % ',' /  # matches because of the *

If you wish to allow the match to end after either side, use C<%%> instead.
Therefore

    / <ident>+ %% ',' /

can match any of

    foo
    foo,
    foo,bar
    foo,bar,
    foo,bar,baz
    foo,bar,baz,

If you wish to quantify each match on the left individually, you must place it in brackets:

    [<a>*]+ % ','

It is legal for the separator to be zero-width as long as the pattern on
the left progresses on each iteration:

    .+ % <?same>   # match sequence of identical characters

The separator never matches independently of the next item; if the
separator matches but the next item fails, it backtracks all the way
back through the separator.  Likewise, this matching of the separator
does not count as "progress" under C<:ratchet> semantics unless the
next item succeeds.

When significant space is used under C<:sigspace>,
each matching element enables the immediately following whitespace
to be considered significant.  Space after the C<%> does nothing.  If you write:

    ms/ <element> +  %  ',' /
      #1        #2 #3 #4  #5

it ignores whitespace #1 and #4, and rewrites the rest to:
                   
    / [ <element> <.ws> ]+ % [ ',' <.ws> ] <.ws> /
                    #2               #5      #3

Since #3 is redundant with #2 (because C<+> requires an element),
it suffices to supply either #2 or #3:

    ms/ <element>+ % ',' /    # ws after comma and at end
    ms/ <element> +% ',' /    # ws after comma and any element

So the first

    ms/ <element>+ % ',' /    # ws after comma and at end

is like

    / <element>[','<.ws><element>]*<.ws> /

while the second

    ms/ <element> +% ',' /    # ws after comma and any element

is like

    / <element><.ws>[','<.ws><element><.ws>]* /

and

    ms/ <element>+% ','/

excludes all significant whitespace like this:

    / <element>[','<element>]* /

Note that with a C<*> instead of a C<+>, space #3 would not be
redundant with #2, since if 0 elements are matched, the space
associated with it (#2) is not matched.  In that case it makes sense
to put space on both sides of the C<*>:

    ms/ <element> * % ',' /

=item *

C<< <...> >> are now extensible metasyntax delimiters or I<assertions>
(i.e. they replace Perl 5's crufty C<(?...)> syntax).

=back

=head1 Variable (non-)interpolation

=over

=item *

In Perl 6 regexes, variables don't interpolate.

=item *

Instead they're passed I<raw> to the regex engine, which can then decide
how to handle them (more on that below).

=item *

The default way in which the engine handles a string scalar is to match it
as a C<< '...' >> literal (i.e. it does not treat the interpolated string
as a subpattern).  In other words, a Perl 6:

     / $var /

is like a Perl 5:

     / \Q$var\E /

However, if C<$var> contains a C<Regex> object, instead of attempting to
convert it to a string, it is called as a subrule, as if you said
C<< <$var> >>.  (See assertions below.)  This form does not capture,
and it fails if C<$var> is tainted.

If C<$var> is undefined, a warning is issued and the match fails.

[Conjecture: when we allow matching against non-string types, doing a
type match on the current node will require the syntax of an embedded
signature, not just a bare variable, so there is no need to account for
a variable containing a type object, which is by definition undefined,
and hence fails to match by the above rule.]

However, a variable used as the left side of an alias or submatch
operator is not used for matching.

    $x = <.ident>
    $0 ~~ <.ident>

If you do want to match C<$0> again and then use that as the submatch,
you can force the match using double quotes:

    "$0" ~~ <.ident>

On the other hand, it is non-sensical to alias to something that is
not a variable:

    "$0" = <.ident>     # ERROR
    $0 = <.ident>       # okay
    $x = <.ident>       # okay, temporary capture
    $<x> = <.ident>     # okay, persistent capture
    <x=.ident>          # same thing

Variables declared in capture aliases are lexically scoped to the
rest of the regex.  You should not confuse this use of C<=> with
either ordinary assignment or ordinary binding.  You should read
the C<=> more like the pseudoassignment of a declarator than like
normal assignment.  It's more like the ordinary C<:=> operator,
since at the level regexes work, strings are immutable, so captures
are really just precomputed substr values.  Nevertheless, when you
eventually use the values independently, the substr may be copied,
and then it's more like it was an assignment originally.

Capture variables of the form C<< $<ident> >> may persist beyond
the lexical scope; if the match succeeds they are remembered in the
C<Match> object's hash, with a key corresponding to the variable name's
identifier.  Likewise bound numeric variables persist as C<$0>, etc.

You may capture to existing lexical variables; such variables may
already be visible from an outer scope, or may be declared within
the regex via a C<:my> declaration.

    my $x; / $x = [...] /            # capture to outer lexical $x
    / :my $x; $x = [...] /           # capture to our own lexical $x

=item *

An interpolated array:

     / @cmds /

is matched as if it were an alternation of its elements.  Ordinarily it
matches using junctive semantics:

     / [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /


However, if it is a direct member of a C<||> list, it uses sequential
matching semantics, even it's the only member of the list.  Conveniently,
you can put C<||> before the first member of an alternation, hence

     / || @cmds /

is equivalent to

     / [ @cmds[0] || @cmds[1] || @cmds[2] || ... ] /

Or course, you can also

     / | @cmds /

to be clear that you mean junctive semantics.

An interpolated array using junctive semantics is declarative
(participates in external longest token matching) only if it's 
known to be constant at the time the regex is compiled.  

As with a scalar variable, each element is matched as a literal
unless it happens to be a C<Regex> object, in which case it is matched
as a subrule.  As with scalar subrules, a tainted subrule always fails.
All string values pay attention to the current C<:ignorecase>
and C<:ignoremark> settings, while C<Regex> values use their own
C<:ignorecase> and C<:ignoremark> settings.

When you get tired of writing:

    token sigil { '$' | '@' | '%' | '&' | '::' }

you can write:

    token sigil { < $ @ % & :: > }

as long as you're careful to put a space after the initial angle so that
it won't be interpreted as a subrule.  With the space it is parsed
like angle quotes in ordinary Perl 6 and treated as a literal array value.

=item *

Alternatively, if you predeclare a proto regex, you can write multiple
regexes for the same category, differentiated only by the symbol they
match.  The symbol is specified as part of the "long name".  It may also
be matched within the rule using C<< <sym> >>, like this:

    proto token sigil {*}
    multi token sigil:sym<$>  { <sym> }
    multi token sigil:sym<@>  { <sym> }
    multi token sigil:sym<%>  { <sym> }
    multi token sigil:sym<&>  { <sym> }
    multi token sigil:sym<::> { <sym> }

(The C<multi> is optional and generally omitted with a grammar.)

This can be viewed as a form of multiple dispatch, except that it's
based on longest-token matching rather than signature matching.  The
advantage of writing it this way is that it's easy to add additional
rules to the same category in a derived grammar.  All of them will
be matched in parallel when you try to match C<< /<sigil>/ >>.

If there are formal parameters on multi regex methods, matching
still proceeds via longest-token rules first.  If that results in a
tie, a normal multiple dispatch is made using the arguments to the
remaining variants, assuming they can be differentiated by type.

The C<proto> calls into the subdispatcher when it sees a C<*> that
cannot be a quantifier and is the only thing in its block.  Therefore
you can put items before and after the subdispatch by putting
the C<*> into curlies:

    proto token foo { <prestuff> {*} <poststuff> }

This works only in a proto.  See L<S06> for a discussion of the
semantics of C<{*}>.  (Unlike a proto sub, a proto regex
automatically remembers the return values from C<{*}> because
they are carried along with the match cursor.)

=item *

The use of a hash variable in patterns is reserved.

=item *

Variable matches are considered declarative if and only if the variable
is known to represent a constant,  Otherwise they are procedural.
Note that role parameters (if readonly) are considered constant
declarations for this purpose despite the absence of an explicit
C<constant> declarator, since roles themselves are immutable, and
will presumably be replacing the parameter with a constant value when
composed (if the value passed is a constant).  Macros instantiated
with constants would also make those constants eligible for declarative
treatment.

=back

=head1 Extensible metasyntax (C<< <...> >>)

Both C<< < >> and C<< > >> are metacharacters, and are usually (but not
always) used in matched pairs.  (Some combinations of metacharacters
function as standalone tokens, and these may include angles.  These are
described below.) Most assertions are considered declarative;
procedural assertions will be marked as exceptions.

For matched pairs, the first character after C<< < >> determines the
nature of the assertion:

=over

=item *

If the first character is whitespace, the angles are treated as an
ordinary "quote words" array literal.

    < adam & eve >   # equivalent to [ 'adam' | '&' | 'eve' ]

Note that the space before the ending > is optional and therefore
< adam & eve> would be acceptable.

=item *

A leading alphabetic character means it's a capturing grammatical
assertion (i.e. a subrule or a named character class - see below):

     / <sign>? <mantissa> <exponent>? /

The first character after the identifier determines the treatment of
the rest of the text before the closing angle.  The underlying semantics
is that of a function or method call, so if the first character is
a left parenthesis, it really is a call to either a method or function:

    <foo('bar')>

If the first character after the identifier is an C<=>, then the identifier
is taken as an alias for what follows.  In particular,

    <foo=bar>

is just shorthand for

    $<foo> = <bar>

Note that this aliasing does not modify the original C<< <bar> >>
capture.  To rename an inherited method capture without using the
original name, use the dot form described below on the capture you
wish to suppress.  That is,

    <foo=.bar>

desugars to:

    $<foo> = <.bar>

Likewise, to rename a lexically scoped regex explicitly, use the C<&>
form described below.  That is,

    <foo=&bar>

desugars to:

    $<foo> = <&bar>

Multiple aliases are allowed, so

    <foo=pub=bar>

is short for

    $<foo> = $<pub> = <bar>

If the first character after the identifier is whitespace, the
subsequent text (following any whitespace) is passed as a regex, so:

    <foo bar>

is more or less equivalent to

    <foo(/bar/)>

To pass a regex with leading whitespace you must use the parenthesized form.

If the first character is a colon followed by whitespace, the rest
of the text is taken as a list of arguments to the method, just as
in ordinary Perl syntax.  So these mean the same thing:

    <foo('foo', $bar, 42)>
    <foo: 'foo', $bar, 42>

No other characters are allowed after the initial identifier.

Subrule matches are considered declarative to the extent that
the front of the subrule is itself considered declarative.  If a
subrule contains a sequence point, then so does the subrule match.
Longest-token matching does not proceed past such a subrule, for
instance.

This form always gives preference to a lexically scoped regex declaration,
dispatching directly to it as if it were function.  If there is no such
lexical regex (or lexical method) in scope, the call is dispatched to the
current grammar, assuming there is one. That is, if there is a
C<my regex foo> visible from the current lexical scope, then

    <foo(1,2,3)>

means the same as

    <foo=&foo(1,2,3)>

However, if there is no such lexically scoped regex (and note that within
a grammar, regexes are installed as methods which have no lexical alias
by default), then the call is dispatched as a normal method on the current
C<Cursor> (which will fail if you're not currently within a grammar).  So
in that case:

    <foo(1,2,3)>

means the same as:

    <foo=.foo(1,2,3)>

A call to C<< <foo> >> will fail if there is neither any lexically
scoped routine of that name it can call, nor any method of that name
that be reached via method dispatch.  (The decision of which dispatcher
to use is made at compile time, not at run time; the method call is not
a fallback mechanism.)

=item *

A leading C<.> explicitly calls a method as a subrule; the fact
that the initial character is not alphanumeric also causes the named
assertion to not capture what it matches (see L<Subrule captures>. For
example:

     / <ident>  <ws>  /      # $/<ident> and $/<ws> both captured
     / <.ident> <ws>  /      # only $/<ws> captured
     / <.ident> <.ws> /      # nothing captured

The assertion is otherwise parsed identically to an assertion beginning with
an identifier, provided the next thing after the dot is an identifier.  As with
the identifier form, any extra arguments pertaining to the matching engine
are automatically supplied to the argument list via the implicit C<Cursor> invocant.
If there is no current class/grammar, or the current class is not derived
from C<Cursor>, the call is likely to fail.

If the dot is not followed by an identifier, it is parsed as a "dotty"
postfix of some type, such as an indirect method call:

    <.$indirect(@args)>

As with all regex matching, the current match state (some derivative
of C<Cursor>) is passed as the first argument, which in this case
is simply the method's invocant.  The method is expected to return
a lazy list of new match state objects, or C<Nil> if the match fails
entirely.  Ratcheted routines will typically return a list containing only
one match.

=item *

Whereas a leading C<.> unambiguously calls a method, a leading C<&>
unambiguously calls a routine instead.  Such a regex routine must
be declared (or imported) with C<my> or C<our> scoping to make its
name visible to the lexical scope, since by default a regex name is
installed only into the current class's metaobject instance, just
as with an ordinary method. The routine serves as a kind of private
submethod, and is called without any consideration of inheritance.
It must still take a C<Cursor> as its first argument (which it can
think of as an invocant if it likes), and must return the new match
state as a cursor object.  Hence,

     <&foo(1,2,3)>

is sugar for something like:

     <.gather { take foo($¢,1,2,3) }>

where C<$¢> represents the current incoming match state, and the
routine must return C<Nil> for failure, or a lazy list of one or
more match states (C<Cursor>-derived objects) for successful matches.

As with the C<.> form, an explicit C<&> suppresses capture.

Note that all normal C<Regex> objects are really such routines in disguise.
When you say:

    rx/stuff/

you're really declaring an anonymous method, something like:

    my $internal = anon regex :: ($¢: ) { stuff }

and then passing that object off to someone else who will call it
indirectly.  In this case, the method is installed neither into
a class nor into a lexical scope, but as long as the value stays
live somehow, it can still be called indirectly (see below).

=item *

A leading C<$> indicates an indirect subrule call.  The variable must
contain either a C<Regex> object (really an anonymous method--see
above), or a string to be compiled as the regex.  The string is never
matched literally.

If the compilation of the string form fails, the error message is converted
to a warning and the assertion fails.

The indirect subrule assertion is not captured.  (No assertion with leading punctuation
is captured by default.)  You may always capture it explicitly, of course:

    / <name=$rx> /

An indirect subrule is always considered procedural, and may not participate
in longest-token matching.

=item *

A leading C<::> indicates a symbolic indirect subrule:

     / <::($somename)> /

The variable must contain the name of a subrule.  By the rules of
single method dispatch this is first searched for in the current
grammar and its ancestors.  If this search fails an attempt is made
to dispatch via MMD, in which case it can find subrules defined as
multis rather than methods.  This form is not captured by default.
It is always considered procedural, not declarative.

=item *

A leading C<@> matches like a bare array except that each element is
treated as a subrule (string or C<Regex> object) rather than as a literal.
That is, a string is forced to be compiled as a subrule instead of being
matched literally.  (There is no difference for a C<Regex> object.)

This assertion is not automatically captured.

=item *

The use of a hash as an assertion is reserved.

=item *

A leading C<{> indicates code that produces a regex to be interpolated
into the pattern at that point as a subrule:

     / (<.ident>)  <{ %cache{$0} //= get_body_for($0) }> /

The closure is guaranteed to be run at the canonical time; it declares
a sequence point, and is considered to be procedural.

=item *

In any case of regex interpolation, if the value already happens to be
a C<Regex> object, it is not recompiled.  If it is a string, the compiled
form is cached with the string so that it is not recompiled next
time you use it unless the string changes.  (Any external lexical
variable names must be rebound each time though.)  Subrules may not be
interpolated with unbalanced bracketing.  An interpolated subrule
keeps its own inner match results as a single item, so its parentheses never count toward the
outer regexes groupings.  (In other words, parenthesis numbering is always
lexically scoped.)

=item *

A leading C<?{> or C<!{> indicates a code assertion:

     / (\d**1..3) <?{ $0 < 256 }> /
     / (\d**1..3) <!{ $0 < 256 }> /

Similar to:

     / (\d**1..3) { $0 < 256 or fail } /
     / (\d**1..3) { $0 < 256 and fail } /

Unlike closures, code assertions are considered declarative; they are
not guaranteed to be run at the canonical time if the optimizer can
prove something later can't match.  So you can sneak in a call to a
non-canonical closure that way:

     token { foo .* <?{ do { say "Got here!" } or 1 }> .* bar }

The C<do> block is unlikely to run unless the string ends with "C<bar>".

=item *

A leading C<[> indicates an enumerated character class.  Ranges
in enumerated character classes are indicated with "C<..>" rather than "C<->".

     / <[a..z_]>* /

Whitespace is ignored within square brackets:

     / <[ a .. z _ ]>* /

A reversed range is illegal.  In directly compiled code it's a compile-time
error to say

    / <[ z .. a ]> /  # Reversed range is not allowed

In indirectly compiled code, a similar warning is issued and the assertion fails:

    $rx = '<[ z .. a ]>';
    / <$rx> /;  # warns and never matches

=item *

A leading C<-> indicates a complemented character class:

     / <-[a..z_]> <-alpha> /
     / <- [a..z_]> <- alpha> /  # whitespace allowed after -

This is essentially the same as using negative lookahead and dot:

    / <![a..z_]> . <!alpha> . /

Whitespace is ignored after the initial C<->.

=item *

A leading C<+> may also be supplied to indicate that the following
character class is to be matched in a positive sense.

     / <+[a..z_]>* /
     / <+[ a..z _ ]>* /
     / <+ [ a .. z _ ] >* /      # whitespace allowed after +

=item *

Character classes can be combined (additively or subtractively) within
a single set of angle brackets.  Whitespace is ignored. For example:

     / <[a..z] - [aeiou] + xdigit> /      # consonant or hex digit

A named character class may be used by itself:

    <alpha>

However, in order to combine classes you must prefix a named
character class with C<+> or C<->.  Whitespace is required before
any C<-> that would be misparsed as an identifier extender.

=item *

Unicode properties are indicated by use of pair notation in place of a normal
rule name:

    <:Letter>   # a letter
    <:!Letter>  # a non-letter

Properties with arguments are passed as the argument to the pair:

    <:East_Asian_Width<Narrow>>
    <:!Blk<ASCII>>

The pair value is smartmatched against the value in the Unicode database.

    <:Nv(0 ^..^ 1)>     # the char has a proper fractional value

As a particular case of smartmatching, TR18 section 2.6 is satisfied
with a pattern as the argument:

    <:name(/^LATIN LETTER.*P$/)>

=item *

Multiple of these terms may be combined with pluses and minuses:

    <+ :HexDigit - :Upper >

Terms may also be combined using C<&> for set intersection, C<|>
for set union, and C<^> for symmetric set difference.  Parens may be
used for grouping.  (Square brackets always quote literal characters
(including backslashed literal forms), and may not be nested, unlike
the suggested notation in TR18 section 1.3.)  The precedence of
the operators is the same as the correspondingly named operators in
L<S03/Operator precedence>, even though they have somewhat different
semantics.

=item *

Extra long characters may be entered by quoting them and including them
via intersection.  Any quoted characters will be treated as "longest tokens"
when appropriate.  Here 'll' would be recognized in preference to 'l':

    / <[ a..z ] | 'ñ' | 'ch' | 'll' | 'rr'>

=item *

The special assertion C<< <.> >> matches any logical grapheme
(including a Unicode combining character sequences):

     / seekto = <.> /  # Maybe a combined char

Same as:

     / seekto = [:graphs .] /

=item *

A leading C<!> indicates a negated meaning (always a zero-width assertion):

     / <!before _ > /    # We aren't before an _

Note that C<< <!alpha> >> is different from C<< <-alpha> >>.
C<< /<-alpha>/ >> is a complemented character class equivalent to
C<<< /<!before <alpha>> ./ >>>, whereas C<< <!alpha> >> is a zero-width
assertion equivalent to a C<<< /<!before <alpha>>/ >>> assertion.

Note also that as a metacharacter C<!> doesn't change the parsing
rules of whatever follows (unlike, say, C<+> or C<->).

=item *

A leading C<?> indicates a positive zero-width assertion, and like C<!>
merely reparses the rest of the assertion recursively as if the C<?>
were not there.  In addition to forcing zero-width, it also suppresses
any named capture:

    <alpha>     # match a letter and capture to $alpha (eventually $<alpha>)
    <.alpha>    # match a letter, don't capture
    <?alpha>    # match null before a letter, don't capture

The special named assertions include:

     / <?before pattern> /    # lookahead
     / <?after pattern> /     # lookbehind

     / <?same> /              # true between two identical characters

     / <.ws> /                # match "whitespace":
                              #   \s+ if it's between two \w characters,
                              #   \s* otherwise

     / <?at($pos)> /          # match only at a particular StrPos
                              # short for <?{ .pos === $pos }>
                              # (considered declarative until $pos changes)

It is legal to use any of these assertions as named captures by omitting the
punctuation at the front.  However, capture entails some overhead in both
memory and computation, so in general you want to suppress that for data
you aren't interested in preserving.

The C<after> assertion implements lookbehind by reversing the syntax
tree and looking for things in the opposite order going to the left.
It is illegal to do lookbehind on a pattern that cannot be reversed.

Note: the effect of a forward-scanning lookbehind at the top level
can be achieved with:

    / .*? prestuff <( mainpat )> /

=item *

A leading C<*> indicates that the following pattern allows a
partial match.  It always succeeds after matching as many characters
as possible.  (It is not zero-width unless 0 characters match.)
For instance, to match a number of abbreviations, you might write
any of:

    s/ ^ G<*n|enesis>     $ /gen/  or
    s/ ^ Ex<*odus>        $ /ex/   or
    s/ ^ L<*v|eviticus>   $ /lev/  or
    s/ ^ N<*m|umbers>     $ /num/  or
    s/ ^ D<*t|euteronomy> $ /deut/ or
    ...

    / (<* <foo bar baz> >) /

    / <short=*@abbrev> / and return %long{$<short>} || $<short>;

The pattern is restricted to declarative forms that can be rewritten
as nested optional character matches.  Sequence information
may not be discarded while making all following characters optional.
That is, it is not sufficient to rewrite:

    <*xyz>

as:

    x? y? z?            # bad, would allow xz

Instead, it must be implemented as:

    [x [y z?]?]?        # allow only x, xy, xyz (and '')

Explicit quantifiers are allowed on single characters, so this:

    <* a b+ c | ax*>

is rewritten as something like:

    [a [b+ c?]?]? | [a x*]?

In the latter example we're assuming the DFA token matcher is going to
give us the longest match regardless.  It's also possible that quantified
multi-character sequences can be recursively remapped:

    <* 'ab'+>     # match a, ab, ababa, etc. (but not aab!)
    ==> [ 'ab'* <*ab> ]
    ==> [ 'ab'* [a b?]? ]

[Conjecture: depending on how fancy we get, we might (or might not)
be able to autodetect ambiguities in C<< <*@abbrev> >> and refuse to
generate ambiguous abbreviations (although exact match of a shorter
abbrev should always be allowed even if it's the prefix of a longer
abbreviation).  If it is not possible, then the user will have to
check for ambiguities after the match. Note also that the array
form is assuming the array doesn't change often.  If it does, the
longest-token matcher has to be recalculated, which could get
expensive.]

=item *

A leading C<~~> indicates a recursive call back into some or all of
the current rule.  An optional argument indicates which subpattern
to re-use, and if provided must resolve to a single subpattern.
If omitted, the entire pattern is called recursively:

    <~~>       # call myself recursively
    <~~0>      # match according to $0's pattern
    <~~foo>    # match according to $<foo>'s pattern

Note that this rematches the pattern associated with the name, not
the string matched.  So

    $_ = "foodbard"

    / ( foo | bar ) d $0 /      # fails; doesn't match "foo" literally
    / ( foo | bar ) d <$0> /    # fails; doesn't match /foo/ as subrule
    / ( foo | bar ) d <~~0> /   # matches using rule associated with $0

The last is equivalent to

    / ( foo | bar ) d ( foo | bar ) /

Note that the "self" call of

    / <term> <operator> <~~> /

calls back into this anonymous rule as a subrule, and is implicitly
anchored to the end of the operator as any other subrule would be.
Despite the fact that the outer rule scans the string, the inner
call to it does not.

Note that a consequence of the previous section is that you also get

    <!~~>

for free, which fails if the current rule would match again at this location.

=item *

A leading C<|> indicates some kind of a zero-width boundary.

    <|w> word boundary
    <|g> grapheme boundary (always matches in grapheme mode)
    <|c> codepoint boundary (always matches in grapheme/codepoint mode)

=back

The following tokens include angles but are not required to balance:

=over

=item *

A C<< <( >> token indicates the start of the match's overall capture, while the
corresponding C<< )> >> token indicates its endpoint.  When matched,
these behave as assertions that are always true, but have the side
effect of setting the C<.from> and C<.to> attributes of the match
object.  That is:

    / foo <( \d+ )> bar /

is equivalent to:

    / <?after foo> \d+ <?before bar> /

except that the scan for "C<foo>" can be done in the forward direction,
while a lookbehind assertion would presumably scan for C<\d+> and then
match "C<foo>" backwards.  The use of C<< <(...)> >> affects only the
positions of the beginning and ending of the match, and anything calculated based on those positions.  For instance, after the match above, C<$()> contains
only the digits matched, and C<$/.to> is pointing to after the digits.
Other captures (named or numbered) are unaffected and may be accessed
through C<$/>.

These tokens are considered declarative, but may force backtracking behavior.

=item *

A C<«> or C<<< << >>> token indicates a left word boundary.  A C<»> or
C<<< >> >>> token indicates a right word boundary.  (As separate tokens,
these need not be balanced.)  Perl 5's C<\b> is replaced by a C<< <|w> >>
"word boundary" assertion, while C<\B> becomes C<< <!|w> >>.  (None of
these are dependent on the definition of C<< <.ws> >>, but only on the C<\w>
definition of "word" characters.  Non-space mark characters are ignored in
calculating word properties of the preceding character.  See TR18 1.4.)

=back

=head2 Predefined Subrules

These are some of the predefined subrules for any grammar or regex:

=over

=item * ident
X<ident>X<< <ident> >>

Match an identifier.

=item * upper
X<upper>X<< <upper> >>

Match a single uppercase character.

=item * lower
X<lower>X<< <lower> >>

Match a single lowercase character.

=item * alpha
X<alpha>X<< <alpha> >>

Match a single alphabetic character.

=item * digit
X<digit>X<< <digit> >>

Match a single digit.

=item * xdigit
X<xdigit>X<< <xdigit> >>

Match a single hexadecimal digit.

=item * print
X<print>X<< <print> >>

Match a single printable character.

=item * graph
X<graph>X<< <graph> >>

Match a single "graphical" character.

=item * cntrl
X<cntrl>X<< <cntrl> >>

Match a single "control" character. A control character is usually one that doesn't produce output as such but instead controls the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than 32 are usually classified as control characters (assuming ASCII, the ISO Latin character sets, and Unicode), as is the character with the ord() value of 127 (DEL ).

=item * punct
X<punct>X<< <punct> >>

Match a single punctuation character.

=item * alnum
X<alnum>X<< <alnum> >>

Match a single alphanumeric character. This is equivalent to <+alpha +digit> .

=item * wb
X<wb>X<< <wb> >>

Returns a zero-width match that is true at word boundaries.  A word
boundary is a spot with a "\w" on one side and a "\W" on the other
side (in either order), counting the beginning and end of the string
as matching "\W".

=item * ww
X<ww>X<< <ww> >>

Matches between two word characters (zero-width match).

=item * ws
X<ws>X<< <ws> >>

Matches required whitespace between two word characters, optional
whitespace otherwise.  This is roughly equivalent to  C<< <!ww> \s* >>
(C<ws> isn't required to use the C<ww> subrule).

=item * space
X<space>X<< <space> >>

Match a single whitespace character (same as C< \s > ).

=item * blank
X<blank>X<< <blank> >>

Match a single "blank" character -- in most locales, this corresponds
to space and tab.

=item * before C<pattern>
X<before>X<< <before> >>

Perform lookahead -- i.e., check if we're at a position where
C<pattern> matches.  Returns a zero-width C<Match> object on
success.

=item * after C<pattern>
X<after>X<< <after> >>

Perform lookbehind -- i.e., check if the string before the
current position matches <pattern> (anchored at the end).
Returns a zero-width C<Match> object on success.

=item * <?>
X<?>X<< <?> >>

Match a null string, viz., always returns true

=item * <!>
X<!>X<< <!> >>

Inverse of <?>, viz., always returns false.

=back

=head1 Backslash reform

=over

=item *

The C<\p> and C<\P> properties become intrinsic grammar rules such as
(C<< <alpha> >> and C<< <-alpha> >>).  They may be combined using the
above-mentioned character class notation: C<< <[_]+alpha+digit> >>.
Regardless of the higher-level character class names, low-level
Unicode properties are always available with a prefix of colon, that is,
in pair notation within the angle brackets.
Hence, C<< <+:Lu+:Lt> >> is equivalent to C<< <+upper+title> >>.

=item *

The C<\L...\E>, C<\U...\E>, and C<\Q...\E> sequences are gone.  In the
rare cases that need them you can use C<< <{ lc $regex }> >> etc.

=item *

The C<\G> sequence is gone.  Use C<:p> instead.  (Note, however,
that it makes no sense to use C<:p> within a pattern, since every
internal pattern is implicitly anchored to the current position.)
See the C<at> assertion below.

=item *

Backreferences (e.g. C<\1>, C<\2>, etc.) are gone; C<$0>, C<$1>, etc. can be
used instead, because variables are no longer interpolated.

Numeric variables are assumed to change every time and therefore are
considered procedural, unlike normal variables.

=item *

New backslash sequences, C<\h> and C<\v>, match horizontal and vertical
whitespace respectively, including Unicode.  Horizontal whitespace is defined
as anything matching C<\s> that doesn't also match C<\v>.  Vertical whitespace is
defined as any of:

    U+000A  LINE FEED (LF)
    U+000B  LINE TABULATION
    U+000C  FORM FEED (FF)
    U+000D  CARRIAGE RETURN (CR)
    U+0085  NEXT LINE (NEL)
    U+2028  LINE SEPARATOR
    U+2029  PARAGRAPH SEPARATOR

Note that U+000D CARRIAGE RETURN (CR) is considered vertical whitespace despite
the fact that it only moves the "carriage" horizontally.

=item *

C<\s> now matches any Unicode whitespace character.

=item *

The new backslash sequence C<\N> matches anything except a logical
newline; it is the negation of C<\n>.

=item *

Other new capital backslash sequences are also the negations
of their lowercase counterparts:

=over

=item *

C<\H> matches anything but horizontal whitespace.

=item *

C<\V> matches anything but vertical whitespace.

=item *

C<\T> matches anything but a tab.

=item *

C<\R> matches anything but a return.

=item *

C<\F> matches anything but a formfeed.

=item *

C<\E> matches anything but an escape.

=item *

C<\X...> matches anything but the specified character (specified in
hexadecimal).

=back

=back

=head1 Regexes constitute a first-class language, rather than just being strings

=over

=item *

The Perl 5 C<qr/pattern/> regex constructor is gone.

=item *

The Perl 6 equivalents are:

     regex { pattern }    # always takes {...} as delimiters
     rx    / pattern /    # can take (almost) any chars as delimiters

You may not use whitespace or alphanumerics for delimiters.  Space is
optional unless needed to distinguish from modifier arguments or
function parens.  So you may use parens as your C<rx> delimiters,
but only if you interpose whitespace:

     rx ( pattern )      # okay
     rx( 1,2,3 )         # tries to call rx function

(This is true for all quotelike constructs in Perl 6.)

The C<rx> form may be used directly as a pattern anywhere a normal C<//> match can.
The C<regex> form is really a method definition, and must be used in such a way that
the grammar class it is to be used in is apparent.

=item *

If either form needs modifiers, they go before the opening delimiter:

     $regex = regex :s:i { my name is (.*) };
     $regex = rx:s:i     / my name is (.*) /;    # same thing

Space is necessary after the final modifier if you use any
bracketing character for the delimiter.  (Otherwise it would be taken as
an argument to the modifier.)

=item *

You may not use colons for the delimiter.  Space is allowed between
modifiers:

     $regex = rx :s :i / my name is (.*) /;

=item *

The name of the constructor was changed from C<qr> because it's no
longer an interpolating quote-like operator.  C<rx> is short for I<regex>,
(not to be confused with regular expressions, except when they are).

=item *

As the syntax indicates, it is now more closely analogous to a C<sub {...}>
constructor.  In fact, that analogy runs I<very> deep in Perl 6.

=item *

Just as a raw C<{...}> is now always a closure (which may still
execute immediately in certain contexts and be passed as an object
in others), so too a raw C</.../> is now always a C<Regex> object (which
may still match immediately in certain contexts and be passed as an
object in others).

=item *

Specifically, a C</.../> matches immediately in a value context (void,
Boolean, string, or numeric), or when it is an explicit argument of
a C<~~>.  Otherwise it's a C<Regex> constructor identical to the explicit
C<regex> form.  So this:

     $var = /pattern/;

no longer does the match and sets C<$var> to the result.
Instead it assigns a C<Regex> object to C<$var>.

=item *

The two cases can always be distinguished using C<m{...}> or C<rx{...}>:

     $match = m{pattern};    # Match regex immediately, assign result
     $regex = rx{pattern};   # Assign regex expression itself

=item *

Note that this means that former magically lazy usages like:

     @list = split /pattern/, $str;

are now just consequences of the normal semantics.

=item *

It's now also possible to set up a user-defined subroutine that acts
like C<grep>:

     sub my_grep($selector, *@list) {
         given $selector {
             when Regex { ... }
             when Code  { ... }
             when Hash  { ... }
             # etc.
         }
     }

When you call C<my_grep>, the first argument is bound in item context,
so passing C<{...}> or C</.../> produces a C<Code> or C<Regex> object,
which the switch statement then selects upon.  (Normal C<grep> just
lets a smartmatch operator do all the work.)

=item *

Just as C<rx> has variants, so does the C<regex> declarator.
In particular, there are two special variants for use in grammars:
C<token> and C<rule>.

A token declaration:

    token ident { [ <alpha> | _ ] \w* }

never backtracks by default.  That is, it likes to commit to whatever
it has scanned so far.  The above is equivalent to

    regex ident { [ <alpha>: | _: ]: \w*: }

but rather easier to read.  The bare C<*>, C<+>, and C<?> quantifiers
never backtrack in a C<token>.
In normal regexes, use
C<*:>, C<+:>, or C<?:> to prevent any backtracking into the quantifier.
If you want to explicitly backtrack, append either a C<?> or a C<!>
to the quantifier.   The C<?> forces frugal matching as usual,
while the C<!> forces greedy matching.  The C<token> declarator is
really just short for

    regex :ratchet { ... }

The other is the C<rule> declarator, for declaring non-terminal
productions in a grammar.  Like a C<token>, it also does not backtrack
by default.  In addition, a C<rule> regex also assumes C<:sigspace>.
A C<rule> is really short for:

    regex :ratchet :sigspace { ... }

=item *

The Perl 5 C<?...?> syntax (I<succeed once>) was rarely used and can be
now emulated more cleanly with a state variable:

    $result = do { state $x ||= m/ pattern /; }    # only matches first time

To reset the pattern, simply say C<$x = 0>.  Though if you want C<$x> visible
you'd have to avoid using a block:

    $result = state $x ||= m/ pattern /;
    ...
    $x = 0;

=back

=head1 Backtracking control

Within those portions of a pattern that are considered procedural rather
than declarative, you may control the backtracking behavior.

=over

=item *

By default, backtracking is greedy in C<rx>, C<m>, C<s>, and the like.
It's also greedy in ordinary C<regex> declarations.  In C<rule>
and C<token> declarations, backtracking must be explicit.

=item *

To force the preceding atom to do frugal backtracking (also sometimes
known as "eager matching" or "minimal matching"),
append a C<:?> or C<?> to the atom.  If the preceding token is
a quantifier, the C<:> may be omitted, so C<*?> works just as
in Perl 5.

=item *

To force the preceding atom to do greedy backtracking in a
spot that would default otherwise, append a C<:!> to the atom.
If the preceding token is a quantifier, the C<:> may be omitted.
(Perl 5 has no corresponding construct because backtracking always
defaults to greedy in Perl 5.)

=item *

To force the preceding atom to do no backtracking, use a single C<:>
without a subsequent C<?> or C<!>.  Backtracking over a single colon
causes the regex engine not to retry the preceding atom:

     ms/ \( <expr> [ , <expr> ]*: \) /

(i.e. there's no point trying fewer C<< <expr> >> matches, if there's
no closing parenthesis on the horizon)

When modifying a quantifier, a C<+> may be used instead of a C<:>, in
which case the quantifier is often known as a I<possessive> quantifier.

     ms/ \( <expr> [ , <expr> ]*+ \) /  # same thing

To force all the atoms in an expression not to backtrack by default,
use C<:ratchet> or C<rule> or C<token>.

=item *

Evaluating a double colon throws away all saved choice points in the current
L<LTM|/"Longest-token matching"> alternation.

     ms/ [ if :: <expr> <block>
         | for :: <list> <block>
         | loop :: <loop_controls>? <block>
         ]
     /

(i.e. there's no point trying to match a different keyword if one was
already found but failed).

The C<::> also has the effect of hiding any declarative match on the right
from "longest token" processing by C<|>.  Only the left side is evaluated
for determinacy.

C<::> does nothing if there is no current LTM alternation. "Current"
is defined dynamically, not lexically.  A C<::> in a subrule will affect
the enclosing alternation.

=item *

Evaluating a C<< ::> >> throws away all saved choice points in the current
innermost temporal alternation.  It thus acts as a "then".

    ms/ [
        || <?{ $a == 1 }> ::> <foo>
        || <?{ $a == 2 }> ::> <bar>
        || <?{ $a == 3 }> ::> <baz>
        ]
    /

Note that you can still back into the "then" part of such
an alternation, so you may also need to put C<:> after it if you
also want to disable that.  If an explicit or implicit C<:ratchet>
has disabled backtracking by supplying an implicit C<:>, you need to
put an explicit C<!> after the alternation to enable backing into,
say, the C<< <foo> >> rule above.

C<< ::> >> does nothing if there is no current temporal alternation.
"Current" is defined dynamically, not lexically.  A C<< ::> >> in a
subrule will affect the enclosing alternation.

=item *

Evaluating a triple colon throws away all saved choice points since
the current regex was entered.  Backtracking to (or past) this point
will fail the rule outright (no matter where in the regex it occurs):

     regex ident {
           ( [<alpha>|_] \w* ) ::: { fail if %reserved{$0} }
         || " [<alpha>|_] \w* "
     }

     ms/ get <ident>? /

(i.e. using an unquoted reserved word as an identifier is not permitted)

=item *

Evaluating a C<< <commit> >> assertion throws away all saved choice
points since the start of the entire match.  Backtracking to (or past)
this point will fail the entire match, no matter how many subrules
down it happens:

     regex subname {
         ([<alpha>|_] \w*) <commit> { fail if %reserved{$0} }
     }
     ms/ sub <subname>? <block> /

(i.e. using a reserved word as a subroutine name is instantly fatal
to the I<surrounding> match as well)

If commit is given an argument, it's the name of a calling rule that
should be committed:

    <commit('infix')>

=item *

A C<< <cut> >> assertion always matches successfully, and has the
side effect of logically deleting the parts of the string already
matched.  Whether this actually frees up the memory immediately may
depend on various interactions among your backreferences, the string
implementation, and the garbage collector.  In any case, the string
will report that it has been chopped off on the front.  It's illegal
to use C<< <cut> >> on a string that you do not have write access to.

Attempting to backtrack past a C<< <cut> >> causes the complete
match to fail (like backtracking past a C<< <commit> >>). This is
because there's now no preceding text to backtrack into.  This is
useful for throwing away successfully processed input when matching
from an input stream or an iterator of arbitrary length.

=back

=head1 Regex Routines, Named and Anonymous

=over

=item *

The analogy between C<sub> and C<regex> extends much further.

=item *

Just as you can have anonymous subs and named subs...

=item *

...so too you can have anonymous regexes and I<named> regexes (and tokens,
and rules):

     token ident { [<alpha>|_] \w* }

     # and later...

     @ids = grep /<ident>/, @strings;

=item *

As the above example indicates, it's possible to refer to named regexes,
such as:

     regex serial_number { <[A..Z]> \d**8 }
     token type { alpha | beta | production | deprecated | legacy }

in other regexes as named assertions:

     rule identification { [soft|hard]ware <type> <serial_number> }

These keyword-declared regexes are officially of type C<Method>,
which is derived from C<Routine>.

In general, the anchoring of any subrule call is controlled by its calling context.
When a regex, token, or rule method is called as a subrule, the
front is anchored to the current position (as with C<:p>), while
the end is not anchored, since the calling context will likely wish
to continue parsing.  However, when such a method is smartmatched
directly, it is automatically anchored on both ends to the beginning
and end of the string.  Thus, you can do direct pattern matching
by using an anonymous regex routine as a standalone pattern:

    $string ~~ regex { \d+ }
    $string ~~ token { \d+ }
    $string ~~ rule { \d+ }

and these are equivalent to

    $string ~~ m/^ \d+ $/;
    $string ~~ m/^ \d+: $/;
    $string ~~ m/^ <.ws> \d+: <.ws> $/;

The basic rule of thumb is that the keyword-defined methods never
do implicit C<.*?>-like scanning, while the C<m//> and C<s///>
quotelike forms do such scanning in the absence of explicit anchoring.

The C<rx//> and C<//> forms can go either way: they scan when used
directly within a smartmatch or boolean context, but when called
indirectly as a subrule they do not scan.  That is, the object returned
by C<rx//> behaves like C<m//> when used directly, but like C<regex>
C<{}> when used as a subrule:

    $pattern = rx/foo/;
    $string ~~ $pattern;                  # equivalent to m/foo/;
    $string ~~ /'[' <$pattern> ']'/       # equivalent to /'[foo]'/

=back

=head1 Nothing is illegal

=over

=item *

The empty pattern is now illegal.

=item *

To match whatever the prior successful regex matched, use:

     / <prior> /

=item *

To match the zero-width string, you must use some explicit
representation of the null match:

    / '' /;
    / <?> /;

For example:

     split /''/, $string

splits between characters.  But then, so does this:

     split '', $string

=item *

Likewise, to match a empty alternative, use something like:

     /a|b|c|<?>/
     /a|b|c|''/

This makes it easier to catch errors like this:

    /a|b|c|/

As a special case, however, the first null alternative in a match like

     ms/ [
         | if :: <expr> <block>
         | for :: <list> <block>
         | loop :: <loop_controls>? <block>
         ]
     /

is simply ignored.  Only the first alternative is special that way.
If you write:

     ms/ [
             if :: <expr> <block>              |
             for :: <list> <block>             |
             loop :: <loop_controls>? <block>  |
         ]
     /


it's still an error.

=item *

However, it's okay for a non-null syntactic construct to have a degenerate
case matching the null string:

     $something = "";
     /a|b|c|$something/;

In particular, C<< <?> >> always matches the null string successfully,
and C<< <!> >> always fails to match anything.

=back

=head1 Longest-token matching

Instead of representing temporal alternation, C<|> now represents
logical alternation with declarative longest-token semantics.  (You may
now use C<||> to indicate the old temporal alternation.  That is, C<|>
and C<||> now work within regex syntax much the same as they do outside
of regex syntax, where they represent junctional and short-circuit OR.
This includes the fact that C<|> has tighter precedence than C<||>.)

Historically regex processing has proceeded in Perl via a backtracking
NFA algorithm.  This is quite powerful, but many parsers work more
efficiently by processing rules in parallel rather than one after
another, at least up to a point.  If you look at something like a
yacc grammar, you find a lot of pattern/action declarations where the
patterns are considered in parallel, and eventually the grammar decides
which action to fire off.  While the default Perl view of parsing is
essentially top-down (perhaps with a bottom-up "middle layer" to handle
operator precedence), it is extremely useful for user understanding
if at least the token processing proceeds deterministically.  So for
regex matching purposes we define token patterns as those patterns
that can be matched without potential side effects or self-reference.
(Since whitespace often has side effects at line transitions, it
is usually excluded from such patterns, give or take a little
lookahead.)  Basically, Perl automatically derives a lexer
from the grammar without you having to write one yourself.

To that end, every regex in Perl 6 is required to be able to
distinguish its "pure" patterns from its actions, and return its
list of initial token patterns (transitively including the token
patterns of any subrule called by the "pure" part of that regex, but
not including any subrule more than once, since that would involve
self reference, which is not allowed in traditional regular
expressions).  A logical alternation using C<|> then takes two or
more of these lists and dispatches to the alternative that matches
the longest token prefix.  This may or may not be the alternative
that comes first lexically.

However, if two alternatives match at the same length, the tie is
broken first by specificity.  The alternative that starts with the
longest fixed string wins; that is, an exact match counts as closer
than a match made using character classes.  If that doesn't work, the tie
is broken by one of two methods.  If the alternatives are in different
grammars, standard MRO (method resolution order) determines which
one to try first.  If the alternatives are in the same grammar file, the
textually earlier alternative takes precedence.  (If a grammar's rules
are defined in more than one file, the order is undefined, and an explicit
assertion must be used to force failure if the wrong one is tried first.)

This longest token prefix corresponds roughly to the notion of "token"
in other parsing systems that use a lexer, but in the case of Perl
this is largely an epiphenomenon derived automatically from the grammar
definition.  However, despite being automatically calculated, the set of
tokens can be modified by the user; various
constructs within a regex declaratively tell the grammar engine that
it is finished with the pattern part and starting in on the side effects,
so by inserting such constructs the user controls what is considered
a token and what is not.  The constructs deemed to terminate a token
declaration and start the "action" part of the pattern include:

=over

=item *

Any :: or ::: backtracking control (but not the : possessive modifier).

=item *

Any atom that is quantified with a frugal match (using the C<?> modifier).

=item *

Any C<{...}> action, but not an assertion containing a closure.
The closure form of the general C<**{...}> quantifier terminates the
longest token, but not the closureless forms.

=item *

Any sequential control flow operator such as C<||> or C<&&>.

=item *

As a consequence of the previous point, and because the standard
grammar's C<< <ws> >> rule defines whitespace using C<||>, the
longest token is also terminated by any part of the regex or rule
that I<might> match whitespace using that rule, including whitespace
implicitly matched via C<:sigspace>.  (However, token declarations are
specifically allowed to recognize whitespace within a token by using
such lower-level primitives as C<\h+> or other character classes.)

=back

Subpatterns (captures) specifically do not terminate the token pattern,
but may require a reparse of the token to find the location
of the subpatterns.  Likewise assertions may need to be checked out
after the longest token is determined.  (Alternately, if DFA semantics
are simulated in any of various ways, such as by Thompson NFA, it may
be possible to know when to fire off the assertions without backchecks.)

Greedy quantifiers and character classes do not terminate a token pattern.
Zero-width assertions such as word boundaries are also okay.

Because such assertions can be part of the token, the lexer engine must
be able to recover from the failure of such an assertion and backtrack
to the next best token candidate, which might be the same length or shorter,
but can never be longer than the current candidate.

For a pattern that starts with a positive lookahead assertion,
the assertion is assumed to be more specific than the subsequent
pattern, so the lookahead's pattern is treated as the longest token;
the longest-token matcher will be smart enough to rematch any text
traversed by the lookahead when (and if) it continues the match.

Oddly enough, the C<token> keyword specifically does not determine
the scope of a token, except insofar as a token pattern usually
doesn't do much matching of whitespace.  In contrast, the C<rule>
keyword (which assumes C<:sigspace>) defines a pattern that tends
to disqualify itself on the first whitespace following the first recognized item.  So most of the token
patterns will end up coming from C<token> declarations.  For instance,
a token declaration such as

    token list_composer { \[ <expr> \] }

considers its "longest token" to be just the left square bracket, because
the first thing the C<expr> rule will do is traverse optional whitespace.
As an exception to this, and in order to promote readability, a special
exception is made for alternations inside rules.  If an alternation in a
rule, or any other context where C<:sigspace> is active,
has any leading whitespace on any of the alternatives, it is ignored.  That is, C<rule { [ a | b ] }> is treated as
if it were C<rule { [a |b ] }>, and the L<LTM|/"Longest-token matching">
match begins with the first non-sigspace atom.  This exception applies to
the rule itself as well, since any rule might participate in an alternation
higher in the grammar.  And just to keep things simple, we say that the initial
whitespace in any regex before the first actual match is not subject to significance.
This includes any whitespace after a C<:sigspace>, if that declaration is the first
thing in the regex.

The initial token matcher must take into account case sensitivity
(or any other canonicalization primitives) and do the right thing even
when propagated up to rules that don't have the same canonicalization.
That is, they must continue to represent the set of matches that the
lower rule would match.

The C<||> form has the old short-circuit semantics, and will not
attempt to match its right side unless all possibilities (including
all C<|> possibilities) are exhausted on its left.  The first C<||>
in a regex makes the token patterns on its left available to the
outer longest-token matcher, but hides any subsequent tests from
longest-token matching.  Every C<||> establishes a new longest-token
matcher.  That is, if you use C<|> on the right side of C<||>, that
right side establishes a new top level scope for longest-token processing
for this subexpression and any called subrules.  The right side's
longest-token automaton is invisible to the left of the C<||> or outside
the regex containing the C<||>.

=head1 Return values from matches

=head2 Match objects

=over

=item *

A match always returns a C<Match> object, which is also available
as C<$/>, which is a dynamic lexical declared in the outer
routine that is calling the regex.  (A named C<regex>, C<token>,
or C<rule> is a routine, and hence declares its own
lexical C<$/> variable, which always refers to the most recent
submatch within the rule, if any.)  The current match state is
kept in the regex's C<$¢> variable which will eventually get
bound to the user's C<$/> variable when the match completes.

=item *

Notionally, a match object contains (among other things) a boolean
success value, an array of ordered submatch objects, and a hash of named
submatch objects.  (It also optionally carries an I<abstract object> normally
used to build up an abstract syntax tree,)  To provide convenient
access to these various values, the match object evaluates differently
in different contexts:

=over

=item *

In boolean context it evaluates as true or false (i.e. did the match
succeed?):

     if /pattern/ {...}
     # or:
     /pattern/; if $/ {...}

With C<:global> or C<:overlap> or C<:exhaustive> the boolean is
allowed to return true on the first match.  The C<Match> object can
produce the rest of the results lazily if evaluated in list context.

=item *

In string context it evaluates to the stringified value of its match,
which is usually the entire matched string:

     print %hash{ "{$text ~~ /<.ident>/}" };
     # or equivalently:
     $text ~~ /<.ident>/  &&  print %hash{~$/};

But generally you should say C<~$/> if you mean C<~$/>.

=item *

In numeric context it evaluates to the numeric value of its match,
which is usually the entire matched string:

     $sum += /\d+/;
     # or equivalently:
     /\d+/; $sum = $sum + $/;

=item *

When used as a scalar, a C<Match> object evaluates to itself.

However, sometimes you would like an alternate scalar value to
ride along with the match.  The C<Match> object itself describes
a concrete parse tree, so this extra value is called an I<abstract>
object; it rides along as an attribute of the C<Match> object.
The C<.ast> method by default returns an undefined value.
C<$()> is a shorthand for C<$($/.ast // ~$/)>.

Therefore C<$()> is usually just the entire match string, but
you can override that by calling C<make> inside a regex:

    my $moose = $(m[
        <antler> <body>
        { make Moose.new( body => $<body>.attach($<antler>) ) }
        # match succeeds -- ignore the rest of the regex
    ]);

This puts the new abstract node into C<$/.ast>.  An AST node
may be of any type.
This makes it convenient to build up an abstract syntax tree of
arbitrary node types.

=item *

You may also capture a subset of the match using the C<< <(...)> >> construct:

    "foo123bar" ~~ / foo <( \d+ )> bar /
    say $();    # says 123

In this case C<$()> is always a string when doing string
matching, and a list of one or more elements when doing list matching.
This construct does not set the C<.ast> attribute.

=item *

When used as an array, a C<Match> object pretends to be an array of all
its positional captures.  Hence

     ($key, $val) = ms/ (\S+) '=>' (\S+)/;

can also be written:

     $result = ms/ (\S+) '=>' (\S+)/;
     ($key, $val) = @$result;

To get a single capture into a string, use a subscript:

     $mystring = "{ ms/ (\S+) '=>' (\S+)/[0] }";

To get all the captures into a string, use a I<zen> slice:

     $mystring = "{ ms/ (\S+) '=>' (\S+)/[] }";

Or cast it into an array:

     $mystring = "@( ms/ (\S+) '=>' (\S+)/ )";

Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context.  Use C<@()> as a shorthand for C<@($/)> to flatten
the positional captures under list context.  Note that a C<Match> object
is allowed to evaluate its match lazily in list context.  Use C<eager @()>
to force an eager match.

=item *

When used as a hash, a C<Match> object pretends to be a hash of all its named
captures.  The keys do not include any sigils, so if you capture to
variable C<< @<foo> >> its real name is C<$/{'foo'}> or C<< $/<foo> >>.
However, you may still refer to it as C<< @<foo> >> anywhere C<$/>
is visible.  (But it is erroneous to use the same name for two different
capture datatypes.)

Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context.  Use C<%()> as a shorthand for C<%($/)> to flatten as a
hash, or bind it to a variable of the appropriate type.  As with C<@()>,
it's possible for C<%()> to produce its pairs lazily in list context.

=item *

The numbered captures may be treated as named, so C<< $<0 1 2> >>
is equivalent to C<$/[0,1,2]>.  This allows you to write slices of
intermixed named and numbered captures.

=item *

The C<.keys>, C<.values> and C<.kv> methods act both on the list and hash
part, with the list part coming first.

    'abcd' ~~ /(.)(.)**2 <alpha>/;
    say ~$/.keys;           # 1 2 alpha

=item *

In ordinary code, variables C<$0>, C<$1>, etc. are just aliases into
C<$/[0]>, C<$/[1]>, etc.  Hence they will all be undefined if the
last match failed (unless they were explicitly bound in a closure without
using the C<let> keyword).

=back

=item *

C<Match> objects have methods that provide additional information about
the match. For example:

     if m/ def <ident> <codeblock> / {
         say "Found sub def from index $/.from.bytes ",
             "to index $/.to.bytes";
     }

The currently defined methods are

    $/.from      # the initial match position
    $/.to        # the final match position
    $/.chars     # $/.to - $/.from
    $/.orig      # the original match string
    $/.Str       # substr($/.orig, $/.from, $/.chars)
    $/.ast       # the abstract result associated with this node
    $/.caps      # sequential captures
    $/.chunks    # sequential tokenization
    $/.prematch  # $/.orig.substr(0, $/.from)
    $/.postmatch # $/.orig.substr($/.to)

Within the regex the current match state C<$¢> also provides

    .pos        # the current match position

This last value may correspond to either C<$¢.from> or C<$¢.to> depending
on whether the match is proceeding in a forward or backward direction
(the latter case arising inside an C<< <?after ...> >> assertion).

=item *

As described above, a C<Match> in list context returns its positional
captures.  However, sometimes you'd rather get a flat list of tokens
in the order they occur in the text.  The C<.caps> method returns
a list of every capture in order, regardless of how it was otherwise
bound into named or numbered captures.  (Other than order, there is
no new information here; all the elements of the list are the very
same C<Match> objects that bound elsewhere.)  The bindings are actually
returned as key/value pairs where the key is the name or number under which
the match object was bound, and the value is the match object itself.

In addition to returning those captured C<Match> objects, the
C<.chunks> method also returns all the interleaved "noise" between
the captures.  As with C<.caps>, the list elements are in the order
they were originally in the text.  The interleaved bits are also
returned as pairs, where the key is '~' and the value
is a simple C<Match> object containing only the string, even if unbound
subrules such as C<.ws> were called to traverse the text in the first
place.  Calling C<.ast> on such a C<Match> object always returns a C<Str>.

A warning will be issued if either C<.caps> or C<.chunks> discovers
that it has overlapping bindings.  In the absence of such overlap,
C<.chunks> guarantees to map every part of its matched string (between
C<.from> and C<.to>) to exactly one element of its returned matches,
so coverage is complete.

[Conjecture: we could also have C<.deepcaps> and C<.deepchunks> that
recursively expand any capture containing submatches.  Presumably the
keys of such returned chunks would indicate the "pedigree" of bindings
in the parse tree.]

=item *

All match attempts--successful or not--against any regex, subrule, or
subpattern (see below) return an object of class C<Match>. That is:

     $match_obj = $str ~~ /pattern/;
     say "Matched" if $match_obj;

=item *

This returned object is also automatically bound to the lexical
C<$/> variable of the current surroundings regardless of success. That is:

     $str ~~ /pattern/;
     say "Matched" if $/;

=item *

Inside a regex, the C<$¢> variable holds the current regex's incomplete
C<Match> object, known as a match state (of type C<Cursor>).  Generally this should not
be modified unless you know how to create and propagate match states.
All regexes actually return match states even when you think they're
returning something else, because the match states keep track of
the successes and failures of the pattern for you.

Fortunately, when you just want to return a different abstract result along with
the default concrete C<Match> object, you may associate your return value with
the current match state using the C<make> function, which works something
like a C<return>, but doesn't clobber the match state:

    $str ~~ / foo                 # Match 'foo'
               { make 'bar' }     # But pretend we matched 'bar'
             /;
    say $();                      # says 'bar'

The abstract object of any C<Match> object is available via the
C<< .ast >> method.  Hence these abstract objects can be managed
independently of the returned cursor objects.

The current cursor object must always be derived from C<Cursor>, or the
match will not work.  However, within that constraint, the actual type
of the current cursor defines which language you are currently parsing.
When you enter the top of a grammar, this cursor generally starts out
as an object whose type is the name of the grammar you are in, but the
current language can be modified by various methods as they mutate the
current language by returning cursor objects blessed into a different
type, which may or may not be derived from the current grammar.

=back

=head2 Subpattern captures

=over

=item *

Any part of a regex that is enclosed in capturing parentheses is called a
I<subpattern>. For example:

        #               subpattern
        #  _________________/\___________________
        # |                                      |
        # |       subpattern  subpattern         |
        # |          __/\__    __/\__            |
        # |         |      |  |      |           |
      ms/ (I am the (walrus), ( khoo )**2  kachoo) /;


=item *

Each subpattern in a regex produces a C<Match> object if it is
successfully matched.

=item *

Each subpattern is either explicitly assigned to a named destination or
implicitly added to an array of matches.

For each subpattern that is not explicitly given a name,
the subpattern's C<Match> object is pushed onto the array inside
the outer C<Match> object belonging to the surrounding scope (known as
its I<parent C<Match> object>). The surrounding scope may be either the
innermost surrounding subpattern (if the subpattern is nested) or else
the entire regex itself.

=item *

Like all captures, these assignments to the array are hypothetical, and
are undone if the subpattern is backtracked.

=item *

For example, if the following pattern matched successfully:

        #                subpat-A
        #  _________________/\__________________
        # |                                     |
        # |         subpat-B  subpat-C          |
        # |          __/\__    __/\__           |
        # |         |      |  |      |          |
      ms/ (I am the (walrus), ( khoo )**2 kachoo) /;

then the C<Match> objects representing the matches made by I<subpat-B>
and I<subpat-C> would be successively pushed onto the array inside I<subpat-
A>'s C<Match> object. Then I<subpat-A>'s C<Match> object would itself be
pushed onto the array inside the C<Match> object for the entire regex
(i.e. onto C<$/>'s array).

=item *

As a result of these semantics, capturing parentheses in Perl 6 are
hierarchical, not linear (see L<Nested subpattern captures>).

=back

=head2 Accessing captured subpatterns

=over

=item *

The array elements of a C<Match> object are referred to using either the
standard array access notation (e.g. C<$/[0]>, C<$/[1]>, C<$/[2]>, etc.)
or else via the corresponding lexically scoped numeric aliases (i.e.
C<$0>, C<$1>, C<$2>, etc.) So:

     say "$/[1] was found between $/[0] and $/[2]";

is the same as:

     say "$1 was found between $0 and $2";

=item *

Note that, in Perl 6, the numeric capture variables start from $0, not
$1, with the numbers corresponding to the element's index inside C<$/>.

=item *

The array elements of the regex's C<Match> object (i.e. C<$/>)
store individual C<Match> objects representing the substrings that were
matched and captured by the first, second, third, etc. I<outermost>
(i.e. unnested) subpatterns. So these elements can be treated like fully
fledged match results. For example:

     if m/ (\d\d\d\d)-(\d\d)-(\d\d) (BCE?|AD|CE)?/ {
           ($yr, $mon, $day) = $/[0..2];
           $era = "$3" if $3;                    # stringify/boolify
           @datepos = ( $0.from() .. $2.to() );  # Call Match methods
     }


=back

=head2 Nested subpattern captures

=over

=item *

Substrings matched by I<nested> subpatterns (i.e. nested capturing
parens) are assigned to the array inside the nested subpattern's parent C<Match>
object, not to the array of C<$/>.

=item *

This behavior is quite different from Perl 5 semantics:

      # Perl 5...
      #
      # $1---------------------  $4---------  $5------------------
      # |   $2---------------  | |          | | $6----  $7------  |
      # |   |         $3--   | | |          | | |     | |       | |
      # |   |         |   |  | | |          | | |     | |       | |
     m/ ( A (guy|gal|g(\S+)  ) ) (sees|calls) ( (the|a) (gal|guy) ) /x;

=item *

In Perl 6, nested parens produce properly nested captures:

      # Perl 6...
      #
      # $0---------------------  $1---------  $2------------------
      # |   $0[0]------------  | |          | | $2[0]-  $2[1]---  |
      # |   |       $0[0][0] | | |          | | |     | |       | |
      # |   |         |   |  | | |          | | |     | |       | |
     m/ ( A (guy|gal|g(\S+)  ) ) (sees|calls) ( (the|a) (gal|guy) ) /;


=back

=head2 Quantified subpattern captures

=over

=item *

If a subpattern is directly quantified with C<?>, it either produces
a single C<Match> object, or C<Nil>.  If a subpattern is directly
quantified using any other quantifier, it never produces a single
C<Match> object.  Instead, it produces a list of C<Match> objects
corresponding to the sequence of individual matches made by the
repeated subpattern.  If we need to distinguish the two categories,
C<?> is an I<item quantifier>, while C<*>, C<+>, and C<**> are called
I<list quantifiers>.

If 0 values match, the captured value depends on which quantifier
is used.  If the quantifier is C<?>, a C<Nil> is captured if it
matched 0 times.  If the quantifier is C<*>, the empty list, C<()>,
is captured instead.  (Nothing is captured by the C<+> quantifier
if it matches 0 times, since it causes backtracking, but the capture
variable should return C<Nil> if an attempt is made to use it after
an unsuccessful match.)  A C<**> quantifier returns () as C<*> does
if it the minimum of its range is 0, and backtracks otherwise.

Note that C<** 0..1> is always considered a list quantifier, unlike C<?>.

The rationale for treating C<?> as an item quantifier is to make
it consistent with how C<$object.?meth> is defined, and to reduce
the need for gratuitous C<.[0]> subscripts, which is surprising to
most people.  Now that C<Nil> is considered undefined rather than a
synonym for C<()>, it's easy to use C<$0 // "default"> or some such
to dereference a capture safely.

=item *

Because a list-quantified subpattern returns a list of C<Match> objects, the
corresponding array element for the quantified capture will store a
(nested) array rather than a single C<Match> object.  For example:

     if m/ (\w+) \: (\w+ \s+)* / {
         say "Key:    $0";         # Unquantified --> single Match
         say "Values: @($1)";      # Quantified   --> array of Match
     }


=back

=head2 Indirectly quantified subpattern captures

=over

=item *

A subpattern may sometimes be nested inside a quantified non-capturing
structure:

      #       non-capturing       quantifier
      #  __________/\____________  __/\__
      # |                        ||      |
      # |   $0         $1        ||      |
      # |  _^_      ___^___      ||      |
      # | |   |    |       |     ||      |
     m/ [ (\w+) \: (\w+ \h*)* \n ] ** 2..* /

Non-capturing brackets I<don't> create a separate nested lexical scope,
so the two subpatterns inside them are actually still in the regex's
top-level scope, hence their top-level designations: C<$0> and C<$1>.

=item *

However, because the two subpatterns are inside a quantified
structure, C<$0> and C<$1> will each contain an array.
The elements of that array will be the submatches returned by the
corresponding subpatterns on each iteration of the non-capturing
parentheses. For example:

     my $text = "foo:food fool\nbar:bard barb";

               #   $0--     $1------
               #   |   |    |       |
     $text ~~ m/ [ (\w+) \: (\w+ \h*)* \n ] ** 2..* /;

     # Because they're in a quantified non-capturing block...
     # $0 contains the equivalent of:
     #
     #       [ Match.new(str=>'foo'), Match.new(str=>'bar') ]
     #
     # and $1 contains the equivalent of:
     #
     #       [ Match.new(str=>'food '),
     #         Match.new(str=>'fool' ),
     #         Match.new(str=>'bard '),
     #         Match.new(str=>'barb' ),
     #       ]


=item *

In contrast, if the outer quantified structure is a I<capturing>
structure (i.e. a subpattern) then it I<will> introduce a nested
lexical scope. That outer quantified structure will then
return an array of C<Match> objects representing the captures
of the inner parens for I<every> iteration (as described above). That is:

     my $text = "foo:food fool\nbar:bard barb";

               # $0-----------------------
               # |                        |
               # | $0[0]    $0[1]---      |
               # | |   |    |       |     |
     $text ~~ m/ ( (\w+) \: (\w+ \h*)* \n ) ** 2..* /;

     # Because it's in a quantified capturing block,
     # $0 contains the equivalent of:
     #
     #       [ Match.new( str=>"foo:food fool\n",
     #                    arr=>[ Match.new(str=>'foo'),
     #                           [
     #                               Match.new(str=>'food '),
     #                               Match.new(str=>'fool'),
     #                           ]
     #                         ],
     #                  ),
     #         Match.new( str=>'bar:bard barb',
     #                    arr=>[ Match.new(str=>'bar'),
     #                           [
     #                               Match.new(str=>'bard '),
     #                               Match.new(str=>'barb'),
     #                           ]
     #                         ],
     #                  ),
     #       ]
     #
     # and there is no $1

=item *

In other words, quantified non-capturing parens collect their components
into handy flattened lists, whereas quantified capturing parens collect
their components in a handy hierarchical structure.

=back

=head2 Subpattern numbering

=over

=item *

The index of a given subpattern can always be statically determined, but
is not necessarily unique nor always monotonic. The numbering of subpatterns
restarts in each lexical scope (either a regex, a subpattern, or the
branch of an alternation).

=item *

In particular, the index of capturing parentheses restarts after each
C<|> or C<||> (but not after each C<&> or C<&&>). Hence:

                  # $0      $1    $2   $3    $4           $5
     $tune_up = rx/ ("don't") (ray) (me) (for) (solar tea), ("d'oh!")
                  # $0      $1      $2    $3        $4
                  | (every) (green) (BEM) (devours) (faces)
                  /;

This means that if the second alternation matches, the list value of the match will
contain C<('every', 'green', 'BEM', 'devours', 'faces')> rather than
Perl 5's C<(undef, undef, undef, undef, undef, undef, 'every', 'green', 'BEM',
'devours', 'faces')>.

=item *

Note that it is still possible to mimic the monotonic Perl 5 capture
indexing semantics.  See L<Numbered scalar aliasing> below for details.


=back

=head2 Subrule captures

=over

=item *

Any call to a named C<< <regex> >> within a pattern is known as a
I<subrule>, whether that regex is actually defined as a C<regex> or
C<token> or C<rule> or even an ordinary C<method> or C<multi>.

=item *

Any bracketed construct that is aliased (see L</Aliasing> below) to a
named variable is also a subrule.

=item *

For example, this regex contains three subrules:

      # subrule       subrule     subrule
      #  __^__    _______^_____    __^__
      # |     |  |             |  |     |
     m/ <ident>  $<spaces>=(\s*)  <digit>+ /

=item *

Just like subpatterns, each successfully matched subrule within a regex
produces a C<Match> object. But, unlike subpatterns, that C<Match>
object is not assigned to the array inside its parent C<Match> object.
Instead, it is assigned to an entry of the hash inside its parent C<Match>
object. For example:

      #  .... $/ .....................................
      # :                                             :
      # :              .... $/[0] ..................  :
      # :             :                             : :
      # : $/<ident>   :        $/[0]<ident>         : :
      # :   __^__     :           __^__             : :
      # :  |     |    :          |     |            : :
      ms/  <ident> \: ( known as <ident> previously ) /


=back

=head2 Accessing captured subrules

=over

=item *

The hash entries of a C<Match> object can be referred to using any of the
standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>,
etc.), or else via corresponding lexically scoped aliases (C<< $<foo> >>,
C<$«bar»>, C<< $<baz> >>, etc.)  So the previous example also implies:

      #    $<ident>             $0<ident>
      #     __^__                 __^__
      #    |     |               |     |
      ms/  <ident> \: ( known as <ident> previously ) /

=item *

Note that it makes no difference whether a subrule is angle-bracketed
(C<< <ident> >>) or aliased internally (C<< <ident=.name> >>) or aliased
externally (C<< $<ident>=(<.alpha>\w*) >>). The name's the thing.


=back

=head2 Repeated captures of the same subrule

=over

=item *

If a subrule appears two (or more) times in any branch of a lexical
scope (i.e. twice within the same subpattern and alternation), or if
the subrule is list-quantified anywhere within a given scope (that is,
by any quantifier other than C<?>), then its corresponding hash entry
is always assigned an array of C<Match> objects rather than a single
C<Match> object.

=item *

Successive matches of the same subrule (whether from separate calls, or
from a single quantified repetition) append their individual C<Match>
objects to this array. For example:

     if ms/ mv <file> <file> / {
         $from = $<file>[0];
         $to   = $<file>[1];
     }

(Note, for clarity we are ignoring whitespace subtleties here--the
normal sigspace rules would require space only between alphanumeric
characters, which is wrong.  Assume that our file subrule deals
with whitespace on its own.)

Likewise, with a quantified subrule:

     if ms/ mv <file> ** 2 / {
         $from = $<file>[0];
         $to   = $<file>[1];
     }

And with a mixture of both:

     if ms/ mv <file>+ <file> / {
         $to   = pop @($<file>);
         @from = @($<file>);
     }

=item *

To avoid name collisions, you may suppress the original name by use
of a leading dot, and then use an alias to give the capture a different name:

     if ms/ mv <file> <dir=.file> / {
         $from = $<file>;  # Only one subrule named <file>, so scalar
         $to   = $<dir>;   # The Capture Formerly Known As <file>
     }


Likewise, neither of the following constructions causes C<< <file> >> to
produce an array of C<Match> objects, since none of them has two or more
C<< <file> >> subrules in the same lexical scope:

     if ms/ (keep) <file> | (toss) <file> / {
         # Each <file> is in a separate alternation, therefore <file>
         # is not repeated in any one scope, hence $<file> is
         # not an Array object...
         $action = $0;
         $target = $<file>;
     }

     if ms/ <file> \: (<file>|none) / {
         # Second <file> nested in subpattern which confers a
         # different scope...
         $actual  = $/<file>;
         $virtual = $/[0]<file> if $/[0]<file>;
     }

=item *

On the other hand, unaliased square brackets don't confer a separate
scope (because they don't have an associated C<Match> object). So:

     if ms/ <file> \: [<file>|none] / { # Two <file>s in same scope
         $actual  = $/<file>[0];
         $virtual = $/<file>[1] if $/<file>[1];
     }


=back

=head2 Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like.
And they can be applied to either capturing or non-capturing constructs. The
following sections highlight special features of the semantics of some
of those combinations.


=head3 Named scalar aliasing to subpatterns

=over

=item *

If a named scalar alias is applied to a set of I<capturing> parens:

        #         _____/capturing parens\_____
        #        |                            |
        #        |                            |
      ms/ $<key>=( (<[A..E]>) (\d**3..6) (X?) ) /;

then the outer capturing parens no longer capture into the array of
C<$/> as unaliased parens would. Instead the aliased parens capture
into the hash of C<$/>; specifically into the hash element
whose key is the alias name.

=item *

So, in the above example, a successful match sets
C<< $<key> >> (i.e. C<< $/<key> >>), but I<not> C<$0> (i.e. not C<< $/[0] >>).

=item *

More specifically:

=over

=item *

C<< $/<key> >> will contain the C<Match> object that would previously have
been placed in C<< $/[0] >>.

=item *

C<< $/<key>[0] >> will contain the A-E letter,

=item *

C<< $/<key>[1] >> will contain the digits,

=item *

C<< $/<key>[2] >> will contain the optional X.

=back

=item *

Another way to think about this behavior is that aliased parens create
a kind of lexically scoped named subrule; that the contents of the
parentheses are treated as if they were part of a separate subrule whose
name is the alias.


=back

=head3 Named scalar aliases applied to non-capturing brackets

=over

=item *

If a named scalar alias is applied to a set of I<non-capturing> brackets:

        #         __/non-capturing brackets\__
        #        |                            |
        #        |                            |
      ms/ $<key>=[ (<[A..E]>) (\d**3..6) (X?) ] /;

then the corresponding C<< $/<key> >> C<Match> object contains only the string
matched by the non-capturing brackets.

=item *

In particular, the array of the C<< $/<key> >> entry is empty. That's
because square brackets do not create a nested lexical scope, so the
subpatterns are unnested and hence correspond to $0, $1, and $2, and
I<not> to C<< $/<key>[0] >>, C<< $/<key>[1] >>, and C<< $/<key>[2] >>.

=item *

In other words:

=over

=item *

C<< $/<key> >> will contain the complete substring matched by the square
brackets (in a C<Match> object, as described above),

=item *

C<< $0 >> will contain the A-E letter,

=item *

C<< $1 >> will contain the digits,

=item *

C<< $2 >> will contain the optional X.

=back


=back

=head3 Named scalar aliasing to subrules

=over

=item *

If a subrule is aliased, it assigns its C<Match> object to the hash
entry whose key is the name of the alias, as well as to the original name.

     if m/ ID\: <id=ident> / {
         say "Identified as $/<id> and $/<ident>";    # both names defined
     }

To suppress the original name, use the dot form:

     if m/ ID\: <id=.ident> / {
         say "Identified as $/<id>";    # $/<ident> is undefined
     }

=item *

Hence aliasing a dotted subrule I<changes> the destination of the subrule's C<Match>
object. This is particularly useful for differentiating two or more calls to
the same subrule in the same scope. For example:

     if ms/ mv <file>+ <dir=.file> / {
         @from = @($<file>);
         $to   = $<dir>;
     }

=back

=head3 Numbered scalar aliasing

=over

=item *

If a numbered alias is used instead of a named alias:

     m/ $1=(<-[:]>*) \:  $0=<ident> /   # captures $<ident> too
     m/ $1=(<-[:]>*) \:  $0=<.ident> /  # doesn't capture $<ident>

the behavior is exactly the same as for a named alias (i.e. the various
cases described above), except that the resulting C<Match> object is
assigned to the corresponding element of the appropriate array rather
than to an element of the hash.

=item *

If any numbered alias is used, the numbering of subsequent unaliased
subpatterns in the same scope automatically increments from that
alias number (much like enum values increment from the last explicit
value). That is:

      #  --$1---    -$2-    --$6---    -$7-
      # |       |  |    |  |       |  |    |
     m/ $1=(food)  (bard)  $6=(bazd)  (quxd) /;

=item *

This I<follow-on> behavior is particularly useful for reinstituting
Perl5 semantics for consecutive subpattern numbering in alternations:

     $tune_up = rx/ ("don't") (ray) (me) (for) (solar tea), ("d'oh!")
                  | $6 = (every) (green) (BEM) (devours) (faces)
                  #              $7      $8    $9        $10
                  /;

=item *

It also provides an easy way in Perl 6 to reinstitute the unnested
numbering semantics of nested Perl 5 subpatterns:

      # Perl 5...
      #               $1
      #  _____________/\___________
      # |    $2        $3      $4  |
      # |  __/\___   __/\___   /\  |
      # | |       | |       | |  | |
     m/ ( ( [A-E] ) (\d{3,6}) (X?) ) /x;


      # Perl 6...
      #                $0
      #  ______________/\______________
      # |   $0[0]       $0[1]    $0[2] |
      # |  ___/\___   ____/\____   /\  |
      # | |        | |          | |  | |
     m/ ( (<[A..E]>) (\d ** 3..6) (X?) ) /;


      # Perl 6 simulating Perl 5...
      #                 $1
      #  _______________/\________________
      # |        $2          $3       $4  |
      # |     ___/\___   ____/\____   /\  |
      # |    |        | |          | |  | |
     m/ $1=[ (<[A..E]>) (\d ** 3..6) (X?) ] /;

The non-capturing brackets don't introduce a scope, so the subpatterns within
them are at regex scope, and hence numbered at the top level. Aliasing the
square brackets to C<$1> means that the next subpattern at the same level
(i.e. the C<< (<[A..E]>) >>) is numbered sequentially (i.e. C<$2>), etc.


=back

=head3 Scalar aliases applied to quantified constructs

=over

=item *

All of the above semantics apply equally to aliases which are bound to
quantified structures.

=item *

The only difference is that, if the aliased construct is a subrule or
subpattern, that quantified subrule or subpattern will have returned a
list of C<Match> objects (as described in L<Quantified subpattern
captures> and L<Repeated captures of the same subrule>).
So the corresponding array element or hash entry for the alias will
contain an array, instead of a single C<Match> object.

=item *

In other words, aliasing and quantification are completely orthogonal.
For example:

     if ms/ mv $0=<.file>+ / {
         # <file>+ returns a list of Match objects,
         # so $0 contains an array of Match objects,
         # one for each successful call to <file>

         # $/<file> does not exist (it's suppressed by the dot)
     }


     if m/ mv \s+ $<from>=(\S+ \s+)* / {
         # Quantified subpattern returns a list of Match objects,
         # so $/<from> contains an array of Match
         # objects, one for each successful match of the subpattern

         # $0 does not exist (it's pre-empted by the alias)
     }

=item *

Note, however, that a set of quantified I<non-capturing> brackets always
returns a single C<Match> object which contains only the complete
substring that was matched by the full set of repetitions of the
brackets (as described in L<Named scalar aliases applied to
non-capturing brackets>). For example:

     "coffee fifo fumble" ~~ m/ $<effs>=[f <-[f]> ** 1..2 \s*]+ /;

     say $<effs>;    # prints "fee fifo fum"


=back

=head3 Array aliasing

=over

=item *

An alias can also be specified using an array as the alias instead of a scalar.
For example:

     m/ mv \s+ @<from>=[(\S+) \s+]* <dir> /;

=item *

Using the C<< @alias= >> notation instead of a C<< $alias= >>
mandates that the corresponding hash entry or array element I<always>
receives an array of C<Match> objects, even if the
construct being aliased would normally return a single C<Match> object.
This is useful for creating consistent capture semantics across
structurally different alternations (by enforcing array captures in all
branches):

     ms/ Mr?s? @<names>=<ident> W\. @<names>=<ident>
        | Mr?s? @<names>=<ident>
        /;

     # Aliasing to @names means $/<names> is always
     # an Array object, so...

     say @($/<names>);

=item *

For convenience and consistency, C<< @<key> >> can also be used outside a
regex, as a shorthand for C<< @( $/<key> ) >>. That is:

     ms/ Mr?s? @<names>=<ident> W\. @<names>=<ident>
        | Mr?s? @<names>=<ident>
        /;

     say @<names>;

=item *

If an array alias is applied to a quantified pair of non-capturing
brackets, it captures the substrings matched by each repetition of the
brackets into separate elements of the corresponding array. That is:

     ms/ mv $<files>=[ f.. \s* ]* /; # $/<files> assigned a single
                                     # Match object containing the
                                     # complete substring matched by
                                     # the full set of repetitions
                                     # of the non-capturing brackets

     ms/ mv @<files>=[ f.. \s* ]* /; # $/<files> assigned an array,
                                     # each element of which is a
                                     # Match object containing
                                     # the substring matched by Nth
                                     # repetition of the non-
                                     # capturing bracket match

=item *

If an array alias is applied to a quantified pair of capturing parens
(i.e. to a subpattern), then the corresponding hash or array element is
assigned a list constructed by concatenating the array values of each
C<Match> object returned by one repetition of the subpattern. That is,
an array alias on a subpattern flattens and collects all nested
subpattern captures within the aliased subpattern. For example:

     if ms/ $<pairs>=( (\w+) \: (\N+) )+ / {
         # Scalar alias, so $/<pairs> is assigned an array
         # of Match objects, each of which has its own array
         # of two subcaptures...

         for @($<pairs>) -> $pair {
             say "Key: $pair[0]";
             say "Val: $pair[1]";
         }
     }


     if ms/ @<pairs>=( (\w+) \: (\N+) )+ / {
         # Array alias, so $/<pairs> is assigned an array
         # of Match objects, each of which is flattened out of
         # the two subcaptures within the subpattern

         for @($<pairs>) -> $key, $val {
             say "Key: $key";
             say "Val: $val";
         }
     }

=item *

Likewise, if an array alias is applied to a quantified subrule, then the
hash or array element corresponding to the alias is assigned a list
containing the array values of each C<Match> object returned by each
repetition of the subrule, all flattened into a single array:

     rule pair { (\w+) \: (\N+) \n }

     if ms/ $<pairs>=<pair>+ / {
         # Scalar alias, so $/<pairs> contains an array of
         # Match objects, each of which is the result of the
         # <pair> subrule call...

         for @($<pairs>) -> $pair {
             say "Key: $pair[0]";
             say "Val: $pair[1]";
         }
     }


     if ms/ mv @<pairs>=<pair>+ / {
         # Array alias, so $/<pairs> contains an array of
         # Match objects, all flattened down from the
         # nested arrays inside the Match objects returned
         # by each match of the <pair> subrule...

         for @($<pairs>) -> $key, $val {
             say "Key: $key";
             say "Val: $val";
         }
     }

=item *

In other words, an array alias is useful to flatten into a single array
any nested captures that might occur within a quantified subpattern or subrule.
Whereas a scalar alias is useful to preserve within a top-level array
the internal structure of each repetition.

=item *

It is also possible to use a numbered variable as an array alias.
The semantics are exactly as described above, with the sole difference
being that the resulting array of C<Match> objects is assigned into the
appropriate element of the regex's match array rather than to a key of
its match hash. For example:

     if m/ mv  \s+  @0=((\w+) \s+)+  $1=((\W+) (\s*))* / {
         #          |                |
         #          |                |
         #          |                 \_ Scalar alias, so $1 gets an
         #          |                    array, with each element
         #          |                    a Match object containing
         #          |                    the two nested captures
         #          |
         #           \___ Array alias, so $0 gets a flattened array of
         #                just the (\w+) captures from each repetition

         @from     = @($0);      # Flattened list

         $to_str   = $1[0][0];   # Nested elems of
         $to_gap   = $1[0][1];   #    unflattened list
     }

=item *

Note again that, outside a regex, C<@0> is simply a shorthand for
C<@($0)>, so the first assignment above could also have been written:

     @from = @0;


=back

=head3 Hash aliasing

=over

=item *

An alias can also be specified using a hash as the alias variable,
instead of a scalar or an array. For example:

     m/ mv %<location>=( (<ident>) \: (\N+) )+ /;

=item *

A hash alias causes the corresponding hash or array element in the
current scope's C<Match> object to be assigned a (nested) Hash object
(rather than an C<Array> object or a single C<Match> object).

=item *

If a hash alias is applied to a subrule or subpattern then the first nested
numeric capture becomes the key of each hash entry and any remaining numeric
captures become the values (in an array if there is more than one).

=item *

As with array aliases it is also possible to use a numbered variable as
a hash alias. Once again, the only difference is where the resulting
C<Match> object is stored:

     rule one_to_many {  (\w+) \: (\S+) (\S+) (\S+) }

     if ms/ %0=<one_to_many>+ / {
         # $/[0] contains a hash, in which each key is provided by
         # the first subcapture within C<one_to_many>, and each
         # value is an array containing the
         # subrule's second, third, fourth, etc. subcaptures...

         for %($/[0]) -> $pair {
             say "One:  $pair.key()";
             say "Many: { @($pair.value) }";
         }
     }

=item *

Outside the regex, C<%0> is a shortcut for C<%($0)>:

         for %0 -> $pair {
             say "One:  $pair.key()";
             say "Many: @($pair.value)";
         }


=back

=head3 External aliasing

=over

=item *

Instead of using internal aliases like:

     m/ mv  @<files>=<ident>+  $<dir>=<ident> /

the name of an ordinary variable can be used as an I<external> alias, like so:

     m/ mv  @OUTER::files=<ident>+  $OUTER::dir=<ident> /

=item *

In this case, the behavior of each alias is exactly as described in the
previous sections, except that any resulting capture is bound
directly (but still hypothetically) to the variable of the specified
name that must already exist in the scope in which the regex is declared.


=back

=head2 Capturing from repeated matches

=over

=item *

When an entire regex is successfully matched with repetitions
(specified via the C<:x> or C<:g> flag) or overlaps (specified via the
C<:ov> or C<:ex> flag), it will usually produce a sequence
of distinct matches.

=item *

A successful match under any of these flags still returns a single
C<Match> object in C<$/>. However, this object may represent a partial
evaluation of the regex.   Moreover, the values of this match object
are slightly different from those provided by a non-repeated match:

=over

=item *

The boolean value of C<$/> after such matches is true or false, depending on
whether the pattern matched.

=item *

The string value is the substring from the start of the first match to
the end of the last match (I<including> any intervening parts of the
string that the regex skipped over in order to find later matches).

=item *

Subcaptures are returned as a multidimensional list, which the user can
choose to process in either of two ways.  If you refer to
C<@().flat> (or just use C<@()> in a flat list context), the multidimensionality is ignored and all the matches are returned
flattened (but still lazily).  If you refer to C<lol()>, you can
get each individual sublist as a C<Parcel> object.
As with any multidimensional list, each sublist can be lazy separately.

=back

For example:

     if $text ~~ ms:g/ (\S+:) <rocks> / {
         say "Full match context is: [$/]";
     }

But the list of individual match objects corresponding to each separate
match is also available:

     if $text ~~ ms:g/ (\S+:) <rocks> / {
         say "Matched { +lol() } times";    # Note: forced eager here by +

         for lol() -> $m {
             say "Match between $m.from() and $m.to()";
             say 'Right on, dude!' if $m[0] eq 'Perl';
             say "Rocks like $m<rocks>";
         }
     }

=back

=head1 Grammars

=over

=item *

Your private C<ident> rule shouldn't clobber someone else's
C<ident> rule.  So some mechanism is needed to confine rules to a namespace.

=item *

If subs are the model for rules, then modules/classes are the obvious
model for aggregating them.  Such collections of rules are generally
known as I<grammars>.

=item *

Just as a class can collect named actions together:

     class Identity {
         method name { "Name = $.name" }
         method age  { "Age  = $.age"  }
         method addr { "Addr = $.addr" }

         method desc {
             print &.name(), "\n",
                   &.age(),  "\n",
                   &.addr(), "\n";
         }

         # etc.
     }

so too a grammar can collect a set of named rules together:

     grammar Identity {
         rule name { Name '=' (\N+) }
         rule age  { Age  '=' (\d+) }
         rule addr { Addr '=' (\N+) }
         rule desc {
             <name> \n
             <age>  \n
             <addr> \n
         }

         # etc.
     }

=item *

Like classes, grammars can inherit:

     grammar Letter {
         rule text     { <greet> <body> <close> }

         rule greet { [Hi|Hey|Yo] $<to>=(\S+?) , $$}

         rule body     { <line>+? }   # note: backtracks forwards via +?

         rule close { Later dude, $<from>=(.+) }

         # etc.
     }

     grammar FormalLetter is Letter {

         rule greet { Dear $<to>=(\S+?) , $$}

         rule close { Yours sincerely, $<from>=(.+) }

     }

=item *

Just like the methods of a class, the rule definitions of a grammar are
inherited (and polymorphic!). So there's no need to respecify C<body>,
C<line>, etc.

=item *

Perl 6 will come with at least one grammar predefined:

     grammar STD {    # Perl's own standard grammar

         rule prog { <statement>* }

         rule statement {
                   | <decl>
                   | <loop>
                   | <label> [<cond>|<sideff>|';']
         }

         rule decl { <sub> | <class> | <use> }

         # etc. etc. etc.
     }

=item *

Hence:

     $parsetree = STD.parse($source_code)

=item *

To switch to a different grammar in the middle of a regex, you may use the C<:lang> adverb.
For example, to match an expression <expr> from $funnylang that is embedded in curlies, say:

    token funnylang { '{' [ :lang($funnylang.unbalanced('}')) <expr> ] '}' }

=item *

A string can be matched against a grammar by calling C<.parse>
or C<.parsefile> on the grammar, and optionally pass an I<actions>
object to that grammar:

    MyGrammar.parse($string, :actions($action-object))
    MyGrammar.parsefile($filename, :actions($action-object))

This creates a C<Grammar> object, whose type denotes the current language being
parsed, and from which other grammars may be derived as extended languages.
All grammar objects are derived from C<Cursor>, so every grammar object's
value embodies the current state of the current match.  This new grammar
object is then passed as the invocant to the C<TOP> method (regex, token,
or rule) of C<MyGrammar>. The default rule name to call can be overridden with
the C<:rule> named argument of the C<parse> method.

Grammar objects are considered immutable, so
every match returns a different match state, and multiple match states may
exist simultaneously.  Each such match state is considered a hypothesis on
how the pattern will eventually match.  A backtrackable choice in pattern
matching may be easily represented in Perl 6 as a lazy list of match state
cursors; backtracking consists of merely throwing away the front value of
the list and continuing to match with the next value.  Hence, the management
of these match cursors controls how backtracking works, and falls naturally
out of the lazy list paradigm.

=back

=head1 Syntactic categories

For writing your own backslash and assertion subrules, you may augment
(your copy of) the Regex sublanguage, using the following syntactic
categories:

    augment slang Regex {
        token backslash:sym<y> { ... }   # define your own \y and \Y
        token assertion:sym<*> { ... }   # define your own <*stuff>
        token metachar:sym<,> { ... }    # define a new metacharacter

        multi method tweak (:$x) {...}   # define your own :x modifier
    }

=head1 Pragmas

Various pragmas may be used to control various aspects of regex
compilation and usage not otherwise provided for.  These are tied
to the particular declarator in question:

    use s :foo;         # control s defaults
    use m :foo;         # control m defaults
    use rx :foo;        # control rx defaults
    use regex :foo;     # control regex defaults
    use token :foo;     # control token defaults
    use rule :foo;      # control rule defaults

(It is a general policy in Perl 6 that any pragma designed to influence
the surface behavior of a keyword is identical to the keyword itself, unless
there is good reason to do otherwise.  On the other hand, pragmas designed
to influence deep semantics should not be named identically, though of
course some similarity is good.)

=head1 Transliteration

=over

=item *

The C<tr///> quote-like operator now also has a method form called
C<trans()>.  Its argument is a list of pairs.  You can use anything that
produces a pair list:

     $str.trans( %mapping.pairs );

Use the C<.=> form to do a translation in place:

     $str.=trans( %mapping.pairs );

(Perl 6 does not support the C<y///> form, which was only in C<sed> because
they were running out of single letters.)

=item *

The two sides of any pair can be strings interpreted as C<tr///> would:

     $str.=trans( 'A..C' => 'a..c', 'XYZ' => 'xyz' );

As a degenerate case, each side can be individual characters:

     $str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' );

Whitespace characters are taken literally as characters to be
translated from or to.  The C<..> range sequence is the only metasyntax
recognized within a string, though you may of course use backslash
interpolations in double quotes.  If the right side is too short, the
final character is replicated out to the length of the left string.
If there is no final character because the right side is the null
string, the result is deletion instead.

=item *

Either or both sides of the pair may also be Array objects:

     $str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> );

The array version is the underlying primitive form: the semantics of
the string form is exactly equivalent to first doing C<..> expansion
and then splitting the string into individual characters and then
using that as an array.

=item *

The array version can map one-or-more characters to one-or-more
characters:

     $str.=trans( [' ',      '<',    '>',    '&'    ] =>
                  ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);

In the case that more than one sequence of input characters matches,
the longest one wins.  In the case of two identical sequences the
first in order wins.

As with the string form, missing righthand elements replicate the
final element, and a null array results in deletion instead.

=item *

The recognition done by the string and array forms is very basic.
To achieve greater power, any recognition element of the left side
may be specified by a regex that can do character classes, lookahead,
etc.


    $str.=trans( [/ \h /,   '<',    '>',    '&'    ] =>
                 ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);

    $str.=trans( / \s+ / => ' ' );  # squash all whitespace to one space
    $str.=trans( / <-alpha> / => '' );  # delete all non-alpha

These submatches are mixed into the overall match in exactly the same way that
they are mixed into parallel alternation in ordinary regex processing, so
longest token rules apply across all the possible matches specified to the
transliteration operator.  Once a match is made and transliterated, the parallel
matching resumes at the new position following the end of the previous match,
even if it matched multiple characters.

=item *

If the right side of the arrow is a closure, it is evaluated to
determine the replacement value.  If the left side was matched by a
regex, the resulting match object is available within the closure.

=back

=head1 Substitution

There are also method forms of C<m//> and C<s///>:

     $str.match(/pat/);
     $str.subst(/pat/, "replacement");
     $str.subst(/pat/, {"replacement"});
     $str.=subst(/pat/, "replacement");
     $str.=subst(/pat/, {"replacement"});

The C<.match> and C<.subst> methods support the adverbs of C<m//> and
C<s///> as named arguments, so you can write

    $str.match(/pat/, :g)

as an equivalent to

    $str.comb(/pat/, :match)

There is no syntactic sugar here, so in order to get deferred
evaluation of the replacement you must put it into a closure.  The
syntactic sugar is provided only by the quotelike forms.  First there
is the standard "triple quote" form:

    s/pattern/replacement/

Only non-bracket characters may be used for the "triple quote".  The
right side is always evaluated as if it were a double-quoted string
regardless of the quote chosen.

As with Perl 5, a bracketing form is also supported, but unlike Perl 5,
Perl 6 uses the brackets I<only> around the pattern.  The replacement
is then specified as if it were an ordinary item assignment, with ordinary
quoting rules.  To pick your own quotes on the right just use one of the C<q>
forms.  The substitution above is equivalent to:

    s[pattern] = "replacement"

or

    s[pattern] = qq[replacement]

This is not a normal assignment, since the right side is evaluated each
time the substitution matches (much like the pseudo-assignment to declarators
can happen at strange times).  It is therefore treated as a "thunk", that is,
it will be called as a chunk of code that creates a dynamic scope but not a
lexical scope.  (You can also think of a thunk as a closure that uses the
current lexical scope parasitically.)  In fact, it makes no sense at all to say

    s[pattern] = { doit }

because that would try to substitute a closure into the string.

Any scalar assignment operator may be used; the substitution macro
knows how to turn

    $target ~~ s:g[pattern] op= expr

into something like:

    $target.subst(rx[pattern], { $() op expr }, :g)

So, for example, you can multiply every dollar amount by 2 with:

    s:g[\$ <( \d+ )>] *= 2

(Of course, the optimizer is free to do something faster than an actual
method call.)

You'll note from the last example that substitutions only happen on
the "official" string result of the match, that is, the portion of
the string between the C<$/.from> and C<$/.to> positions.
(Here we set those explicitly using the C<< <(...)> >> pair; otherwise we
would have had to use lookbehind to match the C<$>.)

Please note that the C<:ii>/C<:samecase> and C<:mm>/C<:samemark>
switches are really two different modifiers in one, and when the compiler desugars
the quote-like forms it distributes semantics to both the pattern
and the replacement.  That is, C<:ii> on the replacement implies a C<:i> on the
pattern, and C<:mm> implies C<:m>.  The proper method equivalents to:

    s:ii/foo/bar/
    s:mm/boo/far/

are not:

    .subst(/foo/, 'bar', :ii)   # WRONG
    .subst(/boo/, 'far', :mm)   # WRONG

but rather:

    .subst(rx:i/foo/, 'bar', :ii)   # okay
    .subst(rx:m/boo/, 'far', :mm)   # okay

It is specifically I<not> required of an implementation that it treat
the regexes as generic with respect to case and mark.  Retroactive
recompilation is considered harmful.  If an implementation does do lazy
generic case and mark semantics, it is erroneous and non-portable
for a program to depend on it.

=head1 Positional matching, fixed width types

=over

=item *

To anchor to a particular position in the general case you can use
the C<< <at($pos)> >> assertion to say that the current position
is the same as the position object you supply.  You may set the
current match position via the C<:c> and C<:p> modifiers.

However, please remember that in Perl 6 string positions are generally
I<not> integers, but objects that point to a particular place in
the string regardless of whether you count by bytes or codepoints or
graphemes.  If used with an integer, the C<at> assertion will assume
you mean the current lexically scoped Unicode level, on the assumption
that this integer was somehow generated in this same lexical scope.
If this is outside the current string's allowed Unicode abstraction levels, an
exception is thrown.  See S02 for more discussion of string positions.

=item *

C<Buf> types are based on fixed-width cells and can therefore
handle integer positions just fine, and treat them as array indices.
In particular, C<buf8> (also known as C<buf>) is just an old-school byte string.
Matches against C<Buf> types are restricted to ASCII semantics in
the absence of an I<explicit> modifier asking for the array's values
to be treated as some particular encoding such as UTF-32.  (This is
also true for those compact arrays that are considered isomorphic to
C<Buf> types.)  Positions within C<Buf> types are always integers,
counting one per unit cell of the underlying array.  Be aware that
"from" and "to" positions are reported as being between elements.
If matching against a compact array C<@foo>, a final position of 42
indicates that C<@foo[42]> was the first element I<not> included.

=back

=head1 Matching against non-strings

=over

=item *

Anything that can be tied to a string can be matched against a
regex. This feature is particularly useful with input streams:

     my $stream := cat $fh.lines;       # tie scalar to filehandle

     # and later...

     $stream ~~ m/pattern/;         # match from stream

=item *

Any non-compact array of mixed strings or objects can be matched
against a regex as long as you present them as an object with the C<Str>
interface, which does not preclude the object having other interfaces
such as C<Array>.  Normally you'd use C<cat> to generate such an object:

    @array.cat ~~ / foo <,> bar <elem>* /;

The special C<< <,> >> subrule matches the boundary between elements.
The C<< <elem> >> assertion matches any individual array element.
It is the equivalent of the "dot" metacharacter for the whole element.

If the array elements are strings, they are concatenated virtually into
a single logical string.  If the array elements are tokens or other
such objects, the objects must provide appropriate methods for the
kinds of subrules to match against.  It is an assertion failure to match
a string-matching assertion against an object that doesn't provide
a stringified view.  However, pure object lists can be parsed as long as
the match (including any subrules) restricts itself to assertions like:

     <.isa(Dog)>
     <.does(Bark)>
     <.can('scratch')>

It is permissible to mix objects and strings in an array as long as they're
in different elements.  You may not embed objects in strings, however.
Any object may, of course, pretend to be a string element if it likes,
and so a C<Cat> object may be used as a substring with the same restrictions
as in the main string.

Please be aware that the warnings on C<.from> and C<.to> returning
opaque objects goes double for matching against an array, where a
particular position reflects both a position within the array and
(potentially) a position within a string of that array.  Do not
expect to do math with such values.  Nor should you expect to be
able to extract a substr that crosses element boundaries.
[Conjecture: Or should you?]

=item *

To match against every element of an array, use a hyper operator:

     @array».match($regex);

=item *

To match against any element of the array, it suffices to use ordinary
smartmatching:

    @array ~~ $regex;

=back

=head1 When C<$/> is valid

To provide implementational freedom, the C<$/> variable is not
guaranteed to be defined until the pattern reaches a sequence
point that requires it (such as completing the match, or calling an
embedded closure, or even evaluating a submatch that requires a Perl
expression for its argument).  Within regex code, C<$/> is officially
undefined, and references to C<$0> or other capture variables may
be compiled to produce the current value without reference to C<$/>.
Likewise a reference to C<< $<foo> >> does not necessarily mean C<<
$/<foo> >> within the regex proper.  During the execution of a match,
the current match state is actually stored in a C<$¢> variable
lexically scoped to an appropriate portion of the match, but that is
not guaranteed to behave the same as the C<$/> object, because C<$/>
is of type C<Match>, while the match state is of a type derived from C<Cursor>.

In any case this is all transparent to the user for simple matches;
and outside of regex code (and inside closures within the regex)
the C<$/> variable is guaranteed to represent the state of the match
at that point.  That is, normal Perl code can always depend on C<<
$<foo> >> meaning C<< $/<foo> >>, and C<$0> meaning C<$/[0]>, whether
that code is embedded in a closure within the regex or outside the
regex after the match completes.

=for vim:set expandtab sw=4: