Skip to content

Commit 0229474

Browse files
Expansion of Regexes:Lexical Conventions section:
- intro clarifying 'regular expressions' and 'regexes'; - systematic treatment of anonymous and named regex definitions; - new subsection on common ways of matching regexes;
1 parent 195a4b9 commit 0229474

File tree

1 file changed

+62
-31
lines changed

1 file changed

+62
-31
lines changed

doc/Language/regexes.pod6

Lines changed: 62 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,24 @@
66
77
X<|Regular Expressions>
88
9-
Regular expressions, I<regexes> for short, are written in a domain-specific
10-
language that describes text patterns. Pattern matching is the process of
11-
matching those patterns to actual text.
9+
A I<regular expression> is a sequence of characters that defines a certain text
10+
pattern, typically one that one wishes to find in some large body of text.
11+
12+
In theoretical computer science and formal language theory, regular expressions
13+
are used to describe so-called
14+
L<I<regular languages>|https://en.wikipedia.org/wiki/Regular_language>. Since
15+
their inception in the 1950's, practical implementations of regular expressions,
16+
for instance in the text search and replace functions of text editors, have outgrown
17+
their strict scientific definition. In acknowledgement of this, and in an attempt
18+
to disambiguate, a regular expression in Perl 6 is normally referred to as a
19+
I<regex> (from: I<reg>ular I<ex>pression), a term that is also in common use in
20+
other programming languages.
21+
22+
In Perl 6, regexes are written in a
23+
L<I<domain-specific language>|https://en.wikipedia.org/wiki/Domain-specific_language>,
24+
i.e. a sublanguage or I<slang>. This page describes this language, and explains how
25+
regexes can be used to search for text patterns in strings in a process called
26+
I<pattern matching>.
1227
1328
=head1 X<Lexical conventions|quote,/ /;quote,rx;quote,m>
1429
@@ -45,12 +60,7 @@ regex delimiter:
4560
=begin item
4661
You cannot use whitespace or alphanumeric characters as delimiters. Whitespace
4762
in regex definition syntax is generally optional, except where it is required to
48-
distinguish from function call syntax (discussed below).
49-
=end item
50-
51-
=begin item
52-
Use of a colon as a delimiter would clash with the use of adverbs of the form
53-
C<:adverb>; accordingly, such use of the colon is forbidden.
63+
distinguish from function call syntax (discussed hereafter).
5464
=end item
5565
5666
=begin item
@@ -63,7 +73,13 @@ C<Regex> object.
6373
=end item
6474
6575
=begin item
66-
The hash C<#> is not available as a delimiter, since it is parsed as the start
76+
Use of a colon as a delimiter would clash with the use of
77+
L<adverbs|/language/regexes#Adverbs>, which take the form C<:adverb>;
78+
accordingly, such use of the colon is forbidden.
79+
=end item
80+
81+
=begin item
82+
The hashmark C<#> is not available as a delimiter since it is parsed as the start
6783
of a L<comment|/language/syntax#Single-line_comments> that runs until the end of
6884
the line.
6985
=end item
@@ -72,13 +88,13 @@ Secondly, the C<rx> form enables the use of
7288
L<regex adverbs|/language/regexes#Adverbs>, which may be placed between C<rx> and the
7389
opening delimiter to modify the definition of the entire regex:
7490
75-
rx:r:s/pattern/ # :r (:ratchet) and :s (:sigspace) adverbs, defining
91+
rx:r:s/pattern/; # :r (:ratchet) and :s (:sigspace) adverbs, defining
7692
# a racheting regex in which whitespace is significant
7793
7894
Although anonymous regexes are not, as such, I<named>, they may effectively be
7995
given a name by putting them inside a named variable, after which they can be
80-
referenced, e.g. direcly or by means of
81-
L<interpolation|/language/regexes#Regex_interpolation>:
96+
referenced, both outside of an embedding regex and from within an embedding
97+
regex by means of L<interpolation|/language/regexes#Regex_interpolation>:
8298
8399
my $regex = / R \w+ /;
84100
say "Zen Buddists like Raku too" ~~ $regex; # OUTPUT: 「Raku」
@@ -110,7 +126,7 @@ rather than data:
110126
&R ~~ Method; # OUTPUT: True (A Regex is really a Method!)
111127
112128
Also unlike with the C<rx> form for defining an anonymous regex, the definition
113-
of a named regex using the C<regex> form does not allow for adverbs to be
129+
of a named regex using the C<regex> keyword does not allow for adverbs to be
114130
inserted before the opening delimiter. Instead, adverbs that are to modify the
115131
entire regex pattern may be included first thing within the curly braces:
116132
@@ -128,9 +144,9 @@ C<Regex> when the C<:ratchet> and C<:sigspace> adverbs are of interest:
128144
129145
Named regexes may be used as building blocks for other regexes, as they are
130146
methods that may called from within other regexes using the C«<regex-name>»
131-
syntax. When they are used this way, they are often referred to as 'subrules';
147+
syntax. When they are used this way, they are often referred to as I<subrules>;
132148
see for more details on their use L<here|/language/regexes#Subrules>.
133-
L<C<Grammars>|/type/Grammar> are the natural niche for subrules, but many common
149+
L<C<Grammars>|/type/Grammar> are the natural habitat of subrules, but many common
134150
predefined character classes are also implemented as named regexes.
135151
136152
=head2 Regex readability: whitespace and comments
@@ -164,36 +180,41 @@ The most common ways to match a string against an anonymous regex C</pattern/> o
164180
against a named regex C<R> include the following:
165181
166182
=begin item
167-
I«Smartmatch: "string" ~~ /pattern/, "string" ~~ /<R>/»
183+
I«Smartmatch: "string" ~~ /pattern/, or "string" ~~ /<R>/»
168184
169185
L<Smartmatching|/language/operators#index-entry-smartmatch_operator> a string
170-
(C<Str>) against a C<Regex> performs a regex match of the string against the
171-
C<Regex>:
186+
against a C<Regex> performs a regex match of the string against the C<Regex>:
172187
173-
say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: 「Go」
188+
say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: «「Go」␤»
174189
175190
my regex R { me|you };
176191
say "You talkin' to me?" ~~ / <R> /; # OUTPUT: «「me」␤ R => 「me」␤»
177-
say "May the force be with you. ~~ &R ; # OUTPUT: 「you」
192+
say "May the force be with you. ~~ &R ; # OUTPUT: «「you」␤»
178193
179194
The different outputs of the last two statements show that these two ways of
180195
smartmatching against a named regex are not identical. The difference arises
181-
because the method call C«<R>» from within the anonymous regex C</.../> installs
196+
because the method call C«<R>» from within the anonymous regex C</ /> installs
182197
a so-called L<'named capture'|/language/regexes#Named_captures> in the C<Match>
183198
object, while the smartmatch against the named C<Regex> as such does not.
184199
=end item
185200
186201
=begin item
187-
I«Explicit topic match: m/pattern/, m/<R>/»
202+
I«Explicit topic match: m/pattern/, or m/<R>/»
188203
189204
The match operator C<m/ /> immediately matches the topic variable
190205
L<C<$_>|/language/variables#index-entry-topic_variable> against the regex
191-
following the C<m>. As with the C<rx/ /> syntax for regex definitions, the match
192-
operator may be used with adverbs in between C<m> and the opening regex
193-
delimiter, and with delimiters other than the slash.
206+
following the C<m>.
194207
195-
Here's an example that illustrates the difference between the C<m/ /> and C</ />
196-
operators:
208+
As with the C<rx/ /> syntax for regex definitions, the match operator may be
209+
used with adverbs in between C<m> and the opening regex delimiter, and with
210+
delimiters other than the slash. However, while the C<rx/ /> syntax may only be
211+
used with L<I<regex adverbs>|/language/regexes#Regex_adverbs> that affect the
212+
compilation of the regex, the C<m/ /> syntax may additionally be used with
213+
L<I<matching adverbs>|/language/regexes#Matching_adverbs> that determine how the
214+
regex engine is to perform pattern matching.
215+
216+
Here's an example that illustrates the primary difference between the C<m/ />
217+
and C</ /> syntax:
197218
198219
my $match;
199220
$_ = "abc";
@@ -219,11 +240,21 @@ against it:
219240
=end item
220241
221242
=begin item
222-
I«Match method: "string".match: /pattern/, "string".match: /<R>/»
243+
I«Match method: "string".match: /pattern/, or "string".match: /<R>/»
223244
224245
The L<C<match>|/type/Str#method_match> method is analogous to the C<m/ />
225-
operator discussed above. Invoking it on a string (C<Str>), with a C<Regex> as
226-
an argument, matches the string against the C<Regex>.
246+
operator discussed above. Invoking it on a string, with a C<Regex> as an
247+
argument, matches the string against the C<Regex>. =end item
248+
249+
=begin item
250+
I«Parsing grammars: grammar-name.parse($string)»
251+
252+
Although parsing a L<Grammar|/language/grammars> involves more than just
253+
matching a string against a regex, this powerful regex-based text destructuring
254+
tool can't be left out from this overview of common pattern matching methods.
255+
256+
If you feel that your needs exceed what simple regexes have to offer, check out this
257+
L<grammar tutorial>|/language/grammar_tutorial> to take regexes to the next level.
227258
=end item
228259
229260

0 commit comments

Comments
 (0)