Skip to content

Commit 6e8305e

Browse files
authored
Merge pull request #2431 from chsanch/fix-2410
Split regexes page and add regexes best practices page
2 parents 4bc82af + 4d04b10 commit 6e8305e

File tree

2 files changed

+207
-200
lines changed

2 files changed

+207
-200
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
=begin pod :tag<perl6>
2+
3+
=TITLE Regexes: Best Practices and gotchas
4+
5+
=SUBTITLE Some tips on regexes and grammars
6+
7+
To help with robust regexes and grammars, here are some best practices
8+
for code layout and readability, what to actually match, and avoiding common
9+
pitfalls.
10+
11+
=head1 Code layout
12+
13+
Without the C<:sigspace> adverb, whitespace is not significant in Perl 6
14+
regexes. Use that to your own advantage and insert whitespace where it
15+
increases readability. Also, insert comments where necessary.
16+
17+
Compare the very compact
18+
19+
my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }
20+
21+
to the more readable
22+
23+
my regex float {
24+
<[+-]>? # optional sign
25+
\d* # leading digits, optional
26+
'.'
27+
\d+
28+
[ # optional exponent
29+
e <[+-]>? \d+
30+
]?
31+
}
32+
33+
As a rule of thumb, use whitespace around atoms and inside groups; put
34+
quantifiers directly after the atom; and vertically align opening and closing
35+
square brackets and parentheses.
36+
37+
When you use a list of alternations inside parentheses or square brackets, align
38+
the vertical bars:
39+
40+
my regex example {
41+
<preamble>
42+
[
43+
|| <choice_1>
44+
|| <choice_2>
45+
|| <choice_3>
46+
]+
47+
<postamble>
48+
}
49+
50+
=head1 Keep it small
51+
52+
Regexes are often more compact than regular code. Because they do so much with
53+
so little, keep regexes short.
54+
55+
When you can name a part of a regex, it's usually best to
56+
put it into a separate, named regex.
57+
58+
For example, you could take the float regex from earlier:
59+
60+
my regex float {
61+
<[+-]>? # optional sign
62+
\d* # leading digits, optional
63+
'.'
64+
\d+
65+
[ # optional exponent
66+
e <[+-]>? \d+
67+
]?
68+
}
69+
70+
And decompose it into parts:
71+
72+
my token sign { <[+-]> }
73+
my token decimal { \d+ }
74+
my token exponent { 'e' <sign>? <decimal> }
75+
my regex float {
76+
<sign>?
77+
<decimal>?
78+
'.'
79+
<decimal>
80+
<exponent>?
81+
}
82+
83+
That helps, especially when the regex becomes more complicated. For example,
84+
you might want to make the decimal point optional in the presence of an exponent.
85+
86+
my regex float {
87+
<sign>?
88+
[
89+
|| <decimal>? '.' <decimal> <exponent>?
90+
|| <decimal> <exponent>
91+
]
92+
}
93+
94+
=head1 What to match
95+
96+
Often the input data format has no clear-cut specification, or the
97+
specification is not known to the programmer. Then, it's good to be liberal
98+
in what you expect, but only so long as there are no possible ambiguities.
99+
100+
For example, in C<ini> files:
101+
102+
=begin code :skip-test
103+
[section]
104+
key=value
105+
=end code
106+
107+
What can be inside the section header? Allowing only a word might be too
108+
restrictive. Somebody might write C<[two words]>, or use dashes, etc.
109+
Instead of asking what's allowed on the inside, it might be worth asking
110+
instead: I<what's not allowed?>
111+
112+
Clearly, closing square brackets are not allowed, because C<[a]b]> would be
113+
ambiguous. By the same argument, opening square brackets should be forbidden.
114+
This leaves us with
115+
116+
token header { '[' <-[ \[\] ]>+ ']' }
117+
118+
which is fine if you are only processing one line. But if you're processing
119+
a whole file, suddenly the regex parses
120+
121+
=begin code :lang<text>
122+
[with a
123+
newline in between]
124+
=end code
125+
126+
which might not be a good idea. A compromise would be
127+
128+
token header { '[' <-[ \[\] \n ]>+ ']' }
129+
130+
and then, in the post-processing, strip leading and trailing spaces and tabs
131+
from the section header.
132+
133+
=head1 Matching whitespace
134+
135+
The C<:sigspace> adverb (or using the C<rule> declarator instead of C<token>
136+
or C<regex>) is very handy for implicitly parsing whitespace that can appear
137+
in many places.
138+
139+
Going back to the example of parsing C<ini> files, we have
140+
141+
my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
142+
143+
which is probably not as liberal as we want it to be, since the user might
144+
put spaces around the equals sign. So, then we may try this:
145+
146+
my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }
147+
148+
But that's looking unwieldy, so we try something else:
149+
150+
my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }
151+
152+
But wait! The implicit whitespace matching after the value uses up all
153+
whitespace, including newline characters, so the C<\n+> doesn't have
154+
anything left to match (and C<rule> also disables backtracking, so no luck
155+
there).
156+
157+
Therefore, it's important to redefine your definition of implicit whitespace
158+
to whitespace that is not significant in the input format.
159+
160+
This works by redefining the token C<ws>; however, it only works for
161+
L<grammars|/language/grammars>:
162+
163+
grammar IniFormat {
164+
token ws { <!ww> \h* }
165+
rule header { \s* '[' (\w+) ']' \n+ }
166+
token identifier { \w+ }
167+
rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
168+
token section {
169+
<header>
170+
<kvpair>*
171+
}
172+
173+
token TOP {
174+
<section>*
175+
}
176+
}
177+
178+
my $contents = q:to/EOI/;
179+
[passwords]
180+
jack = password1
181+
joy = muchmoresecure123
182+
[quotas]
183+
jack = 123
184+
joy = 42
185+
EOI
186+
say so IniFormat.parse($contents);
187+
188+
Besides putting all regexes into a grammar and turning them into tokens
189+
(because they don't need to backtrack anyway), the interesting new bit is
190+
191+
token ws { <!ww> \h* }
192+
193+
which gets called for implicit whitespace parsing. It matches when it's not
194+
between two word characters (C<< <!ww> >>, negated "within word" assertion),
195+
and zero or more horizontal space characters. The limitation to horizontal
196+
whitespace is important, because newlines (which are vertical whitespace)
197+
delimit records and shouldn't be matched implicitly.
198+
199+
Still, there's some whitespace-related trouble lurking. The regex C<\n+>
200+
won't match a string like C<"\n \n">, because there's a blank between the
201+
two newlines. To allow such input strings, replace C<\n+> with C<\n\s*>.
202+
203+
=end pod
204+
# vim: expandtab softtabstop=4 shiftwidth=4 ft=perl6

0 commit comments

Comments
 (0)