|
| 1 | +=begin pod :tag<perl6> |
| 2 | +
|
| 3 | +=TITLE Regexes: Best Practices and gotchas |
| 4 | +
|
| 5 | +=SUBTITLE Some tips on regexes and grammars |
| 6 | +
|
| 7 | +To help with robust regexes and grammars, here are some best practices |
| 8 | +for code layout and readability, what to actually match, and avoiding common |
| 9 | +pitfalls. |
| 10 | +
|
| 11 | +=head1 Code layout |
| 12 | +
|
| 13 | +Without the C<:sigspace> adverb, whitespace is not significant in Perl 6 |
| 14 | +regexes. Use that to your own advantage and insert whitespace where it |
| 15 | +increases readability. Also, insert comments where necessary. |
| 16 | +
|
| 17 | +Compare the very compact |
| 18 | +
|
| 19 | + my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? } |
| 20 | +
|
| 21 | +to the more readable |
| 22 | +
|
| 23 | + my regex float { |
| 24 | + <[+-]>? # optional sign |
| 25 | + \d* # leading digits, optional |
| 26 | + '.' |
| 27 | + \d+ |
| 28 | + [ # optional exponent |
| 29 | + e <[+-]>? \d+ |
| 30 | + ]? |
| 31 | + } |
| 32 | +
|
| 33 | +As a rule of thumb, use whitespace around atoms and inside groups; put |
| 34 | +quantifiers directly after the atom; and vertically align opening and closing |
| 35 | +square brackets and parentheses. |
| 36 | +
|
| 37 | +When you use a list of alternations inside parentheses or square brackets, align |
| 38 | +the vertical bars: |
| 39 | +
|
| 40 | + my regex example { |
| 41 | + <preamble> |
| 42 | + [ |
| 43 | + || <choice_1> |
| 44 | + || <choice_2> |
| 45 | + || <choice_3> |
| 46 | + ]+ |
| 47 | + <postamble> |
| 48 | + } |
| 49 | +
|
| 50 | +=head1 Keep it small |
| 51 | +
|
| 52 | +Regexes are often more compact than regular code. Because they do so much with |
| 53 | +so little, keep regexes short. |
| 54 | +
|
| 55 | +When you can name a part of a regex, it's usually best to |
| 56 | +put it into a separate, named regex. |
| 57 | +
|
| 58 | +For example, you could take the float regex from earlier: |
| 59 | +
|
| 60 | + my regex float { |
| 61 | + <[+-]>? # optional sign |
| 62 | + \d* # leading digits, optional |
| 63 | + '.' |
| 64 | + \d+ |
| 65 | + [ # optional exponent |
| 66 | + e <[+-]>? \d+ |
| 67 | + ]? |
| 68 | + } |
| 69 | +
|
| 70 | +And decompose it into parts: |
| 71 | +
|
| 72 | + my token sign { <[+-]> } |
| 73 | + my token decimal { \d+ } |
| 74 | + my token exponent { 'e' <sign>? <decimal> } |
| 75 | + my regex float { |
| 76 | + <sign>? |
| 77 | + <decimal>? |
| 78 | + '.' |
| 79 | + <decimal> |
| 80 | + <exponent>? |
| 81 | + } |
| 82 | +
|
| 83 | +That helps, especially when the regex becomes more complicated. For example, |
| 84 | +you might want to make the decimal point optional in the presence of an exponent. |
| 85 | +
|
| 86 | + my regex float { |
| 87 | + <sign>? |
| 88 | + [ |
| 89 | + || <decimal>? '.' <decimal> <exponent>? |
| 90 | + || <decimal> <exponent> |
| 91 | + ] |
| 92 | + } |
| 93 | +
|
| 94 | +=head1 What to match |
| 95 | +
|
| 96 | +Often the input data format has no clear-cut specification, or the |
| 97 | +specification is not known to the programmer. Then, it's good to be liberal |
| 98 | +in what you expect, but only so long as there are no possible ambiguities. |
| 99 | +
|
| 100 | +For example, in C<ini> files: |
| 101 | +
|
| 102 | + =begin code :skip-test |
| 103 | + [section] |
| 104 | + key=value |
| 105 | + =end code |
| 106 | +
|
| 107 | +What can be inside the section header? Allowing only a word might be too |
| 108 | +restrictive. Somebody might write C<[two words]>, or use dashes, etc. |
| 109 | +Instead of asking what's allowed on the inside, it might be worth asking |
| 110 | +instead: I<what's not allowed?> |
| 111 | +
|
| 112 | +Clearly, closing square brackets are not allowed, because C<[a]b]> would be |
| 113 | +ambiguous. By the same argument, opening square brackets should be forbidden. |
| 114 | +This leaves us with |
| 115 | +
|
| 116 | + token header { '[' <-[ \[\] ]>+ ']' } |
| 117 | +
|
| 118 | +which is fine if you are only processing one line. But if you're processing |
| 119 | +a whole file, suddenly the regex parses |
| 120 | +
|
| 121 | + =begin code :lang<text> |
| 122 | + [with a |
| 123 | + newline in between] |
| 124 | + =end code |
| 125 | +
|
| 126 | +which might not be a good idea. A compromise would be |
| 127 | +
|
| 128 | + token header { '[' <-[ \[\] \n ]>+ ']' } |
| 129 | +
|
| 130 | +and then, in the post-processing, strip leading and trailing spaces and tabs |
| 131 | +from the section header. |
| 132 | +
|
| 133 | +=head1 Matching whitespace |
| 134 | +
|
| 135 | +The C<:sigspace> adverb (or using the C<rule> declarator instead of C<token> |
| 136 | +or C<regex>) is very handy for implicitly parsing whitespace that can appear |
| 137 | +in many places. |
| 138 | +
|
| 139 | +Going back to the example of parsing C<ini> files, we have |
| 140 | +
|
| 141 | + my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ } |
| 142 | +
|
| 143 | +which is probably not as liberal as we want it to be, since the user might |
| 144 | +put spaces around the equals sign. So, then we may try this: |
| 145 | +
|
| 146 | + my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ } |
| 147 | +
|
| 148 | +But that's looking unwieldy, so we try something else: |
| 149 | +
|
| 150 | + my rule kvpair { <key=identifier> '=' <value=identifier> \n+ } |
| 151 | +
|
| 152 | +But wait! The implicit whitespace matching after the value uses up all |
| 153 | +whitespace, including newline characters, so the C<\n+> doesn't have |
| 154 | +anything left to match (and C<rule> also disables backtracking, so no luck |
| 155 | +there). |
| 156 | +
|
| 157 | +Therefore, it's important to redefine your definition of implicit whitespace |
| 158 | +to whitespace that is not significant in the input format. |
| 159 | +
|
| 160 | +This works by redefining the token C<ws>; however, it only works for |
| 161 | +L<grammars|/language/grammars>: |
| 162 | +
|
| 163 | + grammar IniFormat { |
| 164 | + token ws { <!ww> \h* } |
| 165 | + rule header { \s* '[' (\w+) ']' \n+ } |
| 166 | + token identifier { \w+ } |
| 167 | + rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ } |
| 168 | + token section { |
| 169 | + <header> |
| 170 | + <kvpair>* |
| 171 | + } |
| 172 | +
|
| 173 | + token TOP { |
| 174 | + <section>* |
| 175 | + } |
| 176 | + } |
| 177 | +
|
| 178 | + my $contents = q:to/EOI/; |
| 179 | + [passwords] |
| 180 | + jack = password1 |
| 181 | + joy = muchmoresecure123 |
| 182 | + [quotas] |
| 183 | + jack = 123 |
| 184 | + joy = 42 |
| 185 | + EOI |
| 186 | + say so IniFormat.parse($contents); |
| 187 | +
|
| 188 | +Besides putting all regexes into a grammar and turning them into tokens |
| 189 | +(because they don't need to backtrack anyway), the interesting new bit is |
| 190 | +
|
| 191 | + token ws { <!ww> \h* } |
| 192 | +
|
| 193 | +which gets called for implicit whitespace parsing. It matches when it's not |
| 194 | +between two word characters (C<< <!ww> >>, negated "within word" assertion), |
| 195 | +and zero or more horizontal space characters. The limitation to horizontal |
| 196 | +whitespace is important, because newlines (which are vertical whitespace) |
| 197 | +delimit records and shouldn't be matched implicitly. |
| 198 | +
|
| 199 | +Still, there's some whitespace-related trouble lurking. The regex C<\n+> |
| 200 | +won't match a string like C<"\n \n">, because there's a blank between the |
| 201 | +two newlines. To allow such input strings, replace C<\n+> with C<\n\s*>. |
| 202 | +
|
| 203 | +=end pod |
| 204 | +# vim: expandtab softtabstop=4 shiftwidth=4 ft=perl6 |
0 commit comments