-
Notifications
You must be signed in to change notification settings - Fork 0
/
regex.vroom
313 lines (224 loc) · 7.45 KB
/
regex.vroom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
---- config
title: Perl 6
indent: 4
auto_size: 1
# height: 18
# width: 69
vim_opts: '-u vimrc'
skip: 0
---- perl6
== # Smartmatch operator ~~
my $text = "Milwaukee Perl Monger's";
if $text ~~ m/ Perl / { # space is not significant by default!
say "I found '$/'"; # $/ is the match object
}
if $text !~~ m/ Clojure / { # negation is represented by !~~
say "Didn't match Clojure";
}
---- perl6
== # Match object is $/
# 012345678901234567890
if 'Today is a Perl 6 day' ~~ m/ Perl / {
say "Matched '$/'";
}
say "from: {$/.from}"; # Starting position of the match
say "to: {$/.to}"; # End position of the match
say "chars: {$/.chars}"; # Number of characters matched
say "Str: {$/.Str}"; # The matched text (default string coercion)
say "orig: {$/.orig}"; # The original matched string
---- skip
== The Rules (5)
Longest token match
1) Longest token matching: food\s+ beats foo by 2 or more positions
+2) Longest literal prefix: food\w* beats foo\w* by 1 position
+3) Declaration from most-derived grammar beats less-derived
+4) Within a given compilation unit, earlier declaration wins
+5) Declaration with least number of 'uses' wins
->
====
See: http://www.perlcabal.org/syn/S05.html#Overview
---- slide
== Unchanged from Perl 5
* Capturing: (...)
+* Repetition quantifiers: *, +, and ?
+* Alternatives (sort of): |
+* Backslash escape: \
+* Minimal matching suffix: ??, *?, +?
====
http://www.perlcabal.org/syn/S05.html#Unchanged_syntactic_features
---- slides
== Some rules
Alphanumerics Non-alphanumerics
Literal glyphs a 1 _ \* \$ \. \\ \'
Metasyntax \a \1 \_ * $ . \ '
Quoted glyphs 'a' '1' '_' '*' '$' '.' '\\' '\''
* All glyphs that are _, Unicode letters or numbers are *always* literal
* Any other glyph is metasyntactic (*not* self-matching)
* Escape metasyntax with \
* Glyphs can be made literal by placing them in quotes
* Double quoting glyphs makes it literal with interpolating semantics
---- perl6
# Modifiers are adverbs at the *start* of a regex.
if "TMTOWTDI" ~~ m:ignorecase/ tmtowtdi / { # or for short m:i//
say "Matched '$/' while ignoring case";
}
if "foo bar" ~~ m:sigspace/ foo bar / { # space is significant now
say "Matched '$/' while space is significant";
}
====
:ignorecase :i => makes matching case insignificant
:ignoremark :m => ignores accents
:sigspace :s => whitspace is significant
:ratchet :r => Causes the regex to *not* backtrack by default
:global :g =>
:Perl5 :P5 => Runs in Perl 5 regular expression mode
Unicode modifiers
:bytes => Match bytes
:codes => Match codepoints
:graphs => Match language-independent graphemes
:chars => Match at current max level
---- perl6
# . matches *any* character, including a newline
say "Matched a newline" if "\n" ~~ m/\n/;
say "Matched a '$/'" if 'abc' ~~ m/a.c/;
---- perl6
== # Matching start/end of string
my $str = q{foo
bar
cake};
# ^ and $ match the start/end of a *string*
say "Matched start of string '$/'" if $str ~~ m/^foo/;
say "Matched end of a string '$/'" if $str ~~ m/cake$/;
say "Failed to match start of 'bar'" if $str !~~ m/^bar/;
---- perl6
== # Matching start/end of line
my $str = q{foo
bar
cake};
# ^^ and $$ match the line beginning/ending
say "Matched beginning of line '$/'" if $str ~~ m/^^bar/;
say "Matched end of the line '$/'" if $str ~~ m/bar$$/;
say "Matched end of the line '$/'" if $str ~~ m/cake$$/;
====
TODO: I need better examples of beginning/end lines.
---- perl6
== # Matching space
my $str = "Check this out";
# regexes default to space is insignificant
say "No match" unless $str ~~ m/ Check this out /;
# Make space significant with the :sigspace modifier (or :s).
say "Matched '$/'" if $str ~~ m:s/ Check this out /;
# Match using a literal space.
say "Matched '$/'" if $str ~~ m/ Check ' ' /;
# Match using a named character class.
say "Matched '$/'" if $str ~~ m/ Check <space> this <space> out /;
---- perl6
== # backslashes
my $str = "zipcode: 54935";
if $str ~~ m/ \w+ ':' \s \d+ / {
say "Matched '$/'";
}
---- skip
== Backslash shortcuts
\d <digit> A digit
\D <-digit> A nondigit
\w <alnum> A word character
\W <-alnum> A non-word character
\s <sp> A whitespace character
\S <-sp> A non-whitespace character
\h A horizontal whitespace
\H A non-horizontal whitespace
\v A vertical whitespace
\V A non-vertical whitespace
====
http://www.perlcabal.org/syn/S05.html#Backslash_reform
---- perl6
== # Grouping
# () delimit capturing groups
if "zipcode: 54935" ~~ m/ \w+ ':' \s (\d+)/ {
say "zipcode is $0";
}
# [] delimit non-capturing groups
my $team = 'BrewersPackers';
if $team ~~ m/[Packers|Brewers]+/ { # recall: longest token match
say "Matched '$/'";
}
---- perl6
== # Extensible metasyntax: quote words
# Leading space after < means "quote words" array literal
my $str = "ruby";
if $str ~~ m/ < perl python ruby > / {
# same as: m/ [ 'perl' | 'python' | 'ruby' ] /
say "Matched '$/'"; # ruby
}
====
http://www.perlcabal.org/syn/S05.html#Extensible_metasyntax_%28%3C...%3E%29
---- perl6
== # Named character class or subrule
# Leading alphabetic character after leading <
say "Matched space" if " " ~~ m/ <space> /;
# A leading . after the initial < calls a method as a
# non-capturing subrule.
if 'identifier ' ~~ m/ <.ident> <.ws> / {
say "Matched '$/'";
}
---- perl6
== # Enumerated character classes
my $str = '_phoneNumber';
say "Matched '$/'" if $str ~~ m/ <[a..zA..Z_]>+ /;
say "Matched '$/'" if $str ~~ m/ <[ a..z A..Z _ ]>+ /;
---- perl6
# Arithmetic
my $str = 'Gabriel';
if $str ~~ m/ <+alnum - [ie]>* / { # recall: greedy match
say "Matched '$/'";
}
---- perl6
== # Repetition quantifier is **
my $str = "mememememe";
if $str ~~ m/ [me|meme] ** 2 / { # | matches greedily
say "Matched '$/'";
}
if $str ~~ m/ 'me' ** 2..4 / { # Match 'me' two to four times
say "Matched '$/'";
}
====
Instead of representing temporal alternation, | now represents logical
alternation with declarative longest-token semantics. (You may now use
|| to indicate the old temporal alternation. That is, | and || now work
within regex syntax much the same as they do outside of regex syntax,
where they represent junctional and short-circuit OR. This includes the
fact that | has tighter precedence than ||.)
via http://www.perlcabal.org/syn/S05.html#Longest-token_matching
---- perl6
== # Named regex
my regex engine { Starboard | Port }
my regex teams { Brewers | Packers }
my @stuff = <Gabriel Milwaukee Port Brewers Starboard Packers>;
say grep /<engine>/, @stuff;
say grep /<teams>/, @stuff;
---- perl6
== # Grammar: Token
# token: never backtracks by default (can be overridden)
token proper_name { <word>+ <space> <word>+ }
# Which is short for:
regex name :ratchet { ... }
====
# regex: use the colon to specify no backtracking
regex proper_name { <word>+: <space> <word>+: }
In normal regexes, use *:, +:, or ?: to prevent backtracking into
the quantifier. If you want to backtrack, append either a ? or a !
to the quantifier. The ? forces frugal matching as usual, while
the ! forces greedy matching.
---- perl6
== # Grammar: Rule
# rule: never backtracks (:ratchet) and
# space is significant (:sigspace)
rule proper_name { <word>+ <word>+ }
# which is short for:
regex name :ratchet :sigspace { ... }
---- perl6
== # Not sure about regexes?
if "Perl " ~~ m:Perl5/.../ { # muahhaaa!
say "TMTOWTDI";
}