Remove $unsupported tokens in the regex lexers #269

katef · 2020-10-10T18:27:55Z

Currently we produce $unsupported for various things (e.g. lookahead, lookbehind in pcre) and then error about it.

My suggestion instead is that we do lex these correctly, and then error about them in the parser instead. This moves the concept of unsupportedness along a layer.

Eventually I'd like to also construct AST nodes for these, and only error about the unsupportedness when we come to do the AST->NFA conversion. This way we'd also have support for these features for e.g. AST -> regexp rendering (where FSM are not involved), but perhaps also opportunities to deal with them by AST rewriting.

The text was updated successfully, but these errors were encountered:

sfstewman · 2020-10-10T21:42:57Z

I think this makes a lot of sense.

For the PCRE dialect, $unsupported currently falls into four buckets:

Word boundary, capture groups, and multiline things that libfsm could potentially support: \b, \B, \K, \Z, and \G.
Back references. It makes sense to include these in the AST; simple forms like (foo)\1 can be transformed into (foo)(?:foo) which is compatible with DFAs and linear scanning.
Positive/negative look-ahead and look-behind assertions. We may be able to transform these into something compatible with linear scanning.
Ways to control backtracking: atomic qualifiers and (*VERB) forms like (*COMMIT) and (*PRUNE). These are so specific to backtracking matchers (and PCRE) that I'm not sure if we want to include them. On the other hand, there aren't that many forms of this, so it may make sense.

dkegel-fastly mentioned this issue Jun 2, 2021

pcre word boundary support #359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove $unsupported tokens in the regex lexers #269

Remove $unsupported tokens in the regex lexers #269

katef commented Oct 10, 2020

sfstewman commented Oct 10, 2020 •

edited

Loading

Remove $unsupported tokens in the regex lexers #269

Remove $unsupported tokens in the regex lexers #269

Comments

katef commented Oct 10, 2020

sfstewman commented Oct 10, 2020 • edited Loading

sfstewman commented Oct 10, 2020 •

edited

Loading