Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need better Possessive Quantifier example #1

Open
jridgewell opened this issue Jan 28, 2020 · 3 comments
Open

Need better Possessive Quantifier example #1

jridgewell opened this issue Jan 28, 2020 · 3 comments

Comments

@jridgewell
Copy link
Owner

jridgewell commented Jan 28, 2020

The current "prevents backtracking" is a bit ambiguous, since it could desugar into multiple similar regexs.

Given /^(?:a|.b)++$/ (purposefully using ?:, so we can ignore capturing groups for the moment):

  1. ^(?>a|.b)+$/
  2. /^(?>(?:a|.b)+)$/ (this one is correct)
  3. /^(?>(?>a|.b)+)$/

This is explained in more detail in the "Using Atomic Grouping Instead of Possessive Quantifiers" section of https://www.regular-expressions.info/possessive.html.

@jridgewell
Copy link
Owner Author

Here are a few examples. I've modified the regex to demonstrate the issue, now using ^(?:a|.)++E$

Note #2 is correct (and the ES5 desguaring).

#1's + outside atomic is easily shown as incorrect. This is because the + is keeping the backtracking. The atomic groups (.) branch will eat every char, but because the + is rematching the atomic group over and over, it still gets the backtracking info. Once (.) eats the E char, we exit the + loop. But now, we still need an E. The + will backtrack one time, retry, and now it matches.

It's a bit more difficult to demonstrate why #3's two atomics desguaring is incorrect. It's because the (?:a|.)+ should be allowed to backtrack as needed to complete. I'll keep trying to come up with a demo for this.

@jridgewell
Copy link
Owner Author

Actually, I don't think it's possible to construct a case where #3 is observably different than the correct desugaring, it's just more characters to parse.

@Keyacom
Copy link

Keyacom commented Sep 23, 2022

I have a good application for the + possessive modifier: limited detecting of syntactically invalid HTML.

Since it's not yet available in ECMAScript, here's the PHP code for it:

<?php
# Requires PCRE 4.0 or newer!

$invalidHTMLre = '#</?[^\0- !-@[-`{-\x7f][^\0- !-,:-@[-`{-\x7f]*+(?!>)#gm';

$htmlToAlter = '<div></div><input></input';

echo htmlentities( preg_replace( $invalidHTMLre, '', $htmlToAlter ) );
# htmlentities escapes HTML into plain text
?>

This will output <div></div><input>. To see this regex in action: https://regex101.com/r/Hrw1SE/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants