prefer-predefined-assertion does not recognize `(\.(?!$)|$))` #659

kurtextrem · 2023-10-20T12:07:39Z

Information:
I used the playground to test (so I assume all latest?)

Description
Coming from this SO post: stackoverflow.com/a/36760050
The regex

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(\.(?!$)|$)){4}$

got optimized (by hand) to:

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}$

so, partbefore(\.(?!$)|$)) can be expressed as (partbefore\.?\b), avoiding the negative lookahead, making the regex both smaller and faster: esbench.com/bench/65326a147ff73700a4debb7d.

I assume prefer-predefined-assertion is meant to capture this, but it does not:
https://ota-meshi.github.io/eslint-plugin-regexp/playground/#eJwVydEKgjAUxvFXOR28cNBail60CB+hiy6dgbRjDHSOaVI5370FH/x/8K34GDWhRCHg6mYzmC9BazXczOB6032UVXZpPXiCC4C4p5WMy8v6yMsm/BlVNCFTSoc646eGVZEsPkod0mqXsJAwthZbIpYz7tG/eppQrujpSW8nnKeOPI/R1BlLmrfTRH42o0WZb9sP02szMA==

(by the way, if you turn on all rules, it errors for that regex: )

The text was updated successfully, but these errors were encountered:

ota-meshi · 2023-10-20T13:01:42Z

Thank you for posting this issue!

However, I don't think partbefore(\.(?!$)|$) and (partbefore\.?\b) are equivalent.
The following script returns a different result.

const s = 'partbefore.'
console.log(/partbefore(\.(?!$)|$)/.test(s), /(partbefore\.?\b)/.test(s)) // false true

I think there probably needs to be a $ after the quantifier, but I haven't yet figured out how that can be optimized.

kurtextrem · 2023-10-20T13:13:02Z

Yeah, looks like this is equivalent:
console.log(/^partbefore(\.(?!$)|$)$/.test(s), /^(partbefore\.?\b)$/.test(s))

-> partbefore(\.(?!$)|$))$ can be expressed as partbefore\.?\b$

Thank you for taking care already regarding the issue!

RunDevelopment · 2023-10-20T16:31:16Z

partbefore(\.(?!$)|$)$ can be expressed as partbefore\.?\b$

Those 2 equivalent, but probably not in the way you think they are. Here's how we can simplify partbefore(\.(?!$)|$)$ and /^(partbefore\.?\b)$/:

  /^partbefore(\.(?!$)|$)$/;
= /^partbefore(\.(?!$)$|$$)/; // move the last $ inside the group
= /^partbefore(\.(?!$)$|$)/; // $$ == $
= /^partbefore($)/; // (?!$)$ will always reject, so we can remove the whole branch
= /^partbefore$/; // remove the group

  /^(partbefore\.?\b)$/;
= /^partbefore\.?\b$/; // remove the group
= /^partbefore(\.|)\b$/; // replace \.? with an equivalent group
= /^partbefore(\.\b|\b)$/; // move \b inside the group
= /^partbefore(\.(?=\w)|\b)$/; // Since \. is a non-word character, `\b` is will enforce a word character after it
= /^partbefore(\.(?=\w)|(?!\w))$/; // Since "e" is a word character, `\b` is will enforce a non-word character after it
= /^partbefore(\.(?=\w)$|(?!\w)$)/; // Move $ inside the group
= /^partbefore(\.(?=\w)$|$)/; // (?!\w)$ == $
= /^partbefore($)/; // (?=\w)$ will always reject
= /^partbefore$/; // remove the group

So both e(\.(?!$)|$)$ and e\.\b$ are equivalent to e$. So they are equivalent, but the neither includes the \..

The only place where (\.(?!$)|$) -> \.?\b is correct is if the expression is in a quantifier, is preceded by a word character, and then followed by $. The simplest example for this would be (a(\.(?!$)|$))*$ -> (a\.?\b)*$. Those are very narrow circumstances, so I don't know whether we should include this.

Don't get me wrong. Adding this to the rule would be relatively easy, but I don't know whether this makes the regex more understandable/maintainable. Since \.?\b relies upon the last $ for correctness, removing the last $ or adding anything before would break the regex.

so, partbefore(\.(?!$)|$)) can be expressed as (partbefore\.?\b), avoiding the negative lookahead, making the regex both smaller and faster: esbench.com/bench/65326a147ff73700a4debb7d.

partbefore(\.(?!$)|$)) is faster on my machine.

Also, if you want the best performance, then consider inlining the quantifier to get rid of the assertions completely: /^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

RunDevelopment · 2023-10-20T16:43:40Z

The craziest thing I've seen all day is that eslint-plugin-regex can actually transform /^partbefore\.?\b$/ into /^partbefore$/. Kinda. It first detects that \.? can be removed:

And after following the suggestion, it detects that \b can be removed:

It uses different reasoning, but I did not expect it to get it at all.

kurtextrem · 2023-10-21T08:08:41Z

@RunDevelopment I get the same result now too:

after updating to Chrome 120 (V8 12.0.114)

Pretty curious, thank you for the hint - and I highly appreciate the in-depth explanation for the regex deduction!!

So both e(\.(?!$)|$)$ and e\.\b$ are equivalent to e$

So just checking if I could follow your explanation - the only reason why \.b\b$ or the negative lookahead is needed, is because of the quantifier ({4}), which you've removed subsequently to increase the performance of the regex?

kurtextrem · 2023-10-21T08:12:15Z

That all being said, thank you @ota-meshi @RunDevelopment for fixing the two things quickly. From my side, we can close this thread too. Maintain-/readability and higher performance are good reasons for me to avoid optimizing the regex so to a smaller one.

One last thing: I wonder if it'd make sense to create a rule that aims for highest runtime perf of the regex, so that e.g.

^(?:(?:25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(?:\.(?!$)|$)){4}$ 
->
/^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

in that case it makes the regex more readable (I guess that is subjective?) and faster too.

RunDevelopment · 2023-10-21T14:04:30Z

One last thing: I wonder if it'd make sense to create a rule that aims for highest runtime perf of the regex, so that e.g.
^(?:(?:25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(?:\.(?!$)|$)){4}$ 
->
/^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

This will be difficult.

I can say this with some certainty because I've only recently added this transformation (and similar ones) to refa (my regex, NFA, and DFA library). However, refa uses its own simplified AST format, which makes implementing these transformations a lot easier. If you're interested, here's the code.

One of the main difficulties for this will be regexpp's AST (regexpp = the regex parser we use). Its AST is essentially readonly for the purposes of this project. So we would most likely have to invent our AST format that allows for easy mutations. Mutations are necessary, because it is practically impossible to do these optimizations in a single step. E.g. here are the steps required to optimize the IP regex:

// I will use "byte" instead of the long "(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)"
/^(byte(\.(?!$)|$)){4}$/;
// rewrite a quantfier "(AB){n}" as "A(BA){n-1}B"
/^byte((\.(?!$)|$)byte){3}(\.(?!$)|$)$/; 
// remove the "|$" that will always reject
/^byte(\.(?!$)byte){3}(\.(?!$)|$)$/; 
// remove the "\.(?!$)|" that will always reject
/^byte(\.(?!$)byte){3}$$/; 
// remove the double "$"
/^byte(\.(?!$)byte){3}$/; 
// remove the "(?!$)" that will always accept
/^byte(\.byte){3}$/;

It's possible to do all of this programmatically, but it will require a lot of work since we would have to write an optimizer from scratch.

in that case it makes the regex more readable (I guess that is subjective?) and faster too.

In this case. Unfortunately, there are a few scenarios where the regex blows up when we try to remove assertions. This mostly happens for regexes of the form (a|b|c(?=d))+ where d overlaps with a|b|c. It's also difficult to predict which regexes will get simpler after applying assertions. E.g. in the above example, we first need to grow the regex (making it more complex) to simplify it. But it is not guaranteed that we can effectively simply a regex after growing it. E.g. in some of my tests for refa, I've seen regexes grow to 8x their original size to remove a few assertions. This is good for what refa is doing (NFAs can't handle assertions, so removing assertion is very important), but it's not useful here.

the only reason why \.b\b$ or the negative lookahead is needed, is because of the quantifier ({4}), which you've removed subsequently to increase the performance of the regex?

Yes. See above for more details. Generally, v8's regex engine seems to be pretty slow when it some to assertions (likely because assertions prevent certain optimizations), so rewriting regexes to avoid them often helps performance. Same for backreferences.

From my side, we can close this thread too.

👍

ota-meshi mentioned this issue Oct 20, 2023

Fixed crash for v flag in regexp/optimal-quantifier-concatenation #660

Merged

RunDevelopment mentioned this issue Oct 20, 2023

Make no-useless-assertion detect (?!$)$ #661

Closed

This was referenced Oct 21, 2023

regexp-tree does not optimize (\.(?!$)|$))$ DmitrySoshnikov/regexp-tree#259

Open

perf: make ipv4 regex faster fastify/fast-uri#73

Merged

RunDevelopment closed this as completed Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefer-predefined-assertion does not recognize `(\.(?!$)|$))` #659

prefer-predefined-assertion does not recognize `(\.(?!$)|$))` #659

kurtextrem commented Oct 20, 2023 •

edited

Loading

ota-meshi commented Oct 20, 2023

kurtextrem commented Oct 20, 2023 •

edited

Loading

RunDevelopment commented Oct 20, 2023

RunDevelopment commented Oct 20, 2023

kurtextrem commented Oct 21, 2023

kurtextrem commented Oct 21, 2023

RunDevelopment commented Oct 21, 2023

prefer-predefined-assertion does not recognize (\.(?!$)|$)) #659

prefer-predefined-assertion does not recognize (\.(?!$)|$)) #659

Comments

kurtextrem commented Oct 20, 2023 • edited Loading

ota-meshi commented Oct 20, 2023

kurtextrem commented Oct 20, 2023 • edited Loading

RunDevelopment commented Oct 20, 2023

RunDevelopment commented Oct 20, 2023

kurtextrem commented Oct 21, 2023

kurtextrem commented Oct 21, 2023

RunDevelopment commented Oct 21, 2023

prefer-predefined-assertion does not recognize `(\.(?!$)|$))` #659

prefer-predefined-assertion does not recognize `(\.(?!$)|$))` #659

kurtextrem commented Oct 20, 2023 •

edited

Loading

kurtextrem commented Oct 20, 2023 •

edited

Loading