proposal: regexp: Optimize fixed-length patterns #21463

sylvinus · 2017-08-16T01:29:58Z

The regexp package has 3 different matchers (NFA, onepass, backtrack). One of them is selected depending on the pattern.

I suggest adding a 4th matcher optimized for fixed-length patterns like a.ab$ on strings.
I wrote a proof-of-concept implementation which provides constant-time matching relative to the size of the input string in many cases, whereas the current matchers from regexp perform linearly. Performance becomes close to what can be achieved with methods like strings.Index.

Here is a sample benchmark result of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b"):

M=1000
BenchmarkRegexpBypass/DotSuffix/bypass-8   	30000000	    109 ns/op  # New matcher implementation
BenchmarkRegexpBypass/DotSuffix/std-8      	   50000	  69882 ns/op  # Go's current standard library
BenchmarkRegexpBypass/DotSuffix/pcre-8     	  100000	  41359 ns/op  # PCRE
BenchmarkRegexpBypass/DotSuffix/rust-8     	20000000	    157 ns/op  # Rust regexp engine

I've added more details, benchmarks and tests (with fuzzing!) to the repository. Not sure how much should be inlined here.

The obvious cons of this proposal are:

Adding more code to the standard library
Adding more work at compilation time.

The pros are:

Constant-time matching for many simple regexps (relative to the size of the input string)
No overhead for regexps that are not supported
Go's regexp package is usually considered immature performance-wise. This proposal plays a small role in fixing that by adding optimizations that can reasonably be expected from the end-user.
This matcher keeps very little state and bypasses the mutex from regexp.go
There are already 3 different matchers in the standard library (4 with the upcoming DFA), so adding a new one for a specific kind of patterns is not surprising.
regexp.MatchString("(?:png|jpg)$") could obviously be rewritten as strings.HasSuffix("png") or strings.HasSuffix("jpg") but sometimes it is not practical because the pattern to be matched is user-supplied or part of a long list of patterns. Examples include interactive log search or lists of paths in HTTP routers.
Limited risk due to exhaustive tests in the standard library and additional fuzzing

Feedback would be highly appreciated. Thanks!!

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2017-08-16T04:03:17Z

CC @rsc

cespare · 2017-08-16T04:10:26Z

It would be interesting to examine a large corpus of Go code (perhaps https://github.com/rsc/corpus), extract all the regular expressions (at least the constant ones), and categorize them as to whether this matcher would apply.

steren · 2017-08-16T06:35:11Z

I looked in the BigQuery GitHub public dataset for constant regular expressions used with regexp.MatchString:

matcher applies	cnt
yes: ^([\w\d\s]+\\|?)+$	206
yes: ^[\w\d\s\.]+$	47
yes: [\w\d\s\.]+\$	59
no: ^[^\\^\\$]+$	422
no: [\*\+\?]	1182
maybe	548

13% of the time, the matcher applies. 65% of the time, it does not.

Feel free to add comments or suggestions to the query, I can re-run it and update the results here.

sylvinus · 2017-08-16T08:21:10Z

Great idea & thanks for the stats!

Regarding BigQuery, I think we should also match calls to (Must)?Compile, as this proposal isn't limited to MatchString (though my prototype implementation currently is). Also, it might be difficult to decide if the matcher applies based only on a regexp, so another technique could be dumping all the patterns to a CSV and then testing them one at a time. Finally, the matcher has a "firstpass" mode of operation where it extracts anchored fixed-length prefixes and suffixes before delegating to the other matchers, so that should be counted separately.

I'm going to update the repository with a more exhaustive list of supported regexps to make things clearer.

cznic · 2017-08-16T08:25:14Z

The pros are:

Constant-time matching for many simple regexps

I'm not aware of any theory making that possible. Can you please explain?

sylvinus · 2017-08-16T09:17:10Z

I have updated the repository with a clearer table of supported patterns.

@cznic I meant constant-time like strings.HasSuffix is. I'm not sure of the exact complexity of string slicing in Go, but for practical purposes I have observed it to be constant. I made a chart of the benchmark of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b") as an example:

cznic · 2017-08-16T09:32:05Z

I meant constant-time like strings.HasSuffix is.

I don't think strings.HasSuffix is O(1) either.

I made a chart of the benchmark ...

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

More importantly, theory needs to talk. Can you refer to a theory enabling/explaining constant time regexp matching?

Don't get me wrong, none of the above means your contribution does not improve performance.

sylvinus · 2017-08-16T09:59:31Z

I don't think strings.HasSuffix is O(1) either.

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The actual function being used to read the input string for a.ab$ is utf8.DecodeLastRuneInString, I'd be interested to know its theoretical complexity.

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

Sorry if I missed something. I'm using the strings returned by getString(n int) as input, those do seem to grow and I'm sending the same input to both implementations so I'm not sure there is an error.

I'm happy to change "constant" to something else if it's not actually the case but so far I haven't seen any linear behaviour (relative to the size of the input string) in the benchmarks for these simple patterns.

cznic · 2017-08-16T10:13:38Z

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The N you're looking for is the length of the matched pattern, not the length of the string searched. But you're using a constant length pattern, so you can only get a constant result.

Normally, two parameters should be considered, length of the pattern (let's call it N) and length of the string (M). If the computation can take advantage of knowing the location of the end of the string in O(1) and perform the match from the end, it reduces to considering N only. But I'm not aware of any way to perform the match in O(1) wrt to N only. Therefore, regardless of the complexity wrt M, the overall one cannot be constant time. Known theory says O(NM), which is O(N) for strings.HasSuffix.

sylvinus · 2017-08-16T10:30:46Z

@cznic OK I see what you mean. I did mean "constant" relative to the size of the input string only. I will add that clarification where needed and change N to M to match theory. I'm obviously not claiming it to be constant relative to the size of the pattern itself. Thanks!

steren · 2017-08-16T15:26:06Z

I used this query to extract constant regular expressions used with MatchString, Compile and MustCompile in the the GitHub public dataset: CSV file

Feel free to run an analysis on it.

sylvinus · 2017-08-16T17:02:55Z

Thanks a lot @steren! I just added the dump to the tests, here are the stats:

Total occurences                    76341
Invalid                             2545 (3.33%)
Unsupported                         48006 (62.88%)
Supported total                     25790 (33.78%)
 Supported with byPassProgLinear    14714 (19.27%)
 Supported with byPassProgAlternate 450 (0.59%)
 Supported with byPassProgFirstPass 10626 (13.92%)

33% is much higher than I expected!

sylvinus · 2017-08-30T23:18:35Z

I have updated my repository to add support for all fixed-length unanchored patterns. Support on the GitHub dataset increased from 33.78% to 36.51% with overall performance improvements. It would even go up to 55% if/when we support captures.

I am pretty excited by the fact that on some patterns, this implementation beats all the engines I have been able to benchmark, including Rust which usually performs best. I think implementing both this proposal and a proper DFA would make Go's regexp engine a very strong performer.

rsc · 2017-08-31T00:00:41Z

Hi. Thanks for looking into the performance here. I would like to understand better how much of the improved performance can be had by appropriate adjustments to the existing matchers versus how much requires a whole new matcher. Every new matcher adds a cost, and honestly I think we have too many already. I don't really like how overengineered RE2 is, and I don't want Go's regexp package to end up the same way one matcher at a time. So I want to make sure we're doing all we can with existing matchers before adding a new one.

I observe the following about the standard Go regexp package:

It looks like leading or trailing .* (or leading ^.* or trailing .*$) should be stripped from the pattern and handled specially, just like we handle literal prefixes.
It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)
It seems clear that we should handle patterns ending in $, like \.(jpe?g|gif)$, by scanning the string backward, which cuts the processing from O(n) to O(1) for an input of size n. That's easy to add to the existing matchers and doesn't need to be limited to fixed-size patterns.
Every top-level .* in the pattern (including the implicit .* inside .+ expanded to ..*) serves as a "cut" in the Prolog sense; when reached it should cause all lower-priority threads to be discarded, and no new start-of-pattern threads should ever be created, the same way those threads are disabled when finding an overall match.

I think those would recover the bulk of the improvements shown in the benchmarks, without a new matcher. If you'd like to look into CLs making those changes, please go ahead.

sylvinus · 2017-09-01T09:20:59Z

Thanks for your feedback @rsc!

I agree it's fair to try improving the existing matchers before considering adding a new one. I still believe fixed-length patterns can benefit from optimizations that will require new matching methods, but let's indeed find out what the minimal amount of new code would be.

It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)

I don't see any difference in the benchmark in std between ^xxy and xx, could you explain?

In your points 1 & 4 you mention optimizations for .*. Are they valid only with the DotNL flag or would they involve scanning the input to look for \n characters?

I will try to make a CL for backward matching of anchored patterns first. Should I open a separate issue for it?

I have a couple questions for further CLs:

Would it be possible to add new InstOps? For instance I think having a new one similar to OpAnyCharNotNL but with any other character than \n would improve performance because excluding a single character is a frequent use case. [^a]*a could also be done in a single InstOp.
Cutting processing time from O(n) to O(1) is always interesting, but how much do you care about reducing the ~100ns/op that seem to be the overhead of re.MatchString even for simple patterns? For instance, would you be open to approaches that bypass the mutex when it's not needed?
Would it be worth it to generalize LiteralPrefix and scan the input string for all top-level literals in the pattern in order to quickly dismiss negative matches?

Thanks!

rsc · 2017-09-01T14:44:07Z

re difference, looks like maybe I misread the benchmark output.

re .*, yes, what I described would only apply when . matches \n. That's kind of sad.

re backward matching, sure feel free to open a separate issue.

re InstOps, it would be nice to avoid them by default, since new InstOps invalidate other code using the Prog representation (that don't know how to handle them), but if they make a big difference it is probably OK.

re LiteralPrefix, the nice thing about the prefix is that it tells you where to start too, so the scanning is not wasted effort. Scanning for the other required literals might be OK but they're harder to extract from the Prog, and I try to avoid doing analyses directly on the Regexp beacuse of all the syntax complications. Worth trying though.

rsc · 2017-10-16T20:12:56Z

On hold for response (no hurry).

gopherbot · 2021-10-02T22:14:11Z

Change https://golang.org/cl/353711 mentions this issue: regexp: allow patterns with no alternates to be one-pass

gopherbot · 2022-01-10T18:01:27Z

Change https://golang.org/cl/377294 mentions this issue: regexp: allow prefix string anchored at beginning

gopherbot added this to the Proposal milestone Aug 16, 2017

gopherbot added the Proposal label Aug 16, 2017

dsnet added the Performance label Aug 16, 2017

sylvinus mentioned this issue Sep 2, 2017

regexp: Scan the input string backward for patterns ending in $ #21743

Open

rsc added the Proposal-Hold label Oct 16, 2017

mvdan mentioned this issue Jul 27, 2018

regexp: investigate further performance improvements #26623

Open

mvdan mentioned this issue Oct 9, 2018

go/ast: add func IsGenerated(*File) bool #28089

Closed

This was referenced Apr 7, 2019

regexp: Optimize for inputs that are provably too short #31329

Closed

regexp: improve test coverage #31392

Open

This was referenced Aug 6, 2019

regexp: LiteralPrefix behaving incorrectly #30425

Open

regexp: LiteralPrefix returns surprising result for ^ anchored strings #33506

Open

mvdan mentioned this issue Jun 26, 2020

regexp: fast path for prefix/suffix literals matching on anchored regexps #39869

Open

pracucci mentioned this issue Jul 2, 2020

Optimise labels regex matchers containing a literal within the pattern prometheus/prometheus#7503

Merged

bboreham mentioned this issue Oct 2, 2021

regexp: allow patterns with no alternates to be one-pass #48748

Open

seankhliao changed the title ~~Proposal: regexp: Optimize fixed-length patterns~~ proposal: regexp: Optimize fixed-length patterns Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: regexp: Optimize fixed-length patterns #21463

proposal: regexp: Optimize fixed-length patterns #21463

sylvinus commented Aug 16, 2017 •

edited

Loading

ianlancetaylor commented Aug 16, 2017

cespare commented Aug 16, 2017

steren commented Aug 16, 2017 •

edited

Loading

sylvinus commented Aug 16, 2017

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 •

edited

Loading

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 •

edited

Loading

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 •

edited

Loading

steren commented Aug 16, 2017

sylvinus commented Aug 16, 2017

sylvinus commented Aug 30, 2017 •

edited

Loading

rsc commented Aug 31, 2017

sylvinus commented Sep 1, 2017 •

edited

Loading

rsc commented Sep 1, 2017

rsc commented Oct 16, 2017

gopherbot commented Oct 2, 2021

gopherbot commented Jan 10, 2022

proposal: regexp: Optimize fixed-length patterns #21463

proposal: regexp: Optimize fixed-length patterns #21463

Comments

sylvinus commented Aug 16, 2017 • edited Loading

ianlancetaylor commented Aug 16, 2017

cespare commented Aug 16, 2017

steren commented Aug 16, 2017 • edited Loading

sylvinus commented Aug 16, 2017

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 • edited Loading

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 • edited Loading

cznic commented Aug 16, 2017

sylvinus commented Aug 16, 2017 • edited Loading

steren commented Aug 16, 2017

sylvinus commented Aug 16, 2017

sylvinus commented Aug 30, 2017 • edited Loading

rsc commented Aug 31, 2017

sylvinus commented Sep 1, 2017 • edited Loading

rsc commented Sep 1, 2017

rsc commented Oct 16, 2017

gopherbot commented Oct 2, 2021

gopherbot commented Jan 10, 2022

sylvinus commented Aug 16, 2017 •

edited

Loading

steren commented Aug 16, 2017 •

edited

Loading

sylvinus commented Aug 16, 2017 •

edited

Loading

sylvinus commented Aug 16, 2017 •

edited

Loading

sylvinus commented Aug 16, 2017 •

edited

Loading

sylvinus commented Aug 30, 2017 •

edited

Loading

sylvinus commented Sep 1, 2017 •

edited

Loading