New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: regexp: Optimize fixed-length patterns #21463

Open
sylvinus opened this Issue Aug 16, 2017 · 17 comments

Comments

Projects
None yet
8 participants
@sylvinus
Contributor

sylvinus commented Aug 16, 2017

The regexp package has 3 different matchers (NFA, onepass, backtrack). One of them is selected depending on the pattern.

I suggest adding a 4th matcher optimized for fixed-length patterns like a.ab$ on strings.
I wrote a proof-of-concept implementation which provides constant-time matching relative to the size of the input string in many cases, whereas the current matchers from regexp perform linearly. Performance becomes close to what can be achieved with methods like strings.Index.

Here is a sample benchmark result of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b"):

M=1000
BenchmarkRegexpBypass/DotSuffix/bypass-8   	30000000	    109 ns/op  # New matcher implementation
BenchmarkRegexpBypass/DotSuffix/std-8      	   50000	  69882 ns/op  # Go's current standard library
BenchmarkRegexpBypass/DotSuffix/pcre-8     	  100000	  41359 ns/op  # PCRE
BenchmarkRegexpBypass/DotSuffix/rust-8     	20000000	    157 ns/op  # Rust regexp engine

I've added more details, benchmarks and tests (with fuzzing!) to the repository. Not sure how much should be inlined here.

The obvious cons of this proposal are:

  • Adding more code to the standard library
  • Adding more work at compilation time.

The pros are:

  • Constant-time matching for many simple regexps (relative to the size of the input string)
  • No overhead for regexps that are not supported
  • Go's regexp package is usually considered immature performance-wise. This proposal plays a small role in fixing that by adding optimizations that can reasonably be expected from the end-user.
  • This matcher keeps very little state and bypasses the mutex from regexp.go
  • There are already 3 different matchers in the standard library (4 with the upcoming DFA), so adding a new one for a specific kind of patterns is not surprising.
  • regexp.MatchString("(?:png|jpg)$") could obviously be rewritten as strings.HasSuffix("png") or strings.HasSuffix("jpg") but sometimes it is not practical because the pattern to be matched is user-supplied or part of a long list of patterns. Examples include interactive log search or lists of paths in HTTP routers.
  • Limited risk due to exhaustive tests in the standard library and additional fuzzing

Feedback would be highly appreciated. Thanks!!

@gopherbot gopherbot added this to the Proposal milestone Aug 16, 2017

@gopherbot gopherbot added the Proposal label Aug 16, 2017

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor
Contributor

ianlancetaylor commented Aug 16, 2017

CC @rsc

@cespare

This comment has been minimized.

Show comment
Hide comment
@cespare

cespare Aug 16, 2017

Contributor

It would be interesting to examine a large corpus of Go code (perhaps https://github.com/rsc/corpus), extract all the regular expressions (at least the constant ones), and categorize them as to whether this matcher would apply.

Contributor

cespare commented Aug 16, 2017

It would be interesting to examine a large corpus of Go code (perhaps https://github.com/rsc/corpus), extract all the regular expressions (at least the constant ones), and categorize them as to whether this matcher would apply.

@steren

This comment has been minimized.

Show comment
Hide comment
@steren

steren Aug 16, 2017

I looked in the BigQuery GitHub public dataset for constant regular expressions used with regexp.MatchString:

matcher applies cnt
yes: ^([\w\d\s]+\|?)+$ 206
yes: ^[\w\d\s\.]+$ 47
yes: [\w\d\s\.]+\$ 59
no: ^[^\^\$]+$ 422
no: [\*\+\?] 1182
maybe 548

13% of the time, the matcher applies. 65% of the time, it does not.

Feel free to add comments or suggestions to the query, I can re-run it and update the results here.

steren commented Aug 16, 2017

I looked in the BigQuery GitHub public dataset for constant regular expressions used with regexp.MatchString:

matcher applies cnt
yes: ^([\w\d\s]+\|?)+$ 206
yes: ^[\w\d\s\.]+$ 47
yes: [\w\d\s\.]+\$ 59
no: ^[^\^\$]+$ 422
no: [\*\+\?] 1182
maybe 548

13% of the time, the matcher applies. 65% of the time, it does not.

Feel free to add comments or suggestions to the query, I can re-run it and update the results here.

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 16, 2017

Contributor

Great idea & thanks for the stats!

Regarding BigQuery, I think we should also match calls to (Must)?Compile, as this proposal isn't limited to MatchString (though my prototype implementation currently is). Also, it might be difficult to decide if the matcher applies based only on a regexp, so another technique could be dumping all the patterns to a CSV and then testing them one at a time. Finally, the matcher has a "firstpass" mode of operation where it extracts anchored fixed-length prefixes and suffixes before delegating to the other matchers, so that should be counted separately.

I'm going to update the repository with a more exhaustive list of supported regexps to make things clearer.

Contributor

sylvinus commented Aug 16, 2017

Great idea & thanks for the stats!

Regarding BigQuery, I think we should also match calls to (Must)?Compile, as this proposal isn't limited to MatchString (though my prototype implementation currently is). Also, it might be difficult to decide if the matcher applies based only on a regexp, so another technique could be dumping all the patterns to a CSV and then testing them one at a time. Finally, the matcher has a "firstpass" mode of operation where it extracts anchored fixed-length prefixes and suffixes before delegating to the other matchers, so that should be counted separately.

I'm going to update the repository with a more exhaustive list of supported regexps to make things clearer.

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Aug 16, 2017

Contributor

The pros are:

  • Constant-time matching for many simple regexps

I'm not aware of any theory making that possible. Can you please explain?

Contributor

cznic commented Aug 16, 2017

The pros are:

  • Constant-time matching for many simple regexps

I'm not aware of any theory making that possible. Can you please explain?

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 16, 2017

Contributor

I have updated the repository with a clearer table of supported patterns.

@cznic I meant constant-time like strings.HasSuffix is. I'm not sure of the exact complexity of string slicing in Go, but for practical purposes I have observed it to be constant. I made a chart of the benchmark of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b") as an example:

Benchmark

Contributor

sylvinus commented Aug 16, 2017

I have updated the repository with a clearer table of supported patterns.

@cznic I meant constant-time like strings.HasSuffix is. I'm not sure of the exact complexity of string slicing in Go, but for practical purposes I have observed it to be constant. I made a chart of the benchmark of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b") as an example:

Benchmark

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Aug 16, 2017

Contributor

I meant constant-time like strings.HasSuffix is.

I don't think strings.HasSuffix is O(1) either.

I made a chart of the benchmark ...

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

More importantly, theory needs to talk. Can you refer to a theory enabling/explaining constant time regexp matching?

Don't get me wrong, none of the above means your contribution does not improve performance.

Contributor

cznic commented Aug 16, 2017

I meant constant-time like strings.HasSuffix is.

I don't think strings.HasSuffix is O(1) either.

I made a chart of the benchmark ...

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

More importantly, theory needs to talk. Can you refer to a theory enabling/explaining constant time regexp matching?

Don't get me wrong, none of the above means your contribution does not improve performance.

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 16, 2017

Contributor

I don't think strings.HasSuffix is O(1) either.

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The actual function being used to read the input string for a.ab$ is utf8.DecodeLastRuneInString, I'd be interested to know its theoretical complexity.

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

Sorry if I missed something. I'm using the strings returned by getString(n int) as input, those do seem to grow and I'm sending the same input to both implementations so I'm not sure there is an error.

I'm happy to change "constant" to something else if it's not actually the case but so far I haven't seen any linear behaviour (relative to the size of the input string) in the benchmarks for these simple patterns.

Contributor

sylvinus commented Aug 16, 2017

I don't think strings.HasSuffix is O(1) either.

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The actual function being used to read the input string for a.ab$ is utf8.DecodeLastRuneInString, I'd be interested to know its theoretical complexity.

Nice chart, but the N parameter is used in the wrong place. Actually, the data are for constant N, no wonder it flat-lined.

Sorry if I missed something. I'm using the strings returned by getString(n int) as input, those do seem to grow and I'm sending the same input to both implementations so I'm not sure there is an error.

I'm happy to change "constant" to something else if it's not actually the case but so far I haven't seen any linear behaviour (relative to the size of the input string) in the benchmarks for these simple patterns.

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Aug 16, 2017

Contributor

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The N you're looking for is the length of the matched pattern, not the length of the string searched. But you're using a constant length pattern, so you can only get a constant result.

Normally, two parameters should be considered, length of the pattern (let's call it N) and length of the string (M). If the computation can take advantage of knowing the location of the end of the string in O(1) and perform the match from the end, it reduces to considering N only. But I'm not aware of any way to perform the match in O(1) wrt to N only. Therefore, regardless of the complexity wrt M, the overall one cannot be constant time. Known theory says O(NM), which is O(N) for strings.HasSuffix.

Contributor

cznic commented Aug 16, 2017

I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string.

The N you're looking for is the length of the matched pattern, not the length of the string searched. But you're using a constant length pattern, so you can only get a constant result.

Normally, two parameters should be considered, length of the pattern (let's call it N) and length of the string (M). If the computation can take advantage of knowing the location of the end of the string in O(1) and perform the match from the end, it reduces to considering N only. But I'm not aware of any way to perform the match in O(1) wrt to N only. Therefore, regardless of the complexity wrt M, the overall one cannot be constant time. Known theory says O(NM), which is O(N) for strings.HasSuffix.

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 16, 2017

Contributor

@cznic OK I see what you mean. I did mean "constant" relative to the size of the input string only. I will add that clarification where needed and change N to M to match theory. I'm obviously not claiming it to be constant relative to the size of the pattern itself. Thanks!

Contributor

sylvinus commented Aug 16, 2017

@cznic OK I see what you mean. I did mean "constant" relative to the size of the input string only. I will add that clarification where needed and change N to M to match theory. I'm obviously not claiming it to be constant relative to the size of the pattern itself. Thanks!

@steren

This comment has been minimized.

Show comment
Hide comment
@steren

steren Aug 16, 2017

I used this query to extract constant regular expressions used with MatchString, Compile and MustCompile in the the GitHub public dataset: CSV file

Feel free to run an analysis on it.

steren commented Aug 16, 2017

I used this query to extract constant regular expressions used with MatchString, Compile and MustCompile in the the GitHub public dataset: CSV file

Feel free to run an analysis on it.

@dsnet dsnet added the Performance label Aug 16, 2017

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 16, 2017

Contributor

Thanks a lot @steren! I just added the dump to the tests, here are the stats:

Total occurences                    76341
Invalid                             2545 (3.33%)
Unsupported                         48006 (62.88%)
Supported total                     25790 (33.78%)
 Supported with byPassProgLinear    14714 (19.27%)
 Supported with byPassProgAlternate 450 (0.59%)
 Supported with byPassProgFirstPass 10626 (13.92%)

33% is much higher than I expected!

Contributor

sylvinus commented Aug 16, 2017

Thanks a lot @steren! I just added the dump to the tests, here are the stats:

Total occurences                    76341
Invalid                             2545 (3.33%)
Unsupported                         48006 (62.88%)
Supported total                     25790 (33.78%)
 Supported with byPassProgLinear    14714 (19.27%)
 Supported with byPassProgAlternate 450 (0.59%)
 Supported with byPassProgFirstPass 10626 (13.92%)

33% is much higher than I expected!

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 30, 2017

Contributor

I have updated my repository to add support for all fixed-length unanchored patterns. Support on the GitHub dataset increased from 33.78% to 36.51% with overall performance improvements. It would even go up to 55% if/when we support captures.

I am pretty excited by the fact that on some patterns, this implementation beats all the engines I have been able to benchmark, including Rust which usually performs best. I think implementing both this proposal and a proper DFA would make Go's regexp engine a very strong performer.

Contributor

sylvinus commented Aug 30, 2017

I have updated my repository to add support for all fixed-length unanchored patterns. Support on the GitHub dataset increased from 33.78% to 36.51% with overall performance improvements. It would even go up to 55% if/when we support captures.

I am pretty excited by the fact that on some patterns, this implementation beats all the engines I have been able to benchmark, including Rust which usually performs best. I think implementing both this proposal and a proper DFA would make Go's regexp engine a very strong performer.

@rsc

This comment has been minimized.

Show comment
Hide comment
@rsc

rsc Aug 31, 2017

Contributor

Hi. Thanks for looking into the performance here. I would like to understand better how much of the improved performance can be had by appropriate adjustments to the existing matchers versus how much requires a whole new matcher. Every new matcher adds a cost, and honestly I think we have too many already. I don't really like how overengineered RE2 is, and I don't want Go's regexp package to end up the same way one matcher at a time. So I want to make sure we're doing all we can with existing matchers before adding a new one.

I observe the following about the standard Go regexp package:

  1. It looks like leading or trailing .* (or leading ^.* or trailing .*$) should be stripped from the pattern and handled specially, just like we handle literal prefixes.

  2. It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)

  3. It seems clear that we should handle patterns ending in $, like \.(jpe?g|gif)$, by scanning the string backward, which cuts the processing from O(n) to O(1) for an input of size n. That's easy to add to the existing matchers and doesn't need to be limited to fixed-size patterns.

  4. Every top-level .* in the pattern (including the implicit .* inside .+ expanded to ..*) serves as a "cut" in the Prolog sense; when reached it should cause all lower-priority threads to be discarded, and no new start-of-pattern threads should ever be created, the same way those threads are disabled when finding an overall match.

I think those would recover the bulk of the improvements shown in the benchmarks, without a new matcher. If you'd like to look into CLs making those changes, please go ahead.

Contributor

rsc commented Aug 31, 2017

Hi. Thanks for looking into the performance here. I would like to understand better how much of the improved performance can be had by appropriate adjustments to the existing matchers versus how much requires a whole new matcher. Every new matcher adds a cost, and honestly I think we have too many already. I don't really like how overengineered RE2 is, and I don't want Go's regexp package to end up the same way one matcher at a time. So I want to make sure we're doing all we can with existing matchers before adding a new one.

I observe the following about the standard Go regexp package:

  1. It looks like leading or trailing .* (or leading ^.* or trailing .*$) should be stripped from the pattern and handled specially, just like we handle literal prefixes.

  2. It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)

  3. It seems clear that we should handle patterns ending in $, like \.(jpe?g|gif)$, by scanning the string backward, which cuts the processing from O(n) to O(1) for an input of size n. That's easy to add to the existing matchers and doesn't need to be limited to fixed-size patterns.

  4. Every top-level .* in the pattern (including the implicit .* inside .+ expanded to ..*) serves as a "cut" in the Prolog sense; when reached it should cause all lower-priority threads to be discarded, and no new start-of-pattern threads should ever be created, the same way those threads are disabled when finding an overall match.

I think those would recover the bulk of the improvements shown in the benchmarks, without a new matcher. If you'd like to look into CLs making those changes, please go ahead.

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Sep 1, 2017

Contributor

Thanks for your feedback @rsc!

I agree it's fair to try improving the existing matchers before considering adding a new one. I still believe fixed-length patterns can benefit from optimizations that will require new matching methods, but let's indeed find out what the minimal amount of new code would be.

It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)

I don't see any difference in the benchmark in std between ^xxy and xx, could you explain?

In your points 1 & 4 you mention optimizations for .*. Are they valid only with the DotNL flag or would they involve scanning the input to look for \n characters?

I will try to make a CL for backward matching of anchored patterns first. Should I open a separate issue for it?

I have a couple questions for further CLs:

  • Would it be possible to add new InstOps? For instance I think having a new one similar to OpAnyCharNotNL but with any other character than \n would improve performance because excluding a single character is a frequent use case. [^a]*a could also be done in a single InstOp.
  • Cutting processing time from O(n) to O(1) is always interesting, but how much do you care about reducing the ~100ns/op that seem to be the overhead of re.MatchString even for simple patterns? For instance, would you be open to approaches that bypass the mutex when it's not needed?
  • Would it be worth it to generalize LiteralPrefix and scan the input string for all top-level literals in the pattern in order to quickly dismiss negative matches?

Thanks!

Contributor

sylvinus commented Sep 1, 2017

Thanks for your feedback @rsc!

I agree it's fair to try improving the existing matchers before considering adding a new one. I still believe fixed-length patterns can benefit from optimizations that will require new matching methods, but let's indeed find out what the minimal amount of new code would be.

It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)

I don't see any difference in the benchmark in std between ^xxy and xx, could you explain?

In your points 1 & 4 you mention optimizations for .*. Are they valid only with the DotNL flag or would they involve scanning the input to look for \n characters?

I will try to make a CL for backward matching of anchored patterns first. Should I open a separate issue for it?

I have a couple questions for further CLs:

  • Would it be possible to add new InstOps? For instance I think having a new one similar to OpAnyCharNotNL but with any other character than \n would improve performance because excluding a single character is a frequent use case. [^a]*a could also be done in a single InstOp.
  • Cutting processing time from O(n) to O(1) is always interesting, but how much do you care about reducing the ~100ns/op that seem to be the overhead of re.MatchString even for simple patterns? For instance, would you be open to approaches that bypass the mutex when it's not needed?
  • Would it be worth it to generalize LiteralPrefix and scan the input string for all top-level literals in the pattern in order to quickly dismiss negative matches?

Thanks!

@rsc

This comment has been minimized.

Show comment
Hide comment
@rsc

rsc Sep 1, 2017

Contributor

re difference, looks like maybe I misread the benchmark output.

re .*, yes, what I described would only apply when . matches \n. That's kind of sad.

re backward matching, sure feel free to open a separate issue.

re InstOps, it would be nice to avoid them by default, since new InstOps invalidate other code using the Prog representation (that don't know how to handle them), but if they make a big difference it is probably OK.

re LiteralPrefix, the nice thing about the prefix is that it tells you where to start too, so the scanning is not wasted effort. Scanning for the other required literals might be OK but they're harder to extract from the Prog, and I try to avoid doing analyses directly on the Regexp beacuse of all the syntax complications. Worth trying though.

Contributor

rsc commented Sep 1, 2017

re difference, looks like maybe I misread the benchmark output.

re .*, yes, what I described would only apply when . matches \n. That's kind of sad.

re backward matching, sure feel free to open a separate issue.

re InstOps, it would be nice to avoid them by default, since new InstOps invalidate other code using the Prog representation (that don't know how to handle them), but if they make a big difference it is probably OK.

re LiteralPrefix, the nice thing about the prefix is that it tells you where to start too, so the scanning is not wasted effort. Scanning for the other required literals might be OK but they're harder to extract from the Prog, and I try to avoid doing analyses directly on the Regexp beacuse of all the syntax complications. Worth trying though.

@rsc

This comment has been minimized.

Show comment
Hide comment
@rsc

rsc Oct 16, 2017

Contributor

On hold for response (no hurry).

Contributor

rsc commented Oct 16, 2017

On hold for response (no hurry).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment