Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: case-insensitive MatchString performance #13288

Closed
egonelbre opened this Issue Nov 17, 2015 · 15 comments

Comments

Projects
None yet
8 participants
@egonelbre
Copy link
Contributor

egonelbre commented Nov 17, 2015

I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.

strings.Contains(strings.ToLower(message), "knicks") // 7.8s
// versus
rx := regexp.MustCompile("(?i)knicks") // won't be done multiple times
rx.MatchString(message) // 21.7s

Profiling the etl example lead me to:

152.84s of 158.97s total (96.14%)
Dropped 167 nodes (cum <= 0.79s)
      flat  flat%   sum%        cum   cum%
    61.12s 38.45% 38.45%     61.25s 38.53%  unicode.SimpleFold
    22.62s 14.23% 52.68%    121.64s 76.52%  regexp.(*machine).tryBacktrack
    18.76s 11.80% 64.48%     18.76s 11.80%  regexp.(*bitState).push
    10.41s  6.55% 71.03%     71.66s 45.08%  regexp/syntax.(*Inst).MatchRunePos
     8.99s  5.66% 76.68%      9.41s  5.92%  regexp.(*inputString).step
     8.94s  5.62% 82.30%    135.55s 85.27%  regexp.(*machine).backtrack
     5.76s  3.62% 85.93%      5.76s  3.62%  runtime.osyield
     3.48s  2.19% 88.12%     75.14s 47.27%  regexp/syntax.(*Inst).MatchRune

Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213

I assume that there is a good reason for using unicode.SimpleFold, i.e. some special unicode symbols. When I changed it:

for r1 := unicode.SimpleFold(r0); r1 != r0; r1 = unicode.SimpleFold(r1) {
// into
if r1 := unicode.ToLower(r0); r1 == r || r1 == unicode.ToLower(r) {

It didn't break any tests... also I created easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$" in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:

// old
BenchmarkMatchEasy0i_32-8                 500000              2860 ns/op          11.19 MB/s
BenchmarkMatchEasy0i_1K-8                  20000             84204 ns/op          12.16 MB/s
BenchmarkMatchEasy0i_32K-8                   500           3346191 ns/op           9.79 MB/s
BenchmarkMatchEasy0i_1M-8                     10         107206130 ns/op           9.78 MB/s
BenchmarkMatchEasy0i_32M-8                     1        3424195900 ns/op           9.80 MB/s
// new
BenchmarkMatchEasy0i_32-8                1000000              1831 ns/op          17.48 MB/s
BenchmarkMatchEasy0i_1K-8                  30000             56369 ns/op          18.17 MB/s
BenchmarkMatchEasy0i_32K-8                  1000           2276130 ns/op          14.40 MB/s
BenchmarkMatchEasy0i_1M-8                     20          73004175 ns/op          14.36 MB/s
BenchmarkMatchEasy0i_32M-8                     1        2334133500 ns/op          14.38 MB/s

So, is there a reason why unicode.SimpleFold is used instead of something else, such as unicode.ToLower? Either way, there might be some significant performance gains here for case-insensitive matches.

@egonelbre

This comment has been minimized.

Copy link
Contributor Author

egonelbre commented Nov 17, 2015

unicode.SimpleFold does a binary search over the caseOrbit table, every letter.

Introducing a LUT for runes < 0x80, helped the Easy0i case.

var foldLUT [0x80]uint16

func SimpleFold(r rune) rune {
    if r < len(foldLUT) {
        return rune(foldLUT[r])
    }
    // ...

Results:

BenchmarkMatchEasy0i_32-8        1000000              1796 ns/op          17.82 MB/s
BenchmarkMatchEasy0i_1K-8          30000             50636 ns/op          20.22 MB/s
BenchmarkMatchEasy0i_32K-8          1000           2326133 ns/op          14.09 MB/s
BenchmarkMatchEasy0i_1M-8             20          74604270 ns/op          14.06 MB/s
BenchmarkMatchEasy0i_32M-8             1        2373135700 ns/op          14.14 MB/s

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Nov 17, 2015

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Nov 17, 2015

Nice. Use https://godoc.org/golang.org/x/tools/cmd/benchcmp to show the before & after numbers.

Send in a change? https://golang.org/doc/contribute.html

It's probably too late for Go 1.6, though, unless you send it soon and it addresses an existing bug or regression.

@egonelbre

This comment has been minimized.

Copy link
Contributor Author

egonelbre commented Nov 17, 2015

Sure, I can make CL for the unicode.SimpleFold optimization.

I'm not sure whether regexp part can or should be adjusted, to not use SimpleFold?

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Nov 17, 2015

I don't think you can change SimpleFold to ToLower in regexp. I suspect that would break case-insensitive regexps over characters like "kKK" (little k, big k, Kelvin symbol)? But no, ToLower of kelvin symbol is lowercase 'k'.... http://play.golang.org/p/pqhR4ksOk_ But maybe there are other characters which SimpleFold to each other but don't map to the same ToLower character? You could brute force all the characters (or characters in the defined ranges) and see?

@egonelbre

This comment has been minimized.

Copy link
Contributor Author

egonelbre commented Nov 17, 2015

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Nov 17, 2015

CL https://golang.org/cl/16943 mentions this issue.

@egonelbre

This comment has been minimized.

Copy link
Contributor Author

egonelbre commented Nov 17, 2015

Did the brute-force, it seems that there are multiple such cases: https://play.golang.org/p/D1_JuZYP0v. So yes, unicode.ToLower wouldn't be suitable. Then again, it would be possible to use regexp/syntax.minFoldRune (after making it single lookup).

@bradfitz bradfitz modified the milestones: Go1.7, Unplanned Feb 1, 2016

mk0x9 pushed a commit to mk0x9/go that referenced this issue Apr 26, 2016

unicode: improve SimpleFold performance for ascii
This change significantly speeds up case-insensitive regexp matching.

benchmark                      old ns/op      new ns/op      delta
BenchmarkMatchEasy0i_32-8      2690           1473           -45.24%
BenchmarkMatchEasy0i_1K-8      80404          42269          -47.43%
BenchmarkMatchEasy0i_32K-8     3272187        2076118        -36.55%
BenchmarkMatchEasy0i_1M-8      104805990      66503805       -36.55%
BenchmarkMatchEasy0i_32M-8     3360192200     2126121600     -36.73%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkMatchEasy0i_32-8      11.90        21.72        1.83x
BenchmarkMatchEasy0i_1K-8      12.74        24.23        1.90x
BenchmarkMatchEasy0i_32K-8     10.01        15.78        1.58x
BenchmarkMatchEasy0i_1M-8      10.00        15.77        1.58x
BenchmarkMatchEasy0i_32M-8     9.99         15.78        1.58x

Issue golang#13288

Change-Id: I94af7bb29e75d60b4f6ee760124867ab271b9642
Reviewed-on: https://go-review.googlesource.com/16943
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>

@bradfitz bradfitz modified the milestones: Unplanned, Go1.7 Apr 26, 2016

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Apr 26, 2016

Probably enough for Go 1.7. Leaving this open, though, if you were planning on doing more later.

@aprice

This comment has been minimized.

Copy link

aprice commented Jul 14, 2016

Go 1.7 is looming and I don't see this mentioned anywhere in the release notes and no update on this ticket - is this being rolled into the 1.7 release? Regex is an unfortunate pain point for Go performance compared to other languages.

@cespare

This comment has been minimized.

Copy link
Contributor

cespare commented Jul 14, 2016

@aprice @egonelbre's improvements (https://go-review.googlesource.com/#/c/16943/) were merged and will be part of Go 1.7. (Localized performance improvements are typically not called out in the release notes.)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Jul 14, 2016

A change was submitted to speed up unicode.SimpleFold performance for ASCII, which in turns speeds up case-insensitive regexp performance for ASCII. That change is not significant enough to mention in the release notes. This issue remains open for improved case-folding performance for non-ASCII, if anybody wants to work on it.

@kilyo

This comment has been minimized.

Copy link

kilyo commented Jul 14, 2016

I have moved my question to the forum, as requested.
https://forum.golangbridge.org/t/unicode-simplefold-implementataion-in-go-1-7/2999

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Jul 14, 2016

Let's move general code/style questions to a mailing list and not clutter this bug. I'm happy to answer on a mailing list.

@egonelbre

This comment has been minimized.

Copy link
Contributor Author

egonelbre commented Jul 14, 2016

When making the easy fix, the better solution would have been to avoid calling SimpleFold and use minFoldRune with a LUT to make less lookups. Also, maybe, convert the case-insensitive part of regexp to minFold when compiling the regexp. However, I don't think I would be able to create the time to make such change.

@rsc

This comment has been minimized.

Copy link
Contributor

rsc commented Oct 2, 2018

You can't simply use strings.ToLower on "knicks" because that will fail to match the "k" against the Unicode Kelvin symbol U+212A (K). The extra overhead here compared to strings.ToLower is exactly to add that symbol.

@rsc rsc closed this Oct 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.