-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode: improvement of rune-checking funcs #68064
Comments
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do. |
Doc comment
The three do the same but with different input types. A possible doc comment for
I initially considered the name More on motivation using the previous exampleNote that it is perfectly fine to use for i, c := range s {
if IndexRune(chars, c) >= 0 { // IndexRune("([{<《(", c)
return i
}
} Then Having said all that, it is no longer clear what's going to be the If we had the chance to cache the strategy of what needs to be searched, then that could be reused. In the shared repo, I'm proposing a solution that provides constant time when checking any rune. So calling |
Another option:
(Thinking that we provide a "RuneSet" to match against) func IsInSet[S interface{ []rune | string | []byte }](s S) func(rune) bool {
swtich ss := any(s).(type){
case []rune:
// ...
}
} |
This sounds like something that might be appropriate for a third party library outside of the standard library. https://go.dev/doc/faq#x_in_std |
I'm thinking of these functions as a way to compile a set of runes to be later matched, much like a regular expression, but for runes. The performance advantage is huge since this is done everywhere, and given that rune matching is cached and you can reuse that in parsing loops. Sure, regular expressions are an infinitely more complex case, but consider that even in the standard library there are many hot paths in parsers using functions like I found that the bitmask approach even performs better than a func checking with a func isToken(s string) bool {
if s == "" {
return false
}
return strings.IndexFunc(s, isNotTokenChar) < 0
}
func isNotTokenChar(r rune) bool {
return !isTokenChar(r)
}
func isTokenChar(r rune) bool {
// token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
// or tspecials>
return r > 0x20 && r < 0x7f && !isTSpecial(r)
}
func isTSpecial(r rune) bool {
return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
} Now consider this alternative: var (
isTSpecialChars = `()<>@,;:\"/[]?=`
isTSpecial = whatever.IsInStringSet(isTSpecialChars)
isTokenChar func(rune) bool
)
func init(){
runes := make([]rune, 0, 0x7f - 0x21 - len(isTSpecialChars))
for r := rune(0x21); r < 0x7f; r++ {
if !isTSpecial(r) {
runes = append(runes, r)
}
}
isTokenChar = whatever.IsInRunesSet(runes)
}
func isNotTokenChar(r rune) bool {
return !isTokenChar(r)
} Note that all these function checks are done in constant time in the approach I'm proposing in the linked repo, as opposed to Benchmark stuff
goos: linux
goarch: amd64
pkg: test/mime-test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
IsToken-8 1108.0n ± 1% 574.9n ± 0% -48.11% (p=0.000 n=10) The code package main
import (
"strings"
"testing"
)
// original
func isToken(s string) bool {
if s == "" {
return false
}
return strings.IndexFunc(s, isNotTokenChar) < 0
}
func isNotTokenChar(r rune) bool {
return !isTokenChar(r)
}
func isTokenChar(r rune) bool {
return r > 0x20 && r < 0x7f && !isTSpecial(r)
}
func isTSpecial(r rune) bool {
return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}
// changed
var (
isTSpecialChars = `()<>@,;:\"/[]?=`
isTSpecial2 = IsStringFunc(isTSpecialChars)
isTokenChar2 func(rune) bool
)
func init() {
runes := make([]rune, 0, 0x7f-0x21-len(isTSpecialChars))
for r := rune(0x21); r < 0x7f; r++ {
if !isTSpecial(r) {
runes = append(runes, r)
}
}
isTokenChar2 = IsRunesFunc(runes)
}
func isNotTokenChar2(r rune) bool {
return !isTokenChar2(r)
}
func isToken2(s string) bool {
if s == "" {
return false
}
return strings.IndexFunc(s, isNotTokenChar2) < 0
}
// bench stuff
var mediaTypes = []string{
"noslash noslash @",
"foo bar/baz foo bar/baz @",
"foo/bar baz foo/bar baz @",
"attachment attachment @",
"attachment attachment @",
"attachment attachment @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/BAR foo/BAR @",
"foo/bar foo/bar @",
"foo/bar foo/bar @",
"foo foo @",
"noslash",
"foo bar/baz",
"foo/bar baz",
"attachment",
"attachment",
"attachment",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/BAR",
"foo/bar",
"foo/bar",
"foo",
}
func BenchmarkIsToken(b *testing.B) {
for i := 0; i < b.N; i++ {
for _, mt := range mediaTypes {
isToken(mt)
}
}
}
func BenchmarkIsToken2(b *testing.B) {
for i := 0; i < b.N; i++ {
for _, mt := range mediaTypes {
isToken2(mt)
}
}
} This is also not something uncommon to find in the wild, I think Examples in standard library where this would help
This is not a final list, there are some other cases as well.
|
Proposal Details
I was writing a text parser and came across the need to check some runes repeatedly. I also used a lot functions like
utf8.ValidRune
,unicode.IsSpace
, etc. I found some of them were slower compared to some versions of other rune-checker funcs I wrote for the parser, so I setup a repo to make a PoC with my findings. I also found that usingunicode.RangeTable
is for a very specific use cases, and is not very ergonomic for many mundane use cases. It's also true that I found myself callingContains
andIndex
family of functions instrings
andbytes
packages, but many times with the same "needles" (things looked up) for many different "haystacks" (where we looked them up). In some cases, these functions will create an internal, temporary data structure to aid with the lookup, but they are discarded and cannot be reused. It would be great if they could be reused somehow.(aside: I don't want to bother with the needle/haystack terminology, feel free to point me to more Go-ish alternatives).
The proposal can be divided in two items:
func(rune) bool
functions so they can be reused many times. Bikeshedding starter:unicode.IsBytesFunc([]byte) func(rune) bool
unicode.IsRunesFunc([]rune) func(rune) bool
unicode.IsStringFunc(string) func(rune) bool
unicode
,utf8
,strings
andbytes
packages, especially when there are runes >utf8.RuneSelf
. I would like to contribute to the project, so let me know if these are changes that are wanted in the standard library and to what extent should they be implemented (there might be other things to consider that I'm missing to strike the right balance for the project). If so, I can go through the contribution guides and do all that stuff once the "what" is agreed, we can discuss starting with code in the PoC repo (which can have some cleanup for sure).If both items cannot be attended in this same issue, I would prefer that number 1 is addressed here, as that is more of a proposal, and number 2 sounds more like an improvement that doesn't change API and can probably be addressed in other place (and please, could you kindly point me where to do it).
benchstat results for
unicode
andutf8
improvementsThe text was updated successfully, but these errors were encountered: