-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170
[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170
Conversation
29071e2
to
c7d1f2f
Compare
model/labels/regexp.go
Outdated
@@ -767,7 +768,7 @@ type equalMultiStringMapMatcher struct { | |||
|
|||
func (m *equalMultiStringMapMatcher) add(s string) { | |||
if !m.caseSensitive { | |||
s = strings.ToLower(s) | |||
s = strings.ToLower(norm.NFKD.String(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now thank you for nerd-sniping me on this.
I tried reading the docs and even though I'd have chosen NFKC
for this, I ran the tests in your branch, especially with the example character from the docs ẛ̣
and I couldn't tell any difference (in test results) between both normalizations, so LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
Should we credit @kushalShukla-web with the test case? I'm not sure where that came from but it seems the same. |
The test case was defined in the issue. |
Just want to hold off on merging this while we look at benchmarks. |
I've run few selected benchmarks, and there's quite an impactful performance impact:
I'm experimenting with an optimization... |
I think we can skip the normalization if the string only has ascii chars. The check for ascii chars is already done by This reverts the performance impact for the common case of a string only having ASCII chars:
In case this change is accepted, we should unit test |
@pracucci |
Hey, I agree that we should optimize for the ascii case, which is the most common one in the wild.
I think we can start with a middle ground between copying the entire func toLowerNormalized(s string) s {
if allAscii(s) {
return strings.ToLower()
}
return strings.ToLower(norm.NFKD.String(s))
} Use it and benchmark to see if it's worth copying the entire |
@colega
Therefore, rather than employing an There remains a question about whether it's feasible to combine normalization and case conversion into a single, more optimized loop. What are your thoughts on this? @colega @pracucci @bboreham |
9eabd8f
to
1a6e7e4
Compare
I've made a modification based on @pracucci suggestions. Instead of iterating through the string twice to check for uppercase characters and then converting them to lowercase for ASCII strings, we can now convert all uppercase characters to lowercase regardless of whether the string is ASCII or not. I conducted a benchmark test, and the results were quite intriguing. I kindly request @pracucci to rerun the benchmark for re-validation.
@colega @bboreham @pracucci Could I kindly ask you to spare some time for another review? |
} | ||
|
||
// Normalise and convert to lower. | ||
return strings.Map(unicode.ToLower, norm.NFKD.String(b.String())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can abort looping above once you find the first non-ascii char, and just do this here:
return strings.Map(unicode.ToLower, norm.NFKD.String(b.String())) | |
return strings.Map(unicode.ToLower, norm.NFKD.String(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commited new changes. I've included strings.Map(unicode.ToLower, norm.NFKD.String(b.String())) to convert most uppercase characters to lowercase until we encounter a non-ASCII character in the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why build and allocate a new string if we aren't going to need it? IMO, we should optimize for ascii only, and then optimize for lowercase only.
If you allow me a suggestion, I'd write the method like this:
// toNormalisedLower normalise the input string using "Unicode Normalization Form D" and then convert it to lower case.
// This method is optimized for ASCII-only strings.
func toNormalisedLower(s string) string {
// Check if the string is all ASCII chars and convert any upper case character to lower case character.
isASCII := true
hasUpper := false
for _, c := range s {
hasUpper = hasUpper || ('A' <= c && c <= 'Z')
if c >= utf8.RuneSelf {
isASCII = false
break
}
}
if !isASCII {
return strings.Map(unicode.ToLower, norm.NFKD.String(s))
}
if !hasUpper {
return s
}
var (
b strings.Builder
pos int
)
for i := 0; i < len(s); i++ {
c := s[i]
if 'A' <= c && c <= 'Z' {
if pos < i {
b.WriteString(s[pos:i])
}
c += 'a' - 'A'
b.WriteByte(c)
pos = i + 1
}
}
if pos < len(s) {
b.WriteString(s[pos:])
}
return b.String()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern was regarding the approach of iterating through a string twice to check for uppercase characters and then converting them to lowercase, especially for ASCII strings where this process might seem redundant. Instead, I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that unnecessarily allocating a new string (when everything is lowercase) is more expensive than looping through an ascii string twice, but of course that depends on the string length and we can only talk if we run the benchmarks.
I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.
Did you mean regardless of whether the string has uppercase or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have toNormalisedLower
take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to situations in which the string contains at least one uppercase character.
we can only talk if we run the benchmarks.
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have
toNormalisedLower
take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.
Could you please clarify if you are referring to passing the input string as a buffer to toNormalisedLower
?
Apologies, but I'm having difficulty understanding your explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).
@Ranveer777 what @bboreham is suggesting is to define an array, say var arr byte[2048]
and create a slice using that array out := arr[:0]
, and build the string on that slice as we proceed, something like:
func toNormalisedLower(s string) string {
var (
arr [2024]byte
b = bytes.NewBuffer(arr[:0])
pos int
)
hasUpper := false
for i := 0; i < len(s); i++ {
c := s[i]
if c >= utf8.RuneSelf {
return strings.Map(unicode.ToLower, norm.NFKD.String(s))
}
if 'A' <= c && c <= 'Z' {
hasUpper = true
if pos < i {
b.WriteString(s[pos:i])
}
c += 'a' - 'A'
b.WriteByte(c)
pos = i + 1
}
}
if !hasUpper {
return s
}
if pos < len(s) {
b.WriteString(s[pos:])
}
return b.String()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ranveer777 what @bboreham is suggesting is to define an array, say
var arr byte[2048]
and create a slice using that arrayout := arr[:0]
, and build the string on that slice as we proceed, something like:
Thank you for clarifying. I initially perceived it as a suggestion for commited implementation.
IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).
👍
ba92b80
to
27b27e1
Compare
model/labels/regexp_test.go
Outdated
toNormalisedLowerTestCases = map[string]string{ | ||
"foo": "foo", | ||
"AAAAAAAAAAAAAAAAAAAAAAAA": "aaaaaaaaaaaaaaaaaaaaaaaa", | ||
"cccccccccccccccccccccccC": "cccccccccccccccccccccccc", | ||
"ſſſſſſſſſſſſſſſſſſſſſſſſS": "sssssssssssssssssssssssss", | ||
"ſſAſſa": "ssassa", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in TestToNormalisedLower
, I would move the definition there to make the test easier to read & run.
Utilizing the benchmark provided below to assess all alternative implementations.
@colega @bboreham @pracucci |
I don't like the idea of randomness on the input of benchmarks. I'd just different inputs with the combinations of:
I.e., I would do this: func BenchmarkToNormalizedLower(b *testing.B) {
benchCase := func(l int, uppercase string, asciiOnly bool, alt int) string {
chars := "abcdefghijklmnopqrstuvwxyz"
if !asciiOnly {
chars = "aаbбcвdгeдfеgёhжiзjиkйlкmлnмoнpоqпrрsсtтuуvфwхxцyчzш"
}
// Swap the alphabet to make alternatives.
chars = chars[alt%len(chars):] + chars[:alt%len(chars)]
str := strings.Repeat(chars, l/len(chars)+1)[:l]
switch uppercase {
case "first":
return strings.ToUpper(str[:1]) + str[1:]
case "last":
return str[:len(str)-1] + strings.ToUpper(str[len(str)-1:])
case "all":
return strings.ToUpper(str)
case "none":
return str
default:
panic("invalid uppercase")
}
}
for _, l := range []int{10, 100, 1000, 4000} {
b.Run(fmt.Sprintf("length=%d", l), func(b *testing.B) {
for _, uppercase := range []string{"none", "first", "last", "all"} {
b.Run("uppercase="+uppercase, func(b *testing.B) {
for _, asciiOnly := range []bool{true, false} {
b.Run(fmt.Sprintf("ascii=%t", asciiOnly), func(b *testing.B) {
inputs := make([]string, 10)
for i := range inputs {
inputs[i] = benchCase(l, uppercase, asciiOnly, i)
}
b.ResetTimer()
for n := 0; n < b.N; n++ {
toNormalisedLower(inputs[n%len(inputs)])
}
})
}
})
}
})
}
} |
Looks good to me @colega.
In the event of randomness, we would have been unable to verify the all aforementioned cases. |
I conducted a benchmark test on the implementation committed to the branch and compared it with the benchmark tests of implementations suggested by @colega and @pracucci.
|
Thank you, @Ranveer777, I think both results look worse than the current one (BTW, you can compare all of them if you do |
Thanks for all the work. I think we can merge the improvement so far and start another PR if there are further ideas. However a recent update to dependencies has created a merge conflict, which I am not confident to resolve in the GitHub UI. |
27b27e1
to
885fee4
Compare
@bboreham |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would like to see it optimised further, but let's get the bugfix in and move from there.
The test failed on Windows build seems to be unrelated |
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
…nversion 1) For ASCII strings: The method converts the input string from upper to lower case. 2) For Non-ASCII strings: The method normalizes the input string using 'Unicode Normalization Form D' and then converts it to lower case. Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
ad6cb5a
to
820d35b
Compare
It appears that Windows build has passed successfully. |
Fixes #14066
By normalizing the input string to a standardized form during addition and matching operations, we ensure that our system operates in an optimized manner, meeting our use case requirements effectively.
@colega @bboreham