[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

Ranveer777 · 2024-05-30T23:38:02Z

By normalizing the input string to a standardized form during addition and matching operations, we ensure that our system operates in an optimized manner, meeting our use case requirements effectively.

@colega @bboreham

colega · 2024-05-31T07:30:16Z

model/labels/regexp.go

@@ -767,7 +768,7 @@ type equalMultiStringMapMatcher struct {

 func (m *equalMultiStringMapMatcher) add(s string) {
 	if !m.caseSensitive {
-		s = strings.ToLower(s)
+		s = strings.ToLower(norm.NFKD.String(s))


Now thank you for nerd-sniping me on this.

I tried reading the docs and even though I'd have chosen NFKC for this, I ran the tests in your branch, especially with the example character from the docs ẛ̣ and I couldn't tell any difference (in test results) between both normalizations, so LGTM.

bboreham

TIL

bboreham · 2024-05-31T09:14:43Z

Should we credit @kushalShukla-web with the test case? I'm not sure where that came from but it seems the same.

colega · 2024-05-31T09:30:55Z

The test case was defined in the issue.

bboreham · 2024-05-31T09:56:19Z

Just want to hold off on merging this while we look at benchmarks.

pracucci · 2024-05-31T09:59:03Z

I've run few selected benchmarks, and there's quite an impactful performance impact:

                                                     │ before.txt  │             after.txt              │
                                                     │   sec/op    │   sec/op     vs base               │
FastRegexMatcher/(?i:foo)-12                           115.5n ± 2%   111.7n ± 2%   -3.33% (p=0.004 n=6)
FastRegexMatcher/(?i:(foo|bar))-12                     239.8n ± 4%   233.4n ± 8%        ~ (p=0.288 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-12               431.2n ± 3%   416.6n ± 2%   -3.37% (p=0.002 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-12                1.050µ ± 2%   1.042µ ± 0%        ~ (p=0.284 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-12   1.994µ ± 2%   2.438µ ± 1%  +22.27% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-12   1.998µ ± 5%   2.457µ ± 1%  +22.95% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-12   10.06µ ± 2%   10.16µ ± 7%        ~ (p=0.240 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-12   12.02µ ± 2%   11.70µ ± 4%   -2.64% (p=0.041 n=6)

I'm experimenting with an optimization...

pracucci · 2024-05-31T10:10:52Z

I'm experimenting with an optimization...

I think we can skip the normalization if the string only has ascii chars. The check for ascii chars is already done by strings.ToLower(), so if we hold our nose and copy-paste strings.ToLower() implementation, we can merge it with the normalization:
55dc273

This reverts the performance impact for the common case of a string only having ASCII chars:

goos: darwin
goarch: amd64
pkg: github.com/prometheus/prometheus/model/labels
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
                                                     │ before.txt  │        after-optimized.txt        │
                                                     │   sec/op    │   sec/op     vs base              │
FastRegexMatcher/(?i:foo)-12                           115.5n ± 2%   112.3n ± 1%  -2.77% (p=0.009 n=6)
FastRegexMatcher/(?i:(foo|bar))-12                     239.8n ± 4%   248.6n ± 3%  +3.65% (p=0.037 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-12               431.2n ± 3%   437.8n ± 2%  +1.53% (p=0.039 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-12                1.050µ ± 2%   1.073µ ± 2%  +2.14% (p=0.026 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-12   1.994µ ± 2%   1.936µ ± 1%  -2.88% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-12   1.998µ ± 5%   1.963µ ± 4%  -1.75% (p=0.041 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-12   10.06µ ± 2%   10.04µ ± 3%       ~ (p=0.699 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-12   12.02µ ± 2%   12.15µ ± 4%       ~ (p=0.310 n=6)
geomean                                                1.252µ        1.253µ       +0.08%

In case this change is accepted, we should unit test toNormalisedLower() (at least few smoke tests).

Ranveer777 · 2024-05-31T17:15:15Z

@pracucci
Interesting approach. This will allow us to restrict the performance impact to only non-ascii characters.

Ranveer777 · 2024-06-01T08:15:12Z

@colega @bboreham
I wanted to get your thoughts on the @pracucci approach.
Also, what are the next steps?

colega · 2024-06-01T08:41:00Z

I wanted to get your thoughts on the @pracucci approach.

Hey, I agree that we should optimize for the ascii case, which is the most common one in the wild.

Also, what are the next steps?

I think we can start with a middle ground between copying the entire strings.ToLower and just defining:

func toLowerNormalized(s string) s {
    if allAscii(s) {
        return strings.ToLower()
    }
    return strings.ToLower(norm.NFKD.String(s))
}

Use it and benchmark to see if it's worth copying the entire strings.ToLower() as @pracucci suggests.

Ranveer777 · 2024-06-01T17:41:32Z

@colega
The strings.ToLower() implementation has already incorporated a check for ASCII characters.

// ToLower returns s with all Unicode letters mapped to their lower case.
func ToLower(s string) string {
	isASCII, hasUpper := true, false
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c >= utf8.RuneSelf {
			isASCII = false
			break
		}
		hasUpper = hasUpper || ('A' <= c && c <= 'Z')
	}
        ...

Therefore, rather than employing an if allAscii(s) check, adopting the implementation of strings.ToLower()(a suggestion made by @pracucci) appears to be a more efficient approach.

There remains a question about whether it's feasible to combine normalization and case conversion into a single, more optimized loop. What are your thoughts on this? @colega @pracucci @bboreham

Ranveer777 · 2024-06-02T12:49:45Z

I've made a modification based on @pracucci suggestions. Instead of iterating through the string twice to check for uppercase characters and then converting them to lowercase for ASCII strings, we can now convert all uppercase characters to lowercase regardless of whether the string is ASCII or not.

I conducted a benchmark test, and the results were quite intriguing. I kindly request @pracucci to rerun the benchmark for re-validation.

                                                    │  before.txt  │              after.txt              │
                                                    │    sec/op    │    sec/op     vs base               │
FastRegexMatcher/#00-8                                74.31n ±  6%   74.14n ±  1%        ~ (p=0.699 n=6)
FastRegexMatcher/foo-8                                90.84n ±  0%   90.97n ±  2%        ~ (p=0.310 n=6)
FastRegexMatcher/^foo-8                               68.90n ±  1%   68.86n ±  1%        ~ (p=0.699 n=6)
FastRegexMatcher/(foo|bar)-8                          80.45n ±  1%   83.80n ±  4%   +4.16% (p=0.015 n=6)
FastRegexMatcher/foo.*-8                              166.0n ±  1%   165.7n ±  0%        ~ (p=0.177 n=6)
FastRegexMatcher/.*foo-8                              188.0n ±  1%   187.9n ±  0%        ~ (p=0.459 n=6)
FastRegexMatcher/^.*foo$-8                            187.9n ±  0%   187.8n ±  1%        ~ (p=0.550 n=6)
FastRegexMatcher/^.+foo$-8                            187.7n ±  2%   187.8n ±  0%        ~ (p=0.937 n=6)
FastRegexMatcher/.?-8                                 123.1n ±  0%   123.2n ±  5%        ~ (p=0.619 n=6)
FastRegexMatcher/.*-8                                 158.5n ±  1%   157.9n ±  0%        ~ (p=0.197 n=6)
FastRegexMatcher/.+-8                                 154.3n ±  0%   154.6n ±  1%        ~ (p=0.139 n=6)
FastRegexMatcher/foo.+-8                              165.1n ±  1%   164.7n ±  0%        ~ (p=0.082 n=6)
FastRegexMatcher/.+foo-8                              188.0n ±  1%   187.7n ±  0%        ~ (p=0.465 n=6)
FastRegexMatcher/foo_.+-8                             153.2n ± 13%   150.0n ±  0%        ~ (p=0.584 n=6)
FastRegexMatcher/foo_.*-8                             150.5n ±  8%   150.0n ±  0%        ~ (p=0.056 n=6)
FastRegexMatcher/.*foo.*-8                            342.9n ±  1%   343.7n ±  1%        ~ (p=0.485 n=6)
FastRegexMatcher/.+foo.+-8                            352.7n ±  2%   354.4n ±  2%        ~ (p=0.290 n=6)
FastRegexMatcher/(?s:.*)-8                            74.06n ±  1%   74.03n ±  0%        ~ (p=1.000 n=6)
FastRegexMatcher/(?s:.+)-8                            86.90n ±  0%   86.98n ±  0%   +0.10% (p=0.026 n=6)
FastRegexMatcher/(?s:^.*foo$)-8                       183.8n ±  1%   183.5n ±  0%        ~ (p=0.271 n=6)
FastRegexMatcher/(?i:foo)-8                           154.2n ±  3%   155.9n ±  4%   +1.07% (p=0.032 n=6)
FastRegexMatcher/(?i:(foo|bar))-8                     301.5n ±  2%   299.4n ±  1%        ~ (p=0.310 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-8               519.8n ±  2%   513.4n ±  2%        ~ (p=0.132 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-8                1.270µ ±  1%   1.276µ ±  1%   +0.51% (p=0.039 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-8   2.630µ ±  3%   2.342µ ±  3%  -10.97% (p=0.002 n=6)
FastRegexMatcher/((.*)(bar|b|buzz)(.+)|foo)$-8        778.2n ±  1%   777.2n ±  2%        ~ (p=0.937 n=6)
FastRegexMatcher/^$-8                                 73.87n ±  1%   74.06n ±  1%        ~ (p=0.180 n=6)
FastRegexMatcher/(prometheus|api_prom)_api_v1_.+-8    275.8n ±  1%   275.1n ±  1%        ~ (p=0.937 n=6)
FastRegexMatcher/10\.0\.(1|2)\.+-8                    151.4n ±  8%   150.6n ±  7%        ~ (p=0.699 n=6)
FastRegexMatcher/10\.0\.(1|2).+-8                     149.9n ±  0%   149.8n ±  1%        ~ (p=0.970 n=6)
FastRegexMatcher/((fo(bar))|.+foo)-8                  344.9n ±  1%   344.9n ±  1%        ~ (p=0.729 n=6)
FastRegexMatcher/zQPbMkNO|NNSPdvMi|iWuuSoAl|qbvKM-8   374.2n ± 19%   389.0n ±  7%        ~ (p=0.310 n=6)
FastRegexMatcher/jyyfj00j0061|jyyfj00j0062|jyyfj9-8   344.6n ± 14%   388.1n ± 10%  +12.61% (p=0.026 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-8   2.615µ ±  3%   2.404µ ±  3%   -8.05% (p=0.002 n=6)
FastRegexMatcher/(?i:(AAAAAAAAAAAAAAAAAAAAAAAA|BB-8   558.8n ±  1%   563.9n ±  1%   +0.92% (p=0.041 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-8   12.03µ ±  0%   12.05µ ±  1%        ~ (p=0.310 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-8   14.34µ ±  1%   14.46µ ±  1%   +0.85% (p=0.004 n=6)
FastRegexMatcher/fo.?-8                               163.0n ±  0%   162.8n ±  0%        ~ (p=0.617 n=6)
FastRegexMatcher/foo.?-8                              163.0n ±  2%   162.8n ±  0%        ~ (p=0.729 n=6)
FastRegexMatcher/f.?o-8                               158.5n ±  0%   158.5n ±  0%        ~ (p=0.623 n=6)
FastRegexMatcher/.*foo.?-8                            356.4n ±  1%   355.7n ±  1%        ~ (p=0.221 n=6)
FastRegexMatcher/.?foo.+-8                            339.6n ±  2%   340.7n ±  1%        ~ (p=0.556 n=6)
FastRegexMatcher/foo.?|bar-8                          263.2n ±  1%   262.4n ±  1%        ~ (p=0.260 n=6)
geomean                                               272.8n         272.7n         -0.06%

@colega @bboreham @pracucci Could I kindly ask you to spare some time for another review?

colega · 2024-06-02T21:28:57Z

model/labels/regexp.go

+	}
+
+	// Normalise and convert to lower.
+	return strings.Map(unicode.ToLower, norm.NFKD.String(b.String()))


I think you can abort looping above once you find the first non-ascii char, and just do this here:

Suggested change

return strings.Map(unicode.ToLower, norm.NFKD.String(b.String()))

return strings.Map(unicode.ToLower, norm.NFKD.String(s))

Commited new changes. I've included strings.Map(unicode.ToLower, norm.NFKD.String(b.String())) to convert most uppercase characters to lowercase until we encounter a non-ASCII character in the string.

But why build and allocate a new string if we aren't going to need it? IMO, we should optimize for ascii only, and then optimize for lowercase only.

If you allow me a suggestion, I'd write the method like this:

// toNormalisedLower normalise the input string using "Unicode Normalization Form D" and then convert it to lower case. // This method is optimized for ASCII-only strings. func toNormalisedLower(s string) string { // Check if the string is all ASCII chars and convert any upper case character to lower case character. isASCII := true hasUpper := false for _, c := range s { hasUpper = hasUpper || ('A' <= c && c <= 'Z') if c >= utf8.RuneSelf { isASCII = false break } } if !isASCII { return strings.Map(unicode.ToLower, norm.NFKD.String(s)) } if !hasUpper { return s } var ( b strings.Builder pos int ) for i := 0; i < len(s); i++ { c := s[i] if 'A' <= c && c <= 'Z' { if pos < i { b.WriteString(s[pos:i]) } c += 'a' - 'A' b.WriteByte(c) pos = i + 1 } } if pos < len(s) { b.WriteString(s[pos:]) } return b.String() }

My concern was regarding the approach of iterating through a string twice to check for uppercase characters and then converting them to lowercase, especially for ASCII strings where this process might seem redundant. Instead, I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.

I was assuming that unnecessarily allocating a new string (when everything is lowercase) is more expensive than looping through an ascii string twice, but of course that depends on the string length and we can only talk if we run the benchmarks.

I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.

Did you mean regardless of whether the string has uppercase or not?

We can have toNormalisedLower take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.

I was referring to situations in which the string contains at least one uppercase character.

we can only talk if we run the benchmarks.

True

We can have toNormalisedLower take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.

Could you please clarify if you are referring to passing the input string as a buffer to toNormalisedLower?
Apologies, but I'm having difficulty understanding your explanation.

IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).

@Ranveer777 what @bboreham is suggesting is to define an array, say var arr byte[2048] and create a slice using that array out := arr[:0], and build the string on that slice as we proceed, something like:

func toNormalisedLower(s string) string { var ( arr [2024]byte b = bytes.NewBuffer(arr[:0]) pos int ) hasUpper := false for i := 0; i < len(s); i++ { c := s[i] if c >= utf8.RuneSelf { return strings.Map(unicode.ToLower, norm.NFKD.String(s)) } if 'A' <= c && c <= 'Z' { hasUpper = true if pos < i { b.WriteString(s[pos:i]) } c += 'a' - 'A' b.WriteByte(c) pos = i + 1 } } if !hasUpper { return s } if pos < len(s) { b.WriteString(s[pos:]) } return b.String() }

@Ranveer777 what @bboreham is suggesting is to define an array, say var arr byte[2048] and create a slice using that array out := arr[:0], and build the string on that slice as we proceed, something like:

Thank you for clarifying. I initially perceived it as a suggestion for commited implementation.

IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).

👍

colega · 2024-06-05T10:32:42Z

model/labels/regexp_test.go

+	toNormalisedLowerTestCases = map[string]string{
+		"foo":                      "foo",
+		"AAAAAAAAAAAAAAAAAAAAAAAA": "aaaaaaaaaaaaaaaaaaaaaaaa",
+		"cccccccccccccccccccccccC": "cccccccccccccccccccccccc",
+		"ſſſſſſſſſſſſſſſſſſſſſſſſS": "sssssssssssssssssssssssss",
+		"ſſAſſa": "ssassa",
+	}


This is only used in TestToNormalisedLower, I would move the definition there to make the test easier to read & run.

Ranveer777 · 2024-06-05T14:22:46Z

Utilizing the benchmark provided below to assess all alternative implementations.

func BenchmarkToNormalisedLower(b *testing.B) {
	charset := []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüßÄÖÜſ")
	listOfRandomString := make([]string, 20)
	for i := 0; i < 20; i++ {
		randomString := make([]rune, 100000)
		for i := range randomString {
			randomString[i] = charset[rand.Intn(len(charset))]
		}
		listOfRandomString[i] = string(randomString)
	}
	comps := []struct {
		fun func(string) string
	}{
		{toNormalisedLower1},
		{toNormalisedLower2},
		{toNormalisedLower3},
	}
	for _, comp := range comps {
		for _, randString := range listOfRandomString {
			b.Run(randString[:32], func(b *testing.B) {
				_ = comp.fun(randString)
			})
		}
	}
}

@colega @bboreham @pracucci
What are your insights on this approach?

colega · 2024-06-05T15:22:21Z

I don't like the idea of randomness on the input of benchmarks.

I'd just different inputs with the combinations of:

len=10, 100, 1000, 4000 (last one is intentionally higher than the on-stack buffer that @bboreham proposed)
lowercase only, first uppercase, last uppercase, all uppercase
ascii only, or also unicode.

I.e., I would do this:

func BenchmarkToNormalizedLower(b *testing.B) {
	benchCase := func(l int, uppercase string, asciiOnly bool, alt int) string {
		chars := "abcdefghijklmnopqrstuvwxyz"
		if !asciiOnly {
			chars = "aаbбcвdгeдfеgёhжiзjиkйlкmлnмoнpоqпrрsсtтuуvфwхxцyчzш"
		}
		// Swap the alphabet to make alternatives.
		chars = chars[alt%len(chars):] + chars[:alt%len(chars)]

		str := strings.Repeat(chars, l/len(chars)+1)[:l]
		switch uppercase {
		case "first":
			return strings.ToUpper(str[:1]) + str[1:]
		case "last":
			return str[:len(str)-1] + strings.ToUpper(str[len(str)-1:])
		case "all":
			return strings.ToUpper(str)
		case "none":
			return str
		default:
			panic("invalid uppercase")
		}
	}

	for _, l := range []int{10, 100, 1000, 4000} {
		b.Run(fmt.Sprintf("length=%d", l), func(b *testing.B) {
			for _, uppercase := range []string{"none", "first", "last", "all"} {
				b.Run("uppercase="+uppercase, func(b *testing.B) {
					for _, asciiOnly := range []bool{true, false} {
						b.Run(fmt.Sprintf("ascii=%t", asciiOnly), func(b *testing.B) {
							inputs := make([]string, 10)
							for i := range inputs {
								inputs[i] = benchCase(l, uppercase, asciiOnly, i)
							}
							b.ResetTimer()
							for n := 0; n < b.N; n++ {
								toNormalisedLower(inputs[n%len(inputs)])
							}
						})
					}
				})
			}
		})
	}
}

Ranveer777 · 2024-06-05T15:56:04Z

Looks good to me @colega.

lowercase only, first uppercase, last uppercase, all uppercase

ascii only, or also unicode.

In the event of randomness, we would have been unable to verify the all aforementioned cases.

Ranveer777 · 2024-06-06T09:17:46Z

I conducted a benchmark test on the implementation committed to the branch and compared it with the benchmark tests of implementations suggested by @colega and @pracucci.

ToNormalizedLower/length=10/uppercase=none/ascii=true-8       42.68n ± 1%    21.45n ±  3%   -49.75% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=none/ascii=false-8      310.2n ± 1%    274.1n ±  0%   -11.65% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=first/ascii=true-8      44.70n ± 1%    90.79n ±  1%  +103.09% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=first/ascii=false-8     321.9n ± 1%    309.9n ±  1%    -3.73% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=last/ascii=true-8       43.21n ± 1%    62.77n ±  1%   +45.26% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=last/ascii=false-8      340.9n ± 1%    304.4n ±  1%   -10.71% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=all/ascii=true-8        51.36n ± 1%    97.97n ±  1%   +90.76% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=false-8       337.5n ± 1%    295.9n ±  1%   -12.31% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=true-8      138.0n ± 2%    165.1n ±  2%   +19.59% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=false-8     3.743µ ± 1%    3.667µ ±  1%    -2.02% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=true-8     139.1n ± 2%    296.7n ±  2%  +113.34% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=false-8    3.746µ ± 1%    3.900µ ±  3%    +4.12% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=true-8      138.8n ± 1%    303.4n ±  0%  +118.62% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=false-8     3.768µ ± 1%    3.692µ ±  1%    -2.03% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=true-8       259.9n ± 1%    575.4n ±  2%  +121.44% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=false-8      3.596µ ± 1%    3.507µ ±  0%    -2.48% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=true-8     1.002µ ± 2%    1.570µ ±  1%   +56.68% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=false-8    31.68µ ± 1%    31.43µ ±  1%    -0.80% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=true-8    1.030µ ± 1%    2.252µ ±  1%  +118.80% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=false-8   31.71µ ± 3%    32.62µ ±  1%         ~ (p=0.065 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=true-8     1.006µ ± 1%    2.676µ ±  0%  +166.00% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=false-8    31.71µ ± 1%    31.61µ ±  0%         ~ (p=0.084 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=true-8      2.267µ ± 1%    4.697µ ±  1%  +107.21% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=false-8     29.70µ ± 1%    29.30µ ±  1%    -1.36% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=true-8     4.143µ ± 3%    6.289µ ±  1%   +51.82% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=false-8    126.0µ ± 1%    124.7µ ±  0%    -1.06% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=true-8    4.085µ ± 1%    8.939µ ±  1%  +118.81% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=false-8   125.5µ ± 0%    128.6µ ±  0%    +2.46% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=true-8     4.126µ ± 1%   10.763µ ± 11%  +160.86% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=false-8    126.6µ ± 7%    125.2µ ±  2%    -1.13% (p=0.041 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=true-8      9.326µ ± 1%   18.084µ ±  2%   +93.91% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=false-8     116.3µ ± 1%    114.8µ ±  0%    -1.28% (p=0.002 n=6)
geomean                                                       1.957µ         2.583µ         +31.96%

ToNormalizedLower/length=10/uppercase=none/ascii=true-8       42.68n ± 1%    15.91n ± 6%  -62.73% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=none/ascii=false-8      310.2n ± 1%    259.8n ± 0%  -16.26% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=first/ascii=true-8      44.70n ± 1%    47.60n ± 1%   +6.48% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=first/ascii=false-8     321.9n ± 1%    303.6n ± 1%   -5.70% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=last/ascii=true-8       43.21n ± 1%    51.34n ± 1%  +18.80% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=last/ascii=false-8      340.9n ± 1%    294.9n ± 1%  -13.49% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=true-8        51.36n ± 1%    57.50n ± 1%  +11.95% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=false-8       337.5n ± 1%    295.2n ± 2%  -12.52% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=true-8      138.0n ± 2%    117.0n ± 0%  -15.21% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=false-8     3.743µ ± 1%    3.681µ ± 1%   -1.66% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=true-8     139.1n ± 2%    193.9n ± 0%  +39.45% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=false-8    3.746µ ± 1%    3.799µ ± 0%   +1.42% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=true-8      138.8n ± 1%    236.9n ± 1%  +70.68% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=false-8     3.768µ ± 1%    3.693µ ± 1%   -1.99% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=true-8       259.9n ± 1%    331.4n ± 0%  +27.55% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=false-8      3.596µ ± 1%    3.498µ ± 0%   -2.71% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=true-8     1.002µ ± 2%    1.060µ ± 4%   +5.82% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=false-8    31.68µ ± 1%    31.55µ ± 2%        ~ (p=0.372 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=true-8    1.030µ ± 1%    1.429µ ± 1%  +38.85% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=false-8   31.71µ ± 3%    32.59µ ± 1%   +2.79% (p=0.041 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=true-8     1.006µ ± 1%    1.909µ ± 1%  +89.71% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=false-8    31.71µ ± 1%    31.63µ ± 1%        ~ (p=0.461 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=true-8      2.267µ ± 1%    2.828µ ± 1%  +24.77% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=false-8     29.70µ ± 1%    29.24µ ± 0%   -1.54% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=true-8     4.143µ ± 3%    4.194µ ± 0%   +1.23% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=false-8    126.0µ ± 1%    124.7µ ± 0%   -1.04% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=true-8    4.085µ ± 1%    5.838µ ± 6%  +42.90% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=false-8   125.5µ ± 0%    129.7µ ± 4%   +3.34% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=true-8     4.126µ ± 1%    7.639µ ± 1%  +85.13% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=false-8    126.6µ ± 7%    125.2µ ± 1%   -1.09% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=true-8      9.326µ ± 1%   11.578µ ± 4%  +24.15% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=false-8     116.3µ ± 1%    116.3µ ± 2%        ~ (p=0.937 n=6)
geomean                                                       1.957µ         2.097µ        +7.14%

colega · 2024-06-06T10:08:59Z

Thank you, @Ranveer777, I think both results look worse than the current one (BTW, you can compare all of them if you do benchstat current.txt foo.txt bar.txt and keep the headers of the table, this mislead me so I posted a wrong comment that I later removed)

bboreham · 2024-06-06T10:15:48Z

Thanks for all the work. I think we can merge the improvement so far and start another PR if there are further ideas.

However a recent update to dependencies has created a merge conflict, which I am not confident to resolve in the GitHub UI.
@Ranveer777 could you rebase please?

Ranveer777 · 2024-06-06T10:38:22Z

@colega @bboreham @pracucci
I appreciate your guidance through the changes. Thank you.

Ranveer777 · 2024-06-06T12:02:16Z

@bboreham
I have successfully resolved the conflict and performed a rebase on the branch.

bboreham

LGTM. I would like to see it optimised further, but let's get the bugfix in and move from there.

colega · 2024-06-07T07:40:37Z

The test failed on Windows build seems to be unrelated FAIL: TestDBReadOnly_Querier_NoAlteration (1.76s)

Signed-off-by: RA <ranveeravhad777@gmail.com>

…nversion 1) For ASCII strings: The method converts the input string from upper to lower case. 2) For Non-ASCII strings: The method normalizes the input string using 'Unicode Normalization Form D' and then converts it to lower case. Signed-off-by: RA <ranveeravhad777@gmail.com>

Signed-off-by: RA <ranveeravhad777@gmail.com>

Ranveer777 · 2024-06-09T12:19:15Z

It appears that Windows build has passed successfully.

Ranveer777 marked this pull request as draft May 31, 2024 00:18

Ranveer777 marked this pull request as ready for review May 31, 2024 00:18

Ranveer777 force-pushed the equalMultiStringMapMatcher branch from 29071e2 to c7d1f2f Compare May 31, 2024 06:36

colega reviewed May 31, 2024

View reviewed changes

colega approved these changes May 31, 2024

View reviewed changes

bboreham approved these changes May 31, 2024

View reviewed changes

Ranveer777 force-pushed the equalMultiStringMapMatcher branch 2 times, most recently from 9eabd8f to 1a6e7e4 Compare June 2, 2024 12:48

Ranveer777 requested review from colega and bboreham June 2, 2024 12:50

colega reviewed Jun 2, 2024

View reviewed changes

Ranveer777 force-pushed the equalMultiStringMapMatcher branch from ba92b80 to 27b27e1 Compare June 3, 2024 17:42

beorn7 mentioned this pull request Jun 4, 2024

Label: Changed regexp.go file and added a single test case 'ſſs' #14086

Open

colega reviewed Jun 5, 2024

View reviewed changes

Ranveer777 force-pushed the equalMultiStringMapMatcher branch from 27b27e1 to 885fee4 Compare June 6, 2024 10:29

bboreham approved these changes Jun 6, 2024

View reviewed changes

Ranveer777 added 7 commits June 9, 2024 16:50

Converted string to standarized form

af92067

Signed-off-by: RA <ranveeravhad777@gmail.com>

Added golang.org/x/text in Go dependencies

8dbe247

Signed-off-by: RA <ranveeravhad777@gmail.com>

Added test cases for FastRegexMatcher

03bf4c5

Signed-off-by: RA <ranveeravhad777@gmail.com>

breaking the loop incase ASCII character is found

b9a319c

Signed-off-by: RA <ranveeravhad777@gmail.com>

Added benchmark for toNormalizedLower

549938b

Signed-off-by: RA <ranveeravhad777@gmail.com>

Updated Go dependencies

820d35b

Signed-off-by: RA <ranveeravhad777@gmail.com>

Ranveer777 force-pushed the equalMultiStringMapMatcher branch from ad6cb5a to 820d35b Compare June 9, 2024 11:25

bboreham changed the title ~~model: Normalized the string to standardized form while add and Matches for MultiStringMapMatcher~~ [BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison Jun 10, 2024

bboreham merged commit 39902ba into prometheus:main Jun 10, 2024
25 checks passed

colega mentioned this pull request Jun 14, 2024

Refactor toNormalisedLower: shorter and slightly faster. #14299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

Ranveer777 commented May 30, 2024

colega May 31, 2024

bboreham left a comment

bboreham commented May 31, 2024

colega commented May 31, 2024

bboreham commented May 31, 2024

pracucci commented May 31, 2024

pracucci commented May 31, 2024 •

edited

Ranveer777 commented May 31, 2024 •

edited

Ranveer777 commented Jun 1, 2024

colega commented Jun 1, 2024

Ranveer777 commented Jun 1, 2024

Ranveer777 commented Jun 2, 2024

colega Jun 2, 2024

Ranveer777 Jun 3, 2024 •

edited

colega Jun 4, 2024

Ranveer777 Jun 4, 2024 •

edited

colega Jun 4, 2024

bboreham Jun 4, 2024

Ranveer777 Jun 4, 2024 •

edited

Ranveer777 Jun 4, 2024 •

edited

colega Jun 4, 2024 •

edited

Ranveer777 Jun 4, 2024

colega Jun 5, 2024

Ranveer777 commented Jun 5, 2024

colega commented Jun 5, 2024 •

edited

Ranveer777 commented Jun 5, 2024 •

edited

Ranveer777 commented Jun 6, 2024

colega commented Jun 6, 2024

bboreham commented Jun 6, 2024

Ranveer777 commented Jun 6, 2024

Ranveer777 commented Jun 6, 2024

bboreham left a comment

colega commented Jun 7, 2024

Ranveer777 commented Jun 9, 2024 •

edited

	return strings.Map(unicode.ToLower, norm.NFKD.String(b.String()))
	return strings.Map(unicode.ToLower, norm.NFKD.String(s))

[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

Conversation

Ranveer777 commented May 30, 2024

colega May 31, 2024

Choose a reason for hiding this comment

bboreham left a comment

Choose a reason for hiding this comment

bboreham commented May 31, 2024

colega commented May 31, 2024

bboreham commented May 31, 2024

pracucci commented May 31, 2024

pracucci commented May 31, 2024 • edited

Ranveer777 commented May 31, 2024 • edited

Ranveer777 commented Jun 1, 2024

colega commented Jun 1, 2024

Ranveer777 commented Jun 1, 2024

Ranveer777 commented Jun 2, 2024

colega Jun 2, 2024

Choose a reason for hiding this comment

Ranveer777 Jun 3, 2024 • edited

Choose a reason for hiding this comment

colega Jun 4, 2024

Choose a reason for hiding this comment

Ranveer777 Jun 4, 2024 • edited

Choose a reason for hiding this comment

colega Jun 4, 2024

Choose a reason for hiding this comment

bboreham Jun 4, 2024

Choose a reason for hiding this comment

Ranveer777 Jun 4, 2024 • edited

Choose a reason for hiding this comment

Ranveer777 Jun 4, 2024 • edited

Choose a reason for hiding this comment

colega Jun 4, 2024 • edited

Choose a reason for hiding this comment

Ranveer777 Jun 4, 2024

Choose a reason for hiding this comment

colega Jun 5, 2024

Choose a reason for hiding this comment

Ranveer777 commented Jun 5, 2024

colega commented Jun 5, 2024 • edited

Ranveer777 commented Jun 5, 2024 • edited

Ranveer777 commented Jun 6, 2024

colega commented Jun 6, 2024

bboreham commented Jun 6, 2024

Ranveer777 commented Jun 6, 2024

Ranveer777 commented Jun 6, 2024

bboreham left a comment

Choose a reason for hiding this comment

colega commented Jun 7, 2024

Ranveer777 commented Jun 9, 2024 • edited

pracucci commented May 31, 2024 •

edited

Ranveer777 commented May 31, 2024 •

edited

Ranveer777 Jun 3, 2024 •

edited

Ranveer777 Jun 4, 2024 •

edited

Ranveer777 Jun 4, 2024 •

edited

Ranveer777 Jun 4, 2024 •

edited

colega Jun 4, 2024 •

edited

colega commented Jun 5, 2024 •

edited

Ranveer777 commented Jun 5, 2024 •

edited

Ranveer777 commented Jun 9, 2024 •

edited