Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison #14170

Merged
merged 7 commits into from
Jun 10, 2024

Conversation

Ranveer777
Copy link
Contributor

Fixes #14066

By normalizing the input string to a standardized form during addition and matching operations, we ensure that our system operates in an optimized manner, meeting our use case requirements effectively.

@colega @bboreham

@Ranveer777 Ranveer777 marked this pull request as draft May 31, 2024 00:18
@Ranveer777 Ranveer777 marked this pull request as ready for review May 31, 2024 00:18
@@ -767,7 +768,7 @@ type equalMultiStringMapMatcher struct {

func (m *equalMultiStringMapMatcher) add(s string) {
if !m.caseSensitive {
s = strings.ToLower(s)
s = strings.ToLower(norm.NFKD.String(s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now thank you for nerd-sniping me on this.

I tried reading the docs and even though I'd have chosen NFKC for this, I ran the tests in your branch, especially with the example character from the docs ẛ̣ and I couldn't tell any difference (in test results) between both normalizations, so LGTM.

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

@bboreham
Copy link
Member

Should we credit @kushalShukla-web with the test case? I'm not sure where that came from but it seems the same.

@colega
Copy link
Contributor

colega commented May 31, 2024

The test case was defined in the issue.

@bboreham
Copy link
Member

Just want to hold off on merging this while we look at benchmarks.

@pracucci
Copy link
Contributor

I've run few selected benchmarks, and there's quite an impactful performance impact:

                                                     │ before.txt  │             after.txt              │
                                                     │   sec/op    │   sec/op     vs base               │
FastRegexMatcher/(?i:foo)-12                           115.5n ± 2%   111.7n ± 2%   -3.33% (p=0.004 n=6)
FastRegexMatcher/(?i:(foo|bar))-12                     239.8n ± 4%   233.4n ± 8%        ~ (p=0.288 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-12               431.2n ± 3%   416.6n ± 2%   -3.37% (p=0.002 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-12                1.050µ ± 2%   1.042µ ± 0%        ~ (p=0.284 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-12   1.994µ ± 2%   2.438µ ± 1%  +22.27% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-12   1.998µ ± 5%   2.457µ ± 1%  +22.95% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-12   10.06µ ± 2%   10.16µ ± 7%        ~ (p=0.240 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-12   12.02µ ± 2%   11.70µ ± 4%   -2.64% (p=0.041 n=6)

I'm experimenting with an optimization...

@pracucci
Copy link
Contributor

pracucci commented May 31, 2024

I'm experimenting with an optimization...

I think we can skip the normalization if the string only has ascii chars. The check for ascii chars is already done by strings.ToLower(), so if we hold our nose and copy-paste strings.ToLower() implementation, we can merge it with the normalization:
55dc273

This reverts the performance impact for the common case of a string only having ASCII chars:

goos: darwin
goarch: amd64
pkg: github.com/prometheus/prometheus/model/labels
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
                                                     │ before.txt  │        after-optimized.txt        │
                                                     │   sec/op    │   sec/op     vs base              │
FastRegexMatcher/(?i:foo)-12                           115.5n ± 2%   112.3n ± 1%  -2.77% (p=0.009 n=6)
FastRegexMatcher/(?i:(foo|bar))-12                     239.8n ± 4%   248.6n ± 3%  +3.65% (p=0.037 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-12               431.2n ± 3%   437.8n ± 2%  +1.53% (p=0.039 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-12                1.050µ ± 2%   1.073µ ± 2%  +2.14% (p=0.026 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-12   1.994µ ± 2%   1.936µ ± 1%  -2.88% (p=0.002 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-12   1.998µ ± 5%   1.963µ ± 4%  -1.75% (p=0.041 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-12   10.06µ ± 2%   10.04µ ± 3%       ~ (p=0.699 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-12   12.02µ ± 2%   12.15µ ± 4%       ~ (p=0.310 n=6)
geomean                                                1.252µ        1.253µ       +0.08%

In case this change is accepted, we should unit test toNormalisedLower() (at least few smoke tests).

@Ranveer777
Copy link
Contributor Author

Ranveer777 commented May 31, 2024

@pracucci
Interesting approach. This will allow us to restrict the performance impact to only non-ascii characters.

@Ranveer777
Copy link
Contributor Author

@colega @bboreham
I wanted to get your thoughts on the @pracucci approach.
Also, what are the next steps?

@colega
Copy link
Contributor

colega commented Jun 1, 2024

I wanted to get your thoughts on the @pracucci approach.

Hey, I agree that we should optimize for the ascii case, which is the most common one in the wild.

Also, what are the next steps?

I think we can start with a middle ground between copying the entire strings.ToLower and just defining:

func toLowerNormalized(s string) s {
    if allAscii(s) {
        return strings.ToLower()
    }
    return strings.ToLower(norm.NFKD.String(s))
}

Use it and benchmark to see if it's worth copying the entire strings.ToLower() as @pracucci suggests.

@Ranveer777
Copy link
Contributor Author

@colega
The strings.ToLower() implementation has already incorporated a check for ASCII characters.

// ToLower returns s with all Unicode letters mapped to their lower case.
func ToLower(s string) string {
	isASCII, hasUpper := true, false
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c >= utf8.RuneSelf {
			isASCII = false
			break
		}
		hasUpper = hasUpper || ('A' <= c && c <= 'Z')
	}
        ...

Therefore, rather than employing an if allAscii(s) check, adopting the implementation of strings.ToLower()(a suggestion made by @pracucci) appears to be a more efficient approach.

There remains a question about whether it's feasible to combine normalization and case conversion into a single, more optimized loop. What are your thoughts on this? @colega @pracucci @bboreham

@Ranveer777 Ranveer777 force-pushed the equalMultiStringMapMatcher branch 2 times, most recently from 9eabd8f to 1a6e7e4 Compare June 2, 2024 12:48
@Ranveer777
Copy link
Contributor Author

I've made a modification based on @pracucci suggestions. Instead of iterating through the string twice to check for uppercase characters and then converting them to lowercase for ASCII strings, we can now convert all uppercase characters to lowercase regardless of whether the string is ASCII or not.

I conducted a benchmark test, and the results were quite intriguing. I kindly request @pracucci to rerun the benchmark for re-validation.

                                                    │  before.txt  │              after.txt              │
                                                    │    sec/op    │    sec/op     vs base               │
FastRegexMatcher/#00-8                                74.31n ±  6%   74.14n ±  1%        ~ (p=0.699 n=6)
FastRegexMatcher/foo-8                                90.84n ±  0%   90.97n ±  2%        ~ (p=0.310 n=6)
FastRegexMatcher/^foo-8                               68.90n ±  1%   68.86n ±  1%        ~ (p=0.699 n=6)
FastRegexMatcher/(foo|bar)-8                          80.45n ±  1%   83.80n ±  4%   +4.16% (p=0.015 n=6)
FastRegexMatcher/foo.*-8                              166.0n ±  1%   165.7n ±  0%        ~ (p=0.177 n=6)
FastRegexMatcher/.*foo-8                              188.0n ±  1%   187.9n ±  0%        ~ (p=0.459 n=6)
FastRegexMatcher/^.*foo$-8                            187.9n ±  0%   187.8n ±  1%        ~ (p=0.550 n=6)
FastRegexMatcher/^.+foo$-8                            187.7n ±  2%   187.8n ±  0%        ~ (p=0.937 n=6)
FastRegexMatcher/.?-8                                 123.1n ±  0%   123.2n ±  5%        ~ (p=0.619 n=6)
FastRegexMatcher/.*-8                                 158.5n ±  1%   157.9n ±  0%        ~ (p=0.197 n=6)
FastRegexMatcher/.+-8                                 154.3n ±  0%   154.6n ±  1%        ~ (p=0.139 n=6)
FastRegexMatcher/foo.+-8                              165.1n ±  1%   164.7n ±  0%        ~ (p=0.082 n=6)
FastRegexMatcher/.+foo-8                              188.0n ±  1%   187.7n ±  0%        ~ (p=0.465 n=6)
FastRegexMatcher/foo_.+-8                             153.2n ± 13%   150.0n ±  0%        ~ (p=0.584 n=6)
FastRegexMatcher/foo_.*-8                             150.5n ±  8%   150.0n ±  0%        ~ (p=0.056 n=6)
FastRegexMatcher/.*foo.*-8                            342.9n ±  1%   343.7n ±  1%        ~ (p=0.485 n=6)
FastRegexMatcher/.+foo.+-8                            352.7n ±  2%   354.4n ±  2%        ~ (p=0.290 n=6)
FastRegexMatcher/(?s:.*)-8                            74.06n ±  1%   74.03n ±  0%        ~ (p=1.000 n=6)
FastRegexMatcher/(?s:.+)-8                            86.90n ±  0%   86.98n ±  0%   +0.10% (p=0.026 n=6)
FastRegexMatcher/(?s:^.*foo$)-8                       183.8n ±  1%   183.5n ±  0%        ~ (p=0.271 n=6)
FastRegexMatcher/(?i:foo)-8                           154.2n ±  3%   155.9n ±  4%   +1.07% (p=0.032 n=6)
FastRegexMatcher/(?i:(foo|bar))-8                     301.5n ±  2%   299.4n ±  1%        ~ (p=0.310 n=6)
FastRegexMatcher/(?i:(foo1|foo2|bar))-8               519.8n ±  2%   513.4n ±  2%        ~ (p=0.132 n=6)
FastRegexMatcher/^(?i:foo|oo)|(bar)$-8                1.270µ ±  1%   1.276µ ±  1%   +0.51% (p=0.039 n=6)
FastRegexMatcher/(?i:(foo1|foo2|aaa|bbb|ccc|ddd|e-8   2.630µ ±  3%   2.342µ ±  3%  -10.97% (p=0.002 n=6)
FastRegexMatcher/((.*)(bar|b|buzz)(.+)|foo)$-8        778.2n ±  1%   777.2n ±  2%        ~ (p=0.937 n=6)
FastRegexMatcher/^$-8                                 73.87n ±  1%   74.06n ±  1%        ~ (p=0.180 n=6)
FastRegexMatcher/(prometheus|api_prom)_api_v1_.+-8    275.8n ±  1%   275.1n ±  1%        ~ (p=0.937 n=6)
FastRegexMatcher/10\.0\.(1|2)\.+-8                    151.4n ±  8%   150.6n ±  7%        ~ (p=0.699 n=6)
FastRegexMatcher/10\.0\.(1|2).+-8                     149.9n ±  0%   149.8n ±  1%        ~ (p=0.970 n=6)
FastRegexMatcher/((fo(bar))|.+foo)-8                  344.9n ±  1%   344.9n ±  1%        ~ (p=0.729 n=6)
FastRegexMatcher/zQPbMkNO|NNSPdvMi|iWuuSoAl|qbvKM-8   374.2n ± 19%   389.0n ±  7%        ~ (p=0.310 n=6)
FastRegexMatcher/jyyfj00j0061|jyyfj00j0062|jyyfj9-8   344.6n ± 14%   388.1n ± 10%  +12.61% (p=0.026 n=6)
FastRegexMatcher/(?i:(zQPbMkNO|NNSPdvMi|iWuuSoAl|-8   2.615µ ±  3%   2.404µ ±  3%   -8.05% (p=0.002 n=6)
FastRegexMatcher/(?i:(AAAAAAAAAAAAAAAAAAAAAAAA|BB-8   558.8n ±  1%   563.9n ±  1%   +0.92% (p=0.041 n=6)
FastRegexMatcher/(?i:(zQPbMkNO.*|NNSPdvMi.*|iWuuS-8   12.03µ ±  0%   12.05µ ±  1%        ~ (p=0.310 n=6)
FastRegexMatcher/(?i:(.*zQPbMkNO|.*NNSPdvMi|.*iWu-8   14.34µ ±  1%   14.46µ ±  1%   +0.85% (p=0.004 n=6)
FastRegexMatcher/fo.?-8                               163.0n ±  0%   162.8n ±  0%        ~ (p=0.617 n=6)
FastRegexMatcher/foo.?-8                              163.0n ±  2%   162.8n ±  0%        ~ (p=0.729 n=6)
FastRegexMatcher/f.?o-8                               158.5n ±  0%   158.5n ±  0%        ~ (p=0.623 n=6)
FastRegexMatcher/.*foo.?-8                            356.4n ±  1%   355.7n ±  1%        ~ (p=0.221 n=6)
FastRegexMatcher/.?foo.+-8                            339.6n ±  2%   340.7n ±  1%        ~ (p=0.556 n=6)
FastRegexMatcher/foo.?|bar-8                          263.2n ±  1%   262.4n ±  1%        ~ (p=0.260 n=6)
geomean                                               272.8n         272.7n         -0.06%

@colega @bboreham @pracucci Could I kindly ask you to spare some time for another review?

}

// Normalise and convert to lower.
return strings.Map(unicode.ToLower, norm.NFKD.String(b.String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can abort looping above once you find the first non-ascii char, and just do this here:

Suggested change
return strings.Map(unicode.ToLower, norm.NFKD.String(b.String()))
return strings.Map(unicode.ToLower, norm.NFKD.String(s))

Copy link
Contributor Author

@Ranveer777 Ranveer777 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commited new changes. I've included strings.Map(unicode.ToLower, norm.NFKD.String(b.String())) to convert most uppercase characters to lowercase until we encounter a non-ASCII character in the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why build and allocate a new string if we aren't going to need it? IMO, we should optimize for ascii only, and then optimize for lowercase only.

If you allow me a suggestion, I'd write the method like this:

// toNormalisedLower normalise the input string using "Unicode Normalization Form D" and then convert it to lower case.
// This method is optimized for ASCII-only strings.
func toNormalisedLower(s string) string {
	// Check if the string is all ASCII chars and convert any upper case character to lower case character.
	isASCII := true
	hasUpper := false
	for _, c := range s {
		hasUpper = hasUpper || ('A' <= c && c <= 'Z')
		if c >= utf8.RuneSelf {
			isASCII = false
			break
		}
	}
	if !isASCII {
		return strings.Map(unicode.ToLower, norm.NFKD.String(s))
	}
	if !hasUpper {
		return s
	}

	var (
		b   strings.Builder
		pos int
	)
	for i := 0; i < len(s); i++ {
		c := s[i]
		if 'A' <= c && c <= 'Z' {
			if pos < i {
				b.WriteString(s[pos:i])
			}
			c += 'a' - 'A'
			b.WriteByte(c)
			pos = i + 1
		}
	}
	if pos < len(s) {
		b.WriteString(s[pos:])
	}
	return b.String()
}

Copy link
Contributor Author

@Ranveer777 Ranveer777 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern was regarding the approach of iterating through a string twice to check for uppercase characters and then converting them to lowercase, especially for ASCII strings where this process might seem redundant. Instead, I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that unnecessarily allocating a new string (when everything is lowercase) is more expensive than looping through an ascii string twice, but of course that depends on the string length and we can only talk if we run the benchmarks.

I proposed a single loop solution to convert all uppercase characters to lowercase, regardless of whether the string is ASCII or not.

Did you mean regardless of whether the string has uppercase or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have toNormalisedLower take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.

Copy link
Contributor Author

@Ranveer777 Ranveer777 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to situations in which the string contains at least one uppercase character.

we can only talk if we run the benchmarks.

True

Copy link
Contributor Author

@Ranveer777 Ranveer777 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have toNormalisedLower take a buffer (say 2KB) which we allocate on the stack of the caller, thus removing that allocation cost when the string is not already lowercase.

Could you please clarify if you are referring to passing the input string as a buffer to toNormalisedLower?
Apologies, but I'm having difficulty understanding your explanation.

Copy link
Contributor

@colega colega Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).

@Ranveer777 what @bboreham is suggesting is to define an array, say var arr byte[2048] and create a slice using that array out := arr[:0], and build the string on that slice as we proceed, something like:

func toNormalisedLower(s string) string {
	var (
		arr [2024]byte
		b   = bytes.NewBuffer(arr[:0])
		pos int
	)
	hasUpper := false
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c >= utf8.RuneSelf {
			return strings.Map(unicode.ToLower, norm.NFKD.String(s))
		}
		if 'A' <= c && c <= 'Z' {
			hasUpper = true
			if pos < i {
				b.WriteString(s[pos:i])
			}
			c += 'a' - 'A'
			b.WriteByte(c)
			pos = i + 1
		}
	}
	if !hasUpper {
		return s
	}
	if pos < len(s) {
		b.WriteString(s[pos:])
	}
	return b.String()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ranveer777 what @bboreham is suggesting is to define an array, say var arr byte[2048] and create a slice using that array out := arr[:0], and build the string on that slice as we proceed, something like:

Thank you for clarifying. I initially perceived it as a suggestion for commited implementation.

IMO, we should commit to something working and not terrible at performance, and then we can nerd snipe each other with alternative implementations of this method, benchmarking specifically this method (instead of the entire matcher).

👍

Comment on lines 101 to 107
toNormalisedLowerTestCases = map[string]string{
"foo": "foo",
"AAAAAAAAAAAAAAAAAAAAAAAA": "aaaaaaaaaaaaaaaaaaaaaaaa",
"cccccccccccccccccccccccC": "cccccccccccccccccccccccc",
"ſſſſſſſſſſſſſſſſſſſſſſſſS": "sssssssssssssssssssssssss",
"ſſAſſa": "ssassa",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in TestToNormalisedLower, I would move the definition there to make the test easier to read & run.

@Ranveer777
Copy link
Contributor Author

Utilizing the benchmark provided below to assess all alternative implementations.

func BenchmarkToNormalisedLower(b *testing.B) {
	charset := []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüßÄÖÜſ")
	listOfRandomString := make([]string, 20)
	for i := 0; i < 20; i++ {
		randomString := make([]rune, 100000)
		for i := range randomString {
			randomString[i] = charset[rand.Intn(len(charset))]
		}
		listOfRandomString[i] = string(randomString)
	}
	comps := []struct {
		fun func(string) string
	}{
		{toNormalisedLower1},
		{toNormalisedLower2},
		{toNormalisedLower3},
	}
	for _, comp := range comps {
		for _, randString := range listOfRandomString {
			b.Run(randString[:32], func(b *testing.B) {
				_ = comp.fun(randString)
			})
		}
	}
}

@colega @bboreham @pracucci
What are your insights on this approach?

@colega
Copy link
Contributor

colega commented Jun 5, 2024

I don't like the idea of randomness on the input of benchmarks.

I'd just different inputs with the combinations of:

  • len=10, 100, 1000, 4000 (last one is intentionally higher than the on-stack buffer that @bboreham proposed)
  • lowercase only, first uppercase, last uppercase, all uppercase
  • ascii only, or also unicode.

I.e., I would do this:

func BenchmarkToNormalizedLower(b *testing.B) {
	benchCase := func(l int, uppercase string, asciiOnly bool, alt int) string {
		chars := "abcdefghijklmnopqrstuvwxyz"
		if !asciiOnly {
			chars = "aаbбcвdгeдfеgёhжiзjиkйlкmлnмoнpоqпrрsсtтuуvфwхxцyчzш"
		}
		// Swap the alphabet to make alternatives.
		chars = chars[alt%len(chars):] + chars[:alt%len(chars)]

		str := strings.Repeat(chars, l/len(chars)+1)[:l]
		switch uppercase {
		case "first":
			return strings.ToUpper(str[:1]) + str[1:]
		case "last":
			return str[:len(str)-1] + strings.ToUpper(str[len(str)-1:])
		case "all":
			return strings.ToUpper(str)
		case "none":
			return str
		default:
			panic("invalid uppercase")
		}
	}

	for _, l := range []int{10, 100, 1000, 4000} {
		b.Run(fmt.Sprintf("length=%d", l), func(b *testing.B) {
			for _, uppercase := range []string{"none", "first", "last", "all"} {
				b.Run("uppercase="+uppercase, func(b *testing.B) {
					for _, asciiOnly := range []bool{true, false} {
						b.Run(fmt.Sprintf("ascii=%t", asciiOnly), func(b *testing.B) {
							inputs := make([]string, 10)
							for i := range inputs {
								inputs[i] = benchCase(l, uppercase, asciiOnly, i)
							}
							b.ResetTimer()
							for n := 0; n < b.N; n++ {
								toNormalisedLower(inputs[n%len(inputs)])
							}
						})
					}
				})
			}
		})
	}
}

@Ranveer777
Copy link
Contributor Author

Ranveer777 commented Jun 5, 2024

Looks good to me @colega.

  • lowercase only, first uppercase, last uppercase, all uppercase
  • ascii only, or also unicode.

In the event of randomness, we would have been unable to verify the all aforementioned cases.

@Ranveer777
Copy link
Contributor Author

I conducted a benchmark test on the implementation committed to the branch and compared it with the benchmark tests of implementations suggested by @colega and @pracucci.

ToNormalizedLower/length=10/uppercase=none/ascii=true-8       42.68n ± 1%    21.45n ±  3%   -49.75% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=none/ascii=false-8      310.2n ± 1%    274.1n ±  0%   -11.65% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=first/ascii=true-8      44.70n ± 1%    90.79n ±  1%  +103.09% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=first/ascii=false-8     321.9n ± 1%    309.9n ±  1%    -3.73% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=last/ascii=true-8       43.21n ± 1%    62.77n ±  1%   +45.26% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=last/ascii=false-8      340.9n ± 1%    304.4n ±  1%   -10.71% (p=0.002 n=6)  
ToNormalizedLower/length=10/uppercase=all/ascii=true-8        51.36n ± 1%    97.97n ±  1%   +90.76% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=false-8       337.5n ± 1%    295.9n ±  1%   -12.31% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=true-8      138.0n ± 2%    165.1n ±  2%   +19.59% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=false-8     3.743µ ± 1%    3.667µ ±  1%    -2.02% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=true-8     139.1n ± 2%    296.7n ±  2%  +113.34% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=false-8    3.746µ ± 1%    3.900µ ±  3%    +4.12% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=true-8      138.8n ± 1%    303.4n ±  0%  +118.62% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=false-8     3.768µ ± 1%    3.692µ ±  1%    -2.03% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=true-8       259.9n ± 1%    575.4n ±  2%  +121.44% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=false-8      3.596µ ± 1%    3.507µ ±  0%    -2.48% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=true-8     1.002µ ± 2%    1.570µ ±  1%   +56.68% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=false-8    31.68µ ± 1%    31.43µ ±  1%    -0.80% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=true-8    1.030µ ± 1%    2.252µ ±  1%  +118.80% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=false-8   31.71µ ± 3%    32.62µ ±  1%         ~ (p=0.065 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=true-8     1.006µ ± 1%    2.676µ ±  0%  +166.00% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=false-8    31.71µ ± 1%    31.61µ ±  0%         ~ (p=0.084 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=true-8      2.267µ ± 1%    4.697µ ±  1%  +107.21% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=false-8     29.70µ ± 1%    29.30µ ±  1%    -1.36% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=true-8     4.143µ ± 3%    6.289µ ±  1%   +51.82% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=false-8    126.0µ ± 1%    124.7µ ±  0%    -1.06% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=true-8    4.085µ ± 1%    8.939µ ±  1%  +118.81% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=false-8   125.5µ ± 0%    128.6µ ±  0%    +2.46% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=true-8     4.126µ ± 1%   10.763µ ± 11%  +160.86% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=false-8    126.6µ ± 7%    125.2µ ±  2%    -1.13% (p=0.041 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=true-8      9.326µ ± 1%   18.084µ ±  2%   +93.91% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=false-8     116.3µ ± 1%    114.8µ ±  0%    -1.28% (p=0.002 n=6)
geomean                                                       1.957µ         2.583µ         +31.96%
ToNormalizedLower/length=10/uppercase=none/ascii=true-8       42.68n ± 1%    15.91n ± 6%  -62.73% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=none/ascii=false-8      310.2n ± 1%    259.8n ± 0%  -16.26% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=first/ascii=true-8      44.70n ± 1%    47.60n ± 1%   +6.48% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=first/ascii=false-8     321.9n ± 1%    303.6n ± 1%   -5.70% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=last/ascii=true-8       43.21n ± 1%    51.34n ± 1%  +18.80% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=last/ascii=false-8      340.9n ± 1%    294.9n ± 1%  -13.49% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=true-8        51.36n ± 1%    57.50n ± 1%  +11.95% (p=0.002 n=6)
ToNormalizedLower/length=10/uppercase=all/ascii=false-8       337.5n ± 1%    295.2n ± 2%  -12.52% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=true-8      138.0n ± 2%    117.0n ± 0%  -15.21% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=none/ascii=false-8     3.743µ ± 1%    3.681µ ± 1%   -1.66% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=true-8     139.1n ± 2%    193.9n ± 0%  +39.45% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=first/ascii=false-8    3.746µ ± 1%    3.799µ ± 0%   +1.42% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=true-8      138.8n ± 1%    236.9n ± 1%  +70.68% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=last/ascii=false-8     3.768µ ± 1%    3.693µ ± 1%   -1.99% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=true-8       259.9n ± 1%    331.4n ± 0%  +27.55% (p=0.002 n=6)
ToNormalizedLower/length=100/uppercase=all/ascii=false-8      3.596µ ± 1%    3.498µ ± 0%   -2.71% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=true-8     1.002µ ± 2%    1.060µ ± 4%   +5.82% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=none/ascii=false-8    31.68µ ± 1%    31.55µ ± 2%        ~ (p=0.372 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=true-8    1.030µ ± 1%    1.429µ ± 1%  +38.85% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=first/ascii=false-8   31.71µ ± 3%    32.59µ ± 1%   +2.79% (p=0.041 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=true-8     1.006µ ± 1%    1.909µ ± 1%  +89.71% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=last/ascii=false-8    31.71µ ± 1%    31.63µ ± 1%        ~ (p=0.461 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=true-8      2.267µ ± 1%    2.828µ ± 1%  +24.77% (p=0.002 n=6)
ToNormalizedLower/length=1000/uppercase=all/ascii=false-8     29.70µ ± 1%    29.24µ ± 0%   -1.54% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=true-8     4.143µ ± 3%    4.194µ ± 0%   +1.23% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=none/ascii=false-8    126.0µ ± 1%    124.7µ ± 0%   -1.04% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=true-8    4.085µ ± 1%    5.838µ ± 6%  +42.90% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=first/ascii=false-8   125.5µ ± 0%    129.7µ ± 4%   +3.34% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=true-8     4.126µ ± 1%    7.639µ ± 1%  +85.13% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=last/ascii=false-8    126.6µ ± 7%    125.2µ ± 1%   -1.09% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=true-8      9.326µ ± 1%   11.578µ ± 4%  +24.15% (p=0.002 n=6)
ToNormalizedLower/length=4000/uppercase=all/ascii=false-8     116.3µ ± 1%    116.3µ ± 2%        ~ (p=0.937 n=6)
geomean                                                       1.957µ         2.097µ        +7.14%

@colega
Copy link
Contributor

colega commented Jun 6, 2024

Thank you, @Ranveer777, I think both results look worse than the current one (BTW, you can compare all of them if you do benchstat current.txt foo.txt bar.txt and keep the headers of the table, this mislead me so I posted a wrong comment that I later removed)

@bboreham
Copy link
Member

bboreham commented Jun 6, 2024

Thanks for all the work. I think we can merge the improvement so far and start another PR if there are further ideas.

However a recent update to dependencies has created a merge conflict, which I am not confident to resolve in the GitHub UI.
@Ranveer777 could you rebase please?

@Ranveer777
Copy link
Contributor Author

@colega @bboreham @pracucci
I appreciate your guidance through the changes. Thank you.

@Ranveer777
Copy link
Contributor Author

@bboreham
I have successfully resolved the conflict and performed a rebase on the branch.

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would like to see it optimised further, but let's get the bugfix in and move from there.

@colega
Copy link
Contributor

colega commented Jun 7, 2024

The test failed on Windows build seems to be unrelated FAIL: TestDBReadOnly_Querier_NoAlteration (1.76s)

Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
…nversion

1) For ASCII strings: The method converts the input string from upper to lower case.
2) For Non-ASCII strings: The method normalizes the input string using 'Unicode Normalization Form D' and then converts it to lower case.

Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
Signed-off-by: RA <ranveeravhad777@gmail.com>
@Ranveer777
Copy link
Contributor Author

Ranveer777 commented Jun 9, 2024

It appears that Windows build has passed successfully.

@bboreham bboreham changed the title model: Normalized the string to standardized form while add and Matches for MultiStringMapMatcher [BUGFIX] FastRegexpMatcher: do Unicode normalization as part of case-insensitive comparison Jun 10, 2024
@bboreham bboreham merged commit 39902ba into prometheus:main Jun 10, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Case insensitive FastRegexMatcher doesn't check all folds
4 participants