-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes: IndexAny is slower than a naive implementation in some cases #60550
Comments
Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases. How does your implementation fare on the benchmarks that are already in the bytes package? The current |
I agree. When I searched for
I didn't find any benchmarks for But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster. Except when the first occurrence is really close to the beginning (less than 4). But in this case, they're very similar. And as the distance from the first occurrence increases, so does the advantage of the naive version.
I'm really surprised by what you say, because I discovered this naive version precisely because I wanted to treat only the case of ASCII separators, and I was surprised that in If we only want to handle ASCII delimiters, the following func IndexAnyByte(s, chars []byte) int {
min := -1
for _, c := range chars {
if i := bytes.IndexByte(s, c); uint(i) < uint(min) {
min = i
}
}
return min
} I really don't understand the choice of the |
go test -test.bench=BenchmarkIndexAny bytes |
This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect? |
I got curious so I'm running the benchmarks in the |
Here's what I got:
This is pretty interesting, assuming it's reproducible. Largely the implementation in the bytes package is faster (sometimes dramatically so). But there are a number of cases where the naive version does better. Picking them out:
I don't know what's particularly interesting about these cases. Benchmark platform:
|
@mknyszek I have the impression that the cases where If we look at the cases where the naive version exceeds are the cases where the data size is much larger than the number of separators, it seems to me. And this feels closer to actual use. |
In any case we have a double loop, over the separators and over the data. Let Theoretically, without taking optimizations into account, especially if I don't think this explains the dominance of the naive version in my benchmarks, since I put the "founded" separator in the last position. A possible explanation for the advantage of the naive version in situations where the number of separators is smaller than the first occurrence is the fact that the Notes:
|
I just realized last night that I was wrong in my previous message and that the benchmarks, both mine and the two in the standard library, are no good. They all suffer from the same problem: it's not normal to consider only cases where the separator is absent from the data or it's in the last position. Let's consider that the separator in position The situation is actually much more complicated, as the position of all the separators comes into play. We can try to make theoretical predictions by approximating the behavior of I think a realistic benchmark could be
In all cases, we'll limit ourselves to 2, 3 or 4 separators (which seem to be the use cases on GitHub) and to data which is scanned to find all the separators one after the other. I'll try to design a more realistic benchmark and see how it goes. |
Using the following benchmark func BenchmarkIndexAnyCSV(b *testing.B) {
// A csv dataset with 7 columns (of diffrent lengths) and 10 rows.
// Some of the fields (gender) are empty.
// We will check with this data repetaed 1, 10 and 100 times.
// This data is fake and was generated with https://www.convertcsv.com/generate-test-data.htm.
var csvdata = []byte(`seq,name,adress,email,birthdate,gender,phone
1,"May Moody",adress,gofpuwvi@wolamme.gp,08/03/2014,F,(313) 321-8728
2,"Adele Martinez",adress,nohvahac@du.ai,02/09/1993,,(785) 834-8091
3,"Zachary Brown",adress,bo@covih.yt,07/17/1901,M,(308) 643-7802
4,"Lucy Lindsey",adress,sa@oh.fi,04/22/2023,F,(375) 395-3248
5,"Lucy Oliver",adress,gemjivefa@ha.bd,07/21/1995,M,(422) 714-5921
6,"Hilda Mullins",adress,kiujma@bok.kh,05/13/1922,,(716) 645-7631
7,"Mollie Parker",adress,kir@kahamrew.io,03/18/1960,M,(534) 818-5916
8,"Katie Adkins",adress,konocizah@jibfu.et,02/01/1965,F,(828) 544-2687
9,"Maurice Carroll",adress,fosmutlo@erpewi.dm,05/01/1958,M,(478) 249-3000
10,"Lettie Harvey",adress,lircuged@pe.bs,07/04/2008,,(571) 628-2216`)
// We will check with the first 2, 3 and 4 delimiters of the following list.
// The 4th is not present in the data.
var delimiters = ",\n\";"
for _, k := range []int{1, 10, 100} {
for _, j := range []int{2, 3, 4} {
b.Run(fmt.Sprintf("%d:%d", k, j), func(b *testing.B) {
data := bytes.Repeat(csvdata, k)
separators := delimiters[:j]
for i := 0; i < b.N; i++ {
p, q := 0, 0
for {
if q = IndexAny(data[p:], separators); q == -1 {
break
}
p += q + 1
}
}
})
}
}
} I get the following results $ benchstat standard.txt naive.txt
name old time/op new time/op delta
IndexAnyCSV/1:2-8 1.42µs ± 2% 1.30µs ± 1% -8.29% (p=0.029 n=4+4)
IndexAnyCSV/1:3-8 1.68µs ± 2% 2.34µs ± 1% +38.98% (p=0.029 n=4+4)
IndexAnyCSV/1:4-8 1.79µs ± 2% 4.13µs ± 2% +130.71% (p=0.029 n=4+4)
IndexAnyCSV/10:2-8 13.9µs ± 4% 14.0µs ± 4% ~ (p=0.886 n=4+4)
IndexAnyCSV/10:3-8 16.4µs ± 2% 24.1µs ± 1% +46.48% (p=0.029 n=4+4)
IndexAnyCSV/10:4-8 17.8µs ± 2% 119.0µs ± 1% +566.80% (p=0.029 n=4+4)
IndexAnyCSV/100:2-8 139µs ± 4% 138µs ± 5% ~ (p=0.686 n=4+4)
IndexAnyCSV/100:3-8 164µs ± 2% 243µs ± 2% +48.53% (p=0.029 n=4+4)
IndexAnyCSV/100:4-8 179µs ± 2% 11008µs ± 3% +6036.16% (p=0.029 n=4+4) My conclusion is that the current version is better than the naive version in real-life situations. So there's no need to look any further. On the other hand, I think we need to add a more realistic benchmark (like this one, for example) to the standard library (for both |
FTR, we try to aggregate more realistic benchmarks in It sounds like there might not be anything left to do in this issue? Closing for now, please let me know if you disagree! |
@kpym Thanks for the investigation. |
+1, thanks for the investigation! |
Just for the sake of completeness, here is a better "naive" version, which is still slower than the standard func IndexAny(s []byte, chars string) int {
min, cut := -1, len(s)
for _, c := range chars {
if i := bytes.IndexRune(s[:cut], c); uint(i) < uint(min) {
min = i
cut = i
}
}
return min
}
|
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I consider the following naive implementation of
bytes.IndexAny
I benchmark it for 10 delimiters on a data where the first delimiter appears at position 1000, like this
What did you expect to see?
That
bytes.IndexAny
is faster then this "naive"IndexAny
.What did you see instead?
So this "naive" implementation is more than 10 times faster than the standard
bytes.IndexAny
.Am I doing something wrong ? Is this relevant ?
Remark : If we take a lot of delimiters (1000) the standard function is 30% better (on my computer).
The text was updated successfully, but these errors were encountered: