-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes: tests are slow on arm #29001
Comments
I'm also puzzled as to why some packages on some builders take a very long time to test. Thanks for opening the issue! |
Running https://gist.github.com/tklauser/576094d1635c1d3d18793ab3fbef9abf
There's one obvious outlier:
I suspect that's due to missing SIMD implementations of |
Thanks for testing.
That sounds like reasonable explanation. I know nothing about arm instructions, so I would not know if it is possible to make this faster. If not, we should make TestCompareBytes shorter on arm - 2 minutes is too long for one single tests. Leaving this for others to decide. Alex |
Unfortunately, ARM vector instructions are not yet available in The Go assembler (#7300). However, from a quick look at the current implementation in |
Change https://golang.org/cl/153557 mentions this issue: |
Use word-wise comparison for aligned buffers. Try to align buffers if possible, otherwise fall back to the current byte-wise comparison. The slight increase in CompareBytesEmpty is due to the additional empty and alignments checks. name old time/op new time/op delta CompareBytesEqual-4 79.6ns ± 0% 67.3ns ± 1% -15.46% (p=0.000 n=10+10) CompareBytesToNil-4 31.0ns ± 1% 30.4ns ± 1% -2.00% (p=0.000 n=10+10) CompareBytesEmpty-4 25.7ns ± 1% 28.8ns ± 0% +12.32% (p=0.000 n=10+9) CompareBytesIdentical-4 34.2ns ± 1% 29.2ns ± 1% -14.63% (p=0.000 n=10+10) CompareBytesSameLength-4 49.9ns ± 0% 47.9ns ± 1% -3.95% (p=0.000 n=9+10) CompareBytesDifferentLength-4 49.8ns ± 0% 48.2ns ± 1% -3.21% (p=0.000 n=9+9) CompareBytesBigUnaligned-4 5.71ms ± 1% 5.99ms ± 4% +4.86% (p=0.001 n=10+10) CompareBytesBig-4 4.97ms ± 1% 2.28ms ± 1% -54.23% (p=0.000 n=10+10) CompareBytesBigIdentical-4 27.4ns ± 1% 27.3ns ± 1% ~ (p=0.394 n=10+10) name old speed new speed delta CompareBytesBigUnaligned-4 184MB/s ± 1% 175MB/s ± 4% -4.56% (p=0.001 n=10+10) CompareBytesBig-4 211MB/s ± 1% 460MB/s ± 1% +118.46% (p=0.000 n=10+10) CompareBytesBigIdentical-4 38.4TB/s ± 1% 38.4TB/s ± 1% ~ (p=0.684 n=10+10) This reduces time for TestCompareBytes from ~90 sec to ~40 sec on a linux-arm builder via gomote. Updates golang#29001 Change-Id: I25f148739b9ccb7cb1fc97b3d8763549b0a66c16
Change https://golang.org/cl/165338 mentions this issue: |
Use word-wise comparison for aligned buffers, otherwise fall back to the current byte-wise comparison. name old time/op new time/op delta BytesCompare/1-4 41.3ns ± 0% 36.4ns ± 1% -11.73% (p=0.008 n=5+5) BytesCompare/2-4 39.5ns ± 0% 39.5ns ± 1% ~ (p=0.960 n=5+5) BytesCompare/4-4 45.3ns ± 0% 41.0ns ± 1% -9.40% (p=0.008 n=5+5) BytesCompare/8-4 64.8ns ± 1% 44.7ns ± 0% -31.12% (p=0.008 n=5+5) BytesCompare/16-4 86.3ns ± 0% 55.1ns ± 0% -36.21% (p=0.008 n=5+5) BytesCompare/32-4 135ns ± 0% 70ns ± 1% -47.73% (p=0.008 n=5+5) BytesCompare/64-4 231ns ± 1% 99ns ± 0% -57.27% (p=0.016 n=5+4) BytesCompare/128-4 424ns ± 0% 147ns ± 0% -65.31% (p=0.000 n=4+5) BytesCompare/256-4 810ns ± 0% 243ns ± 0% -69.96% (p=0.008 n=5+5) BytesCompare/512-4 1.59µs ± 0% 0.44µs ± 0% -72.43% (p=0.008 n=5+5) BytesCompare/1024-4 3.14µs ± 1% 0.83µs ± 1% -73.56% (p=0.008 n=5+5) BytesCompare/2048-4 6.23µs ± 0% 1.61µs ± 1% -74.21% (p=0.008 n=5+5) CompareBytesEqual-4 79.4ns ± 0% 52.2ns ± 0% -34.23% (p=0.008 n=5+5) CompareBytesToNil-4 31.0ns ± 0% 30.3ns ± 0% -2.32% (p=0.008 n=5+5) CompareBytesEmpty-4 25.7ns ± 0% 25.7ns ± 0% ~ (p=0.556 n=4+5) CompareBytesIdentical-4 25.7ns ± 0% 25.7ns ± 0% ~ (p=1.000 n=5+5) CompareBytesSameLength-4 49.1ns ± 0% 48.5ns ± 0% -1.26% (p=0.008 n=5+5) CompareBytesDifferentLength-4 49.8ns ± 1% 49.3ns ± 0% -1.08% (p=0.008 n=5+5) CompareBytesBigUnaligned-4 5.71ms ± 1% 5.68ms ± 1% ~ (p=0.222 n=5+5) CompareBytesBig-4 4.95ms ± 0% 2.28ms ± 1% -53.81% (p=0.008 n=5+5) CompareBytesBigIdentical-4 27.2ns ± 1% 27.3ns ± 1% ~ (p=0.310 n=5+5) name old speed new speed delta CompareBytesBigUnaligned-4 184MB/s ± 1% 185MB/s ± 1% ~ (p=0.222 n=5+5) CompareBytesBig-4 212MB/s ± 0% 459MB/s ± 1% +116.51% (p=0.008 n=5+5) CompareBytesBigIdentical-4 38.5TB/s ± 0% 38.4TB/s ± 1% ~ (p=0.421 n=5+5) Also, this reduces time for TestCompareBytes by about 20 sec on a linux-arm builder via gomote. Updates #29001 Change-Id: I25f148739b9ccb7cb1fc97b3d8763549b0a66c16 Reviewed-on: https://go-review.googlesource.com/c/go/+/165338 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Testing
one of the biggest contributors is
It seems the main culprit there is the |
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal, otherwise fall back to the current byte-wise comparison. TODO benchmarks Updates golang#29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal, otherwise fall back to the current byte-wise comparison. TODO benchmarks Updates golang#29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
@tklauser your hard work achieves results. Recent windows-arm builder run https://build.golang.org/log/32d08cb2c389db61937c461cbcf64172bd00bc84
is more than twice as fast as before. Thank you very much. Alex |
Change https://golang.org/cl/166677 mentions this issue: |
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal, otherwise fall back to the current byte-wise comparison. TODO benchmarks Updates golang#29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Move the shared code into byteal.memeqbody. This will allow to implement optimizations (e.g. for #29001) in a single function. Change-Id: Iaa34ddeb7068d92c35a8b4e581b7fd92da56535c Reviewed-on: https://go-review.googlesource.com/c/go/+/166677 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal on arm, otherwise fall back to the current byte-wise comparison. TODO benchmarks Updates golang#29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal on arm, otherwise fall back to the current byte-wise comparison. name old time/op new time/op delta Equal/0-4 25.7ns ± 1% 23.5ns ± 1% -8.78% (p=0.000 n=10+10) Equal/1-4 65.8ns ± 0% 60.1ns ± 1% -8.69% (p=0.000 n=10+9) Equal/6-4 82.9ns ± 1% 86.7ns ± 0% +4.59% (p=0.000 n=10+10) Equal/9-4 90.0ns ± 0% 101.0ns ± 0% +12.18% (p=0.000 n=9+10) Equal/15-4 108ns ± 0% 119ns ± 0% +10.19% (p=0.000 n=8+8) Equal/16-4 111ns ± 0% 82ns ± 0% -26.37% (p=0.000 n=8+10) Equal/20-4 124ns ± 1% 87ns ± 1% -29.94% (p=0.000 n=9+10) Equal/32-4 160ns ± 1% 97ns ± 1% -39.40% (p=0.000 n=10+10) Equal/4K-4 14.0µs ± 0% 3.6µs ± 1% -74.57% (p=0.000 n=9+10) Equal/4M-4 12.8ms ± 1% 3.2ms ± 0% -74.93% (p=0.000 n=9+9) Equal/64M-4 204ms ± 1% 51ms ± 0% -74.78% (p=0.000 n=10+10) EqualPort/1-4 47.0ns ± 1% 46.8ns ± 0% -0.40% (p=0.015 n=10+6) EqualPort/6-4 82.6ns ± 1% 81.9ns ± 1% -0.87% (p=0.002 n=10+10) EqualPort/32-4 232ns ± 0% 232ns ± 0% ~ (p=0.496 n=8+10) EqualPort/4K-4 29.0µs ± 1% 29.0µs ± 1% ~ (p=0.604 n=9+10) EqualPort/4M-4 24.0ms ± 1% 23.8ms ± 0% -0.65% (p=0.001 n=9+9) EqualPort/64M-4 383ms ± 1% 382ms ± 0% ~ (p=0.218 n=10+10) CompareBytesEqual-4 61.2ns ± 1% 61.0ns ± 1% ~ (p=0.539 n=10+10) name old speed new speed delta Equal/1-4 15.2MB/s ± 0% 16.6MB/s ± 1% +9.52% (p=0.000 n=10+9) Equal/6-4 72.4MB/s ± 1% 69.2MB/s ± 0% -4.40% (p=0.000 n=10+10) Equal/9-4 100MB/s ± 0% 89MB/s ± 0% -11.40% (p=0.000 n=9+10) Equal/15-4 138MB/s ± 1% 125MB/s ± 1% -9.41% (p=0.000 n=10+10) Equal/16-4 144MB/s ± 1% 196MB/s ± 0% +36.41% (p=0.000 n=10+10) Equal/20-4 162MB/s ± 1% 231MB/s ± 1% +42.98% (p=0.000 n=9+10) Equal/32-4 200MB/s ± 1% 331MB/s ± 1% +65.64% (p=0.000 n=10+10) Equal/4K-4 292MB/s ± 0% 1149MB/s ± 1% +293.19% (p=0.000 n=9+10) Equal/4M-4 328MB/s ± 1% 1307MB/s ± 0% +298.87% (p=0.000 n=9+9) Equal/64M-4 329MB/s ± 1% 1306MB/s ± 0% +296.56% (p=0.000 n=10+10) EqualPort/1-4 21.3MB/s ± 1% 21.4MB/s ± 0% +0.42% (p=0.002 n=10+9) EqualPort/6-4 72.6MB/s ± 1% 73.2MB/s ± 1% +0.87% (p=0.003 n=10+10) EqualPort/32-4 138MB/s ± 0% 138MB/s ± 0% ~ (p=0.953 n=9+10) EqualPort/4K-4 141MB/s ± 1% 141MB/s ± 1% ~ (p=0.382 n=10+10) EqualPort/4M-4 175MB/s ± 1% 176MB/s ± 0% +0.65% (p=0.001 n=9+9) EqualPort/64M-4 175MB/s ± 1% 176MB/s ± 0% ~ (p=0.225 n=10+10) Updates golang#29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Change https://golang.org/cl/167557 mentions this issue: |
Change https://golang.org/cl/167798 mentions this issue: |
Add Benchmark(Index|Count)Hard[1-3] in preparation for implementing Index and Count in assembly on arm. Updates #29001 Change-Id: I2a9701892190e8d91de069c2f5a7f5bd3544c6c2 Reviewed-on: https://go-review.googlesource.com/c/go/+/167798 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal on arm, otherwise fall back to the current byte-wise comparison. name old time/op new time/op delta Equal/0-4 25.7ns ± 1% 23.5ns ± 1% -8.78% (p=0.000 n=10+10) Equal/1-4 65.8ns ± 0% 60.1ns ± 1% -8.69% (p=0.000 n=10+9) Equal/6-4 82.9ns ± 1% 86.7ns ± 0% +4.59% (p=0.000 n=10+10) Equal/9-4 90.0ns ± 0% 101.0ns ± 0% +12.18% (p=0.000 n=9+10) Equal/15-4 108ns ± 0% 119ns ± 0% +10.19% (p=0.000 n=8+8) Equal/16-4 111ns ± 0% 82ns ± 0% -26.37% (p=0.000 n=8+10) Equal/20-4 124ns ± 1% 87ns ± 1% -29.94% (p=0.000 n=9+10) Equal/32-4 160ns ± 1% 97ns ± 1% -39.40% (p=0.000 n=10+10) Equal/4K-4 14.0µs ± 0% 3.6µs ± 1% -74.57% (p=0.000 n=9+10) Equal/4M-4 12.8ms ± 1% 3.2ms ± 0% -74.93% (p=0.000 n=9+9) Equal/64M-4 204ms ± 1% 51ms ± 0% -74.78% (p=0.000 n=10+10) EqualPort/1-4 47.0ns ± 1% 46.8ns ± 0% -0.40% (p=0.015 n=10+6) EqualPort/6-4 82.6ns ± 1% 81.9ns ± 1% -0.87% (p=0.002 n=10+10) EqualPort/32-4 232ns ± 0% 232ns ± 0% ~ (p=0.496 n=8+10) EqualPort/4K-4 29.0µs ± 1% 29.0µs ± 1% ~ (p=0.604 n=9+10) EqualPort/4M-4 24.0ms ± 1% 23.8ms ± 0% -0.65% (p=0.001 n=9+9) EqualPort/64M-4 383ms ± 1% 382ms ± 0% ~ (p=0.218 n=10+10) CompareBytesEqual-4 61.2ns ± 1% 61.0ns ± 1% ~ (p=0.539 n=10+10) name old speed new speed delta Equal/1-4 15.2MB/s ± 0% 16.6MB/s ± 1% +9.52% (p=0.000 n=10+9) Equal/6-4 72.4MB/s ± 1% 69.2MB/s ± 0% -4.40% (p=0.000 n=10+10) Equal/9-4 100MB/s ± 0% 89MB/s ± 0% -11.40% (p=0.000 n=9+10) Equal/15-4 138MB/s ± 1% 125MB/s ± 1% -9.41% (p=0.000 n=10+10) Equal/16-4 144MB/s ± 1% 196MB/s ± 0% +36.41% (p=0.000 n=10+10) Equal/20-4 162MB/s ± 1% 231MB/s ± 1% +42.98% (p=0.000 n=9+10) Equal/32-4 200MB/s ± 1% 331MB/s ± 1% +65.64% (p=0.000 n=10+10) Equal/4K-4 292MB/s ± 0% 1149MB/s ± 1% +293.19% (p=0.000 n=9+10) Equal/4M-4 328MB/s ± 1% 1307MB/s ± 0% +298.87% (p=0.000 n=9+9) Equal/64M-4 329MB/s ± 1% 1306MB/s ± 0% +296.56% (p=0.000 n=10+10) EqualPort/1-4 21.3MB/s ± 1% 21.4MB/s ± 0% +0.42% (p=0.002 n=10+9) EqualPort/6-4 72.6MB/s ± 1% 73.2MB/s ± 1% +0.87% (p=0.003 n=10+10) EqualPort/32-4 138MB/s ± 0% 138MB/s ± 0% ~ (p=0.953 n=9+10) EqualPort/4K-4 141MB/s ± 1% 141MB/s ± 1% ~ (p=0.382 n=10+10) EqualPort/4M-4 175MB/s ± 1% 176MB/s ± 0% +0.65% (p=0.001 n=9+9) EqualPort/64M-4 175MB/s ± 1% 176MB/s ± 0% ~ (p=0.225 n=10+10) The 5-12% decrease in performance on Equal/{6,9,15} are due to the benchmarks splitting the bytes buffer in half. The b argument to Equal then ends up being unaligned and thus the fast word-wise compare doesn't kick in. Updates #29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609 Reviewed-on: https://go-review.googlesource.com/c/go/+/167557 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Change https://golang.org/cl/167939 mentions this issue: |
Move the shared code of IndexByte and IndexByteString into indexbytebody. This will allow to implement optimizations (e.g. for #29001) in a single function. Change-Id: I1d550da8eb65f95e492a460a12058cc35b1162b6 Reviewed-on: https://go-review.googlesource.com/c/go/+/167939 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Change https://golang.org/cl/168319 mentions this issue: |
… arm Simple single-byte loop count for now, to be further improved in future CLs. Benchmark on linux/arm: name old time/op new time/op delta CountSingle/10-4 122ns ± 0% 87ns ± 1% -28.41% (p=0.000 n=7+10) CountSingle/32-4 242ns ± 0% 174ns ± 1% -28.25% (p=0.000 n=10+10) CountSingle/4K-4 24.2µs ± 1% 15.6µs ± 1% -35.42% (p=0.000 n=10+10) CountSingle/4M-4 29.6ms ± 1% 21.3ms ± 1% -28.09% (p=0.000 n=10+9) CountSingle/64M-4 562ms ± 0% 414ms ± 1% -26.23% (p=0.000 n=8+10) name old speed new speed delta CountSingle/10-4 81.7MB/s ± 1% 114.5MB/s ± 1% +40.07% (p=0.000 n=10+10) CountSingle/32-4 132MB/s ± 0% 184MB/s ± 1% +39.39% (p=0.000 n=10+9) CountSingle/4K-4 170MB/s ± 1% 263MB/s ± 1% +54.86% (p=0.000 n=10+10) CountSingle/4M-4 142MB/s ± 1% 197MB/s ± 1% +39.07% (p=0.000 n=10+9) CountSingle/64M-4 119MB/s ± 0% 162MB/s ± 1% +35.55% (p=0.000 n=8+10) Updates #29001 Change-Id: I42a268215a62044286ec32b548d8e4b86b9570ee Reviewed-on: https://go-review.googlesource.com/c/go/+/168319 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
I was watching windows/arm build output, and I noticed that
bytes
package tests are pretty slowhttps://farmer.golang.org/temporarylogs?name=windows-arm&rev=311d87dbebbb0238196d3aa13fd9a37f655e1fc3&st=0xc4704774a0
and then I looked at linux/arm too, and it is slow too
https://farmer.golang.org/temporarylogs?name=linux-arm&rev=311d87dbebbb0238196d3aa13fd9a37f655e1fc3&st=0xc4576f58c0
I just felt sorry for the builders, and decided to create this issue.
I run byte tests here, and they are fast
I also did not notice any particular test that is slow than others.
Maybe there is
bytes
package code that can be made faster? Maybe some tests can be adjusted to make arm builders faster? Maybe nothing can be done? Then just close the issue.Thank you.
Alex
The text was updated successfully, but these errors were encountered: