Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: tests are slow on arm #29001

Open
alexbrainman opened this issue Nov 29, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@alexbrainman
Copy link
Member

commented Nov 29, 2018

I was watching windows/arm build output, and I noticed that bytes package tests are pretty slow

https://farmer.golang.org/temporarylogs?name=windows-arm&rev=311d87dbebbb0238196d3aa13fd9a37f655e1fc3&st=0xc4704774a0

ok  	bytes	146.447s

and then I looked at linux/arm too, and it is slow too

https://farmer.golang.org/temporarylogs?name=linux-arm&rev=311d87dbebbb0238196d3aa13fd9a37f655e1fc3&st=0xc4576f58c0

ok  	bytes	121.089s

I just felt sorry for the builders, and decided to create this issue.

I run byte tests here, and they are fast

c:\>go test -count=1 bytes
ok      bytes   1.106s

I also did not notice any particular test that is slow than others.

Maybe there is bytes package code that can be made faster? Maybe some tests can be adjusted to make arm builders faster? Maybe nothing can be done? Then just close the issue.

Thank you.

Alex

@mvdan

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

I'm also puzzled as to why some packages on some builders take a very long time to test. Thanks for opening the issue!

@tklauser

This comment has been minimized.

Copy link
Member

commented Nov 30, 2018

Running gotip test -v bytes on linux/arm (on a Toradex Apalis T30) gives:

https://gist.github.com/tklauser/576094d1635c1d3d18793ab3fbef9abf

PASS
ok  	bytes	101.559s

There's one obvious outlier:

=== RUN   TestCompareBytes
--- PASS: TestCompareBytes (88.86s)

I suspect that's due to missing SIMD implementations of internal/bytealg.Compare for GOARCH=arm.

@alexbrainman

This comment has been minimized.

Copy link
Member Author

commented Dec 1, 2018

There's one obvious outlier:

=== RUN   TestCompareBytes
--- PASS: TestCompareBytes (88.86s)

Thanks for testing.

I suspect that's due to missing SIMD implementations of internal/bytealg.Compare for GOARCH=arm.

That sounds like reasonable explanation. I know nothing about arm instructions, so I would not know if it is possible to make this faster. If not, we should make TestCompareBytes shorter on arm - 2 minutes is too long for one single tests.

Leaving this for others to decide.

Alex

@tklauser

This comment has been minimized.

Copy link
Member

commented Dec 2, 2018

I suspect that's due to missing SIMD implementations of internal/bytealg.Compare for GOARCH=arm.

That sounds like reasonable explanation. I know nothing about arm instructions, so I would not know if it is possible to make this faster. If not, we should make TestCompareBytes shorter on arm - 2 minutes is too long for one single tests.

Unfortunately, ARM vector instructions are not yet available in The Go assembler (#7300). However, from a quick look at the current implementation in internal/bytealg/compare_arm.s it looks like we could at least get some speedup by doing word-wise comparison (for word aligned strings) instead of always using byte-wise comparison.

@gopherbot

This comment has been minimized.

Copy link

commented Dec 11, 2018

Change https://golang.org/cl/153557 mentions this issue: bytes: avoid reallocations in TestCompareBytes

tklauser added a commit to tklauser/go that referenced this issue Dec 20, 2018

internal/bytealg: use word-wise comparison for Compare on arm
Use word-wise comparison for aligned buffers. Try to align buffers if
possible, otherwise fall back to the current byte-wise comparison.

The slight increase in CompareBytesEmpty is due to the additional empty
and alignments checks.

name                           old time/op    new time/op    delta
CompareBytesEqual-4              79.6ns ± 0%    67.3ns ± 1%   -15.46%  (p=0.000 n=10+10)
CompareBytesToNil-4              31.0ns ± 1%    30.4ns ± 1%    -2.00%  (p=0.000 n=10+10)
CompareBytesEmpty-4              25.7ns ± 1%    28.8ns ± 0%   +12.32%  (p=0.000 n=10+9)
CompareBytesIdentical-4          34.2ns ± 1%    29.2ns ± 1%   -14.63%  (p=0.000 n=10+10)
CompareBytesSameLength-4         49.9ns ± 0%    47.9ns ± 1%    -3.95%  (p=0.000 n=9+10)
CompareBytesDifferentLength-4    49.8ns ± 0%    48.2ns ± 1%    -3.21%  (p=0.000 n=9+9)
CompareBytesBigUnaligned-4       5.71ms ± 1%    5.99ms ± 4%    +4.86%  (p=0.001 n=10+10)
CompareBytesBig-4                4.97ms ± 1%    2.28ms ± 1%   -54.23%  (p=0.000 n=10+10)
CompareBytesBigIdentical-4       27.4ns ± 1%    27.3ns ± 1%      ~     (p=0.394 n=10+10)

name                           old speed      new speed      delta
CompareBytesBigUnaligned-4      184MB/s ± 1%   175MB/s ± 4%    -4.56%  (p=0.001 n=10+10)
CompareBytesBig-4               211MB/s ± 1%   460MB/s ± 1%  +118.46%  (p=0.000 n=10+10)
CompareBytesBigIdentical-4     38.4TB/s ± 1%  38.4TB/s ± 1%      ~     (p=0.684 n=10+10)

This reduces time for TestCompareBytes from ~90 sec to ~40 sec on a
linux-arm builder via gomote.

Updates golang#29001

Change-Id: I25f148739b9ccb7cb1fc97b3d8763549b0a66c16
@gopherbot

This comment has been minimized.

Copy link

commented Mar 5, 2019

Change https://golang.org/cl/165338 mentions this issue: internal/bytealg: use word-wise comparison for Compare on arm

@ALTree ALTree added this to the Unplanned milestone Mar 5, 2019

gopherbot pushed a commit that referenced this issue Mar 6, 2019

internal/bytealg: use word-wise comparison for Compare on arm
Use word-wise comparison for aligned buffers, otherwise fall back to the
current byte-wise comparison.

name                           old time/op    new time/op    delta
BytesCompare/1-4                 41.3ns ± 0%    36.4ns ± 1%   -11.73%  (p=0.008 n=5+5)
BytesCompare/2-4                 39.5ns ± 0%    39.5ns ± 1%      ~     (p=0.960 n=5+5)
BytesCompare/4-4                 45.3ns ± 0%    41.0ns ± 1%    -9.40%  (p=0.008 n=5+5)
BytesCompare/8-4                 64.8ns ± 1%    44.7ns ± 0%   -31.12%  (p=0.008 n=5+5)
BytesCompare/16-4                86.3ns ± 0%    55.1ns ± 0%   -36.21%  (p=0.008 n=5+5)
BytesCompare/32-4                 135ns ± 0%      70ns ± 1%   -47.73%  (p=0.008 n=5+5)
BytesCompare/64-4                 231ns ± 1%      99ns ± 0%   -57.27%  (p=0.016 n=5+4)
BytesCompare/128-4                424ns ± 0%     147ns ± 0%   -65.31%  (p=0.000 n=4+5)
BytesCompare/256-4                810ns ± 0%     243ns ± 0%   -69.96%  (p=0.008 n=5+5)
BytesCompare/512-4               1.59µs ± 0%    0.44µs ± 0%   -72.43%  (p=0.008 n=5+5)
BytesCompare/1024-4              3.14µs ± 1%    0.83µs ± 1%   -73.56%  (p=0.008 n=5+5)
BytesCompare/2048-4              6.23µs ± 0%    1.61µs ± 1%   -74.21%  (p=0.008 n=5+5)
CompareBytesEqual-4              79.4ns ± 0%    52.2ns ± 0%   -34.23%  (p=0.008 n=5+5)
CompareBytesToNil-4              31.0ns ± 0%    30.3ns ± 0%    -2.32%  (p=0.008 n=5+5)
CompareBytesEmpty-4              25.7ns ± 0%    25.7ns ± 0%      ~     (p=0.556 n=4+5)
CompareBytesIdentical-4          25.7ns ± 0%    25.7ns ± 0%      ~     (p=1.000 n=5+5)
CompareBytesSameLength-4         49.1ns ± 0%    48.5ns ± 0%    -1.26%  (p=0.008 n=5+5)
CompareBytesDifferentLength-4    49.8ns ± 1%    49.3ns ± 0%    -1.08%  (p=0.008 n=5+5)
CompareBytesBigUnaligned-4       5.71ms ± 1%    5.68ms ± 1%      ~     (p=0.222 n=5+5)
CompareBytesBig-4                4.95ms ± 0%    2.28ms ± 1%   -53.81%  (p=0.008 n=5+5)
CompareBytesBigIdentical-4       27.2ns ± 1%    27.3ns ± 1%      ~     (p=0.310 n=5+5)

name                           old speed      new speed      delta
CompareBytesBigUnaligned-4      184MB/s ± 1%   185MB/s ± 1%      ~     (p=0.222 n=5+5)
CompareBytesBig-4               212MB/s ± 0%   459MB/s ± 1%  +116.51%  (p=0.008 n=5+5)
CompareBytesBigIdentical-4     38.5TB/s ± 0%  38.4TB/s ± 1%      ~     (p=0.421 n=5+5)

Also, this reduces time for TestCompareBytes by about 20 sec on a
linux-arm builder via gomote.

Updates #29001

Change-Id: I25f148739b9ccb7cb1fc97b3d8763549b0a66c16
Reviewed-on: https://go-review.googlesource.com/c/go/+/165338
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@tklauser

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Testing cmd/compile/internal/gc also takes a substantial amount of time (see https://gist.github.com/tklauser/6c265a73eb9ef310998e29ff976552b6 for the full log):

ok  	cmd/compile/internal/gc	159.902s

one of the biggest contributors is

--- PASS: TestCode (77.99s)

It seems the main culprit there is the bytes.Contains call (to determine whether to test using softfloat): which is not surprizing given that there is no arm-specific optimization for Index/IndexString in internal/bytealg. I'll try to add one.

tklauser added a commit to tklauser/go that referenced this issue Mar 8, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal, otherwise fall back to the current byte-wise comparison.

TODO benchmarks

Updates golang#29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609

tklauser added a commit to tklauser/go that referenced this issue Mar 8, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal, otherwise fall back to the current byte-wise comparison.

TODO benchmarks

Updates golang#29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
@alexbrainman

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2019

@tklauser your hard work achieves results. Recent windows-arm builder run

https://build.golang.org/log/32d08cb2c389db61937c461cbcf64172bd00bc84

ok  	bytes	60.869s

is more than twice as fast as before. Thank you very much.

Alex

@gopherbot

This comment has been minimized.

Copy link

commented Mar 11, 2019

Change https://golang.org/cl/166677 mentions this issue: internal/bytealg: share code for equal functions on arm

tklauser added a commit to tklauser/go that referenced this issue Mar 12, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal, otherwise fall back to the current byte-wise comparison.

TODO benchmarks

Updates golang#29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609

gopherbot pushed a commit that referenced this issue Mar 12, 2019

internal/bytealg: share code for equal functions on arm
Move the shared code into byteal.memeqbody. This will allow to implement
optimizations (e.g. for #29001) in a single function.

Change-Id: Iaa34ddeb7068d92c35a8b4e581b7fd92da56535c
Reviewed-on: https://go-review.googlesource.com/c/go/+/166677
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>

tklauser added a commit to tklauser/go that referenced this issue Mar 12, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal on arm, otherwise fall back to the current byte-wise comparison.

TODO benchmarks

Updates golang#29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609

tklauser added a commit to tklauser/go that referenced this issue Mar 12, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal on arm, otherwise fall back to the current byte-wise comparison.

name                 old time/op    new time/op    delta
Equal/0-4              25.7ns ± 1%    23.5ns ± 1%    -8.78%  (p=0.000 n=10+10)
Equal/1-4              65.8ns ± 0%    60.1ns ± 1%    -8.69%  (p=0.000 n=10+9)
Equal/6-4              82.9ns ± 1%    86.7ns ± 0%    +4.59%  (p=0.000 n=10+10)
Equal/9-4              90.0ns ± 0%   101.0ns ± 0%   +12.18%  (p=0.000 n=9+10)
Equal/15-4              108ns ± 0%     119ns ± 0%   +10.19%  (p=0.000 n=8+8)
Equal/16-4              111ns ± 0%      82ns ± 0%   -26.37%  (p=0.000 n=8+10)
Equal/20-4              124ns ± 1%      87ns ± 1%   -29.94%  (p=0.000 n=9+10)
Equal/32-4              160ns ± 1%      97ns ± 1%   -39.40%  (p=0.000 n=10+10)
Equal/4K-4             14.0µs ± 0%     3.6µs ± 1%   -74.57%  (p=0.000 n=9+10)
Equal/4M-4             12.8ms ± 1%     3.2ms ± 0%   -74.93%  (p=0.000 n=9+9)
Equal/64M-4             204ms ± 1%      51ms ± 0%   -74.78%  (p=0.000 n=10+10)
EqualPort/1-4          47.0ns ± 1%    46.8ns ± 0%    -0.40%  (p=0.015 n=10+6)
EqualPort/6-4          82.6ns ± 1%    81.9ns ± 1%    -0.87%  (p=0.002 n=10+10)
EqualPort/32-4          232ns ± 0%     232ns ± 0%      ~     (p=0.496 n=8+10)
EqualPort/4K-4         29.0µs ± 1%    29.0µs ± 1%      ~     (p=0.604 n=9+10)
EqualPort/4M-4         24.0ms ± 1%    23.8ms ± 0%    -0.65%  (p=0.001 n=9+9)
EqualPort/64M-4         383ms ± 1%     382ms ± 0%      ~     (p=0.218 n=10+10)
CompareBytesEqual-4    61.2ns ± 1%    61.0ns ± 1%      ~     (p=0.539 n=10+10)

name                 old speed      new speed      delta
Equal/1-4            15.2MB/s ± 0%  16.6MB/s ± 1%    +9.52%  (p=0.000 n=10+9)
Equal/6-4            72.4MB/s ± 1%  69.2MB/s ± 0%    -4.40%  (p=0.000 n=10+10)
Equal/9-4             100MB/s ± 0%    89MB/s ± 0%   -11.40%  (p=0.000 n=9+10)
Equal/15-4            138MB/s ± 1%   125MB/s ± 1%    -9.41%  (p=0.000 n=10+10)
Equal/16-4            144MB/s ± 1%   196MB/s ± 0%   +36.41%  (p=0.000 n=10+10)
Equal/20-4            162MB/s ± 1%   231MB/s ± 1%   +42.98%  (p=0.000 n=9+10)
Equal/32-4            200MB/s ± 1%   331MB/s ± 1%   +65.64%  (p=0.000 n=10+10)
Equal/4K-4            292MB/s ± 0%  1149MB/s ± 1%  +293.19%  (p=0.000 n=9+10)
Equal/4M-4            328MB/s ± 1%  1307MB/s ± 0%  +298.87%  (p=0.000 n=9+9)
Equal/64M-4           329MB/s ± 1%  1306MB/s ± 0%  +296.56%  (p=0.000 n=10+10)
EqualPort/1-4        21.3MB/s ± 1%  21.4MB/s ± 0%    +0.42%  (p=0.002 n=10+9)
EqualPort/6-4        72.6MB/s ± 1%  73.2MB/s ± 1%    +0.87%  (p=0.003 n=10+10)
EqualPort/32-4        138MB/s ± 0%   138MB/s ± 0%      ~     (p=0.953 n=9+10)
EqualPort/4K-4        141MB/s ± 1%   141MB/s ± 1%      ~     (p=0.382 n=10+10)
EqualPort/4M-4        175MB/s ± 1%   176MB/s ± 0%    +0.65%  (p=0.001 n=9+9)
EqualPort/64M-4       175MB/s ± 1%   176MB/s ± 0%      ~     (p=0.225 n=10+10)

Updates golang#29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
@gopherbot

This comment has been minimized.

Copy link

commented Mar 14, 2019

Change https://golang.org/cl/167557 mentions this issue: internal/bytealg: use word-wise comparison for Equal on arm

@gopherbot

This comment has been minimized.

Copy link

commented Mar 15, 2019

Change https://golang.org/cl/167798 mentions this issue: bytes: add hard benchmarks for Index and Count

gopherbot pushed a commit that referenced this issue Mar 15, 2019

bytes: add hard benchmarks for Index and Count
Add Benchmark(Index|Count)Hard[1-3] in preparation for implementing
Index and Count in assembly on arm.

Updates #29001

Change-Id: I2a9701892190e8d91de069c2f5a7f5bd3544c6c2
Reviewed-on: https://go-review.googlesource.com/c/go/+/167798
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

gopherbot pushed a commit that referenced this issue Mar 17, 2019

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal on arm, otherwise fall back to the current byte-wise comparison.

name                 old time/op    new time/op    delta
Equal/0-4              25.7ns ± 1%    23.5ns ± 1%    -8.78%  (p=0.000 n=10+10)
Equal/1-4              65.8ns ± 0%    60.1ns ± 1%    -8.69%  (p=0.000 n=10+9)
Equal/6-4              82.9ns ± 1%    86.7ns ± 0%    +4.59%  (p=0.000 n=10+10)
Equal/9-4              90.0ns ± 0%   101.0ns ± 0%   +12.18%  (p=0.000 n=9+10)
Equal/15-4              108ns ± 0%     119ns ± 0%   +10.19%  (p=0.000 n=8+8)
Equal/16-4              111ns ± 0%      82ns ± 0%   -26.37%  (p=0.000 n=8+10)
Equal/20-4              124ns ± 1%      87ns ± 1%   -29.94%  (p=0.000 n=9+10)
Equal/32-4              160ns ± 1%      97ns ± 1%   -39.40%  (p=0.000 n=10+10)
Equal/4K-4             14.0µs ± 0%     3.6µs ± 1%   -74.57%  (p=0.000 n=9+10)
Equal/4M-4             12.8ms ± 1%     3.2ms ± 0%   -74.93%  (p=0.000 n=9+9)
Equal/64M-4             204ms ± 1%      51ms ± 0%   -74.78%  (p=0.000 n=10+10)
EqualPort/1-4          47.0ns ± 1%    46.8ns ± 0%    -0.40%  (p=0.015 n=10+6)
EqualPort/6-4          82.6ns ± 1%    81.9ns ± 1%    -0.87%  (p=0.002 n=10+10)
EqualPort/32-4          232ns ± 0%     232ns ± 0%      ~     (p=0.496 n=8+10)
EqualPort/4K-4         29.0µs ± 1%    29.0µs ± 1%      ~     (p=0.604 n=9+10)
EqualPort/4M-4         24.0ms ± 1%    23.8ms ± 0%    -0.65%  (p=0.001 n=9+9)
EqualPort/64M-4         383ms ± 1%     382ms ± 0%      ~     (p=0.218 n=10+10)
CompareBytesEqual-4    61.2ns ± 1%    61.0ns ± 1%      ~     (p=0.539 n=10+10)

name                 old speed      new speed      delta
Equal/1-4            15.2MB/s ± 0%  16.6MB/s ± 1%    +9.52%  (p=0.000 n=10+9)
Equal/6-4            72.4MB/s ± 1%  69.2MB/s ± 0%    -4.40%  (p=0.000 n=10+10)
Equal/9-4             100MB/s ± 0%    89MB/s ± 0%   -11.40%  (p=0.000 n=9+10)
Equal/15-4            138MB/s ± 1%   125MB/s ± 1%    -9.41%  (p=0.000 n=10+10)
Equal/16-4            144MB/s ± 1%   196MB/s ± 0%   +36.41%  (p=0.000 n=10+10)
Equal/20-4            162MB/s ± 1%   231MB/s ± 1%   +42.98%  (p=0.000 n=9+10)
Equal/32-4            200MB/s ± 1%   331MB/s ± 1%   +65.64%  (p=0.000 n=10+10)
Equal/4K-4            292MB/s ± 0%  1149MB/s ± 1%  +293.19%  (p=0.000 n=9+10)
Equal/4M-4            328MB/s ± 1%  1307MB/s ± 0%  +298.87%  (p=0.000 n=9+9)
Equal/64M-4           329MB/s ± 1%  1306MB/s ± 0%  +296.56%  (p=0.000 n=10+10)
EqualPort/1-4        21.3MB/s ± 1%  21.4MB/s ± 0%    +0.42%  (p=0.002 n=10+9)
EqualPort/6-4        72.6MB/s ± 1%  73.2MB/s ± 1%    +0.87%  (p=0.003 n=10+10)
EqualPort/32-4        138MB/s ± 0%   138MB/s ± 0%      ~     (p=0.953 n=9+10)
EqualPort/4K-4        141MB/s ± 1%   141MB/s ± 1%      ~     (p=0.382 n=10+10)
EqualPort/4M-4        175MB/s ± 1%   176MB/s ± 0%    +0.65%  (p=0.001 n=9+9)
EqualPort/64M-4       175MB/s ± 1%   176MB/s ± 0%      ~     (p=0.225 n=10+10)

The 5-12% decrease in performance on Equal/{6,9,15} are due to the
benchmarks splitting the bytes buffer in half. The b argument to Equal
then ends up being unaligned and thus the fast word-wise compare doesn't
kick in.

Updates #29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Reviewed-on: https://go-review.googlesource.com/c/go/+/167557
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot

This comment has been minimized.

Copy link

commented Mar 17, 2019

Change https://golang.org/cl/167939 mentions this issue: internal/bytealg: share code for IndexByte functions on arm

gopherbot pushed a commit that referenced this issue Mar 18, 2019

internal/bytealg: share code for IndexByte functions on arm
Move the shared code of IndexByte and IndexByteString into
indexbytebody. This will allow to implement optimizations (e.g.
for #29001) in a single function.

Change-Id: I1d550da8eb65f95e492a460a12058cc35b1162b6
Reviewed-on: https://go-review.googlesource.com/c/go/+/167939
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented Mar 19, 2019

Change https://golang.org/cl/168319 mentions this issue: internal/bytealg: add assembly implementation of Count/CountString on arm

gopherbot pushed a commit that referenced this issue Mar 19, 2019

internal/bytealg: add assembly implementation of Count/CountString on…
… arm

Simple single-byte loop count for now, to be further improved in future
CLs.

Benchmark on linux/arm:

name               old time/op    new time/op     delta
CountSingle/10-4      122ns ± 0%       87ns ± 1%  -28.41%  (p=0.000 n=7+10)
CountSingle/32-4      242ns ± 0%      174ns ± 1%  -28.25%  (p=0.000 n=10+10)
CountSingle/4K-4     24.2µs ± 1%     15.6µs ± 1%  -35.42%  (p=0.000 n=10+10)
CountSingle/4M-4     29.6ms ± 1%     21.3ms ± 1%  -28.09%  (p=0.000 n=10+9)
CountSingle/64M-4     562ms ± 0%      414ms ± 1%  -26.23%  (p=0.000 n=8+10)

name               old speed      new speed       delta
CountSingle/10-4   81.7MB/s ± 1%  114.5MB/s ± 1%  +40.07%  (p=0.000 n=10+10)
CountSingle/32-4    132MB/s ± 0%    184MB/s ± 1%  +39.39%  (p=0.000 n=10+9)
CountSingle/4K-4    170MB/s ± 1%    263MB/s ± 1%  +54.86%  (p=0.000 n=10+10)
CountSingle/4M-4    142MB/s ± 1%    197MB/s ± 1%  +39.07%  (p=0.000 n=10+9)
CountSingle/64M-4   119MB/s ± 0%    162MB/s ± 1%  +35.55%  (p=0.000 n=8+10)

Updates #29001

Change-Id: I42a268215a62044286ec32b548d8e4b86b9570ee
Reviewed-on: https://go-review.googlesource.com/c/go/+/168319
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.