Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/blake2b: very low performance for AVX and AVX2 code #18563

Closed
aead opened this issue Jan 7, 2017 · 2 comments
Closed

x/crypto/blake2b: very low performance for AVX and AVX2 code #18563

aead opened this issue Jan 7, 2017 · 2 comments

Comments

@aead
Copy link
Contributor

aead commented Jan 7, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.7.*

What operating system and processor architecture are you using (go env)?

amd64/linux on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Further info

What did you do?

go test -bench=Benchmark for x/crypto/blake2b

What did you expect to see?

BenchmarkWrite128-12              500000              2084 ns/op          61.40 MB/s
BenchmarkWrite1K-12               100000             16227 ns/op          63.10 MB/s
BenchmarkSum128-12               1000000              2107 ns/op          60.74 MB/s
BenchmarkSum1K-12                 100000             16312 ns/op          62.77 MB/s
PASS
ok      golang.org/x/crypto/blake2b     7.647s

What did you see instead?

Performance about 800 MB/s as on a i7-6500U

@rakyll rakyll changed the title blake2b: very low performance for AVX and AVX2 code x/crypto/blake2b: very low performance for AVX and AVX2 code Jan 7, 2017
@rakyll rakyll added this to the Unreleased milestone Jan 7, 2017
@aead
Copy link
Contributor Author

aead commented Jan 7, 2017

Replacing PINSRQ through VPINSERQ and adding VZEROUPPER doesn't change anything...

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/34993 mentions this issue.

@golang golang locked and limited conversation to collaborators Feb 8, 2018
c-expert-zigbee pushed a commit to c-expert-zigbee/crypto_go that referenced this issue Mar 28, 2022
On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
c-expert-zigbee pushed a commit to c-expert-zigbee/crypto_go that referenced this issue Mar 29, 2022
On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
LewiGoddard pushed a commit to LewiGoddard/crypto that referenced this issue Feb 16, 2023
On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
BiiChris pushed a commit to BiiChris/crypto that referenced this issue Sep 15, 2023
On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
desdeel2d0m added a commit to desdeel2d0m/crypto that referenced this issue Jul 1, 2024
On some amd64 CPUs (Xeon E5-2680v4 / E5-2620v3) using SSE and AVX instructions
leads to very low performance.
On a i7-6500U the SSE-AVX code performs following:

AVX2:
name        time/op
Write128-4    165ns ± 0%
Write1K-4    1.20µs ± 0%
Sum128-4      189ns ± 1%
Sum1K-4      1.22µs ± 0%

name        speed
Write128-4  773MB/s ± 1%
Write1K-4   855MB/s ± 0%
Sum128-4    675MB/s ± 1%
Sum1K-4     838MB/s ± 0%

while the same code achieves values < 65MB/s on a Xeon E5-2620v3.

Replacing the `MOVQ` and `PINSRQ` with the AVX instructions `VMOVQ` and `VPINSRQ`
increases the performance of the AVX/AVX2 code to some expected values:

name         old time/op    new time/op     delta
Write128-12    2.20µs ±10%     0.22µs ± 9%    -90.00%  (p=0.029 n=4+4)
Write1K-12     16.2µs ± 0%      1.1µs ± 0%    -93.07%  (p=0.029 n=4+4)
Sum128-12      2.10µs ± 0%     0.22µs ± 0%    -89.47%  (p=0.029 n=4+4)
Sum1K-12       16.3µs ± 0%      1.2µs ± 0%    -92.65%  (p=0.029 n=4+4)

name         old speed      new speed       delta
Write128-12  58.5MB/s ±10%  582.8MB/s ±10%   +897.08%  (p=0.029 n=4+4)
Write1K-12   63.1MB/s ± 0%  909.8MB/s ± 0%  +1341.40%  (p=0.029 n=4+4)
Sum128-12    60.8MB/s ± 0%  576.3MB/s ± 0%   +847.84%  (p=0.029 n=4+4)
Sum1K-12     62.8MB/s ± 0%  855.2MB/s ± 0%  +1260.78%  (p=0.029 n=4+4)

The AVX/AVX2 code now uses only AVX (no SSE) instructions.

Fixes golang/go#18563.

Change-Id: I1961dd8fa02014642587523b7f099816a263c9f5
Reviewed-on: https://go-review.googlesource.com/34993
Reviewed-by: Adam Langley <agl@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants