Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto: understand performance differences compared to BoringSSL #21525

Open
rsc opened this issue Aug 18, 2017 · 4 comments

Comments

@rsc
Copy link
Contributor

commented Aug 18, 2017

I ran all the crypto benchmarks with standard Go crypto and with BoringCrypto. Results below.

In general there is about a 200ns overhead to calling into BoringCrypto via cgo for a particular call. So for example aes.BenchmarkEncrypt (testing encryption of a single 16-byte block) went from 13ns to 209ns, or +1500%. That we can't do much about except hope that bulk operations call into cgo once instead of once per 16 bytes.

But there are also some mysteries or things to consider fixing. I've put this in milestone Go 1.10 because some of them may be bugs in the Go distribution that we should at least understand. Once we know that the problems are all on the dev.boringcrypto side, we can switch the milestone to Unreleased.

crypto/aes

  • AESCFBEncrypt1K, AESCFBDecrypt1K, AESOFB1K are much slower because there is no bulk CFB operation (no cipher.cfbAble interface). Should there be? Probably.
  • AESCTR1K looks like it is not using the ctrAble implementation that boring.aesCipher is offering.
  • AESCBCEncrypt1K looks like it is not using the cbcEncAble implementation that boring.aesCipher is offering.
  • AESCBCDecrypt1K looks like it is not using the cbcDecAble implementation that boring.aesCipher is offering.

crypto/ecdsa

  • Why does SignP256 take an extra 9µs in BoringCrypto? That's too big to be cgo. Should the signature conversion be improved?
  • SignP384 drops from 5.54ms to 0.85ms, indicating that the Go implementation has 6X room for improvement.
  • Even after the 6X, I don't understand why P384 is so much slower than P256.
  • KeyGeneration is 10X slower in BoringCrypto than in Go. We should make sure the Go version is not missing something important.

crypto/hmac

  • How is it that HMACSHA256_1K takes the same amount of time as HMACSHA256_32 in BoringCrypto?
  • For that matter, how it is that, in BoringCrypto, HMACSHA256_1K takes 2µs but crypto/sha256's Hash1K takes 4µs?

crypto/rsa

  • Why is RSA2048Sign 3X faster in BoringCrypto?

Benchmark results (also at https://perf.golang.org/search?q=upload:20170818.4):

name                              old time/op    new time/op    delta
pkg:crypto/aes goos:linux goarch:amd64
Encrypt-4                           13.3ns ± 2%   208.6ns ± 5%  +1473.15%  (p=0.008 n=5+5)
Decrypt-4                           13.2ns ± 1%   255.0ns ± 0%  +1828.90%  (p=0.016 n=5+4)
Expand-4                            75.8ns ± 0%    76.4ns ± 1%       ~     (p=0.056 n=5+5)
pkg:crypto/cipher goos:linux goarch:amd64
AESGCMSeal1K-4                       341ns ± 1%     503ns ± 0%    +47.71%  (p=0.008 n=5+5)
AESGCMOpen1K-4                       321ns ± 0%     496ns ± 1%    +54.68%  (p=0.008 n=5+5)
AESGCMSeal8K-4                      2.04µs ± 0%    2.21µs ± 1%     +8.27%  (p=0.008 n=5+5)
AESGCMOpen8K-4                      1.97µs ± 1%    2.18µs ± 0%    +10.84%  (p=0.008 n=5+5)
AESCFBEncrypt1K-4                   2.37µs ± 0%   14.48µs ± 0%   +512.17%  (p=0.008 n=5+5)
AESCFBDecrypt1K-4                   2.27µs ± 1%   14.48µs ± 1%   +538.94%  (p=0.008 n=5+5)
AESOFB1K-4                          1.46µs ± 1%   13.76µs ± 1%   +844.07%  (p=0.008 n=5+5)
AESCTR1K-4                          1.66µs ± 1%    8.99µs ± 0%   +442.57%  (p=0.008 n=5+5)
AESCBCEncrypt1K-4                   2.27µs ± 1%    8.13µs ± 0%   +257.59%  (p=0.008 n=5+5)
AESCBCDecrypt1K-4                   1.67µs ± 2%   11.65µs ± 2%   +598.98%  (p=0.008 n=5+5)
pkg:crypto/des goos:linux goarch:amd64
Encrypt-4                            162ns ± 3%     159ns ± 1%       ~     (p=0.222 n=5+5)
Decrypt-4                            157ns ± 1%     158ns ± 2%       ~     (p=0.722 n=5+5)
TDESEncrypt-4                        380ns ± 0%     381ns ± 1%       ~     (p=0.857 n=5+5)
TDESDecrypt-4                        386ns ± 0%     386ns ± 0%       ~     (p=1.000 n=5+5)
pkg:crypto/ecdsa goos:linux goarch:amd64
SignP256-4                          36.2µs ± 2%    45.3µs ± 1%    +24.91%  (p=0.008 n=5+5)
SignP384-4                          5.54ms ± 0%    0.85ms ± 1%    -84.63%  (p=0.008 n=5+5)
VerifyP256-4                         104µs ± 1%     102µs ± 0%     -1.29%  (p=0.016 n=5+4)
KeyGeneration-4                     21.9µs ± 1%   200.3µs ± 0%   +815.39%  (p=0.008 n=5+5)
pkg:crypto/elliptic goos:linux goarch:amd64
BaseMult-4                           979µs ± 3%     954µs ± 1%     -2.62%  (p=0.008 n=5+5)
BaseMultP256-4                      19.8µs ± 0%    19.7µs ± 1%       ~     (p=0.151 n=5+5)
ScalarMultP256-4                    77.3µs ± 0%    76.7µs ± 0%     -0.70%  (p=0.008 n=5+5)
pkg:crypto/hmac goos:linux goarch:amd64
HMACSHA256_1K-4                     4.46µs ± 0%    1.92µs ± 0%    -56.91%  (p=0.008 n=5+5)
HMACSHA256_32-4                     1.16µs ± 0%    1.94µs ± 2%    +66.17%  (p=0.008 n=5+5)
pkg:crypto/md5 goos:linux goarch:amd64
Hash8Bytes-4                         184ns ± 0%     183ns ± 0%     -0.54%  (p=0.029 n=4+4)
Hash1K-4                            2.01µs ± 0%    2.01µs ± 1%       ~     (p=0.087 n=5+5)
Hash8K-4                            14.8µs ± 1%    14.8µs ± 0%       ~     (p=0.651 n=5+5)
Hash8BytesUnaligned-4                184ns ± 0%     183ns ± 0%     -0.89%  (p=0.000 n=5+4)
Hash1KUnaligned-4                   2.01µs ± 0%    2.00µs ± 0%     -0.41%  (p=0.040 n=5+5)
Hash8KUnaligned-4                   14.9µs ± 1%    15.2µs ± 3%       ~     (p=0.690 n=5+5)
pkg:crypto/rand goos:linux goarch:amd64
Prime-4                              146ms ±57%     143ms ± 8%       ~     (p=0.548 n=5+5)
pkg:crypto/rc4 goos:linux goarch:amd64
RC4_128-4                            313ns ± 3%     287ns ± 2%     -8.19%  (p=0.008 n=5+5)
RC4_1K-4                            2.72µs ± 2%    2.70µs ± 1%       ~     (p=0.151 n=5+5)
RC4_8K-4                            21.8µs ± 2%    22.0µs ± 2%       ~     (p=0.056 n=5+5)
pkg:crypto/rsa goos:linux goarch:amd64
RSA2048Sign-4                       2.90ms ± 1%    1.08ms ± 1%    -62.64%  (p=0.008 n=5+5)
pkg:crypto/sha1 goos:linux goarch:amd64
Hash8Bytes-4                         215ns ± 0%     562ns ± 1%   +161.58%  (p=0.016 n=4+5)
Hash320Bytes-4                       821ns ± 1%    1096ns ± 0%    +33.45%  (p=0.008 n=5+5)
Hash1K-4                            1.64µs ± 0%    2.26µs ± 0%    +37.21%  (p=0.008 n=5+5)
Hash8K-4                            10.6µs ± 0%    12.5µs ± 0%    +18.04%  (p=0.008 n=5+5)
pkg:crypto/sha256 goos:linux goarch:amd64
Hash8Bytes-4                         310ns ± 1%     679ns ± 1%   +118.96%  (p=0.008 n=5+5)
Hash1K-4                            3.61µs ± 0%    3.99µs ± 0%    +10.50%  (p=0.008 n=5+5)
Hash8K-4                            26.8µs ± 3%    26.6µs ± 1%       ~     (p=0.548 n=5+5)
pkg:crypto/sha512 goos:linux goarch:amd64
Hash8Bytes-4                         419ns ± 1%     805ns ± 1%    +91.94%  (p=0.008 n=5+5)
Hash1K-4                            2.67µs ± 1%    3.13µs ± 1%    +17.25%  (p=0.008 n=5+5)
Hash8K-4                            18.0µs ± 0%    18.8µs ± 1%     +4.07%  (p=0.008 n=5+5)
pkg:crypto/tls goos:linux goarch:amd64
Throughput/MaxPacket/1MB-4          4.02ms ± 1%    3.48ms ± 1%    -13.48%  (p=0.008 n=5+5)
Throughput/MaxPacket/2MB-4          6.11ms ± 2%    5.66ms ± 1%     -7.46%  (p=0.008 n=5+5)
Throughput/MaxPacket/4MB-4          10.3ms ± 1%    10.0ms ± 1%     -3.65%  (p=0.008 n=5+5)
Throughput/MaxPacket/8MB-4          18.6ms ± 1%    18.5ms ± 0%       ~     (p=0.151 n=5+5)
Throughput/MaxPacket/16MB-4         35.1ms ± 1%    35.5ms ± 1%       ~     (p=0.222 n=5+5)
Throughput/MaxPacket/32MB-4         68.0ms ± 1%    69.9ms ± 2%     +2.67%  (p=0.008 n=5+5)
Throughput/MaxPacket/64MB-4          133ms ± 1%     137ms ± 0%     +2.90%  (p=0.008 n=5+5)
Throughput/DynamicPacket/1MB-4      4.11ms ± 1%    3.55ms ± 2%    -13.55%  (p=0.008 n=5+5)
Throughput/DynamicPacket/2MB-4      6.32ms ± 4%    5.70ms ± 2%     -9.80%  (p=0.008 n=5+5)
Throughput/DynamicPacket/4MB-4      10.5ms ± 1%    10.1ms ± 1%     -3.51%  (p=0.008 n=5+5)
Throughput/DynamicPacket/8MB-4      18.7ms ± 1%    18.6ms ± 0%       ~     (p=0.222 n=5+5)
Throughput/DynamicPacket/16MB-4     35.3ms ± 1%    35.7ms ± 1%     +1.18%  (p=0.032 n=5+5)
Throughput/DynamicPacket/32MB-4     67.9ms ± 0%    69.6ms ± 1%     +2.44%  (p=0.008 n=5+5)
Throughput/DynamicPacket/64MB-4      134ms ± 0%     137ms ± 1%     +2.21%  (p=0.016 n=4+5)
Latency/MaxPacket/200kbps-4          699ms ± 1%     697ms ± 0%       ~     (p=0.151 n=5+5)
Latency/MaxPacket/500kbps-4          286ms ± 0%     283ms ± 0%     -0.84%  (p=0.008 n=5+5)
Latency/MaxPacket/1000kbps-4         147ms ± 0%     145ms ± 0%     -1.62%  (p=0.008 n=5+5)
Latency/MaxPacket/2000kbps-4        77.9ms ± 1%    74.6ms ± 2%     -4.22%  (p=0.008 n=5+5)
Latency/MaxPacket/5000kbps-4        35.5ms ± 0%    33.2ms ± 4%     -6.46%  (p=0.008 n=5+5)
Latency/DynamicPacket/200kbps-4      139ms ± 3%     138ms ± 0%       ~     (p=0.151 n=5+5)
Latency/DynamicPacket/500kbps-4     60.7ms ± 1%    58.9ms ± 1%     -2.97%  (p=0.008 n=5+5)
Latency/DynamicPacket/1000kbps-4    34.0ms ± 2%    32.1ms ± 1%     -5.51%  (p=0.008 n=5+5)
Latency/DynamicPacket/2000kbps-4    20.0ms ± 1%    18.1ms ± 3%     -9.52%  (p=0.008 n=5+5)
Latency/DynamicPacket/5000kbps-4    10.2ms ± 4%    10.1ms ± 8%       ~     (p=1.000 n=5+5)

@rsc rsc added this to the Go1.10 milestone Aug 18, 2017

@valyala

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2017

Cc'ing @agl

@aead

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2017

  • crypto/aes
    OFB and CFB are "old" modes and should not be used any more. It is probably not worth it to optimize (introduce assembly) for those cipher modes.
    Currently only s390x provides an optimized implementation of AES-CTR. There is an open issue for amd64 - See #20967
    Currently only s390x provides an optimized implementation of AES-CBC. The reason is probably TLS. Since TLS 1.3 only allows AEADs I don't think it is worth to optimize the AES-CBC implementation.

  • crypto/ecdsa
    P256 is implemented in asm for amd64 and s390x while P384 and P521 are using the (non-constant time) generic implementation.

  • crypto/rsa
    Go's RSA implementation is not highly optimized. (See e.g. #20058)

The other points may require a closer look - very nice to bring this up 👍

@vielmetti

This comment has been minimized.

Copy link

commented Nov 8, 2017

This issue should have a "Performance" label.

@odeke-em

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

/cc-ing @FiloSottile too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.