Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

crypto/aes: add assembly for non-AES-NI machines #4299

Open
alexbrainman opened this Issue Oct 29, 2012 · 26 comments

Comments

Projects
None yet
9 participants
Member

alexbrainman commented Oct 29, 2012

golang https server described in https://golang.org/issue/4073?c=8
and tested with

siege --benchmark --concurrent=100 "https://localhost:8082";

command gives

Lifting the server siege...      done.
Transactions:                    779 hits
Availability:                 100.00 %
Elapsed time:                  40.56 secs
Data transferred:               0.02 MB
Response time:                  4.85 secs
Transaction rate:              19.21 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                   93.11
Successful transactions:         779
Failed transactions:               0
Longest transaction:           10.22
Shortest transaction:           0.34

But nginx does better:

Transactions:                   5120 hits
Availability:                 100.00 %
Elapsed time:                  53.87 secs
Data transferred:               0.74 MB
Response time:                  1.04 secs
Transaction rate:              95.04 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                   98.92
Successful transactions:        5120
Failed transactions:               0
Longest transaction:            1.08
Shortest transaction:           0.15

hg id is 8e87cb8dca7d. windows/386.

linux/386 golang server does about the same:

Lifting the server siege...      done.
Transactions:                   1867 hits
Availability:                 100.00 %
Elapsed time:                 118.75 secs
Data transferred:               0.05 MB
Response time:                  6.23 secs
Transaction rate:              15.72 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                   97.89
Successful transactions:        1867
Failed transactions:               0
Longest transaction:           13.59
Shortest transaction:           0.32

https://golang.org/issue/4073?c=6 claims similar results comparing to
"hello world node.js app".

I would investigate more, but I know nothing about SSL.

Alex
Member

alexbrainman commented Oct 29, 2012

Comment 1:

Issue #4073 has been merged into this issue.

Contributor

agl commented Oct 29, 2012

Comment 2:

nginx is probably doing a less computationally intensive ciphersuite and it's using
OpenSSL rather than math/big.
If anyone is keen on improving things, a constant time modexp in math/big would go a
long way, but it's very complex. We could also likely reduce the cost of RSA blinding by
squaring the same manner as OpenSSL rather than generating a new blind each time.
Lastly, a high-speed, constant time P-256 would also help.

Labels changed: added priority-later, removed priority-triage.

Status changed to LongTerm.

Contributor

rsc commented Dec 9, 2012

Comment 3:

Labels changed: added go1.1maybe, removed go1.1.

Contributor

robpike commented Mar 7, 2013

Comment 4:

Labels changed: removed go1.1maybe.

Comment 5 by jlmoiron:

I was running into this from another direction, downloading over https, and wrote a
program that demonstrates the issue, not through lower footprint but increased CPU
utilization.  The program is on the playground here:
http://play.golang.org/p/chCbgqS_ls
It is almost certainly possible to create a more straightforward example that passes
data straight through the crypto functions without relying on the network, but I was
trying to see if the issue was related to fetching multiple ssl streams at once or not
and wanted to isolate this anyway.
A sample pprof topN output on a linux-64bit machine:
Total: 953 samples
     472  49.5%  49.5%      472  49.5% crypto/aes.decryptBlockGo
     302  31.7%  81.2%      785  82.4% crypto/cipher.(*cbcDecrypter).CryptBlocks
      85   8.9%  90.1%       85   8.9% crypto/sha1.block
      36   3.8%  93.9%       36   3.8% runtime.memmove
       9   0.9%  94.9%      479  50.3% crypto/aes.decryptBlock
       5   0.5%  95.4%      483  50.7% crypto/aes.(*aesCipher).Decrypt
       4   0.4%  95.8%        4   0.4% runtime.futex
       3   0.3%  96.1%        3   0.3% ifaceeq1
       3   0.3%  96.4%        9   0.9% syscall.read
       2   0.2%  96.6%        2   0.2% crypto/hmac.(*hmac).tmpPad
       2   0.2%  96.9%        5   0.5% crypto/sha1.(*digest).Sum
       2   0.2%  97.1%        2   0.2% netpollblock
This doesn't appear to be limited to aes, as downloading from a different source
(unfortunately an internal server) gave me this pprof output:
Total: 1210 samples
     813  67.2%  67.2%      813  67.2% crypto/des.permuteBlock
     280  23.1%  90.3%     1016  84.0% crypto/des.feistel
      31   2.6%  92.9%       67   5.5% compress/flate.(*compressor).deflate
      21   1.7%  94.6%     1118  92.4% crypto/des.cryptBlock
       8   0.7%  95.3%     1125  93.0% crypto/cipher.(*cbcDecrypter).CryptBlocks
       6   0.5%  95.8%        6   0.5% compress/flate.(*compressor).findMatch
       6   0.5%  96.3%        6   0.5% encoding/binary.bigEndian.PutUint64
       5   0.4%  96.7%       19   1.6% compress/flate.(*huffmanEncoder).bitCounts
       5   0.4%  97.1%       10   0.8% runtime.mallocgc
       5   0.4%  97.5%        5   0.4% runtime.settype_flush
This was with `go version devel +98b396da54db Sun Apr 28 00:18:11 2013 +1000 linux/amd64`
Contributor

agl commented Apr 30, 2013

Comment 6:

jlmoiron: it's certainly true that the ciphers take CPU time. DES (in the second trace)
is well known to be a terrible CPU hog. AES (the first trace) isn't quite so bad.
Unfortunately your machine doesn't appear to have AES-NI support in the CPU so the AES
code is pure Go and somewhat slow. I'd be happy to have optimised versions for various
other CPUs but I'm afraid that we don't, yet.
Member

alexbrainman commented Oct 21, 2013

Comment 9:

Yes, go is much faster then before:
go (go version devel +d2cb80eac1ac Sat Oct 05 14:15:02 2013 +1000 linux/386):
Lifting the server siege...      done.
Transactions:                    837 hits
Availability:                 100.00 %
Elapsed time:                  16.95 secs
Data transferred:               0.02 MB
Response time:                  1.90 secs
Transaction rate:              49.38 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                   94.06
Successful transactions:         837
Failed transactions:               0
Longest transaction:            4.01
Shortest transaction:           0.50
but still not as fast as nginx
nginx:
Lifting the server siege...      done.
Transactions:                   1336 hits
Availability:                 100.00 %
Elapsed time:                  14.16 secs
Data transferred:               0.19 MB
Response time:                  1.02 secs
Transaction rate:              94.35 trans/sec
Throughput:                     0.01 MB/sec
Concurrency:                   96.06
Successful transactions:        1336
Failed transactions:               0
Longest transaction:            1.07
Shortest transaction:           0.12
Feel free to close this, if you think nothing else we can do here.
Alex
Member

minux commented Oct 21, 2013

Comment 10:

Re #8, no, that commit implements AESNI based AES for amd64 only.
we still need to implement AES in assembly for non-AESNI machines and
esp. for 386.
Member

minux commented Oct 21, 2013

Comment 11:

Re #9, is AES operation the bottleneck here?
if not, perhaps we can close this issue.
(i might one day get time to write AES in assembly for ARM, but I don't bother
to do that for 386; and I believe AES-NI machines will be more widespread
before Go 1.3 is released)
Member

alexbrainman commented Oct 21, 2013

Comment 12:

I don't know. Does this
U:\>go tool pprof main.exe c:\tmp\a.pprof
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 6211 samples
    5461  87.9%  87.9%     5461  87.9% etext
     135   2.2%  90.1%      135   2.2% math/big.addMulVVW
     133   2.1%  92.2%      133   2.1% math/big.subVV
      98   1.6%  93.8%       98   1.6% math/big.mulAddVWW
      45   0.7%  94.5%      327   5.3% math/big.nat.divLarge
      43   0.7%  95.2%       43   0.7% addroots
      31   0.5%  95.7%       31   0.5% crypto/elliptic.p256ReduceDegree
      23   0.4%  96.1%       23   0.4% runtime.settype_flush
      21   0.3%  96.4%       21   0.3% markonly
      20   0.3%  96.8%       34   0.5% crypto/elliptic.p256Mul
(pprof)
tell you anything?
Alex
Member

minux commented Oct 21, 2013

Comment 13:

so the bottleneck is ECC/RSA part?
Member

alexbrainman commented Oct 21, 2013

Comment 14:

How can I tell?
Alex
Member

minux commented Oct 21, 2013

Comment 15:

should be the ECC part. RSA shouldn't use math/big.nat.
let's leave this issue to agl to decide.
Contributor

agl commented Oct 21, 2013

Comment 16:

Why wouldn't RSA use math/big.nat? In this case it's using P256 ECC, which is
implemented specially and *that* won't use math/big.
Contributor

rsc commented Nov 27, 2013

Comment 17:

Labels changed: added go1.3maybe.

Contributor

rsc commented Dec 4, 2013

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

Contributor

rsc commented Dec 4, 2013

Comment 19:

Labels changed: added repo-main.

Comment 20 by alex.gaynor:

I've submitted a patch for constant time modular exponentiation
(https://golang.org/cl/94850043/), however it's slower than normal big.Int.Exp,
so I don't think it actually addresses the performance issue here.

Comment 21 by jtolds:

On Linux, it might be worth considering using an AF_ALG socket and interacting with the
kernel's crypto library. Instead of writing your own assembly, then you would get the
benefit of automatic hardware acceleration if specialized instructions or chips exist.
Here's more sample code for how to interact with AF_ALG on Linux (for cipher and hash
algorithms) http://src.carnivore.it/users/common/af_alg with blog post
http://carnivore.it/2011/04/23/openssl_-_af_alg
Here's an example from Go for SHA1:
https://github.com/jtolds/go-af-alg/blob/master/sha1/sha1_linux.go

Comment 22 by jtolds:

Worth pointing out that specialized chips are common and provide orders of magnitude
more performance on ARM devices.
Contributor

taruti commented May 17, 2014

Comment 23:

Please no cgo for crypto. Also AF_ALG is slower in many use cases if the amount of data
per operation is small.

Comment 24 by jtolds:

cgo isn't required for AF_ALG with some minor syscall pkg changes (documented in that
sha1_linux.go file I linked).
And you're right, a context switch is way slower for small amounts of data. If the
crypto package did end up using AF_ALG (my hope), I assume there'd be a threshold at
which it is done natively in user space before it switches to syscalls.
Member

marete commented Jun 14, 2014

Comment 25:

Not sure if this is specific to AES or if it is a CFB-mode issue, but: AES without
AES-NI on a Nehalem machine in CFB mode gets one an astonishingly anaemic 16 Mbps (2
MB/s). This is especially significant because the openpgp standard (RFC 4880) requires
all symmetric ciphers to be used in CFB mode. So, among other things, this makes the use
of go.crypto/openpgp essentially impossible for large bulk encryptions when using AES
128 or 256 as the symmetric cipher.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015

gaillard commented Apr 19, 2016 edited

I am running into this performance bottleneck (not tiny payloads) since the amd64 assembly is not present for ARM. Is this the proper ticket to prod? There was mention in the dev mailing list that arm64 AES-NI is coming for go1.7, how likely is that?

CL https://golang.org/cl/38366 mentions this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment