-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/aes: add assembly for non-AES-NI machines #4299
Comments
Issue #4073 has been merged into this issue. |
nginx is probably doing a less computationally intensive ciphersuite and it's using OpenSSL rather than math/big. If anyone is keen on improving things, a constant time modexp in math/big would go a long way, but it's very complex. We could also likely reduce the cost of RSA blinding by squaring the same manner as OpenSSL rather than generating a new blind each time. Lastly, a high-speed, constant time P-256 would also help. Labels changed: added priority-later, removed priority-triage. Status changed to LongTerm. |
I was running into this from another direction, downloading over https, and wrote a program that demonstrates the issue, not through lower footprint but increased CPU utilization. The program is on the playground here: http://play.golang.org/p/chCbgqS_ls It is almost certainly possible to create a more straightforward example that passes data straight through the crypto functions without relying on the network, but I was trying to see if the issue was related to fetching multiple ssl streams at once or not and wanted to isolate this anyway. A sample pprof topN output on a linux-64bit machine: Total: 953 samples 472 49.5% 49.5% 472 49.5% crypto/aes.decryptBlockGo 302 31.7% 81.2% 785 82.4% crypto/cipher.(*cbcDecrypter).CryptBlocks 85 8.9% 90.1% 85 8.9% crypto/sha1.block 36 3.8% 93.9% 36 3.8% runtime.memmove 9 0.9% 94.9% 479 50.3% crypto/aes.decryptBlock 5 0.5% 95.4% 483 50.7% crypto/aes.(*aesCipher).Decrypt 4 0.4% 95.8% 4 0.4% runtime.futex 3 0.3% 96.1% 3 0.3% ifaceeq1 3 0.3% 96.4% 9 0.9% syscall.read 2 0.2% 96.6% 2 0.2% crypto/hmac.(*hmac).tmpPad 2 0.2% 96.9% 5 0.5% crypto/sha1.(*digest).Sum 2 0.2% 97.1% 2 0.2% netpollblock This doesn't appear to be limited to aes, as downloading from a different source (unfortunately an internal server) gave me this pprof output: Total: 1210 samples 813 67.2% 67.2% 813 67.2% crypto/des.permuteBlock 280 23.1% 90.3% 1016 84.0% crypto/des.feistel 31 2.6% 92.9% 67 5.5% compress/flate.(*compressor).deflate 21 1.7% 94.6% 1118 92.4% crypto/des.cryptBlock 8 0.7% 95.3% 1125 93.0% crypto/cipher.(*cbcDecrypter).CryptBlocks 6 0.5% 95.8% 6 0.5% compress/flate.(*compressor).findMatch 6 0.5% 96.3% 6 0.5% encoding/binary.bigEndian.PutUint64 5 0.4% 96.7% 19 1.6% compress/flate.(*huffmanEncoder).bitCounts 5 0.4% 97.1% 10 0.8% runtime.mallocgc 5 0.4% 97.5% 5 0.4% runtime.settype_flush This was with `go version devel +98b396da54db Sun Apr 28 00:18:11 2013 +1000 linux/amd64` |
jlmoiron: it's certainly true that the ciphers take CPU time. DES (in the second trace) is well known to be a terrible CPU hog. AES (the first trace) isn't quite so bad. Unfortunately your machine doesn't appear to have AES-NI support in the CPU so the AES code is pure Go and somewhat slow. I'd be happy to have optimised versions for various other CPUs but I'm afraid that we don't, yet. |
Is this fixed by https://code.google.com/p/go/source/detail?r=57503accfdc7 now? |
Yes, go is much faster then before: go (go version devel +d2cb80eac1ac Sat Oct 05 14:15:02 2013 +1000 linux/386): Lifting the server siege... done. Transactions: 837 hits Availability: 100.00 % Elapsed time: 16.95 secs Data transferred: 0.02 MB Response time: 1.90 secs Transaction rate: 49.38 trans/sec Throughput: 0.00 MB/sec Concurrency: 94.06 Successful transactions: 837 Failed transactions: 0 Longest transaction: 4.01 Shortest transaction: 0.50 but still not as fast as nginx nginx: Lifting the server siege... done. Transactions: 1336 hits Availability: 100.00 % Elapsed time: 14.16 secs Data transferred: 0.19 MB Response time: 1.02 secs Transaction rate: 94.35 trans/sec Throughput: 0.01 MB/sec Concurrency: 96.06 Successful transactions: 1336 Failed transactions: 0 Longest transaction: 1.07 Shortest transaction: 0.12 Feel free to close this, if you think nothing else we can do here. Alex |
I don't know. Does this U:\>go tool pprof main.exe c:\tmp\a.pprof Welcome to pprof! For help, type 'help'. (pprof) top10 Total: 6211 samples 5461 87.9% 87.9% 5461 87.9% etext 135 2.2% 90.1% 135 2.2% math/big.addMulVVW 133 2.1% 92.2% 133 2.1% math/big.subVV 98 1.6% 93.8% 98 1.6% math/big.mulAddVWW 45 0.7% 94.5% 327 5.3% math/big.nat.divLarge 43 0.7% 95.2% 43 0.7% addroots 31 0.5% 95.7% 31 0.5% crypto/elliptic.p256ReduceDegree 23 0.4% 96.1% 23 0.4% runtime.settype_flush 21 0.3% 96.4% 21 0.3% markonly 20 0.3% 96.8% 34 0.5% crypto/elliptic.p256Mul (pprof) tell you anything? Alex |
I've submitted a patch for constant time modular exponentiation (https://golang.org/cl/94850043/), however it's slower than normal big.Int.Exp, so I don't think it actually addresses the performance issue here. |
On Linux, it might be worth considering using an AF_ALG socket and interacting with the kernel's crypto library. Instead of writing your own assembly, then you would get the benefit of automatic hardware acceleration if specialized instructions or chips exist. Here's more sample code for how to interact with AF_ALG on Linux (for cipher and hash algorithms) http://src.carnivore.it/users/common/af_alg with blog post http://carnivore.it/2011/04/23/openssl_-_af_alg Here's an example from Go for SHA1: https://github.com/jtolds/go-af-alg/blob/master/sha1/sha1_linux.go |
cgo isn't required for AF_ALG with some minor syscall pkg changes (documented in that sha1_linux.go file I linked). And you're right, a context switch is way slower for small amounts of data. If the crypto package did end up using AF_ALG (my hope), I assume there'd be a threshold at which it is done natively in user space before it switches to syscalls. |
Not sure if this is specific to AES or if it is a CFB-mode issue, but: AES without AES-NI on a Nehalem machine in CFB mode gets one an astonishingly anaemic 16 Mbps (2 MB/s). This is especially significant because the openpgp standard (RFC 4880) requires all symmetric ciphers to be used in CFB mode. So, among other things, this makes the use of go.crypto/openpgp essentially impossible for large bulk encryptions when using AES 128 or 256 as the symmetric cipher. |
I am running into this performance bottleneck (not tiny payloads) since the amd64 assembly is not present for ARM. Is this the proper ticket to prod? There was mention in the dev mailing list that arm64 AES-NI is coming for go1.7, how likely is that? |
CL https://golang.org/cl/38366 mentions this issue. |
Anyone planning on making progress on that CL linked above? |
The text was updated successfully, but these errors were encountered: