New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

Open
Yawning opened this Issue Jul 10, 2017 · 21 comments

Comments

Projects
None yet
9 participants
@Yawning

Yawning commented Jul 10, 2017

What version of Go are you using (go version)?

1.8.3

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

Benchmarked CTR-AES128 backed by crypto/aes on a system with AES-NI.

What did you expect to see?

Acceptable performance.

What did you see instead?

~400 MB/s for 1 KiB writes. For reference the same system using crypto/aes's gcmAble does GCM-AES128 Seal() at ~1300 MB/s. A cursory look at the source suggests that there is no special case ctrAble implementation for AES-NI.

@mvdan mvdan changed the title from crypto/aes: Implement ctrAble when using AES-NI. to crypto/aes: implement ctrAble when using AES-NI Jul 10, 2017

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan
Member

mvdan commented Jul 10, 2017

/cc @agl @rsc

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Jul 10, 2017

Member

Hi @Yawning!

You might know this, but for context on the issue: ctrAble and gcmAble are interfaces to break the cipher.Block abstraction. AES-CTR on linux/amd64 does use AES-NI for block encryption, but the block mode is handled in Go, which is of course slower, as opposed to GCM or CTR on s390x.

I'd be happy to review a CL adding assembly to optimize CTR, and I assume it would be merged, but someone needs to do the work. Let us know if you'd like to.

I suggest adding "amd64" to the title. Label suggestions: NeedsFix HelpWanted Performance.

Member

FiloSottile commented Jul 10, 2017

Hi @Yawning!

You might know this, but for context on the issue: ctrAble and gcmAble are interfaces to break the cipher.Block abstraction. AES-CTR on linux/amd64 does use AES-NI for block encryption, but the block mode is handled in Go, which is of course slower, as opposed to GCM or CTR on s390x.

I'd be happy to review a CL adding assembly to optimize CTR, and I assume it would be merged, but someone needs to do the work. Let us know if you'd like to.

I suggest adding "amd64" to the title. Label suggestions: NeedsFix HelpWanted Performance.

@bradfitz bradfitz changed the title from crypto/aes: implement ctrAble when using AES-NI to crypto/aes: implement ctrAble when using AES-NI on amd64 Jul 10, 2017

@bradfitz bradfitz added this to the Go1.10 milestone Jul 10, 2017

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Jul 18, 2017

Contributor

Sounds fun. I'd be interested in taking this on.

Contributor

mmcloughlin commented Jul 18, 2017

Sounds fun. I'd be interested in taking this on.

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Jul 18, 2017

Member
Member

FiloSottile commented Jul 18, 2017

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Jul 18, 2017

Contributor

@Yawning or anyone else: I don't want to duplicate work, so let me know if you've made any progress in this direction.

I see some work here on AES https://git.schwanenlied.me/yawning/bsaes.git. You also suggest in katzenpost/core#1 that you're going to add the AES-NI instructions to that package.

I actually quite enjoy messing around with assembly every now and then :) As you say, since GCM has been done this shouldn't be hard in theory. Famous last words.

Contributor

mmcloughlin commented Jul 18, 2017

@Yawning or anyone else: I don't want to duplicate work, so let me know if you've made any progress in this direction.

I see some work here on AES https://git.schwanenlied.me/yawning/bsaes.git. You also suggest in katzenpost/core#1 that you're going to add the AES-NI instructions to that package.

I actually quite enjoy messing around with assembly every now and then :) As you say, since GCM has been done this shouldn't be hard in theory. Famous last words.

@Yawning

This comment has been minimized.

Show comment
Hide comment
@Yawning

Yawning Jul 19, 2017

@mmcloughlin Don't hold off on doing this on my account. I haven't gotten around to it yet, and the way I end up integrating it into my AES package will probably be quite different from something that's upstreamable.

Yawning commented Jul 19, 2017

@mmcloughlin Don't hold off on doing this on my account. I haven't gotten around to it yet, and the way I end up integrating it into my AES package will probably be quite different from something that's upstreamable.

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Jul 19, 2017

Contributor

Thanks for clarifying @Yawning

Contributor

mmcloughlin commented Jul 19, 2017

Thanks for clarifying @Yawning

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Jul 22, 2017

Contributor

I have just spent longer than I should investigating a concern I had about cipherhw.AESGCMSupport. Ultimately not fruitful, but I'll document it here in case someone somewhere on the internet ever has the same question.

The function cipherhw.AESGCMSupport delegates to hasAESNI. This checks bit 25 of CPUINFO (leaf 1 ecx).

My concern was that GCM also requires CLMUL which is indicated by bit 1 of the same CPUINFO word. Note that aes.hasGCMAsm checks both bits. It is not clear why we need both cipherhw.AESGCMSupport and aes.hasGCMAsm.

Anyway it seemed to me there might be an issue if there is a processor with AES available and CLMUL off. So I looked to see if that ever happens. I searched through a database of processor CPUIDs and it turns out it doesn't happen:

https://gist.github.com/mmcloughlin/66488e42a8fdbd9ab39c3f6438bb8ed7

Contributor

mmcloughlin commented Jul 22, 2017

I have just spent longer than I should investigating a concern I had about cipherhw.AESGCMSupport. Ultimately not fruitful, but I'll document it here in case someone somewhere on the internet ever has the same question.

The function cipherhw.AESGCMSupport delegates to hasAESNI. This checks bit 25 of CPUINFO (leaf 1 ecx).

My concern was that GCM also requires CLMUL which is indicated by bit 1 of the same CPUINFO word. Note that aes.hasGCMAsm checks both bits. It is not clear why we need both cipherhw.AESGCMSupport and aes.hasGCMAsm.

Anyway it seemed to me there might be an issue if there is a processor with AES available and CLMUL off. So I looked to see if that ever happens. I searched through a database of processor CPUIDs and it turns out it doesn't happen:

https://gist.github.com/mmcloughlin/66488e42a8fdbd9ab39c3f6438bb8ed7

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Jul 25, 2017

Contributor

Some experiments on performance of encrypting multiple blocks concurrently with the same key:

https://github.com/mmcloughlin/aesnix

The work by Gueron et. al. referenced in gcm_amd64.s suggests 8 concurrent encryptions is optimal. The experiments I've run generally tend to agree, but results for 4+ appear more or less the same.

Contributor

mmcloughlin commented Jul 25, 2017

Some experiments on performance of encrypting multiple blocks concurrently with the same key:

https://github.com/mmcloughlin/aesnix

The work by Gueron et. al. referenced in gcm_amd64.s suggests 8 concurrent encryptions is optimal. The experiments I've run generally tend to agree, but results for 4+ appear more or less the same.

@crvv

This comment has been minimized.

Show comment
Hide comment
@crvv

crvv Jul 28, 2017

Contributor

How many concurrent encryptions is optimal depends on the processor.
I tested mmcloughlin's code on two different CPUs. The result is:
Intel Core i7-4770HQ

BenchmarkSingle-8   	200000000	         8.69 ns/op	1841.41 MB/s
BenchmarkMulti/2-8  	100000000	        10.4 ns/op	3063.63 MB/s
BenchmarkMulti/4-8  	100000000	        17.5 ns/op	3661.52 MB/s
BenchmarkMulti/6-8  	100000000	        22.9 ns/op	4193.59 MB/s
BenchmarkMulti/8-8  	50000000	        28.6 ns/op	4476.84 MB/s
BenchmarkMulti/10-8 	50000000	        35.3 ns/op	4529.11 MB/s
BenchmarkMulti/12-8 	30000000	        40.5 ns/op	4740.57 MB/s
BenchmarkMulti/14-8 	30000000	        45.2 ns/op	4955.40 MB/s

Intel Celeron N3050

BenchmarkSingle-2   	30000000	        47.1 ns/op	 339.35 MB/s
BenchmarkMulti/2-2  	20000000	        70.3 ns/op	 455.03 MB/s
BenchmarkMulti/4-2  	10000000	       117 ns/op	 543.86 MB/s
BenchmarkMulti/6-2  	10000000	       168 ns/op	 571.04 MB/s
BenchmarkMulti/8-2  	 5000000	       326 ns/op	 391.92 MB/s
BenchmarkMulti/10-2 	 3000000	       400 ns/op	 399.27 MB/s
BenchmarkMulti/12-2 	 3000000	       480 ns/op	 399.73 MB/s
BenchmarkMulti/14-2 	 3000000	       557 ns/op	 402.03 MB/s
Contributor

crvv commented Jul 28, 2017

How many concurrent encryptions is optimal depends on the processor.
I tested mmcloughlin's code on two different CPUs. The result is:
Intel Core i7-4770HQ

BenchmarkSingle-8   	200000000	         8.69 ns/op	1841.41 MB/s
BenchmarkMulti/2-8  	100000000	        10.4 ns/op	3063.63 MB/s
BenchmarkMulti/4-8  	100000000	        17.5 ns/op	3661.52 MB/s
BenchmarkMulti/6-8  	100000000	        22.9 ns/op	4193.59 MB/s
BenchmarkMulti/8-8  	50000000	        28.6 ns/op	4476.84 MB/s
BenchmarkMulti/10-8 	50000000	        35.3 ns/op	4529.11 MB/s
BenchmarkMulti/12-8 	30000000	        40.5 ns/op	4740.57 MB/s
BenchmarkMulti/14-8 	30000000	        45.2 ns/op	4955.40 MB/s

Intel Celeron N3050

BenchmarkSingle-2   	30000000	        47.1 ns/op	 339.35 MB/s
BenchmarkMulti/2-2  	20000000	        70.3 ns/op	 455.03 MB/s
BenchmarkMulti/4-2  	10000000	       117 ns/op	 543.86 MB/s
BenchmarkMulti/6-2  	10000000	       168 ns/op	 571.04 MB/s
BenchmarkMulti/8-2  	 5000000	       326 ns/op	 391.92 MB/s
BenchmarkMulti/10-2 	 3000000	       400 ns/op	 399.27 MB/s
BenchmarkMulti/12-2 	 3000000	       480 ns/op	 399.73 MB/s
BenchmarkMulti/14-2 	 3000000	       557 ns/op	 402.03 MB/s
@crvv

This comment has been minimized.

Show comment
Hide comment
@crvv

crvv Jul 28, 2017

Contributor

I have tried to optimize AES-CTR on AMD64. The result is:

  1. Encrypt the counter values in bulk.
    crvv@f33f081.
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  1210MB/s ± 2%  +132.58%
  1. Implement big-endian 128 bit integer addition in assembly.
    crvv@8b203f5
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2014MB/s ± 3%  +286.96%
  1. merge 1 and 2. This is not significant.
    crvv@cb9cb27
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2104MB/s ± 1%  +304.31%
  1. Like ctr_s390x.go, Implement XOR in assembly.
    crvv@958e355
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2505MB/s ± 3%  +381.43%
Contributor

crvv commented Jul 28, 2017

I have tried to optimize AES-CTR on AMD64. The result is:

  1. Encrypt the counter values in bulk.
    crvv@f33f081.
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  1210MB/s ± 2%  +132.58%
  1. Implement big-endian 128 bit integer addition in assembly.
    crvv@8b203f5
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2014MB/s ± 3%  +286.96%
  1. merge 1 and 2. This is not significant.
    crvv@cb9cb27
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2104MB/s ± 1%  +304.31%
  1. Like ctr_s390x.go, Implement XOR in assembly.
    crvv@958e355
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2505MB/s ± 3%  +381.43%
@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Jul 28, 2017

Change https://golang.org/cl/51670 mentions this issue: crypto/aes: add optimized implementation of AES-CTR for AMD64

gopherbot commented Jul 28, 2017

Change https://golang.org/cl/51670 mentions this issue: crypto/aes: add optimized implementation of AES-CTR for AMD64

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Jul 28, 2017

Change https://golang.org/cl/51790 mentions this issue: crypto/aes: add encryptMany and use it to speed up ctr

gopherbot commented Jul 28, 2017

Change https://golang.org/cl/51790 mentions this issue: crypto/aes: add encryptMany and use it to speed up ctr

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Aug 30, 2017

Member

I compared the two CLs for speed on my MacBook Pro 3.1 GHz Intel Core i5.

51f9e92 - CL 51670 by @crvv

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.37µs ± 1%   -73.30%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    21.1ns ± 3%   -63.14%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2776MB/s ± 1%  +274.21%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1514MB/s ± 3%  +171.39%  (p=0.002 n=6+6)

1d4a2c8 - CL 51790 by @mmcloughlin @TocarIP

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.48µs ± 2%   -64.81%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    22.6ns ± 1%   -60.53%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2107MB/s ± 2%  +184.03%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1413MB/s ± 1%  +153.20%  (p=0.002 n=6+6)

CL 51670 is ~7-30% faster. I also marginally prefer the assembly macros and not touching the non-amd64 files. So unless I'm missing something I'd suggest focusing on that one.

@mmcloughlin @TocarIP Sorry you folks duplicated work :( I would of course appreciate it if you could help review. You already did, thanks, my bad! (But a +1 would be awesome.)

Member

FiloSottile commented Aug 30, 2017

I compared the two CLs for speed on my MacBook Pro 3.1 GHz Intel Core i5.

51f9e92 - CL 51670 by @crvv

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.37µs ± 1%   -73.30%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    21.1ns ± 3%   -63.14%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2776MB/s ± 1%  +274.21%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1514MB/s ± 3%  +171.39%  (p=0.002 n=6+6)

1d4a2c8 - CL 51790 by @mmcloughlin @TocarIP

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.48µs ± 2%   -64.81%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    22.6ns ± 1%   -60.53%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2107MB/s ± 2%  +184.03%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1413MB/s ± 1%  +153.20%  (p=0.002 n=6+6)

CL 51670 is ~7-30% faster. I also marginally prefer the assembly macros and not touching the non-amd64 files. So unless I'm missing something I'd suggest focusing on that one.

@mmcloughlin @TocarIP Sorry you folks duplicated work :( I would of course appreciate it if you could help review. You already did, thanks, my bad! (But a +1 would be awesome.)

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Aug 30, 2017

Contributor

@FiloSottile I did some early work on this but stopped when I saw other people had taken it. The two CLs were from @crvv and Ilya Tocar, who does not appear to be on this thread.

Contributor

mmcloughlin commented Aug 30, 2017

@FiloSottile I did some early work on this but stopped when I saw other people had taken it. The two CLs were from @crvv and Ilya Tocar, who does not appear to be on this thread.

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Aug 30, 2017

Member

Ouch, this is what I get for writing replies on planes.

@TocarIP ^

Member

FiloSottile commented Aug 30, 2017

Ouch, this is what I get for writing replies on planes.

@TocarIP ^

@bronze1man

This comment has been minimized.

Show comment
Hide comment
@bronze1man

bronze1man Dec 29, 2017

I want to use this patch with golang 1.9 right now.
So i fork the CL 51670 and make it a standalone package. I think it may help others.
https://github.com/bronze1man/AesCtr

bronze1man commented Dec 29, 2017

I want to use this patch with golang 1.9 right now.
So i fork the CL 51670 and make it a standalone package. I think it may help others.
https://github.com/bronze1man/AesCtr

@agnivade

This comment has been minimized.

Show comment
Hide comment
@agnivade

agnivade Mar 26, 2018

Member

@FiloSottile - CL 51670 has a +1 now and it is waiting for a final +2 from you.

Would you be able to take a look ? I ask because I would love to add a AVX2 flavor to this and see what sort of performance improvements we get.

Member

agnivade commented Mar 26, 2018

@FiloSottile - CL 51670 has a +1 now and it is waiting for a final +2 from you.

Would you be able to take a look ? I ask because I would love to add a AVX2 flavor to this and see what sort of performance improvements we get.

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Sep 24, 2018

Change https://golang.org/cl/136896 mentions this issue: crypto/aes: optimize AES-CTR mode on amd64

gopherbot commented Sep 24, 2018

Change https://golang.org/cl/136896 mentions this issue: crypto/aes: optimize AES-CTR mode on amd64

@mmcloughlin

This comment has been minimized.

Show comment
Hide comment
@mmcloughlin

mmcloughlin Sep 24, 2018

Contributor

CL https://golang.org/cl/136896 brings CTR mode performance in line with GCM, as you would expect.

$ ./bin/go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-4   	 5000000	       269 ns/op	3803.33 MB/s
BenchmarkAESGCMOpen1K-4   	 5000000	       244 ns/op	4189.12 MB/s
BenchmarkAESCTR1K-4       	10000000	       230 ns/op	4426.85 MB/s
PASS
ok  	crypto/cipher	6.357s

@FiloSottile please let me know what would make this CL easiest to verify and review.

Contributor

mmcloughlin commented Sep 24, 2018

CL https://golang.org/cl/136896 brings CTR mode performance in line with GCM, as you would expect.

$ ./bin/go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-4   	 5000000	       269 ns/op	3803.33 MB/s
BenchmarkAESGCMOpen1K-4   	 5000000	       244 ns/op	4189.12 MB/s
BenchmarkAESCTR1K-4       	10000000	       230 ns/op	4426.85 MB/s
PASS
ok  	crypto/cipher	6.357s

@FiloSottile please let me know what would make this CL easiest to verify and review.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Sep 24, 2018

Change https://golang.org/cl/136897 mentions this issue: crypto/cipher: 8K benchmarks for AES stream modes

gopherbot commented Sep 24, 2018

Change https://golang.org/cl/136897 mentions this issue: crypto/cipher: 8K benchmarks for AES stream modes

gopherbot pushed a commit that referenced this issue Sep 25, 2018

crypto/cipher: 8K benchmarks for AES stream modes
Some parallelizable cipher modes may achieve peak performance for larger
block sizes. For this reason the AES-GCM mode already has an 8K
benchmark alongside the 1K version. This change introduces 8K benchmarks
for additional AES stream cipher modes.

Updates #20967

Change-Id: If97c6fbf31222602dcc200f8f418d95908ec1202
Reviewed-on: https://go-review.googlesource.com/136897
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment