crypto/cipher: AES CTR optimizations

There are some optimizations which can be applied to AES CTR implementation.

- [ ] **amd64**: When the low 64-bit counter does not overflow within an 8-block group, take the fast path that keeps all subsequent counters in XMM registers. Fill the first counter exactly as before, then expand it in XMM space and only fall back to the scalar path if a carry is detected. This netted up to an 18% improvement for AES-128 on long buffers. ([CL 714361](https://go.dev/cl/714361))

- [ ] **amd64**: Use VAES/AVX2 to process two 128-bit blocks per instruction. VAES is already available on AMD Zen 3 and newer, where [a similar optimization reportedly delivers up to a 74% gain](https://lwn.net/Articles/1040740/).

- [ ] **amd64**: Extend VAES support to AVX-512 so we can go through four blocks at a time on wide vector machines. In Linux kernel this optimization [brought up to 150% speed improvement](https://github.com/torvalds/linux/commit/b06affb1cb58).

- [ ] **amd64**: Server CPUs expose 32 vector registers. We should be able to keep the expanded key hot in registers instead of reloading it from memory every eight blocks. Not tested.

- [ ] **arm64**: Likewise, ARM64 gives us 32 vector registers. We can cache the expanded key in registers while the loop runs, rather than re-fetching it every iteration. Tested. Brings up to 20% speedup.

- [ ] **arm64**: Replace the eight scalar `VLD1`/`VST1` pairs with wider `LD1 {Vx, Vy}` and `ST1 {Vx, Vy}` instructions (or the new `LD1R` forms) so we move 32 bytes per issue. Not tested.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

crypto/cipher: AES CTR optimizations #76061

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

crypto/cipher: AES CTR optimizations #76061

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions