Skip to content

crypto/cipher: AES CTR optimizations #76061

@starius

Description

@starius

There are some optimizations which can be applied to AES CTR implementation.

  • amd64: When the low 64-bit counter does not overflow within an 8-block group, take the fast path that keeps all subsequent counters in XMM registers. Fill the first counter exactly as before, then expand it in XMM space and only fall back to the scalar path if a carry is detected. This netted up to an 18% improvement for AES-128 on long buffers. (CL 714361)

  • amd64: Use VAES/AVX2 to process two 128-bit blocks per instruction. VAES is already available on AMD Zen 3 and newer, where a similar optimization reportedly delivers up to a 74% gain.

  • amd64: Extend VAES support to AVX-512 so we can go through four blocks at a time on wide vector machines. In Linux kernel this optimization brought up to 150% speed improvement.

  • amd64: Server CPUs expose 32 vector registers. We should be able to keep the expanded key hot in registers instead of reloading it from memory every eight blocks. Not tested.

  • arm64: Likewise, ARM64 gives us 32 vector registers. We can cache the expanded key in registers while the loop runs, rather than re-fetching it every iteration. Tested. Brings up to 20% speedup.

  • arm64: Replace the eight scalar VLD1/VST1 pairs with wider LD1 {Vx, Vy} and ST1 {Vx, Vy} instructions (or the new LD1R forms) so we move 32 bytes per issue. Not tested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ImplementationIssues describing a semantics-preserving change to the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions