-
Couldn't load subscription status.
- Fork 18.4k
Description
There are some optimizations which can be applied to AES CTR implementation.
-
amd64: When the low 64-bit counter does not overflow within an 8-block group, take the fast path that keeps all subsequent counters in XMM registers. Fill the first counter exactly as before, then expand it in XMM space and only fall back to the scalar path if a carry is detected. This netted up to an 18% improvement for AES-128 on long buffers. (CL 714361)
-
amd64: Use VAES/AVX2 to process two 128-bit blocks per instruction. VAES is already available on AMD Zen 3 and newer, where a similar optimization reportedly delivers up to a 74% gain.
-
amd64: Extend VAES support to AVX-512 so we can go through four blocks at a time on wide vector machines. In Linux kernel this optimization brought up to 150% speed improvement.
-
amd64: Server CPUs expose 32 vector registers. We should be able to keep the expanded key hot in registers instead of reloading it from memory every eight blocks. Not tested.
-
arm64: Likewise, ARM64 gives us 32 vector registers. We can cache the expanded key in registers while the loop runs, rather than re-fetching it every iteration. Tested. Brings up to 20% speedup.
-
arm64: Replace the eight scalar
VLD1/VST1pairs with widerLD1 {Vx, Vy}andST1 {Vx, Vy}instructions (or the newLD1Rforms) so we move 32 bytes per issue. Not tested.