New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AES-GCM enabled with AVX512 vAES and vPCLMULQDQ. #17239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments below. I have not reviewed the assembler. I will accept external review of this code.
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
providers/implementations/ciphers/cipher_aes_gcm_hw_vaes_avx512.inc
Outdated
Show resolved
Hide resolved
dc12196
to
0468bb5
Compare
Thanks Matt for a quick review! I have fixed mentioned style issues plus the ones that were pointed by check-format.pl. |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I've not inspected the assembly.
I've lined up one ASM focused reviewer to start in January, and hope to have another lined up soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into this more in the coming days.
# ; It should be called before restoring the XMM registers | ||
# ; for Windows (XMM6-XMM15). | ||
# ; | ||
sub clear_scratch_zmms_asm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearing ZMM scratch registers is not needed, just vzeroupper should be used before returning.
This function can probably be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is currently not called. $CLEAR_SCRATCH_REGISTERS flag is set to zero, so only vzeroupper is called.
} | ||
|
||
# Clears all scratch GP registers | ||
sub clear_scratch_gps_asm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, scratch registers probably do not need to be cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for GPR registers cleanup.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
.quad 0x0000000000000001, 0x0000000000000000 | ||
|
||
.align 16 | ||
TWO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TWO and TWOf are not used, so they can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these and double checked and removed other unused data too. Thanks.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
.quad 0x0000000000000000, 0x0400000000000000 | ||
|
||
.align 64 | ||
ddq_addbe_5678: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddq_addbe_5678 and ddq_addbe_888 are not used, so they can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
.quad 0x0000000000000000, 0x0400000000000000 | ||
|
||
.align 64 | ||
ddq_addbe_8888: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddq_aadbe_8888 is not used, so it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $ZTMP5 = $_[7]; | ||
my $ZTMP6 = $_[8]; | ||
my $ZTMP7 = $_[9]; | ||
my $HKEYS_RANGE = $_[10]; # ; "first16", "mid16", "last16", "all", "first32", "last32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last16 is not passed anywhere, so it can be removed (including lines 506, 571-581 and part of 583)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this code path is currently not used in the code generation, I would leave this for completeness. It does not impact resulting assembly. Do you have any objections?
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
|
||
$code .= <<___; | ||
sub \$`(16*16)`,$T2 | ||
je .L_CALC_AAD_done_${rndsuffix} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this alignment and others in 1588, 1591, 1594...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the issues of code formatting. I'll fix formatting for all jump instructions.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $NUM_BLOCKS = $_[6]; # [in] can only be 1, 2, 3, 4, 5, ..., 15 or 16 (not 0) | ||
my $CTR = $_[7]; # [in/out] current counter value | ||
my $ENC_DEC = $_[8]; # [in] cipher direction (ENC/DEC) | ||
my $INSTANCE_TYPE = $_[9]; # [in] multi_call or single_call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi_call is always used, so I suggest to simplify this and remove INSTANCE_TYPE here, as multi_call is always passed. There are several macros that have this argument and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, only "multi_call" option is used in this file across all macros. An upstream code (from IPsec-MB library) which is used as a basis for this contribution supports both (multi_call and single_call), so I would leave this as is for support reasons to allow easier sync with upstream if needed. It does not affect current code generation. Do you think it is a valid reasoning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to remove this (single_call) code path in generator to not overload it.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
|
||
my $rndsuffix = &random_string(); | ||
|
||
if ($INSTANCE_TYPE eq "single_call") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this block can be removed, as "single_call" is never set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see answer about multi_call / single_call above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
# This implementation is based on the AES-GCM code (AVX512VAES + VPCLMULQDQ) | ||
# from Intel(R) Multi-Buffer Crypto for IPsec Library v1.1 | ||
# (https://github.com/intel/intel-ipsec-mb). | ||
# Original author is Tomasz Kantecki <tomasz.kantecki@intel.com>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very useful to reference the 2 papers (Vinod Gopal et al) referenced in the ipsec-mb code here, since there are alot of useful background details. I believe it is OK to include references like this in the OpenSSL code - as I remember there are examples like this elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
} | ||
|
||
# ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | ||
# ;; GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (128,127,126,121,0) notation for the reduction polynomial looks a little unusual. Perhaps something like "x^128 + x^127 + x^126 + x^121 + 1" is more conventional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $T4 = $_[5]; #; [clobbered] xmm/ymm/zmm | ||
my $T5 = $_[6]; #; [clobbered] xmm/ymm/zmm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like $T4 and $T5 are unused in the GHASH_MUL macro - perhaps remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
vpclmulqdq \$0x11,$HK,$GH,$T1 # ; $T1 = a1*b1 | ||
vpclmulqdq \$0x00,$HK,$GH,$T2 # ; $T2 = a0*b0 | ||
vpclmulqdq \$0x01,$HK,$GH,$T3 # ; $T3 = a1*b0 | ||
vpclmulqdq \$0x10,$HK,$GH,$GH # ; $GH = a0*b1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is a textbook multiplication with 4 X vpclmulqdq instructions - has the karatsuba method been considered, to reduce it to 3 X vpclmulqdq? (along with the additional ALU instructions required and additional Hashkey precomputations)
I see this approach discussed in the paper - "Vinodh Gopal et. al. Optimized Galois-Counter-Mode Implementation on Intel Architecture Processors. August, 2010". Potentially there were Arch improvements since the paper was written, meaning the benefit of removing a vpclmulqdq vs the additional ALU work for karatsuba doesn’t give a benefit for current Architectures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Karatsuba method was considered too, but as you correctly pointed out with the consistent uarch improvements simple schoolbook multiplication scales better.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
&GHASH_MUL($T5, $HK, $T1, $T3, $T4, $T6, $T2); | ||
$code .= <<___; | ||
vmovdqu64 $T5,@{[HashKeyByIdx(3,$GCM128_CTX)]} | ||
vinserti64x2 \$1,$T5,$ZT7,$ZT7 | ||
|
||
# ;; calculate HashKey^4<<1 mod poly | ||
___ | ||
&GHASH_MUL($T5, $HK, $T1, $T3, $T4, $T6, $T2); | ||
$code .= <<___; | ||
vmovdqu64 $T5,@{[HashKeyByIdx(4,$GCM128_CTX)]} | ||
vinserti64x2 \$0,$T5,$ZT7,$ZT7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the start of this block, both HK^1 and HK^2 are known - couldn’t HK^3 and HK^4 be calculated now in a single call to GHASH_MUL using ymm registers, instead of 2 calls to GHASH_MUL using xmm registers (with some extra instructions to create a ymm register containing HK^1 and HK^2), saving 1 call to GHASH_MUL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $GPR2 = $_[4]; # [clobbered] GP register | ||
my $GPR3 = $_[5]; # [clobbered] GP register | ||
my $MASKREG = $_[6]; # [clobbered] mask register | ||
my $AAD_HASH = $_[7]; # [out] XMM for AAD_HASH value (xmm14) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be an assumption about the exact xmm register used here? i.e. any xmm could potentially be used for $AAD_HASH by a caller of the GCM_UPDATE_AAD subroutine? (not just xmm14)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, removed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $ZTMP15 = $_[27]; # [clobbered] ZMM register | ||
my $ZTMP16 = $_[28]; # [clobbered] ZMM register | ||
my $ZTMP17 = $_[29]; # [clobbered] ZMM register | ||
my $ZTMP18 = $_[30]; # [clobbered] ZMM register | ||
my $ZTMP19 = $_[31]; # [clobbered] ZMM register | ||
my $ZTMP20 = $_[32]; # [clobbered] ZMM register | ||
my $ZTMP21 = $_[33]; # [clobbered] ZMM register | ||
my $ZTMP22 = $_[34]; # [clobbered] ZMM register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like $ZTMP15 to $ZTMP22 are unused in the subroutine, remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, removed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
my $GPR2 = $_[5]; # [clobbered] GP register | ||
my $GPR3 = $_[6]; # [clobbered] GP register | ||
my $MASKREG = $_[7]; # [clobbered] mask register | ||
my $CUR_COUNT = $_[8]; # [out] XMM with current counter (xmm2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should xmm2 be mentioned here? (i.e. it could be any xmm register used by subroutine caller)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, removed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
TWOf: | ||
.quad 0x0000000000000000, 0x0200000000000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TWOf looks to be unused, perhaps remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed also other unused data entries.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
mask_out_top_block: | ||
.quad 0xffffffffffffffff, 0xffffffffffffffff | ||
.quad 0xffffffffffffffff, 0xffffffffffffffff | ||
.quad 0xffffffffffffffff, 0xffffffffffffffff | ||
.quad 0x0000000000000000, 0x0000000000000000 | ||
___ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask_out_top_block looks to be unused, perhaps remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
# ;; - it is assumed that data read from $INPTR is already shuffled and | ||
# ;; $INPTR address is 64 byte aligned | ||
# ;; - there is an option to pass ready blocks through ZMM registers too. | ||
# ;; 4 extra parameters need to passed in such case and 21st argument can be empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add ($_[20]) or ($ZTMP9) after "21st argument" to make it more clear which arg can be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thank for noticing.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
} | ||
|
||
# ;; =========================================================================== | ||
# ;; schoolbook multiply of 16 blocks (8 x 16 bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 x 16 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, updated.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
vshufi64x2 \$0x00,$ZT7,$ZT7,$ZT5 # ;; broadcast HashKey^8 across all ZT5 | ||
___ | ||
|
||
# ;; calculate HashKey^9<<1 mod poly, HashKey^10<<1 mod poly, ... HashKey^48<<1 mod poly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... HashKey^16<<1 mod poly
Since hkeys 17 .. 48 are computed somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, fixed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
jb .L_AAD_blocks_13_${rndsuffix} | ||
je .L_AAD_blocks_14_${rndsuffix} | ||
cmp \$15,@{[DWORD($T2)]} | ||
je .L_AAD_blocks_15_${rndsuffix} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation of jump instructions is inconsistent throughout the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
if ($do_reduction != 0) { | ||
|
||
# ;; GH1H holds reduced hash value | ||
# ;; - normally do "vmovdqa64 XWORD($HASH_IN_OUT), XWORD($GH1H)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be converted to AT&T syntax for consistency and other places with ASM code in the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
.align 32 | ||
.Laes_gcm_encrypt_${keylen}_avx512: | ||
___ | ||
&GCM_ENC_DEC("$arg1", "$arg2", "$arg3", "$arg4", "$arg5", "$arg6", "ENC", "multi_call"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like all macros are called with "multi-call". Is there a reason to keep all the "single-call" code in the macros?
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
|
||
# ;; ================================================= | ||
# ;; Return GHASH value through $GH1H | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty if statement should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed all style comments below.
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
# $code .= <<___; | ||
# add $PLAIN_CIPH_LEN,`$CTX_OFFSET_InLen`($GCM128_CTX) | ||
# ___ | ||
# } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this commented out code be removed?
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
|
||
$code .= <<___; | ||
cmp \$`(32 * 16)`,$LENGTH | ||
jb .L_message_below_32_blocks_${rndsuffix} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation off here and lots more places below in this function
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
$code .= <<___; | ||
# ;; Check aes_keys != NULL | ||
test $arg1,$arg1 | ||
jz .Labort_setiv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation off for jz here and below
- Removed unused data - Removed unused code branch in perl generator (related to single_call scenario) - Indentation fixes - Added references to papers used in the work etc
0468bb5
to
4cf4274
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment from me, apart from that, rest looks good. One more thing, will this 3rd commit be squashed into the first commit after addressing all comments?
crypto/modes/asm/aes-gcm-avx512.pl
Outdated
@@ -497,8 +519,7 @@ sub precompute_hkeys_on_stack { | |||
my $ZTMP4 = $_[6]; | |||
my $ZTMP5 = $_[7]; | |||
my $ZTMP6 = $_[8]; | |||
my $ZTMP7 = $_[9]; | |||
my $HKEYS_RANGE = $_[10]; # ; "first16", "mid16", "last16", "all", "first32", "last32" | |||
my $HKEYS_RANGE = $_[9]; # ; "first16", "mid16", "last16", "all", "first32", "last32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last16 is not passed anywhere, so it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally squash commits when they are merged but it is better if the author does this beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned to do it (squash commits) when review is done. I think separate commits allow to better track changes during reviews, especially in such large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! So if you agree with my comment about removing last16, I'm done with the review. Once this is settled, I will approve. Thanks for the work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pablodelara for the review. I removed "last16" part.
precompute_hkeys_on_stack() routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for the work, Andrey!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrey
I've checked all the resolutions to my comments and have no outstanding issues so I'm happy to Approve.
It is great work! And I have learned a lot during the review, thanks.
TJ
@mdcornu - are you satisfied with the resolutions to your comments applied to this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
My approval stands. Thanks for the assistance everyone. |
This pull request is ready to merge |
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture). The performance details for representative IceLake Server and Client platforms are shown below Performance data: OpenSSL Speed KBs/Sec Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 478708.27 1118296.96 2428092.52 3518199.4 4172355.99 4235762.07 Patched 534613.95 2009345.55 3775588.15 5059517.64 8476794.88 8941541.79 Speedup 1.12 1.80 1.55 1.44 2.03 2.11 AES-256-GCM Baseline 399237.27 961699.9 2136377.65 2979889.15 3554823.37 3617757.5 Patched 475948.13 1720128.51 3462407.12 4696832.2 7532013.16 7924953.91 Speedup 1.19 1.79 1.62 1.58 2.12 2.19 Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 259128.54 570756.43 1362554.16 1990654.57 2359128.88 2401671.58 Patched 292139.47 1079320.95 2001974.63 2829007.46 4510318.59 4705314.41 Speedup 1.13 1.89 1.47 1.42 1.91 1.96 AES-256-GCM Baseline 236000.34 550506.76 1234638.08 1716734.57 2011255.6 2028099.99 Patched 247256.32 919731.34 1773270.43 2553239.55 3953115.14 4111227.29 Speedup 1.05 1.67 1.44 1.49 1.97 2.03 Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #17239)
Merged after squashing the commits and expanding the commit message. Thank you all for your contribution and reviews. |
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture). The performance details for representative IceLake Server and Client platforms are shown below Performance data: OpenSSL Speed KBs/Sec Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 478708.27 1118296.96 2428092.52 3518199.4 4172355.99 4235762.07 Patched 534613.95 2009345.55 3775588.15 5059517.64 8476794.88 8941541.79 Speedup 1.12 1.80 1.55 1.44 2.03 2.11 AES-256-GCM Baseline 399237.27 961699.9 2136377.65 2979889.15 3554823.37 3617757.5 Patched 475948.13 1720128.51 3462407.12 4696832.2 7532013.16 7924953.91 Speedup 1.19 1.79 1.62 1.58 2.12 2.19 Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 259128.54 570756.43 1362554.16 1990654.57 2359128.88 2401671.58 Patched 292139.47 1079320.95 2001974.63 2829007.46 4510318.59 4705314.41 Speedup 1.13 1.89 1.47 1.42 1.91 1.96 AES-256-GCM Baseline 236000.34 550506.76 1234638.08 1716734.57 2011255.6 2028099.99 Patched 247256.32 919731.34 1773270.43 2553239.55 3953115.14 4111227.29 Speedup 1.05 1.67 1.44 1.49 1.97 2.03 Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl#17239) (cherry picked from commit 63b996e)
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture). The performance details for representative IceLake Server and Client platforms are shown below Performance data: OpenSSL Speed KBs/Sec Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 478708.27 1118296.96 2428092.52 3518199.4 4172355.99 4235762.07 Patched 534613.95 2009345.55 3775588.15 5059517.64 8476794.88 8941541.79 Speedup 1.12 1.80 1.55 1.44 2.03 2.11 AES-256-GCM Baseline 399237.27 961699.9 2136377.65 2979889.15 3554823.37 3617757.5 Patched 475948.13 1720128.51 3462407.12 4696832.2 7532013.16 7924953.91 Speedup 1.19 1.79 1.62 1.58 2.12 2.19 Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 259128.54 570756.43 1362554.16 1990654.57 2359128.88 2401671.58 Patched 292139.47 1079320.95 2001974.63 2829007.46 4510318.59 4705314.41 Speedup 1.13 1.89 1.47 1.42 1.91 1.96 AES-256-GCM Baseline 236000.34 550506.76 1234638.08 1716734.57 2011255.6 2028099.99 Patched 247256.32 919731.34 1773270.43 2553239.55 3953115.14 4111227.29 Speedup 1.05 1.67 1.44 1.49 1.97 2.03 Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl#17239) (cherry picked from commit 63b996e)
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture). The performance details for representative IceLake Server and Client platforms are shown below Performance data: OpenSSL Speed KBs/Sec Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 478708.27 1118296.96 2428092.52 3518199.4 4172355.99 4235762.07 Patched 534613.95 2009345.55 3775588.15 5059517.64 8476794.88 8941541.79 Speedup 1.12 1.80 1.55 1.44 2.03 2.11 AES-256-GCM Baseline 399237.27 961699.9 2136377.65 2979889.15 3554823.37 3617757.5 Patched 475948.13 1720128.51 3462407.12 4696832.2 7532013.16 7924953.91 Speedup 1.19 1.79 1.62 1.58 2.12 2.19 Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 259128.54 570756.43 1362554.16 1990654.57 2359128.88 2401671.58 Patched 292139.47 1079320.95 2001974.63 2829007.46 4510318.59 4705314.41 Speedup 1.13 1.89 1.47 1.42 1.91 1.96 AES-256-GCM Baseline 236000.34 550506.76 1234638.08 1716734.57 2011255.6 2028099.99 Patched 247256.32 919731.34 1773270.43 2553239.55 3953115.14 4111227.29 Speedup 1.05 1.67 1.44 1.49 1.97 2.03 Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl#17239) (cherry picked from commit 63b996e)
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture). The performance details for representative IceLake Server and Client platforms are shown below Performance data: OpenSSL Speed KBs/Sec Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 478708.27 1118296.96 2428092.52 3518199.4 4172355.99 4235762.07 Patched 534613.95 2009345.55 3775588.15 5059517.64 8476794.88 8941541.79 Speedup 1.12 1.80 1.55 1.44 2.03 2.11 AES-256-GCM Baseline 399237.27 961699.9 2136377.65 2979889.15 3554823.37 3617757.5 Patched 475948.13 1720128.51 3462407.12 4696832.2 7532013.16 7924953.91 Speedup 1.19 1.79 1.62 1.58 2.12 2.19 Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread) Payload in Bytes 16 64 256 1024 8192 16384 AES-128-GCM Baseline 259128.54 570756.43 1362554.16 1990654.57 2359128.88 2401671.58 Patched 292139.47 1079320.95 2001974.63 2829007.46 4510318.59 4705314.41 Speedup 1.13 1.89 1.47 1.42 1.91 1.96 AES-256-GCM Baseline 236000.34 550506.76 1234638.08 1716734.57 2011255.6 2028099.99 Patched 247256.32 919731.34 1773270.43 2553239.55 3953115.14 4111227.29 Speedup 1.05 1.67 1.44 1.49 1.97 2.03 Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #17239) (cherry picked from commit 63b996e)
The proposed patch provides a vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture).
The performance details for representative IceLake Server and Client platforms are shown below