Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AES-GCM enabled with AVX512 vAES and vPCLMULQDQ. #17239

Closed

Conversation

amatyuko-intc
Copy link
Contributor

The proposed patch provides a vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's IceLake micro-architecture).

The performance details for representative IceLake Server and Client platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11
 
AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

@github-actions github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Dec 8, 2021
@t8m t8m added branch: master Merge to master branch triaged: feature The issue/pr requests/adds a feature approval: otc review pending This pull request needs review by an OTC member labels Dec 8, 2021
Copy link
Member

@mattcaswell mattcaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments below. I have not reviewed the assembler. I will accept external review of this code.

@amatyuko-intc
Copy link
Contributor Author

Thanks Matt for a quick review! I have fixed mentioned style issues plus the ones that were pointed by check-format.pl.

@mattcaswell
Copy link
Member

LGTM.

Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I've not inspected the assembly.

@paulidale paulidale added approval: review pending This pull request needs review by a committer and removed approval: otc review pending This pull request needs review by an OTC member labels Dec 10, 2021
@dtzimmerman
Copy link

I've lined up one ASM focused reviewer to start in January, and hope to have another lined up soon.

Copy link

@mdcornu mdcornu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this more in the coming days.

# ; It should be called before restoring the XMM registers
# ; for Windows (XMM6-XMM15).
# ;
sub clear_scratch_zmms_asm {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing ZMM scratch registers is not needed, just vzeroupper should be used before returning.
This function can probably be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is currently not called. $CLEAR_SCRATCH_REGISTERS flag is set to zero, so only vzeroupper is called.

}

# Clears all scratch GP registers
sub clear_scratch_gps_asm {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, scratch registers probably do not need to be cleared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for GPR registers cleanup.

.quad 0x0000000000000001, 0x0000000000000000

.align 16
TWO:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TWO and TWOf are not used, so they can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these and double checked and removed other unused data too. Thanks.

.quad 0x0000000000000000, 0x0400000000000000

.align 64
ddq_addbe_5678:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddq_addbe_5678 and ddq_addbe_888 are not used, so they can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

.quad 0x0000000000000000, 0x0400000000000000

.align 64
ddq_addbe_8888:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddq_aadbe_8888 is not used, so it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

my $ZTMP5 = $_[7];
my $ZTMP6 = $_[8];
my $ZTMP7 = $_[9];
my $HKEYS_RANGE = $_[10]; # ; "first16", "mid16", "last16", "all", "first32", "last32"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last16 is not passed anywhere, so it can be removed (including lines 506, 571-581 and part of 583)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this code path is currently not used in the code generation, I would leave this for completeness. It does not impact resulting assembly. Do you have any objections?


$code .= <<___;
sub \$`(16*16)`,$T2
je .L_CALC_AAD_done_${rndsuffix}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this alignment and others in 1588, 1591, 1594...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the issues of code formatting. I'll fix formatting for all jump instructions.

my $NUM_BLOCKS = $_[6]; # [in] can only be 1, 2, 3, 4, 5, ..., 15 or 16 (not 0)
my $CTR = $_[7]; # [in/out] current counter value
my $ENC_DEC = $_[8]; # [in] cipher direction (ENC/DEC)
my $INSTANCE_TYPE = $_[9]; # [in] multi_call or single_call

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi_call is always used, so I suggest to simplify this and remove INSTANCE_TYPE here, as multi_call is always passed. There are several macros that have this argument and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, only "multi_call" option is used in this file across all macros. An upstream code (from IPsec-MB library) which is used as a basis for this contribution supports both (multi_call and single_call), so I would leave this as is for support reasons to allow easier sync with upstream if needed. It does not affect current code generation. Do you think it is a valid reasoning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to remove this (single_call) code path in generator to not overload it.


my $rndsuffix = &random_string();

if ($INSTANCE_TYPE eq "single_call") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this block can be removed, as "single_call" is never set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see answer about multi_call / single_call above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

crypto/modes/asm/aes-gcm-avx512.pl Outdated Show resolved Hide resolved
Comment on lines +10 to +13
# This implementation is based on the AES-GCM code (AVX512VAES + VPCLMULQDQ)
# from Intel(R) Multi-Buffer Crypto for IPsec Library v1.1
# (https://github.com/intel/intel-ipsec-mb).
# Original author is Tomasz Kantecki <tomasz.kantecki@intel.com>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very useful to reference the 2 papers (Vinod Gopal et al) referenced in the ipsec-mb code here, since there are alot of useful background details. I believe it is OK to include references like this in the OpenSSL code - as I remember there are examples like this elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

}

# ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
# ;; GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (128,127,126,121,0) notation for the reduction polynomial looks a little unusual. Perhaps something like "x^128 + x^127 + x^126 + x^121 + 1" is more conventional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Comment on lines 1228 to 1229
my $T4 = $_[5]; #; [clobbered] xmm/ymm/zmm
my $T5 = $_[6]; #; [clobbered] xmm/ymm/zmm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like $T4 and $T5 are unused in the GHASH_MUL macro - perhaps remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines +1233 to +1255
vpclmulqdq \$0x11,$HK,$GH,$T1 # ; $T1 = a1*b1
vpclmulqdq \$0x00,$HK,$GH,$T2 # ; $T2 = a0*b0
vpclmulqdq \$0x01,$HK,$GH,$T3 # ; $T3 = a1*b0
vpclmulqdq \$0x10,$HK,$GH,$GH # ; $GH = a0*b1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it is a textbook multiplication with 4 X vpclmulqdq instructions - has the karatsuba method been considered, to reduce it to 3 X vpclmulqdq? (along with the additional ALU instructions required and additional Hashkey precomputations)
I see this approach discussed in the paper - "Vinodh Gopal et. al. Optimized Galois-Counter-Mode Implementation on Intel Architecture Processors. August, 2010". Potentially there were Arch improvements since the paper was written, meaning the benefit of removing a vpclmulqdq vs the additional ALU work for karatsuba doesn’t give a benefit for current Architectures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Karatsuba method was considered too, but as you correctly pointed out with the consistent uarch improvements simple schoolbook multiplication scales better.

Comment on lines 1300 to 1310
&GHASH_MUL($T5, $HK, $T1, $T3, $T4, $T6, $T2);
$code .= <<___;
vmovdqu64 $T5,@{[HashKeyByIdx(3,$GCM128_CTX)]}
vinserti64x2 \$1,$T5,$ZT7,$ZT7

# ;; calculate HashKey^4<<1 mod poly
___
&GHASH_MUL($T5, $HK, $T1, $T3, $T4, $T6, $T2);
$code .= <<___;
vmovdqu64 $T5,@{[HashKeyByIdx(4,$GCM128_CTX)]}
vinserti64x2 \$0,$T5,$ZT7,$ZT7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the start of this block, both HK^1 and HK^2 are known - couldn’t HK^3 and HK^4 be calculated now in a single call to GHASH_MUL using ymm registers, instead of 2 calls to GHASH_MUL using xmm registers (with some extra instructions to create a ymm register containing HK^1 and HK^2), saving 1 call to GHASH_MUL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

my $GPR2 = $_[4]; # [clobbered] GP register
my $GPR3 = $_[5]; # [clobbered] GP register
my $MASKREG = $_[6]; # [clobbered] mask register
my $AAD_HASH = $_[7]; # [out] XMM for AAD_HASH value (xmm14)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an assumption about the exact xmm register used here? i.e. any xmm could potentially be used for $AAD_HASH by a caller of the GCM_UPDATE_AAD subroutine? (not just xmm14)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed.

Comment on lines 3617 to 3624
my $ZTMP15 = $_[27]; # [clobbered] ZMM register
my $ZTMP16 = $_[28]; # [clobbered] ZMM register
my $ZTMP17 = $_[29]; # [clobbered] ZMM register
my $ZTMP18 = $_[30]; # [clobbered] ZMM register
my $ZTMP19 = $_[31]; # [clobbered] ZMM register
my $ZTMP20 = $_[32]; # [clobbered] ZMM register
my $ZTMP21 = $_[33]; # [clobbered] ZMM register
my $ZTMP22 = $_[34]; # [clobbered] ZMM register

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like $ZTMP15 to $ZTMP22 are unused in the subroutine, remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed.

my $GPR2 = $_[5]; # [clobbered] GP register
my $GPR3 = $_[6]; # [clobbered] GP register
my $MASKREG = $_[7]; # [clobbered] mask register
my $CUR_COUNT = $_[8]; # [out] XMM with current counter (xmm2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should xmm2 be mentioned here? (i.e. it could be any xmm register used by subroutine caller)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed.

Comment on lines 4995 to 4996
TWOf:
.quad 0x0000000000000000, 0x0200000000000000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TWOf looks to be unused, perhaps remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed also other unused data entries.

Comment on lines 5099 to 4951
mask_out_top_block:
.quad 0xffffffffffffffff, 0xffffffffffffffff
.quad 0xffffffffffffffff, 0xffffffffffffffff
.quad 0xffffffffffffffff, 0xffffffffffffffff
.quad 0x0000000000000000, 0x0000000000000000
___

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask_out_top_block looks to be unused, perhaps remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

# ;; - it is assumed that data read from $INPTR is already shuffled and
# ;; $INPTR address is 64 byte aligned
# ;; - there is an option to pass ready blocks through ZMM registers too.
# ;; 4 extra parameters need to passed in such case and 21st argument can be empty
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add ($_[20]) or ($ZTMP9) after "21st argument" to make it more clear which arg can be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thank for noticing.

}

# ;; ===========================================================================
# ;; schoolbook multiply of 16 blocks (8 x 16 bytes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 x 16 bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated.

vshufi64x2 \$0x00,$ZT7,$ZT7,$ZT5 # ;; broadcast HashKey^8 across all ZT5
___

# ;; calculate HashKey^9<<1 mod poly, HashKey^10<<1 mod poly, ... HashKey^48<<1 mod poly
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... HashKey^16<<1 mod poly
Since hkeys 17 .. 48 are computed somewhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, fixed.

jb .L_AAD_blocks_13_${rndsuffix}
je .L_AAD_blocks_14_${rndsuffix}
cmp \$15,@{[DWORD($T2)]}
je .L_AAD_blocks_15_${rndsuffix}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation of jump instructions is inconsistent throughout the module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if ($do_reduction != 0) {

# ;; GH1H holds reduced hash value
# ;; - normally do "vmovdqa64 XWORD($HASH_IN_OUT), XWORD($GH1H)"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be converted to AT&T syntax for consistency and other places with ASM code in the comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks.

.align 32
.Laes_gcm_encrypt_${keylen}_avx512:
___
&GCM_ENC_DEC("$arg1", "$arg2", "$arg3", "$arg4", "$arg5", "$arg6", "ENC", "multi_call");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like all macros are called with "multi-call". Is there a reason to keep all the "single-call" code in the macros?


# ;; =================================================
# ;; Return GHASH value through $GH1H
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty if statement should be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed all style comments below.

# $code .= <<___;
# add $PLAIN_CIPH_LEN,`$CTX_OFFSET_InLen`($GCM128_CTX)
# ___
# }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this commented out code be removed?


$code .= <<___;
cmp \$`(32 * 16)`,$LENGTH
jb .L_message_below_32_blocks_${rndsuffix}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation off here and lots more places below in this function

$code .= <<___;
# ;; Check aes_keys != NULL
test $arg1,$arg1
jz .Labort_setiv
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation off for jz here and below

- Removed unused data
- Removed unused code branch in perl generator (related to single_call scenario)
- Indentation fixes
- Added references to papers used in the work
etc
Copy link

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment from me, apart from that, rest looks good. One more thing, will this 3rd commit be squashed into the first commit after addressing all comments?

@@ -497,8 +519,7 @@ sub precompute_hkeys_on_stack {
my $ZTMP4 = $_[6];
my $ZTMP5 = $_[7];
my $ZTMP6 = $_[8];
my $ZTMP7 = $_[9];
my $HKEYS_RANGE = $_[10]; # ; "first16", "mid16", "last16", "all", "first32", "last32"
my $HKEYS_RANGE = $_[9]; # ; "first16", "mid16", "last16", "all", "first32", "last32"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last16 is not passed anywhere, so it can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally squash commits when they are merged but it is better if the author does this beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned to do it (squash commits) when review is done. I think separate commits allow to better track changes during reviews, especially in such large files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! So if you agree with my comment about removing last16, I'm done with the review. Once this is settled, I will approve. Thanks for the work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pablodelara for the review. I removed "last16" part.

precompute_hkeys_on_stack() routine.
Copy link

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the work, Andrey!

Copy link

@tj-odwyer tj-odwyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrey
I've checked all the resolutions to my comments and have no outstanding issues so I'm happy to Approve.
It is great work! And I have learned a lot during the review, thanks.
TJ

@mattcaswell
Copy link
Member

@mdcornu - are you satisfied with the resolutions to your comments applied to this PR?

Copy link

@mdcornu mdcornu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

@paulidale
Copy link
Contributor

My approval stands. Thanks for the assistance everyone.

@t8m t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Feb 9, 2022
@openssl-machine openssl-machine removed the approval: done This pull request has the required number of approvals label Feb 10, 2022
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@openssl-machine openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label Feb 10, 2022
openssl-machine pushed a commit that referenced this pull request Feb 10, 2022
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled
with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's
IceLake micro-architecture).

The performance details for representative IceLake Server and Client
platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11

AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #17239)
@t8m
Copy link
Member

t8m commented Feb 10, 2022

Merged after squashing the commits and expanding the commit message. Thank you all for your contribution and reviews.

@t8m t8m closed this Feb 10, 2022
t8m pushed a commit to t8m/openssl that referenced this pull request Nov 4, 2022
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled
with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's
IceLake micro-architecture).

The performance details for representative IceLake Server and Client
platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11

AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from openssl#17239)

(cherry picked from commit 63b996e)
t8m pushed a commit to t8m/openssl that referenced this pull request Nov 4, 2022
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled
with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's
IceLake micro-architecture).

The performance details for representative IceLake Server and Client
platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11

AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from openssl#17239)

(cherry picked from commit 63b996e)
t8m pushed a commit to t8m/openssl that referenced this pull request Nov 9, 2022
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled
with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's
IceLake micro-architecture).

The performance details for representative IceLake Server and Client
platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11

AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from openssl#17239)

(cherry picked from commit 63b996e)
openssl-machine pushed a commit that referenced this pull request Nov 11, 2022
Vectorized 'stitched' encrypt + ghash implementation of AES-GCM enabled
with AVX512 vAES and vPCLMULQDQ instructions (available starting Intel's
IceLake micro-architecture).

The performance details for representative IceLake Server and Client
platforms are shown below

Performance data:
OpenSSL Speed KBs/Sec
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      478708.27   1118296.96  2428092.52  3518199.4   4172355.99  4235762.07
  Patched       534613.95   2009345.55  3775588.15  5059517.64  8476794.88  8941541.79
  Speedup            1.12         1.80        1.55        1.44        2.03        2.11

AES-256-GCM
  Baseline      399237.27   961699.9    2136377.65  2979889.15  3554823.37  3617757.5
  Patched       475948.13   1720128.51  3462407.12  4696832.2   7532013.16  7924953.91
  Speedup            1.19        1.79         1.62        1.58        2.12        2.19
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (1Core/1Thread)
Payload in Bytes       16          64        256         1024        8192      16384
AES-128-GCM
  Baseline      259128.54   570756.43   1362554.16  1990654.57  2359128.88  2401671.58
  Patched       292139.47   1079320.95  2001974.63  2829007.46  4510318.59  4705314.41
  Speedup            1.13        1.89         1.47        1.42        1.91        1.96
AES-256-GCM
  Baseline      236000.34   550506.76   1234638.08  1716734.57  2011255.6   2028099.99
  Patched       247256.32   919731.34   1773270.43  2553239.55  3953115.14  4111227.29
  Speedup            1.05        1.67         1.44        1.49        1.97        2.03

Reviewed-by: TJ O'Dwyer, Marcel Cornu, Pablo de Lara
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #17239)

(cherry picked from commit 63b996e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch severity: fips change The pull request changes FIPS provider sources triaged: feature The issue/pr requests/adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants