Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

Open
realPy opened this issue Mar 21, 2020 · 13 comments
Open

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

realPy opened this issue Mar 21, 2020 · 13 comments
Labels
branch: master Merge to master branch help wanted triaged: feature The issue/pr requests/adds a feature triaged: performance The issue/pr reports/fixes a performance concern
Milestone

Comments

@realPy
Copy link

realPy commented Mar 21, 2020

Hi,
I currently test openssl in 32 and 64 bit optimize for cortex-A72 ( raspberry pi4) .
My kernel is 64 bit and i can run software in 64 or 32 bit.

I have a huge performance difference between aarch64 and armv8
The performance decrease by 50% for the aarch64.

Compile with a custom aarch64 compiler and execute on a raspberry pi 4:
test openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 5011950 aes-256-cbc's in 2.89s
Doing aes-256-cbc for 3s on 64 size blocks: 1343320 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 256 size blocks: 334189 aes-256-cbc's in 2.88s
Doing aes-256-cbc for 3s on 1024 size blocks: 81715 aes-256-cbc's in 2.82s
Doing aes-256-cbc for 3s on 8192 size blocks: 10625 aes-256-cbc's in 2.90s
Doing aes-256-cbc for 3s on 16384 size blocks: 5244 aes-256-cbc's in 2.91s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:33:00 2020 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/aarch64-rpi4-linux-gnu-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 27747.82k 29442.63k 29705.69k 29672.40k 30013.79k 29524.98k

Compile with a custom arrmv8 compiler and execute on a raspberry pi 4:

openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 10145827 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 64 size blocks: 2874360 aes-256-cbc's in 2.84s
Doing aes-256-cbc for 3s on 256 size blocks: 773257 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 1024 size blocks: 202402 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 8192 size blocks: 25266 aes-256-cbc's in 2.96s
Doing aes-256-cbc for 3s on 16384 size blocks: 12036 aes-256-cbc's in 2.90s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:52:50 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/armv8-rpi4-linux-gnueabihf-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 55593.57k 64774.31k 67792.39k 69550.22k 69925.36k 67999.25k

Generally in 64 bit i have observed a performance gain of 25% with a ram consommation increase by 25%

Can someone have an idea why this strange result?

@realPy realPy added the issue: question The issue was opened to ask a question label Mar 21, 2020
@realPy realPy changed the title Huge aarch64 vs armv8 performance (50% lower for aarch64) Huge aarch64 vs armv8 32 bit performance (50% lower for aarch64) Mar 21, 2020
@realPy realPy changed the title Huge aarch64 vs armv8 32 bit performance (50% lower for aarch64) aarch64 vs armv8 32 bit performance (50% lower for aarch64) Mar 21, 2020
samuel-lee-msft added a commit to samuel-lee-msft/openssl that referenced this issue Aug 17, 2020
+ Looking at cd42b9e the definition of
AES_ASM appears to have been missed for aarch64
+ I suspect this may fix openssl#11377

CLA: trivial
@cherso
Copy link

cherso commented Nov 26, 2020

I can confirm this issue on all AES variants
64bit vs 32bit 50% perfomance loss on RPI4 (no AES Crypto Extensions on chip)

@paulidale
Copy link
Contributor

What is the comparison if both versions are build with no-asm?

@realPy
Copy link
Author

realPy commented Nov 28, 2020

Sorry @paulidale but we never said that we compile with no-asm

@paulidale
Copy link
Contributor

I'm wondering if the difference is compiler based or assembly based. OpenSSL has assembly code for various platforms but what is implemented in assembly isn't consistent.

For ARM 32 and ARM 64, they'd be considered to be different platforms from the assembly point of view.

If the difference remains when both are compiled no-asm, the tool chains are the likely culprit. If not, it's more likely the assembly (or lack of assembly) that is the smoking gun.

@realPy
Copy link
Author

realPy commented Nov 28, 2020

Thanks for the explain. I can make a test with noasm compile with the toolchains to see the difference

@cherso
Copy link

cherso commented Nov 28, 2020

What is the comparison if both versions are build with no-asm?

Ok.
Natively compiled aarch64 version with no-asm.
Performance are as expected. Same as 32bit version

RPI4

OpenSSL 1.1.1h 22 Sep 2020
built on: Sat Nov 28 21:57:41 2020 UTC
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 62399.04k 70255.38k 75067.56k 76989.78k 77023.91k 77578.24k

gcc 9.3.0

@t8m t8m added branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch triaged: feature The issue/pr requests/adds a feature and removed issue: question The issue was opened to ask a question labels Jul 22, 2021
@t8m
Copy link
Member

t8m commented Jul 22, 2021

So apparently the aarch64 asm implementation is somehow lacking the performance. However there is another issue - the asm implementation might have additional side-channel mitigations such as const-time const-mem-access but the no-asm (at least in the 1.1.1) doesn't. And perhaps the armv8 asm implementation does not have some of these mitigations either which might be the reason for the better performance.

@t8m t8m added branch: master Merge to master branch and removed branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch labels Jul 23, 2021
@t8m t8m added this to the Post 3.0.0 milestone Jul 23, 2021
@mahtin
Copy link

mahtin commented Jun 17, 2023

Hello. It's just about 23 months since the last comment on this issue and I wonder if this slowdown is now more important than before because of the mainstreaming of 64 bit OS's for the Pi. I looked at openssl and a Raspberry Pi 4 running armv71 its still twice as fast as a Raspberry Pi 4 running aarch64. Both tested with newest openssl update for their respective mainline OS's (i.e nothing special done). I don't have a armv8 machine anymore; but this seems still valid.

32 bit OS ...

Raspberry Pi 4 Model B Rev 1.2
Linux 5.15.72-v7l+ #1591 SMP Wed Oct 5 12:05:33 BST 2022 armv7l unknown unknown GNU/Linux

compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-lY0Eec/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wdate-time -D_FORTIFY_SOURCE=2

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      50622.20k    59695.34k    63555.72k    64419.87k    64719.54k    64738.65k

64 bit OS ...

Raspberry Pi 4 Model B Rev 1.5
Linux 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 unknown unknown GNU/Linux

compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-DLe6s5/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      32939.33k    35138.31k    35998.89k    35789.14k    36216.83k    36175.87k

There's a difference in the compile flags. It's easier to see once you remove the common values.

32 bit OS ...

compiler: gcc ... -DOPENSSL_BN_ASM_GF2m -DAES_ASM -DBSAES_ASM -DGHASH_ASM  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64

64 bit OS ...

compiler: gcc ... -DVPAES_ASM

Any update? Enquiring minds want to know. Thanks in advance.

@paulidale
Copy link
Contributor

I don't think anyone has looked at this properly yet.
The core developers have other priorities at the moment, so it will be a while longer.

@bernd-edlinger
Copy link
Member

With openssl speed -evp aes-256-cbc you are not using the BSAES implementation on ARM platform,
and this cipher is therefore never constant time on ARM. Your numbers reflect therefore basically the
performance without NIOS. To measure the BSAES performance you should either
try openssl speed -evp aes-256-cbc -decrypt or openssl speed -evp aes-256-ctr. Note however that
the BSAES implementation is in a way questionable, since it still uses the non-constant time table-driven
AES implementation for key derivation and encrypting messages shorter than 128 bytes.

But OTOH the AARCH platform uses VPAES which is always used for CBC in encrypt and decrypt mode.
And does also the key derivation in constant time. So unfortunately on all ARM platforms both VPAES
and BSAES are less performant than the default software AES implementaion.

If you want non-constant time AES on the AARCH you have to do it this way:
OPENSSL_armcap=0 openssl speed -evp aes-256-cbc

@paulidale paulidale added the triaged: performance The issue/pr reports/fixes a performance concern label Nov 7, 2023
@ricardobranco777
Copy link

openssl is slow on Raspberry Pi.

@mahtin
Copy link

mahtin commented Jan 22, 2024

openssl is slow on Raspberry Pi.

As a programmer that started coding on PDP11's & VAX11's in the 70's & 80's I find the present day Raspberry Pi's (especially the 4's & 5's) to be extremely fast! So I'm happy to reject the "slow" statement and move back to the issue at hand. If that's ok with you.

Remember, this issue is a few years old and from what I've experienced, we now live with faster code, both for OpenSSL and the underlying operating system.

@ricardobranco777
Copy link

openssl is slow on Raspberry Pi.

As a programmer that started coding on PDP11's & VAX11's in the 70's & 80's I find the present day Raspberry Pi's (especially the 4's & 5's) to be extremely fast! So I'm happy to reject the "slow" statement and move back to the issue at hand. If that's ok with you.

Remember, this issue is a few years old and from what I've experienced, we now live with faster code, both for OpenSSL and the underlying operating system.

Compare the output of the following commands on Raspberry Pi 4b & 5:

  • cryptsetup benchmark
  • gnutls-cli --benchmark-ciphers
  • openssl speed

The first 2 shows RPi 5 to be 10x faster than RPi 4b because of the use of crypto extensions. Not so with openssl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch help wanted triaged: feature The issue/pr requests/adds a feature triaged: performance The issue/pr reports/fixes a performance concern
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants