New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377
Comments
+ Looking at cd42b9e the definition of AES_ASM appears to have been missed for aarch64 + I suspect this may fix openssl#11377 CLA: trivial
I can confirm this issue on all AES variants |
What is the comparison if both versions are build with |
Sorry @paulidale but we never said that we compile with no-asm |
I'm wondering if the difference is compiler based or assembly based. OpenSSL has assembly code for various platforms but what is implemented in assembly isn't consistent. For ARM 32 and ARM 64, they'd be considered to be different platforms from the assembly point of view. If the difference remains when both are compiled no-asm, the tool chains are the likely culprit. If not, it's more likely the assembly (or lack of assembly) that is the smoking gun. |
Thanks for the explain. I can make a test with noasm compile with the toolchains to see the difference |
Ok. RPI4 OpenSSL 1.1.1h 22 Sep 2020 gcc 9.3.0 |
So apparently the aarch64 asm implementation is somehow lacking the performance. However there is another issue - the asm implementation might have additional side-channel mitigations such as const-time const-mem-access but the no-asm (at least in the 1.1.1) doesn't. And perhaps the armv8 asm implementation does not have some of these mitigations either which might be the reason for the better performance. |
Hello. It's just about 23 months since the last comment on this issue and I wonder if this slowdown is now more important than before because of the mainstreaming of 64 bit OS's for the Pi. I looked at 32 bit OS ...
64 bit OS ...
There's a difference in the compile flags. It's easier to see once you remove the common values. 32 bit OS ...
64 bit OS ...
Any update? Enquiring minds want to know. Thanks in advance. |
I don't think anyone has looked at this properly yet. |
With But OTOH the AARCH platform uses VPAES which is always used for CBC in encrypt and decrypt mode. If you want non-constant time AES on the AARCH you have to do it this way: |
openssl is slow on Raspberry Pi. |
As a programmer that started coding on PDP11's & VAX11's in the 70's & 80's I find the present day Raspberry Pi's (especially the 4's & 5's) to be extremely fast! So I'm happy to reject the "slow" statement and move back to the issue at hand. If that's ok with you. Remember, this issue is a few years old and from what I've experienced, we now live with faster code, both for OpenSSL and the underlying operating system. |
Compare the output of the following commands on Raspberry Pi 4b & 5:
The first 2 shows RPi 5 to be 10x faster than RPi 4b because of the use of crypto extensions. Not so with openssl. |
Hi,
I currently test openssl in 32 and 64 bit optimize for cortex-A72 ( raspberry pi4) .
My kernel is 64 bit and i can run software in 64 or 32 bit.
I have a huge performance difference between aarch64 and armv8
The performance decrease by 50% for the aarch64.
Compile with a custom aarch64 compiler and execute on a raspberry pi 4:
test openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 5011950 aes-256-cbc's in 2.89s
Doing aes-256-cbc for 3s on 64 size blocks: 1343320 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 256 size blocks: 334189 aes-256-cbc's in 2.88s
Doing aes-256-cbc for 3s on 1024 size blocks: 81715 aes-256-cbc's in 2.82s
Doing aes-256-cbc for 3s on 8192 size blocks: 10625 aes-256-cbc's in 2.90s
Doing aes-256-cbc for 3s on 16384 size blocks: 5244 aes-256-cbc's in 2.91s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:33:00 2020 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/aarch64-rpi4-linux-gnu-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 27747.82k 29442.63k 29705.69k 29672.40k 30013.79k 29524.98k
Compile with a custom arrmv8 compiler and execute on a raspberry pi 4:
openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 10145827 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 64 size blocks: 2874360 aes-256-cbc's in 2.84s
Doing aes-256-cbc for 3s on 256 size blocks: 773257 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 1024 size blocks: 202402 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 8192 size blocks: 25266 aes-256-cbc's in 2.96s
Doing aes-256-cbc for 3s on 16384 size blocks: 12036 aes-256-cbc's in 2.90s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:52:50 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/armv8-rpi4-linux-gnueabihf-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 55593.57k 64774.31k 67792.39k 69550.22k 69925.36k 67999.25k
Generally in 64 bit i have observed a performance gain of 25% with a ram consommation increase by 25%
Can someone have an idea why this strange result?
The text was updated successfully, but these errors were encountered: