aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

realPy · 2020-03-21T18:16:12Z

Hi,
I currently test openssl in 32 and 64 bit optimize for cortex-A72 ( raspberry pi4) .
My kernel is 64 bit and i can run software in 64 or 32 bit.

I have a huge performance difference between aarch64 and armv8
The performance decrease by 50% for the aarch64.

Compile with a custom aarch64 compiler and execute on a raspberry pi 4:
test openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 5011950 aes-256-cbc's in 2.89s
Doing aes-256-cbc for 3s on 64 size blocks: 1343320 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 256 size blocks: 334189 aes-256-cbc's in 2.88s
Doing aes-256-cbc for 3s on 1024 size blocks: 81715 aes-256-cbc's in 2.82s
Doing aes-256-cbc for 3s on 8192 size blocks: 10625 aes-256-cbc's in 2.90s
Doing aes-256-cbc for 3s on 16384 size blocks: 5244 aes-256-cbc's in 2.91s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:33:00 2020 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/aarch64-rpi4-linux-gnu-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 27747.82k 29442.63k 29705.69k 29672.40k 30013.79k 29524.98k

Compile with a custom arrmv8 compiler and execute on a raspberry pi 4:

openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 10145827 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 64 size blocks: 2874360 aes-256-cbc's in 2.84s
Doing aes-256-cbc for 3s on 256 size blocks: 773257 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 1024 size blocks: 202402 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 8192 size blocks: 25266 aes-256-cbc's in 2.96s
Doing aes-256-cbc for 3s on 16384 size blocks: 12036 aes-256-cbc's in 2.90s
OpenSSL 1.1.1e 17 Mar 2020
built on: Sat Mar 21 17:52:50 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) idea(int) blowfish(ptr)
compiler: /cross/bin/armv8-rpi4-linux-gnueabihf-gcc -fPIC -pthread -Wa,--noexecstack -O3 -march=armv8-a+crypto -mtune=cortex-a72 -fPIC -Wunreachable-code -pedantic -Wextra -Wall -Wformat=2 -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fPIE -pie -Wl,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 55593.57k 64774.31k 67792.39k 69550.22k 69925.36k 67999.25k

Generally in 64 bit i have observed a performance gain of 25% with a ram consommation increase by 25%

Can someone have an idea why this strange result?

+ Looking at cd42b9e the definition of AES_ASM appears to have been missed for aarch64 + I suspect this may fix openssl#11377 CLA: trivial

cherso · 2020-11-26T23:54:22Z

I can confirm this issue on all AES variants
64bit vs 32bit 50% perfomance loss on RPI4 (no AES Crypto Extensions on chip)

paulidale · 2020-11-28T05:46:15Z

What is the comparison if both versions are build with no-asm?

realPy · 2020-11-28T08:23:32Z

Sorry @paulidale but we never said that we compile with no-asm

paulidale · 2020-11-28T08:29:06Z

I'm wondering if the difference is compiler based or assembly based. OpenSSL has assembly code for various platforms but what is implemented in assembly isn't consistent.

For ARM 32 and ARM 64, they'd be considered to be different platforms from the assembly point of view.

If the difference remains when both are compiled no-asm, the tool chains are the likely culprit. If not, it's more likely the assembly (or lack of assembly) that is the smoking gun.

realPy · 2020-11-28T08:39:30Z

Thanks for the explain. I can make a test with noasm compile with the toolchains to see the difference

cherso · 2020-11-28T22:17:57Z

What is the comparison if both versions are build with no-asm?

Ok.
Natively compiled aarch64 version with no-asm.
Performance are as expected. Same as 32bit version

RPI4

OpenSSL 1.1.1h 22 Sep 2020
built on: Sat Nov 28 21:57:41 2020 UTC
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 62399.04k 70255.38k 75067.56k 76989.78k 77023.91k 77578.24k

gcc 9.3.0

t8m · 2021-07-22T14:10:35Z

So apparently the aarch64 asm implementation is somehow lacking the performance. However there is another issue - the asm implementation might have additional side-channel mitigations such as const-time const-mem-access but the no-asm (at least in the 1.1.1) doesn't. And perhaps the armv8 asm implementation does not have some of these mitigations either which might be the reason for the better performance.

mahtin · 2023-06-17T21:35:58Z

Hello. It's just about 23 months since the last comment on this issue and I wonder if this slowdown is now more important than before because of the mainstreaming of 64 bit OS's for the Pi. I looked at openssl and a Raspberry Pi 4 running armv71 its still twice as fast as a Raspberry Pi 4 running aarch64. Both tested with newest openssl update for their respective mainline OS's (i.e nothing special done). I don't have a armv8 machine anymore; but this seems still valid.

32 bit OS ...

Raspberry Pi 4 Model B Rev 1.2
Linux 5.15.72-v7l+ #1591 SMP Wed Oct 5 12:05:33 BST 2022 armv7l unknown unknown GNU/Linux

compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-lY0Eec/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wdate-time -D_FORTIFY_SOURCE=2

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      50622.20k    59695.34k    63555.72k    64419.87k    64719.54k    64738.65k

64 bit OS ...

Raspberry Pi 4 Model B Rev 1.5
Linux 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 unknown unknown GNU/Linux

compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-DLe6s5/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      32939.33k    35138.31k    35998.89k    35789.14k    36216.83k    36175.87k

There's a difference in the compile flags. It's easier to see once you remove the common values.

32 bit OS ...

compiler: gcc ... -DOPENSSL_BN_ASM_GF2m -DAES_ASM -DBSAES_ASM -DGHASH_ASM  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64

64 bit OS ...

compiler: gcc ... -DVPAES_ASM

Any update? Enquiring minds want to know. Thanks in advance.

paulidale · 2023-06-18T07:03:13Z

I don't think anyone has looked at this properly yet.
The core developers have other priorities at the moment, so it will be a while longer.

bernd-edlinger · 2023-06-21T06:53:53Z

With openssl speed -evp aes-256-cbc you are not using the BSAES implementation on ARM platform,
and this cipher is therefore never constant time on ARM. Your numbers reflect therefore basically the
performance without NIOS. To measure the BSAES performance you should either
try openssl speed -evp aes-256-cbc -decrypt or openssl speed -evp aes-256-ctr. Note however that
the BSAES implementation is in a way questionable, since it still uses the non-constant time table-driven
AES implementation for key derivation and encrypting messages shorter than 128 bytes.

But OTOH the AARCH platform uses VPAES which is always used for CBC in encrypt and decrypt mode.
And does also the key derivation in constant time. So unfortunately on all ARM platforms both VPAES
and BSAES are less performant than the default software AES implementaion.

If you want non-constant time AES on the AARCH you have to do it this way:
OPENSSL_armcap=0 openssl speed -evp aes-256-cbc

ricardobranco777 · 2024-01-22T14:15:31Z

openssl is slow on Raspberry Pi.

mahtin · 2024-01-22T15:08:52Z

openssl is slow on Raspberry Pi.

As a programmer that started coding on PDP11's & VAX11's in the 70's & 80's I find the present day Raspberry Pi's (especially the 4's & 5's) to be extremely fast! So I'm happy to reject the "slow" statement and move back to the issue at hand. If that's ok with you.

Remember, this issue is a few years old and from what I've experienced, we now live with faster code, both for OpenSSL and the underlying operating system.

ricardobranco777 · 2024-01-22T15:17:31Z

openssl is slow on Raspberry Pi.

As a programmer that started coding on PDP11's & VAX11's in the 70's & 80's I find the present day Raspberry Pi's (especially the 4's & 5's) to be extremely fast! So I'm happy to reject the "slow" statement and move back to the issue at hand. If that's ok with you.

Remember, this issue is a few years old and from what I've experienced, we now live with faster code, both for OpenSSL and the underlying operating system.

Compare the output of the following commands on Raspberry Pi 4b & 5:

cryptsetup benchmark
gnutls-cli --benchmark-ciphers
openssl speed

The first 2 shows RPi 5 to be 10x faster than RPi 4b because of the use of crypto extensions. Not so with openssl.

realPy added the issue: question The issue was opened to ask a question label Mar 21, 2020

realPy changed the title ~~Huge aarch64 vs armv8 performance (50% lower for aarch64)~~ Huge aarch64 vs armv8 32 bit performance (50% lower for aarch64) Mar 21, 2020

realPy changed the title ~~Huge aarch64 vs armv8 32 bit performance (50% lower for aarch64)~~ aarch64 vs armv8 32 bit performance (50% lower for aarch64) Mar 21, 2020

samuel-lee-msft mentioned this issue Aug 17, 2020

Add missed compile time definition for aes on aarch64 #12665

Closed

t8m added branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch triaged: feature The issue/pr requests/adds a feature and removed issue: question The issue was opened to ask a question labels Jul 22, 2021

t8m added branch: master Merge to master branch and removed branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch labels Jul 23, 2021

t8m added this to the Post 3.0.0 milestone Jul 23, 2021

paulidale added the help wanted label Jun 18, 2023

preveen-stack mentioned this issue Jun 19, 2023

Question on 64-bit platform optimization for nistp256 curve in OpenSSL #21234

Open

paulidale added the triaged: performance The issue/pr reports/fixes a performance concern label Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

realPy commented Mar 21, 2020

cherso commented Nov 26, 2020 •

edited

paulidale commented Nov 28, 2020

realPy commented Nov 28, 2020

paulidale commented Nov 28, 2020

realPy commented Nov 28, 2020

cherso commented Nov 28, 2020 •

edited

t8m commented Jul 22, 2021

mahtin commented Jun 17, 2023 •

edited

paulidale commented Jun 18, 2023

bernd-edlinger commented Jun 21, 2023

ricardobranco777 commented Jan 22, 2024

mahtin commented Jan 22, 2024

ricardobranco777 commented Jan 22, 2024

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

aarch64 vs armv8 32 bit performance (50% lower for aarch64) #11377

Comments

realPy commented Mar 21, 2020

cherso commented Nov 26, 2020 • edited

paulidale commented Nov 28, 2020

realPy commented Nov 28, 2020

paulidale commented Nov 28, 2020

realPy commented Nov 28, 2020

cherso commented Nov 28, 2020 • edited

t8m commented Jul 22, 2021

mahtin commented Jun 17, 2023 • edited

paulidale commented Jun 18, 2023

bernd-edlinger commented Jun 21, 2023

ricardobranco777 commented Jan 22, 2024

mahtin commented Jan 22, 2024

ricardobranco777 commented Jan 22, 2024

cherso commented Nov 26, 2020 •

edited

cherso commented Nov 28, 2020 •

edited

mahtin commented Jun 17, 2023 •

edited