Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoongArch: chacha20 speed has huge performance degraded #23300

Closed
Fearyncess opened this issue Jan 13, 2024 · 6 comments
Closed

LoongArch: chacha20 speed has huge performance degraded #23300

Fearyncess opened this issue Jan 13, 2024 · 6 comments
Labels
branch: master Merge to master branch branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug

Comments

@Fearyncess
Copy link

Fearyncess commented Jan 13, 2024

9a41a3c added chacha20 simd asm accelerated pack by lsx on loongarch machines.
but now i noticed it has a performance problem on 3a6000 cpu.

after bisects, this issue has appeared after PR #22817 merged (commit: b46de72)

good chacha20 benchmark result (on dfd986b)

Doing ChaCha20 ops for 3s on 16 size blocks: 55003822 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 23395673 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 24012325 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 12258642 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 1614483 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 809542 ChaCha20 ops in 3.00s
version: 3.2.0-alpha2-dev
built on: Sat Jan 13 18:51:08 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Wa,--noexecstack -mabi=lp64d -march=loongarch64 -mtune=la464 -mlsx -O2 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20        293353.72k   499107.69k  2049051.73k  4184283.14k  4408614.91k  4421178.71k

bad chacha20 benchmark result after PR #22817 merged (on b46de72)

Doing ChaCha20 ops for 3s on 16 size blocks: 54890748 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 23112491 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 6264785 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 1614116 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 203605 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 16384 size blocks: 101827 ChaCha20 ops in 3.00s
version: 3.3.0-dev
built on: Sat Jan 13 19:35:15 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Wa,--noexecstack -mabi=lp64d -march=loongarch64 -mtune=la464 -mlsx -O2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20        292750.66k   493066.47k   534594.99k   550951.59k   557836.84k   556111.19k

@Fearyncess Fearyncess added the issue: bug report The issue was opened to report a bug label Jan 13, 2024
@lrzlin
Copy link
Contributor

lrzlin commented Jan 14, 2024

Seems can not reproducing on Gentoo 3A6000

3a6k ~/openssl/openssl-fix/apps # ldd openssl
        linux-vdso.so.1 (0x00007ffffe090000)
        libssl.so.3 => /root/openssl/openssl-fix/libssl.so.3 (0x00007ffff2a94000)
        libcrypto.so.3 => /root/openssl/openssl-fix/libcrypto.so.3 (0x00007ffff2650000)
        libz.so.1 => /usr/lib64/libz.so.1 (0x00007ffff2628000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007ffff24a8000)
        /lib64/ld-linux-loongarch-lp64d.so.1 (0x00007ffff2c9c000)
3a6k ~/openssl/openssl-fix/apps # ./openssl speed -evp ChaCha20
Doing ChaCha20 ops for 3s on 16 size blocks: 59737877 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 23876988 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 23876937 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 12310928 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 1643074 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 825151 ChaCha20 ops in 3.00s
version: 3.3.0-dev
built on: Sun Jan 14 08:48:52 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DZLIB -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20        318602.01k   509375.74k  2037498.62k  4202130.09k  4486687.40k  4506424.66k

The gcc I used didn't support -mlsx option BTW, but the binutils could handle with LSX/LASX asm instructions.

@Fearyncess
Copy link
Author

Fearyncess commented Jan 14, 2024

https://sourceware.org/git/?p=glibc.git;a=commit;h=672b91ba1060887aa8897d0b98af83b96d4a52b0

it seems glibc's hwcap has been reverted, that causes the chacha20 lsx asm pack can't be called.

but kernel's hwcap bit should not be erased, why?

@mattcaswell mattcaswell added branch: master Merge to master branch triaged: bug The issue/pr is/fixes a bug severity: regression The issue/pr is a regression from previous released version branch: 3.2 Merge to openssl-3.2 and removed issue: bug report The issue was opened to report a bug labels Jan 15, 2024
@mattcaswell
Copy link
Member

@xry111

@mattcaswell
Copy link
Member

@zhoumin2

@xry111
Copy link
Contributor

xry111 commented Jan 15, 2024

Phew. I'm too stupid :(.

@mattcaswell mattcaswell removed the severity: regression The issue/pr is a regression from previous released version label Jan 15, 2024
lrzlin added a commit to lrzlin/openssl that referenced this issue Jan 15, 2024
In that pull request, the input length check was moved forward,
but the related ori instruction was missing, and it will cause
input of any length down to the much slower scalar implementation.

Fixes openssl#23300

CLA: trivial
@zhoumin2
Copy link
Contributor

@zhoumin2

@lrzlin is fixing this problem.

openssl-machine pushed a commit that referenced this issue Jan 17, 2024
The regression was introduced in PR #22817.

In that pull request, the input length check was moved forward,
but the related ori instruction was missing, and it will cause
input of any length down to the much slower scalar implementation.

Fixes #23300

CLA: trivial

Reviewed-by: Shane Lontis <shane.lontis@oracle.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #23301)

(cherry picked from commit 9710285)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants