New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riscv: Provide a vector only implementation of CHACHA20 cipher #24069
Conversation
Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it? Should the existing non-vector implementation be called on shorter inputs? |
I think so. Maybe I should do some profiling and optimize the implementation.
I will wait until I have further optimization to make this decision. |
It does not pass the |
OPENSSL_riscvcap seems to be undocumented. Can you add an document like OPENSSL_ia32cap and OPENSSL_s390xcap in another PR? |
I think OPENSSL_riscvcap should be deprecated in the future. For now, we have hwprobe syscall in the linux kernel since v6.4 to get what ISA extensions are supported by every CPU. |
Being able to explicitly enable and disable specific optimisations is very useful. On both x86 and Arm we query for capabilities and use that to set the respective xxxcap variables, which can then be given in the environment to override what is detected. I would recommend against deprecating this. |
OK. So the things to do might add hwprobe and use it by default, and allow it to be overridden by |
I have refactored the code from #24097. And updated the benchmark on the Canaan Kendryte K230 platform. It shows no performance degradation compared to pure C implementation, so it is worth using it. In addition, Canaan Kendryte K230 is an embedded RISC-V SoC with an in-order pipeline with a very short vector length of 128 bits, like Cortex-A53 with NEON in Arm Ecosystems. We may see better performance in out-of-order cores with longer vector lengths on better CPUs. |
I also provided a diff with your implementation using Zvkb. You can see the code below: 5c5
< # Copyright 2023-2023 The OpenSSL Project Authors. All Rights Reserved.
---
> # Copyright 2023-2024 The OpenSSL Project Authors. All Rights Reserved.
13a14
> # Copyright (c) 2024, Yangyu Chen <cyy@cyyself.name>
40d40
< # - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb')
63,65c63,65
< # void ChaCha20_ctr32_zvkb(unsigned char *out, const unsigned char *inp,
< # size_t len, const unsigned int key[8],
< # const unsigned int counter[4]);
---
> # void ChaCha20_ctr32_v(unsigned char *out, const unsigned char *inp,
> # size_t len, const unsigned int key[8],
> # const unsigned int counter[4]);
124c124,132
< @{[vror_vi $D0, $D0, 32 - 16]}
---
> @{[vsll_vi $V24, $D0, 16]}
> @{[vsll_vi $V25, $D1, 16]}
> @{[vsll_vi $V26, $D2, 16]}
> @{[vsll_vi $V27, $D3, 16]}
> @{[vsrl_vi $D0, $D0, 32 - 16]}
> @{[vsrl_vi $D1, $D1, 32 - 16]}
> @{[vsrl_vi $D2, $D2, 32 - 16]}
> @{[vsrl_vi $D3, $D3, 32 - 16]}
> @{[vor_vv $D0, $D0, $V24]}
126c134
< @{[vror_vi $D1, $D1, 32 - 16]}
---
> @{[vor_vv $D1, $D1, $V25]}
128c136
< @{[vror_vi $D2, $D2, 32 - 16]}
---
> @{[vor_vv $D2, $D2, $V26]}
130c138
< @{[vror_vi $D3, $D3, 32 - 16]}
---
> @{[vor_vv $D3, $D3, $V27]}
149c157,165
< @{[vror_vi $B0, $B0, 32 - 12]}
---
> @{[vsll_vi $V28, $B0, 12]}
> @{[vsll_vi $V29, $B1, 12]}
> @{[vsll_vi $V30, $B2, 12]}
> @{[vsll_vi $V31, $B3, 12]}
> @{[vsrl_vi $B0, $B0, 32 - 12]}
> @{[vsrl_vi $B1, $B1, 32 - 12]}
> @{[vsrl_vi $B2, $B2, 32 - 12]}
> @{[vsrl_vi $B3, $B3, 32 - 12]}
> @{[vor_vv $B0, $B0, $V28]}
151c167
< @{[vror_vi $B1, $B1, 32 - 12]}
---
> @{[vor_vv $B1, $B1, $V29]}
153c169
< @{[vror_vi $B2, $B2, 32 - 12]}
---
> @{[vor_vv $B2, $B2, $V30]}
155c171
< @{[vror_vi $B3, $B3, 32 - 12]}
---
> @{[vor_vv $B3, $B3, $V31]}
174c190,198
< @{[vror_vi $D0, $D0, 32 - 8]}
---
> @{[vsll_vi $V24, $D0, 8]}
> @{[vsll_vi $V25, $D1, 8]}
> @{[vsll_vi $V26, $D2, 8]}
> @{[vsll_vi $V27, $D3, 8]}
> @{[vsrl_vi $D0, $D0, 32 - 8]}
> @{[vsrl_vi $D1, $D1, 32 - 8]}
> @{[vsrl_vi $D2, $D2, 32 - 8]}
> @{[vsrl_vi $D3, $D3, 32 - 8]}
> @{[vor_vv $D0, $D0, $V24]}
176c200
< @{[vror_vi $D1, $D1, 32 - 8]}
---
> @{[vor_vv $D1, $D1, $V25]}
178c202
< @{[vror_vi $D2, $D2, 32 - 8]}
---
> @{[vor_vv $D2, $D2, $V26]}
180c204
< @{[vror_vi $D3, $D3, 32 - 8]}
---
> @{[vor_vv $D3, $D3, $V27]}
199c223,231
< @{[vror_vi $B0, $B0, 32 - 7]}
---
> @{[vsll_vi $V28, $B0, 7]}
> @{[vsll_vi $V29, $B1, 7]}
> @{[vsll_vi $V30, $B2, 7]}
> @{[vsll_vi $V31, $B3, 7]}
> @{[vsrl_vi $B0, $B0, 32 - 7]}
> @{[vsrl_vi $B1, $B1, 32 - 7]}
> @{[vsrl_vi $B2, $B2, 32 - 7]}
> @{[vsrl_vi $B3, $B3, 32 - 7]}
> @{[vor_vv $B0, $B0, $V28]}
201c233
< @{[vror_vi $B1, $B1, 32 - 7]}
---
> @{[vor_vv $B1, $B1, $V29]}
203c235
< @{[vror_vi $B2, $B2, 32 - 7]}
---
> @{[vor_vv $B2, $B2, $V30]}
205c237
< @{[vror_vi $B3, $B3, 32 - 7]}
---
> @{[vor_vv $B3, $B3, $V31]}
214,216c246,248
< .globl ChaCha20_ctr32_zvkb
< .type ChaCha20_ctr32_zvkb,\@function
< ChaCha20_ctr32_zvkb:
---
> .globl ChaCha20_ctr32_v
> .type ChaCha20_ctr32_v,\@function
> ChaCha20_ctr32_v:
471c503
< .size ChaCha20_ctr32_zvkb,.-ChaCha20_ctr32_zvkb
---
> .size ChaCha20_ctr32_v,.-ChaCha20_ctr32_v You can also take a look at this since this code is changed from your implementation. |
Could somebody knowledgeable of RISCV please review this code? I can provide only a formal approval. It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted. |
@cyyself, as the difference is only on the
I will check the equivalence of these two asm implementation soon. I had some RISC-V Vector Extension experience inside OpenSSL.
@t8m I am willing to do so, How could I participate in? I think I had related experience, see here |
I will do that. Thanks. |
Please open a pull request against https://github.com/openssl/general-policies/ to move the relevant platforms from unadopted to community. Similar to for example this pull request: https://github.com/openssl/general-policies/pull/54/files |
Since we can do group operations on vector registers in RISC-V, some vector registers will be used without being explicitly referenced. Thus, comments on vector register allocation should be added to improve the code readability and maintainability. Signed-off-by: Yangyu Chen <cyy@cyyself.name>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator.
I'm not sure which one is the good practice in openssl.
I also have this concern. That’s why I separate it to a different commit. I will wait for opinions from OpenSSL maintainers. |
The f1d7f4d has:
It's not just for the indent only. |
Yes, the indentation should be primarily correct in the .pl file as that is the source file. It is nice if the indentation is correct in the generated file, but it is definitely not a requirement. |
Thanks. I edited the last commit on indentation. Now, it only reuses the same part of the code. |
This patch merged the `add` and `xor` part of chacha_sub_round, which are same in RISC-V Vector only and Zvkb implementation. There is no change to the generated ASM code except for the indent. Signed-off-by: Yangyu Chen <cyy@cyyself.name>
@paulidale please reconfirm |
Although it hasn't been reviewed for two weeks. There is some good news to share. I got another new RISC-V Board, BananaPi F3, which has Spacemit K1 SoC and 256 bits VLEN, which is 2x wide in Vector Length compared to the K230 I used to benchmark before. The performance is shown below: [cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11580221 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4837731 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1433019 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 1024 size blocks: 375369 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 47621 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 23843 ChaCha20 ops in 3.01s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
ChaCha20 61761.18k 103204.93k 121878.03k 128125.95k 130037.08k 129781.96k
[cyy@k1 ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11308125 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4804401 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 256 size blocks: 1409444 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 725378 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 99129 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 51344 ChaCha20 ops in 3.00s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
ChaCha20 60310.00k 102153.38k 120272.55k 247595.69k 270688.26k 280406.70k
[cyy@k1 ssl]$ neofetch
-` cyy@k1
.o+` ------
`ooo/ OS: Arch Linux riscv64
`+oooo: Host: spacemit k1-x deb1 board
`+oooooo: Kernel: 6.1.15-legacy-k1
-+oooooo+: Uptime: 6 mins
`/:-:++oooo+: Packages: 139 (pacman)
`/++++/+++++++: Shell: bash 5.2.26
`/++++++++++++++: Terminal: /dev/pts/0
`/+++ooooooooooooo/` CPU: Spacemit X60 (8) @ 1.600GHz
./ooosssso++osssssso+` Memory: 111MiB / 3809MiB
.oossssso-````/ossssss+`
-osssssso. :ssssssso.
:osssssss/ osssso+++.
/ossssssss/ +ssssooo/-
`/ossssso+/:- -:/+osssso+-
`+sso+:-` `.-/+oso:
`++:. `-/+/
.` `/
[cyy@k1 ssl]$ cat /proc/cpuinfo | head -n 9
processor : 0
hart : 0
model name : Spacemit(R) X60
isa : rv64imafdcv_sscofpmf_sstc_svpbmt_zicbom_zicboz_zicbop_zihintpause
mmu : sv39
mvendorid : 0x710
marchid : 0x8000000058000001
mimpid : 0x1000000049772200 I changed the baseline to rv64gcv_zba_zbb compiled by GCC. The result shows that it is more than 1x faster than pure C implementation optimized by the compiler. On K230, it only indicates 26.3% faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, missed this one.
This pull request is ready to merge |
Merged to the master branch. Thank you for your contribution. |
Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V Vector Cryptography Bit-manipulation extension was ratified in late 2023 and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not support Zvkb but may have Vector and Bit-manipulation, which are already in the RVA22 Profile. This commit provides a vector-only implementation that replaced the vror with vsll+vsrl+vor and can provide enough speed for Chacha20 for new CPUs this year. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)
Since we can do group operations on vector registers in RISC-V, some vector registers will be used without being explicitly referenced. Thus, comments on vector register allocation should be added to improve the code readability and maintainability. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)
This patch merged the `add` and `xor` part of chacha_sub_round, which are same in RISC-V Vector only and Zvkb implementation. There is no change to the generated ASM code except for the indent. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)
Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V Vector Cryptography Bit-manipulation extension was ratified in late 2023 and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not support Zvkb but may have Vector and Scalar Bit-manipulation, which are already in the RVA22 Profile. This commit provides a vector-only implementation that replaced the vror with vsll+vsrl+vor and can provide enough speed for Chacha20 for new CPUs this year.
Performance Evaluation
Platform: Canaan Kendryte K230
CPU: T-Head C908 @1.6GHz
Compiler: gcc version 13.2.1 20230801 (GCC)
Summary of the result:
According to the result, we get chacha20 speed up to 160154.98k on 16384 bytes blocks, which is 26.3% faster compared to C implementation with rv64gcv_zba_zbb optimized by GCC and 78.9% faster compared to C implementation with rv64gc which is the default -march for riscv in GCC.
Checklist
I also
diff
the generated ASM for Zvkb since I updated the Perl script for ASM generation. There are no changes except for indentation.