riscv: Provide a vector only implementation of CHACHA20 cipher #24069

cyyself · 2024-04-09T06:54:41Z

Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V Vector Cryptography Bit-manipulation extension was ratified in late 2023 and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not support Zvkb but may have Vector and Scalar Bit-manipulation, which are already in the RVA22 Profile. This commit provides a vector-only implementation that replaced the vror with vsll+vsrl+vor and can provide enough speed for Chacha20 for new CPUs this year.

Performance Evaluation

Platform: Canaan Kendryte K230
CPU: T-Head C908 @1.6GHz
Compiler: gcc version 13.2.1 20230801 (GCC)

[cyy@archlinux ssl]$ neofetch
                   -`                    cyy@archlinux
                  .o+`                   -------------
                 `ooo/                   OS: Arch Linux riscv64
                `+oooo:                  Host: Canaan Kendryte K230
               `+oooooo:                 Kernel: 6.8.0+
               -+oooooo+:                Uptime: 4 days, 17 hours, 27 mins
             `/:-:++oooo+:               Packages: 156 (pacman)
            `/++++/+++++++:              Shell: bash 5.2.26
           `/++++++++++++++:             Terminal: /dev/pts/0
          `/+++ooooooooooooo/`           CPU: (1)
         ./ooosssso++osssssso+`          Memory: 72MiB / 477MiB
        .oossssso-````/ossssss+`
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

[cyy@archlinux ssl]$ cat /proc/cpuinfo
processor	: 0
hart		: 0
isa		: rv64imafdcv_zicbom_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbc_zbs_svpbmt
mmu		: sv39
mvendorid	: 0x5b7
marchid		: 0x8000000009140d00
mimpid		: 0x50000
hart isa	: rv64imafdcv_zicbom_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbc_zbs_svpbmt

[cyy@archlinux ssl]$ unset OPENSSL_riscvcap
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 9190817 ChaCha20 ops in 2.90s
Doing ChaCha20 ops for 3s on 64 size blocks: 3541082 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 256 size blocks: 998612 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 1024 size blocks: 258575 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 8192 size blocks: 32655 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 16384 size blocks: 16335 ChaCha20 ops in 2.99s
version: 3.4.0-dev
built on: Fri Apr 19 06:03:02 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         50707.96k    76050.08k    85786.80k    88555.45k    89768.38k    89509.24k
[cyy@archlinux ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 9400208 ChaCha20 ops in 2.97s
Doing ChaCha20 ops for 3s on 64 size blocks: 3561831 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 256 size blocks: 1417570 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 1024 size blocks: 370833 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 8192 size blocks: 57722 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 16384 size blocks: 28833 ChaCha20 ops in 2.98s
version: 3.4.0-dev
built on: Fri Apr 19 06:03:02 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         50640.85k    76239.86k   121777.83k   127427.18k   158146.70k   158523.45k
[cyy@archlinux ssl]$ unset OPENSSL_riscvcap
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 10710666 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 64 size blocks: 4656347 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1390592 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 366078 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 8192 size blocks: 46493 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 23223 ChaCha20 ops in 3.00s
version: 3.4.0-dev
built on: Fri Apr 19 06:13:13 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gcv_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         57314.60k    99335.40k   118663.85k   124539.49k   126956.89k   126828.54k
[cyy@archlinux ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11064374 ChaCha20 ops in 2.96s
Doing ChaCha20 ops for 3s on 64 size blocks: 4723040 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1426926 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 1024 size blocks: 373266 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 58149 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 16384 size blocks: 29032 ChaCha20 ops in 2.97s
version: 3.4.0-dev
built on: Fri Apr 19 06:13:13 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gcv_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         59807.43k   100758.19k   121359.82k   127408.13k   159316.59k   160154.98k
[cyy@archlinux ssl]$

Summary of the result:

Chacha20 Implementation	GCC -march	16 bytes	64 bytes	256 bytes	1024 bytes	8192 bytes	16384 bytes
C	64gc	50707.96k	76050.08k	85786.80k	88555.45k	89768.38k	89509.24k
C	64gcv_zba_zbb	57314.60k	99335.40k	118663.85k	124539.49k	126956.89k	126828.54k
ASM-RVV	64gc	50640.85k	76239.86k	121777.83k	127427.18k	158146.70k	158523.45k
ASM-RVV	64gcv_zba_zbb	59807.43k	100758.19k	121359.82k	127408.13k	159316.59k	160154.98k

According to the result, we get chacha20 speed up to 160154.98k on 16384 bytes blocks, which is 26.3% faster compared to C implementation with rv64gcv_zba_zbb optimized by GCC and 78.9% faster compared to C implementation with rv64gc which is the default -march for riscv in GCC.

Checklist

documentation is added or updated
tests are added or updated

I also diff the generated ASM for Zvkb since I updated the Perl script for ASM generation. There are no changes except for indentation.

crypto/chacha/chacha_riscv.c

t8m · 2024-04-09T08:13:24Z

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it? Should the existing non-vector implementation be called on shorter inputs?

crypto/chacha/chacha_riscv.c

cyyself · 2024-04-09T08:29:36Z

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it?

I think so. Maybe I should do some profiling and optimize the implementation.

Should the existing non-vector implementation be called on shorter inputs?

I will wait until I have further optimization to make this decision.

cyyself · 2024-04-09T08:32:46Z

It does not pass the make test so I convert this PR to a draft. It was modified based on the zvkb implementation, which already in openssl. But that will not pass the test either. I have submitted issue #24070 to address this and will debug it soon.

kroeckx · 2024-04-09T09:01:47Z

OPENSSL_riscvcap seems to be undocumented. Can you add an document like OPENSSL_ia32cap and OPENSSL_s390xcap in another PR?

cyyself · 2024-04-09T09:09:20Z

OPENSSL_riscvcap seems to be undocumented. Can you add an document like OPENSSL_ia32cap and OPENSSL_s390xcap in another PR?

I think OPENSSL_riscvcap should be deprecated in the future. For now, we have hwprobe syscall in the linux kernel since v6.4 to get what ISA extensions are supported by every CPU.

tom-cosgrove-arm · 2024-04-09T09:22:15Z

I think OPENSSL_riscvcap should be deprecated in the future

Being able to explicitly enable and disable specific optimisations is very useful. On both x86 and Arm we query for capabilities and use that to set the respective xxxcap variables, which can then be given in the environment to override what is detected. I would recommend against deprecating this.

cyyself · 2024-04-09T10:06:54Z

I think OPENSSL_riscvcap should be deprecated in the future

Being able to explicitly enable and disable specific optimisations is very useful. On both x86 and Arm we query for capabilities and use that to set the respective xxxcap variables, which can then be given in the environment to override what is detected. I would recommend against deprecating this.

OK. So the things to do might add hwprobe and use it by default, and allow it to be overridden by OPENSSL_riscvcap. Missing documentation for OPENSSL_riscvcap should be added either.

cyyself · 2024-04-12T15:14:02Z

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it? Should the existing non-vector implementation be called on shorter inputs?

@t8m

I have refactored the code from #24097. And updated the benchmark on the Canaan Kendryte K230 platform. It shows no performance degradation compared to pure C implementation, so it is worth using it.

In addition, Canaan Kendryte K230 is an embedded RISC-V SoC with an in-order pipeline with a very short vector length of 128 bits, like Cortex-A53 with NEON in Arm Ecosystems. We may see better performance in out-of-order cores with longer vector lengths on better CPUs.

cyyself · 2024-04-12T15:21:32Z

@JerryShih

I also provided a diff with your implementation using Zvkb. You can see the code below:

5c5
< # Copyright 2023-2023 The OpenSSL Project Authors. All Rights Reserved.
---
> # Copyright 2023-2024 The OpenSSL Project Authors. All Rights Reserved.
13a14
> # Copyright (c) 2024, Yangyu Chen <cyy@cyyself.name>
40d40
< # - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb')
63,65c63,65
< # void ChaCha20_ctr32_zvkb(unsigned char *out, const unsigned char *inp,
< #                          size_t len, const unsigned int key[8],
< #                          const unsigned int counter[4]);
---
> # void ChaCha20_ctr32_v(unsigned char *out, const unsigned char *inp,
> #                       size_t len, const unsigned int key[8],
> #                       const unsigned int counter[4]);
124c124,132
<     @{[vror_vi $D0, $D0, 32 - 16]}
---
>     @{[vsll_vi $V24, $D0, 16]}
>     @{[vsll_vi $V25, $D1, 16]}
>     @{[vsll_vi $V26, $D2, 16]}
>     @{[vsll_vi $V27, $D3, 16]}
>     @{[vsrl_vi $D0, $D0, 32 - 16]}
>     @{[vsrl_vi $D1, $D1, 32 - 16]}
>     @{[vsrl_vi $D2, $D2, 32 - 16]}
>     @{[vsrl_vi $D3, $D3, 32 - 16]}
>     @{[vor_vv $D0, $D0, $V24]}
126c134
<     @{[vror_vi $D1, $D1, 32 - 16]}
---
>     @{[vor_vv $D1, $D1, $V25]}
128c136
<     @{[vror_vi $D2, $D2, 32 - 16]}
---
>     @{[vor_vv $D2, $D2, $V26]}
130c138
<     @{[vror_vi $D3, $D3, 32 - 16]}
---
>     @{[vor_vv $D3, $D3, $V27]}
149c157,165
<     @{[vror_vi $B0, $B0, 32 - 12]}
---
>     @{[vsll_vi $V28, $B0, 12]}
>     @{[vsll_vi $V29, $B1, 12]}
>     @{[vsll_vi $V30, $B2, 12]}
>     @{[vsll_vi $V31, $B3, 12]}
>     @{[vsrl_vi $B0, $B0, 32 - 12]}
>     @{[vsrl_vi $B1, $B1, 32 - 12]}
>     @{[vsrl_vi $B2, $B2, 32 - 12]}
>     @{[vsrl_vi $B3, $B3, 32 - 12]}
>     @{[vor_vv $B0, $B0, $V28]}
151c167
<     @{[vror_vi $B1, $B1, 32 - 12]}
---
>     @{[vor_vv $B1, $B1, $V29]}
153c169
<     @{[vror_vi $B2, $B2, 32 - 12]}
---
>     @{[vor_vv $B2, $B2, $V30]}
155c171
<     @{[vror_vi $B3, $B3, 32 - 12]}
---
>     @{[vor_vv $B3, $B3, $V31]}
174c190,198
<     @{[vror_vi $D0, $D0, 32 - 8]}
---
>     @{[vsll_vi $V24, $D0, 8]}
>     @{[vsll_vi $V25, $D1, 8]}
>     @{[vsll_vi $V26, $D2, 8]}
>     @{[vsll_vi $V27, $D3, 8]}
>     @{[vsrl_vi $D0, $D0, 32 - 8]}
>     @{[vsrl_vi $D1, $D1, 32 - 8]}
>     @{[vsrl_vi $D2, $D2, 32 - 8]}
>     @{[vsrl_vi $D3, $D3, 32 - 8]}
>     @{[vor_vv $D0, $D0, $V24]}
176c200
<     @{[vror_vi $D1, $D1, 32 - 8]}
---
>     @{[vor_vv $D1, $D1, $V25]}
178c202
<     @{[vror_vi $D2, $D2, 32 - 8]}
---
>     @{[vor_vv $D2, $D2, $V26]}
180c204
<     @{[vror_vi $D3, $D3, 32 - 8]}
---
>     @{[vor_vv $D3, $D3, $V27]}
199c223,231
<     @{[vror_vi $B0, $B0, 32 - 7]}
---
>     @{[vsll_vi $V28, $B0, 7]}
>     @{[vsll_vi $V29, $B1, 7]}
>     @{[vsll_vi $V30, $B2, 7]}
>     @{[vsll_vi $V31, $B3, 7]}
>     @{[vsrl_vi $B0, $B0, 32 - 7]}
>     @{[vsrl_vi $B1, $B1, 32 - 7]}
>     @{[vsrl_vi $B2, $B2, 32 - 7]}
>     @{[vsrl_vi $B3, $B3, 32 - 7]}
>     @{[vor_vv $B0, $B0, $V28]}
201c233
<     @{[vror_vi $B1, $B1, 32 - 7]}
---
>     @{[vor_vv $B1, $B1, $V29]}
203c235
<     @{[vror_vi $B2, $B2, 32 - 7]}
---
>     @{[vor_vv $B2, $B2, $V30]}
205c237
<     @{[vror_vi $B3, $B3, 32 - 7]}
---
>     @{[vor_vv $B3, $B3, $V31]}
214,216c246,248
< .globl ChaCha20_ctr32_zvkb
< .type ChaCha20_ctr32_zvkb,\@function
< ChaCha20_ctr32_zvkb:
---
> .globl ChaCha20_ctr32_v
> .type ChaCha20_ctr32_v,\@function
> ChaCha20_ctr32_v:
471c503
< .size ChaCha20_ctr32_zvkb,.-ChaCha20_ctr32_zvkb
---
> .size ChaCha20_ctr32_v,.-ChaCha20_ctr32_v

You can also take a look at this since this code is changed from your implementation.

t8m · 2024-04-15T07:00:45Z

Could somebody knowledgeable of RISCV please review this code? I can provide only a formal approval.

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

ZenithalHourlyRate · 2024-04-15T10:17:56Z

I also provided a diff with your implementation using Zvkb

@cyyself, as the difference is only on the vror instruction and all other parts are reused, I think reusing all of them in a file named chacha20-riscv64-v-common.pl and making extension specific implementions into chacha20-riscv64-v.pl and chacha20-riscv64-zvkb.pl would largely reduce the maintainance burden.

Could somebody knowledgeable of RISCV please review this code? I can provide only a formal approval.

I will check the equivalence of these two asm implementation soon. I had some RISC-V Vector Extension experience inside OpenSSL.

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

@t8m I am willing to do so, How could I participate in? I think I had related experience, see here

cyyself · 2024-04-15T10:20:06Z

I also provided a diff with your implementation using Zvkb

@cyyself, as the difference is only on the vror instruction and all other parts are reused, I think reusing all of them in a file named chacha20-riscv64-v-common.pl and making extension specific implementions into chacha20-riscv64-v.pl and chacha20-riscv64-zvkb.pl would largely reduce the maintainance burden.

I will do that. Thanks.

t8m · 2024-04-15T10:44:16Z

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

@t8m I am willing to do so, How could I participate in? I think I had related experience, see here

Please open a pull request against https://github.com/openssl/general-policies/ to move the relevant platforms from unadopted to community. Similar to for example this pull request: https://github.com/openssl/general-policies/pull/54/files

crypto/chacha/asm/chacha-riscv64-v.pl

crypto/chacha/build.info

Since we can do group operations on vector registers in RISC-V, some vector registers will be used without being explicitly referenced. Thus, comments on vector register allocation should be added to improve the code readability and maintainability. Signed-off-by: Yangyu Chen <cyy@cyyself.name>

crypto/chacha/asm/chacha-riscv64-v-zbb.pl

JerryShih

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator.
I'm not sure which one is the good practice in openssl.

cyyself · 2024-04-23T02:00:32Z

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

I also have this concern. That’s why I separate it to a different commit. I will wait for opinions from OpenSSL maintainers.

JerryShih · 2024-04-23T04:38:59Z

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

I also have this concern. That’s why I separate it to a different commit. I will wait for opinions from OpenSSL maintainers.

The f1d7f4d has:

have branch for vror code block.
update indent

It's not just for the indent only.

t8m · 2024-04-23T06:10:36Z

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

Yes, the indentation should be primarily correct in the .pl file as that is the source file. It is nice if the indentation is correct in the generated file, but it is definitely not a requirement.

cyyself · 2024-04-23T06:35:21Z

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

Yes, the indentation should be primarily correct in the .pl file as that is the source file. It is nice if the indentation is correct in the generated file, but it is definitely not a requirement.

Thanks. I edited the last commit on indentation. Now, it only reuses the same part of the code.

This patch merged the `add` and `xor` part of chacha_sub_round, which are same in RISC-V Vector only and Zvkb implementation. There is no change to the generated ASM code except for the indent. Signed-off-by: Yangyu Chen <cyy@cyyself.name>

t8m · 2024-04-24T17:12:49Z

@paulidale please reconfirm

cyyself · 2024-05-06T17:00:20Z

Although it hasn't been reviewed for two weeks. There is some good news to share. I got another new RISC-V Board, BananaPi F3, which has Spacemit K1 SoC and 256 bits VLEN, which is 2x wide in Vector Length compared to the K230 I used to benchmark before. The performance is shown below:

[cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11580221 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4837731 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1433019 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 1024 size blocks: 375369 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 47621 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 23843 ChaCha20 ops in 3.01s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         61761.18k   103204.93k   121878.03k   128125.95k   130037.08k   129781.96k
[cyy@k1 ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11308125 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4804401 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 256 size blocks: 1409444 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 725378 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 99129 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 51344 ChaCha20 ops in 3.00s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         60310.00k   102153.38k   120272.55k   247595.69k   270688.26k   280406.70k
[cyy@k1 ssl]$ neofetch
                   -`                    cyy@k1
                  .o+`                   ------
                 `ooo/                   OS: Arch Linux riscv64
                `+oooo:                  Host: spacemit k1-x deb1 board
               `+oooooo:                 Kernel: 6.1.15-legacy-k1
               -+oooooo+:                Uptime: 6 mins
             `/:-:++oooo+:               Packages: 139 (pacman)
            `/++++/+++++++:              Shell: bash 5.2.26
           `/++++++++++++++:             Terminal: /dev/pts/0
          `/+++ooooooooooooo/`           CPU: Spacemit X60 (8) @ 1.600GHz
         ./ooosssso++osssssso+`          Memory: 111MiB / 3809MiB
        .oossssso-````/ossssss+`
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

[cyy@k1 ssl]$ cat /proc/cpuinfo | head -n 9
processor	: 0
hart		: 0
model name	: Spacemit(R) X60
isa		: rv64imafdcv_sscofpmf_sstc_svpbmt_zicbom_zicboz_zicbop_zihintpause
mmu		: sv39
mvendorid	: 0x710
marchid		: 0x8000000058000001
mimpid		: 0x1000000049772200

I changed the baseline to rv64gcv_zba_zbb compiled by GCC. The result shows that it is more than 1x faster than pure C implementation optimized by the compiler. On K230, it only indicates 26.3% faster.

paulidale

Apologies, missed this one.

openssl-machine · 2024-05-07T23:00:19Z

This pull request is ready to merge

t8m · 2024-05-08T09:12:14Z

Merged to the master branch. Thank you for your contribution.

Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V Vector Cryptography Bit-manipulation extension was ratified in late 2023 and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not support Zvkb but may have Vector and Bit-manipulation, which are already in the RVA22 Profile. This commit provides a vector-only implementation that replaced the vror with vsll+vsrl+vor and can provide enough speed for Chacha20 for new CPUs this year. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)

Since we can do group operations on vector registers in RISC-V, some vector registers will be used without being explicitly referenced. Thus, comments on vector register allocation should be added to improve the code readability and maintainability. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)

This patch merged the `add` and `xor` part of chacha_sub_round, which are same in RISC-V Vector only and Zvkb implementation. There is no change to the generated ASM code except for the indent. Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24069)

openssl-machine added the hold: cla required The contributor needs to submit a license agreement label Apr 9, 2024

cyyself force-pushed the chacha-rvv branch from b020c6a to 0874c6b Compare April 9, 2024 07:06

cyyself marked this pull request as draft April 9, 2024 07:18

t8m reviewed Apr 9, 2024

View reviewed changes

crypto/chacha/chacha_riscv.c Show resolved Hide resolved

t8m reviewed Apr 9, 2024

View reviewed changes

crypto/chacha/chacha_riscv.c Outdated Show resolved Hide resolved

openssl-machine removed the hold: cla required The contributor needs to submit a license agreement label Apr 12, 2024

cyyself marked this pull request as ready for review April 12, 2024 15:14

cyyself force-pushed the chacha-rvv branch from 0874c6b to 742a46d Compare April 12, 2024 15:15

cyyself requested a review from t8m April 12, 2024 17:27

ZenithalHourlyRate mentioned this pull request Apr 15, 2024

Propose ZenithalHourlyRate as linux64-riscv64 and linux32-riscv32 community maintainer openssl/general-policies#66

Closed

JerryShih reviewed Apr 16, 2024

View reviewed changes

crypto/chacha/asm/chacha-riscv64-v.pl Outdated Show resolved Hide resolved

crypto/chacha/build.info Outdated Show resolved Hide resolved

cyyself force-pushed the chacha-rvv branch from 8aa8dbd to 5946323 Compare April 19, 2024 14:20

t8m approved these changes Apr 19, 2024

View reviewed changes

paulidale approved these changes Apr 20, 2024

View reviewed changes

JerryShih reviewed Apr 22, 2024

View reviewed changes

crypto/chacha/asm/chacha-riscv64-v-zbb.pl Show resolved Hide resolved

cyyself requested a review from JerryShih April 22, 2024 06:39

JerryShih approved these changes Apr 23, 2024

View reviewed changes

cyyself force-pushed the chacha-rvv branch from f1d7f4d to 52490de Compare April 23, 2024 06:33

chacha-riscv64-v-zbb.pl: better format

be76304

This patch merged the `add` and `xor` part of chacha_sub_round, which are same in RISC-V Vector only and Zvkb implementation. There is no change to the generated ASM code except for the indent. Signed-off-by: Yangyu Chen <cyy@cyyself.name>

cyyself force-pushed the chacha-rvv branch from 52490de to be76304 Compare April 23, 2024 06:37

t8m approved these changes Apr 24, 2024

View reviewed changes

cyyself requested a review from paulidale April 28, 2024 06:50

paulidale approved these changes May 6, 2024

View reviewed changes

paulidale added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels May 6, 2024

openssl-machine removed the approval: done This pull request has the required number of approvals label May 7, 2024

openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label May 7, 2024

t8m closed this May 8, 2024

ZenithalHourlyRate mentioned this pull request May 15, 2024

CI: cross-compile: riscv: enable more tests on extensions #24403

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

riscv: Provide a vector only implementation of CHACHA20 cipher #24069

riscv: Provide a vector only implementation of CHACHA20 cipher #24069

cyyself commented Apr 9, 2024 •

edited

t8m commented Apr 9, 2024

cyyself commented Apr 9, 2024

cyyself commented Apr 9, 2024 •

edited

kroeckx commented Apr 9, 2024

cyyself commented Apr 9, 2024

tom-cosgrove-arm commented Apr 9, 2024

cyyself commented Apr 9, 2024

cyyself commented Apr 12, 2024 •

edited

cyyself commented Apr 12, 2024

t8m commented Apr 15, 2024

ZenithalHourlyRate commented Apr 15, 2024 •

edited

cyyself commented Apr 15, 2024

t8m commented Apr 15, 2024 •

edited

JerryShih left a comment

cyyself commented Apr 23, 2024

JerryShih commented Apr 23, 2024

t8m commented Apr 23, 2024

cyyself commented Apr 23, 2024

t8m commented Apr 24, 2024

cyyself commented May 6, 2024 •

edited

paulidale left a comment

openssl-machine commented May 7, 2024

t8m commented May 8, 2024

riscv: Provide a vector only implementation of CHACHA20 cipher #24069

riscv: Provide a vector only implementation of CHACHA20 cipher #24069

Conversation

cyyself commented Apr 9, 2024 • edited

Performance Evaluation

Checklist

t8m commented Apr 9, 2024

cyyself commented Apr 9, 2024

cyyself commented Apr 9, 2024 • edited

kroeckx commented Apr 9, 2024

cyyself commented Apr 9, 2024

tom-cosgrove-arm commented Apr 9, 2024

cyyself commented Apr 9, 2024

cyyself commented Apr 12, 2024 • edited

cyyself commented Apr 12, 2024

t8m commented Apr 15, 2024

ZenithalHourlyRate commented Apr 15, 2024 • edited

cyyself commented Apr 15, 2024

t8m commented Apr 15, 2024 • edited

JerryShih left a comment

Choose a reason for hiding this comment

cyyself commented Apr 23, 2024

JerryShih commented Apr 23, 2024

t8m commented Apr 23, 2024

cyyself commented Apr 23, 2024

t8m commented Apr 24, 2024

cyyself commented May 6, 2024 • edited

paulidale left a comment

Choose a reason for hiding this comment

openssl-machine commented May 7, 2024

t8m commented May 8, 2024

cyyself commented Apr 9, 2024 •

edited

cyyself commented Apr 9, 2024 •

edited

cyyself commented Apr 12, 2024 •

edited

ZenithalHourlyRate commented Apr 15, 2024 •

edited

t8m commented Apr 15, 2024 •

edited

cyyself commented May 6, 2024 •

edited