Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riscv: Provide a vector only implementation of CHACHA20 cipher #24069

Closed
wants to merge 3 commits into from

Conversation

cyyself
Copy link
Contributor

@cyyself cyyself commented Apr 9, 2024

Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V Vector Cryptography Bit-manipulation extension was ratified in late 2023 and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not support Zvkb but may have Vector and Scalar Bit-manipulation, which are already in the RVA22 Profile. This commit provides a vector-only implementation that replaced the vror with vsll+vsrl+vor and can provide enough speed for Chacha20 for new CPUs this year.

Performance Evaluation

Platform: Canaan Kendryte K230
CPU: T-Head C908 @1.6GHz
Compiler: gcc version 13.2.1 20230801 (GCC)

[cyy@archlinux ssl]$ neofetch
                   -`                    cyy@archlinux
                  .o+`                   -------------
                 `ooo/                   OS: Arch Linux riscv64
                `+oooo:                  Host: Canaan Kendryte K230
               `+oooooo:                 Kernel: 6.8.0+
               -+oooooo+:                Uptime: 4 days, 17 hours, 27 mins
             `/:-:++oooo+:               Packages: 156 (pacman)
            `/++++/+++++++:              Shell: bash 5.2.26
           `/++++++++++++++:             Terminal: /dev/pts/0
          `/+++ooooooooooooo/`           CPU: (1)
         ./ooosssso++osssssso+`          Memory: 72MiB / 477MiB
        .oossssso-````/ossssss+`
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

[cyy@archlinux ssl]$ cat /proc/cpuinfo
processor	: 0
hart		: 0
isa		: rv64imafdcv_zicbom_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbc_zbs_svpbmt
mmu		: sv39
mvendorid	: 0x5b7
marchid		: 0x8000000009140d00
mimpid		: 0x50000
hart isa	: rv64imafdcv_zicbom_zicntr_zicsr_zifencei_zihpm_zba_zbb_zbc_zbs_svpbmt

[cyy@archlinux ssl]$ unset OPENSSL_riscvcap
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 9190817 ChaCha20 ops in 2.90s
Doing ChaCha20 ops for 3s on 64 size blocks: 3541082 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 256 size blocks: 998612 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 1024 size blocks: 258575 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 8192 size blocks: 32655 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 16384 size blocks: 16335 ChaCha20 ops in 2.99s
version: 3.4.0-dev
built on: Fri Apr 19 06:03:02 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         50707.96k    76050.08k    85786.80k    88555.45k    89768.38k    89509.24k
[cyy@archlinux ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 9400208 ChaCha20 ops in 2.97s
Doing ChaCha20 ops for 3s on 64 size blocks: 3561831 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 256 size blocks: 1417570 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 1024 size blocks: 370833 ChaCha20 ops in 2.98s
Doing ChaCha20 ops for 3s on 8192 size blocks: 57722 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 16384 size blocks: 28833 ChaCha20 ops in 2.98s
version: 3.4.0-dev
built on: Fri Apr 19 06:03:02 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         50640.85k    76239.86k   121777.83k   127427.18k   158146.70k   158523.45k
[cyy@archlinux ssl]$ unset OPENSSL_riscvcap
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 10710666 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 64 size blocks: 4656347 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1390592 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 366078 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 8192 size blocks: 46493 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 23223 ChaCha20 ops in 3.00s
version: 3.4.0-dev
built on: Fri Apr 19 06:13:13 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gcv_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         57314.60k    99335.40k   118663.85k   124539.49k   126956.89k   126828.54k
[cyy@archlinux ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@archlinux ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11064374 ChaCha20 ops in 2.96s
Doing ChaCha20 ops for 3s on 64 size blocks: 4723040 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1426926 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 1024 size blocks: 373266 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 58149 ChaCha20 ops in 2.99s
Doing ChaCha20 ops for 3s on 16384 size blocks: 29032 ChaCha20 ops in 2.97s
version: 3.4.0-dev
built on: Fri Apr 19 06:13:13 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gcv_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         59807.43k   100758.19k   121359.82k   127408.13k   159316.59k   160154.98k
[cyy@archlinux ssl]$

Summary of the result:

Chacha20 Implementation GCC -march 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
C 64gc 50707.96k 76050.08k 85786.80k 88555.45k 89768.38k 89509.24k
C 64gcv_zba_zbb 57314.60k 99335.40k 118663.85k 124539.49k 126956.89k 126828.54k
ASM-RVV 64gc 50640.85k 76239.86k 121777.83k 127427.18k 158146.70k 158523.45k
ASM-RVV 64gcv_zba_zbb 59807.43k 100758.19k 121359.82k 127408.13k 159316.59k 160154.98k

According to the result, we get chacha20 speed up to 160154.98k on 16384 bytes blocks, which is 26.3% faster compared to C implementation with rv64gcv_zba_zbb optimized by GCC and 78.9% faster compared to C implementation with rv64gc which is the default -march for riscv in GCC.

Checklist
  • documentation is added or updated
  • tests are added or updated

I also diff the generated ASM for Zvkb since I updated the Perl script for ASM generation. There are no changes except for indentation.

@openssl-machine openssl-machine added the hold: cla required The contributor needs to submit a license agreement label Apr 9, 2024
@cyyself cyyself marked this pull request as draft April 9, 2024 07:18
@t8m
Copy link
Member

t8m commented Apr 9, 2024

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it? Should the existing non-vector implementation be called on shorter inputs?

crypto/chacha/chacha_riscv.c Outdated Show resolved Hide resolved
@cyyself
Copy link
Contributor Author

cyyself commented Apr 9, 2024

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it?

I think so. Maybe I should do some profiling and optimize the implementation.

Should the existing non-vector implementation be called on shorter inputs?

I will wait until I have further optimization to make this decision.

@cyyself
Copy link
Contributor Author

cyyself commented Apr 9, 2024

It does not pass the make test so I convert this PR to a draft. It was modified based on the zvkb implementation, which already in openssl. But that will not pass the test either. I have submitted issue #24070 to address this and will debug it soon.

@kroeckx
Copy link
Member

kroeckx commented Apr 9, 2024

OPENSSL_riscvcap seems to be undocumented. Can you add an document like OPENSSL_ia32cap and OPENSSL_s390xcap in another PR?

@cyyself
Copy link
Contributor Author

cyyself commented Apr 9, 2024

OPENSSL_riscvcap seems to be undocumented. Can you add an document like OPENSSL_ia32cap and OPENSSL_s390xcap in another PR?

I think OPENSSL_riscvcap should be deprecated in the future. For now, we have hwprobe syscall in the linux kernel since v6.4 to get what ISA extensions are supported by every CPU.

@tom-cosgrove-arm
Copy link
Contributor

I think OPENSSL_riscvcap should be deprecated in the future

Being able to explicitly enable and disable specific optimisations is very useful. On both x86 and Arm we query for capabilities and use that to set the respective xxxcap variables, which can then be given in the environment to override what is detected. I would recommend against deprecating this.

@cyyself
Copy link
Contributor Author

cyyself commented Apr 9, 2024

I think OPENSSL_riscvcap should be deprecated in the future

Being able to explicitly enable and disable specific optimisations is very useful. On both x86 and Arm we query for capabilities and use that to set the respective xxxcap variables, which can then be given in the environment to override what is detected. I would recommend against deprecating this.

OK. So the things to do might add hwprobe and use it by default, and allow it to be overridden by OPENSSL_riscvcap. Missing documentation for OPENSSL_riscvcap should be added either.

@openssl-machine openssl-machine removed the hold: cla required The contributor needs to submit a license agreement label Apr 12, 2024
@cyyself
Copy link
Contributor Author

cyyself commented Apr 12, 2024

Also from your performance measurements I see that the performance on short data is severally degraded and the improvement on long data is only slight. Is it really worth it? Should the existing non-vector implementation be called on shorter inputs?

@t8m

I have refactored the code from #24097. And updated the benchmark on the Canaan Kendryte K230 platform. It shows no performance degradation compared to pure C implementation, so it is worth using it.

In addition, Canaan Kendryte K230 is an embedded RISC-V SoC with an in-order pipeline with a very short vector length of 128 bits, like Cortex-A53 with NEON in Arm Ecosystems. We may see better performance in out-of-order cores with longer vector lengths on better CPUs.

@cyyself cyyself marked this pull request as ready for review April 12, 2024 15:14
@cyyself
Copy link
Contributor Author

cyyself commented Apr 12, 2024

@JerryShih

I also provided a diff with your implementation using Zvkb. You can see the code below:

5c5
< # Copyright 2023-2023 The OpenSSL Project Authors. All Rights Reserved.
---
> # Copyright 2023-2024 The OpenSSL Project Authors. All Rights Reserved.
13a14
> # Copyright (c) 2024, Yangyu Chen <cyy@cyyself.name>
40d40
< # - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb')
63,65c63,65
< # void ChaCha20_ctr32_zvkb(unsigned char *out, const unsigned char *inp,
< #                          size_t len, const unsigned int key[8],
< #                          const unsigned int counter[4]);
---
> # void ChaCha20_ctr32_v(unsigned char *out, const unsigned char *inp,
> #                       size_t len, const unsigned int key[8],
> #                       const unsigned int counter[4]);
124c124,132
<     @{[vror_vi $D0, $D0, 32 - 16]}
---
>     @{[vsll_vi $V24, $D0, 16]}
>     @{[vsll_vi $V25, $D1, 16]}
>     @{[vsll_vi $V26, $D2, 16]}
>     @{[vsll_vi $V27, $D3, 16]}
>     @{[vsrl_vi $D0, $D0, 32 - 16]}
>     @{[vsrl_vi $D1, $D1, 32 - 16]}
>     @{[vsrl_vi $D2, $D2, 32 - 16]}
>     @{[vsrl_vi $D3, $D3, 32 - 16]}
>     @{[vor_vv $D0, $D0, $V24]}
126c134
<     @{[vror_vi $D1, $D1, 32 - 16]}
---
>     @{[vor_vv $D1, $D1, $V25]}
128c136
<     @{[vror_vi $D2, $D2, 32 - 16]}
---
>     @{[vor_vv $D2, $D2, $V26]}
130c138
<     @{[vror_vi $D3, $D3, 32 - 16]}
---
>     @{[vor_vv $D3, $D3, $V27]}
149c157,165
<     @{[vror_vi $B0, $B0, 32 - 12]}
---
>     @{[vsll_vi $V28, $B0, 12]}
>     @{[vsll_vi $V29, $B1, 12]}
>     @{[vsll_vi $V30, $B2, 12]}
>     @{[vsll_vi $V31, $B3, 12]}
>     @{[vsrl_vi $B0, $B0, 32 - 12]}
>     @{[vsrl_vi $B1, $B1, 32 - 12]}
>     @{[vsrl_vi $B2, $B2, 32 - 12]}
>     @{[vsrl_vi $B3, $B3, 32 - 12]}
>     @{[vor_vv $B0, $B0, $V28]}
151c167
<     @{[vror_vi $B1, $B1, 32 - 12]}
---
>     @{[vor_vv $B1, $B1, $V29]}
153c169
<     @{[vror_vi $B2, $B2, 32 - 12]}
---
>     @{[vor_vv $B2, $B2, $V30]}
155c171
<     @{[vror_vi $B3, $B3, 32 - 12]}
---
>     @{[vor_vv $B3, $B3, $V31]}
174c190,198
<     @{[vror_vi $D0, $D0, 32 - 8]}
---
>     @{[vsll_vi $V24, $D0, 8]}
>     @{[vsll_vi $V25, $D1, 8]}
>     @{[vsll_vi $V26, $D2, 8]}
>     @{[vsll_vi $V27, $D3, 8]}
>     @{[vsrl_vi $D0, $D0, 32 - 8]}
>     @{[vsrl_vi $D1, $D1, 32 - 8]}
>     @{[vsrl_vi $D2, $D2, 32 - 8]}
>     @{[vsrl_vi $D3, $D3, 32 - 8]}
>     @{[vor_vv $D0, $D0, $V24]}
176c200
<     @{[vror_vi $D1, $D1, 32 - 8]}
---
>     @{[vor_vv $D1, $D1, $V25]}
178c202
<     @{[vror_vi $D2, $D2, 32 - 8]}
---
>     @{[vor_vv $D2, $D2, $V26]}
180c204
<     @{[vror_vi $D3, $D3, 32 - 8]}
---
>     @{[vor_vv $D3, $D3, $V27]}
199c223,231
<     @{[vror_vi $B0, $B0, 32 - 7]}
---
>     @{[vsll_vi $V28, $B0, 7]}
>     @{[vsll_vi $V29, $B1, 7]}
>     @{[vsll_vi $V30, $B2, 7]}
>     @{[vsll_vi $V31, $B3, 7]}
>     @{[vsrl_vi $B0, $B0, 32 - 7]}
>     @{[vsrl_vi $B1, $B1, 32 - 7]}
>     @{[vsrl_vi $B2, $B2, 32 - 7]}
>     @{[vsrl_vi $B3, $B3, 32 - 7]}
>     @{[vor_vv $B0, $B0, $V28]}
201c233
<     @{[vror_vi $B1, $B1, 32 - 7]}
---
>     @{[vor_vv $B1, $B1, $V29]}
203c235
<     @{[vror_vi $B2, $B2, 32 - 7]}
---
>     @{[vor_vv $B2, $B2, $V30]}
205c237
<     @{[vror_vi $B3, $B3, 32 - 7]}
---
>     @{[vor_vv $B3, $B3, $V31]}
214,216c246,248
< .globl ChaCha20_ctr32_zvkb
< .type ChaCha20_ctr32_zvkb,\@function
< ChaCha20_ctr32_zvkb:
---
> .globl ChaCha20_ctr32_v
> .type ChaCha20_ctr32_v,\@function
> ChaCha20_ctr32_v:
471c503
< .size ChaCha20_ctr32_zvkb,.-ChaCha20_ctr32_zvkb
---
> .size ChaCha20_ctr32_v,.-ChaCha20_ctr32_v

You can also take a look at this since this code is changed from your implementation.

@cyyself cyyself requested a review from t8m April 12, 2024 17:27
@t8m t8m added branch: master Merge to master branch approval: review pending This pull request needs review by a committer approval: otc review pending This pull request needs review by an OTC member triaged: feature The issue/pr requests/adds a feature tests: exempted The PR is exempt from requirements for testing triaged: performance The issue/pr reports/fixes a performance concern labels Apr 15, 2024
@t8m
Copy link
Member

t8m commented Apr 15, 2024

Could somebody knowledgeable of RISCV please review this code? I can provide only a formal approval.

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

@ZenithalHourlyRate
Copy link
Contributor

ZenithalHourlyRate commented Apr 15, 2024

I also provided a diff with your implementation using Zvkb

@cyyself, as the difference is only on the vror instruction and all other parts are reused, I think reusing all of them in a file named chacha20-riscv64-v-common.pl and making extension specific implementions into chacha20-riscv64-v.pl and chacha20-riscv64-zvkb.pl would largely reduce the maintainance burden.

Could somebody knowledgeable of RISCV please review this code? I can provide only a formal approval.

I will check the equivalence of these two asm implementation soon. I had some RISC-V Vector Extension experience inside OpenSSL.

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

@t8m I am willing to do so, How could I participate in? I think I had related experience, see here

@cyyself
Copy link
Contributor Author

cyyself commented Apr 15, 2024

I also provided a diff with your implementation using Zvkb

@cyyself, as the difference is only on the vror instruction and all other parts are reused, I think reusing all of them in a file named chacha20-riscv64-v-common.pl and making extension specific implementions into chacha20-riscv64-v.pl and chacha20-riscv64-zvkb.pl would largely reduce the maintainance burden.

I will do that. Thanks.

@t8m
Copy link
Member

t8m commented Apr 15, 2024

It would be also appreciated if someone could adopt the riscv build targets as community maintainer(s) as they are currently unadopted.
https://www.openssl.org/policies/general-supplemental/platforms.html

@t8m I am willing to do so, How could I participate in? I think I had related experience, see here

Please open a pull request against https://github.com/openssl/general-policies/ to move the relevant platforms from unadopted to community. Similar to for example this pull request: https://github.com/openssl/general-policies/pull/54/files

crypto/chacha/asm/chacha-riscv64-v.pl Outdated Show resolved Hide resolved
crypto/chacha/build.info Outdated Show resolved Hide resolved
Since we can do group operations on vector registers in RISC-V, some vector
registers will be used without being explicitly referenced. Thus, comments
on vector register allocation should be added to improve the code
readability and maintainability.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>
@cyyself cyyself requested a review from JerryShih April 22, 2024 06:39
Copy link
Contributor

@JerryShih JerryShih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator.
I'm not sure which one is the good practice in openssl.

@cyyself
Copy link
Contributor Author

cyyself commented Apr 23, 2024

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

I also have this concern. That’s why I separate it to a different commit. I will wait for opinions from OpenSSL maintainers.

@JerryShih
Copy link
Contributor

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

I also have this concern. That’s why I separate it to a different commit. I will wait for opinions from OpenSSL maintainers.

The f1d7f4d has:

  1. have branch for vror code block.
  2. update indent

It's not just for the indent only.

@t8m
Copy link
Member

t8m commented Apr 23, 2024

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

Yes, the indentation should be primarily correct in the .pl file as that is the source file. It is nice if the indentation is correct in the generated file, but it is definitely not a requirement.

@cyyself
Copy link
Contributor Author

cyyself commented Apr 23, 2024

For me, I would like to have the regular indent in perl generator instead of the final asm file. It's easier for me to maintain the generator. I'm not sure which one is the good practice in openssl.

Yes, the indentation should be primarily correct in the .pl file as that is the source file. It is nice if the indentation is correct in the generated file, but it is definitely not a requirement.

Thanks. I edited the last commit on indentation. Now, it only reuses the same part of the code.

This patch merged the `add` and `xor` part of chacha_sub_round, which are
same in RISC-V Vector only and Zvkb implementation. There is no change to
the generated ASM code except for the indent.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>
@t8m
Copy link
Member

t8m commented Apr 24, 2024

@paulidale please reconfirm

@cyyself cyyself requested a review from paulidale April 28, 2024 06:50
@cyyself
Copy link
Contributor Author

cyyself commented May 6, 2024

Although it hasn't been reviewed for two weeks. There is some good news to share. I got another new RISC-V Board, BananaPi F3, which has Spacemit K1 SoC and 256 bits VLEN, which is 2x wide in Vector Length compared to the K230 I used to benchmark before. The performance is shown below:

[cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11580221 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4837731 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 256 size blocks: 1433019 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 1024 size blocks: 375369 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 47621 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 23843 ChaCha20 ops in 3.01s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         61761.18k   103204.93k   121878.03k   128125.95k   130037.08k   129781.96k
[cyy@k1 ssl]$ export OPENSSL_riscvcap=rv64gc_v_zba_zbb
[cyy@k1 ssl]$ ./bin/openssl speed -evp chacha20
Doing ChaCha20 ops for 3s on 16 size blocks: 11308125 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 64 size blocks: 4804401 ChaCha20 ops in 3.01s
Doing ChaCha20 ops for 3s on 256 size blocks: 1409444 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 1024 size blocks: 725378 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 8192 size blocks: 99129 ChaCha20 ops in 3.00s
Doing ChaCha20 ops for 3s on 16384 size blocks: 51344 ChaCha20 ops in 3.00s
version: 3.4.0-dev
built on: Tue Apr 23 06:23:46 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -march=rv64gc_zba_zbb -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
CPUINFO: N/A
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
ChaCha20         60310.00k   102153.38k   120272.55k   247595.69k   270688.26k   280406.70k
[cyy@k1 ssl]$ neofetch
                   -`                    cyy@k1
                  .o+`                   ------
                 `ooo/                   OS: Arch Linux riscv64
                `+oooo:                  Host: spacemit k1-x deb1 board
               `+oooooo:                 Kernel: 6.1.15-legacy-k1
               -+oooooo+:                Uptime: 6 mins
             `/:-:++oooo+:               Packages: 139 (pacman)
            `/++++/+++++++:              Shell: bash 5.2.26
           `/++++++++++++++:             Terminal: /dev/pts/0
          `/+++ooooooooooooo/`           CPU: Spacemit X60 (8) @ 1.600GHz
         ./ooosssso++osssssso+`          Memory: 111MiB / 3809MiB
        .oossssso-````/ossssss+`
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

[cyy@k1 ssl]$ cat /proc/cpuinfo | head -n 9
processor	: 0
hart		: 0
model name	: Spacemit(R) X60
isa		: rv64imafdcv_sscofpmf_sstc_svpbmt_zicbom_zicboz_zicbop_zihintpause
mmu		: sv39
mvendorid	: 0x710
marchid		: 0x8000000058000001
mimpid		: 0x1000000049772200

I changed the baseline to rv64gcv_zba_zbb compiled by GCC. The result shows that it is more than 1x faster than pure C implementation optimized by the compiler. On K230, it only indicates 26.3% faster.

Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, missed this one.

@paulidale paulidale added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels May 6, 2024
@openssl-machine openssl-machine removed the approval: done This pull request has the required number of approvals label May 7, 2024
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@openssl-machine openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label May 7, 2024
@t8m
Copy link
Member

t8m commented May 8, 2024

Merged to the master branch. Thank you for your contribution.

@t8m t8m closed this May 8, 2024
openssl-machine pushed a commit that referenced this pull request May 8, 2024
Although we have a Zvkb version of Chacha20, the Zvkb from the RISC-V
Vector Cryptography Bit-manipulation extension was ratified in late 2023
and does not come to the RVA23 Profile. Many CPUs in 2024 currently do not
support Zvkb but may have Vector and Bit-manipulation, which are already in
the RVA22 Profile. This commit provides a vector-only implementation that
replaced the vror with vsll+vsrl+vor and can provide enough speed for
Chacha20 for new CPUs this year.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24069)
openssl-machine pushed a commit that referenced this pull request May 8, 2024
Since we can do group operations on vector registers in RISC-V, some vector
registers will be used without being explicitly referenced. Thus, comments
on vector register allocation should be added to improve the code
readability and maintainability.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24069)
openssl-machine pushed a commit that referenced this pull request May 8, 2024
This patch merged the `add` and `xor` part of chacha_sub_round, which are
same in RISC-V Vector only and Zvkb implementation. There is no change to
the generated ASM code except for the indent.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24069)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch tests: exempted The PR is exempt from requirements for testing triaged: feature The issue/pr requests/adds a feature triaged: performance The issue/pr reports/fixes a performance concern
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants