LoongArch64 assembly pack: add ChaCha20 modules #21998

zhoumin2 · 2023-09-07T10:37:51Z

This assembly implementation for ChaCha20 includes three code paths: scalar path, 128-bit LSX path and 256-bit LASX path. We prefer the LASX path or LSX path if the hardware and system support these extensions.

There are 32 vector registers avaialable in the LSX and LASX extensions. So, we can load the 16 initial states and the 16 intermediate states of ChaCha into the 32 vector registers for calculating in the implementation. The test results on the 3A5000 and 3A6000 show that this assembly implementation significantly improves the performance of ChaCha20 on LoongArch based machines. The detailed test results are as following.

Test with:
$ openssl speed -evp chacha20

3A5000
type               16 bytes     64 bytes    256 bytes    1024 bytes    8192 bytes   16384 bytes
C code           178484.53k   282789.93k   311793.70k    322234.99k    324405.93k    324659.88k
assembly code    223152.28k   407863.65k   989520.55k   2049192.96k   2127248.70k   2131749.55k
                   +25%         +44%         +217%        +536%         +556%         +557%

3A6000
type               16 bytes     64 bytes     256 bytes    1024 bytes    8192 bytes   16384 bytes
C code           214945.33k   310041.75k    340724.22k    349949.27k    352925.01k    353140.74k
assembly code    299151.34k   492766.34k   2070166.02k   4300909.91k   4473978.88k   4499084.63k
                   +39%         +59%         +508%         +1129%        +1168%        +1174%

Checklist

documentation is added or updated
tests are added or updated

zhoumin2 · 2023-09-07T11:28:24Z

@t8m, hi, cloud you please help me to trigger the rebuild of buildbot/master:unix-fedora38-x86_64 . That build failed for network issue.

t8m

It would be nice if we could get an independent review of this by someone knowing the LoongArch64 assembly.

crypto/chacha/asm/chacha-loongarch64.pl

zhoumin2 · 2023-09-08T02:30:25Z

It would be nice if we could get an independent review of this by someone knowing the LoongArch64 assembly.

Yes. Although the process of chacha20 is not difficult to understand, there are indeed not many people who are familiar with LoongArch assembly.

It almost doesn't have data dependence between two data blocks in chacha20 and multiple consecutive data blocks can be parallelly processed. So the vectorization of chacha20 is intuitive.

Basically, the LASX code path is similar to AVX2 code path in X86_64 which are all ChaCha20_8x implementation. Again, the LSX code path is similar to SSSE3 code path in X86_64 which are all ChaCha20_4x implementation.

But there still are some differences in the details. The LASX and LSX all have 32 vector registers that have a chance to make the implementation a little simpler than x86_64.

The second major difference is the process of transpose for 16 vector registers that stored the batched final states. The LASX and LSX have some useful instructions ((x)vilvl.w, (x)vilvh.w, (x)vilvl.d, (x)vilvh.d, xvpermi.q) that will make it easy to implement the transpose.

This assembly implementation for ChaCha20 includes three code paths: scalar path, 128-bit LSX path and 256-bit LASX path. We prefer the LASX path or LSX path if the hardware and system support these extensions. There are 32 vector registers avaialable in the LSX and LASX extensions. So, we can load the 16 initial states and the 16 intermediate states of ChaCha into the 32 vector registers for calculating in the implementation. The test results on the 3A5000 and 3A6000 show that this assembly implementation significantly improves the performance of ChaCha20 on LoongArch based machines. The detailed test results are as following. Test with: $ openssl speed -evp chacha20 3A5000 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes C code 178484.53k 282789.93k 311793.70k 322234.99k 324405.93k 324659.88k assembly code 223152.28k 407863.65k 989520.55k 2049192.96k 2127248.70k 2131749.55k +25% +44% +217% +536% +556% +557% 3A6000 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes C code 214945.33k 310041.75k 340724.22k 349949.27k 352925.01k 353140.74k assembly code 299151.34k 492766.34k 2070166.02k 4300909.91k 4473978.88k 4499084.63k +39% +59% +508% +1129% +1168% +1174% Signed-off-by: Min Zhou <zhoumin@loongson.cn>

t8m

Assuming the @paulidale's approval still holds as the change was trivial.

paulidale · 2023-09-08T08:18:10Z

Yes, it holds.

openssl-machine · 2023-09-09T09:00:18Z

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

paulidale · 2023-09-10T22:49:29Z

Merged, thanks for the contribution.

This assembly implementation for ChaCha20 includes three code paths: scalar path, 128-bit LSX path and 256-bit LASX path. We prefer the LASX path or LSX path if the hardware and system support these extensions. There are 32 vector registers avaialable in the LSX and LASX extensions. So, we can load the 16 initial states and the 16 intermediate states of ChaCha into the 32 vector registers for calculating in the implementation. The test results on the 3A5000 and 3A6000 show that this assembly implementation significantly improves the performance of ChaCha20 on LoongArch based machines. The detailed test results are as following. Test with: $ openssl speed -evp chacha20 3A5000 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes C code 178484.53k 282789.93k 311793.70k 322234.99k 324405.93k 324659.88k assembly code 223152.28k 407863.65k 989520.55k 2049192.96k 2127248.70k 2131749.55k +25% +44% +217% +536% +556% +557% 3A6000 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes C code 214945.33k 310041.75k 340724.22k 349949.27k 352925.01k 353140.74k assembly code 299151.34k 492766.34k 2070166.02k 4300909.91k 4473978.88k 4499084.63k +39% +59% +508% +1129% +1168% +1174% Signed-off-by: Min Zhou <zhoumin@loongson.cn> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21998)

t8m reviewed Sep 7, 2023

View reviewed changes

crypto/chacha/asm/chacha-loongarch64.pl Outdated Show resolved Hide resolved

paulidale approved these changes Sep 8, 2023

View reviewed changes

paulidale removed the approval: otc review pending This pull request needs review by an OTC member label Sep 8, 2023

zhoumin2 force-pushed the chacha20_asm branch from 0f32b96 to c14f33e Compare September 8, 2023 02:45

t8m approved these changes Sep 8, 2023

View reviewed changes

t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Sep 8, 2023

paulidale closed this Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoongArch64 assembly pack: add ChaCha20 modules #21998

LoongArch64 assembly pack: add ChaCha20 modules #21998

zhoumin2 commented Sep 7, 2023

zhoumin2 commented Sep 7, 2023 •

edited

t8m left a comment

zhoumin2 commented Sep 8, 2023 •

edited

t8m left a comment

paulidale commented Sep 8, 2023

openssl-machine commented Sep 9, 2023

paulidale commented Sep 10, 2023

LoongArch64 assembly pack: add ChaCha20 modules #21998

LoongArch64 assembly pack: add ChaCha20 modules #21998

Conversation

zhoumin2 commented Sep 7, 2023

Checklist

zhoumin2 commented Sep 7, 2023 • edited

t8m left a comment

Choose a reason for hiding this comment

zhoumin2 commented Sep 8, 2023 • edited

t8m left a comment

Choose a reason for hiding this comment

paulidale commented Sep 8, 2023

openssl-machine commented Sep 9, 2023

paulidale commented Sep 10, 2023

zhoumin2 commented Sep 7, 2023 •

edited

zhoumin2 commented Sep 8, 2023 •

edited