Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on t-head C908 #5

Open
camel-cdr opened this issue Dec 28, 2023 · 10 comments
Open

Benchmark on t-head C908 #5

camel-cdr opened this issue Dec 28, 2023 · 10 comments

Comments

@camel-cdr
Copy link

camel-cdr commented Dec 28, 2023

Hi, I'm not sure if you have real hardware at hand, so I wanted to share some benchmark runs on my Kendryte K230 with an rvv 1.0 capable C908 core.

Note that at best, I'd expect the same performance as for the scalar case, because this processor takes roughly two cycles to process one 128 wide vector lane, while it takes about one cycle for a 64-bit GPR operation. (see instruction timings)
This is fine when working with bytes, but it's hard to get speedup from 64-bit integers.

For the benchmark I measured the cycle count for Keccak( 576, 1024, in, 1024*1024*4, 0x06, out, 512); on random data. (I removed the instret measurements from the original code)

C908:

$ ./scalar.out
total: 4193374331 byte/cycle: 0.0010002
$ ./scalar-opt.out
total: 273281329 byte/cycle: 0.0153479
$ ./scalar-opt-in-regs.out
total: 208159562 byte/cycle: 0.0201494
$ ./vector.out
total: 1063668810 byte/cycle: 0.0039432
$ ./vector-in-reg.out
total: 6339152135 byte/cycle: 0.0006616

./vector-in-reg performs very bad because the C908 has very slow vrgather for higher LMULs, in this case LMUL=8 SEQ=64 vrgather about 261.1 cycles.
From what I can tell, there is no processor where high LMUL gather is particularly fast, although you should be able to expect one cycle per element processed (similar to the X280).

./vector.out is about five time slower that scalar-opt-in-regs.out, I think this is mostly to to using needing to use LMUL=4 for only a single element, effectively almost doubling the execution time. Furthermore, the c908 doesn't support zvbb, but the scalar bitmanip extension. vluxei might also be slow, but I haven't benchmarked that yet.

I think you'd get the fastest result from porting https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600-times2/SIMD128/KeccakP-1600-times2-SIMD128.c.
It looks like you can just swap out the macros to the equivalent rvv ones, but I'm not sure what exactly is equivalent to your code here.

Obviously a fully scalable implementation would be the best. It might be possible to modify the above to something like that.

If I've got some more time I might also be able to run the code on an much faster C920, but that would require me to backport the assembly manually to rvv 0.7.1, so I'd rather do that after a bit more development.

@nibrunie
Copy link
Collaborator

Hi @camel-cdr , Thank you for running the code (I do not have a RVV 1.0 capable hardware to run it) and for your suggestions.

To be honest, I was not expecting the current vector code to be faster than the optimized scalar code so the results you report match my expectation.
I am writing a post for my blog on these experiments. The idea was to try to come up with a baseline vector implementation I could play with. The use of either index loads (vluxei) or vrgather(ei16) made me very pessimistic about the performance that could be reached. And as you said using a LMUL=4 vector register groups to store 5 elements is far from efficient. There are many more ideas I would like to try (splitting computation between scalar and vector, using LMUL=8 to back 3 lanes [15 elements], ...) but I am not sure any will pane out.

One suggestion I got from @mjosaarinen was to parallelize two Keccak run and exploit the VLEN=128 to speed-up this with a very similar approach as the unrolled scalar code. I need to check if this corresponds to the KeccakP-1600 times two code you linked above.

@nibrunie
Copy link
Collaborator

Looking at the documentation https://github.com/XKCP/XKCP/blob/master/doc/PlSnP-documentation.h I think your suggestion exactly matches Markku's: Implement parallel processing of multiple Keccak state using vectors.

@camel-cdr
Copy link
Author

Thats great, good luck with the project. You can message me if you whant some more benchmarks for the blog post.

@nibrunie
Copy link
Collaborator

@camel-cdr, if you have some cycles available, could you benchmark the vector multi-round implementation: https://github.com/nibrunieAtSi5/rvv-keccak/blob/main/src/keccak-vector-multi-round.c (the default configuration still relying on memory for the rho and pi steps) ?

@camel-cdr
Copy link
Author

I'm getting the following:

$ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094

@nibrunie
Copy link
Collaborator

I'm getting the following:

$ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094

Thank you

@nibrunie
Copy link
Collaborator

Part 1 of my blog post series on SHA-3 is out: https://open.substack.com/pub/fprox/p/rvv-implementation-of-keccak-sha?r=lsukk&utm_campaign=post&utm_medium=web

Thank you again for the hardware measurements @camel-cdr

@camel-cdr
Copy link
Author

Oh, I didn't know fprox is your blog, nice I've been following it for a bit.

@nibrunieAtSi5
Copy link
Owner

Oh, I didn't know fprox is your blog, nice I've been following it for a bit.

My secret identity has been revealed :-)
Thank you for following.

@camel-cdr
Copy link
Author

:-)

It might be worth citing something like "Software implementation of SHA-3 family using AVX2" to explain that not gaining a speedup over scalar, in the non parallel case is expected and not a shortcoming of the isa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants