Benchmark on t-head C908 #5

camel-cdr · 2023-12-28T22:49:59Z

Hi, I'm not sure if you have real hardware at hand, so I wanted to share some benchmark runs on my Kendryte K230 with an rvv 1.0 capable C908 core.

Note that at best, I'd expect the same performance as for the scalar case, because this processor takes roughly two cycles to process one 128 wide vector lane, while it takes about one cycle for a 64-bit GPR operation. (see instruction timings)
This is fine when working with bytes, but it's hard to get speedup from 64-bit integers.

For the benchmark I measured the cycle count for Keccak( 576, 1024, in, 1024*1024*4, 0x06, out, 512); on random data. (I removed the instret measurements from the original code)

C908:

$ ./scalar.out
total: 4193374331 byte/cycle: 0.0010002
$ ./scalar-opt.out
total: 273281329 byte/cycle: 0.0153479
$ ./scalar-opt-in-regs.out
total: 208159562 byte/cycle: 0.0201494
$ ./vector.out
total: 1063668810 byte/cycle: 0.0039432
$ ./vector-in-reg.out
total: 6339152135 byte/cycle: 0.0006616

./vector-in-reg performs very bad because the C908 has very slow vrgather for higher LMULs, in this case LMUL=8 SEQ=64 vrgather about 261.1 cycles.
From what I can tell, there is no processor where high LMUL gather is particularly fast, although you should be able to expect one cycle per element processed (similar to the X280).

./vector.out is about five time slower that scalar-opt-in-regs.out, I think this is mostly to to using needing to use LMUL=4 for only a single element, effectively almost doubling the execution time. Furthermore, the c908 doesn't support zvbb, but the scalar bitmanip extension. vluxei might also be slow, but I haven't benchmarked that yet.

I think you'd get the fastest result from porting https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600-times2/SIMD128/KeccakP-1600-times2-SIMD128.c.
It looks like you can just swap out the macros to the equivalent rvv ones, but I'm not sure what exactly is equivalent to your code here.

Obviously a fully scalable implementation would be the best. It might be possible to modify the above to something like that.

If I've got some more time I might also be able to run the code on an much faster C920, but that would require me to backport the assembly manually to rvv 0.7.1, so I'd rather do that after a bit more development.

The text was updated successfully, but these errors were encountered:

nibrunie · 2023-12-29T07:25:12Z

Hi @camel-cdr , Thank you for running the code (I do not have a RVV 1.0 capable hardware to run it) and for your suggestions.

To be honest, I was not expecting the current vector code to be faster than the optimized scalar code so the results you report match my expectation.
I am writing a post for my blog on these experiments. The idea was to try to come up with a baseline vector implementation I could play with. The use of either index loads (vluxei) or vrgather(ei16) made me very pessimistic about the performance that could be reached. And as you said using a LMUL=4 vector register groups to store 5 elements is far from efficient. There are many more ideas I would like to try (splitting computation between scalar and vector, using LMUL=8 to back 3 lanes [15 elements], ...) but I am not sure any will pane out.

One suggestion I got from @mjosaarinen was to parallelize two Keccak run and exploit the VLEN=128 to speed-up this with a very similar approach as the unrolled scalar code. I need to check if this corresponds to the KeccakP-1600 times two code you linked above.

nibrunie · 2023-12-29T07:31:50Z

Looking at the documentation https://github.com/XKCP/XKCP/blob/master/doc/PlSnP-documentation.h I think your suggestion exactly matches Markku's: Implement parallel processing of multiple Keccak state using vectors.

camel-cdr · 2023-12-29T08:41:23Z

Thats great, good luck with the project. You can message me if you whant some more benchmarks for the blog post.

nibrunie · 2023-12-30T21:08:00Z

@camel-cdr, if you have some cycles available, could you benchmark the vector multi-round implementation: https://github.com/nibrunieAtSi5/rvv-keccak/blob/main/src/keccak-vector-multi-round.c (the default configuration still relying on memory for the rho and pi steps) ?

camel-cdr · 2023-12-31T07:41:03Z

I'm getting the following:

$ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094

nibrunie · 2023-12-31T07:42:16Z

I'm getting the following:

$ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094

Thank you

nibrunie · 2023-12-31T20:40:17Z

Part 1 of my blog post series on SHA-3 is out: https://open.substack.com/pub/fprox/p/rvv-implementation-of-keccak-sha?r=lsukk&utm_campaign=post&utm_medium=web

Thank you again for the hardware measurements @camel-cdr

camel-cdr · 2024-01-01T15:56:23Z

Oh, I didn't know fprox is your blog, nice I've been following it for a bit.

nibrunieAtSi5 · 2024-01-01T17:07:04Z

Oh, I didn't know fprox is your blog, nice I've been following it for a bit.

My secret identity has been revealed :-)
Thank you for following.

camel-cdr · 2024-01-01T23:29:23Z

:-)

It might be worth citing something like "Software implementation of SHA-3 family using AVX2" to explain that not gaining a speedup over scalar, in the non parallel case is expected and not a shortcoming of the isa.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark on t-head C908 #5

Benchmark on t-head C908 #5

camel-cdr commented Dec 28, 2023 •

edited

nibrunie commented Dec 29, 2023

nibrunie commented Dec 29, 2023

camel-cdr commented Dec 29, 2023

nibrunie commented Dec 30, 2023

camel-cdr commented Dec 31, 2023

nibrunie commented Dec 31, 2023

nibrunie commented Dec 31, 2023

camel-cdr commented Jan 1, 2024

nibrunieAtSi5 commented Jan 1, 2024

camel-cdr commented Jan 1, 2024

Benchmark on t-head C908 #5

Benchmark on t-head C908 #5

Comments

camel-cdr commented Dec 28, 2023 • edited

nibrunie commented Dec 29, 2023

nibrunie commented Dec 29, 2023

camel-cdr commented Dec 29, 2023

nibrunie commented Dec 30, 2023

camel-cdr commented Dec 31, 2023

nibrunie commented Dec 31, 2023

nibrunie commented Dec 31, 2023

camel-cdr commented Jan 1, 2024

nibrunieAtSi5 commented Jan 1, 2024

camel-cdr commented Jan 1, 2024

camel-cdr commented Dec 28, 2023 •

edited