-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark on t-head C908 #5
Comments
Hi @camel-cdr , Thank you for running the code (I do not have a RVV 1.0 capable hardware to run it) and for your suggestions. To be honest, I was not expecting the current vector code to be faster than the optimized scalar code so the results you report match my expectation. One suggestion I got from @mjosaarinen was to parallelize two Keccak run and exploit the VLEN=128 to speed-up this with a very similar approach as the unrolled scalar code. I need to check if this corresponds to the KeccakP-1600 times two code you linked above. |
Looking at the documentation https://github.com/XKCP/XKCP/blob/master/doc/PlSnP-documentation.h I think your suggestion exactly matches Markku's: Implement parallel processing of multiple Keccak state using vectors. |
Thats great, good luck with the project. You can message me if you whant some more benchmarks for the blog post. |
@camel-cdr, if you have some cycles available, could you benchmark the vector multi-round implementation: https://github.com/nibrunieAtSi5/rvv-keccak/blob/main/src/keccak-vector-multi-round.c (the default configuration still relying on memory for the rho and pi steps) ? |
I'm getting the following: $ ./vector-multi-round.out
total: 1671419730 byte/cycle: 0.0025094 |
Thank you |
Part 1 of my blog post series on SHA-3 is out: https://open.substack.com/pub/fprox/p/rvv-implementation-of-keccak-sha?r=lsukk&utm_campaign=post&utm_medium=web Thank you again for the hardware measurements @camel-cdr |
Oh, I didn't know fprox is your blog, nice I've been following it for a bit. |
My secret identity has been revealed :-) |
:-) It might be worth citing something like "Software implementation of SHA-3 family using AVX2" to explain that not gaining a speedup over scalar, in the non parallel case is expected and not a shortcoming of the isa. |
Hi, I'm not sure if you have real hardware at hand, so I wanted to share some benchmark runs on my Kendryte K230 with an rvv 1.0 capable C908 core.
Note that at best, I'd expect the same performance as for the scalar case, because this processor takes roughly two cycles to process one 128 wide vector lane, while it takes about one cycle for a 64-bit GPR operation. (see instruction timings)
This is fine when working with bytes, but it's hard to get speedup from 64-bit integers.
For the benchmark I measured the cycle count for
Keccak( 576, 1024, in, 1024*1024*4, 0x06, out, 512);
on random data. (I removed theinstret
measurements from the original code)C908:
./vector-in-reg
performs very bad because the C908 has very slow vrgather for higher LMULs, in this case LMUL=8 SEQ=64 vrgather about 261.1 cycles.From what I can tell, there is no processor where high LMUL gather is particularly fast, although you should be able to expect one cycle per element processed (similar to the X280).
./vector.out
is about five time slower thatscalar-opt-in-regs.out
, I think this is mostly to to using needing to use LMUL=4 for only a single element, effectively almost doubling the execution time. Furthermore, the c908 doesn't support zvbb, but the scalar bitmanip extension. vluxei might also be slow, but I haven't benchmarked that yet.I think you'd get the fastest result from porting https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600-times2/SIMD128/KeccakP-1600-times2-SIMD128.c.
It looks like you can just swap out the macros to the equivalent rvv ones, but I'm not sure what exactly is equivalent to your code here.
Obviously a fully scalable implementation would be the best. It might be possible to modify the above to something like that.
If I've got some more time I might also be able to run the code on an much faster C920, but that would require me to backport the assembly manually to rvv 0.7.1, so I'd rather do that after a bit more development.
The text was updated successfully, but these errors were encountered: