proposal: x/crypto/sha3: add SHA3 assembly implementation for ARMv7 #28148
Currently, there's no assembly implementation for SHA3 hashing for ARM platforms (specifically ARMv7). On ARMv7+ there are vector assembly instructions (known as NEON) available which greatly speed up the speed of SHA3 hashing. There is an upstream reference implementation (here: https://github.com/KeccakTeam/KeccakCodePackage/blob/master/lib/low/KeccakP-1600-times2/OptimizedAsmARM/KeccakP-1600-inplace-pl2-armv7a-neon-le-gcc.s) available that implements SHA3 hashing using these vector instructions and so I have ported this to Go.
Unfortunately, there is no support in the Go assembler/dissassembler for ARMv7 vector instructions, and so I wrote a small tool (available: https://github.com/anonymouse64/asm2go) which translates native assembly code for ARM into Go's plan9 based assembly unsupported opcode syntax in order to integrate the upstream implementation in Go.
I see approximately 3-4 time speedup in SHA3 hashing on a reference Raspberry Pi 3 Model B Revision 1.2 board:
I opened a CL providing this implementation here: https://go-review.googlesource.com/c/crypto/+/119255 however I have not received any feedback on the CL, so here I am opening this issue to hopefully get more visibility on this.
As I wasn't aware of the mentioned Assembly Policy mentioned in the issue, I will provide responses to some of the points mentioned there that should be discussed when considering assembly contributions:
This function is on the "fast path" when performing SHA3 hashing of large files, etc. I don't have specific benchmarks of implementing other parts of the hashing algorithm in assembly to compare to, but it should be clear from both the speedup I got and the fact that the amd64 implementation only implements this same function in assembly that the function I'm providing is the "minimal" amount to implement in assembly.
I would argue this is okay because the generated assembly is generated from a relatively small amount of hand-written native assembly. The reason why I didn't write the native assembly in plan9 assembly is because Go's assembler doesn't support the ARM vector instructions needed for this implementation (and using vector instructions is why the implementation is much faster than the native Go implementation).
One particular place this lack of performance on ARM hits us is during the first boot of a newly flashed Ubuntu Core ARM device, where the system when installing snaps (and other tasks) takes up a good portion of the time performing SHA3 hashes. An example difference for a customer device I can't unfortunately release more details on, performs the first boot in a factory environment about 1-2 minutes faster (approx 10% faster) when using this assembly implementation, and this time/percentage increases with the number of snaps the customer wants to configure their device with initially.
Okay, so can I at least get a decision about whether an assembly implementation of SHA3 for ARM is acceptable or not (disregarding what specific go release it goes into)?