-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vli_mult algorithm #69
Comments
My "slower" multiplication code (ie, not fully-inlined) uses Comba multiplication (which reduces stores). The main issue with using umaal is that it overwrites the accumulator registers. It might be possible to use it cleverly to reduce the runtime though; I'll look into it some more. |
But your code was faster, at the same inline conditions. Here is further optimized naive multiplication inline (8x8=16 words), that is indeed faster than yours. I pre cached whole one operand in registers, which reduced number of memory accesses. This multiplication method is about 9% faster than your existing inline, running on NucleoF401RE@48MHz . This inline is also easily scalable for different word counts. Feel free to use, improve or ignore it. |
Thanks! My current code tries to minimize ldr/str instructions. However, this approach makes the carries quite large (the accumulator needs 3 registers), which makes it hard to take advantage of umaal. I'll try to see if there is any way to improve on your "naive algorithm with umaal" code. Maybe a hybrid approach (eg, doing 4x4 blocks at a time) would be effective. |
Here is what I came up with. It is about a 20% improvement over my original code on my test platform. Let me know if you see any more possible improvements! |
On ARM platforms that support UMAAL, this new code should speed up curve operations by 15-20%. There is automatic detection of UMAAL support using compiler macros, but if it doesn't work for a given platform, #define uECC_ARM_USE_UMAAL to 1 or 0 as desired.
I was wondering what multiplication algorithm was used in asm_arm.inc inlines(static branch). I tried to improve it speed writing own asm multiplication inline using 'umaal' instruction, which accumulates two numbers along multiplication, but it was slower than yours asm inline (150ms vs 158ms on NucleoF401RE@48MHz signature computation time). My inline. Was there some reason not to use this instruction, but make additions separately?
The text was updated successfully, but these errors were encountered: