Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vli_mult algorithm #69

Closed
omicronns opened this issue Feb 25, 2016 · 4 comments
Closed

vli_mult algorithm #69

omicronns opened this issue Feb 25, 2016 · 4 comments

Comments

@omicronns
Copy link

I was wondering what multiplication algorithm was used in asm_arm.inc inlines(static branch). I tried to improve it speed writing own asm multiplication inline using 'umaal' instruction, which accumulates two numbers along multiplication, but it was slower than yours asm inline (150ms vs 158ms on NucleoF401RE@48MHz signature computation time). My inline. Was there some reason not to use this instruction, but make additions separately?

@kmackay
Copy link
Owner

kmackay commented Feb 27, 2016

My "slower" multiplication code (ie, not fully-inlined) uses Comba multiplication (which reduces stores). The main issue with using umaal is that it overwrites the accumulator registers. It might be possible to use it cleverly to reduce the runtime though; I'll look into it some more.

@omicronns
Copy link
Author

But your code was faster, at the same inline conditions. Here is further optimized naive multiplication inline (8x8=16 words), that is indeed faster than yours. I pre cached whole one operand in registers, which reduced number of memory accesses. This multiplication method is about 9% faster than your existing inline, running on NucleoF401RE@48MHz . This inline is also easily scalable for different word counts. Feel free to use, improve or ignore it.

@kmackay
Copy link
Owner

kmackay commented Mar 9, 2016

Thanks! My current code tries to minimize ldr/str instructions. However, this approach makes the carries quite large (the accumulator needs 3 registers), which makes it hard to take advantage of umaal. I'll try to see if there is any way to improve on your "naive algorithm with umaal" code. Maybe a hybrid approach (eg, doing 4x4 blocks at a time) would be effective.

@kmackay
Copy link
Owner

kmackay commented Mar 11, 2016

Here is what I came up with. It is about a 20% improvement over my original code on my test platform. Let me know if you see any more possible improvements!

kmackay added a commit that referenced this issue Apr 21, 2016
On ARM platforms that support UMAAL, this new code should speed up curve
operations by 15-20%. There is automatic detection of UMAAL support
using compiler macros, but if it doesn't work for a given platform,
#define uECC_ARM_USE_UMAAL to 1 or 0 as desired.
@kmackay kmackay closed this as completed Apr 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants