vli_mult algorithm #69

omicronns · 2016-02-25T13:44:33Z

I was wondering what multiplication algorithm was used in asm_arm.inc inlines(static branch). I tried to improve it speed writing own asm multiplication inline using 'umaal' instruction, which accumulates two numbers along multiplication, but it was slower than yours asm inline (150ms vs 158ms on NucleoF401RE@48MHz signature computation time). My inline. Was there some reason not to use this instruction, but make additions separately?

kmackay · 2016-02-27T22:07:32Z

My "slower" multiplication code (ie, not fully-inlined) uses Comba multiplication (which reduces stores). The main issue with using umaal is that it overwrites the accumulator registers. It might be possible to use it cleverly to reduce the runtime though; I'll look into it some more.

omicronns · 2016-02-29T15:27:10Z

But your code was faster, at the same inline conditions. Here is further optimized naive multiplication inline (8x8=16 words), that is indeed faster than yours. I pre cached whole one operand in registers, which reduced number of memory accesses. This multiplication method is about 9% faster than your existing inline, running on NucleoF401RE@48MHz . This inline is also easily scalable for different word counts. Feel free to use, improve or ignore it.

kmackay · 2016-03-09T05:59:04Z

Thanks! My current code tries to minimize ldr/str instructions. However, this approach makes the carries quite large (the accumulator needs 3 registers), which makes it hard to take advantage of umaal. I'll try to see if there is any way to improve on your "naive algorithm with umaal" code. Maybe a hybrid approach (eg, doing 4x4 blocks at a time) would be effective.

kmackay · 2016-03-11T06:22:14Z

Here is what I came up with. It is about a 20% improvement over my original code on my test platform. Let me know if you see any more possible improvements!

On ARM platforms that support UMAAL, this new code should speed up curve operations by 15-20%. There is automatic detection of UMAAL support using compiler macros, but if it doesn't work for a given platform, #define uECC_ARM_USE_UMAAL to 1 or 0 as desired.

kmackay closed this as completed Apr 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vli_mult algorithm #69

vli_mult algorithm #69

omicronns commented Feb 25, 2016

kmackay commented Feb 27, 2016

omicronns commented Feb 29, 2016

kmackay commented Mar 9, 2016

kmackay commented Mar 11, 2016

vli_mult algorithm #69

vli_mult algorithm #69

Comments

omicronns commented Feb 25, 2016

kmackay commented Feb 27, 2016

omicronns commented Feb 29, 2016

kmackay commented Mar 9, 2016

kmackay commented Mar 11, 2016