Optimized support for P-384r1 and P-521r1 #99

balchick · 2017-04-01T01:08:24Z

Hi,

I am new to Github and found your micro-ecc sw. I am interested in that on ARM microcontrollers and was able to get it to work just fine.

I was able to modify the curve-specific.inc file to include support for the NIST P-384r1 and P-521r1 curves and I added code to the test-ecdh.c program to include and test the NIST test vectors. I swap their private key in to the variable and modified the make_key function to make a make_public_key function.

My question is that I have no idea how to give you my code. Can you point me to a guide that would help? Would you even want it?

What I'd really like would be to see if you can add the NIST ecc optimizations for these two curves the way you have for the other ones. When I compile those in, my code is 20-40x faster, depending whether I run in on Linux or ARM bare metal on my dev board.

Thanks for the software and thanks for your time!

Ed B.

balchick · 2017-04-05T03:59:09Z

Update - I was able to incorporate the NIST optimized equation for the P-384 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Am trying to add the NIST equation for the P-521 curve, but I think the odd number of bits, and the fact that I have to define the num_words as 17 (32-bit words for a total 544 bits available) means that the add routine NEVER generates a normal carry. I think the carry is only a mid-range bit in the highest word. I'm going to try to check the 522 bit by adding a separate vli_add (and maybe vli_sub) routine to detect it and see what happens. Will update if I figure it out.

Ed B.

balchick · 2017-04-13T03:20:01Z

Update - I was able to incorporate the NIST optimized equation for the P-521 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Just had to only keep the low 9 bits on the high word and rearrange all so that each tmp[i] uses the left 23 bits shifted to the right 9 bits and ORs on the low 9 bits from the next word shifted left 23 bits. Only take the loop to num_words_secp521r1 -2 and set the last tmp[num_words_secp521r1-1] to the left 23 bits shifted to the right 9 and AND with 0x000001FF. No changes to any other files needed (except invoke optimization >0 in uECC.h).

Thanks for making the base software. Thanks for including the references to the math papers so I could learn about the mmod and jacobian algorithms.

Ed B.

kmackay · 2017-05-21T17:19:12Z

I didn't include secp384r1 or scpr521r1 because they are unusably slow on 8-bit platforms, and the optimized assembly would be very large. If a lot of people want to use them I could add them though.

To merge your changes into micro-ecc, you would submit a 'pull request'.

Thanks for your interest!

DoDoENT · 2017-07-16T21:48:46Z

@balchick, do you have a fork of this repository with support for P-384 and P-521? I need a support for those curves and was planning on adding them (at least for ARMv7 and ARM64, other platforms are less important to me). Since you have those already implemented, I would be grateful if you could share that code.

DoDoENT · 2017-08-10T16:34:08Z

@balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick.

So far, this looks like this (for 64-bit platforms):

static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) {
    uint64_t tmp[ num_words_secp521r1 ];
    int carry;
    int i;

    /* t */
    uECC_vli_set(result, product, num_words_secp521r1);

    result[ num_words_secp521r1 - 1 ] &= 0x01FF;

    /* s */
    for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
        tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 );
    }
    tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;

    carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);

    while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
         carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
    }

}

However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints.

- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)

balchick · 2017-08-11T00:40:29Z

Hi, I have some code on another computer. I'll compare my 32-bit version and let you know. Ed

…

-----Original Message----- From: DoDo <notifications@github.com> To: kmackay/micro-ecc <micro-ecc@noreply.github.com> Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com> Sent: Thu, Aug 10, 2017 12:34 pm Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99) @balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick. So far, this looks like this (for 64-bit platforms): static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) { uint64_t tmp[ num_words_secp521r1 ]; int carry; int i; /* t */ uECC_vli_set(result, product, num_words_secp521r1); result[ num_words_secp521r1 - 1 ] &= 0x01FF; /* s */ for ( i = 0; i < num_words_secp521r1 - 2; ++i ) { tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 ); } tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF; carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1); while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) { carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1); } } However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DoDoENT · 2017-08-11T08:32:09Z

@balchick, this is my 32-bit version (if it helps you in comparison). It doesn't work either (same as 64-bit). The only difference is in left shifting - now its by 23 bytes, not 55 as in 64-bit case:

static void vli_mmod_fast_secp521r1(uint32_t *result, uint32_t *product) {
    uint32_t tmp[ num_words_secp521r1 ];
    int carry;
    int i;

    /* t */
    uECC_vli_set(result, product, num_words_secp521r1);

    result[ num_words_secp521r1 - 1 ] &= 0x01FF;

    /* s */
    for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
        tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 23 );
    }
    tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;

    carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);

    while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
         carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
    }

}

- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)

balchick · 2017-08-12T16:54:23Z

Hi, I put changes in below. Also, be sure that when you run the non-optimized version (with uECC Optimization set to 0) that it generates the correct result using the NIST test vectors to make sure that you entered all the curve data properly. The micro-ECC author doesn't have those tests. I wrote them for myself when I was adding support for the P-384 and P-521 curves. I assume you also know you have to deal with the extra bits in the public keys and shared secrets if you want to compare since the NIST data is exactly the right length. I use 17 32-bit words and lop off the extra bytes when printing and comparing. I also wrote a routing to substitute a constant private key into the private variable and then only call compute_public_key in the test. Hope this helps. Ed

…

-----Original Message----- From: DoDo <notifications@github.com> To: kmackay/micro-ecc <micro-ecc@noreply.github.com> Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com> Sent: Fri, Aug 11, 2017 4:32 am Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99) @balchick, this is my 32-bit version (if it helps you in comparison). It doesn't work either (same as 64-bit). The only difference is in left shifting - now its by 23 bytes, not 55 as in 64-bit case: static void vli_mmod_fast_secp521r1(uint32_t *result, uint32_t *product) { uint32_t tmp[ num_words_secp521r1 ]; uECC_word_t tmp2 = 0; //new by ejb int carry; int i; /* t */ tmp2 = product[num_words_secp521r1 - 1]; // new by ejb product[num_words_secp521r1 -1] = tmp2 & 0x000001FF; // new by ejb uECC_vli_set(result, product, num_words_secp521r1); // moved this down product[num_words_secp521r1 -1 = tmp2; // new by ejb delete this line (i added a slightly different one above) - result[ num_words_secp521r1 - 1 ] &= 0x01FF; /* s */ for ( i = 0; i < num_words_secp521r1 - 1; i++ ) // Note that I changed the line to only subtract 1 instead of 2 and do i++. Your original code may work if you just fix this, not sure. ejb { tmp[i] = ((product[i + num_words_secp521r1] & 0x000001ff) << 23) | ((product[i + numwords_secp521r1 - 1] & 0xFFFFFE00) >> 9); // new by ejb to replace line you have below delete this line and use mine - tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 23 ); } tmp[ num_words_secp521r1 - 1 ] = ( product[ 2 * num_words_secp521r1 - 2 ] & 0xFFFFFE00) >> 9 ); // NOTE - I edited this line for you. ejb carry = += uECC_vli_add(result, result, tmp, num_words_secp521r1); // NOTE - I edited this line for you. ejb if (carry < 0) { do { carry += uECC_vli_add( result, result, curve_secp521r1.p, num_words_secp521r1); // new line by ejb } while (carry < 0); } else { while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) { carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1); } } // don't forget this extra bracket. ejb } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

balchick · 2017-08-12T16:56:44Z

Hi, I also implemented the 32-bit version of the NIST equation for the P-384 curve. Let me know if you want that. I did not do any ARM optimizations. I wish I knew how to add some of those routines for the 32-bit ARM platform :) Ed

…

-----Original Message----- From: DoDo <notifications@github.com> To: kmackay/micro-ecc <micro-ecc@noreply.github.com> Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com> Sent: Thu, Aug 10, 2017 12:34 pm Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99) @balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick. So far, this looks like this (for 64-bit platforms): static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) { uint64_t tmp[ num_words_secp521r1 ]; int carry; int i; /* t */ uECC_vli_set(result, product, num_words_secp521r1); result[ num_words_secp521r1 - 1 ] &= 0x01FF; /* s */ for ( i = 0; i < num_words_secp521r1 - 2; ++i ) { tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 ); } tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF; carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1); while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) { carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1); } } However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4

DoDoENT · 2017-08-13T13:57:13Z

Hi @balchick!

Thank you a lot! With your help, I was able to fix my code and ensure tests with secp521r1 now work for me. The bug was in for loop (it had to go until num_words_secp521r1 - 1, as you noticed) and the shifting of last element of tmp was done incorrectly. My final (working) code can now be found in my fork of this repository, however it will only work with uECC_OPTIMIZATION_LEVEL <= 2 on ARM, since optimisation levels beyond 2 expect ARM assembly-optimised uECC_vli_* functions, which are very large for secp521r1 (as @kmackay stated in his last comment).

I also implemented the 32-bit version of the NIST equation for the P-384 curve.

Let me know if you want that.

For sake of completeness, it would be good to add that too - it would make this library complete for those who don't seek assembly optimised versions of routines or do not target 8-bit computers (that will have to be stated in README, to prevent people to expect that there is an assembly-optimised version for that too).

Based on your code, I will add 64-bit version (I don't have any 8-bit devices to add and test an 8-bit version) and send a pull request to the @kmackay so he can include that into original software, if he wants.

- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)

- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4

balchick · 2017-08-14T23:59:35Z

Hi, Glad that worked. I'll send you an e-mail with the NIST equation for the P-384 tomorrow. Later this month I might take a shot at making the 32-bit ARM asm. I don't know ARM asm, but I have a book and if I can relate his for 32-bit for P-256, I figure I can expand it to make it work. I have some performance numbers from an ARM platform for the supported curves for 32-bit and can't remember if it helps a lot over the NIST math. I'll get a list of the numbers for your info. Ed

…

-----Original Message----- From: DoDo <notifications@github.com> To: kmackay/micro-ecc <micro-ecc@noreply.github.com> Cc: balchick <balchickej@aol.com>; Mention <mention@noreply.github.com> Sent: Sun, Aug 13, 2017 9:57 am Subject: Re: [kmackay/micro-ecc] Optimized support for P-384r1 and P-521r1 (#99) Hi @balchick! Thank you a lot! With your help, I was able to fix my code and ensure tests with secp521r1 now work for me. The bug was in for loop (it had to go until num_words_secp521r1 - 1, as you noticed) and the shifting of last element of tmp was done incorrectly. My final (working) code can now be found in my fork of this repository, however it will only work with uECC_OPTIMIZATION_LEVEL <= 2 on ARM, since optimisation levels beyond 2 expect ARM assembly-optimised uECC_vli_* functions, which are very large for secp521r1 (as @kmackay stated in his last comment). I also implemented the 32-bit version of the NIST equation for the P-384 curve. Let me know if you want that. For sake of completeness, it would be good to add that too - it would make this library complete for those who don't seek assembly optimised versions of routines or do not target 8-bit computers (that will have to be stated in README, to prevent people to expect that there is an assembly-optimised version for that too). Based on your code, I will add 64-bit version (I don't have any 8-bit devices to add and test an 8-bit version) and send a pull request to the @kmackay so he can include that into original software, if he wants. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DoDoENT · 2017-08-15T11:46:01Z

Hi!

Later this month I might take a shot at making the 32-bit ARM asm. I don't know ARM asm, but I have a book and if I can relate his for 32-bit for P-256, I figure I can expand it to make it work.

Good luck! I tried that and concluded that optimised assembly would be very large (@kmackay also stated that previously). However, maybe a good way to do that would be to start from C code, then first optimise it until it produces best possible assembly (this tool is good for doing that task) and finally hand-optimise the produced assembly until it produces best possible performance - it's a lot of work, but could pay off in cases when performance is critical.

I have some performance numbers from an ARM platform for the supported curves for 32-bit and can't remember if it helps a lot over the NIST math. I'll get a list of the numbers for your info.

On my benchmarks (dual core Cortex A9 in low-end android device), secp521r1 was around 6 times slower than secp256k1 when same optimisation level was used (2), and by using optimisation level 4 for secp256k1 it made it additional 2 times faster - so in total secp521r1 with optimisation level 2 is 12 times slower than secp256k1 with optimisation level 4. Since for my use case performance is much more critical than security, I'll stick with secp256k1 for the time being, which provides pretty good security for my current use case.

However, with the advancement of the technology, when average android device will be faster, I will eventually upgrade my software to use secp521r1, provided that there are no major leaps in quantum computers (ECC is more vulnerable to quantum attacks than RSA). If there will be significant advances in quantum computing in next few years, I'll probably have to explore some quantum-resistant cryptosystem (such as supersingular isogeny key exchange).

- the implementation does not work for unknown reason, but gives overview of final performance of P-521 - details can be tracked here: kmackay#99 (comment)

- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is around 6 times slower than secp256k1 on same optimisation level and around 12 time slower than secp256k1 on optimisation level 4

kmackay closed this as completed May 21, 2017

derekatkins mentioned this issue Apr 20, 2022

Support for P-384? #193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized support for P-384r1 and P-521r1 #99

Optimized support for P-384r1 and P-521r1 #99

balchick commented Apr 1, 2017

balchick commented Apr 5, 2017

balchick commented Apr 13, 2017

kmackay commented May 21, 2017

DoDoENT commented Jul 16, 2017

DoDoENT commented Aug 10, 2017

balchick commented Aug 11, 2017 via email

DoDoENT commented Aug 11, 2017

balchick commented Aug 12, 2017 via email

balchick commented Aug 12, 2017 via email

DoDoENT commented Aug 13, 2017

balchick commented Aug 14, 2017 via email

DoDoENT commented Aug 15, 2017

Optimized support for P-384r1 and P-521r1 #99

Optimized support for P-384r1 and P-521r1 #99

Comments

balchick commented Apr 1, 2017

balchick commented Apr 5, 2017

balchick commented Apr 13, 2017

kmackay commented May 21, 2017

DoDoENT commented Jul 16, 2017

DoDoENT commented Aug 10, 2017

balchick commented Aug 11, 2017 via email

DoDoENT commented Aug 11, 2017

balchick commented Aug 12, 2017 via email

balchick commented Aug 12, 2017 via email

DoDoENT commented Aug 13, 2017

balchick commented Aug 14, 2017 via email

DoDoENT commented Aug 15, 2017