Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized support for P-384r1 and P-521r1 #99

Closed
balchick opened this issue Apr 1, 2017 · 12 comments
Closed

Optimized support for P-384r1 and P-521r1 #99

balchick opened this issue Apr 1, 2017 · 12 comments

Comments

@balchick
Copy link

balchick commented Apr 1, 2017

Hi,

I am new to Github and found your micro-ecc sw. I am interested in that on ARM microcontrollers and was able to get it to work just fine.

I was able to modify the curve-specific.inc file to include support for the NIST P-384r1 and P-521r1 curves and I added code to the test-ecdh.c program to include and test the NIST test vectors. I swap their private key in to the variable and modified the make_key function to make a make_public_key function.

My question is that I have no idea how to give you my code. Can you point me to a guide that would help? Would you even want it?

What I'd really like would be to see if you can add the NIST ecc optimizations for these two curves the way you have for the other ones. When I compile those in, my code is 20-40x faster, depending whether I run in on Linux or ARM bare metal on my dev board.

Thanks for the software and thanks for your time!

Ed B.

@balchick
Copy link
Author

balchick commented Apr 5, 2017

Update - I was able to incorporate the NIST optimized equation for the P-384 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Am trying to add the NIST equation for the P-521 curve, but I think the odd number of bits, and the fact that I have to define the num_words as 17 (32-bit words for a total 544 bits available) means that the add routine NEVER generates a normal carry. I think the carry is only a mid-range bit in the highest word. I'm going to try to check the 522 bit by adding a separate vli_add (and maybe vli_sub) routine to detect it and see what happens. Will update if I figure it out.

Ed B.

@balchick
Copy link
Author

Update - I was able to incorporate the NIST optimized equation for the P-521 curve that I added and it works (passes the NIST test vectors I also included in the test_ecdh_all.c file I made. Just had to only keep the low 9 bits on the high word and rearrange all so that each tmp[i] uses the left 23 bits shifted to the right 9 bits and ORs on the low 9 bits from the next word shifted left 23 bits. Only take the loop to num_words_secp521r1 -2 and set the last tmp[num_words_secp521r1-1] to the left 23 bits shifted to the right 9 and AND with 0x000001FF. No changes to any other files needed (except invoke optimization >0 in uECC.h).

Thanks for making the base software. Thanks for including the references to the math papers so I could learn about the mmod and jacobian algorithms.

Ed B.

@kmackay
Copy link
Owner

kmackay commented May 21, 2017

I didn't include secp384r1 or scpr521r1 because they are unusably slow on 8-bit platforms, and the optimized assembly would be very large. If a lot of people want to use them I could add them though.

To merge your changes into micro-ecc, you would submit a 'pull request'.

Thanks for your interest!

@kmackay kmackay closed this as completed May 21, 2017
@DoDoENT
Copy link

DoDoENT commented Jul 16, 2017

@balchick, do you have a fork of this repository with support for P-384 and P-521? I need a support for those curves and was planning on adding them (at least for ARMv7 and ARM64, other platforms are less important to me). Since you have those already implemented, I would be grateful if you could share that code.

@DoDoENT
Copy link

DoDoENT commented Aug 10, 2017

@balchick, @kmackay, I am trying to implement P-521 in my code and it works with disabled optimisation (uECC_OPTIMIZATION_LEVEL==0), however I am struggling with writing optimised mmod function, based on NIST paper and suggestions from @balchick.

So far, this looks like this (for 64-bit platforms):

static void vli_mmod_fast_secp521r1(uint64_t *result, uint64_t *product) {
    uint64_t tmp[ num_words_secp521r1 ];
    int carry;
    int i;

    /* t */
    uECC_vli_set(result, product, num_words_secp521r1);

    result[ num_words_secp521r1 - 1 ] &= 0x01FF;

    /* s */
    for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
        tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 55 );
    }
    tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;

    carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);

    while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
         carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
    }

}

However, this does not work and I cannot find the reason why. I would be very grateful if you could give me some hints.

DoDoENT added a commit to microblink/micro-ecc that referenced this issue Aug 10, 2017
- the implementation does not work for unknown reason, but gives overview
  of final performance of P-521
- details can be tracked here: kmackay#99 (comment)
@balchick
Copy link
Author

balchick commented Aug 11, 2017 via email

@DoDoENT
Copy link

DoDoENT commented Aug 11, 2017

@balchick, this is my 32-bit version (if it helps you in comparison). It doesn't work either (same as 64-bit). The only difference is in left shifting - now its by 23 bytes, not 55 as in 64-bit case:

static void vli_mmod_fast_secp521r1(uint32_t *result, uint32_t *product) {
    uint32_t tmp[ num_words_secp521r1 ];
    int carry;
    int i;

    /* t */
    uECC_vli_set(result, product, num_words_secp521r1);

    result[ num_words_secp521r1 - 1 ] &= 0x01FF;

    /* s */
    for ( i = 0; i < num_words_secp521r1 - 2; ++i ) {
        tmp[ i ] = ( product[ num_words_secp521r1 - 1 + i ] >> 9 ) | ( product[ num_words_secp521r1 + i ] << 23 );
    }
    tmp[ num_words_secp521r1 - 1 ] = ( product[ num_words_secp521r1 + num_words_secp521r1 - 1 ] >> 9 ) & 0x01FF;

    carry = (int)uECC_vli_add(result, result, tmp, num_words_secp521r1);

    while (carry || uECC_vli_cmp_unsafe(curve_secp521r1.p, result, num_words_secp521r1) != 1) {
         carry -= uECC_vli_sub( result, result, curve_secp521r1.p, num_words_secp521r1);
    }

}

DoDoENT added a commit to microblink/micro-ecc that referenced this issue Aug 11, 2017
- the implementation does not work for unknown reason, but gives overview
  of final performance of P-521
- details can be tracked here: kmackay#99 (comment)
@balchick
Copy link
Author

balchick commented Aug 12, 2017 via email

@balchick
Copy link
Author

balchick commented Aug 12, 2017 via email

DoDoENT added a commit to microblink/micro-ecc that referenced this issue Aug 13, 2017
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is
  around 6 times slower than secp256k1 on same optimisation level and
  around 12 time slower than secp256k1 on optimisation level 4
@DoDoENT
Copy link

DoDoENT commented Aug 13, 2017

Hi @balchick!

Thank you a lot! With your help, I was able to fix my code and ensure tests with secp521r1 now work for me. The bug was in for loop (it had to go until num_words_secp521r1 - 1, as you noticed) and the shifting of last element of tmp was done incorrectly. My final (working) code can now be found in my fork of this repository, however it will only work with uECC_OPTIMIZATION_LEVEL <= 2 on ARM, since optimisation levels beyond 2 expect ARM assembly-optimised uECC_vli_* functions, which are very large for secp521r1 (as @kmackay stated in his last comment).

I also implemented the 32-bit version of the NIST equation for the P-384 curve.

Let me know if you want that.

For sake of completeness, it would be good to add that too - it would make this library complete for those who don't seek assembly optimised versions of routines or do not target 8-bit computers (that will have to be stated in README, to prevent people to expect that there is an assembly-optimised version for that too).

Based on your code, I will add 64-bit version (I don't have any 8-bit devices to add and test an 8-bit version) and send a pull request to the @kmackay so he can include that into original software, if he wants.

DoDoENT added a commit to microblink/micro-ecc that referenced this issue Aug 13, 2017
- the implementation does not work for unknown reason, but gives overview
  of final performance of P-521
- details can be tracked here: kmackay#99 (comment)
DoDoENT added a commit to microblink/micro-ecc that referenced this issue Aug 13, 2017
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is
  around 6 times slower than secp256k1 on same optimisation level and
  around 12 time slower than secp256k1 on optimisation level 4
@balchick
Copy link
Author

balchick commented Aug 14, 2017 via email

@DoDoENT
Copy link

DoDoENT commented Aug 15, 2017

Hi!

Later this month I might take a shot at making the 32-bit ARM asm. I don't know ARM asm, but I have a book and if I can relate his for 32-bit for P-256, I figure I can expand it to make it work.

Good luck! I tried that and concluded that optimised assembly would be very large (@kmackay also stated that previously). However, maybe a good way to do that would be to start from C code, then first optimise it until it produces best possible assembly (this tool is good for doing that task) and finally hand-optimise the produced assembly until it produces best possible performance - it's a lot of work, but could pay off in cases when performance is critical.

I have some performance numbers from an ARM platform for the supported curves for 32-bit and can't remember if it helps a lot over the NIST math. I'll get a list of the numbers for your info.

On my benchmarks (dual core Cortex A9 in low-end android device), secp521r1 was around 6 times slower than secp256k1 when same optimisation level was used (2), and by using optimisation level 4 for secp256k1 it made it additional 2 times faster - so in total secp521r1 with optimisation level 2 is 12 times slower than secp256k1 with optimisation level 4. Since for my use case performance is much more critical than security, I'll stick with secp256k1 for the time being, which provides pretty good security for my current use case.

However, with the advancement of the technology, when average android device will be faster, I will eventually upgrade my software to use secp521r1, provided that there are no major leaps in quantum computers (ECC is more vulnerable to quantum attacks than RSA). If there will be significant advances in quantum computing in next few years, I'll probably have to explore some quantum-resistant cryptosystem (such as supersingular isogeny key exchange).

Umair-Anwar pushed a commit to Umair-Anwar/micro-ecc that referenced this issue Aug 22, 2024
- the implementation does not work for unknown reason, but gives overview
  of final performance of P-521
- details can be tracked here: kmackay#99 (comment)
Umair-Anwar pushed a commit to Umair-Anwar/micro-ecc that referenced this issue Aug 22, 2024
- secp521r1 currently works only for uECC_OPTIMIZATION_LEVEL <= 2 and is
  around 6 times slower than secp256k1 on same optimisation level and
  around 12 time slower than secp256k1 on optimisation level 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants