Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOST ECC optimizations #263

Merged
merged 1 commit into from
Jul 3, 2020
Merged

GOST ECC optimizations #263

merged 1 commit into from
Jul 3, 2020

Conversation

bbbrumley
Copy link
Contributor

@bbbrumley bbbrumley commented Jun 4, 2020

Hey @beldmit,

This is def not for merging :) I'm working on the gost-engine rigging and this is just a dry run. Out of WiP

Could you take a look at the performance and let me know what you think?

Should see improvements on EVP keygen, sign, derive, and verify across the board.

Only tested so far on 64-bit and 32-bit linux, with clang.

32-bit support is still WiP, and also will serve as the fallback for compilers that do not support __uint128_t.

Edit: very much suggest clang no debug build with optimization flags.

@beldmit
Copy link
Contributor

beldmit commented Jun 4, 2020

Clang 10, with patch:
sign: 1263.024945/s
verify: 1348.435814/s
derive: 1321.222130/s

Clang 10, without patch:
sign: 1267.627951/s
verify: 1475.796930/s
derive: 1337.792642/s

GCC 8.3, with patch:
sign: 1267.226358/s
verify: 1469.723692/s
derive: 1347.254968/s

GCC 8.3, without patch:
sign: 1251.956182/s
verify: 1428.571429/s
derive: 1343.408900/s

$ cat /proc/cpuinfo

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 21
model		: 48
model name	: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
stepping	: 1
microcode	: 0x6003106
cpu MHz		: 1391.372
cache size	: 2048 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 16
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb bpext ptsc cpb hw_pstate ssbd vmmcall fsgsbase bmi1 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov
bugs		: fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 6188.51
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro [13]

...

@bbbrumley
Copy link
Contributor Author

This def doesn't seem right. I don't think it's hitting the right code paths. Could you double check in the debugger? Set a bp at gost_ec_point_mul

@bbbrumley
Copy link
Contributor Author

OH and ofc critical info I left out ... so far it's only id_tc26_gost_3410_2012_256_paramSetA as a dry run.

@beldmit
Copy link
Contributor

beldmit commented Jun 4, 2020

Got my mistake. id_tc26_gost_3410_2012_256_paramSetA is TCA parameters. Just a moment.

@beldmit
Copy link
Contributor

beldmit commented Jun 4, 2020

Wow!

With patch, clang 10
beldmit@manul$ ./test.sh
sign: 5743.000718/s
verify: 2561.475410/s
derive: 3263.973888/s

Without patch:
sign: 1306.975984/s
verify: 1427.755568/s
derive: 1366.586949/s

GCC 8.3, with patch:
sign: 4132.231405/s
verify: 1811.594203/s
derive: 2173.322467/s

GCC 8.3, without patch:
sign: 1313.844638/s
verify: 1432.664756/s
derive: 1360.081605/s

@bbbrumley
Copy link
Contributor Author

Excellent! Thanks for the sanity check :) We'll keep at it.

If you have any comments outside the BEGIN verbatim fiat .. END verbatim fiat, those are welcome :)

@vt-alt
Copy link
Member

vt-alt commented Jun 4, 2020

This looks awesome. I use benchmarker from #264.

Before patch is applied (time, cycles, instrs):

:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111
Using gost2012_256 paramset:TCA
min 1523 max 1697 (microseconds)
  1523: 1796: ################################################################################...
  1530: 4615: ################################################################################...
  1537: 1668: ################################################################################...
  1544:  722: ################################################################################...
  1551:  282: ################################################################################...
  1558:  158: ################################################################################...
  1565:  139: ################################################################################...
  1572:   86: ################################################################################...
  1579:   96: ################################################################################...
  1586:  115: ################################################################################...
  1593:   49: ###
  1600:   40: ###
  1607:   21: #
  1614:   20: #
  1621:   11:
  1628:   15: #
  1635:   12:
  1642:   27: ##
  1649:   32: ##
  1656:   36: ##
  1663:   53: ####
  1670:   19: #
  1677:   11:
  1684:  204: ################################################################################...
  1691:  548: ################################################################################...
:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111 -P0
Using gost2012_256 paramset:TCA
min 4271288 max 4348762 (cpu cycles)
4271288:   21: #
4274809:   70: #####
4278330:  264: ################################################################################...
4281851:  573: ################################################################################...
4285372:  946: ################################################################################...
4288893: 1229: ################################################################################...
4292414: 1165: ################################################################################...
4295935: 1037: ################################################################################...
4299456:  880: ################################################################################...
4302977:  842: ################################################################################...
4306498:  722: ################################################################################...
4310019:  618: ################################################################################...
4313540:  509: ################################################################################...
4317061:  404: ################################################################################...
4320582:  361: ################################################################################...
4324103:  276: ################################################################################...
4327624:  249: ################################################################################...
4331145:  217: ################################################################################...
4334666:  151: ################################################################################...
4338187:   99: ################################################################################...
4341708:   80: ################################################################################...
4345229:   61: ####
4348750:    1:
:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111 -P1
Using gost2012_256 paramset:TCA
min 6108840 max 6205585 (instructions)
6108840:   49: ###
6113237:  399: ################################################################################...
6117634: 1735: ################################################################################...
6122031: 2979: ################################################################################...
6126428: 2327: ################################################################################...
6130825:  694: ################################################################################...
6135222:   81: ################################################################################...
6139619:    6:
6144016:    2:
6148413:   12:
6152810:  136: ################################################################################...
6157207:  514: ################################################################################...
6161604:  824: ################################################################################...
6166001:  496: ################################################################################...
6170398:  145: ################################################################################...
6174795:   14: #
6179192:    1:
6183589:    0:
6187986:    5:
6192383:   35: ##
6196780:  139: ################################################################################...
6201177:  181: ################################################################################...
6205574:    1:
:~/src/gost-engine (sign-time)$ 

After patch is applied (time, cycles, instrs):

:~/src/gost-engine (sign-time)$ git cherry-pick 5c241bc1d991c0794720dbcb15fc141232bc267f

:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111
Using gost2012_256 paramset:TCA
min 226 max 267 (microseconds)
   226:  949: ################################################################################...
   227: 1300: ################################################################################...
   228:   84: ################################################################################...
   229:   27: ##
   230:   13: #
   231:   51: ###
   232:   29: ##
   233:   26: ##
   234:  331: ################################################################################...
   235:  255: ################################################################################...
   236:   23: #
   237:   12:
   238:   11:
   239:    8:
   240:    7:
   241:   35: ##
   242:  120: ################################################################################...
   243:   12:
   244:    6:
   245:    2:
   246:    6:
   247:    5:
   248:    4:
   249:   94: ################################################################################...
   250: 3367: ################################################################################...
   251: 1789: ################################################################################...
   252:  155: ################################################################################...
   253:   54: ####
   254:   48: ###
   255:  125: ################################################################################...
   256:   61: ####
   257:   38: ##
   258:  238: ################################################################################...
   259:  961: ################################################################################...
   260:  169: ################################################################################...
   261:   42: ###
   262:   16: #
   263:   29: ##
   264:   30: ##
   265:   18: #
   266:   35: ##
   267:  190: ################################################################################...
:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111 -P0
Using gost2012_256 paramset:TCA
min 635446 max 680937 (cpu cycles)
635446: 3892: ################################################################################...
637513: 3450: ################################################################################...
639580:  554: ################################################################################...
641647:  205: ################################################################################...
643714:   61: ####
645781:   37: ##
647848:   29: ##
649915:    5:
651982:   12:
654049:   14: #
656116:   62: ####
658183: 1558: ################################################################################...
660250:  287: ################################################################################...
662317:   97: ################################################################################...
664384:   38: ##
666451:   24: #
668518:   12:
670585:    7:
672652:    4:
674719:    4:
676786:   27: ##
678853:  394: ################################################################################...
680920:    2:
:~/src/gost-engine (sign-time)$ bin/sign-time -p paramset:TCA -c 11111 -P1
Using gost2012_256 paramset:TCA
min 1410417 max 1489002 (instructions)
1410417: 8327: ################################################################################...
1413989:    0:
1417561:    0:
1421133:    0:
1424705:    0:
1428277:    1:
1431849:    0:
1435421:    0:
1438993:    0:
1442565:    0:
1446137:  640: ################################################################################...
1449709: 1414: ################################################################################...
1453281:    0:
1456853:    0:
1460425:    0:
1463997:    0:
1467569:    0:
1471141:    0:
1474713:    0:
1478285:    0:
1481857:    0:
1485429:  355: ################################################################################...
1489001:   38: ##
:~/src/gost-engine (sign-time)$

@beldmit
Copy link
Contributor

beldmit commented Jun 4, 2020

Brilliant result!

@beldmit
Copy link
Contributor

beldmit commented Jun 4, 2020

Stupid question no 1: why the sign is 4,5 times faster and verify only 2?

@bbbrumley
Copy link
Contributor Author

bbbrumley commented Jun 5, 2020

Stupid question no 1: why the sign is 4,5 times faster and verify only 2?

It's not a dumb question at all :)

For verify, OpenSSL right now isn't using a constant time algorithm, so it's actually pretty OK efficient. (All the changes in the OpenSSL EC module over the past few years don't affect that code path -- only keygen, sign, and derive.)

The same goes for the code changes here, in the sense that

  • you'll get constant time, formally verified paths for keygen, sign, and derive.
  • you'll get a formally verified path for verify

Signature verify isn't normally made constant time because there are no secrets involved.

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

So. We had a constant-time implementation, that became 4 times faster after replacing the multiplication with a new one. Non-constant-time implementation has become "just" 2 times faster. So the verification now is two times slower than signing.

Do I correctly understand your point?

@bbbrumley
Copy link
Contributor Author

Do I correctly understand your point?

Yes that's the situation :)

In general the verification will be slower because it takes the form aG + bP where signing is just kG.

The fact that it may not currently hold in OpenSSL master is because the CT point multiplication (kG) is not efficient -- so slow in fact that it's possible the non-CT verification (aG + bP) outperforms it.

@bbbrumley
Copy link
Contributor Author

a190178 should be all the curves now ...

@vt-alt
Copy link
Member

vt-alt commented Jun 5, 2020

@bbbrumley Another question - why being constant time algorithm it still show distribution of execution times in my benchmarks?

@bbbrumley
Copy link
Contributor Author

Another question - why being constant time algorithm it still show distribution of execution times in my benchmarks?

Also a good question :)

While in general, CT means "not dependent on the key", the issues for gost-engine are likely deeper than that.

Having these full stack EC implementations will let gost-engine eventually detatch from OpenSSL's ec module. But you're still dependent on the bn module, for various other arithmetic in signing and verifying, and even key agreement.

From experience, I know a few spots to look so when I carve out some time I'll go through those and submit PRs.

@vt-alt
Copy link
Member

vt-alt commented Jun 5, 2020

IC. Yes, BN_lebin2bn/BN_bn2lebinpad, and even CRYPTO_memcmp looks non-regular at all.

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

Without optimization:

Benchmarking gost2012_256
sign,   alg gost2012_256, params A: 1258.059443/s
verify, alg gost2012_256, params A: 1438.434983/s
derive, alg gost2012_256, params A: 1344.989913/s
sign,   alg gost2012_256, params B: 1317.523057/s
verify, alg gost2012_256, params B: 1511.487304/s
derive, alg gost2012_256, params B: 1368.457065/s
sign,   alg gost2012_256, params C: 1319.478806/s
verify, alg gost2012_256, params C: 1467.997651/s
derive, alg gost2012_256, params C: 1364.721938/s
sign,   alg gost2012_256, params TCA: 1329.566229/s
verify, alg gost2012_256, params TCA: 1400.560224/s
derive, alg gost2012_256, params TCA: 1361.933946/s
Benchmarking gost2012_512
sign,   alg gost2012_512, params A: 316.718793/s
verify, alg gost2012_512, params A: 403.551251/s
derive, alg gost2012_512, params A: 319.769766/s
sign,   alg gost2012_512, params B: 316.869331/s
verify, alg gost2012_512, params B: 413.359788/s
derive, alg gost2012_512, params B: 317.965024/s
sign,   alg gost2012_512, params C: 315.345500/s
verify, alg gost2012_512, params C: 377.928949/s
derive, alg gost2012_512, params C: 317.952387/s

With optimization:

Benchmarking gost2012_256
sign,   alg gost2012_256, params A: 4262.120405/s
verify, alg gost2012_256, params A: 1704.158146/s
derive, alg gost2012_256, params A: 2127.093858/s
sign,   alg gost2012_256, params B: 5524.861878/s
verify, alg gost2012_256, params B: 2107.925801/s
derive, alg gost2012_256, params B: 2689.075630/s
sign,   alg gost2012_256, params C: 4217.185029/s
verify, alg gost2012_256, params C: 1587.301587/s
derive, alg gost2012_256, params C: 1978.728667/s
sign,   alg gost2012_256, params TCA: 6334.125099/s
verify, alg gost2012_256, params TCA: 2587.991718/s
derive, alg gost2012_256, params TCA: 3318.125259/s
Benchmarking gost2012_512
sign,   alg gost2012_512, params A: 994.530085/s
verify, alg gost2012_512, params A: 277.161863/s
derive, alg gost2012_512, params A: 333.069653/s
sign,   alg gost2012_512, params B: 1392.272885/s
verify, alg gost2012_512, params B: 385.921581/s
derive, alg gost2012_512, params B: 467.508181/s
sign,   alg gost2012_512, params C: 1395.916943/s
verify, alg gost2012_512, params C: 421.727395/s
derive, alg gost2012_512, params C: 506.104890/s

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

Optimized version:

Benchmarking gost2012_256
keygen, alg gost2012_256, params A: 4242.681375/s
sign,   alg gost2012_256, params A: 4566.210046/s
verify, alg gost2012_256, params A: 1707.650273/s
derive, alg gost2012_256, params A: 2130.833156/s
keygen, alg gost2012_256, params B: 5512.679162/s
sign,   alg gost2012_256, params B: 5534.034311/s
verify, alg gost2012_256, params B: 2122.691573/s
derive, alg gost2012_256, params B: 2683.843264/s
keygen, alg gost2012_256, params C: 4166.666667/s
sign,   alg gost2012_256, params C: 4224.757076/s
verify, alg gost2012_256, params C: 1592.863969/s
derive, alg gost2012_256, params C: 1982.160555/s
keygen, alg gost2012_256, params TCA: 6199.628022/s
sign,   alg gost2012_256, params TCA: 6337.135615/s
verify, alg gost2012_256, params TCA: 2585.315408/s
derive, alg gost2012_256, params TCA: 3324.468085/s
Benchmarking gost2012_512
keygen, alg gost2012_512, params A: 987.751877/s
sign,   alg gost2012_512, params A: 994.530085/s
verify, alg gost2012_512, params A: 277.592716/s
derive, alg gost2012_512, params A: 334.235770/s
keygen, alg gost2012_512, params B: 1384.083045/s
sign,   alg gost2012_512, params B: 1392.951665/s
verify, alg gost2012_512, params B: 386.622849/s
derive, alg gost2012_512, params B: 465.918092/s
keygen, alg gost2012_512, params C: 1383.508578/s
sign,   alg gost2012_512, params C: 1395.283940/s
verify, alg gost2012_512, params C: 421.141293/s
derive, alg gost2012_512, params C: 504.439064/s

Non-optimized version:

Benchmarking gost2012_256
keygen, alg gost2012_256, params A: 1262.785705/s
sign,   alg gost2012_256, params A: 1288.161793/s
verify, alg gost2012_256, params A: 1446.549978/s
derive, alg gost2012_256, params A: 1342.281879/s
keygen, alg gost2012_256, params B: 1324.678765/s
sign,   alg gost2012_256, params B: 1331.380642/s
verify, alg gost2012_256, params B: 1544.878727/s
derive, alg gost2012_256, params B: 1353.912808/s
keygen, alg gost2012_256, params C: 1312.680494/s
sign,   alg gost2012_256, params C: 1327.668614/s
verify, alg gost2012_256, params C: 1458.151064/s
derive, alg gost2012_256, params C: 1365.933616/s
keygen, alg gost2012_256, params TCA: 1324.854266/s
sign,   alg gost2012_256, params TCA: 1333.688984/s
verify, alg gost2012_256, params TCA: 1407.657658/s
derive, alg gost2012_256, params TCA: 1357.404642/s
Benchmarking gost2012_512
keygen, alg gost2012_512, params A: 315.079715/s
sign,   alg gost2012_512, params A: 317.077811/s
verify, alg gost2012_512, params A: 406.967280/s
derive, alg gost2012_512, params A: 318.674315/s
keygen, alg gost2012_512, params B: 318.197728/s
sign,   alg gost2012_512, params B: 316.646085/s
verify, alg gost2012_512, params B: 421.389743/s
derive, alg gost2012_512, params B: 316.866821/s
keygen, alg gost2012_512, params C: 314.257880/s
sign,   alg gost2012_512, params C: 314.841635/s
verify, alg gost2012_512, params C: 374.770453/s
derive, alg gost2012_512, params C: 315.377823/s

@bbbrumley
Copy link
Contributor Author

For me, knowing our EC layer, the gost2012_512:C is a bit disappointing -- e.g. compared to gost2012_512:B :\

I was hoping to see more along the lines of gost2012_256:TCA where you can see the Edwards birational map is really paying off internally compared to the legacy 256-bit curves!

We'll investigate more next week.

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

I also see some regression in 512 A/B verification.
gost_speedup.xlsx

@bbbrumley
Copy link
Contributor Author

Huh, yea.

The signature verification path was more of an afterthought -- we hacked it together in about 15min just to get more complete coverage.

We'll think about how we can improve that one. (It should be easy since there are no CT requirements, even if the Fiat GF layer will anyway be CT.)

Basically you're seeing OpenSSL's variable time code performing nicely for larger curve sizes.

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

Many thanks anyway!

@beldmit
Copy link
Contributor

beldmit commented Jun 5, 2020

Do I correctly understand, that some code is automatically generated and some (minor) is hand-written? And, if new curves appear, the code may be regenerated?

@bbbrumley
Copy link
Contributor Author

Do I correctly understand, that some code is automatically generated and some (minor) is hand-written? And, if new curves appear, the code may be regenerated?

Technically speaking, everything is automatically generated -- it's just about where the templates come from (we are using python mako templating):

  • The 90% that is the Fiat GF layer, that's included verbatim
  • The 5% that is the EC layer, we wrote that but yes it's a mako template
  • The 5% that is rigging / plumbing, we wrote that but yes it's a mako template

Above also highlights the fact that formal verification only covers the GF layer -- so we focused the EC layer on simplicity and correctness for review, coupled with lots of unit testing.

Adding a new curve takes less than 5 minutes:

  • Generate the GF layer with Fiat's tooling
  • Insert the curve parameters in JSON
  • Do some quick optimization of GF inversion by running an included tool that is not extremely robust yet (optional)

We have about 20 curves right now in our experiments, including all the GOST ones.

@vt-alt
Copy link
Member

vt-alt commented Jun 5, 2020

Btw, are there many code duplications now? Hard to compare since all functions named by curve.

@vt-alt
Copy link
Member

vt-alt commented Jun 5, 2020

@bbbrumley Also, could you provide script or text description of complete or just example for reproduction of generated code? This would be nice to include in the sources for the future generations.

@bbbrumley
Copy link
Contributor Author

Btw, are there many code duplications now? Hard to compare since all functions named by curve.

I ... don't know. The Fiat code is, by nature, specific to the finite field. For example, we generated the id_tc26_gost_3410_2012_256_paramSetA code with:

./word_by_word_montgomery --static id_tc26_gost_3410_2012_256_paramSetA 64 '2^256 - 617'

Given that the Fiat lines are the vast majority of the code (try grep -n 'verbatim.fiat'), we didn't put much though into it since one of the goals (for us) was to preserve the formal verification guarantees of Fiat. And since we are not formal verification people, to us that means "don't touch it" :)

Also, could you provide script or text description of complete or just example for reproduction of generated code?

Absolutely! We'll have all the tooling source code up within a few weeks. It's on git already and CI is running now, but I hope to extend the CI to integrations. The rigging supports several third party projects (gost-engine, OpenSSL, ...) so it would be nice to have the CI not just run local testing but also pull in those projects to

  • ensure seamless integration
  • sanity check the rigging hasn't grown stale
  • leverage their regression testing in CI, too

@beldmit
Copy link
Contributor

beldmit commented Jun 6, 2020

Btw, are there many code duplications now? Hard to compare since all functions named by curve.

Not many. vimdiff does not show many matching parts.

@vt-alt
Copy link
Member

vt-alt commented Jun 7, 2020

@bbbrumley Thanks. What do you think - is it possible (and meaningful) to make statistical test to determine if implementation have time leaks?

@bbbrumley
Copy link
Contributor Author

What do you think - is it possible (and meaningful) to make statistical test to determine if implementation have time leaks?

Some timing leaks are here. The issue is BN_mod_mul and BN_mod_add are not constant time. Secret arguments there include priv_key and the nonce k, which are both security critical. These are very unlikely to be remotely exploitable. Only local attackers via uarch timings.

You can see the rich history of similar ECDSA fixes in OpenSSL.

I can help but it needs to be part of larger restructuring of ECC stuff in gost-engine. Basically the EC_GROUP structure needs to go away, and introduce a custom internal structure with NIDs, selective parameters, a Montgomery context, and function pointers instead of the gost_ec_point_mul shim this PR is using. (Which is a bit of a hack, but we can't solve everything at once.) This would properly detach gost-engine from the OpenSSL ec module.

In terms of this PR, yea the rigging parts are the place to look for leaks. You spotted BN_bn2lebinpad and CRYPTO_memcmp. I don't think the CRYPTO_memcmp calls are very meaningful, but I used the "secure" variant instead of a normal memcmp just as good practice. In the signing case, the output point coordinates are public anyway. In the key agreement case, it's just not very useful information -- an attacker could just guess anyway without a side channel, and try.

Yes BN_bn2lebinpad is critical. There's some mitigations there in the OpenSSL code, but I haven't evaluated them and offhand I don't know of any research in that direction. I do know it's used all over the place internally in OpenSSL for similar purposes (i.e. "OpenSSL has this thing and the code I need to call needs bytes").

When things are restructured, potentially both the BN_bn2lebinpad and CRYPTO_memcmp calls can go away. At the end of the day, the interface the EC code gives you here works generically on bytes -- so we can just feed it bytes.

@vt-alt
Copy link
Member

vt-alt commented Jun 8, 2020

Btw, I was wrong to say CRYPTO_memcmp is not regular, since I accidentally looked wrong implementation that defines it as memcmp.

@bbbrumley bbbrumley changed the title WiP: GOST ECC optimization GOST ECC optimizations Jul 2, 2020
@bbbrumley
Copy link
Contributor Author

bbbrumley commented Jul 2, 2020

Hey Folks,

I think we are getting closer to merge quality here.

See what your benchmarks say for this latest push. We haven't done extensive benchmarking yet, but spot testing looks pretty good.

Edit: Oh right and you also get 32-bit support here, which is also the codepath it'll take for compilers / archs that don't support uint128_t (e.g. MSVC).

@beldmit
Copy link
Contributor

beldmit commented Jul 2, 2020

Wow!!! Great job!

It may be even faster than Ed25519!

gost_speedup.xlsx

Standalone EC implementations from ECCKiila.

https://gitlab.com/nisec/ecckiila
@bbbrumley
Copy link
Contributor Author

Wow!!! Great job!

It may be even faster than Ed25519!

Excellent :) Yea, they look good.

The speedup over what we had before is a combination of:

  1. Better Fiat arithmetic strategy for curves with primes of special form 2^n - c or 2^n + c for small c.
  2. Our prior scalar multiplication strategy was more "proof-of-concept" to make sure stuff was working. Here we increased the window size and used an intelligent recoding of scalars, so less EC arith ops but still CT.
  3. We optimized point_mul_two, that is not CT, a little.

I think this is pretty much ready.

After merging (when you're comfortable), my team's next steps are removing the gost_ec_point_mul shim for you, come up with internal GOST_EC_GROUP and GOST_EC_KEY structures, and detach you from EC_GROUP, EC_KEY, and the rest of OpenSSL's ec module. Does that sound sane?

@beldmit
Copy link
Contributor

beldmit commented Jul 3, 2020

Yes, that sound definitely sane, but could you please clarify the purpose of moving to GOST-specific structures instead of openssl native ones?

@bbbrumley
Copy link
Contributor Author

could you please clarify the purpose of moving to GOST-specific structures instead of openssl native ones

There are several reasons, a few here but just my opinions:

  1. Your GOST groups are fixed -- currently you are constructing them dynamically, and not using most of the OpenSSL functionality behind them; just using them for storage, basically, and after this PR even most of the EC functionality behind them goes away. What you want to do is something like group->point_mul and group->point_mul_g and group->point_mul_two instead of the silly switch statement in this PR. (Although internally in OpenSSL that's actually part of the EC_METHOD so the analogue there would be group->meth->mul with a very poorly thought out API.)

  2. If EC_GROUP is going away, I think EC_KEY almost has to go away? Really you are only using these for storage as well, and to fetch the associated EC_GROUP. Try to back up and think about what your keys really are -- they are just bytes. Since GOST has its own OIDs not under the legacy EC hierarchy, you can do basically whatever you want. And you have your own ASN1 encodings, anyway -- you are not using that part of the EC module.

Another way to look at it, is right now you're going gost-engine -> EC module -> gost-engine basically just for arithmetic. After this PR, you're going gost-engine -> EC module -> gost-engine -> ECCKiila -> gost-engine -> EC module -> gost-engine.

When IMO the better path is gost-engine -> ECCKiila -> gost-engine.

Even another, perhaps outsider, view is: If I were to integrate the ECC parts of GOST in OpenSSL now, I would have it as separate as possible from the EC module. It is similar to the 25519 and 448 situation -- yes it is ECC, but you have your own OID hierarchy and can do what you want. The disaster situation is like SM2, where the cryptosystems are totally different but they still put their OIDs in the legacy EC hierarchy: so they are linked forever.

@beldmit
Copy link
Contributor

beldmit commented Jul 3, 2020

Got it. Many thanks for the clarification!

@beldmit beldmit merged commit bc34620 into gost-engine:master Jul 3, 2020
@beldmit
Copy link
Contributor

beldmit commented Jul 3, 2020

Merged. Thanks!

@beldmit
Copy link
Contributor

beldmit commented Jul 3, 2020

Could you also please provide some HOWTO on adding new curves, if any?

@bbbrumley
Copy link
Contributor Author

Could you also please provide some HOWTO on adding new curves, if any?

Sure :) We've got an adding a new curve section, but we need to expand it better for curves with birational equivalences. The oneliner I used there for unifdef in the "Downstream projects: stripping the source for rigging" is accurate, and I'll try to keep it up to date. For example, the GOST one I just used was

cd ecp/
for d in id_*; do unifdef $d/ecp_$d.c -DRIG_GOST -UOPENSSL_BUILDING_OPENSSL -ULIB_TEST > /path/to/gost-engine/ecp_$d.c; done

I will have to update the gost-engine rigging once we introduce the new structures for you. But for now, I hope the changes that are needed to pull in from ECCKiila will be minimal.

@vt-alt
Copy link
Member

vt-alt commented Jul 4, 2020

@bbbrumley How can we be sure that this generated implementations (btw, >5 MB of generated code) does not produce incorrect results for some rare curve points?

@bbbrumley
Copy link
Contributor Author

bbbrumley commented Jul 5, 2020

How can we be sure that this generated implementations (btw, >5 MB of generated code) does not produce incorrect results for some rare curve points

It's a good question! I don't think you can be sure. Here is what you can do:

  1. For most of the GF arithmetic layer: you can re-generate the Fiat code. That part has formal verification guarantees.
  2. For most of the EC arithmetic layer: if you look at point_add_mixed, point_add_proj, and point_dbl, the comments above the functions will tell you where the formula came from. It will be one of these files depending on the curve. All of the formulae we used in ECCKiila are complete and exception free. What it means is "rare" (yet critical for functional correctness) cases like P+Q where Q=P, Q=-P, Q=inf, P=inf, or cases like 2P where P=-P, P=inf -- these "just work" with the math and we don't need special control flow logic to handle them.
  3. Our unit test generator is here. I like to think it has pretty good coverage for corner cases, but I'm biased because I wrote it. PRs welcome (keeping in mind ECCKiila runs CI on many curves so there is only so much regression testing we can do. I think one useful test for coverage would be an analogue of OpenSSL's ecstresstest. The downside is we'd have to have a different way of generating those tests instead of "on-the-fly" -- it would slow down our development process too much over all curves.
  4. About size: yes it's a lot of code! But on a particular build, for sure less than half of that goes in. If you follow the first ifdef, roughly the top half of the file is for 64-bit, and the bottom half 32-bit. I'm guessing 64-bit builds are more common for gost-engine, so it'll actually be less than half -- the EC layer will be roughly the same size, but the Fiat GF layer will be smaller for 64-bit fields.
  5. You can regenerate the complete source with these instructions.

Limitations that I have not mentioned yet for gost-engine are:

  1. You cannot use negative scalars.
  2. You cannot use scalars where the number of bytes exceeds the number of bytes in the curve prime p.

These are cases that never happen in gost-engine due to the nature of GOST cryptosystems (I would argue it is a general property of EC cryptosystems). Nevertheless, I hope to remove these limitations after restructuring some gost-engine internals and modifying the rigging to use the new structures.

If you want, you could create your own version of ecstresstest (using a version before this PR was merged as the ground truth) and integrate this as a unit test in gost-engine, independent of ECCKiila. Maybe somehow rigging it as an "optional" or "extended" test, so it doesn't slow down your development too much. It seems like it might fit nicely here if you add a CLI switch.

Edit: FYI, if you do the ecstresstest thing, you'll need to use the y-coord to do the walk, or sth like k=x+2 or k = x + y mod 2**n for an n-bit curve prime p. This is because curves where you have x=0 points in the main subgroup will be a degenerate walk.

@vt-alt
Copy link
Member

vt-alt commented Jul 5, 2020

@bbbrumley Thank you much for detailed answer!

@vt-alt
Copy link
Member

vt-alt commented Jul 29, 2020

I want to and will implement ecstresstest-like testing. But, later when I get some good time for that. Thanks again for valuable suggestions.

@bbbrumley
Copy link
Contributor Author

Thanks again for valuable suggestions.

YW!

I added some limited ecstresstest to ECCKiila here. But it's not enabled by default in the CI, only run manually. Because ... well ... once it is passing, there isn't much to say about the test.

It's currently passing on 2**16 iterations for all curves, GOST included.

@beldmit
Copy link
Contributor

beldmit commented Jul 29, 2020

When testing libraries, it may be worth testing public and private keys with leading or final zero bytes to catch serialisation issues, but I suspect it's out of scope of ECCKiila

beldmit pushed a commit that referenced this pull request Aug 7, 2020
Standalone EC implementations from ECCKiila.

https://gitlab.com/nisec/ecckiila
(cherry picked from commit bc34620)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants