Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windowed GLV acceleration - 25% faster signing on G1 #74

Merged
merged 3 commits into from
Aug 24, 2020
Merged

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Aug 24, 2020

This introduces window optimization for endomorphism acceleration.

At the very slight cost of a size 8 EC point precomputed table (compared to 16 or 32 for traditional NAF representation) we have a 25% faster scalar multiplication, fully constant-time.

Compared to other implementations it is:

Compiled with Clang
Optimization level => 
  no optimization: false
  release: true
  danger: true
  inline assembly: true
Using Constantine with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=================================================================================================================

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BN254_Snarks]]              3731343.284 ops/s           268 ns/op           805 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BN254_Snarks]]              6172839.506 ops/s           162 ns/op           488 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BN254_Snarks]]                12627.857 ops/s         79190 ns/op        237575 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 8847.444 ops/s        113027 ns/op        339087 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                12564.393 ops/s         79590 ns/op        238775 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                14145.272 ops/s         70695 ns/op        212089 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 32)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                14669.425 ops/s         68169 ns/op        204511 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                17086.131 ops/s         58527 ns/op        175583 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Window-2 G1 (endomorphism accelerated)          ECP_SWei_Proj[Fp[BN254_Snarks]]                23229.343 ops/s         43049 ns/op        129149 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BLS12_381]]                 2114164.905 ops/s           473 ns/op          1419 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BLS12_381]]                 3460207.612 ops/s           289 ns/op           867 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BLS12_381]]                    7083.658 ops/s        141170 ns/op        423516 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BLS12_381]]                    5057.888 ops/s        197711 ns/op        593140 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BLS12_381]]                    7183.650 ops/s        139205 ns/op        417620 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BLS12_381]]                    8159.535 ops/s        122556 ns/op        367674 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 32)                   ECP_SWei_Proj[Fp[BLS12_381]]                    8497.621 ops/s        117680 ns/op        353044 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BLS12_381]]                    9772.878 ops/s        102324 ns/op        306976 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Window-2 G1 (endomorphism accelerated)          ECP_SWei_Proj[Fp[BLS12_381]]                   13148.725 ops/s         76053 ns/op        228161 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Further optimization might involve:

  • mixed addition, provided we have simultaneous inversion (Simultaneous inversion #49), note that at the moment Fp inversion requires 50k cycles which is almost 25% of the time and a multiplication step is 2 doublings and 1 addition so there is a significant overhead to overcome.
  • shortcut addition and not handle infinity points or doubling/negate. Infinity check is only needed once and double-double-add sequence should never (proof?) trigger a case where 4Q == ±P from precomputed table, hence we don't need to pay the 40% penalty from the exception free formulae
  • Using Jacobian coordinates, in particular the composites formulae like quadrupling formulae or double and then double+add formulae from Composites Double-Add 2P+Q, tripling, quadrupling, quintupling, octupling #35
  • Faster table conditional select via SSE or AVX: while on just 4 or 6 limbs cmov is plenty fast, the vectorized versions are probably faster when we need to iterate on 6 limbs * 3 coordinates * 8 table elements

@mratsim mratsim linked an issue Aug 24, 2020 that may be closed by this pull request
@mratsim mratsim merged commit 6ac974d into master Aug 24, 2020
@mratsim mratsim deleted the window-glc-sac branch September 4, 2020 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Window method for GLV acceleration
1 participant