Internals refactor + renewed focus on perf #17

mratsim · 2020-03-15T21:16:57Z

This PR is a complete overhaul of Constantine internal.

Change of priorities

This stems for a renewed focus on performance.
Instead of focusing on constant-time, code size and performance in this order, the library will focus on
constant-time, performance and code size.

The new focus on performance is due to the following articles https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0, https://hackmd.io/@zkteam/goff and the Cambrian explosion of Zero-Knowledge Proof protocols (ZKP).

In particular the first post shows that machines used for ZKP protocols start from $1000 (16 cores + 64GB of RAM) and in discussion at EthCC with Consensys ZKP team, I realised that clusters with ~100 of cores would be interesting to use.
At those scale, squeezing the most performance possible from the low-level implementation would significantly reduce the cost of the hardware, and might even make assembly (and it's auditing) worthwhile in the future.

Performance

Here are the performance figures, before/after.

GCC abysmal performance

Note that GCC generates very inefficient and also bloated code for multiprecision arithmetic, even when using addcarry and subborrow intrinsics.
This is so bad that GMP has a dedicated web page: https://gmplib.org/manual/Assembly-Carry-Propagation.html

Example in Godbolt: https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

There are a couple of issues related in GCC tracker:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317 (fixed but fix is wrong?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 (support add carry and sub borrow)
https://gcc.gnu.org/legacy-ml/gcc-help/2017-08/msg00100.html (GCC is not able to distinguish carry chains)

Benchmark & Compilation flags

Nim devel from Jan 13 2020
-d:danger

Benchmark is https://github.com/mratsim/constantine/blob/191bb771/benchmarks/bench_eth_curves.nim
which benchmarks the library on the 3 Ethereum 1 and 2 elliptic curves:

secp256k1 (Ethereum1 ECDSA)
BN254 (Ethereum 1 precompile and the Zero-Knowledge Proof standard curve for Zcash and many many others)
BLS12_381 (Ethereum 2 signatures and standard across Algorand, Chia, Dfinity, Ethereum 2, Filecoin, ...)

The most important item is the field Multiplication it's the building block that makes everything (exponentiation and inversion in particular) slow.

Important: My CPU is overclocked, the hardware clock is using the CPU nominal frequency instead of the overclocked frequency meaning the benchmark are only meaningful to compare between runs on my own PC

GCC before PR

$  build/bench_eth_curves_gcc_old


Warmup: 0.9042 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]          13 ns        39 cycles
Substraction    Fp[Secp256k1]           8 ns        26 cycles
Negation        Fp[Secp256k1]           3 ns        11 cycles
Multiplication  Fp[Secp256k1]          59 ns       179 cycles
Squaring        Fp[Secp256k1]          59 ns       179 cycles
Inversion       Fp[Secp256k1]       23215 ns     69646 cycles


Warmup: 0.8972 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]              13 ns        39 cycles
Substraction    Fp[BN254]               8 ns        26 cycles
Negation        Fp[BN254]               4 ns        12 cycles
Multiplication  Fp[BN254]              59 ns       179 cycles
Squaring        Fp[BN254]              59 ns       179 cycles
Inversion       Fp[BN254]           23049 ns     69149 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]          17 ns        51 cycles
Substraction    Fp[BLS12_381]          10 ns        32 cycles
Negation        Fp[BLS12_381]           4 ns        14 cycles
Multiplication  Fp[BLS12_381]         106 ns       320 cycles
Squaring        Fp[BLS12_381]         106 ns       319 cycles
Inversion       Fp[BLS12_381]       62882 ns    188649 cycles

Clang before PR

$  build/bench_eth_curves_clang_old


Warmup: 0.9157 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]          12 ns        37 cycles
Substraction    Fp[Secp256k1]           8 ns        24 cycles
Negation        Fp[Secp256k1]           4 ns        14 cycles
Multiplication  Fp[Secp256k1]          55 ns       167 cycles
Squaring        Fp[Secp256k1]          55 ns       167 cycles
Inversion       Fp[Secp256k1]       20619 ns     61860 cycles


Warmup: 0.9060 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]              12 ns        37 cycles
Substraction    Fp[BN254]               8 ns        24 cycles
Negation        Fp[BN254]               4 ns        12 cycles
Multiplication  Fp[BN254]              55 ns       167 cycles
Squaring        Fp[BN254]              55 ns       167 cycles
Inversion       Fp[BN254]           20555 ns     61666 cycles


Warmup: 0.9054 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]          16 ns        49 cycles
Substraction    Fp[BLS12_381]          10 ns        31 cycles
Negation        Fp[BLS12_381]           4 ns        14 cycles
Multiplication  Fp[BLS12_381]         101 ns       304 cycles
Squaring        Fp[BLS12_381]         101 ns       304 cycles
Inversion       Fp[BLS12_381]       54204 ns    162615 cycles

GCC after PR

$  build/bench_eth_curves_gcc_new


Warmup: 0.9033 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           4 ns        14 cycles
Substraction    Fp[Secp256k1]           3 ns        10 cycles
Negation        Fp[Secp256k1]           2 ns         6 cycles
Multiplication  Fp[Secp256k1]          34 ns       104 cycles
Squaring        Fp[Secp256k1]          34 ns       104 cycles
Inversion       Fp[Secp256k1]       12463 ns     37390 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               4 ns        13 cycles
Substraction    Fp[BN254]               3 ns        10 cycles
Negation        Fp[BN254]               2 ns         6 cycles
Multiplication  Fp[BN254]              32 ns        98 cycles
Squaring        Fp[BN254]              32 ns        98 cycles
Inversion       Fp[BN254]           11473 ns     34420 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           9 ns        27 cycles
Substraction    Fp[BLS12_381]           5 ns        15 cycles
Negation        Fp[BLS12_381]           3 ns        10 cycles
Multiplication  Fp[BLS12_381]          62 ns       188 cycles
Squaring        Fp[BLS12_381]          62 ns       188 cycles
Inversion       Fp[BLS12_381]       31324 ns     93972 cycles

Clang after PR

$  build/bench_eth_curves_clang_new


Warmup: 0.9139 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           3 ns         9 cycles
Substraction    Fp[Secp256k1]           2 ns         7 cycles
Negation        Fp[Secp256k1]           0 ns         0 cycles
Multiplication  Fp[Secp256k1]          22 ns        68 cycles
Squaring        Fp[Secp256k1]          22 ns        68 cycles
Inversion       Fp[Secp256k1]        9779 ns     29339 cycles


Warmup: 0.9064 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               2 ns         8 cycles
Substraction    Fp[BN254]               2 ns         6 cycles
Negation        Fp[BN254]               0 ns         0 cycles
Multiplication  Fp[BN254]              21 ns        64 cycles
Squaring        Fp[BN254]              21 ns        64 cycles
Inversion       Fp[BN254]            9264 ns     27794 cycles


Warmup: 0.9052 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           3 ns        11 cycles
Substraction    Fp[BLS12_381]           2 ns         8 cycles
Negation        Fp[BLS12_381]           0 ns         0 cycles
Multiplication  Fp[BLS12_381]          45 ns       136 cycles
Squaring        Fp[BLS12_381]          45 ns       137 cycles
Inversion       Fp[BLS12_381]       24951 ns     74855 cycles

Code-size

The code-size increase to support 3 curves (with BN254 and secp256k1 using the same number of limbs) is very reasonable over the old code.

Explanation of the internals

The old internals were using the same representation as BearSSL BigInt, in particular for words uint64 words only 63-bit were used and the last one stored the carry flag as there is no easy access to the carry flag in C.
The issue caused by carries is visible in this code to handle carries in Nim compile-time VM:

constantine/constantine/arithmetic/precomputed.nim

Lines 34 to 95 in 3c9d070

    
           const 
        
             HalfWidth = WordBitWidth shr 1 
        
             HalfBase = (BaseType(1) shl HalfWidth) 
        
             HalfMask = HalfBase - 1 
        
           func split(n: BaseType): tuple[hi, lo: BaseType] = 
        
             result.hi = n shr HalfWidth 
        
             result.lo = n and HalfMask 
        
           func merge(hi, lo: BaseType): BaseType = 
        
             (hi shl HalfWidth) or lo 
        
           func addC(cOut, sum: var BaseType, a, b, cIn: BaseType) = 
        
             # Add with carry, fallback for the Compile-Time VM 
        
             # (CarryOut, Sum) <- a + b + CarryIn 
        
             let (aHi, aLo) = split(a) 
        
             let (bHi, bLo) = split(b) 
        
             let tLo = aLo + bLo + cIn 
        
             let (cLo, rLo) = split(tLo) 
        
             let tHi = aHi + bHi + cLo 
        
             let (cHi, rHi) = split(tHi) 
        
             cOut = cHi 
        
             sum = merge(rHi, rLo) 
        
           func subB(bOut, diff: var BaseType, a, b, bIn: BaseType) = 
        
             # Substract with borrow, fallback for the Compile-Time VM 
        
             # (BorrowOut, Sum) <- a - b - BorrowIn 
        
             let (aHi, aLo) = split(a) 
        
             let (bHi, bLo) = split(b) 
        
             let tLo = HalfBase + aLo - bLo - bIn 
        
             let (noBorrowLo, rLo) = split(tLo) 
        
             let tHi = HalfBase + aHi - bHi - BaseType(noBorrowLo == 0) 
        
             let (noBorrowHi, rHi) = split(tHi) 
        
             bOut = BaseType(noBorrowHi == 0) 
        
             diff = merge(rHi, rLo) 
        
           func dbl(a: var BigInt): bool = 
        
             ## In-place multiprecision double 
        
             ##   a -> 2a 
        
             var carry, sum: BaseType 
        
             for i in 0 ..< a.limbs.len: 
        
               let ai = BaseType(a.limbs[i]) 
        
               addC(carry, sum, ai, ai, carry) 
        
               a.limbs[i] = Word(sum) 
        
             result = bool(carry) 
        
           func csub(a: var BigInt, b: BigInt, ctl: bool): bool = 
        
             ## In-place optional substraction 
        
             ## 
        
             ## It is NOT constant-time and is intended 
        
             ## only for compile-time precomputation 
        
             ## of non-secret data. 
        
             var borrow, diff: BaseType 
        
             for i in 0 ..< a.limbs.len: 
        
               let ai = BaseType(a.limbs[i]) 
        
               let bi = BaseType(b.limbs[i]) 
        
               subB(borrow, diff, ai, bi, borrow) 
        
               if ctl: 
        
                 a.limbs[i] = Word(diff) 
        
             result = bool(borrow)

vs the old representation:

constantine/constantine/arithmetic/precomputed.nim

Lines 37 to 55 in 191bb77

    
           func dbl(a: var BigInt): bool = 
        
             ## In-place multiprecision double 
        
             ##   a -> 2a 
        
             for i in 0 ..< a.limbs.len: 
        
               var z = BaseType(a.limbs[i]) * 2 + BaseType(result) 
        
               result = z.isMsbSet() 
        
               a.limbs[i] = mask(Word(z)) 
        
           func sub(a: var BigInt, b: BigInt, ctl: bool): bool = 
        
             ## In-place optional substraction 
        
             ## 
        
             ## It is NOT constant-time and is intended 
        
             ## only for compile-time precomputation 
        
             ## of non-secret data. 
        
             for i in 0 ..< a.limbs.len: 
        
               let new_a = BaseType(a.limbs[i]) - BaseType(b.limbs[i]) - BaseType(result) 
        
               result = new_a.isMsbSet() 
        
               a.limbs[i] = if ctl: new_a.Word.mask() 
        
                            else: a.limbs[i]

However the BearSSL representation has a couple of issues:

It uses more words, a 254-bit field like for BN254 or a 381-bit field like for BLS12_381 requires an extra word compared to a compact representation. On current CPUs, the biggest performance bottleneck is memory speed, we want to access as little memory as possible, and for multiprecision multiplication those accesses are quadratic.
The BigInt primitives are implemented via type-erasure / pointer indirections, this is great for code size but it prevents unrolling and inlining and an addition on 4 limbs is only 8 instructions (4 mov, and "add+adc+adc+adc" chain) which should be unrolled and inline.
Lastly at least with Clang we have access to efficient add with carry intrinsics.

So the new representation uses the full 64-bit and uses intrinsics or uint128 to deal with add with carries.

Furthermore it uses the technique described in https://hackmd.io/@zkteam/modular_multiplication to improve speed while staying at a high-level.

Why no lazy carries or reduction

As mentioned #15 lazy carries and reductions seem to be popular. Those also have issues:

They significantly increase the number of memory accesses and so cache misses. At worse ADC latency is 6 cycles, a cache miss is 100 cycles. Arguably everything is stack allocated and always in cache so this argument might not hold but it also increases register pressure.
Addition chains can have lazy carries but
- when a substraction is involved either the representation is signed to handle lazy substraction
  or substraction requires to reduce the Field Element
- Multiplications require a reduction (even partial, iirc below 2p is fine)
- Unless we have a prime of special form (Generalized Mersenne Prime or Golden Prime) reduction is very costly as we can shift the carries from one limb to the other in a single pass, multiple (constant-time) conditional substractions and inequality checks will be needed.
- This makes auditing the library harder

Further improvements:

CMOV / Ccopy

Currently conditional mov and copy use assembly and do a test before.
The test only needs to be done once when looping over bigints.
Thankfully the impact should be very small or invisible due to instruction level parallelism.

Squaring

This is planned

Multiplication

Further speed improvements are possible but will probably require either inflexible Assembly/inline assembly (i.e. always compiled-in and with predetermined number of limbs) or a mini-compiler.
An example for multi-precision addition is available in

constantine/constantine/primitives/research/addcarry_subborrow_compiler.nim

Lines 34 to 79 in cc75582

    
           macro addCarryGen_u64(a, b: untyped, bits: static int): untyped = 
        
             var asmStmt = (block: 
        
               "      movq %[b], %[tmp]\n" & 
        
               "      addq %[tmp], %[a]\n" 
        
             ) 
        
             let maxByteOffset = bits div 8 
        
             const wsize = sizeof(uint64) 
        
             when defined(gcc): 
        
               for byteOffset in countup(wsize, maxByteOffset-1, wsize): 
        
                 asmStmt.add (block: 
        
                   "\n" & 
        
                   # movq 8+%[b], %[tmp] 
        
                   "      movq " & $byteOffset & "+%[b], %[tmp]\n" & 
        
                   # adcq %[tmp], 8+%[a] 
        
                   "      adcq %[tmp], " & $byteOffset & "+%[a]\n" 
        
                 ) 
        
             elif defined(clang): 
        
               # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html 
        
               for byteOffset in countup(wsize, maxByteOffset-1, wsize): 
        
                 asmStmt.add (block: 
        
                   "\n" & 
        
                   # movq 8+%[b], %[tmp] 
        
                   "      movq " & $byteOffset & "%[b], %[tmp]\n" & 
        
                   # adcq %[tmp], 8+%[a] 
        
                   "      adcq %[tmp], " & $byteOffset & "%[a]\n" 
        
                 ) 
        
             let tmp = ident("tmp") 
        
             asmStmt.add (block: 
        
               ": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" & 
        
               ": [b] \"m\"(`" & $b & "->limbs[0]`)\n" & 
        
               ": \"cc\"" 
        
             ) 
        
             result = newStmtList() 
        
             result.add quote do: 
        
               var `tmp`{.noinit.}: uint64 
        
             result.add nnkAsmStmt.newTree( 
        
               newEmptyNode(), 
        
               newLit asmStmt 
        
             ) 
        
             echo result.toStrLit

As explained in A Fast Implementation of the Optimal Ate Pairing over BN curve on Intel Haswell Processor, Shigeo Mitsunari, 2013, https://eprint.iacr.org/2013/362.pdf one of the main bottleneck on x86 is that the MUL instruction is very inflexible in terms of register and requires lots of mov before and after which significantly hinder throughput of Montgomery multiplication.
Furthermore, it pollutes the carry flags, by using MULX instead we can avoid that and even use ADCX and ADOX instructions to handle 2 independent carry chains and benefit from instruction-level parallelism as mentioned by Intel in https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

…b/mratsim/constantine/jobs/298359743

…com/github/mratsim/constantine/jobs/298359744

…e prime can field in uint32

…th - passes all tests (32-bit and 64-bit)

mratsim · 2020-03-16T12:03:41Z

Comparing:

Ubuntu 16.04 failure x86_64 GCC (Travis): https://travis-ci.com/github/mratsim/constantine/jobs/298511157#L354
Ubuntu 16.04 failure x86_64 Clang (Travis): https://travis-ci.com/github/mratsim/constantine/jobs/298511159#L352
Ubuntu 16.04 failure x86_64 Clang (Travis): https://travis-ci.com/github/mratsim/constantine/jobs/298511159#L352
Ubuntu 16.04 failure GCC Azure: https://dev.azure.com/numforge/Constantine/_build/results?buildId=387&view=results
With
MacOS x86_64 Clang success: https://travis-ci.com/github/mratsim/constantine/jobs/298511160#L247 (overall run failure is due to log size)
Windows and Mac success (Azure): https://dev.azure.com/numforge/Constantine/_build/results?buildId=387&view=results

The issue doesn't seem to be in moveMem.

Could it be an issue in the carry flags and ccopy interaction at

constantine/constantine/arithmetic/limbs.nim

Lines 258 to 286 in 25972a2

    
           func cadd(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Carry = 
        
             ## Type-erased conditional addition 
        
             ## Returns the carry 
        
             ## 
        
             ## if ctl is true: a <- a + b 
        
             ## if ctl is false: a <- a 
        
             ## The carry is always computed whether ctl is true or false 
        
             ## 
        
             ## Time and memory accesses are the same whether a copy occurs or not 
        
             result = Carry(0) 
        
             var sum: Word 
        
             for i in 0 ..< len: 
        
               addC(result, sum, a[i], b[i], result) 
        
               ctl.ccopy(a[i], sum) 
        
           func csub(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Borrow = 
        
             ## Type-erased conditional addition 
        
             ## Returns the borrow 
        
             ## 
        
             ## if ctl is true: a <- a - b 
        
             ## if ctl is false: a <- a 
        
             ## The borrow is always computed whether ctl is true or false 
        
             ## 
        
             ## Time and memory accesses are the same whether a copy occurs or not 
        
             result = Borrow(0) 
        
             var diff: Word 
        
             for i in 0 ..< len: 
        
               subB(result, diff, a[i], b[i], result) 
        
               ctl.ccopy(a[i], diff)

…ional mov

…bugzilla/show_bug.cgi?id=87139)

…count + inlining comment

This reverts commit 087f9aa.

mratsim · 2020-03-16T15:31:17Z

Fixed the 2 leftover bugs:

GCC before version 7 generated wrong code on addcarry_u64
The fallback for subborrow_u64 on ARM didn't mask the borrow byte.

mratsim added 30 commits March 15, 2020 21:02

Lay out the refactoring objectives and tradeoffs

69f725a

Refactor the 32 and 64-bit primitives [skip ci]

78690dc

BigInts and Modular BigInts compile

5926766

Make the bigints test compile

df006fd

Fix modular reduction

5f3d761

Fix reduction tests vs GMP

ff63b57

Implement montegomery mul, pow, inverse, WIP finite field compilation

cc94fc8

Make FiniteField compile

94c5a86

Fix exponentiation compilation

c61bf00

Fix Montgomery magic constant computation for 2^64 words

259ac8a

Fix typo in non-optimized CIOS - passing finite fields IO tests

70f53c2

Add limbs comparisons [skip ci]

ec9afb7

Fix on precomputation of the Montgomery magic constant

ccfbea9

Passing all tests including 𝔽p2

b92ab58

modular addition, the test for mersenne prime was wrong

5e89f0b

update benches

cc75582

Fix "nimble test" + typo on out-of-place field addition

b2458d7

bigint division, normalization is needed: https://travis-ci.com/githu…

2d340d2

…b/mratsim/constantine/jobs/298359743

missing conversion in subborrow non-x86 fallback - https://travis-ci.…

d820679

…com/github/mratsim/constantine/jobs/298359744

Fix little-endian serialization

62c2a9d

Constantine32 flag to run 32-bit constantine on 64-bit machines

0e47c28

IO Field test, ensure that BaseType is used instead of uint64 when th…

92e5b42

…e prime can field in uint32

Implement proper addcarry and subborrow fallback for the compile-time VM

b6509dd

Fix export issue when the logical wordbitwidth == physical wordbitwid…

3c9d070

…th - passes all tests (32-bit and 64-bit)

Fix uint128 on ARM

f4e88a8

Fix C++ conditional copy and ARM addcarry/subborrow

9d781e9

Add investigation for SIGFPE in Travis

e08abb4

Fix debug display for unsafeDiv2n1n

9ecbf6a

multiplexer typo

6eb86c5

moveMem bug in glibc of Ubuntu 16.04?

25972a2

mratsim added 7 commits March 16, 2020 13:25

Was probably missing an early clobbered register annotation on condit…

bb8adff

…ional mov

Note on Montgomery-friendly moduli

27b4bba

Strongly suspect a GCC before GCC 7 codegen bug (https://gcc.gnu.org/…

715c473

…bugzilla/show_bug.cgi?id=87139)

hex conversion was (for debugging) not taking requested order into ac…

e9785c9

…count + inlining comment

Use 32-bit limbs on ARM64, uint128 builtin __udivti4 bug?

087f9aa

Revert "Use 32-bit limbs on ARM64, uint128 builtin __udivti4 bug?"

9395c3f

This reverts commit 087f9aa.

Fix subborrow fallback for non-x86 (need to maks the borrow)

c0a49f3

mratsim merged commit 4ff0e3d into master Mar 16, 2020

mratsim mentioned this pull request Mar 17, 2020

Add optimized squaring (~15% speedup) #18

Merged

mratsim deleted the refactor-perf branch March 18, 2020 19:38

mratsim mentioned this pull request Mar 21, 2020

Implement lazy carries and reductions #15

Closed

mratsim mentioned this pull request Apr 2, 2020

Stop disabling AVX-512 in march=native compiler flag setup status-im/nimbus-eth2#843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internals refactor + renewed focus on perf #17

Internals refactor + renewed focus on perf #17

mratsim commented Mar 15, 2020 •

edited

Loading

mratsim commented Mar 16, 2020

mratsim commented Mar 16, 2020

	const
	HalfWidth = WordBitWidth shr 1
	HalfBase = (BaseType(1) shl HalfWidth)
	HalfMask = HalfBase - 1

	func split(n: BaseType): tuple[hi, lo: BaseType] =
	result.hi = n shr HalfWidth
	result.lo = n and HalfMask

	func merge(hi, lo: BaseType): BaseType =
	(hi shl HalfWidth) or lo

	func addC(cOut, sum: var BaseType, a, b, cIn: BaseType) =
	# Add with carry, fallback for the Compile-Time VM
	# (CarryOut, Sum) <- a + b + CarryIn
	let (aHi, aLo) = split(a)
	let (bHi, bLo) = split(b)
	let tLo = aLo + bLo + cIn
	let (cLo, rLo) = split(tLo)
	let tHi = aHi + bHi + cLo
	let (cHi, rHi) = split(tHi)
	cOut = cHi
	sum = merge(rHi, rLo)

	func subB(bOut, diff: var BaseType, a, b, bIn: BaseType) =
	# Substract with borrow, fallback for the Compile-Time VM
	# (BorrowOut, Sum) <- a - b - BorrowIn
	let (aHi, aLo) = split(a)
	let (bHi, bLo) = split(b)
	let tLo = HalfBase + aLo - bLo - bIn
	let (noBorrowLo, rLo) = split(tLo)
	let tHi = HalfBase + aHi - bHi - BaseType(noBorrowLo == 0)
	let (noBorrowHi, rHi) = split(tHi)
	bOut = BaseType(noBorrowHi == 0)
	diff = merge(rHi, rLo)

	func dbl(a: var BigInt): bool =
	## In-place multiprecision double
	## a -> 2a
	var carry, sum: BaseType
	for i in 0 ..< a.limbs.len:
	let ai = BaseType(a.limbs[i])
	addC(carry, sum, ai, ai, carry)
	a.limbs[i] = Word(sum)

	result = bool(carry)

	func csub(a: var BigInt, b: BigInt, ctl: bool): bool =
	## In-place optional substraction
	##
	## It is NOT constant-time and is intended
	## only for compile-time precomputation
	## of non-secret data.
	var borrow, diff: BaseType
	for i in 0 ..< a.limbs.len:
	let ai = BaseType(a.limbs[i])
	let bi = BaseType(b.limbs[i])
	subB(borrow, diff, ai, bi, borrow)
	if ctl:
	a.limbs[i] = Word(diff)

	result = bool(borrow)

	macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
	var asmStmt = (block:
	" movq %[b], %[tmp]\n" &
	" addq %[tmp], %[a]\n"
	)

	let maxByteOffset = bits div 8
	const wsize = sizeof(uint64)

	when defined(gcc):
	for byteOffset in countup(wsize, maxByteOffset-1, wsize):
	asmStmt.add (block:
	"\n" &
	# movq 8+%[b], %[tmp]
	" movq " & $byteOffset & "+%[b], %[tmp]\n" &
	# adcq %[tmp], 8+%[a]
	" adcq %[tmp], " & $byteOffset & "+%[a]\n"
	)
	elif defined(clang):
	# https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
	for byteOffset in countup(wsize, maxByteOffset-1, wsize):
	asmStmt.add (block:
	"\n" &
	# movq 8+%[b], %[tmp]
	" movq " & $byteOffset & "%[b], %[tmp]\n" &
	# adcq %[tmp], 8+%[a]
	" adcq %[tmp], " & $byteOffset & "%[a]\n"
	)

	let tmp = ident("tmp")
	asmStmt.add (block:
	": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
	": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
	": \"cc\""
	)

	result = newStmtList()
	result.add quote do:
	var `tmp`{.noinit.}: uint64

	result.add nnkAsmStmt.newTree(
	newEmptyNode(),
	newLit asmStmt
	)

	echo result.toStrLit

Internals refactor + renewed focus on perf #17

Internals refactor + renewed focus on perf #17

Conversation

mratsim commented Mar 15, 2020 • edited Loading

Change of priorities

Performance

GCC abysmal performance

Benchmark & Compilation flags

GCC before PR

Clang before PR

GCC after PR

Clang after PR

Code-size

Explanation of the internals

Why no lazy carries or reduction

Further improvements:

CMOV / Ccopy

Squaring

Multiplication

mratsim commented Mar 16, 2020

mratsim commented Mar 16, 2020

mratsim commented Mar 15, 2020 •

edited

Loading