Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internals refactor + renewed focus on perf #17

Merged
merged 37 commits into from
Mar 16, 2020
Merged

Internals refactor + renewed focus on perf #17

merged 37 commits into from
Mar 16, 2020

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Mar 15, 2020

This PR is a complete overhaul of Constantine internal.

Change of priorities

This stems for a renewed focus on performance.
Instead of focusing on constant-time, code size and performance in this order, the library will focus on
constant-time, performance and code size.

The new focus on performance is due to the following articles https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0, https://hackmd.io/@zkteam/goff and the Cambrian explosion of Zero-Knowledge Proof protocols (ZKP).

In particular the first post shows that machines used for ZKP protocols start from $1000 (16 cores + 64GB of RAM) and in discussion at EthCC with Consensys ZKP team, I realised that clusters with ~100 of cores would be interesting to use.
At those scale, squeezing the most performance possible from the low-level implementation would significantly reduce the cost of the hardware, and might even make assembly (and it's auditing) worthwhile in the future.

Performance

Here are the performance figures, before/after.

GCC abysmal performance

Note that GCC generates very inefficient and also bloated code for multiprecision arithmetic, even when using addcarry and subborrow intrinsics.
This is so bad that GMP has a dedicated web page: https://gmplib.org/manual/Assembly-Carry-Propagation.html

Example in Godbolt: https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

There are a couple of issues related in GCC tracker:

Benchmark & Compilation flags

Nim devel from Jan 13 2020
-d:danger

Benchmark is https://github.com/mratsim/constantine/blob/191bb771/benchmarks/bench_eth_curves.nim
which benchmarks the library on the 3 Ethereum 1 and 2 elliptic curves:

  • secp256k1 (Ethereum1 ECDSA)
  • BN254 (Ethereum 1 precompile and the Zero-Knowledge Proof standard curve for Zcash and many many others)
  • BLS12_381 (Ethereum 2 signatures and standard across Algorand, Chia, Dfinity, Ethereum 2, Filecoin, ...)

The most important item is the field Multiplication it's the building block that makes everything (exponentiation and inversion in particular) slow.

Important: My CPU is overclocked, the hardware clock is using the CPU nominal frequency instead of the overclocked frequency meaning the benchmark are only meaningful to compare between runs on my own PC

GCC before PR

$  build/bench_eth_curves_gcc_old


Warmup: 0.9042 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]          13 ns        39 cycles
Substraction    Fp[Secp256k1]           8 ns        26 cycles
Negation        Fp[Secp256k1]           3 ns        11 cycles
Multiplication  Fp[Secp256k1]          59 ns       179 cycles
Squaring        Fp[Secp256k1]          59 ns       179 cycles
Inversion       Fp[Secp256k1]       23215 ns     69646 cycles


Warmup: 0.8972 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]              13 ns        39 cycles
Substraction    Fp[BN254]               8 ns        26 cycles
Negation        Fp[BN254]               4 ns        12 cycles
Multiplication  Fp[BN254]              59 ns       179 cycles
Squaring        Fp[BN254]              59 ns       179 cycles
Inversion       Fp[BN254]           23049 ns     69149 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]          17 ns        51 cycles
Substraction    Fp[BLS12_381]          10 ns        32 cycles
Negation        Fp[BLS12_381]           4 ns        14 cycles
Multiplication  Fp[BLS12_381]         106 ns       320 cycles
Squaring        Fp[BLS12_381]         106 ns       319 cycles
Inversion       Fp[BLS12_381]       62882 ns    188649 cycles

Clang before PR

$  build/bench_eth_curves_clang_old


Warmup: 0.9157 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]          12 ns        37 cycles
Substraction    Fp[Secp256k1]           8 ns        24 cycles
Negation        Fp[Secp256k1]           4 ns        14 cycles
Multiplication  Fp[Secp256k1]          55 ns       167 cycles
Squaring        Fp[Secp256k1]          55 ns       167 cycles
Inversion       Fp[Secp256k1]       20619 ns     61860 cycles


Warmup: 0.9060 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]              12 ns        37 cycles
Substraction    Fp[BN254]               8 ns        24 cycles
Negation        Fp[BN254]               4 ns        12 cycles
Multiplication  Fp[BN254]              55 ns       167 cycles
Squaring        Fp[BN254]              55 ns       167 cycles
Inversion       Fp[BN254]           20555 ns     61666 cycles


Warmup: 0.9054 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]          16 ns        49 cycles
Substraction    Fp[BLS12_381]          10 ns        31 cycles
Negation        Fp[BLS12_381]           4 ns        14 cycles
Multiplication  Fp[BLS12_381]         101 ns       304 cycles
Squaring        Fp[BLS12_381]         101 ns       304 cycles
Inversion       Fp[BLS12_381]       54204 ns    162615 cycles

GCC after PR

$  build/bench_eth_curves_gcc_new


Warmup: 0.9033 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           4 ns        14 cycles
Substraction    Fp[Secp256k1]           3 ns        10 cycles
Negation        Fp[Secp256k1]           2 ns         6 cycles
Multiplication  Fp[Secp256k1]          34 ns       104 cycles
Squaring        Fp[Secp256k1]          34 ns       104 cycles
Inversion       Fp[Secp256k1]       12463 ns     37390 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               4 ns        13 cycles
Substraction    Fp[BN254]               3 ns        10 cycles
Negation        Fp[BN254]               2 ns         6 cycles
Multiplication  Fp[BN254]              32 ns        98 cycles
Squaring        Fp[BN254]              32 ns        98 cycles
Inversion       Fp[BN254]           11473 ns     34420 cycles


Warmup: 0.8966 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           9 ns        27 cycles
Substraction    Fp[BLS12_381]           5 ns        15 cycles
Negation        Fp[BLS12_381]           3 ns        10 cycles
Multiplication  Fp[BLS12_381]          62 ns       188 cycles
Squaring        Fp[BLS12_381]          62 ns       188 cycles
Inversion       Fp[BLS12_381]       31324 ns     93972 cycles

Clang after PR

$  build/bench_eth_curves_clang_new


Warmup: 0.9139 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           3 ns         9 cycles
Substraction    Fp[Secp256k1]           2 ns         7 cycles
Negation        Fp[Secp256k1]           0 ns         0 cycles
Multiplication  Fp[Secp256k1]          22 ns        68 cycles
Squaring        Fp[Secp256k1]          22 ns        68 cycles
Inversion       Fp[Secp256k1]        9779 ns     29339 cycles


Warmup: 0.9064 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               2 ns         8 cycles
Substraction    Fp[BN254]               2 ns         6 cycles
Negation        Fp[BN254]               0 ns         0 cycles
Multiplication  Fp[BN254]              21 ns        64 cycles
Squaring        Fp[BN254]              21 ns        64 cycles
Inversion       Fp[BN254]            9264 ns     27794 cycles


Warmup: 0.9052 s, result 224 (displayed to avoid compiler optimizing warmup away)


⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           3 ns        11 cycles
Substraction    Fp[BLS12_381]           2 ns         8 cycles
Negation        Fp[BLS12_381]           0 ns         0 cycles
Multiplication  Fp[BLS12_381]          45 ns       136 cycles
Squaring        Fp[BLS12_381]          45 ns       137 cycles
Inversion       Fp[BLS12_381]       24951 ns     74855 cycles

Code-size

The code-size increase to support 3 curves (with BN254 and secp256k1 using the same number of limbs) is very reasonable over the old code.
image

Explanation of the internals

The old internals were using the same representation as BearSSL BigInt, in particular for words uint64 words only 63-bit were used and the last one stored the carry flag as there is no easy access to the carry flag in C.
The issue caused by carries is visible in this code to handle carries in Nim compile-time VM:

const
HalfWidth = WordBitWidth shr 1
HalfBase = (BaseType(1) shl HalfWidth)
HalfMask = HalfBase - 1
func split(n: BaseType): tuple[hi, lo: BaseType] =
result.hi = n shr HalfWidth
result.lo = n and HalfMask
func merge(hi, lo: BaseType): BaseType =
(hi shl HalfWidth) or lo
func addC(cOut, sum: var BaseType, a, b, cIn: BaseType) =
# Add with carry, fallback for the Compile-Time VM
# (CarryOut, Sum) <- a + b + CarryIn
let (aHi, aLo) = split(a)
let (bHi, bLo) = split(b)
let tLo = aLo + bLo + cIn
let (cLo, rLo) = split(tLo)
let tHi = aHi + bHi + cLo
let (cHi, rHi) = split(tHi)
cOut = cHi
sum = merge(rHi, rLo)
func subB(bOut, diff: var BaseType, a, b, bIn: BaseType) =
# Substract with borrow, fallback for the Compile-Time VM
# (BorrowOut, Sum) <- a - b - BorrowIn
let (aHi, aLo) = split(a)
let (bHi, bLo) = split(b)
let tLo = HalfBase + aLo - bLo - bIn
let (noBorrowLo, rLo) = split(tLo)
let tHi = HalfBase + aHi - bHi - BaseType(noBorrowLo == 0)
let (noBorrowHi, rHi) = split(tHi)
bOut = BaseType(noBorrowHi == 0)
diff = merge(rHi, rLo)
func dbl(a: var BigInt): bool =
## In-place multiprecision double
## a -> 2a
var carry, sum: BaseType
for i in 0 ..< a.limbs.len:
let ai = BaseType(a.limbs[i])
addC(carry, sum, ai, ai, carry)
a.limbs[i] = Word(sum)
result = bool(carry)
func csub(a: var BigInt, b: BigInt, ctl: bool): bool =
## In-place optional substraction
##
## It is NOT constant-time and is intended
## only for compile-time precomputation
## of non-secret data.
var borrow, diff: BaseType
for i in 0 ..< a.limbs.len:
let ai = BaseType(a.limbs[i])
let bi = BaseType(b.limbs[i])
subB(borrow, diff, ai, bi, borrow)
if ctl:
a.limbs[i] = Word(diff)
result = bool(borrow)

vs the old representation:
func dbl(a: var BigInt): bool =
## In-place multiprecision double
## a -> 2a
for i in 0 ..< a.limbs.len:
var z = BaseType(a.limbs[i]) * 2 + BaseType(result)
result = z.isMsbSet()
a.limbs[i] = mask(Word(z))
func sub(a: var BigInt, b: BigInt, ctl: bool): bool =
## In-place optional substraction
##
## It is NOT constant-time and is intended
## only for compile-time precomputation
## of non-secret data.
for i in 0 ..< a.limbs.len:
let new_a = BaseType(a.limbs[i]) - BaseType(b.limbs[i]) - BaseType(result)
result = new_a.isMsbSet()
a.limbs[i] = if ctl: new_a.Word.mask()
else: a.limbs[i]

However the BearSSL representation has a couple of issues:

  • It uses more words, a 254-bit field like for BN254 or a 381-bit field like for BLS12_381 requires an extra word compared to a compact representation. On current CPUs, the biggest performance bottleneck is memory speed, we want to access as little memory as possible, and for multiprecision multiplication those accesses are quadratic.
  • The BigInt primitives are implemented via type-erasure / pointer indirections, this is great for code size but it prevents unrolling and inlining and an addition on 4 limbs is only 8 instructions (4 mov, and "add+adc+adc+adc" chain) which should be unrolled and inline.
    Lastly at least with Clang we have access to efficient add with carry intrinsics.

So the new representation uses the full 64-bit and uses intrinsics or uint128 to deal with add with carries.

Furthermore it uses the technique described in https://hackmd.io/@zkteam/modular_multiplication to improve speed while staying at a high-level.

Why no lazy carries or reduction

As mentioned #15 lazy carries and reductions seem to be popular. Those also have issues:

  • They significantly increase the number of memory accesses and so cache misses. At worse ADC latency is 6 cycles, a cache miss is 100 cycles. Arguably everything is stack allocated and always in cache so this argument might not hold but it also increases register pressure.
  • Addition chains can have lazy carries but
    • when a substraction is involved either the representation is signed to handle lazy substraction
      or substraction requires to reduce the Field Element
    • Multiplications require a reduction (even partial, iirc below 2p is fine)
    • Unless we have a prime of special form (Generalized Mersenne Prime or Golden Prime) reduction is very costly as we can shift the carries from one limb to the other in a single pass, multiple (constant-time) conditional substractions and inequality checks will be needed.
    • This makes auditing the library harder

Further improvements:

CMOV / Ccopy

Currently conditional mov and copy use assembly and do a test before.
The test only needs to be done once when looping over bigints.
Thankfully the impact should be very small or invisible due to instruction level parallelism.

Squaring

This is planned

Multiplication

Further speed improvements are possible but will probably require either inflexible Assembly/inline assembly (i.e. always compiled-in and with predetermined number of limbs) or a mini-compiler.
An example for multi-precision addition is available in

macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
var asmStmt = (block:
" movq %[b], %[tmp]\n" &
" addq %[tmp], %[a]\n"
)
let maxByteOffset = bits div 8
const wsize = sizeof(uint64)
when defined(gcc):
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "+%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "+%[a]\n"
)
elif defined(clang):
# https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "%[a]\n"
)
let tmp = ident("tmp")
asmStmt.add (block:
": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
": \"cc\""
)
result = newStmtList()
result.add quote do:
var `tmp`{.noinit.}: uint64
result.add nnkAsmStmt.newTree(
newEmptyNode(),
newLit asmStmt
)
echo result.toStrLit

As explained in A Fast Implementation of the Optimal Ate Pairing over BN curve on Intel Haswell Processor, Shigeo Mitsunari, 2013, https://eprint.iacr.org/2013/362.pdf one of the main bottleneck on x86 is that the MUL instruction is very inflexible in terms of register and requires lots of mov before and after which significantly hinder throughput of Montgomery multiplication.
Furthermore, it pollutes the carry flags, by using MULX instead we can avoid that and even use ADCX and ADOX instructions to handle 2 independent carry chains and benefit from instruction-level parallelism as mentioned by Intel in https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

image
image

@mratsim
Copy link
Owner Author

mratsim commented Mar 16, 2020

Comparing:

The issue doesn't seem to be in moveMem.

Could it be an issue in the carry flags and ccopy interaction at

func cadd(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Carry =
## Type-erased conditional addition
## Returns the carry
##
## if ctl is true: a <- a + b
## if ctl is false: a <- a
## The carry is always computed whether ctl is true or false
##
## Time and memory accesses are the same whether a copy occurs or not
result = Carry(0)
var sum: Word
for i in 0 ..< len:
addC(result, sum, a[i], b[i], result)
ctl.ccopy(a[i], sum)
func csub(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Borrow =
## Type-erased conditional addition
## Returns the borrow
##
## if ctl is true: a <- a - b
## if ctl is false: a <- a
## The borrow is always computed whether ctl is true or false
##
## Time and memory accesses are the same whether a copy occurs or not
result = Borrow(0)
var diff: Word
for i in 0 ..< len:
subB(result, diff, a[i], b[i], result)
ctl.ccopy(a[i], diff)

@mratsim
Copy link
Owner Author

mratsim commented Mar 16, 2020

Fixed the 2 leftover bugs:

  1. GCC before version 7 generated wrong code on addcarry_u64
  2. The fallback for subborrow_u64 on ARM didn't mask the borrow byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant