Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Multi-Scalar-Multiplication #226

Merged
merged 14 commits into from
Apr 10, 2023
Merged

Parallel Multi-Scalar-Multiplication #226

merged 14 commits into from
Apr 10, 2023

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Apr 10, 2023

Parallel Multi-Scalar-Multiplication

As mentioned in #220, this is the largest bottleneck in zero-knowledge proofs. It's worth millions in prizes (https://zprize.io) and developing custom ASICs and GPU libraries.

We introduced the fastest:tm: CPU implementation on BLS12-381 G1 for small-scale MSMs and within 3.2% of the fastest for medium scale MSM (starting from 2^18 = 262144 points).

Overview

Multi-scalar-multiplication (MSM) in pseudocode

func multiScalarMulImpl_reference_vartime:

  c          <- fn(numPoints)  with `fn` a function that minimizes the total number of Elliptic Curve additions
                               in the order of log2(numPoints) - 3
  numWindows <- ⌈coefBits/c⌉
  numBuckets <- 2ᶜ⁻¹
  r          <- ∅              (The elliptic curve infinity point)

  miniMSMs[0..<numWindows]  <-// 0.a MiniMSMs accumulation
  for w in 0 ..< numWindows:

    // 1.a Bucket accumulation
    buckets[0..<numBuckets] <-for j in 0 ..< numPoints:
      b <- coefs[j].getWindowAt(w*c, c)
      buckets[b] += points[j]

    // 1.r Bucket reduction
    accumBuckets <-for k in countdown(numBuckets-1, 0):
      accumBuckets += buckets[k]
      miniMSMs[w] += accumBuckets

  // 0.r MiniMSM reduction
  for w in countdown(numWindows-1, 0):
    for _ in 0 ..< c:
      r.double()
    r += miniMSMs[w]

  return r

There are 3 main level of parallelisms:

  • top-level / msm-parallelism: partitioning the points and doing separate MSMs on each before recombining them.
    Disadvantages: MSM complexity is about O(n/log(n)), the more points we have, the more we save. With a large number of points we can afford a large bit window c for the inner miniMSMs.
    Advantages: This parallelizes even the tricky final reduction (parallelizing it is possible at the cost of extra doubles). Also augmenting c has diminishing returns in practice (around ~16). In that case msm-level parallelism has no disadvantage.
  • mid-level / window-parallelism: scheduling the 0.a MiniMSMs accumulation loop on different threads.
    Disadvantages: none, this is natural parallelism. However it is limited, assuming 255-bit coefficient for large MSM, the window is about 16 so we expose 255/16 = 15.93x parallelism opportunities. If we split the scalars using endomorphism, it becomes 127/16 = 7.94x parallelism opportunities. With a high core count or on GPU more parallelism is needed.
  • bottom-level / bucket parallelism: scheduling 1.a Bucket accumulation loop on different threads.
    At first look we might want to have different threads handle separate chunk of points but then threads might race to add them to the same bucket. Instead we can have threads handle separate buckets and ignore points not relevant to their buckets.
    Advantages: the number of buckets is easily in the thousands (starting from 2^13 = 8192 inputs), providing large parallelism opportunities even on GPUs.
    Disadvantages: all tasks need to scan the whole input range, which becomes extremely large with millions of inputs as large as 255 bits.

Note on other parallelism opportunities:

  • for bucket accumulation, it might be tempting to use parallel batch addition into each buckets, but that means allocating a large buffer for each bucket to make it worthwhile (min 1024). This seemed too expensive for now.
  • bucket reduction and miniMSM reduction can also be done in parallel but the parallelism is limited. It also requires extra doubling. It is done this way: compute the top half and bottom half separately, and align the top half with doubling (127 doublings for example).

Constantine implements window-level and has a stub for bucket-level parallelism when mid-level parallelism does not provide enough parallelism to occupy all cores. Unfortunately bucket-level parallelism actually slows down performance as implemented at the moment for the problem-size and on the hardware used (laptop CPU i9-9980HK 8-cores/16-threads) and so is deactivated.

Bug fix

In #223

An additional slight optimization is avoiding syscalls in wake when we have sleepy workers that are not yet asleep. The change in epoch is sufficient to avoid their sleep.

Unfortunately an ordering issue prevented a thief from waking up any thread if all others threads were sleeping:

In the event loop, a thief notifies that it is sleepy (a 2-phase commit to sleep protocol to avoid deadlocks/sleeplocks), attempt to steal and if successful cancel sleep:

proc eventLoop(ctx: var WorkerContext) {.raises:[], gcsafe.} =
## Each worker thread executes this loop over and over.
while true:
# 1. Pick from local queue
debug: log("Worker %3d: eventLoop 1 - searching task from local queue\n", ctx.id)
while (var task = ctx.taskqueue[].pop(); not task.isNil):
debug: log("Worker %3d: eventLoop 1 - running task 0x%.08x (parent 0x%.08x, current 0x%.08x)\n", ctx.id, task, task.parent, ctx.currentTask)
ctx.run(task)
# 2. Run out of tasks, become a thief
debug: log("Worker %3d: eventLoop 2 - becoming a thief\n", ctx.id)
let ticket = ctx.threadpool.globalBackoff.sleepy()
if (var stolenTask = ctx.tryStealOne(); not stolenTask.isNil):
# We manage to steal a task, cancel sleep
ctx.threadpool.globalBackoff.cancelSleep()

However, on success, it wakes another thread before cancelling its own sleep:

proc tryStealOne(ctx: var WorkerContext): ptr Task =
## Try to steal a task.
let seed = ctx.rng.nextU32()
for targetId in seed.pseudoRandomPermutation(ctx.threadpool.numThreads):
if targetId == ctx.id:
continue
let stolenTask = ctx.id.steal(ctx.threadpool.workerQueues[targetId])
if not stolenTask.isNil():
# Theft successful, there might be more work for idle threads, wake one
ctx.threadpool.globalBackoff.wake()
return stolenTask
return nil

Due to #223 optimization (which led to significant improvement), the thief detects itself as sleepy but not sleeping and never wakes up anothe rthread, limiting parallelism.

In Multi-Scalar-Multiplication, this was significant as parallelism was limited to 4 cores before the fix in 8271be5

Unfortunately, this increase scheduling overhead by 1.8x on "empty tasks" due to syscalls as measured on the fibonacci benchmark. Though it's beneficial for non-synthetic benchmark and also beneficial on power consumption.

@mratsim
Copy link
Owner Author

mratsim commented Apr 10, 2023

Benchmarks

Configuration

Curve BLS12-381 G1
CPU: i9-11980HK (laptop Tiger Lake CPU, 2021, 8 cores / 16 threads)

We benchmark against Gnark which is the fastest multithreaded framework. Single threaded benches are available in #220 and also zkalc, https://crypto.ethereum.org/blog/zkalc.

Constantine: nimble bench_ec_g1_msm_bls12_381_clang or nim c -r -d:danger --hints:off --warnings:off --threads:on --outdir:build benchmarks/bench_ec_g1_msm_bls12_381.nim
Gnark: go test -bench=MultiExpG1 -run=^#

8-16 inputs

image

32-128 inputs

image

image

Constantine is 1.50x faster with 32 inputs, 1.38x faster with 64 inputs, 1.10x faster with 128 inputs

256-1024 inputs

image
image

Constantine is 1.007x faster with 256 inputs, 1.24x faster with 512 inputs, 1.14x faster with 1024 inputs

2048-8192 inputs

image
image

Constantine is 1.19x faster with 2048 inputs, 1.03x faster with 4096 inputs, 1.16x faster with 8192 inputs

16384-65536 inputs

image

image

Constantine is 1.028x faster with 16384 inputs, 1.12x faster with 32768 inputs, 1.13x faster with 65536 inputs

131072-262144 inputs

image

image

Constantine is 1.016x faster with 131072 inputs
Constantine is 1.032x slower with 262144 inputs

@mratsim
Copy link
Owner Author

mratsim commented Apr 10, 2023

Arkworks

Arkworks is the reference ecosystem for zero-knowledge proofs

Unfortunately arkworks doesn't seem to have a multithreaded MSM implementation.
Also the single-threaded implementation they have is benched with 131072 inputs (https://github.com/arkworks-rs/algebra/blob/bc991d4/bench-templates/src/macros/ec.rs#L206-L230) but their result is 1.83x slower than Constantine single-threaded.

image

Bellman

Bellman is Zcash backend for Zero-Knowledge proofs. Their MSM bench uses 65536 (1 << 16) inputs https://github.com/zkcrypto/bellman/blob/e137775/benches/slow.rs#L14-L44. It is multicore enabled by default: https://github.com/zkcrypto/bellman/blob/e137775/Cargo.toml#L44

image

Constantine is 4.06x faster than bellman

Bellman_ce

Bellman CE is a fork of Bellman by Matter Lab, with a focus on BN254_Snarks for the zkSync Ethereum L2.
Hence we bench using BN254-Snarks

RUSTFLAGS="-C target-cpu=native -C target_feature=+bmi2,+adx,+sse4.1" cargo +nightly test --release --features "asm" -- --nocapture test_new_multexp_speed_with_bn256
image

image

Constantine is 2.46x faster than Bellman CE

Barretenberg

Barretenberg is a C++ library by Aztec Protocol, with a focus on BN254_Snarks for the Aztec Ethereum L2.

image

image

Performance varies between -1.035x to +1.19x

@mratsim mratsim merged commit 6c48975 into master Apr 10, 2023
@mratsim
Copy link
Owner Author

mratsim commented Apr 12, 2023

vs BLST

BLST actually has a multithreaded implementation in Rust at: https://github.com/supranational/blst/blob/e9dfc5e/bindings/rust/src/pippenger.rs#L56-L88
it can be benched using blstrs: https://github.com/filecoin-project/blstrs/blob/73d70b9/benches/bls12_381/mod.rs#L139-L141

BLST Performance

image

vs Gnark

image

Beyond 512 points, BLST is significantly slower than Gnark

vs Constantine

image

Even between 32 to 512 points Constantine is 1.26x to 1.51x faster.
At 131072 points (2^17 ), BLST is 1.66x slower than Gnark and 1.68x slower than Constantine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant