Parallel Multi-Scalar-Multiplication #226

mratsim · 2023-04-10T08:01:15Z

Parallel Multi-Scalar-Multiplication

As mentioned in #220, this is the largest bottleneck in zero-knowledge proofs. It's worth millions in prizes (https://zprize.io) and developing custom ASICs and GPU libraries.

We introduced the fastest:tm: CPU implementation on BLS12-381 G1 for small-scale MSMs and within 3.2% of the fastest for medium scale MSM (starting from 2^18 = 262144 points).

Overview

Multi-scalar-multiplication (MSM) in pseudocode

func multiScalarMulImpl_reference_vartime:

  c          <- fn(numPoints)  with `fn` a function that minimizes the total number of Elliptic Curve additions
                               in the order of log2(numPoints) - 3
  numWindows <- ⌈coefBits/c⌉
  numBuckets <- 2ᶜ⁻¹
  r          <- ∅              (The elliptic curve infinity point)

  miniMSMs[0..<numWindows]  <- ∅

  // 0.a MiniMSMs accumulation
  for w in 0 ..< numWindows:

    // 1.a Bucket accumulation
    buckets[0..<numBuckets] <- ∅
    for j in 0 ..< numPoints:
      b <- coefs[j].getWindowAt(w*c, c)
      buckets[b] += points[j]

    // 1.r Bucket reduction
    accumBuckets <- ∅
    for k in countdown(numBuckets-1, 0):
      accumBuckets += buckets[k]
      miniMSMs[w] += accumBuckets

  // 0.r MiniMSM reduction
  for w in countdown(numWindows-1, 0):
    for _ in 0 ..< c:
      r.double()
    r += miniMSMs[w]

  return r

There are 3 main level of parallelisms:

top-level / msm-parallelism: partitioning the points and doing separate MSMs on each before recombining them.
Disadvantages: MSM complexity is about O(n/log(n)), the more points we have, the more we save. With a large number of points we can afford a large bit window c for the inner miniMSMs.
Advantages: This parallelizes even the tricky final reduction (parallelizing it is possible at the cost of extra doubles). Also augmenting c has diminishing returns in practice (around ~16). In that case msm-level parallelism has no disadvantage.
mid-level / window-parallelism: scheduling the 0.a MiniMSMs accumulation loop on different threads.
Disadvantages: none, this is natural parallelism. However it is limited, assuming 255-bit coefficient for large MSM, the window is about 16 so we expose 255/16 = 15.93x parallelism opportunities. If we split the scalars using endomorphism, it becomes 127/16 = 7.94x parallelism opportunities. With a high core count or on GPU more parallelism is needed.
bottom-level / bucket parallelism: scheduling 1.a Bucket accumulation loop on different threads.
At first look we might want to have different threads handle separate chunk of points but then threads might race to add them to the same bucket. Instead we can have threads handle separate buckets and ignore points not relevant to their buckets.
Advantages: the number of buckets is easily in the thousands (starting from 2^13 = 8192 inputs), providing large parallelism opportunities even on GPUs.
Disadvantages: all tasks need to scan the whole input range, which becomes extremely large with millions of inputs as large as 255 bits.

Note on other parallelism opportunities:

for bucket accumulation, it might be tempting to use parallel batch addition into each buckets, but that means allocating a large buffer for each bucket to make it worthwhile (min 1024). This seemed too expensive for now.
bucket reduction and miniMSM reduction can also be done in parallel but the parallelism is limited. It also requires extra doubling. It is done this way: compute the top half and bottom half separately, and align the top half with doubling (127 doublings for example).

Constantine implements window-level and has a stub for bucket-level parallelism when mid-level parallelism does not provide enough parallelism to occupy all cores. Unfortunately bucket-level parallelism actually slows down performance as implemented at the moment for the problem-size and on the hardware used (laptop CPU i9-9980HK 8-cores/16-threads) and so is deactivated.

Bug fix

In #223

An additional slight optimization is avoiding syscalls in wake when we have sleepy workers that are not yet asleep. The change in epoch is sufficient to avoid their sleep.

Unfortunately an ordering issue prevented a thief from waking up any thread if all others threads were sleeping:

In the event loop, a thief notifies that it is sleepy (a 2-phase commit to sleep protocol to avoid deadlocks/sleeplocks), attempt to steal and if successful cancel sleep:

constantine/constantine/platforms/threadpool/threadpool.nim

Lines 563 to 577 in 4dc2610

    
           proc eventLoop(ctx: var WorkerContext) {.raises:[], gcsafe.} = 
        
             ## Each worker thread executes this loop over and over. 
        
             while true: 
        
               # 1. Pick from local queue 
        
               debug: log("Worker %3d: eventLoop 1 - searching task from local queue\n", ctx.id) 
        
               while (var task = ctx.taskqueue[].pop(); not task.isNil): 
        
                 debug: log("Worker %3d: eventLoop 1 - running task 0x%.08x (parent 0x%.08x, current 0x%.08x)\n", ctx.id, task, task.parent, ctx.currentTask) 
        
                 ctx.run(task) 
        
               # 2. Run out of tasks, become a thief 
        
               debug: log("Worker %3d: eventLoop 2 - becoming a thief\n", ctx.id) 
        
               let ticket = ctx.threadpool.globalBackoff.sleepy() 
        
               if (var stolenTask = ctx.tryStealOne(); not stolenTask.isNil): 
        
                 # We manage to steal a task, cancel sleep 
        
                 ctx.threadpool.globalBackoff.cancelSleep()

However, on success, it wakes another thread before cancelling its own sleep:

constantine/constantine/platforms/threadpool/threadpool.nim

Lines 521 to 534 in 4dc2610

    
           proc tryStealOne(ctx: var WorkerContext): ptr Task = 
        
             ## Try to steal a task. 
        
             let seed = ctx.rng.nextU32() 
        
             for targetId in seed.pseudoRandomPermutation(ctx.threadpool.numThreads): 
        
               if targetId == ctx.id: 
        
                 continue 
        
               let stolenTask = ctx.id.steal(ctx.threadpool.workerQueues[targetId]) 
        
               if not stolenTask.isNil(): 
        
                 # Theft successful, there might be more work for idle threads, wake one 
        
                 ctx.threadpool.globalBackoff.wake() 
        
                 return stolenTask 
        
             return nil

Due to #223 optimization (which led to significant improvement), the thief detects itself as sleepy but not sleeping and never wakes up anothe rthread, limiting parallelism.

In Multi-Scalar-Multiplication, this was significant as parallelism was limited to 4 cores before the fix in 8271be5

Unfortunately, this increase scheduling overhead by 1.8x on "empty tasks" due to syscalls as measured on the fibonacci benchmark. Though it's beneficial for non-synthetic benchmark and also beneficial on power consumption.

… chunking. Except maybe on arch with performance/efficiency cores

…st enough

…ly on 2x threads. Thread wakeup issue?

…lso introduce spawnAwaitable to await void procs

mratsim · 2023-04-10T17:10:48Z

Benchmarks

Configuration

Curve BLS12-381 G1
CPU: i9-11980HK (laptop Tiger Lake CPU, 2021, 8 cores / 16 threads)

We benchmark against Gnark which is the fastest multithreaded framework. Single threaded benches are available in #220 and also zkalc, https://crypto.ethereum.org/blog/zkalc.

Constantine: nimble bench_ec_g1_msm_bls12_381_clang or nim c -r -d:danger --hints:off --warnings:off --threads:on --outdir:build benchmarks/bench_ec_g1_msm_bls12_381.nim
Gnark: go test -bench=MultiExpG1 -run=^#

8-16 inputs

32-128 inputs

Constantine is 1.50x faster with 32 inputs, 1.38x faster with 64 inputs, 1.10x faster with 128 inputs

256-1024 inputs

Constantine is 1.007x faster with 256 inputs, 1.24x faster with 512 inputs, 1.14x faster with 1024 inputs

2048-8192 inputs

Constantine is 1.19x faster with 2048 inputs, 1.03x faster with 4096 inputs, 1.16x faster with 8192 inputs

16384-65536 inputs

Constantine is 1.028x faster with 16384 inputs, 1.12x faster with 32768 inputs, 1.13x faster with 65536 inputs

131072-262144 inputs

Constantine is 1.016x faster with 131072 inputs
Constantine is 1.032x slower with 262144 inputs

mratsim · 2023-04-10T20:52:57Z

Arkworks

Arkworks is the reference ecosystem for zero-knowledge proofs

Unfortunately arkworks doesn't seem to have a multithreaded MSM implementation.
Also the single-threaded implementation they have is benched with 131072 inputs (https://github.com/arkworks-rs/algebra/blob/bc991d4/bench-templates/src/macros/ec.rs#L206-L230) but their result is 1.83x slower than Constantine single-threaded.

Bellman

Bellman is Zcash backend for Zero-Knowledge proofs. Their MSM bench uses 65536 (1 << 16) inputs https://github.com/zkcrypto/bellman/blob/e137775/benches/slow.rs#L14-L44. It is multicore enabled by default: https://github.com/zkcrypto/bellman/blob/e137775/Cargo.toml#L44

Constantine is 4.06x faster than bellman

Bellman_ce

Bellman CE is a fork of Bellman by Matter Lab, with a focus on BN254_Snarks for the zkSync Ethereum L2.
Hence we bench using BN254-Snarks

RUSTFLAGS="-C target-cpu=native -C target_feature=+bmi2,+adx,+sse4.1" cargo +nightly test --release --features "asm" -- --nocapture test_new_multexp_speed_with_bn256

Constantine is 2.46x faster than Bellman CE

Barretenberg

Barretenberg is a C++ library by Aztec Protocol, with a focus on BN254_Snarks for the Aztec Ethereum L2.

Performance varies between -1.035x to +1.19x

mratsim · 2023-04-12T11:57:58Z

vs BLST

BLST actually has a multithreaded implementation in Rust at: https://github.com/supranational/blst/blob/e9dfc5e/bindings/rust/src/pippenger.rs#L56-L88
it can be benched using blstrs: https://github.com/filecoin-project/blstrs/blob/73d70b9/benches/bls12_381/mod.rs#L139-L141

BLST Performance

vs Gnark

Beyond 512 points, BLST is significantly slower than Gnark

vs Constantine

Even between 32 to 512 points Constantine is 1.26x to 1.51x faster.
At 131072 points (2^17 ), BLST is 1.66x slower than Gnark and 1.68x slower than Constantine.

mratsim added 13 commits March 15, 2023 12:06

try parallel reduction in batch add, but alas it's slower than custom…

c95c232

… chunking. Except maybe on arch with performance/efficiency cores

initial impl of parallel MSM - scaling to debug, threads not woken fa…

52f96cb

…st enough

improve comment [skip ci]

bfb4155

skip top window when c divides the number of bits

d4c6c1a

for some reason parallel-for loops scale on 5+ threads while spawn on…

b505922

…ly on 2x threads. Thread wakeup issue?

Add counters and timers to audit threadpool bottlenecks

57636dc

metrics and profiling fixes, (slower) latency hiding, activate tests

ea62fc7

fix thief thread trying to wake another before canceling its own sleep

8271be5

easier to sort metrics and parallel endomorphism application

dc41511

selective endomorphism acceleration

bc6a422

some tuning

060d763

spawn can handle compile-time literals, static and type parameters. A…

e92556c

…lso introduce spawnAwaitable to await void procs

improve MSM overview [skip ci]

11a5e8c

bench cleanup

256f827

mratsim merged commit 6c48975 into master Apr 10, 2023

mratsim deleted the parallel-ec branch April 12, 2023 12:06

mratsim mentioned this pull request Jun 11, 2024

Towards state-of-the-art multi-scalar-muls privacy-scaling-explorations/halo2curves#163

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Multi-Scalar-Multiplication #226

Parallel Multi-Scalar-Multiplication #226

mratsim commented Apr 10, 2023 •

edited

Loading

mratsim commented Apr 10, 2023

mratsim commented Apr 10, 2023

mratsim commented Apr 12, 2023

	proc eventLoop(ctx: var WorkerContext) {.raises:[], gcsafe.} =
	## Each worker thread executes this loop over and over.
	while true:
	# 1. Pick from local queue
	debug: log("Worker %3d: eventLoop 1 - searching task from local queue\n", ctx.id)
	while (var task = ctx.taskqueue[].pop(); not task.isNil):
	debug: log("Worker %3d: eventLoop 1 - running task 0x%.08x (parent 0x%.08x, current 0x%.08x)\n", ctx.id, task, task.parent, ctx.currentTask)
	ctx.run(task)

	# 2. Run out of tasks, become a thief
	debug: log("Worker %3d: eventLoop 2 - becoming a thief\n", ctx.id)
	let ticket = ctx.threadpool.globalBackoff.sleepy()
	if (var stolenTask = ctx.tryStealOne(); not stolenTask.isNil):
	# We manage to steal a task, cancel sleep
	ctx.threadpool.globalBackoff.cancelSleep()

	proc tryStealOne(ctx: var WorkerContext): ptr Task =
	## Try to steal a task.
	let seed = ctx.rng.nextU32()
	for targetId in seed.pseudoRandomPermutation(ctx.threadpool.numThreads):
	if targetId == ctx.id:
	continue

	let stolenTask = ctx.id.steal(ctx.threadpool.workerQueues[targetId])

	if not stolenTask.isNil():
	# Theft successful, there might be more work for idle threads, wake one
	ctx.threadpool.globalBackoff.wake()
	return stolenTask
	return nil

Parallel Multi-Scalar-Multiplication #226

Parallel Multi-Scalar-Multiplication #226

Conversation

mratsim commented Apr 10, 2023 • edited Loading

Parallel Multi-Scalar-Multiplication

Overview

Bug fix

mratsim commented Apr 10, 2023

Benchmarks

Configuration

8-16 inputs

32-128 inputs

256-1024 inputs

2048-8192 inputs

16384-65536 inputs

131072-262144 inputs

mratsim commented Apr 10, 2023

Arkworks

Bellman

Bellman_ce

Barretenberg

mratsim commented Apr 12, 2023

vs BLST

BLST Performance

vs Gnark

vs Constantine

mratsim commented Apr 10, 2023 •

edited

Loading