Msm optimization #29

kilic · 2023-02-27T23:15:31Z

This effort, at first, started to locate a discrepancy problem in pse/halo2/#40 and becomes a relatively simpler implementation.

Both implementations follows the batch addition approach with bucket method. Some promising techniques that are applied in #40 are not applied here yet. These are:

endo decomposition to halve rounds in exchange for doubling number of scalars
signed scalar decomposition to reduce bucket size which would reduce memory consumption and number of steps in accumulation phase
bottom-up parallelization as it is applied in Implement PrimeFieldBits for base and scalar field of BN and Secp #40 and suggested by @Brechtpd. Currently this PR applies top-down parallelization as applied in zcash/halo2::arithmetic::best_multiexp

Without these optimisations performance seems like close to #40 where code size is aprox 5x smaller than that. This PR also copies other msm functions to compare this effort against #40 and zcash/msm which will be removed once this PR matures. Rough bench results below are performed in M1 machine.

k	`#40`	`zcash` serial	`zcash` parallel	`this` serial	`this` parallel	notes
10	12.39ms	15.30ms	5.77ms	11.11ms	7.91ms
11	18.09ms	26.69ms	8.43ms	16.21ms	8.22ms
12	20.03ms	51.43ms	13.22ms	26.18ms	12.84ms	`this` takes over the lead
13	26.74ms	85.26ms	20.07ms	44.86ms	18.39ms
14	34.80ms	159.12ms	31.62ms	81.60ms	22.37ms
15	48.35ms	283.77ms	58.36ms	146.15ms	30.52ms
16	75.76ms	525.89ms	105.72ms	283.59ms	56.33ms
17	127.98ms	1.00s	166.15ms	577.60ms	100.93ms
18	207.24ms	1.86s	300.41ms	1.09s	164.42ms
19	402.92ms	3.50s	568.79ms	2.26s	350.78ms
20	714.45ms	6.85s	1.05s	4.75s	709.65ms
21	1.32s	12.73s	2.34s	9.77s	1.60s	`#40` takes over the lead
22	2.58s	24.82s	3.71s	19.59s	2.96s

Pasta curves haven't been covered yet since there is no Point::from_xy_unchecked kind of endpoint that would enable cheap construction of a point that is required in batch additions. A PR to zcash/pasta_curves planed to be opened to enable generic msm functionality here.

Another limitation for this PR is that it assumes base points are independent and are not equal to infinity so that we can skip doubling and infinity checks which is the usual case in polynomial commitments.

Co-Authored-By: Brechtpd <Brechtp.Devos@gmail.com>

mratsim · 2023-10-25T22:36:47Z

I have redone the benchmarks on my PC using this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-startover and by adding the new ASM from #49 and inversion algorithm from #83.

Strangely, on my machine both your and @Brechtpd PR get slower than Zcash's. at 2²² sizes

Full details, from commit 1fd2e54 on the left and with new assembly and Bernstein-Yang inversion on the right.

Some details can be found in this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-opt which was trying to rebase all PRs on top of main and use the criterion benchmark facilities from #86 but test_round failed and having no field access in traits is just 🤕

cargo test --features print-trace,asm --release -- --ignored --nocapture test_multiexp_bench

For references, here are the benchmark on my own library

jonathanpwang · 2023-11-15T02:56:23Z

@mratsim Thanks for sharing this and your branch! (I did not want to try updating Brecht's PR to ff 0.13 etc...)

I ran taikoxyz@25d80d7 bench msm on my laptop (M2Max Macbook Pro, no asm, 12 cores, 96gb RAM) and here are results:

I'll get some AWS machines to compare the asm times with different numbers of cores.

jonathanpwang · 2023-11-15T08:11:30Z

This is very surprising, it seems to have to do with x86 processors, even without asm turned on. (Or just something general about memory?)

On a c7a.8xl (x86_64, 32 core, AMD EPYC 9R14 Processor, 64gb memory)
cargo bench --bench msm --features "bn256-table, asm"

By the way I also did just cargo bench --bench msm without asm and the performance was indeed worse.

On a c7a.24xl (x86_64, 96 core, AMD EPYC 9R14 Processor, 192 gb memory)
cargo bench --bench msm --features "bn256-table, asm"

Lastly I checked an ARM processor: c7g.8xl (arm64, 32 core, AWS Graviton3 Processor, 64gb memory)
cargo bench --bench msm

So the absolute best time for 32 cores is still x86_64 (the c7a has clock speed 3.7ghz, while c7g is only 2.5ghz), but the performance of Brecht's msm (halo2_pr40) is better than zcash on Arm64 but worse than zcash on x86.

@mratsim

jonathanpwang · 2023-11-15T08:13:26Z

BTW it should be worth noting that with 3x as many cores, zcash MSM was only 2x faster

mratsim · 2023-11-15T08:35:06Z

So the absolute best time for 32 cores is still x86_64 (the c7a has clock speed 3.7ghz, while c7g is only 2.5ghz), but the performance of Brecht's msm (halo2_pr40) is better than zcash on Arm64 but worse than zcash on x86.

I'm at devconnect so can't investigate much but sounds to me like memory bandwidth bottleneck.

The approach by Barretenberg needs sorting and lots of alloc, this is what I describe here in my deep dive: https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md

kilic and others added 7 commits February 1, 2023 22:34

feat: add endo scalar decomposition

6153969

fix: clippy

abb34c3

feat: add msm function with other reference impls

1fd2e54

implement signed digit representation

d81d677

Co-Authored-By: Brechtpd <Brechtp.Devos@gmail.com>

use zcash msm when number of points less than 256

b28c473

remove pse/halo2/#40

165c1da

fix typo

2afd5e2

CPerezz mentioned this pull request Mar 7, 2023

MSM optimization privacy-scaling-explorations/halo2#40

Closed

kilic force-pushed the endo branch from de71b18 to 75fd90b Compare June 5, 2023 16:37

mratsim mentioned this pull request Jun 11, 2024

Towards state-of-the-art multi-scalar-muls #163

Open

6 tasks

tumberger mentioned this pull request Jun 21, 2023

Add halo2_curves mmaker/zkalc#22

Merged

This was referenced Aug 29, 2023

Fast modular inverse - 9.4x acceleration #83

Merged

RFC: Move MSM and FFT in this repo and offer a standard interface #84

Closed

einar-taiko mentioned this pull request Sep 8, 2023

Insert MSM and FFT code and their benchmarks. #86

Merged

huitseeker mentioned this pull request Dec 18, 2023

Implement a fancier msm for halo2curves-related crates lurk-lab/arecibo#193

Closed

kilic mentioned this pull request Jan 23, 2024

MSM optimisations: CycloneMSM #130

Merged

kilic marked this pull request as draft January 24, 2024 07:13

duguorong009 mentioned this pull request Mar 9, 2024

Analyze and select optimizations to port from C++ port of Halo2 by kroma-network/tachyon privacy-scaling-explorations/halo2#293

Open

kilic closed this Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Msm optimization #29

Msm optimization #29

kilic commented Feb 27, 2023 •

edited

Loading

mratsim commented Oct 25, 2023

jonathanpwang commented Nov 15, 2023

jonathanpwang commented Nov 15, 2023

jonathanpwang commented Nov 15, 2023

mratsim commented Nov 15, 2023

Msm optimization #29

Msm optimization #29

Conversation

kilic commented Feb 27, 2023 • edited Loading

mratsim commented Oct 25, 2023

jonathanpwang commented Nov 15, 2023

jonathanpwang commented Nov 15, 2023

jonathanpwang commented Nov 15, 2023

mratsim commented Nov 15, 2023

kilic commented Feb 27, 2023 •

edited

Loading