Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Msm optimization #29

Closed
wants to merge 7 commits into from
Closed

Msm optimization #29

wants to merge 7 commits into from

Conversation

kilic
Copy link
Collaborator

@kilic kilic commented Feb 27, 2023

This effort, at first, started to locate a discrepancy problem in pse/halo2/#40 and becomes a relatively simpler implementation.

Both implementations follows the batch addition approach with bucket method. Some promising techniques that are applied in #40 are not applied here yet. These are:

  • endo decomposition to halve rounds in exchange for doubling number of scalars
  • signed scalar decomposition to reduce bucket size which would reduce memory consumption and number of steps in accumulation phase
  • bottom-up parallelization as it is applied in Implement PrimeFieldBits for base and scalar field of BN and Secp #40 and suggested by @Brechtpd. Currently this PR applies top-down parallelization as applied in zcash/halo2::arithmetic::best_multiexp

Without these optimisations performance seems like close to #40 where code size is aprox 5x smaller than that. This PR also copies other msm functions to compare this effort against #40 and zcash/msm which will be removed once this PR matures. Rough bench results below are performed in M1 machine.

k #40 zcash serial zcash parallel this serial this parallel notes
10 12.39ms 15.30ms 5.77ms 11.11ms 7.91ms
11 18.09ms 26.69ms 8.43ms 16.21ms 8.22ms
12 20.03ms 51.43ms 13.22ms 26.18ms 12.84ms this takes over the lead
13 26.74ms 85.26ms 20.07ms 44.86ms 18.39ms
14 34.80ms 159.12ms 31.62ms 81.60ms 22.37ms
15 48.35ms 283.77ms 58.36ms 146.15ms 30.52ms
16 75.76ms 525.89ms 105.72ms 283.59ms 56.33ms
17 127.98ms 1.00s 166.15ms 577.60ms 100.93ms
18 207.24ms 1.86s 300.41ms 1.09s 164.42ms
19 402.92ms 3.50s 568.79ms 2.26s 350.78ms
20 714.45ms 6.85s 1.05s 4.75s 709.65ms
21 1.32s 12.73s 2.34s 9.77s 1.60s #40 takes over the lead
22 2.58s 24.82s 3.71s 19.59s 2.96s

Pasta curves haven't been covered yet since there is no Point::from_xy_unchecked kind of endpoint that would enable cheap construction of a point that is required in batch additions. A PR to zcash/pasta_curves planed to be opened to enable generic msm functionality here.

Another limitation for this PR is that it assumes base points are independent and are not equal to infinity so that we can skip doubling and infinity checks which is the usual case in polynomial commitments.

@mratsim
Copy link
Contributor

mratsim commented Oct 25, 2023

I have redone the benchmarks on my PC using this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-startover and by adding the new ASM from #49 and inversion algorithm from #83.

Strangely, on my machine both your and @Brechtpd PR get slower than Zcash's. at 2²² sizes

Full details, from commit 1fd2e54 on the left and with new assembly and Bernstein-Yang inversion on the right.

image
image

Some details can be found in this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-opt which was trying to rebase all PRs on top of main and use the criterion benchmark facilities from #86 but test_round failed and having no field access in traits is just 🤕

cargo test --features print-trace,asm --release -- --ignored --nocapture test_multiexp_bench

image

For references, here are the benchmark on my own library

image
image

@jonathanpwang
Copy link
Contributor

@mratsim Thanks for sharing this and your branch! (I did not want to try updating Brecht's PR to ff 0.13 etc...)

I ran taikoxyz@25d80d7 bench msm on my laptop (M2Max Macbook Pro, no asm, 12 cores, 96gb RAM) and here are results:
Screenshot 2023-11-14 at 6 51 20 PM

I'll get some AWS machines to compare the asm times with different numbers of cores.

@jonathanpwang
Copy link
Contributor

This is very surprising, it seems to have to do with x86 processors, even without asm turned on. (Or just something general about memory?)

On a c7a.8xl (x86_64, 32 core, AMD EPYC 9R14 Processor, 64gb memory)
cargo bench --bench msm --features "bn256-table, asm"
Screenshot 2023-11-14 at 11 15 54 PM
By the way I also did just cargo bench --bench msm without asm and the performance was indeed worse.

On a c7a.24xl (x86_64, 96 core, AMD EPYC 9R14 Processor, 192 gb memory)
cargo bench --bench msm --features "bn256-table, asm"
Screenshot 2023-11-14 at 11 32 58 PM

Lastly I checked an ARM processor: c7g.8xl (arm64, 32 core, AWS Graviton3 Processor, 64gb memory)
cargo bench --bench msm
Screenshot 2023-11-15 at 12 08 29 AM

So the absolute best time for 32 cores is still x86_64 (the c7a has clock speed 3.7ghz, while c7g is only 2.5ghz), but the performance of Brecht's msm (halo2_pr40) is better than zcash on Arm64 but worse than zcash on x86.

@mratsim

@jonathanpwang
Copy link
Contributor

BTW it should be worth noting that with 3x as many cores, zcash MSM was only 2x faster

@mratsim
Copy link
Contributor

mratsim commented Nov 15, 2023

So the absolute best time for 32 cores is still x86_64 (the c7a has clock speed 3.7ghz, while c7g is only 2.5ghz), but the performance of Brecht's msm (halo2_pr40) is better than zcash on Arm64 but worse than zcash on x86.

I'm at devconnect so can't investigate much but sounds to me like memory bandwidth bottleneck.

The approach by Barretenberg needs sorting and lots of alloc, this is what I describe here in my deep dive: https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants