Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIP] Batch Verification & Multithread Optimization on Range Proof Verification #1321

Closed
garyyu opened this issue Aug 7, 2018 · 24 comments
Closed

Comments

@garyyu
Copy link
Contributor

garyyu commented Aug 7, 2018

This is a Grin Improvement Proposal.

Range proof verification is the most expensive computation among all those computations in Grin, I'm not using "one of" here:) If I'm wrong, let me know please.

According to my bench test on range proof, it need average 40ms for a verification and 80ms for a creation, on my MacBook Air (Early 2015). As a compare, the ECDSA signature spent average 0.14ms for a signature and 0.17ms for a signature verification, on same computer. The detail of this bench test can be found here.

$ cargo test --release bench_bullet_proof_wt_extra -- --nocapture

running 1 test
test bench::tests::bench_bullet_proof_wt_extra ...
spent time:	81(s)/(1000 bullet proof)
verify_bullet_proof ok:	1000/1000
spent time:	40(s)/(1000 verify bullet proof)

$ cargo test --release bench_bullet_proof_with_extra_msg -- --nocapture

running 1 test
test bench::tests::bench_bullet_proof_with_extra_msg ... test bench::tests::bench_bullet_proof_with_extra_msg has been running for over 60 seconds
spent time:	80(s)/(1000 bullet proof with extra msg)
verify_bullet_proof ok:	1000/1000
spent time:	41(s)/(1000 verify bullet proof with extra msg)

That's a little bit higher than expectation! In case of 50 millions of UTXO, that need about 23 days to complete all the range proof verification, on my poor old MacBook Air; suppose a brand new computer is as 5 times faster as mine, that still need 4.6 days!

Currently, when a new node installed, the fast sync mode will download and validate the txhashset archive, I spent 17 minutes to verify 23140 range proofs (average about 45ms/verification):

Aug 07 11:18:04.636 DEBG txhashset: validated the output|rproof|kernel mmrs, took 0s
Aug 07 11:18:17.429 DEBG txhashset: verified 42692 kernel signatures, pmmr size 85377, took 11s
Aug 07 11:35:31.781 DEBG txhashset: verified 23140 rangeproofs, pmmr size 95321, took 1034s

That's the longest part in the whole fast sync procedure (24 minutes in this example), range proof verification spent 70% time of the whole sync time, even not including the BodySync stage which also need range proof verification.

Before looking into the possibility of the real optimization on range proof algorithm or implementation self, at least we can easily give multiple threads optimization at application level.

Look forward to hear your views on this. If all agree to proceed, I will try to give this optimization.

@antiochp
Copy link
Member

antiochp commented Aug 7, 2018

Related - #836

We should be able to "batch verify" bulletproofs via libsecp.
We may want to investigate multi-threading here but the big win will come from batching these up.

I'm not sure where we got to with this but @yeastplume may have more info.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 7, 2018

"batch verify" bulletproofs via libsecp? 👍 that's wonderful if indeed having, will also take a look at it.

@antiochp
Copy link
Member

antiochp commented Aug 7, 2018

@apoelstra states in the referenced blog post -

... reduced the speed of 147 multiplications to only 15.5 microseconds each, getting the total verification time of a Bulletproof down to 2.3 ms, vs 5.8 ms for the old proofs.

@ignopeverell
Copy link
Contributor

Yes, #836 is highly relevant, I had started looking at it and ran into issues on an older version of libsecp-zkp. Not sure where it's at now. And I do think we should do both:

  1. Chunk the bulletproofs and send them to N separate threads.
  2. Batch verify on each thread.

@ignopeverell
Copy link
Contributor

Oh, and is GIP an actual thing now?! 😄

@yeastplume
Copy link
Member

Last I left this, I couldn't get the batch verify function working from rust.. unsure if it was being called correctly or not (most likely not). Upon asking, I was pointed at the test in the benchmark which is indeed passing, so I need to compare this more closely with what's being passed in from rust, though I haven't got to it.

Code in benchmark is here:

https://github.com/mimblewimble/secp256k1-zkp/blob/88f2fec355f4a7f0fb7636b60d45d9ead007f52f/src/bench_bulletproof.c#L134

@apoelstra
Copy link

A 64-bit BP rangeproof should never be taking 40ms to verify regardless of batching. On my laptop a single proof takes 18ms to generate, 2.3ms to verify, and 0.25ms to batch-verify.

@ignopeverell
Copy link
Contributor

ignopeverell commented Aug 8, 2018

The gap is most likely due to the fact that we recreate the scratch space and generators every time. A first easy step, until we figure out the issue in passing parameters down to batch-verify properly, could be to have a batch implementation that doesn't use libsecp batch-verify but at least reuses those artifacts (of course, fixing both would be best).

Edit: after checking the latest code, we do not recreate generators every time anymore, but we do use secp256k1_bulletproof_rangeproof_verify_single_w_scratch instead of reusing a scratch space. And runing the benchmark @garyyu used gives me a (on my slow machine) a verify time of 4ms.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

Thanks all of you for providing these useful information! which lead me to investigate the detail of rust-secp256k1-zkp interfaces to secp256k1, and finally find a fix in rust-secp256k1-zkp to make batching works, will provide the new bench test result soon.

Nice team work:)

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

@ignopeverell what's the detail parameters for your slow machine? 10 times faster than my MacBook Air (Early 2015)! let me check if something else is wrong on my side.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

The detail of the batch verification bench test can be found here in my Wiki

Test result on my poor mac air:

$ cargo test --release bench_bulletproof_batch_verify -- --nocapture

running 1 test
test bench::tests::bench_bulletproof_batch_verify ...
spent time:	81.144(s)/(1000 bullet proof creation w/t extra message)
spent time:	0.844(s)/(1 batch verify for 1000 bullet proofs w/t extra message)

proof_range:	ProofRange {
    min: 0x0,
    max: 0xffffffffffffffff
}


$ cargo test --release bench_bulletproof_with_extra_batch_verify -- --nocapture

running 1 test
test bench::tests::bench_bulletproof_with_extra_batch_verify ... test bench::tests::bench_bulletproof_with_extra_batch_verify has been running for over 60 seconds
spent time:	82.608(s)/(1000 bullet proof creation w/ extra data)
spent time:	0.816(s)/(1 batch verify for 1000 bullet proofs w/ extra data)

proof_range:	ProofRange {
    min: 0x0,
    max: 0xffffffffffffffff
}

That's a near 50 times speed optimization! Thanks @apoelstra and all other contributors of secp256k1-zkp lib to give this nice optimization.

And I'm still working at the scratch space, a simple test to modify 256 as 128 will save about 40% range proof single verification cost, but trying to understand detail of this.

Hope to complete all these and give pull-request to Grin repo, in one or two days.

@yeastplume
Copy link
Member

@garyyu That’s great news, what was the issue with the batch verify function?

verify_single_w_scratch was from the older version, the latest T3 branch allows you to create and hold onto a scratch space on the rust side

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

@ignopeverell @yeastplume I'm confused about ignopeverell said above: we do not recreate generators every time anymore:

Look at this 2 lines:
https://github.com/mimblewimble/rust-secp256k1-zkp/blob/testnet3/src/pedersen.rs#L640-L641

Current code indeed recreate generators every time, please confirm.

@garyyu garyyu changed the title [GIP] Multithread Optimization on Range Proof Verification [GIP] Batch Verification & Multithread Optimization on Range Proof Verification Aug 9, 2018
@yeastplume
Copy link
Member

yeastplume commented Aug 9, 2018

I think @ignopeverell was just looking at master instead of testnet3 branch (really need to fix that to avoid confusion)

Would be more concerned about this:

pub const GENERATOR_H : [u8;33] = [
-    0x11,
+    0x0b,

that 0x11 has been copied and copied and copied again, so I don't know where it came from, but if it's wrong it means the generator point is currently being incorrectly flipped, which might indeed lead to problems (and a hard fork). I'll check this further this evening when I have some time.

@yeastplume
Copy link
Member

No, should be fine. This is what gets called for generator load:

static void secp256k1_generator_load(secp256k1_ge* ge, const secp256k1_generator* gen) {
    secp256k1_fe fe;
    secp256k1_fe_set_b32(&fe, &gen->data[1]);
    secp256k1_ge_set_xquad(ge, &fe);
    if (gen->data[0] & 1) {
        secp256k1_ge_neg(ge, ge);
    }
}

Both 0x11 & 0x01 == 1 and 0xb & 0x01 == 1 so that change shouldn't make any difference (it should indeed be 0x0b), but Interesting that our generator H needs to be flipped, we'd get a tiny optimisation from choosing a point that doesn't need to be.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 9, 2018

Interesting! @yeastplume you want choosing another point to replace current H to avoid expensive call secp256k1_ge_neg() in loading?

That would be a global optimization! even the saving computation is trivial. good idea~

but that would be a hard fork, right? lucky we're still in Testnet.

But perhaps @ignopeverell don't want a Testnet4 :)

@ignopeverell
Copy link
Contributor

We could hardfork testnet3. But I'd rather see some numbers before to make sure it's not a micro-optimization with no runtime impact.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 10, 2018

OK, will investigate if that's a micro-optimization.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 10, 2018

And about:

Current code indeed recreate generators every time...

I did an optimization to use a shared generators, avoid recreating every time, here is the bench test result on my MacBook Air (Early 2015):

$ cargo test --release bench_bullet_proof_wt_extra -- --nocapture

running 1 test
spent time:	45(s)/(1000 bullet proof)
verify_bullet_proof ok:	1000/1000
spent time:	6(s)/(1000 verify bullet proof)

This can contribute near 177% faster in bullet proof creation, and 667% faster in single bullet proof verification.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 10, 2018

Now I try to integrate these optimization to Grin repos :)

@garyyu
Copy link
Contributor Author

garyyu commented Aug 10, 2018

About:

changing H to avoid expensive call secp256k1_ge_neg() in loading ... 
investigate if that's a micro-optimization

Here is the bench test for pedersen commitment with original H and with H' (I just simply change 0x0b to 0x0a for a performance compare, without strictly checking if H or H' is indeed a different Eclipse Curve from the G).

$ cargo test --release bench_generator_h_efficiency -- --nocapture

running 1 test
test bench::tests::bench_generator_h_efficiency ...
spent time:	252(s)/(1000000 pedersen commitment with H1)
spent time:	252(s)/(1000000 pedersen commitment with H2)

Indeed, that's a micro-optimization and no need to do it.

(To Be Checked! it's normal that a Pedersen Commitment spend 0.252 ms, but why changing H from 0x0b to 0x0a seems no difference in calculation complexity?! Hope somebody can help review this bench test code, and perhaps something wrong? )

@yeastplume
Copy link
Member

:D that was just a throwaway comment, I was never under the impression that that was anything other than a tiny micro-optimisation.. but thanks anyhow for checking!

@garyyu
Copy link
Contributor Author

garyyu commented Aug 16, 2018

Totally we got over 35 times speed optimization :)

Before:

Aug 07 11:35:31.781 DEBG txhashset: verified 23140 rangeproofs, pmmr size 95321, took 1034s

After #1363:

Aug 16 09:50:12.435 DEBG txhashset: verified 28278 rangeproofs, pmmr size 121240, took 35s

So, this optimization should be enough already. The multithread optimization proposal could be a premature one, can be postponed.

yeastplume pushed a commit that referenced this issue Aug 17, 2018
…1321) (#1363)

* improve: use bullet rangeproof batch verification for txhashset validation (#1321)

* update rust-secp256k1-zkp to tag 'grin_integration_22'
@garyyu garyyu closed this as completed Aug 17, 2018
jackrack pushed a commit to jackrack/grin that referenced this issue Aug 22, 2018
…imblewimble#1321) (mimblewimble#1363)

* improve: use bullet rangeproof batch verification for txhashset validation (mimblewimble#1321)

* update rust-secp256k1-zkp to tag 'grin_integration_22'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants