Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved modulo multiplier #289

Merged
merged 9 commits into from Dec 5, 2023
Merged

Improved modulo multiplier #289

merged 9 commits into from Dec 5, 2023

Conversation

DmytroTym
Copy link
Contributor

@DmytroTym DmytroTym commented Nov 29, 2023

I tried to document inner workings of methods that were changed in field.cuh, but the bottom line is: Karatsuba implemented for wide multiplication; lsb and msb multipliers optimised as much as possible plus some extra tweaks.

Benchmarks

All the benchmarks are from the same RTX 3090Ti machine.

Multiplication modulo base modulus of BLS12-377, throughput, 10^9 ops/s:

Multiplication modulo base modulus of BN254, throughput, 10^9 ops/s:

  • Current main - 29.6
  • This PR - 38.9
  • Matter Labs' bellman-cuda - 39.5

Among some higher-level applications, we have a Poseidon tree based on Supranational's PC2 implementation. The time for building depth 30 tree for BLS12-381 in seconds:

  • Current main - 17.2
  • This PR - 14.2
  • Supranational's original version - 13.6

Despite being on-par in terms of multiplier, our tree is still a bit slower, probably due to squaring (sppark has a separate implementation and we just use multiplication for now).
Next, MSM of size 2^22 on BN254 curve (without data transfer), ms.:

  • Current main - 88.8
  • This PR - 70.6
  • Matter Labs' bellman-cuda - 44.1

UPD: one important reason for our slowdown compared to Matter Labs and limited acceleration compared to main is that 256-bit adder that we're using is not accelerated as much as expected. Here are the throughputs of different versions of projective BN254 adder:

  • Current main - 2.49
  • This PR - 2.91
  • Matter Labs' bellman-cuda - 3.37

Meanwhile, 384-bit adder is accelerated as expected. E.g. for projective addition on BLS12-377:

It seems that the reason is that EC addition is basically register bound, and one advantage of Montgomery multiplication implemented by Matter Labs and Supranational is the smaller register footprint. This might not affect the performance at all, like with 384-bit adder but does affect 256-bit one, at least on 3090Ti. Solving this issue is future work.

@vhnatyk if you have NTT benchmarks, would be interesting to see them too.

@DmytroTym DmytroTym mentioned this pull request Nov 29, 2023
Copy link
Contributor

@vhnatyk vhnatyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested performance and correctness for bls12_381 - in my custom tests that is simple multiply in a for loop for different grid sizes on laptop 3050ti - I get ~9000 mult / microsecond = 9Gops versus 7.8 in Sppark. That is decent advantage over current main ~7Gops 👍

@DmytroTym DmytroTym merged commit f8610dd into main Dec 5, 2023
10 of 11 checks passed
@DmytroTym DmytroTym deleted the develop/dima/multiplier branch December 5, 2023 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants