Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I tried to document inner workings of methods that were changed in
field.cuh
, but the bottom line is: Karatsuba implemented for wide multiplication; lsb and msb multipliers optimised as much as possible plus some extra tweaks.Benchmarks
All the benchmarks are from the same RTX 3090Ti machine.
Multiplication modulo base modulus of BLS12-377, throughput, 10^9 ops/s:
main
- 14.1Multiplication modulo base modulus of BN254, throughput, 10^9 ops/s:
main
- 29.6Among some higher-level applications, we have a Poseidon tree based on Supranational's PC2 implementation. The time for building depth 30 tree for BLS12-381 in seconds:
main
- 17.2Despite being on-par in terms of multiplier, our tree is still a bit slower, probably due to squaring (sppark has a separate implementation and we just use multiplication for now).
Next, MSM of size 2^22 on BN254 curve (without data transfer), ms.:
main
- 88.8UPD: one important reason for our slowdown compared to Matter Labs and limited acceleration compared to
main
is that 256-bit adder that we're using is not accelerated as much as expected. Here are the throughputs of different versions of projective BN254 adder:main
- 2.49Meanwhile, 384-bit adder is accelerated as expected. E.g. for projective addition on BLS12-377:
main
- 1.11It seems that the reason is that EC addition is basically register bound, and one advantage of Montgomery multiplication implemented by Matter Labs and Supranational is the smaller register footprint. This might not affect the performance at all, like with 384-bit adder but does affect 256-bit one, at least on 3090Ti. Solving this issue is future work.
@vhnatyk if you have NTT benchmarks, would be interesting to see them too.