Improved modulo multiplier #289

DmytroTym · 2023-11-29T20:50:51Z

I tried to document inner workings of methods that were changed in field.cuh, but the bottom line is: Karatsuba implemented for wide multiplication; lsb and msb multipliers optimised as much as possible plus some extra tweaks.

Benchmarks

All the benchmarks are from the same RTX 3090Ti machine.

Multiplication modulo base modulus of BLS12-377, throughput, 10^9 ops/s:

Current main - 14.1
This PR - 18.8
Matter Labs' z-prize submission - 18.1

Multiplication modulo base modulus of BN254, throughput, 10^9 ops/s:

Current main - 29.6
This PR - 38.9
Matter Labs' bellman-cuda - 39.5

Among some higher-level applications, we have a Poseidon tree based on Supranational's PC2 implementation. The time for building depth 30 tree for BLS12-381 in seconds:

Current main - 17.2
This PR - 14.2
Supranational's original version - 13.6

Despite being on-par in terms of multiplier, our tree is still a bit slower, probably due to squaring (sppark has a separate implementation and we just use multiplication for now).
Next, MSM of size 2^22 on BN254 curve (without data transfer), ms.:

Current main - 88.8
This PR - 70.6
Matter Labs' bellman-cuda - 44.1

UPD: one important reason for our slowdown compared to Matter Labs and limited acceleration compared to main is that 256-bit adder that we're using is not accelerated as much as expected. Here are the throughputs of different versions of projective BN254 adder:

Current main - 2.49
This PR - 2.91
Matter Labs' bellman-cuda - 3.37

Meanwhile, 384-bit adder is accelerated as expected. E.g. for projective addition on BLS12-377:

Current main - 1.11
This PR - 1.56
Matter Labs' z-prize submission - 1.5

It seems that the reason is that EC addition is basically register bound, and one advantage of Montgomery multiplication implemented by Matter Labs and Supranational is the smaller register footprint. This might not affect the performance at all, like with 384-bit adder but does affect 256-bit one, at least on 3090Ti. Solving this issue is future work.

@vhnatyk if you have NTT benchmarks, would be interesting to see them too.

vhnatyk

tested performance and correctness for bls12_381 - in my custom tests that is simple multiply in a for loop for different grid sizes on laptop 3050ti - I get ~9000 mult / microsecond = 9Gops versus 7.8 in Sppark. That is decent advantage over current main ~7Gops 👍

DmytroTym added 4 commits November 25, 2023 10:27

Improved multiplier with Karatsuba, lsb & msb mult

fbdf15b

Fused lsb mad and other minor perf improvements

61fdf90

Comments and formating

00b6f56

fix typo

fbfc43c

DmytroTym mentioned this pull request Nov 29, 2023

Karatsuba 1.0 #134

Closed

DmytroTym added 5 commits November 30, 2023 10:23

Trying to appease CI

55cbd81

Another attempt

a02522b

Another shot

6031f30

Now CI tests should pass

b9d0196

Merge in main and final fmt/typo fixing

c783ba3

jeremyfelder approved these changes Dec 5, 2023

View reviewed changes

vhnatyk approved these changes Dec 5, 2023

View reviewed changes

DmytroTym merged commit f8610dd into main Dec 5, 2023
10 of 11 checks passed

DmytroTym deleted the develop/dima/multiplier branch December 5, 2023 11:11

DmytroTym mentioned this pull request Jan 9, 2024

montgomery representation #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved modulo multiplier #289

Improved modulo multiplier #289

DmytroTym commented Nov 29, 2023 •

edited

vhnatyk left a comment •

edited

Improved modulo multiplier #289

Improved modulo multiplier #289

Conversation

DmytroTym commented Nov 29, 2023 • edited

Benchmarks

vhnatyk left a comment • edited

Choose a reason for hiding this comment

DmytroTym commented Nov 29, 2023 •

edited

vhnatyk left a comment •

edited