Single syscall modmul and modadd submachine #1436

qwang98 · 2024-06-07T16:05:47Z

Background

Currently there are two approaches to accelerating the ECDSA signature: 1. accelerate ec_add and ec_double. 2. accelerate modmul and modadd only, which are the most time consuming component of ec_add and ec_double. 3. Do both *1 and *2.
*1 reduces execution trace rows by 60%, *2 only 5%, and *3 by 80%.
*2, despite poor performance right now, has the most potential for improvement, because modmul syscall, which is called extensively by *2, spends most rows in register copying from and to memory. *3 has limited potential for improvement because not many non-ec operations calls modmul, so any improvement would be marginal
It's still inconclusive whether *2 or *3 would achieve the best performance eventually.

The issue

Regardless of whether we do *2 or *3, modmul currently has two syscalls on the affine submachine, which is why we propose to reduce modmul and modadd to one syscall in this issue. This change would mostly benefit *2, if we can create a submachine just for modmul and modadd and remove all ec operations in the current arith machine.
Currently we do two syscalls, the first to obtain D and E from A * B + C = D * 2^256 + E, given A, B, C; and the second to obtain R from M * Q + R = D * 2^256 + E, given D, E, and M (SECP modulo). Q and R are computed as hints.

Basic proposal

User provides A, B, C, compute D and E as before. Also compute Q and R as hint columns. Have 32 limbs of M as 32 constants (instead of columns). In the same 32 rows where we enforce A * B + C = D * 2^256 + E, also enforce that M * Q + R = D * 2^256 + E. In addition, would also need a carry column for M * Q + R.

Additional proposal to reduce column C

Instead of having 16 columns for C, have two operations, one for addition and one for multiplication. Enforce that A (* or +) B = D * 2^256 + E depending on the operation. This would be quite easy to do given that the current eq0 constraint is just

let eq0 = |nr|
        product(x1f, y1f)(nr) // this should work up to nr = 15, but after that, we would need a(nr - 15) * b(15) + ... + a(15) * b(nr - 15)?
        + x2f(nr)
        - shift_right(y2f, 16)(nr) // 0 for nr in 0..16
        - y3f(nr);

so we can just remove the product term for the add operation and remove the x2f term for the product operation. This has the advantage of shaving off the 16 C columns.

One concern for this approach is that the add operation would just take 16 rows for the 16 limbs instead of 32 as in the case of multiplication. I think this should be fine if we just keep all columns constant after 16 rows for the add operation?

Additional proposal to reduce columns Q and R

Not sure if possible but we can also put Q and R in the place of B and C and M in the place of A and just add 32 additional rows. So there would be 64 rows for each latch. We would need another fixed columns like [0, 0, 0, ..., 0, 1, 1, 1, ..., 1] with 32 zeroes and 32 ones, and A, B, C will be query columns that is none for when this fixed column is 0 and hints for when this fixed column is 1.
One concern for this approach is that submachine rows might overflow main machine rows, because currently our ECDSA test calls modmul 13k times, for 64 rows per call, this would be 832k rows. Currently the main machine has 3.1 million rows, but after further optimization this might go below 832k rows.

Other potential optimizations

Currently and in this proposal, it's not enforced that modmul and modadd will return remainder that's smaller than the modulus. This is enforced as an assertion on the Rust level of the k256 crate field mul function. We could consider enforcing this in less than constraints in the submachine, but I'm not sure how to efficiently do it for numbers with 16 limbs.

The text was updated successfully, but these errors were encountered:

qwang98 · 2024-06-07T16:11:30Z

@georgwiese @pacheco Thoughts? :)

pacheco · 2024-06-10T09:16:50Z

I think this makes sense, in particular once we reduce the "fixed" cost of doing ecalls, which is quite high right now.
I don't know about the proving cost for each approach (accelerate ec VS modular operations) but we can measure and compare once we have everything in place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single syscall modmul and modadd submachine #1436

Single syscall modmul and modadd submachine #1436

qwang98 commented Jun 7, 2024 •

edited

Loading

qwang98 commented Jun 7, 2024

pacheco commented Jun 10, 2024

Single syscall modmul and modadd submachine #1436

Single syscall modmul and modadd submachine #1436

Comments

qwang98 commented Jun 7, 2024 • edited Loading

Background

The issue

Basic proposal

Additional proposal to reduce column C

Additional proposal to reduce columns Q and R

Other potential optimizations

qwang98 commented Jun 7, 2024

pacheco commented Jun 10, 2024

qwang98 commented Jun 7, 2024 •

edited

Loading