Investigate other field options #1

dlubarov · 2021-03-06T05:28:22Z

We're currently using p = 2^64 - 9 * 2^28 + 1, which I figured would enable a simple and fast Crandall reduction. It's reasonably fast, but seems like there's room for improvement.

We could probably get some speedup by switching to a slightly smaller field and delaying some reductions, although I expect that this would be fairly minor, since GMiMC doesn't do many additions in between multiplications. My code might also have room for improvement independent of field choice.

@unzvfu suggested looking into binary fields, which should be very fast on CPUs. There may be some complications with using FRI with binary fields, but it seems worth looking into.

A much more speculative idea is to use a field that isn't 2-adic, like p = 2^61 - 1. This would complicate LDEs and FRI. For LDEs, I suppose we could either use mixed-arity FFTs (might be expensive to do the high-arity layers), or use some other DFT algorithm with no smoothness requirement (not sure how practical they are). For FRI, we typically want pretty high-arity reductions anyway (like 4-to-1 or 8-to-1 layers), so it might not affect efficiency that much, but it would be a bit more complex.

It would be interesting to see what ~64 bit fields other projects have used, but I found very little online. AFAIK they don't come up in any widely used crypto primitives.

The text was updated successfully, but these errors were encountered:

dlubarov · 2021-03-06T05:31:58Z

Assigning to @unzvfu as the "chief investigator", but I and/or @wborgeaud might help out if we end up exploring several options.

dlubarov · 2021-03-06T06:19:26Z

There's also the Monty option. If I'm remembering correctly, it seems like it should involve the same number of operations as our current Crandall reduction (two muls, two adds and a conditional sub), but I might be off.

unzvfu · 2021-03-29T12:04:59Z

Here is a draft summary of the field options that I've investigated so far, as well as some of their trade-offs. It'll probably remain a work in progress, and there are no doubt some minor improvements to be made at various points that I have missed and that we can update as we learn them.

The criteria for the field F of size q we're interested in are:

2ⁿ ≤ q ≤ 2⁶⁴ where the smallest n is determined by the security; I assume n >= 60, but need to check specifics.
Fast modular multiplication algorithm, obviously
the size of 2-sylow subgroup of the multiplicative subgroup of F (the "2-adicity") is 'large'.
- Clearly can't be more than 63.
- Seems larger is better for FFT, but I still need to understand the precise connection
- need to find out if it's helpful for the 2-adicity itself to have a special form, e.g. a power of 2.
x³ must be a permutation in the field for GMiMC; for Poseidon can have x⁵ or x⁷ (or higher).

One slightly surprising discovery is that rustc actually prefers mul's to combinations of shifts and adds, while gcc does not; so the advantage of a prime with a special form for arithmetic is almost completely negated with pure rust. I will check which one is actually faster and update; presumably it will depend on the machine architecture to some extent.

That lead me to consider what the best form would be from the point of view of the other criteria; this suggests we use something of the form q = s*2^n + 1, with s small and not divisible by 3, 5 etc., to maximise the 2-adicity and control the permutation monomials. Then arithmetic in Montgomery form will provide as about as good performance as you can expect (REDC with two muls and three adds). For example, depending on how big q and n have to be, we could pick q = 5*2^55 + 1, or q = 71*2^57 + 1, etc.

(These choices of q have mu = -q^-1 = q-2 (mod 2^64), which is enticing, though I can't yet see how to use it to replace the multiplication-by-mu modulo 2^64 part of REDC with (say) multiplication by -2, which would be great. In some cases you can replace a mul-by-mu with shifts and adds: For example, x * (5*2^55-1) mod 2^64 is the same as ((x << 57) + (x << 55) - x), which gcc still thinks is faster than the multiplication.)

Probably the fastest modular multiplication possible would be the binary fields (costs just a couple of shifts and xors; see pseudo-code below), though they obviously have some significant disadvantages relating to the 2-adicity and permutation monomial selection criteria.

The following table summarises the "special form" options I've looked into, with varying degrees of detail (doesn't track conditional subs yet; sometimes not needed, e.g. with binary field). The "small factors" column tells us the 2-adicity (i.e. the exponent of 2) as well as which degree permutation monomials are available (namely, products of primes that don't appear in the list); all prime power factors up to and including 11 are shown.

group	form	potential candidates	small factors of `q-1`	pseudocode	implementation notes
Mersenne	2ⁿ-1	2⁶¹ - 1	2⋅3²⋅5²⋅7⋅11...	hi,lo=MUL(a,b); (lo&(1<<61)-1) + (lo>>61) + (hi<<3)	Space for redundant repr.
pseudo-Mersenne/ Crandall	2ⁿ-c small c	2⁶¹ - 31	2⁵⋅3⋅5 ...	hi,lo=MUL(a,b); two muli's by c, three adds	same
		2⁶³ - 25	2⋅3⁵ ...	same	same
binary fields	2ⁿ	F₂[x]/(x⁶⁴+x⁴+x³+x+1)	3⋅5 ...	hi,lo=CLMUL(a,b); 3 shl, 4 xor 3 shl, 4 xor	shl's have one immed. operand; shl's are independent, can tree xor's; CLMUL instructions are same speed as IMUL from Haswell on. Possible improvement with lookup tables and SSE
		F₂[x]/(x⁶³ + x + 1)	7² ...	hi,lo=CLMUL(a,b); 1 shl, 1 xor, 1 and 1 shl, 1 xor, 1 and	as above; and's have one immed. operand; might be able to remove the and's?
Goldilocks	2^2m-2^m±1	2⁶⁴ - 2³² + 1	2³²⋅3⋅5 ...	hi,lo=MUL(a,b); hihi=hi>>32; hilo=hi&(1<<32)-1; lo + (hilo<<32) - hilo - hihi	Note that (hilo<<32) - hilo can't underflow; each part of the reduction line may carry (resp. borrow), these could cancel, or result in conditional add/sub of 2³² - 1; not sure if that could cause another carry/borrow
Solinas	poly(2^t)	2⁶⁴ - 9⋅2²⁸ + 1	2²⁸⋅11 ...	1. same as for Crandall; or 2. Use the structure of 9⋅2²⁸-1=2³¹+2²⁸-1	...

Please let me know if anything is unclear and if anything looks like it might be worth investigating a bit further. My next step from here is look closer at the effect of field choice on the FFT and check whether rustc or gcc is right about muls versus shifts+adds.

dlubarov · 2021-03-29T18:27:34Z

This is great info!

Seems larger is better for FFT, but I still need to understand the precise connection

I think it's just a matter of what size circuits we want to support. Like if we wanted to support circuits with 2^k gates, the prover will generate polynomials of degree 2^k, compute low-degree extensions of them (resulting in Reed-Solomon codewords of size 2^{k + 3} or so), then commit to those LDEs.

Our current approach to computing LDEs is to run a (radix-2) IFFT to convert points to coefficients, zero-pad the coefficient list, and run a (radix-2) FFT to convert back to points. My understanding is that a radix-2 FFT of size 2^{k + 3} requires a multiplicative subgroup of order 2^{k + 3}, so the field must have a 2-adicity of k + 3 of higher. (There might be other approaches we could consider, like using mixed-radix FFTs, or some other DFT algorithm that doesn't require a smooth subgroup. I suspect those alternatives may be slower though.)

I would estimate that the largest circuits anyone would want to use would have around 2²⁵ gates, since proofs for such large circuits would take a lot of time and memory. So I think it would be good to have a 2-adicity of around 28+, although this is just a rough number. We could go a few bits lower if there was a performance reason, or if we could get higher 2-adicity with no performance difference, that might be nice just in case someone wanted to use a giant circuit.

need to find out if it's helpful for the 2-adicity itself to have a special form, e.g. a power of 2.

As far as I'm aware, this shouldn't affect things.

That's interesting about the mul vs shifts/adds. I suppose the mul would be more compact, though that's not a priority for us.

I just noticed that the toolchain supports profile guided optimization. That would be nice if the toolchain could profile mul vs shifts/adds and select the winner for the local machine. I'm not sure if it supports that sort of thing though; the docs give me the impression that it's just low-level stuff like register assignments.

That's an excellent table! So to try to summarize,

Binary fields might be fastest. I'm kind of hesitant about them since it's outside my comfort zone; I haven't really looked into how to adapt FRI etc. to work with them (and it sounds nontrivial). But we could explore it more if there's a significant performance gap.
For GMiMC, it sounds like a Monty reduction might be fastest, especially if we found a field with cheap multiplication by mu. The current Crandall/Solinas field might be close though, so might need to benchmark both to be sure.
For Poseidon, it seems like the leading candidates would probably be ~~Mersenne~~ and Goldilocks. My hunch is that Goldilocks might be preferable since it would let us use a lower-degree permutation monomial like x⁷, though I haven't compared the costs in detail.

dlubarov · 2021-04-02T05:35:42Z

On second thought, I'm not sure Mersenne would be a viable option since its 2-adicity is only 1. I'm not sure how practical it would be to compute DFTs and run FRI in a field like that (see the initial comment in this issue). Between that and the high exponent we'd have to use to get a permutation, it seems likely that Goldilocks would be the better fit if we used Poseidon.

dlubarov · 2021-04-02T19:22:30Z

Interestingly, Bobbin timed our field code as well as his own field code, and got around 1.3ns per field mul for both, much faster than the ~4ns per field mul I got on my laptop. No idea why mine is slower, but I think 1.3ns is believable. Assuming his i9-9980HK was Turbo Boosting to 5GHz, that's 6-7 cycles per field mul, which seems about right for the three 64-bit muls and the few additions etc.

If we can reproduce the performance that Bobbin saw, and get that kind of perf on most hardware, that should make Poseidon pretty practical.

unzvfu · 2021-04-07T11:30:37Z

Thanks heaps for the additional info @dlubarov! I agree with your summary of the table.

Regarding muls vs shift/adds, on practically all recent Intel architectures we have

op	latency	throughput
add/sub	1	4
shl/shr	1	2
mul	3	1

So computing, say, (x << 57) + (x << 55) - x will require three cycles because of the data dependency (the shifts can happen simultaneously (assuming both shifting ports are free)). So it's actually pretty hard to beat mul, except possibly if there are several shift/add-based multiplications to do (the CPU can do two of them simultaneously if they are independent, but that's not the case for us I don't think).

Anyway, I'll add some benchmarks for some of the different field options.

dlubarov · 2021-04-08T20:32:04Z

Quick update/clarification on the timing of field ops: my previous figure of ~4ns per field mul was based on a sequential chain of field muls. Specifically, it was doing naive exponentiation by repeated multiplication. I added another benchmark (bench_field_mul_interleaved) with a more parallel dependency structure, and I'm now seeing much better numbers: ~.9ns, or ~4 cycles, per field mul.

I looked a bit at the asm of both, and didn't see anything weird or surprising like auto-vectorization. I think our current field mul code just has a very sequential dependency structure, so if we do a single field mul in a loop, the CPU can't do much else while it waits for muls to complete. Doing several independent field muls in a loop gives much better throughput.

This might be a point in favor of Poseidon, since its dependency structure should permit more interleaving of field ops, whereas GMiMC has a very sequential dependency structure. I'm just speculating though; we'll see soon enough when we benchmark it.

unzvfu · 2021-05-22T05:08:28Z

I've implemented the three main prime/reduction algorithm pairs and timed them on the field_benchmarks branch.

Here are the timings in nanoseconds, and approximate number of cycles (assuming boost), on an i7-7700hq @ 2.8 GHz (3.8 GHz boost). Proth-5 is 5⋅2⁵⁵ + 1 and Proth-29 is 29⋅2⁵⁷ + 1. The Serial timing measures the average cost per modular multiplication when they're done one after the other, whereas the Interleaved timing measures the average cost when several are interleaved at once.

Prime	Serial		Interleaved
	ns	cyc	ns	cyc
Crandall	4.06	15.5	0.98	3.7
Goldilocks	3.97	15.1	1.22	4.6
Proth-5	2.16	8.2	0.67	2.5
Proth-29	2.58	9.8	0.87	3.3

The main observation is that Goldilocks doesn't perform as well as we had hoped. I spent a while trying different variations of the implementation, but it was very hard to remove the data dependencies caused by overflow/underflow on the addition/subtractions at the end. Of all the things I tried, the fastest version was:

fn reduce128(x: u128) -> GoldilocksField {
    // Write x = a0 + a1*2^64  (0 <= a0, a1 < 2^64)
    //         = b0 + b1*2^32 + b2*2^64 + b3*2^96  (0 <= b0, b1, b2, b3 < 2^32)
    const MASK_LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (a0, a1) = split(x);   // This is a no-op
    let b3 = a1 >> 32;
    let b2 = a1 & MASK_LO_32_BITS;
    let x = (b2 << 32) - b2;  // Can't underflow
    let (y, cy) = x.overflowing_add(a0);
    let (mut z, br) = y.overflowing_sub(b3);

    if cy {
        z = z.overflowing_sub(GoldilocksField::ORDER).0;
    }
    if br {
        z = z.overflowing_add(GoldilocksField::ORDER).0;
    }
    GoldilocksField(z)
}

which compiles down to

reduce128:
        ; x = a0 + a1*2^64 arrives in (rdi, rsi) = (a0, a1)
        mov     rcx, rsi
        shr     rcx, 32
        mov     eax, esi
        shl     rsi, 32
        sub     rsi, rax
        lea     rax, [rsi + rdi]
        sub     rax, rcx
        mov     edx, 4294967295
        add     rdx, rax
        add     rsi, rdi
        cmovae  rdx, rax
        movabs  rax, -4294967295
        add     rax, rdx
        cmp     rsi, rcx
        cmovae  rax, rdx
        ret

Happy to hear about any ideas for improvements!

The other observation is that the Proth reduction is very fast; in fact the interleaved Proth-5 is almost optimal, since the 64-bit multiplication itself is 2 cycles. The caveat is that it is only partial reduction, though it is compatible with the FFT in a nice way (cf. this comment on issue 8).

Overall, I think these timings give a pretty good idea of what is possible with different choices of prime, and the unfortunate conclusion is that there's no stand-out winner. I think it will be important to keep the code fairly generic to allow switching between different possibilities as I experiment with improving the FFT code. For example, the Proth-29 prime beats Crandall by about 10%, but has some extra bookkeeping required because of the way it changes residue class representation along the way; if the bookkeeping isn't too bad, that could be enough to swing things it its favour.

dlubarov · 2021-05-23T02:40:36Z

Neat! It's interesting that the compiler not only (ab)uses lea to compute 5 * c0 in Proth-5,

lea     rax, [rcx, +, 4*rcx]

but also to compute 29 * c0 as 3 * 9 * c0 + c0 + c0 in Proth-29,

lea     rax, [rcx, +, 8*rcx]
lea     rax, [rax, +, 2*rax]
add     rax, rcx
add     rax, rcx

dlubarov · 2021-05-23T02:54:06Z

A couple thoughts on the Proth options:

I think it's totally fine in our case to have fast arithmetic for non-canonical encodings and a slower "to canonical" function.
At first glance, it seems like underflow can happen, e.g. if c0 == 0. Do we need to conditionally add q? Your current code matches K-RED in the paper, but I guess the subtraction in the paper means (wrapping) field subtraction?

unzvfu · 2021-05-23T02:55:59Z

Yeah, I'm actually a bit confused about the use of lea for mult-by-29, since it has a latency of 3 when all three operands are used and is restricted to CPU port 1 (so those two leas can't be done in parallel); that's exactly the same as mul from at least Broadwell onwards AFAICT. The two adds at the end have a serial dependency too, which makes matters worse. Curious!

dlubarov · 2021-05-23T02:59:07Z

Another little thought -- since MASK_LO_55_BITS requires a movabs, I thought it might be slightly faster to use two shifts, like

fn reduce128(x: u128) -> ProthField {
    let x_shl9 = x << 9;
    let c0 = x_shl9 as u64 >> 9;
    let c1 = (x_shl9 >> 64) as u64; // free
    let (d, under) = (5 * c0).overflowing_sub(c1);
    ProthField(d)
}

but the compiler switches it back to an AND. I guess the compiler (and you) know probably better :)

unzvfu · 2021-05-23T02:59:51Z

A couple thoughts on the Proth options:

* I think it's totally fine in our case to have fast arithmetic for non-canonical encodings and a slower "to canonical" function.

* At first glance, it seems like underflow can happen, e.g. if `c0 == 0`. Do we need to conditionally add `q`? Your current code matches K-RED in the paper, but I guess the subtraction in the paper means (wrapping) field subtraction?

Yeah, I probably shouldn't vaunt the qualities of the Proth primes too soon, since, as you noticed, I didn't do a fully correct implementation. Underflow can definitely happen, though this might be manageable by using a signed representation instead of unsigned. Otherwise, yes, we'll have to conditionally add q on underflow.

dlubarov · 2021-06-07T22:50:41Z

Here are all the potentially suitable Proth primes whose coefficients are within 1 of a power of two. I figured these should be slower than Proth5 (but with a more desirable field size) but faster than Proth29:

More Proth fields

2287828610704211969
- = (2^7 - 1) * 2^54 + 1
- Mul group factors: 2^54 * 127
- Size: ~61 bits
2308094809027379201
- = (2^10 + 1) * 2^51 + 1
- Mul group factors: 2^51 * 5^2 * 41
- Size: ~61 bits
4611615649683210241
- = (2^16 - 1) * 2^46 + 1
- Mul group factors: 2^46 * 3 * 5 * 17 * 257
- Size: ~62 bits
9223336852482686977
- = (2^18 - 1) * 2^45 + 1
- Mul group factors: 2^45 * 3^3 * 7 * 19 * 73
- Size: ~63 bits
1152923703630102529
- = (2^19 + 1) * 2^41 + 1
- Mul group factors: 2^41 * 3 * 174763
- Size: ~60 bits
9223369837831520257
- = (2^22 - 1) * 2^41 + 1
- Mul group factors: 2^41 * 3 * 23 * 89 * 683
- Size: ~63 bits
1152921513196781569
- = (2^27 + 1) * 2^33 + 1
- Mul group factors: 2^33 * 3^4 * 19 * 87211
- Size: ~60 bits

I added (2^16 - 1) * 2^46 + 1 along with Proth29 to the benchmark. Here are the results on my M1:

Benchmark results

Field: Crandall
Serial:
  average field mul: 4.51690975ns
Interleaved:
  average field mul: 0.9451840138333333ns
Field: Goldilocks
Serial:
  average field mul: 4.212927083ns
Interleaved:
  average field mul: 1.1893468193333334ns
Field: Proth5
Serial:
  average field mul: 3.10666875ns
Interleaved:
  average field mul: 0.8130153401666667ns
Field: Proth29
Serial:
  average field mul: 3.190831083ns
Interleaved:
  average field mul: 0.7355682568333334ns
Field: ProthField65535
Serial:
  average field mul: 2.996222ns
Interleaved:
  average field mul: 0.8194518958333333ns

And the assembly for each:

ProthField5 mul

 umulh   x8, x1, x0
 mul     x9, x1, x0
 and     x10, x9, #0x7fffffffffffff
 extr    x8, x8, x9, #55
 add     x9, x10, x10, lsl, #2
 subs    x8, x9, x8
 mov     x9, #1
 movk    x9, #640, lsl, #48
 csel    x9, x9, xzr, lo
 add     x0, x9, x8
 ret

ProthField29 mul

 umulh   x8, x1, x0
 mul     x9, x1, x0
 and     x10, x9, #0x1ffffffffffffff
 extr    x8, x8, x9, #57
 mov     w9, #29
 mul     x9, x10, x9
 subs    x8, x9, x8
 mov     x9, #1
 movk    x9, #640, lsl, #48
 csel    x9, x9, xzr, lo
 add     x0, x9, x8
 ret

ProthField65535 mul

 umulh   x8, x1, x0
 mul     x9, x1, x0
 and     x10, x9, #0x3fffffffffff
 extr    x8, x8, x9, #46
 lsl     x9, x10, #16
 sub     x9, x9, x10
 subs    x8, x9, x8
 mov     x9, #1
 movk    x9, #640, lsl, #48
 csel    x9, x9, xzr, lo
 add     x0, x9, x8
 ret

I'm no ARM expert, but the ASM is pretty much as expected: Proth5 uses an add to multiply by 5, Proth29 uses a mov+mul (since there's no great way to multiply by 29), and Proth65535 uses a lsl+sub.

The weird thing is just that Proth29 is somehow faster for me, despite the extra mul.

My (minor) code changes are in field_benchmarks_daniel.

dlubarov · 2021-07-27T17:29:57Z

Going back to this,

A much more speculative idea is to use a field that isn't 2-adic, like p = 2^61 - 1.

A recent paper by Ben-Sasson et al. may make this more practical. The authors are planning another paper to talk about adapting their techniques to arbitrary fields.

I haven't had much time to read it yet, but would be interesting to see if concrete performance is really comparable to traditional FFTs.

nbgl · 2021-08-30T19:55:38Z

This was a really interesting thread to read!

The problem you bumped into with Goldilocks is that LLVM does not emit fast assembly for modulo addition (i.e. the thing where we do let (z, carry) = x.overflowing_add(y); if carry { z.overflowing_sub(FIELD_ORDER).0 } else { z }). Crandall has the same problem, but it only does one such addition, whereas Goldilocks does two.

The code

fn reduce128(x: u128) -> GoldilocksField {
    const LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (x_lo, x_hi) = split(x);

    let x_hi_hi = x_hi >> 32;
    let (mut t0, borrow) = x_lo.overflowing_sub(x_hi_hi);
    if borrow {
        t0 = t0.overflowing_sub(EPSILON).0;
    }

    let x_hi_lo = x_hi & EPSILON;
    let t1 = x_hi_lo * EPSILON;

    let (mut t2, carry) = t1.overflowing_add(t0);
    if carry {
        t2 = t2.overflowing_add(EPSILON).0;
    }
    GoldilocksField(t2)
}

indeed benchmarks slower than Crandall. Weirdly, writing

t0 = t0.overflowing_sub(EPSILON * (borrow as u64)).0;
t2 = t2.overflowing_add(EPSILON * (carry as u64)).0;

improves the throughput by a lot, although it’s still marginally slower than Crandall.

The latter version generates the following assembly:

mov                rdx, rbx
mulx               r10, rbx, [rbp - 192]
mov                edx, r10d
shr                r10, 32
sub                rbx, r10
mov                ecx, 0
cmovb              rcx, r12
add                rcx, rbx
imul               rdx, r13
add                rdx, rcx
mov                ebx, 0
cmovb              rbx, r13
add                rbx, rdx

The bottleneck here is the number of fused-domain µops, at 14. A Skylake processor can only execute 4 per cycle, limiting the throughput to 3.5 cycles per multiplication.

We can do better by doing the compiler’s job for it with inline asm:

fn reduce128(x: u128) -> GoldilocksField {
    const LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (x_lo, x_hi) = split(x);

    let x_hi_hi = x_hi >> 32;
    let mut t0 = x_lo;
    let mut s0 = x_hi_hi;
    unsafe {
        asm!(
            "sub {0}, {1}",
            "sbb {1:e}, {1:e}",
            inlateout(reg) t0,
            inlateout(reg) s0,
            options(pure, nomem, nostack)
        );
    }
    t0 = t0.overflowing_sub(s0).0;

    let x_hi_lo = x_hi & EPSILON;
    let t1 = x_hi_lo * EPSILON;

    let mut t2 = t1;
    let mut s2 = t0;
    unsafe {
        asm!(
            "add {0}, {1}",
            "sbb {1:e}, {1:e}",
            inlateout(reg) t2,
            inlateout(reg) s2,
            options(pure, nomem, nostack)
        );
    }
    t2 = t2.overflowing_add(s2).0;

    GoldilocksField(t2)
}

This compiles to:

mov                rdx, r8
mulx               r10, r8, [rbp - 176]
mov                edx, r10d
shr                r10, 32
sub                r8, r10
sbb                r10d, r10d
sub                r8, r10
imul               rdx, r12
add                rdx, r8
sbb                r8d, r8d
add                r8, rdx

This assembly has a Goldilocks-specific trick, which is that sbb r10d, r10d sets r10 to 2^32 − 1 if the previous instruction generated a carry and to 0 otherwise. Since it so happens that EPSILON = 2^32 − 1, we can use this to conditionally add or subtract EPSILON.

This assembly has 12 fused-domain µops. The execution ports are 2×p1, 1×p5, 3×p06, 4×p0156. As there is no pressure on any particular port, the bottleneck still the number of fused-domain µops, limiting the theoretical throughput to 3 cycles per multiplication on Skylake. Ice Lake can execute 5 fused-domain µops per cycle, so on those processors the theoretical throughput is 2.5 cycles per multiplication.

For fair comparison, here’s my best Crandall reduction code:

fn reduce128(x: u128) -> CrandallField {
    let (lo_1, hi_1) = split(x);
    let (lo_2, hi_2) = split((EPSILON as u128) * (hi_1 as u128) + (lo_1 as u128));
    let lo_3 = hi_2 * EPSILON;

    let mut x = lo_2;
    let mut z = 0u64;
    unsafe {
        asm!(
            "add {0}, {1}",
            "cmovc {2}, {3}",
            inout(reg) x,
            in(reg) lo_3,
            inlateout(reg) z,
            in(reg) EPSILON,
            options(pure, nomem, nostack)
        );
    }
    x = x.overflowing_add(z).0;
    CrandallField(x)
}

It compiles to

mov                rdx, rax
mulx               rdx, rax, [rbp - 176]
mulx               rsi, rcx, r12
add                rcx, rax
adc                rsi, 0
imul               rsi, r12
xor                eax, eax
add                rcx, rsi
cmovb              rax, r12
add                rax, rcx

It also has 12 fused-domain µops. The execution ports are 3×p1, 2×p5, 2×p06, 3×p0156. Again, we’re limited to 3 cycles per multiplication by the fused-domain µops on Skylake. We’re also limited to 3 cycles per multiplication by the capacity of port 1. On Ice Lake, the latter does not change so our theoretical throughput still maxes out at 3 cycles per multiplication.

The stats are

Method	TP / Bench (Skylake)	TP (Ice Lake)	TP (Zen 3)	TP / Bench (M1)
Crandall (100% Rust)	3 / .90ns	3	3	2.5 / .78ns
Crandall (inline ASM)	3 / .87ns	3	3	-
Crandall (vector)	3 / .97ns	2.06	2.25
Goldilocks (100% Rust)	3.5 / .93ns	3	2.75	2 / .71ns
Goldilocks (inline ASM)	3 / .85ns	2.5	2.25	1.67 / .69ns
Goldilocks (vector)	2.5 / .79ns	1.62	1.88

Interestingly, Goldilocks is consistently a little bit faster than Crandall on my Skylake even though they have the same theoretical throughput. I'd speculate it's because the Crandall throughput calculation assumes that port 1 is only doing multiplication, but in practice scheduling isn't perfect.

nbgl · 2021-08-31T02:30:29Z

I’ve updated the above table with throughput calculations for AMD Zen 3 and vectorized versions of both Crandall and Goldilocks.

Skylake and Zen 3 support AVX2 (256-bit vectors), whereas Ice Lake has AVX-512 (512-bit).

The main bottleneck in all the vectorized methods is the fact that AVX does not support full 64x64->128-bit multiplication, so we have to cobble it together from 32x32->64-bit multiplications. AVX-512 does has 64x64->64-bit multiplication, but that’s not useful because we need the carry to form the high word. The AVX2 version of multiplication looks like this:

unsafe fn shift(x: __m256i) -> __m256i {
    _mm256_xor_si256(x, sign_bit())
}

unsafe fn mul64_64_s(x: __m256i, y: __m256i) -> (__m256i, __m256i) {
    let x_hi = _mm256_srli_epi64(x, 32);
    let y_hi = _mm256_srli_epi64(y, 32);
    let mul_ll = _mm256_mul_epu32(x, y);
    let mul_lh = _mm256_mul_epu32(x, y_hi);
    let mul_hl = _mm256_mul_epu32(x_hi, y);
    let mul_hh = _mm256_mul_epu32(x_hi, y_hi);

    let res_lo0_s = shift(mul_ll);
    let res_hi0 = mul_hh;

    let res_lo1_s = _mm256_add_epi32(res_lo0_s, _mm256_slli_epi64(mul_lh, 32));
    let res_hi1 = _mm256_sub_epi64(res_hi0, _mm256_cmpgt_epi64(res_lo0_s, res_lo1_s)); // Carry.

    let res_lo2_s = _mm256_add_epi32(res_lo1_s, _mm256_slli_epi64(mul_hl, 32));
    let res_hi2 = _mm256_sub_epi64(res_hi1, _mm256_cmpgt_epi64(res_lo1_s, res_lo2_s)); // Carry.

    let res_hi3 = _mm256_add_epi64(res_hi2, _mm256_srli_epi64(mul_lh, 32));
    let res_hi4 = _mm256_add_epi64(res_hi3, _mm256_srli_epi64(mul_hl, 32));

    (res_hi4, res_lo2_s)
}

There’s a few tricks to note here:

We don’t need to clear the high doubleword before performing multiplication, because it gets ignored anyway.
_mm256_cmpgt_epi64 returns -1 on true and 0 otherwise; hence the carry is subtracted, not added to the high word.
AVX2 does not have unsigned compare for 64-bit integers, so we use signed compare on numbers that have been shifted by 2^63 (trick from Hacker’s Delight). shift adds 2^63, which is then propagated through adds so we don’t have to do it multiple times, and undone at the end. AVX-512 does have unsigned 64-bit compare, so we don’t need this trick and we save 2 instructions.

Goldilocks multiplication is then implemented (nb: this code is not tested) as

unsafe fn sub_modulo_64s_64_s(x_s: __m256i, y: __m256i) -> __m256i {
    let t0_s = _mm256_sub_epi64(x_s, y);
    let carry_mask = _mm256_cmpgt_epi64(t0_s, x_s);
    let adj = _mm256_and_si256(carry_mask, epsilon());
    let t1_s = _mm256_sub_epi64(t0_s, adj);
    t1_s
}

unsafe fn add_modulo_64s_64_s(x_s: __m256i, y: __m256i) -> __m256i {
    let t0_s = _mm256_add_epi64(x_s, y);
    let carry_mask = _mm256_cmpgt_epi64(x_s, t0_s);
    let adj = _mm256_and_si256(carry_mask, epsilon());
    let t1_s = _mm256_add_epi64(t0_s, adj);
    t1_s
}

unsafe fn reduce128s_s(x_s: (__m256i, __m256i)) -> __m256i {
    let (hi0, lo0_s) = x_s;
    let hi_hi0 = _mm256_srli_epi64(hi0, 32);
    let lo1_s = sub_modulo_64s_64_s(lo0_s, hi_hi0);
    let t1 = _mm256_mul_epu32(hi0, epsilon());
    let lo2_s = add_modulo_64s_64_s(lo1_s, t1);;
    lo2_s
}

impl GoldilocksFieldVec {
    pub unsafe fn mul(x: __m256i, y: __m256i) -> __m256i {
        shift(reduce128s_s(mul64_64_s(x, y)))
    }
}

It involves one more 32x32->64-bit multiplication because it’s cheaper than shift + subtract. But it does exploit Goldilocks’ nice properties.

Crandall is a bit longer:

unsafe fn add_with_carry128_64s_s(x: (__m256i, __m256i), y_s: __m256i) -> (__m256i, __m256i) {
    let (x_hi, x_lo) = x;
    let res_lo_s = _mm256_add_epi64(x_lo, y_s);
    let carry = _mm256_cmpgt_epi64(y_s, res_lo_s);
    let res_hi = _mm256_sub_epi64(x_hi, carry);
    (res_hi, res_lo_s)
}

unsafe fn add_with_carry128s_64_s(x_s: (__m256i, __m256i), y: __m256i) -> (__m256i, __m256i) {
    let (x_hi, x_lo_s) = x_s;
    let res_lo_s = _mm256_add_epi64(x_lo_s, y);
    let carry = _mm256_cmpgt_epi64(x_lo_s, res_lo_s);
    let res_hi = _mm256_sub_epi64(x_hi, carry);
    (res_hi, res_lo_s)
}

unsafe fn fmadd_64_32_64s_s(x: __m256i, y: __m256i, z_s: __m256i) -> (__m256i, __m256i) {
    let x_hi = _mm256_srli_epi64(x, 32);
    let mul_lo = _mm256_mul_epu32(x, y);
    let mul_hi = _mm256_mul_epu32(x_hi, y);
    let tmp_s = add_with_carry128_64s_s(
        (_mm256_srli_epi64(mul_hi, 32), _mm256_slli_epi64(mul_hi, 32)),
        z_s,
    );
    add_with_carry128s_64_s(tmp_s, mul_lo)
}

unsafe fn reduce128s_s(x_s: (__m256i, __m256i)) -> __m256i {
    let (hi0, lo0_s) = x_s;
    let (hi1, lo1_s) = fmadd_64_32_64s_s(hi0, epsilon(), lo0_s);
    let lo2 = _mm256_mul_epu32(hi1, epsilon());
    let res_wrapped_s = _mm256_add_epi64(lo1_s, lo2);
    let carry_mask = _mm256_cmpgt_epi64(lo1_s, res_wrapped_s); // all 1 if overflow
    let res_s = _mm256_add_epi64(res_wrapped_s, _mm256_and_si256(carry_mask, epsilon()));
    res_s
}

impl CrandallFieldVec {
    pub unsafe fn mul(x: __m256i, y: __m256i) -> __m256i {
        shift(reduce128s_s(mul64_64_s(x, y)))
    }
}

The full Goldilocks multiplication consists of 30 instructions. With 3 AVX2 execution ports, Skylake runs it at 10 cycles per vector or 2.5 cycles per word. The Crandall multiplication is 36 instructions for a throughput of 12 cycles per vector, or 3 per word.

The AVX-512 versions are analogous. We save two instructions by not having to shift the low words by 2^63 to perform comparisons. We can also use masked addition to save an instruction every time we perform addition modulo prime order (i.e. avoiding the and in _mm_add_epi64(res_wrapped_s, _mm_and_si256(carry_mask, epsilon()))).

The AVX-512 version of Goldilocks has 26 instructions; Ice Lake has 2 AVX-512 execution ports for a throughput of 13 cycles per vector or 1.625 cycles per word. Crandall has 33 instructions, meaning 16.5 cycles per vector or 2.06 per word.

In all those cases, full multiplication is responsible for 18 out of the 30/36/26/33 instructions.

Lastly, I’m convinced that the very fastest way to implement multiplication is neither purely-scalar nor purely-vector. x86 scalar multiplication is significantly faster than anything I can do in AVX2 (although reduction is faster in vector). I would speculate that the fastest implementation tuned for Skylake would do the full 64x64->128-bit multiply in scalar and then the reduction in vector. This holds to a lesser degree on Ice Lake as well: AVX-512 64-bit full multiply is only marginally slower than scalar, but they also compete only partially, so performing some full multiplications in scalar should increase parallelism and yield more throughput. (And yes, combining domains like that is insanely ugly and means we’re optimizing not just for a specific architecture, but a specific CPU family.)

dlubarov · 2021-08-31T05:06:54Z

Thanks Jakub - very useful info. Is the data based on theoretical or observed throughput? (Or is AVX performance predictable enough that the difference isn't important?)

As Jakub pointed out, the Crandall reduction could also work with the Goldilocks field, with the same number of steps. So Goldilocks should be at least as fast on any hardware. The only downside is that Goldilocks doesn't have many options for permutation monomials (for example, x³ is not a permutation in Goldilocks, since p-1 has a factor of 3). But I think that's fine, since we're planning to use Poseidon with x⁷ anyway, which is a permutation in both fields.

The mixed scalar and vector approach sounds interesting. Might be good to benchmark at some point, so we could see if the improvement is significant enough to justify the weird code. But in the meantime, those AVX numbers sound really good already.

Btw, I wonder how we should select among implementations for a given CPU? We could use compile-time guards like #[target_feature(enable = "avx2")], but I don't think there's an option for AVX-512. We could use is_x86_feature_detected to check for AVX-512, but that's a runtime check. We could always add one or more features to control it, though it's not ideal since we wouldn't be inferring the best option automatically.

nbgl · 2021-08-31T05:29:29Z

Is the data based on theoretical or observed throughput? (Or is AVX performance predictable enough that the difference isn't important?)

All throughput numbers are theoretical, but the timings (only for Skylake for now) are observed.

The theoretical throughput numbers are highly indicative of real-world performance, although the correlation isn’t 100%. Also, they measure slightly different things: multiplications per cycle ≠ multiplications per unit time because the instructions that are running can affect things like turbo boost.

nbgl · 2021-09-01T00:24:54Z

I've added throughput calculations for and benchmarks for the M1. I used these instruction tables to derive the theoretical throughput numbers. They do not fully match the benchmarks, so take them with a grain of salt. (Also, while the M1 has a lot of throughput, its latencies are the same, so more unrolling is necessary to max out the speeds. Eventually you run out of registers…)

The one ASM modification I made to speed up Goldilocks is using a mul instead of a mask, shift, and subtract.

dlubarov · 2021-10-24T19:06:47Z

I think we're happy with Goldilocks so I'm going to close this for now.

My only hesitation is that at some point, we might want to have lower-degree constraints, like a STARK with degree-3 constraints (enabling constraint evaluation over a much smaller subgroup, of size 2n rather than 8n). That wouldn't work well with Goldilocks since its smallest permutation monomial is x^7.

But that's not in our immediate plans, more of a possible future exploration.

Log portability stubs

fix error when no std

dlubarov assigned unzvfu Mar 6, 2021

dlubarov mentioned this issue Apr 3, 2021

Implement native Poseidon #4

Closed

unzvfu mentioned this issue Apr 26, 2021

FFT optimizations #8

Closed

nbgl mentioned this issue Sep 8, 2021

Vectorized FFT #223

Merged

dlubarov closed this as completed Oct 24, 2021

dlubarov pushed a commit that referenced this issue Aug 7, 2022

Merge pull request #1 from mir-protocol/log-portability-stubs

86c1493

Log portability stubs

georgwiese mentioned this issue Jul 27, 2023

Create the powdr book powdr-labs/powdr#421

Merged

SuccinctPaul pushed a commit to SuccinctPaul/plonky2 that referenced this issue Aug 31, 2023

Merge pull request 0xPolygonZero#1 from InternetMaximalism/fix/no-std

f99ed9c

fix error when no std

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate other field options #1

Investigate other field options #1

dlubarov commented Mar 6, 2021

dlubarov commented Mar 6, 2021

dlubarov commented Mar 6, 2021

unzvfu commented Mar 29, 2021 •

edited

dlubarov commented Mar 29, 2021 •

edited

dlubarov commented Apr 2, 2021

dlubarov commented Apr 2, 2021

unzvfu commented Apr 7, 2021

dlubarov commented Apr 8, 2021

unzvfu commented May 22, 2021

dlubarov commented May 23, 2021

dlubarov commented May 23, 2021

unzvfu commented May 23, 2021

dlubarov commented May 23, 2021 •

edited

unzvfu commented May 23, 2021

dlubarov commented Jun 7, 2021

dlubarov commented Jul 27, 2021

nbgl commented Aug 30, 2021 •

edited

nbgl commented Aug 31, 2021 •

edited

dlubarov commented Aug 31, 2021

nbgl commented Aug 31, 2021

nbgl commented Sep 1, 2021

dlubarov commented Oct 24, 2021

Investigate other field options #1

Investigate other field options #1

Comments

dlubarov commented Mar 6, 2021

dlubarov commented Mar 6, 2021

dlubarov commented Mar 6, 2021

unzvfu commented Mar 29, 2021 • edited

dlubarov commented Mar 29, 2021 • edited

dlubarov commented Apr 2, 2021

dlubarov commented Apr 2, 2021

unzvfu commented Apr 7, 2021

dlubarov commented Apr 8, 2021

unzvfu commented May 22, 2021

dlubarov commented May 23, 2021

dlubarov commented May 23, 2021

unzvfu commented May 23, 2021

dlubarov commented May 23, 2021 • edited

unzvfu commented May 23, 2021

dlubarov commented Jun 7, 2021

dlubarov commented Jul 27, 2021

nbgl commented Aug 30, 2021 • edited

nbgl commented Aug 31, 2021 • edited

dlubarov commented Aug 31, 2021

nbgl commented Aug 31, 2021

nbgl commented Sep 1, 2021

dlubarov commented Oct 24, 2021

unzvfu commented Mar 29, 2021 •

edited

dlubarov commented Mar 29, 2021 •

edited

dlubarov commented May 23, 2021 •

edited

nbgl commented Aug 30, 2021 •

edited

nbgl commented Aug 31, 2021 •

edited