-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement native Poseidon #4
Comments
[Nothing new below, I'm just doing a brain dump of my understanding at the moment together with some references.] The core cost in evaluating Poseidon is the product of an MDS matrix with the state; for example, an experiment by bobbinth suggests that 92% of the time is in the MDS products, and only 5% in the S-Box permutations. The speed improvements mentioned above in Appendix B of the Poseidon paper apply to the partial rounds; in short, because the S-Box permutation isn't applied in partial rounds, we can factor the MDS matrix into something close to an identity matrix and use that to update the state. Note that bobbinth's implementation of Poseidon for the test above does not implement this trick, so those results are probably somewhat misleading. The reference implementation cited above does implement the trick, but it's written in Sage so performance numbers from it are probably meaningless. There is also the possibility of following the Fractal lead (hat tip to @dlubarov) who use a slightly relaxed property to produce "lightweight MDS" matrices which apparently gives a 10-fold performance improvement (see p68 here for the claim). More generally, there's a long history of research into finding MDS matrices that allow fast products; a recent example with a decent reference list is here. It is not currently clear what the security implications are when using the lightweight MDS matrices with Poseidon. It would be great to compare an implementation of Poseidon using those two tricks and compare it with GMiMC; it wouldn't surprise me if Poseidon wasn't far behind GMiMC. |
That all sounds right to me except
Maybe this is what you meant, but it's applied to a single element in partial round, vs all the elements in full rounds. To dig into the cost a bit more, the StarkWare report listed Poseidon as taking (m2 + 2m) R_f + (2m + 2) R_p multiplications, after the optimizations suggested in the paper. I think m2 R_f of that is full matrix multiplication, and 2m R_p of it is multiplication by a sparse matrix. For our candidate parameters, it looks like full matrix multiplication would be ~72% of the cost while sparse matrix multiplication would be ~3% of the cost, so we should probably just focus on the full matrix and not the sparse matrix. |
I wonder what sort of matrix entries we should search for. The literature seems mostly focused on binary fields. Clearly 0s are best for performance, followed by 1s. I guess powers of two should give a decent speedup, though we'd still need to reduce. Coefficients less than 232 might also help to speed up reductions somewhat -- e.g. with the Goldilocks reduction you suggested, we'd have x11 = 0. My understanding is that MDS matrices (at least square ones) can't have 0 entries, but maybe we could find one with several 1 entries, and the rest small and/or power-of-two entries. |
Oh, yes, good catch! I was a bit sloppy there. :)
Yep, agreed. Thanks for that reference!
Yep, that was my thinking too. With the "lightweight MDS" matrices, apparently all the entries are either 0, 1, or 2, which explains the great speedup Valardragon referred to. |
I think another good candidate would be Poseidon with a width of 8, with the same field and same x7 S-box. It would have the same number of rounds (8 full, 22 partial), but 86 S-boxes instead of 118, and 1208 total muls instead of 2152. (Aside: if we used a sponge-based 2-to-1 compression function in our (binary) Merkle tree, a width of 8 would mean we need two squeeze steps. But I think we can do secure tree hashing by just permuting |
The defaults are quite slow, but we will override them with fast, precomputed, field-specific matrices; see #4.
The defaults are quite slow, but we will override them with fast, precomputed, field-specific matrices; see #4.
First results on this are in. TL;DR: The circulant matrix with row I obtained this matrix by starting with the matrix defined at src/field/crandall_field.rs#L18 and repeatedly reducing each entry by half, stopping if the matrix ever failed to be MDS. I then manually started adding or subtracting a small number from each of the entries while maintaining the MDS property, each time aiming for a power of two. It was a bit ad hoc, but, unless I've missed something obvious, the matrix above can't be simplified further and still satisfy the MDS property. (Note that the 256 can be replaced with 5 and still be good to use; some of the other entries can be changed similarly; this sort of thing allows us to make use of the So the process was a combination of automated "reduction" and manual fiddling; not entirely automate-able, but not too laborious either. The result looks pretty good, with three 1's, and the rest powers of two. Hence each dot product can be computed with a smallish number of adds and shifts which should be relatively easy to interleave to make use of the CPU pipeline. The main hassle will be dealing with the overflow, but that's inevitable since the CrandallField prime leaves no space at the top of the word. |
A different approach to computing the MDS matrix-vector product quickly might be to use the fact that, if C is a circulant matrix with first column c, then the product Cv is given by FFT-1{FFT(c) * FFT(v)}, where the * is componentwise multiplication. In our case the c's are constants so FFT(c) can be precomputed. Unfortunately we can't chain several of these together in the full layers of Poseidon because, as far as I can see, there's no way to compute the S-boxes in the FFT domain: Given FFT(v = (x0, ..., xn-1)), the only way I can see to obtain FFT(x0α, ..., xn-1α) is to compute v with FFT-1, raise each element to the power α, and re-apply FFT. Given that n-1 = t = 8 or 12 or so in our case, which is very small, it's seems likely that the overhead of the FFT in this approach will prevent it from beating the one above. |
Also worth noting that multiplying a circulant matrix by a vector even with normal multiplication is very amenable to CPU vectorisation: For example
i.e. each term on the RHS is just a rotation of the original vector scaled by a constant (which will be one of the small constants found above). If we have t=8, then we can even fit all 8 of the 64-bit vector elements in a single AVX512 register. |
That looks very practical! Seems about as good as it can get, at least with our current field. Just thinking out loud about how much performance could gained with a smaller field. My understanding is that
|
Right (by 'wide multiplication' I assume you mean to do them with shifts). It is pretty nice that we can just do a single
I think you're right that the advantage is only obtained if the inputs are already reduced. |
A full and correct implementation of Poseidon is now available on the
(* Not currently available in the code; timing is approximate.) Experience suggests that the relative timings are fairly constant. For example, before it died, the faster CPU in my other laptop produced timings of about 1.2μs for GMiMC and 2.3μs for Poseidon (smart, fast MDS). There is still space in the implementation to improve CPU pipelining and parallelise with AVX if desired. I should add some documentation and supporting scripts used to generate constants before this is merged. |
The performance looks promising! On my M1:
I think if we use x3 we'll need 42 partial rounds; I was assuming x7 with the 22 partial round figure. I used this script to get the numbers. Do you think x3 would be faster despite the extra partial rounds? In the recursive setting I think x7 should be a little cheaper, as we basically pay a certain cost per S-box; the exponent doesn't affect things much (as long as it's less than 8). |
Thanks for the feedback @dlubarov. Good catch with the S-box monomial! Here are the updated numbers (same computer as before):
So x7 with 22 partial rounds comfortably beats x3 with 42 partial rounds. I'm a little surprised that evaluating x7 is that much slower than x3, since x3 takes 2 multiplications and x7 takes 4 but where two of them are independent (x2 = x * x; x4 = x2 * x2; x3 = x2 * x; x7 = x4 * x3). I guess it's testament to the fact that the fast MDS matrices mean the S-boxes are taking a bigger portion of the runtime. |
For me it only increased from 2.4us to 2.6us
so seems like a bit more parallelism is being used on my machine. |
Quick update: Timing for GMiMC and Poseidon with width = 8:
So the performance really converges for width = 8 and, perhaps surprisingly, the smart partial round evaluation doesn't provide such a dramatic improvement. I would guess that this is because the s-box calculation now dominates the runtime, which could mean trading more partial rounds for a smaller s-box monomial is worth it. I'll continue investigating and report back. Edit: update the numbers with a better MDS matrix. |
Times with smaller s-box, more partial rounds:
So it still looks like x7 is the sweet spot for s-box monomial vs number of partial layers. Further experimentation has shown that, somewhat annoyingly, the smart MDS evaluation and the naive MDS evaluation actually cost about the same amount when the width is 8. In principle, for width w, the work is w2 muls in the naive case and 2w muls for the smart case. However that ignores a big difference: The naive MDS evaluation computes the product with the MDS matrix directly, hence it uses my fast MDS entries directly, whereas the "smart" MDS evalutation uses precomputed constants which are intrinsically random 64-bit values. I think this explains why the performance gains from smart MDS evaluation are less visible when we reduce the hash width from 12 to 8 (at least on my old laptop). |
Repeating a comment from #207 for reference:
|
This issue captures the now merged native Poseidon implementation. Renamed to reflect that, and added #219 for circuit version. |
I don't mean to suggest that we should necessarily replace GMiMC with Poseidon; might be good to wait for more clarity around the security of both. But in the meantime, it would be good to have both in the codebase so that we can get a clearer picture of their costs.
Note that implementing Poseidon efficiently involves a bit of trickery; there are some suggestions in Appendix B of the paper. The reference implementation might also be useful.
There are some variations to consider, but I think a good candidate would be
This leads to a recommendation of 8 full rounds and 22 partial rounds, for a total of 118 S-boxes.
It could also be worth considering x3, and trying to arithmetize it in a clever way that doesn't involve a wire for every S-box (see #2). However, our gate would then have degree 9 constraints, which would mean (since we use radix-2 FFTs) a bunch of intermediate polynonmials need to be degree 16n
The text was updated successfully, but these errors were encountered: