Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto: import fiat-crypto implementations #40171

Open
mdempsky opened this issue Jul 12, 2020 · 69 comments
Open

crypto: import fiat-crypto implementations #40171

mdempsky opened this issue Jul 12, 2020 · 69 comments

Comments

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Jul 12, 2020

The fiat-crypto project (https://github.com/mit-plv/fiat-crypto) generates formally-verified, high-performance modular arithmetic implementations, useful for crypto primitives like Curve25519, Poly1305, and the NIST ECC curves that are used within the Go standard library. They're currently working on a Go backend.

BoringSSL has imported their implementations for Curve25519 and P-256: https://boringssl.googlesource.com/boringssl/+/master/third_party/fiat/

At https://go-review.googlesource.com/c/crypto/+/242177, I've uploaded a WIP CL that imports their Curve25519 implementation (w/ minor tweaks), and demonstrates a significant performance improvement over the current "generic" implementation. (The existing amd64 assembly implementation is still considerably faster though.)

This proposal is to import and make use of those implementations.

Open questions:

  1. Which algorithms should be imported? BoringSSL only imports two. Should we import more?

  2. Do we import both 32-bit and 64-bit implementations? We could import just one implementation and still get a performance speedup (e.g., 386 sees a -10% performance boost with curve25519_fiat_64.go, and amd64 sees a -30% boost with curve25519_fiat_32.go), but they do better still with the CPU-appropriate implementations (-30% for 386 w/ 32-bit, and -61% for amd64 w/ 64-bit).

  3. How should the code be imported? E.g., should it be separated into a third_party or vendor directory with its own LICENSE file like how BoringSSL does it?

/cc @agl @FiloSottile

@gopherbot gopherbot added this to the Proposal milestone Jul 12, 2020
@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Jul 13, 2020

fiat-crypto's 64-bit P-224 implementation is about 2x as fast as Go's existing portable, constant-time P-224 implementation on amd64. Their 32-bit implementation is about the same speed on 386.

I expect P-256 will be similar to Curve25519 (i.e., existing assembly implementations are faster, but still worth measuring), but using fiat-crypto for P-384 and P-521 should be both much faster and provide a constant-time implementation (unlike the current, generic math/big code).

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Aug 7, 2020
@rsc rsc moved this from Incoming to Active in Proposals Aug 12, 2020
@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Aug 18, 2020

@agl @FiloSottile Ping.

While here, I'll point out the fiat-crypto implementation also speeds up curve25519 on ppc64le:

name               old time/op   new time/op   delta
ScalarBaseMult-32    152µs ± 1%     92µs ± 4%  -39.82%  (p=0.000 n=17+20)

name               old speed     new speed     delta
ScalarBaseMult-32  210kB/s ± 0%  347kB/s ± 2%  +65.27%  (p=0.000 n=17+17)

(I don't have any ARM workstations to readily benchmark on.)

@rsc
Copy link
Contributor

@rsc rsc commented Aug 26, 2020

I do have some concerns about adding new license notice requirements in the libraries, because those transitively apply to every Go binary anyone builds (that imports net/http at least).

I would feel much more comfortable about this if we could get the code contributed under CLA so that the Go license notice would cover it.

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Aug 26, 2020

@JasonGross Do you think we can get fiat-crypto's Go code contributed under Google's CLA? The normal process is documented at https://golang.org/doc/contribute.html#cla.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 16, 2020

Ping @JasonGross. We'd be happy to use this code but don't want to impose new notice requirements on every Go binary.

@JasonGross
Copy link

@JasonGross JasonGross commented Sep 16, 2020

Ah, sorry, I meant to follow up on this earlier. As discussed on openssl/openssl#12201 (comment), MIT unfortunately doesn't permit signing CLAs on projects that it holds copyright to. :-/

@rsc
Copy link
Contributor

@rsc rsc commented Sep 18, 2020

@JasonGross, thanks for replying. I certainly understand MIT not wanting to complete CLAs.

An alternative solution to our problem of imposing new notice requirements on every Go binary would be if the generator outputs could be licensed under a non-attribution license such as MIT-0 or a source-code-attribution-only license such as BSD-1-Clause.

Do you think that is a possibility?

@JasonGross
Copy link

@JasonGross JasonGross commented Sep 18, 2020

That seems quite likely. Let me chat with me colleagues and see if it's feasible.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 18, 2020

Thanks very much.

@JasonGross
Copy link

@JasonGross JasonGross commented Sep 22, 2020

@rsc We're in the process of re-licensing under user's choice, MIT OR BSD-1-Clause. However, it seems that BSD-1-Clause is not listed under https://pkg.go.dev/license-policy, even though MIT-0 and BSD-0-Clause are. Is this an oversight? Will BSD-1-Clause in fact be sufficient?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 22, 2020

BSD-1-clause should be fine for our purposes. Thanks very much for tackling this.

I don't know why it's not listed in pkg.go.dev. Maybe it's just not very common. CC @jba .

@jba
Copy link
Contributor

@jba jba commented Sep 22, 2020

Looking into it.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 23, 2020

BSD-1-Clause will be fine, thanks.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 23, 2020

Based on the discussion above, this seems like a likely accept.

@rsc rsc moved this from Active to Likely Accept in Proposals Sep 23, 2020
@JasonGross
Copy link

@JasonGross JasonGross commented Sep 23, 2020

I've gotten approval from everyone and have prepared mit-plv/fiat-crypto#881. Hopefully we'll get it merged in the next couple of days.

@JasonGross
Copy link

@JasonGross JasonGross commented Sep 25, 2020

The code has now been relicensed under MIT OR BSD-1-Clause OR Apache-2.0

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 25, 2020

Thanks!

@rsc
Copy link
Contributor

@rsc rsc commented Sep 30, 2020

Thanks so much @JasonGross!

Accepted.

@rsc rsc moved this from Likely Accept to Accepted in Proposals Sep 30, 2020
@rsc rsc removed this from the Proposal milestone Sep 30, 2020
@rsc rsc added this to the Backlog milestone Sep 30, 2020
@FiloSottile FiloSottile removed this from the Go1.17 milestone Jun 15, 2021
@FiloSottile FiloSottile added this to the Go1.18 milestone Jun 15, 2021
@Yawning
Copy link

@Yawning Yawning commented Aug 2, 2021

Am I missing any? Does that sound like a good categorization?

It maybe worth revisiting some of these in the medium to long-term, since the performance penalty for using fiat can be reduced.

As some of the participants in this issue know, I have been looking into fiat's Go performance recently. As of right now, there are a number of changes that can be made to fiat-crypto's 64-bit Go output that will provide performance increases for "real-world" use-cases.

  • Removing the addcarryxU64/subcarryxU64 wrappers dramatically improves CarryMul/CarrySquare performance (mit-plv/fiat-crypto#949).
  • Selectznz is really really slow, compared to how edwards25519 does it.
  • Working around the compiler (aka "At most, people must copy and paste Go code.") provides gains for at least the following calls:
    • Fusing Add/Sub/Opp with Carry (mit-plv/fiat-crypto#1004). I filed the issue for curve25519 because that was what I was looking at, but skimming the P-521 commit, it applies there as well.
    • Providing a routine that does repeated squaring in a for loop is a moderate gain for curve25519, due to making inversions faster.
    • Providing routines that merge Add/Sub with CarryMul and CarrySquare is a minor gain, though this depends on how the code is called.
* curve25519 and edwards25519 64-bit
  
  * Using the code we are already landing for edwards25519 (https://golang.org/cl/276272, https://golang.org/cl/315269) is a little faster on arm64, and way faster on amd64 (https://golang.org/cl/c/crypto/+/314889/4#message-f16cf2274fc82aaf4a5df6836517ed52eadd32b5). Fiat is 15% faster on POWER9, but that alone doesn't feel worth carrying it (https://golang.org/cl/c/crypto/+/315269/1#message-22671e8bfc8583e4ae329a88ba25ab5c49cc3016).

With the addcarryxU64 removal, and fused Add/Sub/Opp + Carry calls, the existing code is slightly faster than (fiat runtime is +9~17%) on amd64[0]. I did not benchmark copy-pasting CarrySquare into a for-loop, since the existing inversion code won't leverage it (so there is an optimization opportunity in the base code as well). Likewise I did not benchmark the impact of adding a * (b + c)/a * (b - c)/(a + b)^2 routines.

Under the assumption that upstream pulls in at least the optimizations I did benchmark (and they seemed open to the idea), the tradeoffs from my perspective are:

  • (Pro) "ed25519 and curve25519 32-bits" for "free". This will almost certainly be faster than the existing code as well.
  • (Pro) Formal verification of part of the implementation (though the current code is quite nice)
  • (Pro) Less assembly to maintain (amd64/arm64 multiply/square have assembly impls)
  • (Con) Slightly slower.

But this is contingent on changes to the generated code happening, so there is no rush. On a positive note, when the changes do happen P-521 will be faster for "free".

[0]: ScalarMult/ScalarBaseMult performance will be significantly worse if fiat's Selectznz is used, so I didn't.

@josharian
Copy link
Contributor

@josharian josharian commented Aug 3, 2021

P-521 has landed in Go 1.17! We can bring in P-384, 32 bit versions, and scalar fields in Go 1.18.

We are really looking forward to P-384 landing. We just had a bunch of OOM crashes on our new iOS release, which operates in a memory-constrained environment. We traced it back to the sheer number of allocations from doing P-384 verification of an x509 cert chain. We are working around it with a third party crypto library for now; when P-384 gets the fiat treatment that P-521 did (thanks @FiloSottile!), we can switch back to the standard library.

JasonGross added a commit to JasonGross/fiat-crypto that referenced this issue Aug 3, 2021
This is a partial fix for item 1 of
mit-plv#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
@JasonGross
Copy link

@JasonGross JasonGross commented Aug 3, 2021

I should have a PR doing this soon. The only unfortunate remaining point is that there will be a bunch of calls like x79, _ = bits.Add64(x34, x75, uint64(uint1(x78))) where x78 is a uint64. This is largely due to an artifact of how we're encoding bounds information, but it's an artifact that's extremely hard to change. (It is, however, on my long-term todo list; I hope to eventually rewrite the core of the fiat-crypto engine from scratch and merge abstract interpretation with rewriting, which will allow removing these double-casts in a verified way quite easily.) I hope this doesn't incur too much of a performance overhead?

  • Selectznz is really really slow, compared to how edwards25519 does it.

Can you point me at the code in edwards25519? I'm happy to have fiat-crypto emit a different version.

Should be pretty easy, I just need to get around to doing it.

@josharian
Copy link
Contributor

@josharian josharian commented Aug 3, 2021

uint64(uint1(...)) [...] I hope this doesn't incur too much of a performance overhead?

This should get compiled to a zero-extending move on all platforms (e.g. MOVBU on arm64, MOVBLZX on amd64). That should be pretty cheap, as long as the values (x78) have short lifetimes (so they don't take up too many registers and cause spills).

If you can make uint1 a uint64 instead of a uint8, it'll be free.

Can you point me at the code in edwards25519? I'm happy to have fiat-crypto emit a different version.

I'm guessing, but perhaps https://cs.opensource.google/go/go/+/refs/tags/go1.17rc1:src/crypto/ed25519/internal/edwards25519/edwards25519.go;l=398

I'm also happy to help you optimize these routines, if you can point me at one in particular that is of interest. (I see many Selectznz in fiat-crypto.)

@Yawning
Copy link

@Yawning Yawning commented Aug 3, 2021

I should have a PR doing this soon.

Nice!

The only unfortunate remaining point is that there will be a bunch of calls like x79, _ = bits.Add64(x34, x75, uint64(uint1(x78))) where x78 is a uint64. This is largely due to an artifact of how we're encoding bounds information, but it's an artifact that's extremely hard to change. (It is, however, on my long-term todo list; I hope to eventually rewrite the core of the fiat-crypto engine from scratch and merge abstract interpretation with rewriting, which will allow removing these double-casts in a verified way quite easily.) I hope this doesn't incur too much of a performance overhead?

There will be some, but I would need to explicitly benchmark it to see how much of a hit it is. It shouldn't be too bad, though it depends on how the truncate + zero extend gets emitted/optimized.

  • Selectznz is really really slow, compared to how edwards25519 does it.

Can you point me at the code in edwards25519? I'm happy to have fiat-crypto emit a different version.

Peeking at it, the rough algorithm appears to be the same (https://github.com/FiloSottile/edwards25519/blob/main/field/fe.go#L261).

The difference I saw when trying to use Selectznz is likely because ./curve25519.go:577:6: cannot inline Selectznz: function too complex: cost 241 exceeds budget 80 versus ./fe.go:261:6: can inline (*Element).Select with cost 80 as: method(*Element) func(*Element, *Element, int) *Element { m := mask64Bits(cond); v.l0 = m & a.l0 | ^m & b.l0; v.l1 = m & a.l1 | ^m & b.l1; v.l2 = m & a.l2 | ^m & b.l2; v.l3 = m & a.l3 | ^m & b.l3; v.l4 = m & a.l4 | ^m & b.l4; return v }.

Normally this wouldn't be a big deal but a number of the higher level operations (eg: a table driven scalar basepoint multiply) calls conditional swap in a loop, so it ends up being in the critical path.

While I'm miffed that this is another inliner related issue, on the upside at least the generated Selectznz assembly doesn't appear to be half NOPs or anything gross like that.

Should be pretty easy, I just need to get around to doing it.

Excellent.

ps: Sorry for being such a pain, and thanks for the quick responses.

@josharian
Copy link
Contributor

@josharian josharian commented Aug 3, 2021

As a start, looking at the Selectznz in curve25519.go, shortening the lifetimes of the variables will help. Current code:

func Selectznz(out1 *[5]uint64, arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) {
	var x1 uint64
	cmovznzU64(&x1, arg1, arg2[0], arg3[0])
	var x2 uint64
	cmovznzU64(&x2, arg1, arg2[1], arg3[1])
	var x3 uint64
	cmovznzU64(&x3, arg1, arg2[2], arg3[2])
	var x4 uint64
	cmovznzU64(&x4, arg1, arg2[3], arg3[3])
	var x5 uint64
	cmovznzU64(&x5, arg1, arg2[4], arg3[4])
	out1[0] = x1
	out1[1] = x2
	out1[2] = x3
	out1[3] = x4
	out1[4] = x5
}

Better:

func Selectznz(out1 *[5]uint64, arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) {
	var x1 uint64
	cmovznzU64(&x1, arg1, arg2[0], arg3[0])
	out1[0] = x1
	var x2 uint64
	cmovznzU64(&x2, arg1, arg2[1], arg3[1])
	out1[1] = x2
	var x3 uint64
	cmovznzU64(&x3, arg1, arg2[2], arg3[2])
	out1[2] = x3
	var x4 uint64
	cmovznzU64(&x4, arg1, arg2[3], arg3[3])
	out1[3] = x4
	var x5 uint64
	cmovznzU64(&x5, arg1, arg2[4], arg3[4])
	out1[4] = x5
}

This avoids spills to the stack. On amd64, this compiles to 42 instructions instead of 57; all of the newly elided instructions are stack reads/writes.

This won't help with the execrable inliner, though.

JasonGross added a commit to JasonGross/fiat-crypto that referenced this issue Aug 4, 2021
This is a partial fix for item 1 of
mit-plv#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
@JasonGross
Copy link

@JasonGross JasonGross commented Aug 4, 2021

Note that this changes the semantics of Selectznz when the output array overlaps partially (but not fully) with either input array. Is this acceptable?

JasonGross added a commit to JasonGross/fiat-crypto that referenced this issue Aug 4, 2021
This is a partial fix for item 1 of
mit-plv#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
@Yawning
Copy link

@Yawning Yawning commented Aug 4, 2021

uint64(uint1(...)) [...] I hope this doesn't incur too much of a performance overhead?

This should get compiled to a zero-extending move on all platforms (e.g. MOVBU on arm64, MOVBLZX on amd64). That should be pretty cheap, as long as the values (x78) have short lifetimes (so they don't take up too many registers and cause spills).

Turns out this is rather expensive, because it adds 4 instructions per cast. For example:

	x55, x56 = bits.Add64(x17, x51, uint64(0x0))
	var x57 uint64
	x57, _ = bits.Add64(x18, x53, uint64(uint1(x56)))

With uint1 = uint8:

	0x0273 00627 (curve25519.go:196)	MOVQ	"".x17+248(SP), R15
	0x027b 00635 (curve25519.go:196)	ADDQ	R15, BX
	0x027e 00638 (curve25519.go:196)	SBBQ	R15, R15
	0x0281 00641 (curve25519.go:196)	NEGQ	R15
	0x0284 00644 (curve25519.go:198)	MOVBLZX	R15B, R15
	0x0288 00648 (curve25519.go:198)	NEGL	R15
	0x028b 00651 (curve25519.go:198)	ADCQ	CX, R8

With uint1 = uint64 (which is indeed free):

	0x0268 00616 (curve25519.go:196)	MOVQ	"".x17+248(SP), R15
	0x0270 00624 (curve25519.go:196)	ADDQ	R15, BX
	0x0273 00627 (curve25519.go:198)	ADCQ	CX, R8

Interestingly enough, the compiler is ensuring that the carry being propagated to the second Add64 is always 0 or 1 (which is where the bulk of the extra overhead is coming from), but it isn't noticing that it is the output from a prior Add64 call.

JasonGross added a commit to JasonGross/fiat-crypto that referenced this issue Aug 4, 2021
This is a partial fix for item 1 of
mit-plv#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
@JasonGross
Copy link

@JasonGross JasonGross commented Aug 4, 2021

Turns out this is rather expensive, because it adds 4 instructions per cast.

Oof. Surely the Go compiler can do better and just emit one bitmask instruction per sequence of nested casts, no? If this needs to be handled in our side, I'll have to think about how to solve mit-plv/fiat-crypto#846 so that we can widen the carry type without giving up the ability to make use of a bytes type. (I think ultimately I'll need to keep the casts in to_bytes and from_bytes, but I should hopefully be able to remove them in the other functions.)

@JasonGross
Copy link

@JasonGross JasonGross commented Aug 4, 2021

I guess a cheap alternative is to just change the typedef to uint64 and add a comment; this is unsafe, but no more unsafe than the current code (where our proof assumes that cast to uint1 truncates to 1 bit, even though our actual code doesn't make use of this).

@Yawning
Copy link

@Yawning Yawning commented Aug 6, 2021

I guess a cheap alternative is to just change the typedef to uint64 and add a comment; this is unsafe, but no more unsafe than the current code (where our proof assumes that cast to uint1 truncates to 1 bit, even though our actual code doesn't make use of this).

If the truncation isn't leveraged at all for anything, this does seem like the path of least resistance from my perspective.

JasonGross added a commit to JasonGross/fiat-crypto that referenced this issue Aug 6, 2021
This is a partial fix for item 1 of
mit-plv#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
JasonGross added a commit to mit-plv/fiat-crypto that referenced this issue Aug 7, 2021
This is a partial fix for item 1 of
#949 (comment) also
mentioned at golang/go#40171 (comment)

Unfortunately, we still cast the arguments to uint1 because we don't
have enough information in the right places to relax the casts.  We
really want a fused absint/rewriting pass, where we get access to true
bounds information rather than just casts.
@Yawning
Copy link

@Yawning Yawning commented Aug 8, 2021

Thanks to the work by @JasonGross, naively integrating fiat-crypto into the edwards25519 code scheduled for use in 1.17 now looks like this:

name \ time/op                 baseline     baseline-purego  fiat
MultiScalarMultSize8-4          410µs ± 0%       543µs ± 0%   476µs ± 0%
ScalarBaseMult-4               34.6µs ± 0%      44.7µs ± 0%  40.2µs ± 0%
ScalarMult-4                    115µs ± 0%       160µs ± 0%   127µs ± 0%
VarTimeDoubleScalarBaseMult-4   109µs ± 0%       155µs ± 0%   117µs ± 0%

Note that the baseline numbers use assembly language implementations for multiply and square, so the comparison vs fiat isn't "fair" or "direct". The fiat-crypto code outperforms the existing code when assembly is disabled, and comes within spitting distance (+7~16% increased runtime), when compared to code that cheats and uses assembly.

If people are determined to squeeze out everything they can from the fiat backend, then there's some more manual inlining/refactoring that could be done, but the low hanging fruit gets performance to "competitive with existing code", at least from my perspective.

@JasonGross
Copy link

@JasonGross JasonGross commented Aug 8, 2021

As a start, looking at the Selectznz in curve25519.go, shortening the lifetimes of the variables will help.

I'll aim to add a flag that allows this shortly. If it's helpful, I can also have the cmovznz calls inlined, though I expect that'll take me a little bit more work

@josharian
Copy link
Contributor

@josharian josharian commented Aug 11, 2021

As a start, looking at the Selectznz in curve25519.go, shortening the lifetimes of the variables will help.

I'll aim to add a flag that allows this shortly. If it's helpful, I can also have the cmovznz calls inlined, though I expect that'll take me a little bit more work

No, you're right, it's better not to make the caller worry about alias-safety. Thanks for pointing that out.

Fortunately, you can get a similar effect by tweaking the method signature a bit to return a value rather than write to *arg0:

func cmovznzU64(arg1 uint1, arg2 uint64, arg3 uint64) uint64 {
	x1 := (uint64(arg1) * 0xffffffffffffffff)
	return ((x1 & arg3) | ((^x1) & arg2))
}

func Selectznz(out1 *[5]uint64, arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) {
	x1 := cmovznzU64(arg1, arg2[0], arg3[0])
	x2 := cmovznzU64(arg1, arg2[1], arg3[1])
	x3 := cmovznzU64(arg1, arg2[2], arg3[2])
	x4 := cmovznzU64(arg1, arg2[3], arg3[3])
	x5 := cmovznzU64(arg1, arg2[4], arg3[4])
	out1[0] = x1
	out1[1] = x2
	out1[2] = x3
	out1[3] = x4
	out1[4] = x5
}

// and the obvious change to ToBytes

This shrinks the generated code for Selectznz on amd64 from 61 to 46 instructions, and preserving aliasing safety.

You'll get benefits throughout from returning values, possibly multiple values, from functions rather than passing pointers. And as a bonus, the code will be much more idiomatic. Another example:

func subborrowxU51(arg1 uint1, arg2 uint64, arg3 uint64) (out1 uint64, out2 uint1) {
	x1 := ((int64(arg2) - int64(arg1)) - int64(arg3))
	x2 := int1((x1 >> 51))
	x3 := (uint64(x1) & 0x7ffffffffffff)
	return x3, (0x0 - uint1(x2))
}

func ToBytes(out1 *[32]uint8, arg1 *TightFieldElement) {
	x1, x2 := subborrowxU51(0x0, arg1[0], 0x7ffffffffffed)
	x3, x4 := subborrowxU51(x2, arg1[1], 0x7ffffffffffff)
	x5, x6 := subborrowxU51(x4, arg1[2], 0x7ffffffffffff)
	x7, x8 := subborrowxU51(x6, arg1[3], 0x7ffffffffffff)
	x9, x10 := subborrowxU51(x8, arg1[4], 0x7ffffffffffff)
        // ...
}

This cuts about 25% of the instructions in ToBytes.

I can also have the cmovznz calls inlined

cmovznz is already inlined. Selectznz is not, but getting it inlined is going to be hard to achieve. I messed around with it for a bit, and the only way I saw to do it was to introduce a loop and alter the function signature to return a [5]uint64, which finesses the aliasing issue but will itself be slow.

func Selectznz(arg1 uint1, arg2 *[5]uint64, arg3 *[5]uint64) (out1 [5]uint64) {
	mask := -uint64(arg1)
	for i := 0; i < len(out1); i++ {
		out1[i] = cmovznzU64(mask, arg2[i], arg3[i])
	}
	return
}

func cmovznzU64(mask uint64, arg2 uint64, arg3 uint64) uint64 {
	return ((mask & arg3) | ((^mask) & arg2))
}

I'm pretty sure that is not a net win, even assuming that the changed function signature is acceptable.

@FiloSottile
Copy link
Contributor

@FiloSottile FiloSottile commented Aug 26, 2021

(Back from vacation! Pardon the lag.)

This is excellent, thanks @Yawning, @JasonGross and @josharian for working on fiat's performance!

@Yawning makes good points in #40171 (comment), however I am still inclined not to switch the 25519 backends to fiat, even if I agree the performance difference is acceptable now.

First, the current code already shipped in Go 1.17, so changing it has the marginal risk involved in making any change: unexpected bugs, edge case performance changes, differences in unspecified behavior. We already incurred that risk in Go 1.17, and doing it again in Go 1.18 is a cost.

Second, fiat is leaps and bounds better than the assembly we had before or big.Int, which is what made it acceptable to import code that can't really be reviewed in its output. However, here we have a lot of confidence in the current code, as it's well tested and documented. (I am obviously biased on that, having written the code, but I am also the one that has to maintain it, so my money is where my mouth is.) fiat is generated from a formal model, which rules out one important class of bugs, but it's not impossible for the code generator to have bugs or for unspecified behavior to be subtly different.

Third, the current code has two additional non-functional purposes: it's the only example in the tree of how to apply https://golang.org/wiki/AssemblyPolicy, and is moderately educational for people learning about elliptic curve implementations. The latter is definitely not dispositive, but it's been historically true that the Go standard library was a good place to learn about cryptography engineering, and I am keen on preserving that when we can.

However, the performance work is still extremely valuable, and maybe eventually we'll even get to switch P-256, which would be a major win.

@Yawning
Copy link

@Yawning Yawning commented Aug 26, 2021

@Yawning makes good points in #40171 (comment), however I am still inclined not to switch the 25519 backends to fiat, even if I agree the performance difference is acceptable now.

[snip]

However, the performance work is still extremely valuable, and maybe eventually we'll even get to switch P-256, which would be a major win.

The rationale makes sense to me, since I'm currently still trying to decide between similar tradeoffs with the library I maintain as well.

A nice thing is that, while my efforts were based around 25519 performance, the changes that @JasonGross was kind enough to make will help performance for the places that currently do use fiat (the NIST curves).

It's not as if switching the 25519 code would be expensive if there ends up being a more compelling reason to do so later in the future as well.

@josharian
Copy link
Contributor

@josharian josharian commented Oct 18, 2021

Hey, just checking in. The 1.18 window will close pretty soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Accepted
Linked pull requests

Successfully merging a pull request may close this issue.

None yet