-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/cipher: easy optimisations to xorBytes #53023
Comments
@jech - point 1 is unnecessary because arm64 is excluded via the go:build line in But an emphatic yes to the other points. Especially point 2 where we are currently leaving significant gains on the table on a performance critical function on ARMv6/7 where unaligned access is OK. There are frequent posts regarding golang’s slow crypto performance on arm32 and this would be a low hanging fruit change with big gains. |
cc @bradfitz |
See also #35381. |
@bcmills / @bradfitz / @jech - I have written both NEON and non-NEON versions of xorBytes for 32-bit ARM. The performance increase over the 'slow' version in xor_generic.go is significant, between 4X and 20X. Given that 32-bit ARM remains one of the most popular IoT device platforms, it's a shame that 32-bit ARM is the sole unoptimised platform for 32-bit ARM in the crypto library for xorBytes, which can be a bottleneck function for crypto. So, I would be very happy to create a PR here to contribute optimised 32-bit NEON and non-NEON ARM implementations of xorBytes. This could be a significant deal for crypto performance in general on 32-bit ARM. But I would appreciate a signal first that there is the appetite to accept such a PR, so it does not end up in /dev/null. |
Change https://go.dev/cl/409394 mentions this issue: |
The function
crypto/cipher.xorBytes
implements the XOR operations between slices of bytes, and is heavily used by encryption in CTR mode. The version in the stdlib has seen a significant amount of optimisation, but some easy optimisations are missing:supportUnaligned
in xor_generic.go should be set for GOARCH equal toarm64
.supportsUnaligned
is not set, the fast version should still be called when both src and dst are multiples of wordSize (this actually happens fairly often, e.g. when encrypting a freshly allocated slice);supportsUnaligned
is not set, and src and dst are equal modulo wordSize, then the initial bytes should be xored and then the fast version called on the rest of the array;Points 1 through 3 are fairly trivial, and would bring a significant part of the benefit. This is related to #53021.
The text was updated successfully, but these errors were encountered: