It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:
Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.
If this is ever done it will help simplify the generic golang.org/x/crypto/internal/chacha20 implementation.
The text was updated successfully, but these errors were encountered:
On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.
It seems like (1) is the code we should expect people to write, and it may be a bit easier to recognize and rewrite than (2). (It's usually easier to distribute mathematical operations than to un-distribute them, especially binary operations that don't carry.)