cmd/compile: optimize unaligned load-XOR-store on byte slices #25111
It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:
binary.LittleEndian.PutUint32(dst, binary.LittleEndian.Uint32(src) ^ x)
dst = src ^ byte(x) dst = src ^ byte(x>>8) dst = src ^ byte(x>>16) dst = src ^ byte(x>>24)
(3) [less important]
x ^= uint32(src) x ^= uint32(src) << 8 x ^= uint32(src) << 16 x ^= uint32(src) << 24 binary.LittleEndian.PutUint32(dst, x)
Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.
If this is ever done it will help simplify the generic
The text was updated successfully, but these errors were encountered:
Example (1) is equivalent to the following code which contains an extra 3 shifts:
v := uint32(src) v |= uint32(src) << 8 v |= uint32(src) << 16 v |= uint32(src) << 24 v ^= u dst = byte(v) dst = byte(v>>8) dst = byte(v>>16) dst = byte(v>>24)
On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.