Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/compile: optimize unaligned load-XOR-store on byte slices #25111
It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:
binary.LittleEndian.PutUint32(dst, binary.LittleEndian.Uint32(src) ^ x)
dst = src ^ byte(x) dst = src ^ byte(x>>8) dst = src ^ byte(x>>16) dst = src ^ byte(x>>24)
(3) [less important]
x ^= uint32(src) x ^= uint32(src) << 8 x ^= uint32(src) << 16 x ^= uint32(src) << 24 binary.LittleEndian.PutUint32(dst, x)
Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.
If this is ever done it will help simplify the generic
Example (1) is equivalent to the following code which contains an extra 3 shifts:
v := uint32(src) v |= uint32(src) << 8 v |= uint32(src) << 16 v |= uint32(src) << 24 v ^= u dst = byte(v) dst = byte(v>>8) dst = byte(v>>16) dst = byte(v>>24)
On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.