New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

Open
mundaym opened this Issue Apr 26, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@mundaym
Member

mundaym commented Apr 26, 2018

It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:

(1)

binary.LittleEndian.PutUint32(dst, binary.LittleEndian.Uint32(src) ^ x)

(2)

dst[0] = src[0] ^ byte(x)
dst[1] = src[1] ^ byte(x>>8)
dst[2] = src[2] ^ byte(x>>16)
dst[3] = src[3] ^ byte(x>>24)

(3) [less important]

x ^= uint32(src[0])
x ^= uint32(src[1]) << 8
x ^= uint32(src[2]) << 16
x ^= uint32(src[3]) << 24
binary.LittleEndian.PutUint32(dst, x)

Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.

If this is ever done it will help simplify the generic golang.org/x/crypto/internal/chacha20 implementation.

@mundaym mundaym added the Performance label Apr 26, 2018

@mundaym mundaym added this to the Unplanned milestone Apr 26, 2018

@bcmills

This comment has been minimized.

Member

bcmills commented Apr 27, 2018

I think I'm missing something: why is (1) not optimal on other platforms?

@mundaym

This comment has been minimized.

Member

mundaym commented Apr 27, 2018

Example (1) is equivalent to the following code which contains an extra 3 shifts:

v := uint32(src[0])
v |= uint32(src[1]) << 8
v |= uint32(src[2]) << 16
v |= uint32(src[3]) << 24
v ^= u
dst[0] = byte(v)
dst[1] = byte(v>>8)
dst[2] = byte(v>>16)
dst[3] = byte(v>>24)

On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.

@bcmills

This comment has been minimized.

Member

bcmills commented Apr 27, 2018

It seems like (1) is the code we should expect people to write, and it may be a bit easier to recognize and rewrite than (2). (It's usually easier to distribute mathematical operations than to un-distribute them, especially binary operations that don't carry.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment