Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: generate wider loads when feasible #23907

Closed
klauspost opened this issue Feb 18, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@klauspost
Copy link
Contributor

commented Feb 18, 2018

What version of Go are you using (go version)?

go version go1.10 windows/amd64

Does this issue reproduce with the latest release?

Yes. Tip not tested.

What did you do?

When working with file compression, a typical time-critical part is reading multiple bytes from a bit stream.

A fairly typical function looks like this:

// Uint32 will read a little endian uint32 from current offset.
func (b byteReader) Uint32() uint32 {
	b2 := b.b[b.off : b.off+4 : b.off+4]
	v3 := uint32(b2[3])
	v2 := uint32(b2[2])
	v1 := uint32(b2[1])
	v0 := uint32(b2[0])
	return (v3 << 24) | (v2 << 16) | (v1 << 8) | v0
}

This has a single bounds check, but has 4 loads to get each byte, before shifting and or'ing them into place. On my test setup this operates at 1.49 ns/op.

It seems like there is some SSA logic in place that can combine two adjacent byte loads (with shifts) into a single word load. This can be seen changing the last line to ((v3 << 24) | (v2 << 16)) | ((v1 << 8) | v0). On my machine, this improves time to 1.24 ns/op.

It is possible to also load the high word using a single word, bringing this to 0.85 ns/op, which it the best I can get with the current compiler. The assembler shows two 16 bit loads.

However, it should be possible for the compiler to automatically reduce the original test case to a single 32 bit load on little endian systems.

You can download the example file and benchmark file here.

What did you expect to see?

The compiler combining this to a single load on little endian platforms.

What did you see instead?

4 single byte loads.

@ALTree

This comment has been minimized.

Copy link
Member

commented Feb 18, 2018

You can change the last line of Uint32Typical to

return v0 | (v1 << 8) | (v2 << 16) | (v3 << 24)

i.e. reverse the order of the or operands, and a single load will be generated. So the compiler does recognize that pattern, but the mechanism is not very robust and you need to specify the or operators in ascending order.

@ALTree ALTree changed the title cmd/compile: Generate wider loads when feasible cmd/compile: generate wider loads when feasible Feb 18, 2018

@ALTree ALTree added the Performance label Feb 18, 2018

@klauspost

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2018

You are completely correct. Awesome! This seems like a small issue, then. I will close for now.

@klauspost klauspost closed this Feb 18, 2018

@golang golang locked and limited conversation to collaborators Feb 18, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.