Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: trivial to beat ContainsAny in easy cases #31321

Open
mvdan opened this issue Apr 7, 2019 · 5 comments

Comments

Projects
None yet
4 participants
@mvdan
Copy link
Member

commented Apr 7, 2019

See this CL: https://go-review.googlesource.com/c/go/+/155965

In short, code like bytes.ContainsAny(data, " \t") is easy to read and write, but a simple manual implementation as follows is considerably faster:

any := false
for _, b := range data {
	if b == ' ' || b == '\t' {
		any = true
		break
	}
}

ContainsAny uses IndexAny, which already has three paths; it does nothing for empty chars, it builds an ASCII set if chars is all ASCII and over eight bytes, and otherwise it does a naive double-loop search.

I'm not sure what the right solution here would be. I think adding more special cases to IndexAny would probably be overkill, and even slow down the other cases because of the added code. Making the compiler treat ContainsAny or IndexAny in a special way is also undesirable.

I guess if the compiler was super smart in the future, inlined all the code and realised that our chars are just " \t", it would simply inline that last naive search code. But I imagine the current compiler is far away from that possibility.

This is an unfortunate outcome, because the bytes and strings packages are very intuitive and often the fastest. I'd like to continue relying on them, without having to write my own loops by hand when they start showing up on CPU profiles.

/cc @josharian @randall77 @martisch

@mvdan mvdan added the Performance label Apr 7, 2019

@seebs

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2019

Just intuitively, I'd have thought that a [256]bool would be faster than doing bit ops on a [32]uint8, but possibly not enough faster to matter. That doesn't really address the generalizing problem.

@mvdan

This comment has been minimized.

Copy link
Member Author

commented Apr 7, 2019

Yes, it's about 20% faster. See https://go-review.googlesource.com/c/go/+/156139, which I filed around the same time.

With @josharian we briefly discussed that the bit-test version could be made faster by using BT instructions instead, but I don't think anyone has made significant progress there.

In any case, I don't think that relates to this example; two characters is well below the treshold for the bitset, at the moment.

@bcmills

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

I guess if the compiler was super smart in the future, inlined all the code and realised that our chars are just " \t", it would simply inline that last naive search code.

Yes, better inlining seems like the appropriate fix here. I wonder: could we hand-optimize ContainsAny by outlining its cases into separate functions? If ContainsAny is small enough, then mid-stack inlining should inline it and prune away the cases that aren't relevant, and at that point the function for the individual case should also be small enough to inline.

@mvdan

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Maybe, but we'd probably have to tell the compiler to try harder with ContainsAny, as this kind of optimization would likely require a combination of double inlining plus dead code elimination.

Perhaps #21536 could fix this then, if we marked these bytes/strings functions as //go:hotfunction.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

the bit-test version could be made faster by using BT instructions instead, but I don't think anyone has made significant progress there.

https://go-review.googlesource.com/c/go/+/167091 was my failed attempt at this. While working on it, I learned that BTQload can read a distant bit in memory, which is perfect for this use case. If anyone wants to try to revive that CL and find a way to introduce BTQloads in this scenario, that'd be great.

@bcmills bcmills added this to the Unplanned milestone Apr 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.