-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes: IndexByte performs poorly on short lookups #21759
Comments
\cc @martisch |
Interesting.
Can it be because of this that when the data length is small, a loop code generates faster assembly than the hand-written one ? Would be interesting to look into the assembly of the generated loop code and compare them. |
I do not see how a generic implementation of bytes.IndexByte can be made that will be (near) optimal on any length and distribution of match position if these characteristics are not known at compile time. A solution that is good on a variety of lengths at runtime is complex and therefore will not be inlined to not bloat the binaries and fill the caches. Thereby there is a tradeoff for always having call overhead and overhead to decide which code path to use based e.g. on data length or supported instructions but thereby making it not optimal for short lengths/early positions that require little work. The range loop in the example avoids that since it is known by profiling the application that the problem is mostly composed of small data length and early matches. The solution of the inlined range loop is simple enough that I think it does not warrant to add an e.g. IndexByteShort to the package. However writing a fast IndexByte for long strings is comparatively more complex. It could be that the current version can be tuned a bit for small data sizes. For that we need to benchmark what possible room for optimization there is between an indexByte using a range loop that is not inlined and that checks for e.g. AVX2 and data length. @sirkon what is the platform and cpu used for your benchmarks/production? |
|
Some more info
I.e. with position 5 current implementation is fast enough and it is faster at the length of 6. I just thought if there would be a cheap way to check if the first 4 bytes has a character needed, then some sort of loop would find it, otherwise the generic algorithm would be used. |
Inlining a short lookup is not cost free as the binary size would increase which makes for more icache pressure and it slows down for positions further away. Also in the current state of the compiler (no mid stack inlining) it complicates the compiler and makes the bytes package function special since the optimization with intrinsics would not recognize when the code is copied from the bytes package to somewhere else. I think that when it is known that the position is usually found early the pre check by range loop optimization is good and easy to make explicitly in the code when profiling determines it matters at that call site. If there are performance improvements left to make small lookups a bit faster within the code or in the generated instructions inside the index function that would be great to find out. |
When I remove the inlining the difference goes away:
So I think we're mostly measuring call overhead here. I don't see any real opportunity for making the assembly better (but please, prove me wrong if you see something). |
What version of Go are you using (
go version
)?1.9
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?What did you do?
Benchmarking
bytes.IndexByte
vs loop character lookupWhat did you expect to see?
BenchmarkIndexByte
should be faster or at least close toBenchmarkLoop
for any delimiter position.What did you see instead?
But it can be several times slower when the delimeter is close to the data start (parameter
position
):This matters because we are using IndexByte for log parsing via code generator and a good chunk of data pieces we are looking for are short.
It would be great if
bytes.IndexByte
would be faster at shorter lookups.The text was updated successfully, but these errors were encountered: