-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf(zstd): Improve 'matchLen' performance by vector instructions. #823
Conversation
And I will write the code to make it work fine on amd64 cpu that does not support avx2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have experimented with dedicated assembly matching before, but never really found any convincing+consistent improvement.
These are rather unexpected and really shouldn't be affected by this at all.
Random10MBEncoderFastest-64 1.392Gi ± 6% 2.017Gi ± 3% +44.93% (p=0.002 n=6)
RandomEncoderDefault-64 1.035Gi ± 7% 1.683Gi ± 28% +62.69% (p=0.002 n=6)
I will benchmark and do a few tests.
|
||
Label("ret") | ||
{ | ||
Store(ret, ReturnIndex(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VZEROUPPER missing.
TZCNTQ(equalMaskBits, matchedLen) | ||
Comment("if matched len > remaining len, just add remaining on ret") | ||
CMPQ(remainLen, matchedLen) | ||
CMOVQLT(remainLen, matchedLen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fairly predictable, so I suspect a branch will be faster.
If you keep it, check CMOV - not entirely sure this is always present on AMD64.
matchedLen := GP64() | ||
NOTQ(equalMaskBits) | ||
Comment("store first not equal position into matchedLen") | ||
TZCNTQ(equalMaskBits, matchedLen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for BMI
adata := YMM() | ||
bdata := YMM() | ||
equalMaskBytes := YMM() | ||
VMOVDQU(Mem{Base: aptr}, adata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell you are over-reading. This is not feasible. You cannot read longer than provided slices.
Current code is a big speed regression. Testing on CSV input: Before/after (3 runs):
Level 2 (default)
|
add Pragma("noescape") Co-authored-by: Klaus Post <klauspost@gmail.com>
Adding VZEROUPPER fixes that:
So if you can fix the issues, this could be feasible. |
Tested on a few more bodies. Seems to be 5-10% faster. |
Is there code for the above benchmark tests in this repo? I did not find it |
No, I have a separate tool that does that. Adding it would add a whole bunch of dependencies, which I am not interested in (and it is a bit messy). If it is any use, here it is: https://gist.github.com/klauspost/de4b4d42fe3248fa99fd3f661badffe2 |
I copied over the matchlen assembly from s2, and it is faster than AVX2 based - and it doesn't use anything beyond standard amd64:
|
Copied from the S2 implementation. 5-10% faster. Replaces #823
@zzzzwc I am not sure how you got the equations, but they show the problem fine. You show a crossover at 320 bytes. That is incredibly rare. Most matches are much less than 100 bytes.
This means the penalty for a 4-8 (and possibly up to 16) byte match is much higher. Also add that when loading 32 bytes you are much more likely to cross a 64byte cache line which incurs a minor load penalty. However, I would expect branch prediction to be better, since it is very likely that only one loop will be called. However, it is not enough that it gives an overall improvement. I think the crossover is lower than what your graph indicates. But you are optimizing the wrong thing. We don't want long matches to be faster. We want short matches to be faster, since they are the main bulk, and fast matches are not a problem for overall performance. I tried adding it to S2 assembly #825 with appropriate fallback - and again, it is not faster than the BSF code, so I left it disabled. I left the code in, so you can see how it is integrated. It is important that you test on real data. |
When we use zstd to compress csv text, we find that the matchlen function takes a lot of time, so we try to use vector instructions to speed up the matchlen function.
The following is the benchmark test command and the comparison with the old version:
From the comparison results, it can be seen that when we compress some large content (when the matchlen function can match a long length), it will be much faster than before.