Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Optimize decompress generic #645
The current decompress_generic code suffers from many blocked loads, originally mentioned in issue #411. This patchset attemps to fix that issue. Also similar issues mentioned in pull #470 , issue #126, and probably others.
The heart of the issue is that LZ4_wildCopy does a copy by 8 bytes at a time, leading to lots of work happening side-by-side. The farther away we can move these copies by extending the copy width, the faster things seem to go. Increasing the copy width to as wide as 32 bytes results in overall decompression speed improvement of around ~10% a xeon, ~20% on a modern AMD.
The primary code challenge then is that we have only 8 bytes of WILDCOPYLENGTH, when we'd really like far larger. So we end up duplicating most of the decode loop, to add a fastpath for a larger copy length, and then the last handful of bytes can use the original loop.
I've split up the code in what I hope are readable small diffs. Testing has mostly consisted of 'make test', and an attempt at fuzzing with afl. Has not been tested on other arches, but it would be fairly easy to #ifdef the first loop away and be the same as previous if it shows a regression.
Benchmarks on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, from lzbench. The final percent is how much faster this patchset is vs. the current dev branch
An all-zeroes test is particularly ugly for the current code:
I also tested on a AMD 2400G:
The same issue likely exists in zstd (where it is likely easier to fix, since parsing isn't mixed up with sequence execution).
I confirm big decompression speed gains in benchmark measurements on
However, in all tests, compression speed was completely stable.
Thanks for this patch @djwatson , it's in pretty good shape !
2 remaining topics to get to the bottom of it :
Without the last diff optimizing small offsets, so only increasing the copy size to 32bytes:
The whole series:
So both stripe copy size and small offset optimization seem to be helping. Stripe copy size helps more when there are large matches (xml), or large literal runs (sao).
The only arm I own is a raspberry pi2, armv7, which looks like a small win:
I borrowed a switch that has an ARMv8 Cortex-A72. It also looks like a small win (not nearly as drastic as x64 though):
Unfortunately I don't have any phones I can test on at the moment.
As mentioned in #411, ld_blocks.store_forward shows the general location of the issue. cycles:pp is helpful when optimizing. But even just counting how many times we need to read extra bytes for match lengths and literal lengths was useful. I also dumped histograms of the offset sizes for the small offset optimization - offset 1 and offset 2 are by far the most common for offsets < 8, and the most important to optimize.
Here's a general perf stat dump of silesia.tar:
Thanks for all these interesting data points @djwatson .
One last question :
Is that intentional ?
From a high level, it seems your optimizations, which target long copies, and overlapping copies, feel complementary with the shortcut, which specifically avoids these cases.
Yes, this was intentional. As far as I can tell, the previous shortcut did three things:
See "decompress_generic: re-add fastpath "
And actually I believe this is an improvement on the previous shortcut, because even if match length overflows, we still only check output size once.
We did however remove the shortcut for the last 32 bytes.
I just made a test of the proposed PR on an ARM platform.
Unfortunately, results are not so good on this platform :
Difference is significant, and pretty consistent across multiple benchmark runs.
Possible reasons (hand-waivy) :
There is a bit more to this story.
In contrast, doing
The second version saves one load instruction per sequence,
The shorcut is expected to be useful "as often as possible".
With modern cpus, the cost of branch unpredictability is much higher that the cost of an instruction.
Unfortunately I don't have any qualcomm arm chips, I'll have to borrow one to dig in to it.
Alternatively, we could #ifdef this to only x86, but it looked like there were some nice wins on larger arm chips, it would be nice to fix it.
It's even slightly more complicated: The first version (8+8+2) potentially triggers the load blocked by store condition, while doing it by-16 doesn't seem to. However we don't do anything special to avoid this in memcpy_using_offset for 8-15 (currently only 1,2,4), although we could, but for the shortcut it's probably too small to worry about anyway.
The overall benchmark numbers look neutral-ish to me, but indeed the total branches and branch misses drops a bit. I can add the above patch. Thanks!
Note that the offset is chosen by the compressor, so potentially a favorDecSpeed compressor could choose offsets >= 32 to avoid the load blocked by store issue - it doesn't look like we do this currently.
I noticed a few changes in the PR, so decided to give it another benchmark go.
I confirm the great speed up on x64. Tested on
However, the situation on Qualcom's
I guess it's not possible to correctly observe/debug the situation from x64.
I can lend you a mobile
Worst case would be to only keep the new decoder for
Worth noting :
So, 2 ARM architectures, 2 different outcomes. Makes things a bit more complex to interpret ...
Edit : for information, I restarted the benchmark on the previous phone, and got once again the same performance results ... I also tried to switch between power mode and saving mode, and the difference remains present, with approximately the same magnitude level.
OK, I'm back onto
One major point of this last sprint is checking this optimization path on ARM, and more specifically for
First results, on
Tested run consistently multiple times. That's clearly a large regression.
normal (gated off):
Now, onto investigating more finely....
Seeing as the performance is fine on gcc, and terrible on clang, I would bet its a compiler problem. Something like it makes a bad inlining decision now, or it is failing to constant fold one of the "template parameters".
If you can narrow down what part of the patch is causing the regression, I bet it can be improved by rewriting the same or similar logic in a slightly different way.
It seems we missed that one.
That cannot work with
The stream property guarantees that last 5 bytes are literals.
This issue triggered
In hindsight, a larger problem is that we were missing a test able to catch this issue.
It also means
Another problem is that
However, the test only checks that
Fuzzer tests have been updated to detect such issues in the future.
The fix is simple : test that