Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup of ZopfliFindLongestMatch -- gives ~8% overall performance boost #110

Open
wants to merge 3 commits into
base: master
from

Conversation

@jthlim
Copy link
Contributor

jthlim commented Apr 30, 2016

Check jthlim@b5a1ea2 for a detailed discussion of each change next to the associated code.

Please help verify that all of the comments/statements are valid from your perspective too!

Note: the 8% improvement is compounded on top of the optimizations discussed before (having asserts off, integer cost comparison, -ffast-math, link time optimization, etc.)

Note: Testing compressing hundreds of files has verified so far that these changes to not affect the compression.

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

unsigned short calculations often require the compiler to insert extra bit masking operations.
eg.
unsigned short i, j, k;
...
k = i + j

Would be roughly something like: (psuedo assembly, assuming k is used in a register later that doesn't need to be stored)

fetch i
fetch j
add
and 0xffff // <--- this is the unnecessary bit masking
...use result somewhere that isn't a store to memory

Changing these variable to int reduces the x64 output from clang for this function by several hundred bytes and also allows it to do some nice vectorizing with the sublen update loop.

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

same0 hoisted from inner loops

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Just a small tidy up here -- combine the limit update and the limit check code

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Since I've eliminated the hpos variable, update the assert

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

currentLength can be pushed one level down in scope (now line 484)

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

pos + bestlength can never be >= size. So remove the check.
This is because of checks at line 444-448 and line 507 (new code line numbering)

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Promote variables to unsigned int instead of unsigned short

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Instead of doing:

currentLength = 0;
if(...) {
  currentLength = ...;
}
if(currentLength > bestLength) ...

do:

if(...) {
  currentLength = ...;
  if(currentLength > bestlength) ...
}

This prevents unnecessary checks when the if statement isn't true

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Checking h->val2 == h->hashval2[p] first seemed to give me better performance results.

edit: This change was reverted in the subsequent commit, due to feedback from MrKrzych00

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

Direct simplification of expression that produces much cleaner generated code. This change alone was responsible for ~3% improvement of total execution time.

@jthlim

This comment has been minimized.

Copy link
Owner Author

jthlim commented on src/zopfli/lz77.c in b5a1ea2 Apr 30, 2016

This has been hoisted to outside of the loop.

@MrKrzYch00

This comment has been minimized.

Copy link

MrKrzYch00 commented on b5a1ea2 Apr 30, 2016

Performance test on Odroid U3:

before -> after

real    7m5.593s      ->    real    7m12.828s
user    19m11.065s    ->    user    19m29.945s
sys     0m0.180s      ->    sys     0m0.140s

6 blocks, thread/block, 4 cores Samsung Exynos 4412 O/C 1.92Ghz. Test done on text sourcecode file.

gcc 4.8 with LTO + PGO + NDEBUG

On Odroid U3 it actually slow downs Zopfli, this device is pretty much headless and I nice -19 it for tests so the usual delta is +/- 1s. I even re-tested the change which still resulted in slower execution (7m12.207s).

I will try to split this change to parts and find the bottleneck.

This comment has been minimized.

Copy link

MrKrzYch00 replied Apr 30, 2016

Updated summary on Odroid U3:
Change on line 511-512 caused speed regression of up to 7 seconds. (maybe it depends on data in input file, if it contains a lot of repeated patterns)
Promoting shorts to ints gave no improvements or added small overhead (up to2 seconds - maybe the most optimal would be to promote all variables in zopfli to size_t / ptrdiff_t, when it's known that overflow is not a desired effect on limited data size computations, to proper utilize most optimal registers for given architecture, eliminating padding and avoiding type conversions; at the cost of higher memory usage).
Other changes provided very little improvement of ~1s which is pretty much in the range of margin of error.

I guess GCC or setup I use with it provides enough optimizations to give none to very little gain. Usually I couldn't achieve as good results with Clang as I did with GCC last time I compared both of them...

@jthlim

This comment has been minimized.

Copy link
Contributor Author

jthlim commented Apr 30, 2016

Thanks for the feedback! would be really interesting to see GCC's generated code before and after for that function for ARM. I can't think why there'd be a regression.

Do you see similar results on x86/x64?

I assume this testing is with your modified fork rather than the verbatim zopfli codebase?

@jthlim

This comment has been minimized.

Copy link
Contributor Author

jthlim commented May 1, 2016

I've reverted the order of comparisons change due to the regression that you've noticed.

re: size/space trade off.. A lot of those variables end up in registers in clang, so there's no extra memory tradeoff at all. I guess that'll be platform specific too, but I expect to see that in most cases.

And perhaps gcc is just better at removing the unnecessary masking when appropriate than clang is.

@jthlim jthlim force-pushed the jthlim:master branch from 80cacaa to 8ffda7e May 1, 2016
@lorents17

This comment has been minimized.

Copy link

lorents17 commented May 1, 2016

enwik8 --i15

Zopfli (original) Zopfli (jthlim)
34 967 036 byte 34 967 390 byte
6m19.762s 6m1.886s

ECT

Compressor File Size Time
ECT -2 35 357 940 byte 0m24.922s
ECT -3 35 018 152 byte 0m30.350s
ECT -4 34 967 607 byte 0m37.274s
ECT -5 34 963 304 byte 0m58.680s
ECT -6 34 942 395 byte 1m19.929s
ECT -7 34 939 105 byte 2m1.326s
ECT -8 34 938 139 byte 5m59.643s
@jthlim

This comment has been minimized.

Copy link
Contributor Author

jthlim commented May 2, 2016

Thanks @lorents17 for the info! What compiler/platform were you testing with?

I see that you've measured a ~5% speedup.. but the difference in file size is interesting to me.

Did your original zopfli results include the changes from 37f6da6 -- specifically katajainen.c?

Otherwise I need to re-examing the changes to see why there could be a change.

Thanks!

@lorents17

This comment has been minimized.

Copy link

lorents17 commented May 2, 2016

OS - Windows 10
compiler - GCC 5.3

I used original zopfli with the latest changes from Apr 25, 2016

Interests me more as the ECT project could squeeze so quickly?

@jthlim

This comment has been minimized.

Copy link
Contributor Author

jthlim commented May 2, 2016

I've tried before and after my changes on OSX/Apple LLVM v 7.3 and I'm getting a consistent 34967390 bytes in each case. I'll try and setup a system with gcc later to test further.

as to the performance:
A quick look at https://github.com/fhanau/Efficient-Compression-Tool/blob/master/src/zopfli/zopfli_gzip.cpp#L259 seems to imply that there are modifications there to add multithreading support (this fork of zopfli is single threaded). At a guess, it looks like he's basing it on https://github.com/MrKrzYch00/zopfli

@fhanau

This comment has been minimized.

Copy link
Contributor

fhanau commented May 2, 2016

ECT (which I am the author of) is much faster than zopfli at similar compression, even without multithreading. Multithreading is off by default as it causes compression losses and needs more processing power per byte.
It achives its speeed improvements by rewriting large parts and replacing slow algorithms(like the current hash-chain based ZopfliFindLongestMatch) with faster ones(like the Binary tree based new one).
The only part of ECT that is based on MrKrzYch00/zopfli is the ZIP support.

@jthlim

This comment has been minimized.

Copy link
Contributor Author

jthlim commented May 2, 2016

@fhanau Thanks for the info and clarifications! Are there any downsides to importing your binary tree based code into the mainline?

@fhanau

This comment has been minimized.

Copy link
Contributor

fhanau commented May 2, 2016

Unfortunately, yes:

  • The binary tree match finder has a very low cost for finding matches as doing so is almost the same as updating the hash. However, updating the hash is very expensive which is why BTMF is only faster when using optimal parsing. I use a match finder based on LZ4 when lazy parsing is used.
  • BTMF is not compatible w/ the current matches cache.
  • As BTMF needs much time to update the hash and this is done at the start of each block after the first block to be able to search at posititions in the previous blocks, BTMF needs much time before compression even begins. This can be solved with copying the match finder struct of the previous block.

But you're right, the changes from ECT should be backported.

@MrKrzYch00

This comment has been minimized.

Copy link
Contributor

MrKrzYch00 commented May 13, 2016

Most of changes don't make any difference on ARMv7 when GCC is used, there is no way for me to efficiently run x86 tests as I run Fedora in VirtualBox (though it now supports AVX instructions set and stuff but the CPU is not "as free" as on Odroid U3). Of course ARM has a lot of registers so that may be the reason. Only the loop part change did a very little speed up, any variable type or order changing (here as well as with previous changes) did not make any difference or were actually slower. However, the differences are so small they could as well fall into a margin of error difference. As far as I'm concerned, GCC 4.8 is faster than GCC 5.3 which is faster than Clang 3.6. Or so it did score on ARMv7, would need to recompare it for original Zopfli again to confirm that Clang is still slower.

Even with my MrKrzYch00@7aa51e1 I don't see any difference in speed. :)

ATM the only thing I'm concerned about in case of Zopfli is why do I get:

*** Error in `zopfli/zopfli': double free or corruption (!prev): 0xb4550038 ***
Aborted

which only occurs on my Odroid U3 (both of them) when freeing heap (randomly) on long runs and (not 100% sure) when lazy matching is used... [can't figure it out so far, with all the debugging included] pointer structure members on not properly aligned structures. That was also the reason I invented restore points to resume compression and by this could confirm that the code should be is right and the Odroid U3 (outdated) kernel or its libs are causing this (or some alignment or small out-of-bounds going on somewhere)... GCC switches need to be altered (either 32-bit aligning or enable unaligned access) which I'm currently testing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.