Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

q9 significantly faster than q6 for specific input #632

Closed
fasterthanlime opened this issue Jan 9, 2018 · 4 comments
Closed

q9 significantly faster than q6 for specific input #632

fasterthanlime opened this issue Jan 9, 2018 · 4 comments

Comments

@fasterthanlime
Copy link

fasterthanlime commented Jan 9, 2018

I've been in touch with Jyrki Alakuijala on twitter after I noticed compressing with q9 being faster than q6, and they mentioned q9 being faster "unusual".

The behavior was observed deep into a golang program using (non-official) cgo bindings, so I've been trying to reproduce it with vanilla brotli, and I was finally able to!


Exhibit

brotli-q6-vs-q9

This is not a "benchmark done right", but I've been playing with various input files for dozens of runs for the last 48h and I've noticed this pattern quite a bit.

How I built brotli

I followed the instructions in README.md - this was run on windows-amd64, compiled with cmake:

$ mkdir out && cd out
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./installed ..
$ cmake --build . --config Release --target install

CMake auto-detected and used Visual Studio 14 2015 to compile. Note that when I originally observed the performance difference, brotli's sources were built with gcc 7.2.0 (built by the MSYS2 project) via cgo.

The version of brotli I used was da254cf (HEAD of master at the time of this writing)

The data

Here's a none.br.zip file, which is a 2.3MB zip archive (due to github issues limitations) containing a brotli-q9-compressed version of my sample input.

📦 none.br.zip

The uncompressed input (none) is a 2.2GB file. Its SHA-256 checksum is:

352994da7a1b3f10e665e7eb4f477e9ebeaf5088742e4fddfacd7d4901d51df5

It compresses particularly well (brotli performs splendidly in particular) because it's a patch file inspired by the original BSDiff paper.

In particular, the file:

  • Starts with a 4-byte magic number
  • Is a stream of protobuf messages, prefixed by their length (dead-simple framing)
  • After a few header messages, most of the messages are bsdiff instructions (see the protobuf schema)

It's supposed to compress well because bsdiff is basically:

  • Find slices from the "old" file and the "new" file that are similar
  • Compute byte-wise differences between these slices
  • The patch instruction is slice start in old and new, length, and the byte-wise difference (a large binary blob which, if everything goes well, should be mostly zero, and the differences should have nice repeating patterns, think 00000029000000000000290000000000) - this is called diff data
  • If there's no slice similar enough, the new data is included as-is in the patch - this is called extra data

Other differences from the original bsdiff:

  • vanilla bsdiff uses bzip2 for its patch format
  • vanilla bsdiff patches are divided into three sections:
    • all instructions (without data)
    • all diff data
    • all extra data
  • ...whereas our version stores diff data and extra data directly after the relevant instruction - and limits an instruction's slice size. This lets us to streaming diff & apply with lower memory/disk requirements, with a slight compression penalty, I have a tweet thread about this for the curious.

Hypotheses

Although I've played with compression formats a lot, I cannot claim to understand its inner workings, but I've noticed the following:

  • In some parts of the code, LGBLOCK is adjusted differently for q>=9. I saw this here first, but I haven't checked that it doesn't appear somewhere else. Maybe q>=9 gets a smaller LGBLOCK, resulting in faster compression for this particular data ?
  • There's a speed up RLE-ish data compression commit (and this input data might be RLE-ish? depending on the meaning of it), but afaict I was seeing the performance difference before that commit.

Conclusion

At this point I'm tempted to use q=9 rather than q=6-8 (our diff process is much slower than brotli compression anyway, and we're running it on a machine with many cores, so it's not a bottleneck), but I figured it'd be interesting to brotli developers and I might learn a thing or two in the process :)

@fasterthanlime
Copy link
Author

Additional measurements

As part of my investigation, I'm comparing compressed size and compression speed of zstd and brotli, at various quality levels.

I'm posting here a few of the results, in CSV format. The relative size column is the size of the compressed file relative to the uncompressed file. The compression speed column is in MiB/s (that's uncompressed MiB per second, not compressed).

Patch that compresses very well (manifold)

For the 2.2GB none file mentioned in the original post, here are the results:

algorithm,relative size,compression speed
ZSTD-q1,0.001963,1394.319830
ZSTD-q3,0.001930,1375.773376
ZSTD-q6,0.001833,697.047500
ZSTD-q9,0.001795,618.516803
BROTLI-q1,0.002585,850.446428
BROTLI-q3,0.001426,220.726677
BROTLI-q6,0.001101,69.193146
BROTLI-q9,0.001040,113.393817

Patch that doesn't compress well at all (perfect-heist)

For comparison, with another patch mostly made up of extra data (almost none of the old data is reusable - and what's more, the new data has high entropy afaict - it's probably already compressed).

Uncompressed data size is around 260MiB.

algorithm,relative size,compression speed
ZSTD-q1,0.832471,237.475492
ZSTD-q3,0.832103,218.793391
ZSTD-q6,0.831460,78.934550
ZSTD-q9,0.831450,39.802124
BROTLI-q1,0.829575,202.485644
BROTLI-q3,0.828759,144.709333
BROTLI-q6,0.828862,30.029050
BROTLI-q9,0.827752,5.227860

As you can see, the compression speeds (30MiB/s for q6, 5MiB/s for q9) are a lot more in line with what one would expect.

Another patch that compresses well (fate)

This patch is 1.6GiB uncompressed, and it also compresses quite well:

algorithm,relative size,compression speed
ZSTD-q1,0.008512,1192.887042
ZSTD-q3,0.008309,1087.122294
ZSTD-q6,0.007984,517.924763
ZSTD-q9,0.007912,420.128866
BROTLI-q1,0.009386,756.245175
BROTLI-q3,0.007675,213.797475
BROTLI-q6,0.006624,66.828365
BROTLI-q9,0.006396,73.503025

Again we see q9 being faster (and compressing better).

@eustas
Copy link
Collaborator

eustas commented Jan 11, 2018

Thanks for the report, investigation and a sample file.
Going to analyse it soon.

@eustas
Copy link
Collaborator

eustas commented Feb 27, 2018

Hello.

I'm terribly sorry - I haven't had time to analyze the roots of the problem. So I will keep this issue open.

On the bright side - the last update have improved the situation. Here are the results of measurements, comparing brotli before and after the last commit.

command old time old size new time new size
brotli -6fkw 24 none 0:35.5 2'550'550 0:11.9 2'435'211
brotli -9fkw 24 none 0:39.7 2'405'886 0:28.6 2'384'885
brotli -Zfkw 24 none 4:30.0 2'215'919 5:24.7 2'187'911

Speed misbalance is fixed and compression ratio is improved =)

@mraszyk
Copy link

mraszyk commented Sep 26, 2018

As pointed out by @fasterthanlime, LGBLOCK is adjusted differently for quality >= 9, but in fact, it gets a larger value which results in faster and denser compression for this particular data.

LGBLOCK is the length of so-called input blocks (typically 64K to 256K) and for each input block at least one command used to be produced (at commit da254cf). As a consequence, long chunks of zeros (spanning k full input blocks) resulted in a sequence of k equal commands (insert-length = 0, copy-length = 2^LGBLOCK, copy-distance = 1), although a single command would actually suffice (and the mandatory extra bits prevent Huffman coding from mitigating the issue). This is the reason why compression gets denser with longer input blocks on files with long chunks of zeros. In terms of compression time, with longer input blocks the hasher (which is invoked O(1) times per input block full of zeros) is invoked less times in total which results in faster compression.

The described inefficiency has been fixed by introducing the method ExtendLastCommand (at commit 35e69fc) which checks if the last command of the previous input block can be extended into the current input block (before invoking the hasher). This results in faster and denser compression.

Finally, there is still some room for improving the compression time for qualities <= 9 by optimizing out some calls to the hasher which cannot further improve the HasherSearchResult found so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants