q9 significantly faster than q6 for specific input #632

fasterthanlime · 2018-01-09T12:33:19Z

I've been in touch with Jyrki Alakuijala on twitter after I noticed compressing with q9 being faster than q6, and they mentioned q9 being faster "unusual".

The behavior was observed deep into a golang program using (non-official) cgo bindings, so I've been trying to reproduce it with vanilla brotli, and I was finally able to!

Exhibit

This is not a "benchmark done right", but I've been playing with various input files for dozens of runs for the last 48h and I've noticed this pattern quite a bit.

How I built brotli

I followed the instructions in README.md - this was run on windows-amd64, compiled with cmake:

$ mkdir out && cd out
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=./installed ..
$ cmake --build . --config Release --target install

CMake auto-detected and used Visual Studio 14 2015 to compile. Note that when I originally observed the performance difference, brotli's sources were built with gcc 7.2.0 (built by the MSYS2 project) via cgo.

The version of brotli I used was da254cf (HEAD of master at the time of this writing)

The data

Here's a none.br.zip file, which is a 2.3MB zip archive (due to github issues limitations) containing a brotli-q9-compressed version of my sample input.

📦 none.br.zip

The uncompressed input (none) is a 2.2GB file. Its SHA-256 checksum is:

352994da7a1b3f10e665e7eb4f477e9ebeaf5088742e4fddfacd7d4901d51df5

It compresses particularly well (brotli performs splendidly in particular) because it's a patch file inspired by the original BSDiff paper.

In particular, the file:

Starts with a 4-byte magic number
Is a stream of protobuf messages, prefixed by their length (dead-simple framing)
After a few header messages, most of the messages are bsdiff instructions (see the protobuf schema)

It's supposed to compress well because bsdiff is basically:

Find slices from the "old" file and the "new" file that are similar
Compute byte-wise differences between these slices
The patch instruction is slice start in old and new, length, and the byte-wise difference (a large binary blob which, if everything goes well, should be mostly zero, and the differences should have nice repeating patterns, think 00000029000000000000290000000000) - this is called diff data
If there's no slice similar enough, the new data is included as-is in the patch - this is called extra data

Other differences from the original bsdiff:

vanilla bsdiff uses bzip2 for its patch format
vanilla bsdiff patches are divided into three sections:
- all instructions (without data)
- all diff data
- all extra data
...whereas our version stores diff data and extra data directly after the relevant instruction - and limits an instruction's slice size. This lets us to streaming diff & apply with lower memory/disk requirements, with a slight compression penalty, I have a tweet thread about this for the curious.

Hypotheses

Although I've played with compression formats a lot, I cannot claim to understand its inner workings, but I've noticed the following:

In some parts of the code, LGBLOCK is adjusted differently for q>=9. I saw this here first, but I haven't checked that it doesn't appear somewhere else. Maybe q>=9 gets a smaller LGBLOCK, resulting in faster compression for this particular data ?
There's a speed up RLE-ish data compression commit (and this input data might be RLE-ish? depending on the meaning of it), but afaict I was seeing the performance difference before that commit.

Conclusion

At this point I'm tempted to use q=9 rather than q=6-8 (our diff process is much slower than brotli compression anyway, and we're running it on a machine with many cores, so it's not a bottleneck), but I figured it'd be interesting to brotli developers and I might learn a thing or two in the process :)

The text was updated successfully, but these errors were encountered:

fasterthanlime · 2018-01-09T12:49:10Z

Additional measurements

As part of my investigation, I'm comparing compressed size and compression speed of zstd and brotli, at various quality levels.

I'm posting here a few of the results, in CSV format. The relative size column is the size of the compressed file relative to the uncompressed file. The compression speed column is in MiB/s (that's uncompressed MiB per second, not compressed).

Patch that compresses very well (manifold)

For the 2.2GB none file mentioned in the original post, here are the results:

algorithm,relative size,compression speed
ZSTD-q1,0.001963,1394.319830
ZSTD-q3,0.001930,1375.773376
ZSTD-q6,0.001833,697.047500
ZSTD-q9,0.001795,618.516803
BROTLI-q1,0.002585,850.446428
BROTLI-q3,0.001426,220.726677
BROTLI-q6,0.001101,69.193146
BROTLI-q9,0.001040,113.393817

Patch that doesn't compress well at all (perfect-heist)

For comparison, with another patch mostly made up of extra data (almost none of the old data is reusable - and what's more, the new data has high entropy afaict - it's probably already compressed).

Uncompressed data size is around 260MiB.

algorithm,relative size,compression speed
ZSTD-q1,0.832471,237.475492
ZSTD-q3,0.832103,218.793391
ZSTD-q6,0.831460,78.934550
ZSTD-q9,0.831450,39.802124
BROTLI-q1,0.829575,202.485644
BROTLI-q3,0.828759,144.709333
BROTLI-q6,0.828862,30.029050
BROTLI-q9,0.827752,5.227860

As you can see, the compression speeds (30MiB/s for q6, 5MiB/s for q9) are a lot more in line with what one would expect.

Another patch that compresses well (fate)

This patch is 1.6GiB uncompressed, and it also compresses quite well:

algorithm,relative size,compression speed
ZSTD-q1,0.008512,1192.887042
ZSTD-q3,0.008309,1087.122294
ZSTD-q6,0.007984,517.924763
ZSTD-q9,0.007912,420.128866
BROTLI-q1,0.009386,756.245175
BROTLI-q3,0.007675,213.797475
BROTLI-q6,0.006624,66.828365
BROTLI-q9,0.006396,73.503025

Again we see q9 being faster (and compressing better).

eustas · 2018-01-11T13:21:25Z

Thanks for the report, investigation and a sample file.
Going to analyse it soon.

eustas · 2018-02-27T10:46:01Z

Hello.

I'm terribly sorry - I haven't had time to analyze the roots of the problem. So I will keep this issue open.

On the bright side - the last update have improved the situation. Here are the results of measurements, comparing brotli before and after the last commit.

command	old time	old size	new time	new size
brotli -6fkw 24 none	`0:35.5`	`2'550'550`	`0:11.9`	`2'435'211`
brotli -9fkw 24 none	`0:39.7`	`2'405'886`	`0:28.6`	`2'384'885`
brotli -Zfkw 24 none	`4:30.0`	`2'215'919`	`5:24.7`	`2'187'911`

Speed misbalance is fixed and compression ratio is improved =)

mraszyk · 2018-09-26T11:22:25Z

As pointed out by @fasterthanlime, LGBLOCK is adjusted differently for quality >= 9, but in fact, it gets a larger value which results in faster and denser compression for this particular data.

LGBLOCK is the length of so-called input blocks (typically 64K to 256K) and for each input block at least one command used to be produced (at commit da254cf). As a consequence, long chunks of zeros (spanning k full input blocks) resulted in a sequence of k equal commands (insert-length = 0, copy-length = 2^LGBLOCK, copy-distance = 1), although a single command would actually suffice (and the mandatory extra bits prevent Huffman coding from mitigating the issue). This is the reason why compression gets denser with longer input blocks on files with long chunks of zeros. In terms of compression time, with longer input blocks the hasher (which is invoked O(1) times per input block full of zeros) is invoked less times in total which results in faster compression.

The described inefficiency has been fixed by introducing the method ExtendLastCommand (at commit 35e69fc) which checks if the last command of the previous input block can be extended into the current input block (before invoking the hasher). This results in faster and denser compression.

Finally, there is still some room for improving the compression time for qualities <= 9 by optimizing out some calls to the hasher which cannot further improve the HasherSearchResult found so far.

fasterthanlime mentioned this issue Jan 9, 2018

Added optional OS field for App Manifests itchio/itch#1605

Closed

mraszyk mentioned this issue Sep 26, 2018

Optimize out redundant hasher lookups #712

Closed

eustas closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

q9 significantly faster than q6 for specific input #632

q9 significantly faster than q6 for specific input #632

fasterthanlime commented Jan 9, 2018 •

edited

fasterthanlime commented Jan 9, 2018

eustas commented Jan 11, 2018

eustas commented Feb 27, 2018

mraszyk commented Sep 26, 2018

q9 significantly faster than q6 for specific input #632

q9 significantly faster than q6 for specific input #632

Comments

fasterthanlime commented Jan 9, 2018 • edited

Exhibit

How I built brotli

The data

Hypotheses

Conclusion

fasterthanlime commented Jan 9, 2018

Additional measurements

Patch that compresses very well (manifold)

Patch that doesn't compress well at all (perfect-heist)

Another patch that compresses well (fate)

eustas commented Jan 11, 2018

eustas commented Feb 27, 2018

mraszyk commented Sep 26, 2018

fasterthanlime commented Jan 9, 2018 •

edited