Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: add 3-byte matching for higher compression levels #15622

Open
dsnet opened this issue May 10, 2016 · 5 comments
Open

compress/flate: add 3-byte matching for higher compression levels #15622

dsnet opened this issue May 10, 2016 · 5 comments
Milestone

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented May 10, 2016

Using 9af83462

The recent changes to compress/flate has provided pretty good improvements overall to the compression speeds, but I have noticed a few extraordinary cases where the compression ratio suffers significantly.

This spreadsheet compares the output sizes from go1.6 to current go.tip on a number of compression benchmarks. All tests were done using the flate.DefaultCompression setting.

Most notably, the clegg.tiff image from the ACT corpus experiences a 44% increase in output size (511928 to 741046 bytes).

I think it would be worthwhile to understand what causes this fairly significant regression in compression ratio before the go1.7 release.

cc @nigeltao @klauspost

@dsnet dsnet added this to the Go1.7 milestone May 10, 2016
@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented May 10, 2016

I spent some time analyzing clegg.tiff, here's a histogram of the backwards references clustered by backwards distance and backwards copy length.

Lengths:           go1.6   go.tip
    1..1:          0       0
    2..2:          0       0
    3..3:          96798   0
    4..4:          19939   19534
    5..5:          3708    3683
    6..6:          19950   19349
    7..7:          4116    4691
    8..11:         2464    2557
    12..15:        874     873
    16..23:        157     162
    24..31:        17      18
    32..47:        5       6
    48..63:        44      44
    64..95:        46      46
    96..127:       57      57
    128..191:      7       5
    192..255:      313     315
    256..383:      4348    4348

Distances:
    1..1:          261     258
    2..2:          1       1
    3..3:          135100  38991
    4..4:          4       1
    5..5:          0       0
    6..6:          0       0
    7..7:          1       0
    8..11:         51      0
    12..15:        516     400
    16..23:        344     267
    24..31:        409     299
    32..47:        407     319
    48..63:        345     288
    64..95:        282     246
    96..127:       188     171
    128..191:      213     176
    192..255:      117     94
    256..383:      153     130
    384..511:      83      69
    512..767:      193     164
    768..1023:     186     160
    1024..1535:    373     305
    1536..2047:    348     301
    2048..3071:    7809    7502
    3072..4095:    168     126
    4096..6143:    70      70
    6144..8191:    1302    1267
    8192..12287:   858     873
    12288..16383:  498     516
    16384..24575:  1590    1674
    24576..32767:  973     1020

It seems that the ratio hit occurs because the new hash-chain algorithm only operates on 4-byte segments, rather than 3-byte segments. This tends to be pathological for bitmap-based images, which represent a pixel as a tuple of 3 bytes for (R, G, B). This explains why there is such a large number of {Length:3, Distance:3} references in go1.6 that cannot be generated in go.tip.

Given the dramatic increase in performance using 4-byte hashing, I support keeping 4-byte hashing even if it causes some inputs to suffer (especially RGB based images, and possibly other inputs).

@nigeltao, your thoughts?

@nigeltao

This comment has been minimized.

Copy link
Contributor

@nigeltao nigeltao commented May 12, 2016

Thanks for the analysis. It certainly sounds plausible that these are RGB 3-tuples in an uncompressed TIFF.

I agree that we don't need to revert 4-byte hashing yet.

For example, TIFF is not as common as it used to be, and more modern image formats use 2D-image-specific filtering before passing to a general purpose compressor like DEFLATE; those filters already take out any easy repeated-RGB-3-tuple wins. Using the standard library's image/png encoder yields 501111 bytes, smaller than both 511928 and 741046. WEBP Lossless (generated by the cwebp tool) is 379612 bytes.

@klauspost

This comment has been minimized.

Copy link
Contributor

@klauspost klauspost commented May 12, 2016

For "BestCompression" it could be feasible to have a version that could to 3-byte matches. The token writer could then be modified to choose between 3-length matches and 3-byte literals depending on the encoded size.

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented May 12, 2016

Sounds good. Possibly adding 3-byte matching to BestCompression, might not be a bad idea. Since it seems we're in consensus about keeping the 4-byte matching, I'm going to relabel the issue to possibly adding 3-byte matching to high compression levels.

@dsnet dsnet modified the milestones: Unplanned, Go1.7 May 12, 2016
@dsnet dsnet changed the title compress/flate: compression ratio regression compress/flate: add 3-byte matching for higher compression levels May 12, 2016
@nigeltao

This comment has been minimized.

Copy link
Contributor

@nigeltao nigeltao commented May 19, 2016

I ran both the Go tip (i.e. Go 1.7) and Go 1.6 image/png encoders over a corpus of 20,000 or so PNG images, comparing the output size of a decode-and-then-re-encode. I don't think image/png changed much; it should all be due to compress/flate. Below are the summary statistics (grouped by image type) of the ratios of tip size over 1.6 size; higher means tip is worse, a value of 1.000 means no change.

Gray      n=1252   avg=0.984  p25=0.977  p50=0.987  p75=0.994  p90=0.998  max=1.031
Gray16    n=7      avg=0.998  p25=0.985  p50=1.004  p75=1.009  p90=1.015  max=1.015
NRGBA     n=5365   avg=1.013  p25=0.985  p50=1.007  p75=1.039  p90=1.051  max=1.427
NRGBA64   n=154    avg=1.000  p25=0.994  p50=1.000  p75=1.005  p90=1.013  max=1.243
Paletted  n=3181   avg=0.989  p25=0.984  p50=0.990  p75=0.994  p90=0.998  max=1.023
RGBA      n=11644  avg=1.022  p25=0.991  p50=1.006  p75=1.036  p90=1.082  max=1.696
RGBA64    n=89     avg=1.006  p25=0.996  p50=1.001  p75=1.010  p90=1.032  max=1.118
overall   n=21692  avg=1.013  p25=0.986  p50=0.999  p75=1.027  p90=1.058  max=1.696

pXX are the percentiles, e.g. p50 is the median ratio. It's not too surprising that *image.RGBA and *image.NRGBA Go image types (i.e. what PNG calls RGB and RGBA images) miss the 3-byte matches the most. Nonetheless, I think the increase in compressed size, in the order of 1-2% on average and less than 1% on median, is acceptable for Go 1.7. Sure, we can always find specific examples (such as the TIFF in the original post) that re-encode relatively badly, but I think we're OK in general, at least for PNG.

In Go 1.8, we can consider re-introducing 3-byte matches, and see if that will bring the outliers back into the fold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.