Allow using `zune-png` for PNG decoding #1852

Shnatsel · 2023-02-01T18:15:55Z

zune-png is a very fast PNG implementation in Rust.

It's safe code except for SIMD in filters, which is the same policy as image-rs crates use.

It's in the final stages of testing, with just one known issue remaining on a very obscure type of image, and is being continuously fuzzed on CI. I gave it the full real-world testing treatment, which found a fair amount of bugs and I'm confident we'll squash the remaining ones soon.

Advantages:

Decoding go BRRRRR

Drawbacks:

No encoding support (only decoding is implemented so far) so the png crate will remain as a dependency
No streaming support, the input has to be fully loaded into memory
No APNG support (only the first frame is displayed)
No dedicated fast path for RLE images like the one PNG crate got via fdeflate

See also: #1845 for supporting zune-jpeg

The text was updated successfully, but these errors were encountered:

fintelia · 2023-02-02T19:36:10Z

We should try to understand how the performance compares between image-png and zune-png and track down the specific differences. There's been some recent work on making image-png considerably faster, which isn't yet reflected in the crates.io releases. During that process we've found that the vast majority of PNG encoding/decoding time is spent in compression and filtering. If there's some optimizations from zune-png that we can copy over, that may be a less invasive way of achieving the same performance improvements.

Shnatsel · 2023-02-02T20:35:34Z

zune-png uses zune-inflate internally, which is a 100% safe Rust port of libdeflate. It outperforms miniz_oxide used in png and even zlib-ng. You could get the same performance improvements in decompression by switching to zune-inflate.

The trade-off is that libdeflate and zune-inflate do not support streaming decoding. This seems fine for APIs such as next_frame(). But if there's any streaming decoding support in the png crate, it would have to be disabled when using zune-inflate, or fall back to miniz_oxide just for those codepaths.

HeroicKatora · 2023-02-02T21:06:02Z

It would be very surprising to me if streaming is the cause of difference in performance. deflate should require no more than 64kB buffer size to make the available data virtually indistinguishable. Any even larger buffers and I would expect the only necessary difference to be cache effects and the cost of storing temporary state (which is constant size? some buffered bits and a pointer to where in the stream we are, what code table is active) in between the data passes. I'm not aware of any form of cache, lookup table, or algorithmic improvement that works uniquely for infinite lookahead here. It's at least counterintuitive to identify streaming.

Shnatsel · 2023-02-02T21:13:57Z

The gist of the algorithm is that it assumes the output size is always sufficient to use large copies and other SIMD-like optimizations, without dropping down to the slow loop near the end of the buffer. It might be possible to create a streaming implementation of DEFLATE with similar performance characteristics, but AFAIK it hasn't been done so far.

There's a detailed write-up of how WUFFS achieves high performance, a lot of which boils down to "spend more time in fast loops and not near the end of the buffer" which sounds very similar to what libdeflate is doing, but I haven't compared these two in benchmarks.

Shnatsel · 2023-02-02T21:24:02Z

The relevant bit in the WUFFS write-up is Running off a cliff which neatly explains the difference, but the entire thing is worth a read.

zune-inflate uses the 8-byte-chunk copies for RLE as well, albeit bounds checks in there eat up to 5% of end-to-end performance. Small further gains by removing bounds checks are probably possible, but my attempt regressed some benchmarks while improving others.

Shnatsel · 2023-02-02T21:34:10Z

I'd love to see the png crate adopt zune-inflate and use all-at-once decompression for performance.

I'm confident that zune-inflate is production ready already. It's passing regular fuzzing and roundtrip fuzzing with miniz_oxide and zlib_ng, all of which are running on CI.

zune-png also correctly decodes my entire PNG corpus, and is in the final stages of testing on 600,000 real-world images.

fintelia · 2023-02-02T22:26:11Z

One advantage of the streaming approach is that we don't have to do separate passes over the image for each step of decoding (file read, decompression, unfiltering, and de-interlacing). Only benchmarking can say which side of the tradeoff is better

fintelia · 2023-02-02T22:32:49Z

I've also been wanting to expand fdeflate's decompression to support arbitrary zlib streams. It has a bunch of fancy logic to decode multiple literals at the same time, and gets quite good performance on streams it can support.

Shnatsel · 2023-02-02T22:36:08Z

zune-png benchmarks show it beating png and even libspng, the C library that's faster than libpng.

Could you point me to the fancy logic to decode multiple literals? Now I'm curious!

fintelia · 2023-02-02T22:44:21Z

https://github.com/image-rs/fdeflate/blob/main/src/decompress.rs#L364

Each lookup into the decoding tables produces two data bytes from data_table and an advance_bytes value that the number of input bits and output bytes to advance by. By using 12-bit indexes, it is very common that multiple literals will be output from a single table lookup. As an added bonus, since the input buffer is required to be zero-initialized we can support advancing by 3+ input literals as long as all after the first two are zero.

Shnatsel · 2023-02-03T00:29:01Z

image-rs/image-png#254 is also low-hanging fruit for png optimization

etemesi254 · 2023-02-03T08:18:48Z

The difference is in the following things.

Inflate, zune-inflate is faster than miniz-oxide by a good margin.

2.Defiltering.

The de-filtering code has SSE paths, no amount of coaxing the compiler will get it to generate good SSE code(I've tried).

A lot of small perf optimizations.

I'll elaborate on point 3 down below

etemesi254 · 2023-02-03T09:24:43Z

PS: allow me to do this incrementally, I do have other things that need my attention.

But here is one.

tRNS expansion.

Your code

https://github.com/image-rs/image-png/blob/6862566bf42685c461b24ae2963e127b9de2df2a/src/utils.rs#L41-L88

Vs mine

https://github.com/etemesi254/zune-image/blob/9741797ad5f8278e202c15004a905b9a428fd968/zune-png/src/decoder.rs#L598-L668

Mine is longer because I choose to an interesting formatting convention, but the gist is yours does bounds check per code, and tries to do both Luma, and RGB into one function, mine separates them.

The resulting code difference can be seen at https://godbolt.org/z/e44M37xz8

Yours does it in place, while mine moves from old buffer to a new buffer, yours should be faster than mine.

Shnatsel · 2023-02-03T19:59:56Z

I've also been wanting to expand fdeflate's decompression to support arbitrary zlib streams. It has a bunch of fancy logic to decode multiple literals at the same time

Curiously, this optimization was considered for zune-inflate but wasn't used. You can find a detailed write-up about the trade-offs involved here: etemesi254/zune-image#79 (comment)

oyvindln · 2023-02-04T11:49:18Z

While they probably won't make it match zune, there were some recent performance improvements in miniz_oxide so should help a bit with the performance of image-png once it is updated to use miniz_oxide 0.7.1.

As for the compression side, I could misremember but I think miniz_oxide via flate2 still ends up using a double buffer (aka outputting to one buffer and then from that to a different buffer) which could probably be avoided.

Shnatsel · 2023-02-04T12:23:33Z

@oyvindln I haven't seen a release announcement for miniz_oxide 0.7, would you like me to write one? Although the changelog is pretty sparse, you should probably update that first.

fintelia · 2023-03-01T07:54:18Z

I've now got an experimental branch of fdeflate with support for arbitrary zlib streams, and a draft PR out hooking it up to this crate: image-rs/image-png#384

etemesi254 · 2023-03-01T11:01:40Z

Is it possible if I can add it to benchmark suite?

fintelia · 2023-03-02T01:19:49Z

Feel free to experiment with it. I made https://github.com/fintelia/image-png/tree/finflate-github with just the fdeflate changes to avoid you hitting any circular dependency issues for zune-png.

etemesi254 · 2023-03-05T05:02:22Z

Yea there is some good improvement there, congratulations.

file-type	png 0.17.7	png-fininflate
8bpp	295 ms	222 ms
8bpp interlaced	381 ms	315 ms

Very good indeed!

HeroicKatora · 2023-03-05T11:57:37Z

Just to clarify from my side: The competition in decoder performance is quite awesome to see. Small and large steps and iteratively identifying the currently slowest steps and weak points respectively. For me this pretty clearly shows that being able to experiment is highly useful. And really no matter individual performance results here there should be a mechanism to add a respectively new alternative backend and choose between them for the sake of comparison. That would also for me imply: the optimal mechanism should not be a pure compile-time choice. Features allow some dependency minimization but they should not be exclusive as that just increases cycle time etc.

Compare this to mechanism for SIMD-flavor. In all use cases we iterated clearly towards runtime choices instead of a cfg-choice, because Rust allows this and it just works.

fintelia · 2023-09-01T02:51:48Z

Update on this: many of the optimizations from zune-png have been incorporated into image-png, and the performance is now much closer*. So, while I don't want to completely rule it out, with the current tradeoffs I don't think it is worth switching to zune-png for our PNG decoding

*I'd caution against reading too much into a single number: both decoders have files they're comparatively faster at decoding and it is hard to gather a representative sample of how performance compares on average. But the linked benchmark does at least show that the perf gap has considerably narrowed.

Shnatsel · 2024-01-03T22:18:50Z

With the additional performance improvements done in image-rs/image-png#416, the png crate is now on par with zune-png or even slightly faster, and has other desirable properties such as the ultra-fast compression mode and the fast decoding path for such images.

It seems switching to zune-png will not provide much benefit, so I am going to go ahead and close this.

etemesi254 · 2024-01-04T08:02:33Z

congratulations on all those involved 🎉🎉

wondering if the optimizations are live on crates.io or a git branch with all of them, would love to run all of them.

Shnatsel · 2024-01-04T11:54:10Z

I believe all non-API-breaking changes are now in master for the image-png crate.

There is one more PR outstanding that shows a performance improvement but requires an API change: image-rs/image-png#421

anforowicz mentioned this issue Aug 31, 2023

Ignore subsequent iCCP chunks to match libpng behavior. image-rs/image-png#410

Merged

Shnatsel closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using `zune-png` for PNG decoding #1852

Allow using `zune-png` for PNG decoding #1852

Shnatsel commented Feb 1, 2023 •

edited

Loading

fintelia commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

HeroicKatora commented Feb 2, 2023 •

edited

Loading

Shnatsel commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

fintelia commented Feb 2, 2023

fintelia commented Feb 2, 2023 •

edited

Loading

Shnatsel commented Feb 2, 2023

fintelia commented Feb 2, 2023 •

edited

Loading

Shnatsel commented Feb 3, 2023

etemesi254 commented Feb 3, 2023 •

edited

Loading

etemesi254 commented Feb 3, 2023 •

edited

Loading

Shnatsel commented Feb 3, 2023

oyvindln commented Feb 4, 2023 •

edited

Loading

Shnatsel commented Feb 4, 2023

fintelia commented Mar 1, 2023

etemesi254 commented Mar 1, 2023

fintelia commented Mar 2, 2023

etemesi254 commented Mar 5, 2023

HeroicKatora commented Mar 5, 2023

fintelia commented Sep 1, 2023

Shnatsel commented Jan 3, 2024

etemesi254 commented Jan 4, 2024

Shnatsel commented Jan 4, 2024

Allow using zune-png for PNG decoding #1852

Allow using zune-png for PNG decoding #1852

Comments

Shnatsel commented Feb 1, 2023 • edited Loading

fintelia commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

HeroicKatora commented Feb 2, 2023 • edited Loading

Shnatsel commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

Shnatsel commented Feb 2, 2023

fintelia commented Feb 2, 2023

fintelia commented Feb 2, 2023 • edited Loading

Shnatsel commented Feb 2, 2023

fintelia commented Feb 2, 2023 • edited Loading

Shnatsel commented Feb 3, 2023

etemesi254 commented Feb 3, 2023 • edited Loading

etemesi254 commented Feb 3, 2023 • edited Loading

Shnatsel commented Feb 3, 2023

oyvindln commented Feb 4, 2023 • edited Loading

Shnatsel commented Feb 4, 2023

fintelia commented Mar 1, 2023

etemesi254 commented Mar 1, 2023

fintelia commented Mar 2, 2023

etemesi254 commented Mar 5, 2023

HeroicKatora commented Mar 5, 2023

fintelia commented Sep 1, 2023

Shnatsel commented Jan 3, 2024

etemesi254 commented Jan 4, 2024

Shnatsel commented Jan 4, 2024

Allow using `zune-png` for PNG decoding #1852

Allow using `zune-png` for PNG decoding #1852

Shnatsel commented Feb 1, 2023 •

edited

Loading

HeroicKatora commented Feb 2, 2023 •

edited

Loading

fintelia commented Feb 2, 2023 •

edited

Loading

fintelia commented Feb 2, 2023 •

edited

Loading

etemesi254 commented Feb 3, 2023 •

edited

Loading

etemesi254 commented Feb 3, 2023 •

edited

Loading

oyvindln commented Feb 4, 2023 •

edited

Loading