Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using zune-png for PNG decoding #1852

Closed
Shnatsel opened this issue Feb 1, 2023 · 25 comments
Closed

Allow using zune-png for PNG decoding #1852

Shnatsel opened this issue Feb 1, 2023 · 25 comments

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Feb 1, 2023

zune-png is a very fast PNG implementation in Rust.

It's safe code except for SIMD in filters, which is the same policy as image-rs crates use.

It's in the final stages of testing, with just one known issue remaining on a very obscure type of image, and is being continuously fuzzed on CI. I gave it the full real-world testing treatment, which found a fair amount of bugs and I'm confident we'll squash the remaining ones soon.

Advantages:

  • Decoding go BRRRRR

Drawbacks:

  • No encoding support (only decoding is implemented so far) so the png crate will remain as a dependency
  • No streaming support, the input has to be fully loaded into memory
  • No APNG support (only the first frame is displayed)
  • No dedicated fast path for RLE images like the one PNG crate got via fdeflate

See also: #1845 for supporting zune-jpeg

@fintelia
Copy link
Contributor

fintelia commented Feb 2, 2023

We should try to understand how the performance compares between image-png and zune-png and track down the specific differences. There's been some recent work on making image-png considerably faster, which isn't yet reflected in the crates.io releases. During that process we've found that the vast majority of PNG encoding/decoding time is spent in compression and filtering. If there's some optimizations from zune-png that we can copy over, that may be a less invasive way of achieving the same performance improvements.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 2, 2023

zune-png uses zune-inflate internally, which is a 100% safe Rust port of libdeflate. It outperforms miniz_oxide used in png and even zlib-ng. You could get the same performance improvements in decompression by switching to zune-inflate.

The trade-off is that libdeflate and zune-inflate do not support streaming decoding. This seems fine for APIs such as next_frame(). But if there's any streaming decoding support in the png crate, it would have to be disabled when using zune-inflate, or fall back to miniz_oxide just for those codepaths.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 2, 2023

It would be very surprising to me if streaming is the cause of difference in performance. deflate should require no more than 64kB buffer size to make the available data virtually indistinguishable. Any even larger buffers and I would expect the only necessary difference to be cache effects and the cost of storing temporary state (which is constant size? some buffered bits and a pointer to where in the stream we are, what code table is active) in between the data passes. I'm not aware of any form of cache, lookup table, or algorithmic improvement that works uniquely for infinite lookahead here. It's at least counterintuitive to identify streaming.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 2, 2023

The gist of the algorithm is that it assumes the output size is always sufficient to use large copies and other SIMD-like optimizations, without dropping down to the slow loop near the end of the buffer. It might be possible to create a streaming implementation of DEFLATE with similar performance characteristics, but AFAIK it hasn't been done so far.

There's a detailed write-up of how WUFFS achieves high performance, a lot of which boils down to "spend more time in fast loops and not near the end of the buffer" which sounds very similar to what libdeflate is doing, but I haven't compared these two in benchmarks.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 2, 2023

The relevant bit in the WUFFS write-up is Running off a cliff which neatly explains the difference, but the entire thing is worth a read.

zune-inflate uses the 8-byte-chunk copies for RLE as well, albeit bounds checks in there eat up to 5% of end-to-end performance. Small further gains by removing bounds checks are probably possible, but my attempt regressed some benchmarks while improving others.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 2, 2023

I'd love to see the png crate adopt zune-inflate and use all-at-once decompression for performance.

I'm confident that zune-inflate is production ready already. It's passing regular fuzzing and roundtrip fuzzing with miniz_oxide and zlib_ng, all of which are running on CI.

zune-png also correctly decodes my entire PNG corpus, and is in the final stages of testing on 600,000 real-world images.

@fintelia
Copy link
Contributor

fintelia commented Feb 2, 2023

One advantage of the streaming approach is that we don't have to do separate passes over the image for each step of decoding (file read, decompression, unfiltering, and de-interlacing). Only benchmarking can say which side of the tradeoff is better

@fintelia
Copy link
Contributor

fintelia commented Feb 2, 2023

I've also been wanting to expand fdeflate's decompression to support arbitrary zlib streams. It has a bunch of fancy logic to decode multiple literals at the same time, and gets quite good performance on streams it can support.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 2, 2023

zune-png benchmarks show it beating png and even libspng, the C library that's faster than libpng.

Could you point me to the fancy logic to decode multiple literals? Now I'm curious!

@fintelia
Copy link
Contributor

fintelia commented Feb 2, 2023

https://github.com/image-rs/fdeflate/blob/main/src/decompress.rs#L364

Each lookup into the decoding tables produces two data bytes from data_table and an advance_bytes value that the number of input bits and output bytes to advance by. By using 12-bit indexes, it is very common that multiple literals will be output from a single table lookup. As an added bonus, since the input buffer is required to be zero-initialized we can support advancing by 3+ input literals as long as all after the first two are zero.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 3, 2023

image-rs/image-png#254 is also low-hanging fruit for png optimization

@etemesi254
Copy link

etemesi254 commented Feb 3, 2023

The difference is in the following things.

  1. Inflate, zune-inflate is faster than miniz-oxide by a good margin.

2.Defiltering.

The de-filtering code has SSE paths, no amount of coaxing the compiler will get it to generate good SSE code(I've tried).

  1. A lot of small perf optimizations.

I'll elaborate on point 3 down below

@etemesi254
Copy link

etemesi254 commented Feb 3, 2023

PS: allow me to do this incrementally, I do have other things that need my attention.

But here is one.

  1. tRNS expansion.

Your code

https://github.com/image-rs/image-png/blob/6862566bf42685c461b24ae2963e127b9de2df2a/src/utils.rs#L41-L88

Vs mine

https://github.com/etemesi254/zune-image/blob/9741797ad5f8278e202c15004a905b9a428fd968/zune-png/src/decoder.rs#L598-L668

Mine is longer because I choose to an interesting formatting convention, but the gist is yours does bounds check per code, and tries to do both Luma, and RGB into one function, mine separates them.

The resulting code difference can be seen at https://godbolt.org/z/e44M37xz8

Yours does it in place, while mine moves from old buffer to a new buffer, yours should be faster than mine.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 3, 2023

I've also been wanting to expand fdeflate's decompression to support arbitrary zlib streams. It has a bunch of fancy logic to decode multiple literals at the same time

Curiously, this optimization was considered for zune-inflate but wasn't used. You can find a detailed write-up about the trade-offs involved here: etemesi254/zune-image#79 (comment)

@oyvindln
Copy link
Contributor

oyvindln commented Feb 4, 2023

While they probably won't make it match zune, there were some recent performance improvements in miniz_oxide so should help a bit with the performance of image-png once it is updated to use miniz_oxide 0.7.1.

As for the compression side, I could misremember but I think miniz_oxide via flate2 still ends up using a double buffer (aka outputting to one buffer and then from that to a different buffer) which could probably be avoided.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Feb 4, 2023

@oyvindln I haven't seen a release announcement for miniz_oxide 0.7, would you like me to write one? Although the changelog is pretty sparse, you should probably update that first.

@fintelia
Copy link
Contributor

fintelia commented Mar 1, 2023

I've now got an experimental branch of fdeflate with support for arbitrary zlib streams, and a draft PR out hooking it up to this crate: image-rs/image-png#384

@etemesi254
Copy link

Is it possible if I can add it to benchmark suite?

@fintelia
Copy link
Contributor

fintelia commented Mar 2, 2023

Feel free to experiment with it. I made https://github.com/fintelia/image-png/tree/finflate-github with just the fdeflate changes to avoid you hitting any circular dependency issues for zune-png.

@etemesi254
Copy link

Yea there is some good improvement there, congratulations.

file-type png 0.17.7 png-fininflate
8bpp 295 ms 222 ms
8bpp interlaced 381 ms 315 ms

Very good indeed!

@HeroicKatora
Copy link
Member

Just to clarify from my side: The competition in decoder performance is quite awesome to see. Small and large steps and iteratively identifying the currently slowest steps and weak points respectively. For me this pretty clearly shows that being able to experiment is highly useful. And really no matter individual performance results here there should be a mechanism to add a respectively new alternative backend and choose between them for the sake of comparison. That would also for me imply: the optimal mechanism should not be a pure compile-time choice. Features allow some dependency minimization but they should not be exclusive as that just increases cycle time etc.

Compare this to mechanism for SIMD-flavor. In all use cases we iterated clearly towards runtime choices instead of a cfg-choice, because Rust allows this and it just works.

@fintelia
Copy link
Contributor

fintelia commented Sep 1, 2023

Update on this: many of the optimizations from zune-png have been incorporated into image-png, and the performance is now much closer*. So, while I don't want to completely rule it out, with the current tradeoffs I don't think it is worth switching to zune-png for our PNG decoding

*I'd caution against reading too much into a single number: both decoders have files they're comparatively faster at decoding and it is hard to gather a representative sample of how performance compares on average. But the linked benchmark does at least show that the perf gap has considerably narrowed.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 3, 2024

With the additional performance improvements done in image-rs/image-png#416, the png crate is now on par with zune-png or even slightly faster, and has other desirable properties such as the ultra-fast compression mode and the fast decoding path for such images.

It seems switching to zune-png will not provide much benefit, so I am going to go ahead and close this.

@Shnatsel Shnatsel closed this as completed Jan 3, 2024
@etemesi254
Copy link

congratulations on all those involved 🎉🎉

wondering if the optimizations are live on crates.io or a git branch with all of them, would love to run all of them.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 4, 2024

I believe all non-API-breaking changes are now in master for the image-png crate.

There is one more PR outstanding that shows a performance improvement but requires an API change: image-rs/image-png#421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants