-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid 32kB decompression lag + compact less often. #447
Conversation
I think this may be worth landing - the performance improvements can be seen everywhere except the Transparency testcase. I don't really understand what happens in the Transparency testcase. I can recover some of the regression with an additional commit in master...anforowicz:image-png:cap-incremental-decompression-size. Maybe we should also land this other commit (as a separate PR?). OTOH after this other commit the performance gains elsewhere are still there, but slightly smaller. I don't know how to decide between 1) landing just this commit/PR vs 2) landing this commit + the other commit. I think I am leaning toward just landing this commit/PR for now. FWIW, I've also tested this commit on some small images from the top-500 benchmark described in #416 (comment). I've measured 10% - 15% performance improvements for those small images (Transparency.jpg is also small, but apparently being small is not the deciding factor). |
src/decoder/zlib.rs
Outdated
// We can't reuse the rest of `out_buffer` because it hasn't been zeroed-out (as | ||
// expected by `fdeflate`. Because of this, we `truncate` the `out_buffer`. | ||
self.out_buffer.truncate(preserved_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I eliminated the requirement in fdeflate
about zeroing the buffer, so this shouldn't be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - removed these 3 lines.
This changes how much will be decompressed in subsequent calls to fn decompress
though - this probably means that I should redo the performance measurements to double-check the performance impact is still the same. I'll try to do that next week.
The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements. While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set |
Thanks for looking!
Ack. I won't have time for it this week, but I plan to eventually try to understand master...anforowicz:image-png:cap-incremental-decompression-size better:
Ack. Question: is this something independent from the PR-under-review? Or is this a problem that you think the PR-under-review may be introducing or making worse? I can help with adding a test for this and checking what happens. I think the new test can mostly mimic BTW, I forgot to say that I hope that the PR-under-review doesn't get in the way for your idea from #429 (comment) of "changing ZlibStream to directly write into a caller provided buffer instead of using its own FWIW, I mostly got motivated to look here because of how the |
Avoiding 32kB decompression lag =============================== Before this commit, decompressed data would be accumulated in `ZlibStream::out_buffer` and returned via `image_data` with 32kB lag corresponding to `CHUNCK_BUFFER_SIZE`: ``` fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize { let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE); image_data.extend(self.out_buffer.drain(..safe)); ... ``` 32kB is a typical size of L1 cache, so the lag would mean that the data passed to `image_data.extend(...)` would be already cold and evicted from the L1 cache. This commit avoids the lag by always returning into `image_data` all the data from `out_buffer` (i.e. data up to `out_pos`): ``` fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize { let transferred = &self.out_buffer[self.read_pos..self.out_pos]; image_data.extend_from_slice(transferred); self.read_pos = self.out_pos; ... ``` Compacting less often ===================== The changes above mean that `Vec::drain` no longer compacts `out_buffer`. Therefore this commit also refactors how this compaction works. Before this commit, not-yet-returned data would be shifted to the beginning of `out_buffer` every time `transfer_finished_data` is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction. After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction. Performance impact ================== The commit has a positive impact on performance, except for: * `decode/Transparency.png` - regression between 15% and 20% is reported in 3-out-of-3 measurements. * `decode/kodim17.png` - a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements). * `generated-noncompressed-64k-idat/128x128.png` - a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements). The results below have been gathered by running the `decoder` benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05. * decode/kodim23.png: * [-2.9560% -2.7112% -2.4009%] * [-3.4876% -3.3406% -3.1928%] * [-3.0559% -2.9208% -2.7787%] * decode/kodim07.png: * [-1.2527% -1.0110% -0.6780%] * [-1.7851% -1.6558% -1.5164%] * [-1.6576% -1.5216% -1.3856%] * decode/kodim02.png: * [-0.5108% -0.2806% -0.0112%] * [-1.0885% -0.9493% -0.8118%] * [-0.5563% -0.4239% -0.2874%] * decode/kodim17.png: * [+1.8649% +2.1138% +2.4169%] (**regression**) * [-1.2891% -1.1322% -0.9736%] * [-0.7753% -0.6276% -0.4866%] * decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: * [-1.7165% -1.4968% -1.2650%] * [-1.7051% -1.4473% -1.2229%] * [-1.2544% -1.0457% -0.8375%] * decode/Transparency.png: * [+19.329% +19.789% +20.199%] (**regression**) * [+15.337% +15.798% +16.294%] (**regression**) * [+18.694% +19.106% +19.518%] (**regression**) * generated-noncompressed-4k-idat/8x8.png: * [-2.3295% -1.9940% -1.5912%] * [-6.1285% -5.8872% -5.6091%] * [-2.8814% -2.6787% -2.4820%] * generated-noncompressed-4k-idat/128x128.png: * [-59.793% -59.599% -59.417%] * [-63.930% -63.846% -63.756%] * [-62.377% -62.248% -62.104%] * generated-noncompressed-4k-idat/2048x2048.png: * [-67.678% -67.579% -67.480%] * [-65.616% -65.519% -65.429%] * [-65.824% -65.647% -65.413%] * generated-noncompressed-4k-idat/12288x12288.png: * [-60.932% -60.774% -60.528%] * [-62.088% -62.016% -61.940%] * [-61.663% -61.604% -61.546%] * generated-noncompressed-64k-idat/128x128.png: * [-22.237% -21.975% -21.701%] * [-29.656% -29.480% -29.311%] * [+24.812% +25.190% +25.571%] (**regression**) * generated-noncompressed-64k-idat/2048x2048.png: * [-21.826% -21.499% -21.087%] * [-54.279% -54.049% -53.715%] * [-11.174% -10.828% -10.482%] * generated-noncompressed-64k-idat/12288x12288.png: * [-40.421% -40.311% -40.180%] * [-39.496% -39.183% -38.871%] * [-41.443% -41.367% -41.295%] * generated-noncompressed-2g-idat/2048x2048.png: * [-40.136% -40.010% -39.865%] * [-58.507% -58.333% -58.060%] * [-35.822% -35.457% -35.038%] * generated-noncompressed-2g-idat/12288x12288.png: * [-37.196% -37.107% -37.014%] * [-36.125% -36.049% -35.970%] * [-35.636% -35.477% -35.350%]
7228d12
to
cb0b595
Compare
Just a quick follow-up to #447 (comment) - I re-measured the performance and the results were mostly the same (i.e. as before addressing the comment / as with truncating and re-zero-ing out decode/kodim23.png: decode/kodim07.png: decode/kodim02.png: decode/kodim17.png: decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: decode/Transparency.png: generated-noncompressed-4k-idat/8x8.png: generated-noncompressed-4k-idat/128x128.png: generated-noncompressed-4k-idat/2048x2048.png: generated-noncompressed-4k-idat/12288x12288.png: generated-noncompressed-64k-idat/128x128.png: generated-noncompressed-64k-idat/2048x2048.png: generated-noncompressed-64k-idat/12288x12288.png: generated-noncompressed-2g-idat/2048x2048.png: generated-noncompressed-2g-idat/12288x12288.png: |
Avoiding 32kB decompression lag
Before this commit, decompressed data would be accumulated in
ZlibStream::out_buffer
and returned viaimage_data
with 32kB lag corresponding toCHUNCK_BUFFER_SIZE
:32kB is a typical size of L1 cache, so the lag would mean that the data passed to
image_data.extend(...)
would be already cold and evicted from the L1 cache.This commit avoids the lag by always returning into
image_data
all the data fromout_buffer
(i.e. data up toout_pos
):Compacting less often
The changes above mean that
Vec::drain
no longer compactsout_buffer
. Therefore this commit also refactors how this compaction works.Before this commit, not-yet-returned data would be shifted to the beginning of
out_buffer
every timetransfer_finished_data
is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction.After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction.
Performance impact
The commit has a positive impact on performance, except for:
decode/Transparency.png
- regression between 15% and 20% is reported in 3-out-of-3 measurements.decode/kodim17.png
- a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements).generated-noncompressed-64k-idat/128x128.png
- a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements).The results below have been gathered by running the
decoder
benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05.decode/kodim23.png:
* [-2.9560% -2.7112% -2.4009%]
* [-3.4876% -3.3406% -3.1928%]
* [-3.0559% -2.9208% -2.7787%]
decode/kodim07.png:
* [-1.2527% -1.0110% -0.6780%]
* [-1.7851% -1.6558% -1.5164%]
* [-1.6576% -1.5216% -1.3856%]
decode/kodim02.png:
* [-0.5108% -0.2806% -0.0112%]
* [-1.0885% -0.9493% -0.8118%]
* [-0.5563% -0.4239% -0.2874%]
decode/kodim17.png:
* [+1.8649% +2.1138% +2.4169%] (regression)
* [-1.2891% -1.1322% -0.9736%]
* [-0.7753% -0.6276% -0.4866%]
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
* [-1.7165% -1.4968% -1.2650%]
* [-1.7051% -1.4473% -1.2229%]
* [-1.2544% -1.0457% -0.8375%]
decode/Transparency.png:
* [+19.329% +19.789% +20.199%] (regression)
* [+15.337% +15.798% +16.294%] (regression)
* [+18.694% +19.106% +19.518%] (regression)
generated-noncompressed-4k-idat/8x8.png:
* [-2.3295% -1.9940% -1.5912%]
* [-6.1285% -5.8872% -5.6091%]
* [-2.8814% -2.6787% -2.4820%]
generated-noncompressed-4k-idat/128x128.png:
* [-59.793% -59.599% -59.417%]
* [-63.930% -63.846% -63.756%]
* [-62.377% -62.248% -62.104%]
generated-noncompressed-4k-idat/2048x2048.png:
* [-67.678% -67.579% -67.480%]
* [-65.616% -65.519% -65.429%]
* [-65.824% -65.647% -65.413%]
generated-noncompressed-4k-idat/12288x12288.png:
* [-60.932% -60.774% -60.528%]
* [-62.088% -62.016% -61.940%]
* [-61.663% -61.604% -61.546%]
generated-noncompressed-64k-idat/128x128.png:
* [-22.237% -21.975% -21.701%]
* [-29.656% -29.480% -29.311%]
* [+24.812% +25.190% +25.571%] (regression)
generated-noncompressed-64k-idat/2048x2048.png:
* [-21.826% -21.499% -21.087%]
* [-54.279% -54.049% -53.715%]
* [-11.174% -10.828% -10.482%]
generated-noncompressed-64k-idat/12288x12288.png:
* [-40.421% -40.311% -40.180%]
* [-39.496% -39.183% -38.871%]
* [-41.443% -41.367% -41.295%]
generated-noncompressed-2g-idat/2048x2048.png:
* [-40.136% -40.010% -39.865%]
* [-58.507% -58.333% -58.060%]
* [-35.822% -35.457% -35.038%]
generated-noncompressed-2g-idat/12288x12288.png:
* [-37.196% -37.107% -37.014%]
* [-36.125% -36.049% -35.970%]
* [-35.636% -35.477% -35.350%]