Add further partial decoding optimization #34

mattsarett · 2015-11-11T18:05:27Z

As I've advocated previously, the ability to decode image subsets efficiently is a useful feature. The idea is that if we only need to display a subset of an image, we will improve performance by decoding only that subset (rather than decoding the entire image and then cropping).

In http://sourceforge.net/p/libjpeg-turbo/patches/70/, we added the ability the skip rows when decoding. This significantly improved partial decoding performance, especially when we only care about the bottom portion of an image.

However, we are still non-optimal when dealing with very wide or panorama images. It seems quite wasteful to decode entire rows at a time when we are only interested in a small portion of the width.

I'd like to propose an API to optimize partial decodes by adding the ability to read partial scanlines. I envision that this could be done in a way that makes minimal changes to the code path for regular decodes. We would just need to need to keep track of alternate width counters etc.

The API might look something like this:

/*

Call this before starting to read scanlines, and each subsequent call to jpeg_read_scanlines() will only
read the indicated portion of the row, rather than the entire row.
*
This cannot be called or undone once the decode is in progress.
*/
GLOBAL(void) jpeg_partial_decode(j_decompress_ptr cinfo, JDIMENSION start_x, JDIMENSION width);

The major complication I see with this is that many of the SIMD color conversion routines expect their memory to be 16-byte aligned (and AVX2 will expect 32-byte alignment). So start_x must be a multiple of 16 (or 32) for this to work without significantly altering the SIMD routines.

I think we can get around this by requiring that start_x be a multiple of the necessary alignment - maybe return a bool? Or we can make start_x and width JDIMENSION*, so libjpeg-turbo can adjust them to values that are supported?

Would this be an optimization that you would be interested in? If so, do you have thoughts on the API?

I would write the initial version of the patch, and we would be able to compensate for time spent integrating and testing.

dcommander · 2015-11-11T19:52:25Z

Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers.

dcommander · 2015-11-11T20:00:57Z

To answer your other questions:

I'm definitely interested in this, as long as the integration labor is compensated (as before.)
I don't see any particular issues with the API as you specified it, but to be consistent with other API functions, I think it should be called jpeg_set_decompress_region(). That makes it clear that the function is setting a state rather than performing a task (but I'm open to other suggestions, as long as "set" is in the name.)

dcommander · 2015-11-11T20:03:17Z

Actually, I don't think we really need "decompress" in the name, because that's implied by the argument types. So maybe jpeg_set_partial_scanline would be better.

mattsarett · 2015-11-11T20:19:27Z

"Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers."

Yes, all of the memory allocated by the library is aligned. And yes, I think we only care about the colorspace conversion routines. As for these routines, I think they do not completely handle input from unaligned pointers.

Here's a look at line 200 of jsimd_x86_64.c (in jsimd_ycc_rgb_convert):
sse2fct(cinfo->output_width, input_buf, input_row, output_buf, num_rows);

input_buf is a JSAMPIMAGE. For a given JSAMPROW input_buf_row inside of input_buf, we may want to begin color conversion at "input_buf_row + start_x".

We can modify the pointers in input_buf accordingly, but I believe we will seg fault if start_x is not a multiple of 16. In jdcolext-sse2-64.asm, we will load the vector registers from input_buf_row using movdqa (move aligned double quadword, requires 16 byte alignment).

Of course, this is partly speculation, I haven't made a full attempt at an implementation yet.

On the other hand, we store to output_buf using movdqu (unaligned move), meaning we are safe regardless of the output alignment.

"To answer your other questions:
I'm definitely interested in this, as long as the integration labor is compensated (as before.)
I don't see any particular issues with the API as you specified it, but to be consistent with other API functions, I think it should be called jpeg_set_partial_scanline(). That makes it clear that the function is setting a state rather than performing a task."

Great!

mattsarett · 2015-11-19T22:13:55Z

Just wanted to send a note that I'm still interested in this and intend to implement it, but it's no a longer a priority - so it may take me a little while to get back to it.

dcommander · 2015-11-19T22:35:28Z

Thanks for the update. I'm still interested in it as well, as I think it would further round out the functionality of that API and allow me to create a more reasonable higher-level partial image decode interface for it in TurboJPEG (see #1.) But I won't block on this for 1.5.

mattsarett · 2015-12-17T23:12:46Z

I'm attaching a patch!
patch.txt

Please follow-up with me on any style/design/test/integration tasks that I can help with.

Two concerns I had were:
(1) I needed to write special code to disable fancy upsampling if the subset requested is smaller than 2 pixels. It's kind of ugly.
(2) I need a way to modify the "width" parameter on the destination managers used by djpeg. So far I've just added a hack for using ppm. I'd suggest that "width" be stored in the public part of the destination manager. In itself, this looks like a big change.

Let me know what your thoughts are!

dcommander · 2015-12-18T02:09:29Z

It will probably be January before I can dig into this. I'll keep you posted.

mattsarett · 2015-12-18T13:25:40Z

Sounds good. I'll be out as well.

Happy holidays!
Matt

On Thu, Dec 17, 2015 at 9:09 PM, DRC notifications@github.com wrote:

It will probably be January before I can dig into this. I'll keep you
posted.

—
Reply to this email directly or view it on GitHub
#34 (comment)
.

dcommander · 2016-02-20T03:11:23Z

Modified your patch fairly extensively and pushed to master. Please test and review ASAP. Double-check and triple-check whether the API extensions will be sufficient to support partial image decoding on Android. We can straightforwardly add to the API later, but changing it is extremely painful due to the fact that the libjpeg API is not versioned.

mattsarett · 2016-02-22T15:01:27Z

Looks good to me. I'll follow up if the test suite catches any bugs/issues, but I think this will integrate nicely.

dcommander · 2016-02-22T17:33:17Z

OK, let me know once you have completed the integration.

dcommander · 2016-02-26T17:36:26Z

What's the status of integration?

mattsarett · 2016-02-29T15:17:07Z

It is done! Thanks for your efforts in helping to add this optimization!

dcommander added the enhancement label Dec 2, 2015

dcommander mentioned this issue Jan 18, 2016

ARM64 NEON SIMD huffman encoding #50

Closed

dcommander closed this as completed in 3ab68cf Feb 20, 2016

dcommander added the fixed label Feb 20, 2016

jechols mentioned this issue May 23, 2019

Look into partial JPG decoding uoregon-libraries/rais-image-server#16

Closed

mo271 mentioned this issue Sep 17, 2020

Fix cropping in progressive mode #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add further partial decoding optimization #34

Add further partial decoding optimization #34

mattsarett commented Nov 11, 2015

dcommander commented Nov 11, 2015

dcommander commented Nov 11, 2015

dcommander commented Nov 11, 2015

mattsarett commented Nov 11, 2015

mattsarett commented Nov 19, 2015

dcommander commented Nov 19, 2015

mattsarett commented Dec 17, 2015

dcommander commented Dec 18, 2015

mattsarett commented Dec 18, 2015

dcommander commented Feb 20, 2016

mattsarett commented Feb 22, 2016

dcommander commented Feb 22, 2016

dcommander commented Feb 26, 2016

mattsarett commented Feb 29, 2016

Add further partial decoding optimization #34

Add further partial decoding optimization #34

Comments

mattsarett commented Nov 11, 2015

dcommander commented Nov 11, 2015

dcommander commented Nov 11, 2015

dcommander commented Nov 11, 2015

mattsarett commented Nov 11, 2015

mattsarett commented Nov 19, 2015

dcommander commented Nov 19, 2015

mattsarett commented Dec 17, 2015

dcommander commented Dec 18, 2015

mattsarett commented Dec 18, 2015

dcommander commented Feb 20, 2016

mattsarett commented Feb 22, 2016

dcommander commented Feb 22, 2016

dcommander commented Feb 26, 2016

mattsarett commented Feb 29, 2016