Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add further partial decoding optimization #34

Closed
mattsarett opened this issue Nov 11, 2015 · 14 comments
Closed

Add further partial decoding optimization #34

mattsarett opened this issue Nov 11, 2015 · 14 comments

Comments

@mattsarett
Copy link
Contributor

As I've advocated previously, the ability to decode image subsets efficiently is a useful feature. The idea is that if we only need to display a subset of an image, we will improve performance by decoding only that subset (rather than decoding the entire image and then cropping).

In http://sourceforge.net/p/libjpeg-turbo/patches/70/, we added the ability the skip rows when decoding. This significantly improved partial decoding performance, especially when we only care about the bottom portion of an image.

However, we are still non-optimal when dealing with very wide or panorama images. It seems quite wasteful to decode entire rows at a time when we are only interested in a small portion of the width.

I'd like to propose an API to optimize partial decodes by adding the ability to read partial scanlines. I envision that this could be done in a way that makes minimal changes to the code path for regular decodes. We would just need to need to keep track of alternate width counters etc.

The API might look something like this:

/*

  • Call this before starting to read scanlines, and each subsequent call to jpeg_read_scanlines() will only
  • read the indicated portion of the row, rather than the entire row.
    *
  • This cannot be called or undone once the decode is in progress.
    */
    GLOBAL(void) jpeg_partial_decode(j_decompress_ptr cinfo, JDIMENSION start_x, JDIMENSION width);

The major complication I see with this is that many of the SIMD color conversion routines expect their memory to be 16-byte aligned (and AVX2 will expect 32-byte alignment). So start_x must be a multiple of 16 (or 32) for this to work without significantly altering the SIMD routines.

I think we can get around this by requiring that start_x be a multiple of the necessary alignment - maybe return a bool? Or we can make start_x and width JDIMENSION*, so libjpeg-turbo can adjust them to values that are supported?

Would this be an optimization that you would be interested in? If so, do you have thoughts on the API?

I would write the initial version of the patch, and we would be able to compensate for time spent integrating and testing.

@dcommander
Copy link
Member

Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers.

@dcommander
Copy link
Member

To answer your other questions:

  • I'm definitely interested in this, as long as the integration labor is compensated (as before.)
  • I don't see any particular issues with the API as you specified it, but to be consistent with other API functions, I think it should be called jpeg_set_decompress_region(). That makes it clear that the function is setting a state rather than performing a task (but I'm open to other suggestions, as long as "set" is in the name.)

@dcommander
Copy link
Member

Actually, I don't think we really need "decompress" in the name, because that's implied by the argument types. So maybe jpeg_set_partial_scanline would be better.

@mattsarett
Copy link
Contributor Author

"Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers."

Yes, all of the memory allocated by the library is aligned. And yes, I think we only care about the colorspace conversion routines. As for these routines, I think they do not completely handle input from unaligned pointers.

Here's a look at line 200 of jsimd_x86_64.c (in jsimd_ycc_rgb_convert):
sse2fct(cinfo->output_width, input_buf, input_row, output_buf, num_rows);

input_buf is a JSAMPIMAGE. For a given JSAMPROW input_buf_row inside of input_buf, we may want to begin color conversion at "input_buf_row + start_x".

We can modify the pointers in input_buf accordingly, but I believe we will seg fault if start_x is not a multiple of 16. In jdcolext-sse2-64.asm, we will load the vector registers from input_buf_row using movdqa (move aligned double quadword, requires 16 byte alignment).

Of course, this is partly speculation, I haven't made a full attempt at an implementation yet.

On the other hand, we store to output_buf using movdqu (unaligned move), meaning we are safe regardless of the output alignment.

"To answer your other questions:
I'm definitely interested in this, as long as the integration labor is compensated (as before.)
I don't see any particular issues with the API as you specified it, but to be consistent with other API functions, I think it should be called jpeg_set_partial_scanline(). That makes it clear that the function is setting a state rather than performing a task."

Great!

@mattsarett
Copy link
Contributor Author

Just wanted to send a note that I'm still interested in this and intend to implement it, but it's no a longer a priority - so it may take me a little while to get back to it.

@dcommander
Copy link
Member

Thanks for the update. I'm still interested in it as well, as I think it would further round out the functionality of that API and allow me to create a more reasonable higher-level partial image decode interface for it in TurboJPEG (see #1.) But I won't block on this for 1.5.

@mattsarett
Copy link
Contributor Author

I'm attaching a patch!
patch.txt

Please follow-up with me on any style/design/test/integration tasks that I can help with.

Two concerns I had were:
(1) I needed to write special code to disable fancy upsampling if the subset requested is smaller than 2 pixels. It's kind of ugly.
(2) I need a way to modify the "width" parameter on the destination managers used by djpeg. So far I've just added a hack for using ppm. I'd suggest that "width" be stored in the public part of the destination manager. In itself, this looks like a big change.

Let me know what your thoughts are!

@dcommander
Copy link
Member

It will probably be January before I can dig into this. I'll keep you posted.

@mattsarett
Copy link
Contributor Author

Sounds good. I'll be out as well.

Happy holidays!
Matt

On Thu, Dec 17, 2015 at 9:09 PM, DRC notifications@github.com wrote:

It will probably be January before I can dig into this. I'll keep you
posted.


Reply to this email directly or view it on GitHub
#34 (comment)
.

@dcommander
Copy link
Member

Modified your patch fairly extensively and pushed to master. Please test and review ASAP. Double-check and triple-check whether the API extensions will be sufficient to support partial image decoding on Android. We can straightforwardly add to the API later, but changing it is extremely painful due to the fact that the libjpeg API is not versioned.

@mattsarett
Copy link
Contributor Author

Looks good to me. I'll follow up if the test suite catches any bugs/issues, but I think this will integrate nicely.

@dcommander
Copy link
Member

OK, let me know once you have completed the integration.

@dcommander
Copy link
Member

What's the status of integration?

@mattsarett
Copy link
Contributor Author

It is done! Thanks for your efforts in helping to add this optimization!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants