New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add further partial decoding optimization #34
Comments
Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers. |
To answer your other questions:
|
Actually, I don't think we really need "decompress" in the name, because that's implied by the argument types. So maybe |
"Not sure what you mean regarding alignment. The underlying MCU blocks are allocated by the library, so won't they already be aligned? The only SIMD routines for which alignment should be a concern are the colorspace conversion routines, and those can handle input from/output to unaligned pointers." Yes, all of the memory allocated by the library is aligned. And yes, I think we only care about the colorspace conversion routines. As for these routines, I think they do not completely handle input from unaligned pointers. Here's a look at line 200 of jsimd_x86_64.c (in jsimd_ycc_rgb_convert): input_buf is a JSAMPIMAGE. For a given JSAMPROW input_buf_row inside of input_buf, we may want to begin color conversion at "input_buf_row + start_x". We can modify the pointers in input_buf accordingly, but I believe we will seg fault if start_x is not a multiple of 16. In jdcolext-sse2-64.asm, we will load the vector registers from input_buf_row using movdqa (move aligned double quadword, requires 16 byte alignment). Of course, this is partly speculation, I haven't made a full attempt at an implementation yet. On the other hand, we store to output_buf using movdqu (unaligned move), meaning we are safe regardless of the output alignment. "To answer your other questions: Great! |
Just wanted to send a note that I'm still interested in this and intend to implement it, but it's no a longer a priority - so it may take me a little while to get back to it. |
Thanks for the update. I'm still interested in it as well, as I think it would further round out the functionality of that API and allow me to create a more reasonable higher-level partial image decode interface for it in TurboJPEG (see #1.) But I won't block on this for 1.5. |
I'm attaching a patch! Please follow-up with me on any style/design/test/integration tasks that I can help with. Two concerns I had were: Let me know what your thoughts are! |
It will probably be January before I can dig into this. I'll keep you posted. |
Sounds good. I'll be out as well. Happy holidays! On Thu, Dec 17, 2015 at 9:09 PM, DRC notifications@github.com wrote:
|
Modified your patch fairly extensively and pushed to master. Please test and review ASAP. Double-check and triple-check whether the API extensions will be sufficient to support partial image decoding on Android. We can straightforwardly add to the API later, but changing it is extremely painful due to the fact that the libjpeg API is not versioned. |
Looks good to me. I'll follow up if the test suite catches any bugs/issues, but I think this will integrate nicely. |
OK, let me know once you have completed the integration. |
What's the status of integration? |
It is done! Thanks for your efforts in helping to add this optimization! |
As I've advocated previously, the ability to decode image subsets efficiently is a useful feature. The idea is that if we only need to display a subset of an image, we will improve performance by decoding only that subset (rather than decoding the entire image and then cropping).
In http://sourceforge.net/p/libjpeg-turbo/patches/70/, we added the ability the skip rows when decoding. This significantly improved partial decoding performance, especially when we only care about the bottom portion of an image.
However, we are still non-optimal when dealing with very wide or panorama images. It seems quite wasteful to decode entire rows at a time when we are only interested in a small portion of the width.
I'd like to propose an API to optimize partial decodes by adding the ability to read partial scanlines. I envision that this could be done in a way that makes minimal changes to the code path for regular decodes. We would just need to need to keep track of alternate width counters etc.
The API might look something like this:
/*
*
*/
GLOBAL(void) jpeg_partial_decode(j_decompress_ptr cinfo, JDIMENSION start_x, JDIMENSION width);
The major complication I see with this is that many of the SIMD color conversion routines expect their memory to be 16-byte aligned (and AVX2 will expect 32-byte alignment). So start_x must be a multiple of 16 (or 32) for this to work without significantly altering the SIMD routines.
I think we can get around this by requiring that start_x be a multiple of the necessary alignment - maybe return a bool? Or we can make start_x and width JDIMENSION*, so libjpeg-turbo can adjust them to values that are supported?
Would this be an optimization that you would be interested in? If so, do you have thoughts on the API?
I would write the initial version of the patch, and we would be able to compensate for time spent integrating and testing.
The text was updated successfully, but these errors were encountered: