Bogus Huffman table definition: jpeg_make_d_derived_tbl too strict #586

malaterre · 2022-03-25T09:28:42Z

Have you searched the existing issues (both open and closed) in the libjpeg-turbo issue tracker to ensure that this bug report is not a duplicate?

yes

Does this bug report describe one of the two known and unsolvable issues with the JPEG format?

no

Clear and concise description of the bug:

jpeg_make_d_derived_tbl is too strict and check all codes within the Huffman table, even if not used by the decoder.

Steps to reproduce the bug (using only libjpeg-turbo):

Using the attached JPEG file (WITH_12BIT:BOOL=ON), here is what I see:

% ./djpeg -outfile libjpeg-turbo.pgm sample_12bits_huffman_17.jpg
Bogus Huffman table definition

Image(s) needed in order to reproduce the bug (if applicable):

Expected behavior:

% ./djpeg -outfile libjpeg-turbo.pgm sample_12bits_huffman_17.jpg && echo "success"
success

Observed behavior:

% ./djpeg -outfile libjpeg-turbo.pgm sample_12bits_huffman_17.jpg
Bogus Huffman table definition

Platform(s) (compiler version, operating system version, CPU) on which the bug was observed:

Debian/bullseye amd64.

libjpeg-turbo release(s), commit(s), or branch(es) in which the bug was observed (always test the tip of the main branch or the latest stable pre-release to verify that the bug hasn't already been fixed):

% ./djpeg -version
libjpeg-turbo version 2.1.4 (build 20211129)

If the bug is a regression, the specific commit that introduced the regression (use git bisect to determine this):

Seems like commit is: 5ead57a

https://github.com/libjpeg-turbo/libjpeg-turbo/blob/jpeg-6b/jdhuff.c#L253-L265

Additional information:

Using the following local patch, I can properly decode the image:

 % git diff
diff --git a/jdhuff.c b/jdhuff.c
index 679d2216..037fd184 100644
--- a/jdhuff.c
+++ b/jdhuff.c
@@ -252,7 +252,7 @@ jpeg_make_d_derived_tbl(j_decompress_ptr cinfo, boolean isDC, int tblno,
   if (isDC) {
     for (i = 0; i < numsymbols; i++) {
       int sym = htbl->huffval[i];
-      if (sym < 0 || sym > 15)
+      if (sym < 0 || sym > 17)
         ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
     }
   }

Using the above patch, I get a result compatible with another open-source implementation of ITU 81:

https://github.com/thorfdbg/libjpeg/

% jpeg sample_12bits_huffman_17.jpg jpeg.pgm
jpeg Copyright (C) 2012-2018 Thomas Richter, University of Stuttgart
and Accusoft

For license conditions, see README.license for details.


0 bytes memory not yet released.

1108286 bytes maximal required.

249 allocations performed.

The text was updated successfully, but these errors were encountered:

malaterre · 2022-03-25T09:29:47Z

For background information, here is the original thread that started it:

Lossless JPEG: Valid range for DC table ? thorfdbg/libjpeg#68 (comment)

dcommander · 2022-03-25T16:29:44Z

Yes, I have been discussing this with @thorfdbg in e-mail. This was my response:

The Huffman decoder tends to be the first line of defense against malformed JPEG images, some of which may trigger exploitable security issues farther down in the decoding pipeline. Thus, browsers and other security-conscious applications are very sensitive to whether and how libjpeg-turbo fails to decode such images. Historically, about 9 out of 10 reported security bugs in libjpeg-turbo have been related to the Huffman decoder, so I am hesitant to change it without a good reason.

The code in question was written by Tom Lane in 1998, and I have no insight into why he decided to be strict about the DC symbol range. (The change log simply says, "Huffman tables are checked for validity much more carefully than before.") The fact that his code comments say "to ensure safe decoding" makes me very cautious about changing or removing the code. At minimum, I would need to fully understand why that code was added in the first place. For example, it may be that the strictness is necessary in order to guard against buffer overruns due to the specific implementation. I assume it would be possible to craft an image that actually does use the unsupported symbols, so I need to understand how the Huffman decoder, as it is currently implemented, would respond to that. Any assistance you can provide is appreciated. After a cursory read of the code, I can see several places at which an out-of-range symbol might cause a problem.

Several things I need to understand before proceeding:

How are such images (images with Huffman tables containing symbols >= 16 but that do not actually use those symbols) generated? How likely would it be for a user to encounter such an image in the wild?
If the image actually did use a Huffman symbol >= 16, would it bork libjpeg-turbo? (Let's just say that I have my suspicions.)
Is there an officially specified range limit for these symbols, or did you just increase the range limit arbitrarily to 17 to make one specific image work? In other words, if I adopted your patch, what would prevent someone from crafting a JPEG image with a Huffman table containing a symbol value of 18, which would cause the same error?

The answers to these questions will determine whether it makes sense to modify libjpeg-turbo (for instance, to change the error to a warning) or to simply add a code comment indicating that libjpeg-turbo is stricter than the spec requires and that this is because of limitations in the libjpeg-turbo code. The latter is my strong preference at the moment.

thorfdbg · 2022-03-26T09:11:10Z

Darrell, others,

Several things I need to understand before proceeding: * How are such images (images with Huffman tables containing symbols >= 16 but that do not actually use those symbols) generated? How likely would it be for a user to encounter such an image in the wild?

One can only guess. Those symbols are not needed for correct decoding of progressive or sequential codestreams, but they may be needed for other processes. Potentially, a particular application uses a common (process-agnostic) default table and applies is irrespectively of the particular coding process used in the end. All I can say is that there is no normative requirement that only the minimum number of symbols are populated, or only the first 16 symbols are. It's an implementation decision. The ISO implementation uses default tables depending on the process. If there are, for some image or some configuration, tables generated that populate more than the first 16 entries, please let me know because I would not know how this has happened (code for that in coding/huffmantemplate.cpp, lines 142 and following, val_dc_luminance/chrominance lists the symbols).

* If the image actually did use a Huffman symbol >= 16, would it bork libjpeg-turbo? (Let's just say that I have my suspicions.)

That would require a check in the code, plus proper error handling. Of course, I'm not in the depth of your (or Tom's) code for that matter. In the ISO code, this check is done in coding/huffmantemplate.cpp, line 853. However, this code is much younger, and of course, there may be errors, despite my attempt to avoid them. It hasn't undergone as careful (and years long) testing as libjpeg-turbo.

* Is there an officially specified range limit for these symbols, or did you just increase the range limit arbitrarily to 17 to make one specific image work? In other words, if I adopted your patch, what would prevent someone from crafting a JPEG image with a Huffman table containing a symbol value of 18, which would cause the same error?

The official limit is that the table can carry 256 symbols. That would of course not prevent any source from populating a symbol #18 or a symbol #255 - this does not establish an error by itself. It would be an error as soon as such a symbol is received during the actual decoding process. Again, I can only speak about the ISO implementation where this particular check is made in codestream/sequentialscan.cpp, line 682. (This is both for the sequential process as well as the initial scan of a progressive scan). Rather than the actual symbol, it would be worth a check how libjpeg-turbo handles the case of an incoming bit sequence that does not establish a valid Huffman code, a check that needs to be done anyhow. If it handles this case gracefully (aborting etc) then one could simply invalidate the affected symbols after reading the table. It would then only trigger a check if a particular codestream attempts to decode the corresponding bit sequences. Let me just stress that the purposes of the two code bases is rather different. Darrell's code is a high-speed, production quality implementation suitable as a black-box for software that requires access to JPEG encoded images. The ISO code is a reference to the standard and thus demonstrates how the standard should be read, it aids as a demonstration of the standard, and not so much as an actual implementation of it one should use in other software. As such, it checks for normative requirements of the codestreams, but it may be not as robust as a year-long well-proven implementation such as libjpeg-turbo. Greetings, Thomas

dcommander · 2022-03-26T14:53:13Z

That would require a check in the code, plus proper error handling.

Right, but doing that in a performant manner may prove problematic. The Huffman decoder has a fast path that avoids branches (if() statements) as much as possible.

Again, I need to fully understand the ramifications of the proposed changes from a practical, as opposed to theoretical, perspective before I can consider them. libjpeg-turbo has an indirect user base of billions of people, yet it only has funding for a bit more than a day of full-time labor per month. Thus, the more leg work you and others can do to answer the questions I listed above, the more likely I will be to consider a modification. Otherwise, the default course of action will be to add a code comment and move on.

malaterre · 2022-03-28T14:25:08Z

Just for later reference. Here is what was done in one libjpeg-fork (DCMTK copy):

  if (isDC) {
    for (i = 0; i < numsymbols; i++) {
      int sym = htbl->huffval[i];
      if (sym < 0 || sym > 16)
#ifdef DCMTK_ENABLE_STRICT_HUFFMAN_TABLE_CHECK
    ERREXIT(cinfo, JERR_BAD_HUFF_TABLE);
#else
    TRACEMS1(cinfo, 1, JTRC_UNOPT_HUFF_TABLE, sym);
#endif
    }
  }

ref:

DCMTK/dcmtk@497043a

I cannot comment on the security aspect raised above. But my understanding is that this piece of code has been there since day one, and will really only impact a limited set of people (JPEG without optimized huffman table). I also believe this piece of code may be revisited when implementing #402 (if ever). 2cts

dcommander · 2022-03-28T22:47:04Z

The likelihood that lossless JPEG will ever be implemented in libjpeg-turbo is small. It would require a larger amount of funded development than any that has ever been secured for this project. My own research indicates that lossless JPEG is a mixed bag. It can produce better compression ratios than webp and PNG for some types of images but not for others, and the performance is not exceptional.

The only relevant question at the moment is: how should libjpeg-turbo handle JPEG images with unused DC symbols >= 16 in the Huffman tables. The answer to that depends on the answers to the three questions I posed above. Until/unless I get answers to those three questions, there will be no movement on this issue.

malaterre · 2022-03-29T06:54:47Z

@dcommander Forget my comment about lossless. This was simply a reference to the original issue where:

% jpeg -c -p lena.ppm lena.jpg

does produce a large huffman table with unused symbols, which eventually exercise the code section in jpeg_make_d_derived_tbl.

dcommander · 2022-03-29T18:49:42Z

@malaterre Fair enough. That image would not successfully decode with libjpeg-turbo anyhow.

dcommander · 2022-04-05T16:26:08Z

@malaterre Actually, an image created with the command line above does not exercise the code section in jpeg_make_d_derived_tbl(). It fails earlier than that with Unsupported JPEG process: SOF type 0xc3.

So let me be more clear: I need an example of an image created with a known JPEG encoder (as opposed to an artificially-crafted image) that would otherwise decompress normally with libjpeg-turbo were it not for the strict checking of the DC Huffman table.

dcommander · 2022-06-14T18:32:45Z

I believe that we have given this issue adequate time to breathe. My opinion at the moment is that libjpeg had a good reason for disallowing DC symbols > 15. Until/unless someone can provide the information I requested above, there is insufficient basis for me to change that opinion. I will reopen this issue if further information emerges.

malaterre added the bug label Mar 25, 2022

malaterre assigned dcommander Mar 25, 2022

malaterre mentioned this issue Mar 25, 2022

Lossless JPEG: Valid range for DC table ? thorfdbg/libjpeg#68

Closed

dcommander closed this as completed Jun 14, 2022

dcommander added enhancement won't implement and removed bug labels Jun 14, 2022

dcommander mentioned this issue May 16, 2024

Allow all ones Huffman codes #765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bogus Huffman table definition: jpeg_make_d_derived_tbl too strict #586

Bogus Huffman table definition: jpeg_make_d_derived_tbl too strict #586

malaterre commented Mar 25, 2022 •

edited

malaterre commented Mar 25, 2022

dcommander commented Mar 25, 2022

thorfdbg commented Mar 26, 2022 via email

dcommander commented Mar 26, 2022

malaterre commented Mar 28, 2022

dcommander commented Mar 28, 2022

malaterre commented Mar 29, 2022

dcommander commented Mar 29, 2022

dcommander commented Apr 5, 2022

dcommander commented Jun 14, 2022

Bogus Huffman table definition: jpeg_make_d_derived_tbl too strict #586

Bogus Huffman table definition: jpeg_make_d_derived_tbl too strict #586

Comments

malaterre commented Mar 25, 2022 • edited

malaterre commented Mar 25, 2022

dcommander commented Mar 25, 2022

thorfdbg commented Mar 26, 2022 via email

dcommander commented Mar 26, 2022

malaterre commented Mar 28, 2022

dcommander commented Mar 28, 2022

malaterre commented Mar 29, 2022

dcommander commented Mar 29, 2022

dcommander commented Apr 5, 2022

dcommander commented Jun 14, 2022

malaterre commented Mar 25, 2022 •

edited