Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib decompression produces wrong output #6606

Closed
slonopotamus opened this issue Mar 10, 2021 · 15 comments
Closed

zlib decompression produces wrong output #6606

slonopotamus opened this issue Mar 10, 2021 · 15 comments

Comments

@slonopotamus
Copy link

slonopotamus commented Mar 10, 2021

Environment Information

Provide at least:

  • JRuby version (jruby -v) and command line: jruby 9.2.16.0 (2.5.7) 2021-03-03 f82228dc32 OpenJDK 64-Bit Server VM 25.275-b01 on 1.8.0_275-b01 +jit [linux-x86_64]
  • Operating system and platform (e.g. uname -a): Linux noblesse 5.4.80-gentoo-r1 #1 SMP Tue Dec 8 10:34:59 MSK 2020 x86_64 AMD Ryzen 7 3700X 8-Core Processor AuthenticAMD GNU/Linux

Way to reproduce

  1. Download attached data.gz
  2. Run ruby -e "require 'zlib'; f = File.read('data.gz'); puts Zlib::Inflate.inflate(f).bytesize" in the directory where you saved data.gz with different Ruby implementations

Expected Behavior

Uncompressed data size is 60992 bytes. This is what MRI says and what I actually expect.

$ zlib-flate -uncompress < data.gz > data && ls -l data agrees that uncompressed data is 60992 bytes.

$ openssl zlib -d < data.gz > data && ls -l data also thinks that uncompressed data is 60992 bytes.

$ python3 -c "import zlib; print(len(zlib.decompress(open('data.gz', 'rb').read())))" also thinks that uncompressed data is 60992 bytes.

Java also agrees data is 60992 bytes:

  public static void main(String[] args) throws Exception {
    byte[] input = Files.readAllBytes(Path.of("data.gz"));
    Inflater inflater = new Inflater();
    inflater.setInput(input);

    ByteArrayOutputStream output = new ByteArrayOutputStream();
    byte[] buffer = new byte[1024];

    while (!inflater.finished()) {
      int len = inflater.inflate(buffer);
      output.write(buffer, 0, len);
    }
    System.out.println(output.toByteArray().length); // prints 60992
  }

Actual Behavior

Uncompressed data size is 60994 bytes on JRuby. I observe this on JRuby 9.2.13 and 9.2.16. I didn't test other JRuby versions.

@slonopotamus slonopotamus changed the title zlib decompression differs from MRI zlib decompression produces wrong output Mar 10, 2021
@headius
Copy link
Member

headius commented Mar 10, 2021

The diff appears to be two null bytes at the end of the file. I would guess we are not properly setting the length of the final chunk.

@headius
Copy link
Member

headius commented Mar 11, 2021

The issue here seems to be that the gz file has an extra two \0 at the end of the file (after the end of deflated data), and we have logic that includes those characters in the inflated output. A comment says this mimics CRuby behavior, and I found some code in CRuby that looks similar... but this is clearly not correct behavior.

@headius
Copy link
Member

headius commented Mar 11, 2021

@slonopotamus Where did this gz file come from? I am wondering if these extra \0 are intentional or a bug somewhere else.

@slonopotamus
Copy link
Author

This data is a font, extracted from a mobi file that was generated by Amazon KindleGen. Font is stored in zlib-compressed form inside mobi file. Mobi file also stores expected uncompressed data length, that's how I noticed that JRuby output doesn't match what is expected.

@slonopotamus
Copy link
Author

Anyway, my point here is that I couldn't find any other zlib implementation that would agree with JRuby behavior.

@headius
Copy link
Member

headius commented Mar 11, 2021

@slonopotamus There is definitely a bug, but I was curious why those bytes (which are not part of the compressed content) are there in the first place.

@slonopotamus
Copy link
Author

why those bytes

Not sure. I'm using my own code to extract these bytes from a mobi file. I actually it tis JRuby bug while testing my code on JRuby. But I think that I'm properly following Mobi spec and these bytes are actually part of zlib'ed data. There are alternative implementations of the same code in libmobi and in Calibre, one could possibly hook a couple of print statements inside them to check that they operate on the same compressed chunk.

To give you some more background: mobi contains a bunch of "resources", each inside a "record" whose size is known in advance. Depending on resource type, record data is interpreted differently. Images are stored as-is, video has VIDE prefix that needs to be stripped, audio has AUDI prefix, fonts have FONT prefix, followed by uncompressed size, flags (was data compressed at all or not), some more crazy stuff and then actually data itself (that is assumed to be all remaining bytes up to "record" end. So, the format doesn't describe any things after compressed data. And if we look at Calibre implementation, they do what I do.

@headius
Copy link
Member

headius commented Mar 11, 2021

@slonopotamus Ok thank you, that is good background info. My reason for asking is that we have never had someone else report this, which means to me that nobody ever ran a gz file with extra bytes through our inflater. That makes this a rather unique case, and I wanted to understand why those bytes are there.

Digging back through your code, it seems like this is originating from whatever library provides "palmdb". There could be some other bug in that library, or in the functions it calls to do the extraction of this gz data, but I did not dig that far down.

If you feel so inclined to investigate that end of things, here is the last bit of the hexdump from both the gz and the resulting inflated content:

$ hexdump ~/Downloads/data.gz | tail
0008e00 da 8e 1e 41 3b d0 a3 68 27 7a 0c ed 42 8f a3 dd
0008e10 68 0f f4 0c 4f a0 7d 68 3f 99 e7 e7 66 f6 c9 2c
0008e20 ff 8b e8 04 7a 1d bd 81 de 44 67 d0 47 dc 6a cb
0008e30 5f d1 10 fa 1c 5d 40 5f a2 af d0 37 e8 5b f4 0f
0008e40 f4 3d fa 01 5d 42 3f a2 7f a2 7f a1 ab e8 17 f4
0008e50 2b fa 0d 5d 47 7f a0 9b 68 18 8d a0 51 8c 31 cd
0008e60 d9 43 b3 c4 36 d8 8e 5b c3 28 27 37 53 f2 73 06
0008e70 1f 24 ee e6 b6 83 be b9 3d fa 1a 71 cf f5 aa 2c
0008e80 ff 17 d8 f3 18 66 00 00                        
0008e88

$ jruby -rzlib -e "print Zlib::Inflate.inflate(File.read('/Users/headius/Downloads/data.gz'))" | hexdump | tail
000edc0 00 b3 00 b4 00 b5 00 b6 00 b7 00 b8 00 ba 00 bf
000edd0 00 c0 00 c1 00 cb 00 d0 00 d4 00 d5 00 d6 00 d7
000ede0 00 d8 00 db 00 dd 00 de 00 e1 00 e2 00 e4 00 e6
000edf0 00 e7 00 e8 00 e9 00 ea 00 ed 00 f0 00 f4 00 f5
000ee00 00 f6 00 f8 00 f9 00 fb 00 fc 00 fd 00 ff 01 01
000ee10 01 03 01 08 01 09 01 0b 01 0e 01 10 01 57 01 58
000ee20 01 5b 01 88 00 00 00 01 00 00 00 00 cc 3d a2 cf
000ee30 00 00 00 00 c1 9a 34 f2 00 00 00 00 cf 79 35 0b
000ee40 00 00                                          
000ee42

It is peculiar, at the very least.

@slonopotamus
Copy link
Author

"palmdb" is part of the same library I'm referring to :) I'll investigate whether those trailing zero bytes were supposed to be part of zlib stream or had be stripped off by some kind of lower-level logic that handles mobi records, though it doesn't cancel current JRuby bug.

@headius
Copy link
Member

headius commented Mar 11, 2021

it doesn't cancel current JRuby bug.

Indeed. Looking into a fix now.

@slonopotamus
Copy link
Author

slonopotamus commented Mar 11, 2021

Upd: I'm confirming that libmobi (completely independent MOBI reading library) passes the same amount of bytes (36488) as input to zlib and expects to receive back the same amount of bytes (60992) as my Ruby implementation when given the same MOBI file. If you would like to reproduce this with libmobi, you would need to:

  1. git clone git@github.com:bfabiszewski/libmobi.git
  2. cd libmobi
  3. Edit src/util.c by adding some printing encoded_size and decoded_size around int ret = m_uncompress(*decoded_font, (unsigned long *) decoded_size, encoded_font, encoded_size); (m_uncompress is actually a call to zlib)
  4. make
  5. ./tools/mobitool -s <sample-book.mobi> (it's just a command to unpack MOBI file into a EPUB-like file collection)

And it will print:

Reconstructing source resources...
36488 -> 60992

36488 is exactly the size of data.gz that I initially attached to this bugreport and 60992 is exactly the size of output data that I expect.

So, I'm pretty confident that "palmdb" passes right compressed data to Ruby code that performs zlib uncompression.

@headius
Copy link
Member

headius commented Mar 11, 2021

Ok, so this is likely not covered by the available Ruby zlib tests. I will add some along with the fix.

@headius
Copy link
Member

headius commented Mar 11, 2021

FWIW this does not appear to be a bug in jzlib. We actually add these extra bytes in JRuby code, for some reason.

headius added a commit to headius/jruby that referenced this issue Mar 11, 2021
@headius headius added this to the JRuby 9.2.17.0 milestone Mar 11, 2021
headius added a commit to headius/jruby that referenced this issue Mar 16, 2021
The finish result will include data that came after the compressed
section from the input. This is appropriate when calling finish to
terminate a previous compressed segment and proceed to the next
segment, but the class-level #inflate should only return the
inflated content.

Fixes jruby#6606
@headius
Copy link
Member

headius commented Mar 16, 2021

@slonopotamus Have a look at the new fix in #6612. It passes your case, existing zlib specs, and makes a couple specs pass that didn't before.

@headius headius closed this as completed Mar 17, 2021
@slonopotamus
Copy link
Author

Confirming that jruby-9.2 branch properly passes initial testcase from this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants