-
Notifications
You must be signed in to change notification settings - Fork 765
-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764
Comments
Here's the result of |
Ah, that's trickier than I first thought, and would require API (re)design. Fixing this issue would touch more than just the Even though calling Specifically, AFAICT, the only relevant API function that takes a |
The latter can return 0 for some archive file formats. See libarchive/libarchive#1764
This is pretty much a WONTFIX for me. Extracting data in streaming processing is best effort. If you want to know the size afterwards, you have to do the accounting yourself. This is a hard limitation of the file format (that file sizes are optional outside the central directory). |
I don't think there's a simple way to change this behavior. As Jason pointed out, you can skim the contents (via At one time, I had considered also returning the central directory entries as separate entries in streaming mode. With that design, every zip entry would get returned twice: The first time as now would allow you to read contents but might have incomplete metadata; the second time would return complete metadata (from the central directory) but incomplete contents. I'm not sure how intrusive this would actually be: It would certainly impact the extract-to-disk logic ("extracting" the second entry would need to update metadata on disk without touching contents), though it has some similarities to how we currently handle hardlinks. Finding a clean way to avoid redundant syscalls would be an interesting challenge. |
Background
Here's a puzzle. I have a zip file containing 2 constituent files.
unzip -l
gives the file sizes as 305 and 320 bytes.libarchive
(see themain.c
program below) also says 305 and 320 bytes.However, if I gzip the zip file and pass it to libarchive, I get 0 and 320, not 305 and 320. This is incorrect, and I believe that this is a bug in libarchive (as opposed to a malformed zip file).
The
test_data_descriptor.zip
file can be downloaded fromhttps://github.com/adamhathcock/sharpcompress/files/242365/test_data_descriptor.zip
and is referenced from
adamhathcock/sharpcompress#88
Its contents are:
The
main.c
program is:ZIP File Format
As you may already know, a ZIP file, roughly speaking, is a sequence of records, such as a Local File Header record, Data Descriptor record Central Directory File Header record or End Of Central Directory record. The https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure Wikipedia page has a good overview, and details the fields per record. See also https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT which is the closest thing we have to an official ZIP file format specification.
Both the LFH (Local File Header) and CDFH (Central Directory File Header) records contain a little-endian 4 byte "uncompressed size" field. For the first file (with filename "-"), the LFH one is at byte offset 0x0016, with value 0x00000000. The CDFH one is at offset 0x011f, with value 0x00000131 = 305.
For
test_data_descriptor.zip
, libarchive has random access to the zip file and uses the CDFH records' uncompressed size (and CDFH records are at the end of the file). Fortest_data_descriptor.zip.gz
(and note the.zip.gz
suffix), libarchive can automatically decompress the outer GZIP compression but the decompressed bytes are only available under sequential (streaming) access, not random access, so libarchive reports the LFH records' uncompressed size.(Not a) Malformed ZIP File
My initial reaction was that the
test_data_descriptor.zip
file was malformed, since the two "uncompressed size" fields are inconsistent. However, there is a further subtlety. The two bytes starting at offset 0x0006 are "general purpose bit flag" bytes and the value here is 0x0008.APPNOTE.TXT section 4.4.4 "general purpose bit flag" says "Bit 3: If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data." Bit 3 is, of course, the 0x0008 bit that is set in
test_data_descriptor.zip
's first LFH.The first Data Descriptor starts with the "50 4b 07 08" signature at byte offset 0x0031 and its "uncompressed size" field is at byte offset 0x003d, with value 0x00000131 = 305.
So the
test_data_descriptor.zip
file is valid, and both (LFH+DD) and (CDFH) are reporting the same "uncompressed size" value, 305, but libarchive reports the wrong value, presumably because it reports the literal 0 value in the LFH record and ignores the DD record.Expected Behavior
Given libarchive's iterator model (only requiring sequential access, not random access), I would expect
archive_entry_size_is_set
to still return 0 before thearchive_read_data_skip
call but return non-zero afterwards (and forarchive_entry_size
to then return 305).Workaround
For my particular program, I could work around it if libarchive treated
test_data_descriptor.zip.gz
as a RAWdata.gz
file instead of a (possibly separately compressed) ZIPdata.zip
file, so I could use libarchive once to extract the 'data' to a real (temporary) file and use libarchive a second time on that real (seekable) file to use (CDFH) values instead of (LFH+DD) values. However, per theprintf
calls above,archive_format
doesn't distinguish betweentest_data_descriptor.zip
andtest_data_descriptor.zip.gz
and returnsARCHIVE_FORMAT_ZIP
for both.Is there a way for libarchive to treat it as raw (for
archive_format
to returnARCHIVE_FORMAT_RAW
) for the second case?The text was updated successfully, but these errors were encountered: