archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764

nigeltao · 2022-08-11T06:00:26Z

Background

Here's a puzzle. I have a zip file containing 2 constituent files. unzip -l gives the file sizes as 305 and 320 bytes. libarchive (see the main.c program below) also says 305 and 320 bytes.

However, if I gzip the zip file and pass it to libarchive, I get 0 and 320, not 305 and 320. This is incorrect, and I believe that this is a bug in libarchive (as opposed to a malformed zip file).

$ ls
main.c	test_data_descriptor.zip

$ unzip -l test_data_descriptor.zip
Archive:  test_data_descriptor.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      305  2015-09-05 16:32   -
      320  2015-09-05 22:00   second.txt
---------                     -------
      625                     2 files

$ gzip --keep test_data_descriptor.zip

$ gcc main.c -larchive

$ ./a.out test_data_descriptor.zip
-
  RAW: 0,   ZIP: 1
  size before: 305 (64)
  size after:  305 (64)
second.txt
  RAW: 0,   ZIP: 1
  size before: 320 (64)
  size after:  320 (64)

$ ./a.out test_data_descriptor.zip.gz
-
  RAW: 0,   ZIP: 1
  size before: 0 (0)
  size after:  0 (0)
second.txt
  RAW: 0,   ZIP: 1
  size before: 320 (64)
  size after:  320 (64)

The test_data_descriptor.zip file can be downloaded from
https://github.com/adamhathcock/sharpcompress/files/242365/test_data_descriptor.zip
and is referenced from
adamhathcock/sharpcompress#88
Its contents are:

$ hd test_data_descriptor.zip
00000000  50 4b 03 04 14 00 08 00  08 00 0e 84 25 47 00 00  |PK..........%G..|
00000010  00 00 00 00 00 00 00 00  00 00 01 00 00 00 2d 73  |..............-s|
00000020  f3 0c 0a 0e 51 70 23 87  e4 e5 22 4b db 10 d7 0c  |....Qp#..."K....|
00000030  00 50 4b 07 08 01 61 6f  5b 12 00 00 00 31 01 00  |.PK...ao[....1..|
00000040  00 50 4b 03 04 14 00 08  00 08 00 1a 70 25 47 00  |.PK.........p%G.|
00000050  00 00 00 00 00 00 00 40  01 00 00 0a 00 7b 00 73  |.......@.....{.s|
00000060  65 63 6f 6e 64 2e 74 78  74 53 44 66 00 ac 00 00  |econd.txtSDf....|
00000070  00 00 08 00 a4 e0 ad bb  63 64 60 69 10 61 60 60  |........cd`i.a``|
00000080  30 60 80 00 1f 20 66 64  05 33 59 45 81 84 42 7b  |0`... fd.3YE..B{|
00000090  86 e1 b2 5c c9 69 c7 67  ca f4 bd 60 c6 2d c7 c8  |...\.i.g...`.-..|
000000a0  c4 c0 c0 c4 90 c0 c0 02  96 96 60 f8 cf 28 cf 00  |..........`..(..|
000000b0  12 03 a9 55 00 a9 05 b3  45 20 e2 8c 10 71 21 06  |...U....E ...q!.|
000000c0  88 d8 7e 46 61 b8 18 37  54 ff 4a 06 21 14 fd 8a  |..~Fa..7T.J.!...|
000000d0  40 36 00 55 54 0d 00 07  74 d9 ea 55 5f d9 ea 55  |@6.UT...t..U_..U|
000000e0  5f d9 ea 55 0b 76 75 f6  f7 73 51 08 26 8f e2 e5  |_..U.vu..sQ.&...|
000000f0  22 53 e3 b0 d1 0f 00 50  4b 07 08 e0 08 02 5a 13  |"S.....PK.....Z.|
00000100  00 00 00 40 01 00 00 50  4b 01 02 1e 00 14 00 08  |...@...PK.......|
00000110  00 08 00 0e 84 25 47 01  61 6f 5b 12 00 00 00 31  |.....%G.ao[....1|
00000120  01 00 00 01 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00000130  00 00 00 00 00 2d 50 4b  01 02 1e 00 14 00 08 00  |.....-PK........|
00000140  08 00 1a 70 25 47 e0 08  02 5a 13 00 00 00 40 01  |...p%G...Z....@.|
00000150  00 00 0a 00 11 00 00 00  00 00 01 00 20 00 00 00  |............ ...|
00000160  41 00 00 00 73 65 63 6f  6e 64 2e 74 78 74 53 44  |A...second.txtSD|
00000170  04 00 ac 00 00 00 55 54  05 00 07 74 d9 ea 55 50  |......UT...t..UP|
00000180  4b 05 06 00 00 00 00 02  00 02 00 78 00 00 00 07  |K..........x....|
00000190  01 00 00 00 00                                    |.....|
00000195

The main.c program is:

$ cat main.c
#include <archive.h>
#include <archive_entry.h>
#include <stdio.h>

int main(int argc, char** argv) {
  struct archive* a;
  struct archive_entry* entry;
  int r;

  if (argc < 2) {
    fprintf(stderr, "No archive listed.\n");
    return 1;
  }

  a = archive_read_new();
  archive_read_support_filter_all(a);
  archive_read_support_format_all(a);
  r = archive_read_open_filename(a, argv[1], 10240);
  if (r != ARCHIVE_OK) {
    fprintf(stderr, "archive_read_open_filename failed.\n");
    return 1;
  }

  while (archive_read_next_header(a, &entry) == ARCHIVE_OK) {
    printf("%s\n", archive_entry_pathname(entry));
    printf("  RAW: %d,   ZIP: %d\n", archive_format(a) == ARCHIVE_FORMAT_RAW,
           archive_format(a) == ARCHIVE_FORMAT_ZIP);
    printf("  size before: %d (%d)\n", (int)archive_entry_size(entry),
           archive_entry_size_is_set(entry));
    archive_read_data_skip(a);
    printf("  size after:  %d (%d)\n", (int)archive_entry_size(entry),
           archive_entry_size_is_set(entry));
  }

  r = archive_read_free(a);
  if (r != ARCHIVE_OK) {
    fprintf(stderr, "archive_read_free failed.\n");
    return 1;
  }
  return 0;
}

ZIP File Format

As you may already know, a ZIP file, roughly speaking, is a sequence of records, such as a Local File Header record, Data Descriptor record Central Directory File Header record or End Of Central Directory record. The https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure Wikipedia page has a good overview, and details the fields per record. See also https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT which is the closest thing we have to an official ZIP file format specification.

Both the LFH (Local File Header) and CDFH (Central Directory File Header) records contain a little-endian 4 byte "uncompressed size" field. For the first file (with filename "-"), the LFH one is at byte offset 0x0016, with value 0x00000000. The CDFH one is at offset 0x011f, with value 0x00000131 = 305.

For test_data_descriptor.zip, libarchive has random access to the zip file and uses the CDFH records' uncompressed size (and CDFH records are at the end of the file). For test_data_descriptor.zip.gz (and note the .zip.gz suffix), libarchive can automatically decompress the outer GZIP compression but the decompressed bytes are only available under sequential (streaming) access, not random access, so libarchive reports the LFH records' uncompressed size.

(Not a) Malformed ZIP File

My initial reaction was that the test_data_descriptor.zip file was malformed, since the two "uncompressed size" fields are inconsistent. However, there is a further subtlety. The two bytes starting at offset 0x0006 are "general purpose bit flag" bytes and the value here is 0x0008.

APPNOTE.TXT section 4.4.4 "general purpose bit flag" says "Bit 3: If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data." Bit 3 is, of course, the 0x0008 bit that is set in test_data_descriptor.zip's first LFH.

The first Data Descriptor starts with the "50 4b 07 08" signature at byte offset 0x0031 and its "uncompressed size" field is at byte offset 0x003d, with value 0x00000131 = 305.

So the test_data_descriptor.zip file is valid, and both (LFH+DD) and (CDFH) are reporting the same "uncompressed size" value, 305, but libarchive reports the wrong value, presumably because it reports the literal 0 value in the LFH record and ignores the DD record.

Expected Behavior

Given libarchive's iterator model (only requiring sequential access, not random access), I would expect archive_entry_size_is_set to still return 0 before the archive_read_data_skip call but return non-zero afterwards (and for archive_entry_size to then return 305).

Workaround

For my particular program, I could work around it if libarchive treated test_data_descriptor.zip.gz as a RAW data.gz file instead of a (possibly separately compressed) ZIP data.zip file, so I could use libarchive once to extract the 'data' to a real (temporary) file and use libarchive a second time on that real (seekable) file to use (CDFH) values instead of (LFH+DD) values. However, per the printf calls above, archive_format doesn't distinguish between test_data_descriptor.zip and test_data_descriptor.zip.gz and returns ARCHIVE_FORMAT_ZIP for both.

Is there a way for libarchive to treat it as raw (for archive_format to return ARCHIVE_FORMAT_RAW) for the second case?

The text was updated successfully, but these errors were encountered:

nigeltao · 2022-08-14T06:30:41Z

Here's the result of gzip --keep test_data_descriptor.zip, if you want to reproduce this.

test_data_descriptor.zip.gz

nigeltao · 2022-08-14T06:44:57Z

Expected Behavior... archive_entry_size_is_set to... return non-zero afterwards.

Ah, that's trickier than I first thought, and would require API (re)design. Fixing this issue would touch more than just the archive_read_support_format_zip.c source file.

Even though calling archive_read_data_skip calls consume_optional_marker in archive_read_support_format_zip.c and consume_optional_marker already parses the Data Descriptor record, it only updates the uncompressed_size field of a struct zip_entry and not a struct archive_entry.

Specifically, archive_read_data_skip is given only one argument: a struct archive* pointer. There is no struct archive_entry* pointer that it can pass on to consume_optional_marker. Given the existence of the archive_read_next_header2 function, the struct archive_entry* is possibly owned by the caller in general, not the callee, so the callee (libarchive) can't save the struct archive_entry* pointer between calls.

AFAICT, the only relevant API function that takes a struct archive_entry* argument is archive_read_next_header (or archive_read_next_header2), which moves to the next entry, so it cannot be used to update the prior struct archive_entry.

The latter can return 0 for some archive file formats. See libarchive/libarchive#1764

jsonn · 2022-08-14T23:56:17Z

This is pretty much a WONTFIX for me. Extracting data in streaming processing is best effort. If you want to know the size afterwards, you have to do the accounting yourself. This is a hard limitation of the file format (that file sizes are optional outside the central directory).

kientzle · 2022-08-21T00:35:35Z

I don't think there's a simple way to change this behavior. As Jason pointed out, you can skim the contents (via archive_read_data_block) without writing the data to disk to obtain a proper size in these cases.

At one time, I had considered also returning the central directory entries as separate entries in streaming mode. With that design, every zip entry would get returned twice: The first time as now would allow you to read contents but might have incomplete metadata; the second time would return complete metadata (from the central directory) but incomplete contents. I'm not sure how intrusive this would actually be: It would certainly impact the extract-to-disk logic ("extracting" the second entry would need to update metadata on disk without touching contents), though it has some similarities to how we currently handle hardlinks. Finding a clean way to avoid redundant syscalls would be an interesting challenge.

nigeltao added a commit to google/fuse-archive that referenced this issue Aug 14, 2022

Have my_read use node size not archive_entry_size

14e3415

The latter can return 0 for some archive file formats. See libarchive/libarchive#1764

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764

archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764

nigeltao commented Aug 11, 2022

nigeltao commented Aug 14, 2022

nigeltao commented Aug 14, 2022

jsonn commented Aug 14, 2022

kientzle commented Aug 21, 2022

archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764

archive_entry_size incorrectly ignores ZIP's Data Descriptor records #1764

Comments

nigeltao commented Aug 11, 2022

Background

ZIP File Format

(Not a) Malformed ZIP File

Expected Behavior

Workaround

nigeltao commented Aug 14, 2022

nigeltao commented Aug 14, 2022

jsonn commented Aug 14, 2022

kientzle commented Aug 21, 2022