Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip file format details #12

Closed
mikeabdullah opened this issue Jan 17, 2013 · 2 comments
Closed

zip file format details #12

mikeabdullah opened this issue Jan 17, 2013 · 2 comments
Labels

Comments

@mikeabdullah
Copy link
Contributor

I originally posted this as part of another issue, as I was working my way through the reasoning for it in order to come to a final conclusion. Split out here as requested though:

OK, so first of all, let me make sure I've understood the layout of a zip file:

LocalFileHeader — includes dates, crc32, sizes, filename etc.
FileData — likely compressed, up to the archive generator, and specified in the header
DataDescriptor — also includes crc32 & sizes
repeat the above until you run out of files
CentralFileHeader — notes dates, sizes etc. of, I think, the overall archive?

The apparent duplication of LocalFileHeader and DataDescriptor strikes me as weird. From poking around zip archives I have here, it looks like LocalFileHeader generally reports file sizes to be 0. So I'm guessing unarchivers take that as a signal to consult the DataDescriptor for the true size of the file? Thus allowing a file to be rewritten into the archive without yet knowing the size it will be when compressed.

And then by waiting until the end of the archive to note the extra info associate with it, that also simplifies generation.

The bit I don't understand though, how do unarchivers know how to find the individual files and CentralFileHeader if each LocalFileHeader claims its data is of zero length?

  • The compressed data can be easily analysed in some fashion to know when it stops
  • The search is purely for an identifier, something like ZZDataDescriptor.signature. If so, how do you avoid hitting false positives inside of the compressed data?
  • The CentralFileHeader contains knowledge of each file's length/location, and be consulted to work back and find the files. But if so, how do you find the central header itself?!
@pixelglow
Copy link
Owner

The apparent duplication of LocalFileHeader and DataDescriptor strikes me as weird. From poking around zip archives I have here, it looks like LocalFileHeader generally reports file sizes to be 0. So I'm guessing unarchivers take that as a signal to consult the DataDescriptor for the true size of the file? Thus allowing a file to be rewritten into the archive without yet knowing the size it will be when compressed.

Either:

  • the archiver knows before writing the entry what its size will be. Then LocalFileHeader reports the true size, DataDescriptor is not written, or
  • the archiver doesn't know. Then LocalFileHeader has 0 size, and a DataDescriptor is written with the file size.

The latter is often used in streaming scenarios. zipzap has some streaming and non-streaming scenarios, but for cleanness of implementation, I always treat them as streaming.

The bit I don't understand though, how do unarchivers know how to find the individual files and CentralFileHeader if each LocalFileHeader claims its data is of zero length?

This is tricky and is really up to the quality of the unarchiver. I've based zipzap off the algorithm expressed in minizip, which has a fair amount of history to it.

  • Look for the EndOfCentralDirectory signature. Because the EndOfCentralDirectory is constrained to be at most n bytes (given the variable length comment at the end), we look for it in the last n bytes in the file, searching backwards.
  • Having got a candidate EndOfCentralDirectory signature, we sanity check as many of the EndOfCentralDirectory fields as possible. In particular the comment length field has to be consistent with the actual discovered length of the EndOfCentralDirectory.

This search is complicated by the possibility that a perverse archiver could write out a second etc. EndOfCentralDirectory header in the variable length comment. Because of this (and performance reasons), zipzap does a backward search and only uses the last header it finds. It could be argued that this is incorrect, since this is not a "real" header but a variable length comment. But I chose an algorithm that will work even with a perverse archiver, not necessarily one that will work correctly in such a situation.

Some refinements I did think of

  • Searching forward but skipping inconsistent candidate EndOfCentralDirectory structures. This is probably more correct but performance intensive, since you would definitely scan 64K bytes of any large zip file.
  • Searching backward but continuing if the structure is inconsistent. This won't avoid the perverse archive above but might be slightly more robust e.g. if the variable length comment say contained the signature.

Some other unarchivers do a search forward from the start of the zip. This has several issues:

  • it is intense on large zip files since you would have to sample a couple of bytes, skip, rinse lather and repeat to get all the information you need.
  • it will work 100% correctly in the non-streaming scenario, but you cannot guarantee a non-streaming zip file.
  • in the streaming scenario, you will need to somehow limit the entry reading to the correct size without being told what it is in the LocalFileHeader! It is possible for some formats e.g. deflate compression, but not in general. This forces you to do expensive decompression just to discover metadata, or stall with formats you cannot interpret.

@mikeabdullah
Copy link
Contributor Author

Wow, it is quite tricky then!

It occurs to me that unarchivers can take different strategies depending on the task at hand. For example, if the end goal is to decompress an archive, then there's no harm in searching through the data, decompressing as you go, to find the data descriptors.

Also for situations where compressed data sizes are already known, zipzap could potentially be kind to unarchivers and generate local file headers that contain the file size.

Thank you for filling me in.

macguru pushed a commit to ulyssesapp/zipzap that referenced this issue Jul 18, 2020
…fixes-11 to develop

* commit '068f2b6a8c65f4abda0d1fd6f837c56ad38a7b13':
  ULYSSES-3444 Fixes a crash when using a defective list format string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants