Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: Zip reading with prefixed bytes #10464

Open
philwebb opened this issue Apr 15, 2015 · 17 comments

Comments

@philwebb
Copy link

commented Apr 15, 2015

As discussed here it is possible for Java jar files (zip formatted archives) to include a prefixed bash script that allows them to act like fully executable applications. Many existing tools (java -jar, unzip etc) support zip files with additional prefixed bytes.

Unfortunately go currently doesn't support the reading of such archives which causes problems for the Cloud Foundry cli tool.

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

I have a potential patch pushed here, I can try to submit it via gerrit if it looks like a sensible approach.

@mikioh mikioh changed the title Zip handling with prefixed bytes archive/zip: Zip handling with prefixed bytes Apr 15, 2015

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 15, 2015

Wasn't this already fixed when we added http://tip.golang.org/pkg/archive/zip/#Writer.SetOffset ?

In de573f8 and 1f35bb6 (Issue #8669)

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

@bradfitz Looking at that issue I think it's specifically about writing files, this one would be for reading files. I added a testcase that appeared to fail before starting.

It's also worth considering that in this particular case, the size of the prefixed bytes isn't known until you read the zip central directory record. This means that an equivalent withOffset reader method wouldn't be particularly useful for most people since they would need to duplicate the code that reads the central directory end record.

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

Also, this is my first experience writing go, so apologies if the suggested changes aren't using the proper conventions. I'm happy to try and rework things before submitting to gerrit if there is something off about the suggested changes.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 15, 2015

You didn't mention in your bug report that this is about reading. Can you write a self-contained bug report here without referencing other URLs, discussions, or patches?

I see no reason why reading such a file would fail. The zip TOC is at the end, so it doesn't matter what's at the beginning of a zip file if the TOC is correct.

@philwebb philwebb changed the title archive/zip: Zip handling with prefixed bytes archive/zip: Zip reading with prefixed bytes Apr 15, 2015

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

Apologies, I've updated the title. I thought the references would be useful.

Although the TOC is located correctly and can be read, the offsets that are stored in the directory structure are assumed to be relative to the front of the file. This means that when you actually try to read the data you're hitting the incorrect part of the underling file since there are additional bytes at the font that aren't considered.

The suggested fix works out the size of these additional bytes and then includes them when reading directory entries and file content.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 15, 2015

So it's an invalid zip file?

Is there a spec for this? Is there some bit in the TOC that says this adjustment should be done?

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

I'm guessing that the spec is a little open to interpretation here. Going from https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT descriptions such as:

3.12 Central directory structure:
... relative offset of local header 4 bytes

leave things a little ambiguous as to what the offset is "relative" to.

I do know that such files can be read by unzip and zipinfo and by java -jar.

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 15, 2015

Here is the output from zipinfo for the archive I added to the test case if that helps:

$ zipinfo archive/zip/testdata/test-prefix-junk.zip 
Archive:  archive/zip/testdata/test-prefix-junk.zip   1209   2
warning [archive/zip/testdata/test-prefix-junk.zip]:  39 extra bytes at beginning or within zipfile
  (attempting to process anyway)
-rw-r--r--  3.0 unx       26 tx defN  4-Sep-10 19:12 test.txt
-rw-r--r--  3.0 unx      785 bx stor  4-Sep-10 22:52 gophercolor16x16.png
2 files, 811 bytes uncompressed, 810 bytes compressed:  0.1%

Note specifically the 39 extra bytes at beginning or within zipfile is presented as a warning but the file can still be read.

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 22, 2015

@bradfitz Is there anything else I can do for this one or do you consider prefixed bytes to be a spec violation? I'm happy to revisit the code if you want changes.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 22, 2015

I'm curious which recovery strategies other tools use. I don't want to implement a unique one that doesn't agree with the leniency of other tools, in either direction.

@philwebb

This comment has been minimized.

Copy link
Author

commented Apr 30, 2015

I've tried to understand what the stock unzip application does but I'm not that proficient with C code. The source is available at ftp.info-zip.org/pub/infozip/src/ and with version 6 it looks like line 857 of process.c is where it calculates how many extra bytes are at the front of the file.

The Java implementation also delegates to C, I believe that the relevant code is here. This one seems a bit easier to understand.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 30, 2015

I imagine on error (when we missed finding the directory header) we'd just scan forward looking for the directory header (0x02014b50) and for each case of 0x02014b50 we find, remember its offset from the real one, then load all the file headers with that offset, stopping once we find an offset that works.

But this is too late for Go 1.5. The tree closes in 1.5 days and the few people who regularly hack on this package are on vacation or busy.

For now I recommend you just do this loop yourself. You could instead scan forwards from the beginning, looking for file headers (0x04034b50) and whenever you find a 0x04034b50, try chopping off that prefix and opening the zip anew. io.NewSectionReader will give you a ReaderAt view into another one, letting you cheaply remove prefix bytes.

@bradfitz bradfitz added this to the Unplanned milestone Apr 30, 2015

@neopaf

This comment has been minimized.

Copy link

commented Jun 30, 2015

Same story here. I heavily relied go could do this upon quick-checking that TOC -reading code is OK.
Yet it fails to parse the file :(

@elvarb

This comment has been minimized.

Copy link

commented Aug 8, 2017

I'm running into this problem I think with a program that is scanning files to search for certain data, when it encounters a JAR file it is processed as a zip file. Now checking the logs I'm noticing that when the program has been stopping for unknown reasons in most cases its scanning JAR files, and specifically certain rt.jar file that is a part of JRE. Most of the time the program can read the rt.jar file but not always, still looking into why that is.

The program has run on over 1000 machines and by far the most common file it stops on are JAR files.

One example of this happening just sometimes, when processing rt.jar that comes with the TSM BA client for the same version. Two machines have the same versions, one successfully scans the file while the other fails.

@EricRobert

This comment has been minimized.

Copy link

commented Jul 5, 2018

Same issue here.

As suggested, the workaround is to scan for every instances of 0x02014b50 and try to create a reader from that offset until it works. Reading from the end helps if the file is big.

At least one implemenation minizip simply computes the start of the archive in the file from the end descriptor instead of assuming it's 0 which looks like the sensible thing to do.

@AxbB36

This comment has been minimized.

Copy link

commented Apr 26, 2019

I'm curious which recovery strategies other tools use. I don't want to implement a unique one that doesn't agree with the leniency of other tools, in either direction.

Implementations differ in this respect:

  • Python zipfile always takes the base offset to be eocd_offsetcd_sizecd_offset, where eocd_offset is the absolute offset of the End of Central Directory (usually 22 bytes before the end of the file), and cd_size and cd_offset are read from the EOCD structure. In other words, Python assumes that there is no slack space between the central directory and the EOCD; whatever slack space might be implied by cd_size and cd_offset should instead be interpreted to come at the beginning of the file. Ref Ref
  • yauzl always takes the base offset to be 0. Ref
  • nsZipArchive always takes the base offset to be 0. Ref
  • Info-ZIP Unzip first does what Python does (as @philwebb noted), but if it doesn't find a central directory header signature there, it tries again with a base offset of 0. See process.c:
         /*-----------------------------------------------------------------------
             Compensate for missing or extra bytes, and seek to where the start
             of central directory should be.  If header not found, uncompensate
             and try again (necessary for at least some Atari archives created
             with STZip, as well as archives created by J.H. Holm's ZIPSPLIT 1.1).
           -----------------------------------------------------------------------*/
    
    If UnZip succeeds with the first strategy (nonzero base offset), it prints a warning, e.g.: 100 extra bytes at beginning or within zipfile. If it succeeds with the second strategy (zero base offset), it prints no warning. You can see where it expects to find the central directory with zipinfo -v.

I haven't seen any implementations do the scanning idea. They either look in one place or two. Scanning seems like a bad idea because file data or even unused space between files can contain a central directory header signature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.