Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First steps towards zipfile.ZipFile integration #109

Open
forrestfwilliams opened this issue Jan 19, 2023 · 15 comments
Open

First steps towards zipfile.ZipFile integration #109

forrestfwilliams opened this issue Jan 19, 2023 · 15 comments

Comments

@forrestfwilliams
Copy link

Hello @pauldmccarthy this package is really fantastic! I work at the Alaska Satellite Facility where we distribute satellite imagery data at no cost to scientists and we've run into a similar issue to the one indexed_gzip was designed to solve.

Unfortunately, instead of gzip, the files we're trying index use zip compression. We would love to have an indexed_zip package that has the same functionality as indexed_gzip, but is compatible with python's zipfile.ZipFile. Since zipfile.ZipFile also uses zlib, under the hood how big of a lift do you think this would be?

@martindurant
Copy link

(just a comment that ZIP might use zlib, aka deflate, but can also store uncompressed, bzip2 and maybe others. deflate should be the most common)

@forrestfwilliams
Copy link
Author

Yes, true. we would want to raise an exception if the zip file didn't use zlib compression.

@pauldmccarthy
Copy link
Owner

Hi @forrestfwilliams @martindurant, I'm not at all familiar with the zip format, so I'm afraid I can't really comment on how complex it would be to support zip. I don't have the time to spend on this project (apart from bugs), but am happy to consult/review PRs.

@martindurant
Copy link

martindurant commented Jan 23, 2023

Actually, I think it would be trivial, if the code can apply to zlib/inflate streams, not only gzip. I believe this is already the case.

@forrestfwilliams
Copy link
Author

The big hurdle to overcome is allowing indexed_gzip to work with other types of DEFLATE streams. This post does a great job of going over the differences. From my understanding, DEFLATE compressed data comes in three different format flavors: GZIP, ZLIB, and DEFLATE. All three contain the same compressed data, but have slightly different headers and footers. The python zlib library is capable of compressing/decompressing all three of these formats (see the second answer in this stackoverflow post, but the headers/footers need to be dealt with before the data be passed to zlib.

The DEFLATE format is slightly different from the other two formats because it contains no headers or footers, just the compressed data. Both the python gzip and zipfile python packages convert their respective files to this format before using zlib to compress/de-compress them.

If we could expose the capability of indexed_gzip to create indexes for DEFLATE formatted data. I think we would have a clear path forward for creating a indexed_zip package.

@martindurant
Copy link

martindurant commented Jan 23, 2023

I believe you are correct: they share the same compression algorithm, but add headers (or none for deflate). If we extract an index, we just skip the header to the compressed stream, and all three should look the same.

@pauldmccarthy
Copy link
Owner

@forrestwilliams as you note, the deflate format refers to the format of the compressed data stream, and zlib/gzip refer to container formats which contain deflate data. As it stands, zran.c supports gzip-wrapped deflate data, and I believe (although haven't tested) also supports zlib-wrapped deflate data, as long as the CRC32 validation is disabled via the ZRAN_SKIP_CRC_CHECK flag.

The code should be agnostic to whether a stream has a gzip header or a zlib header - note the use of the inflateInit2 function here. From the inflateInit2 docs:

windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format

But zran.c does not currently support raw/headerless deflate data, and it sounds like this is what you want/need? Optimistically, adding support for raw deflate data would potentially only require changes to the _zran_init_zlib_inflate function, to skip the initial call to inflateInit2 (allowing it to be skipped via a new initialisation flag, e.g. ZRAN_SKIP_STREAM_HEADER).

There is also the matter of the footer - currently zran.c will only support gzip-wrapped data, unless the ZRAN_SKIP_CRC_CHECK flag is set. By default, zran.c performs CRC32 validation of the compressed stream - it accumulates a CRC32 checksum as data is decompressed, and then validates that checksum (and the file size) against the checksum/size stored in the gzip footer.

Again, optimistically, I don't think it would be too difficult to add support for zlib-wrapped data - a toggle would need to be added where the checksum is calculated to calculate either a crc32 (for gzip) or an adler32 (for zlib), and the _zran_validate_stream function would need to be adjusted to either check the crc32 and size (for gzip), or just the adler32 (for zlib). And finally, the ZRAN_SKIP_CRC_CHECK flag would need to be automatically switched on, to disable checksum/size validation for raw/headerless deflate data.

By the way, the zlib manual, and the zlib/gzip RFCs are super handy, and quite digestible:

@martindurant
Copy link

martindurant commented Jan 24, 2023

Here is how to get a gzip version of a DEFLATE file in a ZIP:

import zipfile
import gzip
import io

f = open(zipfile) # of other seekable file-like
z = zipfile.ZipFile(file=f)
zinf = z.infolist()[0] # or whichever member file we want
assert zinf.compress_type == zipfile.ZIP_DEFLATED
f.seek(zinf.header_offset + len(zinf.FileHeader()))
data = f.read(zinf.compress_size)

# https://www.rfc-editor.org/rfc/rfc1952#section-2.3.1
ghead = b"\x1f\x8b\x08\x00" + b"\x00\x00\x00\x00" + b"\x00\xff"
gfoot = zinf.CRC.to_bytes(4, "little") + (zinf.file_size % 2**32).to_bytes(4, "little")
b = io.BytesIO(ghead + data + gfoot)
b.seek(0)

gzip.GzipFile(fileobj=b).read() # OK

This could be modified to work with a streaming remote file (s3, etc.) without needing to hold the whole data in memory at once. Instead of calling GZipFile, we would pass this custom file-like object to indexed_gzip for both indexing and random-access reading. This is well within the kinds of tricks we routinely play with kerchunk.

@forrestfwilliams
Copy link
Author

@martindurant this looks great! I've taken this code/info and created a script that works for a specific use case of mine.

This script follows the code you wrote above, but I ran into an issue when finding the location to start reading the raw DEFLATE stream from the zip. Specifically, using a f.seek(zinf.header_offset + len(zinf.FileHeader())) approach resulted in a ZRAN_BUILD_INDEX_FAIL (file: n/a) error because extracting the data at this location doesn't result in a valid gzip file. I ended up using an empirical solution that involves magic number to solve this issue.

While this seems to work in a few cases, I'm betting this won't work in the majority of cases and I would like to figure out what is happening. I think it has something to do with the varying length of the extra section of the zip file header. Have you seen this before?

@martindurant
Copy link

len(zinf.FileHeader()) is hack relying on python's ZIP doing a successful roundtrip of what it finds in the header back to bytes again. I'm only slightly surprised that it failed. The header is not too complex, though, so we can do this directly! I'll see if I get a little time to do so.

@martindurant
Copy link

f.seek(zinf.header_offset)
data0 = f.read(30)
n = int.from_bytes(data0[26:28], "little")
m = int.from_bytes(data0[28:30], "little")
data_start = zinf.header_offset + n + m

@forrestfwilliams
Copy link
Author

Thanks @martindurant! should the last line of the above be:
data_start = zinf.header_offset + n + m + 30

@martindurant
Copy link

You are right, of course :)

@forrestfwilliams
Copy link
Author

@martindurant this solution worked, thanks! Next week I'll try to re-work the scripts in the repo I linked to above so that the parts relevant to Kerchunk are easier to pull out.

@martindurant
Copy link

Yes, it would be great for kerchunk to be able to directly access chunks within files in ZIP or tar.gz archives. It will, I think, take some work to get this right; the index files need o be stored somewhere and loaded on demand, for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants