-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: gzip multi member dependant chunker / importer, warc, tar #3604
Comments
Might be cool to start an ipfs/importers repo where we can collect ideas like this |
This never went anywhere, did it |
If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too). That being said, I like the generalization proposed here, to make Kubo's For example, ZIPs start with the same magic bytes ( |
For gzip, it would be Yep, the library is designed to be fairly generic, the tests use WARC/WACZ/web archive data, but the commands are all generic and should work with any unixfs directories files and the in-place ZIP, with any ZIP file. I do like the idea of detecting file types automatically, rather than having to provide pre-determined split points as we're doing here. For our use case, would probably keep the pre-computed split offsets file as we already have that, but happy to support/work with more generic multi-member gzip splitting efforts. |
Version information:
go-ipfs version: 0.4.4
Type: Feature, Enhancement
Priority: P4
Area: Tools, Importer
Description:
Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.
By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.
There are two ways to approach this:
a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half)
b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)
This should work for all gzip files, tar files, and more.
Related: https://tools.ietf.org/html/rfc1952
The text was updated successfully, but these errors were encountered: