Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: gzip multi member dependant chunker / importer, warc, tar #3604

Open
donothesitate opened this issue Jan 17, 2017 · 4 comments
Open
Labels
help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature

Comments

@donothesitate
Copy link

Version information:

go-ipfs version: 0.4.4

Type: Feature, Enhancement

Priority: P4

Area: Tools, Importer

Description:

Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.

By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.

There are two ways to approach this:
a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half)
b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)

This should work for all gzip files, tar files, and more.

Related: https://tools.ietf.org/html/rfc1952

@donothesitate donothesitate changed the title feature: gzip multi member dependant chunker / importer feature: gzip multi member dependant chunker / importer, warc, tar Jan 17, 2017
@whyrusleeping whyrusleeping added kind/enhancement A net-new feature or improvement to an existing feature help wanted Seeking public contribution on this issue labels Jan 24, 2017
@whyrusleeping
Copy link
Member

Might be cool to start an ipfs/importers repo where we can collect ideas like this

@bqv
Copy link

bqv commented Jan 28, 2021

This never went anywhere, did it

@lidel
Copy link
Member

lidel commented Nov 9, 2022

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter.
Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04https://en.wikipedia.org/wiki/List_of_file_signatures).

@ikreymer
Copy link

ikreymer commented Nov 9, 2022

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter. Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04https://en.wikipedia.org/wiki/List_of_file_signatures).

For gzip, it would be 1f 8b 08 i think.

Yep, the library is designed to be fairly generic, the tests use WARC/WACZ/web archive data, but the commands are all generic and should work with any unixfs directories files and the in-place ZIP, with any ZIP file.

I do like the idea of detecting file types automatically, rather than having to provide pre-determined split points as we're doing here. For our use case, would probably keep the pre-computed split offsets file as we already have that, but happy to support/work with more generic multi-member gzip splitting efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants