feature: gzip multi member dependant chunker / importer, warc, tar #3604

donothesitate · 2017-01-17T20:30:13Z

Version information:

go-ipfs version: 0.4.4

Type: Feature, Enhancement

Priority: P4

Area: Tools, Importer

Description:

Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.

By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.

There are two ways to approach this:
a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half)
b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)

This should work for all gzip files, tar files, and more.

Related: https://tools.ietf.org/html/rfc1952

whyrusleeping · 2017-03-06T05:43:43Z

Might be cool to start an ipfs/importers repo where we can collect ideas like this

bqv · 2021-01-28T17:05:04Z

This never went anywhere, did it

lidel · 2022-11-09T18:24:14Z

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter.
Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04 ← https://en.wikipedia.org/wiki/List_of_file_signatures).

ikreymer · 2022-11-09T19:02:52Z

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter. Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04 ← https://en.wikipedia.org/wiki/List_of_file_signatures).

For gzip, it would be 1f 8b 08 i think.

Yep, the library is designed to be fairly generic, the tests use WARC/WACZ/web archive data, but the commands are all generic and should work with any unixfs directories files and the in-place ZIP, with any ZIP file.

I do like the idea of detecting file types automatically, rather than having to provide pre-determined split points as we're doing here. For our use case, would probably keep the pre-computed split offsets file as we already have that, but happy to support/work with more generic multi-member gzip splitting efforts.

donothesitate changed the title ~~feature: gzip multi member dependant chunker / importer~~ feature: gzip multi member dependant chunker / importer, warc, tar Jan 17, 2017

whyrusleeping added kind/enhancement A net-new feature or improvement to an existing feature help wanted Seeking public contribution on this issue labels Jan 24, 2017

probonopd mentioned this issue Dec 4, 2017

Investigate peer-to-peer AppImage distribution AppImage/AppImageKit#175

Open

hacdias mentioned this issue Dec 14, 2023

content aware chunking ipfs/boxo#508

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: gzip multi member dependant chunker / importer, warc, tar #3604

feature: gzip multi member dependant chunker / importer, warc, tar #3604

donothesitate commented Jan 17, 2017

whyrusleeping commented Mar 6, 2017

bqv commented Jan 28, 2021

lidel commented Nov 9, 2022 •

edited

Loading

ikreymer commented Nov 9, 2022

feature: gzip multi member dependant chunker / importer, warc, tar #3604

feature: gzip multi member dependant chunker / importer, warc, tar #3604

Comments

donothesitate commented Jan 17, 2017

Version information:

Type: Feature, Enhancement

Priority: P4

Area: Tools, Importer

Description:

whyrusleeping commented Mar 6, 2017

bqv commented Jan 28, 2021

lidel commented Nov 9, 2022 • edited Loading

ikreymer commented Nov 9, 2022

lidel commented Nov 9, 2022 •

edited

Loading