Intelligent chunking for archives #5323

reit-c · 2018-08-01T07:50:37Z

Type:

Enhancement.

Description:

A commonly cited use of IPFS is in managing archival data, however currently the process IPFS handles archives with could be improved. Take the use case of an archival utility that regularly takes a snapshot of a set of files in ZIP format. Despite many of the files being identical between the archives, even slight changes (especially in files earlier in the archive) can cause the archive to produce an entirely different hash, unnecessarily requiring complete duplication of the archive in the IPFS repo.

Pictured above is the file structure of the ZIP-64 archive format. The method to read it is to parse the final bytes of the archive to get the offset of the central directory, and then step through the directory which contains the offsets of every compressed file in the archive. If these file offsets were interpreted by IPFS as bytes to start a new chunk at, this alone would be sufficient to deduplicate the files across any number of zip archives (presuming the archives were created using the same compression options).

dbaarda · 2018-08-01T11:08:26Z

Correct me if I'm wrong, but wouldn't rabin fingerprinting chunking automatically handle this, and any other data format where successive versions might have large arbitrarily aligned unchanged sections.

A search found this discussion, but much of it seems to miss the point of rabin fingerprint chunking, which is to identify similar chunks in slightly different files; ipfs-inactive/archives#137

The trick is the average rabin block size needs to be smaller (less than half) the size of the typical unchanged sequence size. This way it can break those unchanged sequences into a bunch of identical chunks, loosing only (on average) half a chunk at the beginning and end of the unchanged sequence. this means it needs to be smaller than the typical compressed file entry.

Note there is also things like gzip-rsyncable which does a similar thing to rabin fingerprinting to sync a single large compression at "chunk" boundaries to get unchanged compressed sections for small changes in the uncompressed data.

In this particular case, I wonder if it wouldn't just be better to have ipfs unzip the data under the hood before adding it, with some kind of top merkel node that indicates "Oh yeah, when you assemble all these nodes together to output it, format it into a zip file". That way compressed files inside added zip archives are de-duped with any uncompressed version of the file that's added.

reit-c · 2018-08-02T00:19:38Z

In theory rabin chunking should be able to solve this to some extent, though I too saw that discussion and the results posted in ipfs-inactive/archives#137 (comment), which didn't show the chunker getting encouraging results. It is possible with tweaks as you suggest some better performance may be achieved, though intuitively it would never be as good as an explicit hand-crafted rule defining the best place to chunk.

dbaarda · 2018-08-02T04:32:38Z

I'm still unconvinced the testing done then was actually checking for the benefits of rabin fingerprinting.

I'm unfamiliar with the modarchive data used, but it was described as "thousands of files with an awful lot of recycled soundfont samples". If those soundfont samples where recycled whole files, then rabin would not help (since ipfs already finds identical files). If those "recycled soundfont samples" are inside different files, then they need to be in the form of runs of unchanged identical data much larger than the rabin average blocksize.

Note that a smaller rabin blocksize will help find more duplicate data, but it will also result in a lot more chunks, and the overhead of more chunks can outweigh the benefit of more de-duplication.

I suspect for the modarchive data either;

There is less duplicate data than they thought, possibly because the recycled soundfont samples don't end up as identical data in different files (compression effects, changed encoded offsets, etc).
The identical data from soundfront samples is there, but it's in lengths too short for the rabin blocksize to find it. Using a rabin blocksize small enough to find it results in so many chunks the overhead kills any de-duping benefits.
The rabin fingerprinting implementation in ipfs is broken.

It might be worth re-testing rabin fingerprinting with data that is really known to include duplicate data runs larger than than the rabin blocksize used. Say two large zip-64 archives containing lots of identical files with only 1 changed :-) Make sure you specify a rabin blocksize smaller than the compressed size of an individual file. Also make sure that these files really do end up with long runs of duplicate data, and there's not something in the zip-64 format that modifies the data (timestamps, offsets, etc).

Using explicit hand-crafted rules can nearly always do better than generic solutions, but then you have to code and maintain hand-crafted rules. The effort vs return for this is usually not worth it, as the generic solution is often good enough. Effort expended on improving the generic solution benefits all use-cases.

lanzafame · 2018-08-03T00:18:00Z

@dbaarda @reit-c there was a demo of Rabin fingerprinting given by @jbenet at the Berlin dev meetings, I will link to the video once it is out, but in the meantime the script he used to benchmark in the demo is here: https://github.com/jbenet/ipfs-bench-add-opts it allows you to test the different chunking methods against the same directory. Hopefully that helps

dbaarda · 2019-08-29T02:16:25Z

Is there a video or writeup of this out yet?

A quick glance at the code suggests he might have been comparing apples with oranges by comparing 256k fixed blocks vs 1k rabin blocks... 256x as many blocks is likely to be more overheads than deduping would save. It also depends on the data used a lot though.

lanzafame · 2019-09-03T00:47:52Z

@dbaarda Yeah sorry, here it is: https://youtu.be/x0WIDz1bWjI

momack2 added this to Inbox in ipfs/go-ipfs May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent chunking for archives #5323

Intelligent chunking for archives #5323

reit-c commented Aug 1, 2018

dbaarda commented Aug 1, 2018

reit-c commented Aug 2, 2018

dbaarda commented Aug 2, 2018

lanzafame commented Aug 3, 2018

dbaarda commented Aug 29, 2019 •

edited

Loading

lanzafame commented Sep 3, 2019

Intelligent chunking for archives #5323

Intelligent chunking for archives #5323

Comments

reit-c commented Aug 1, 2018

Type:

Description:

dbaarda commented Aug 1, 2018

reit-c commented Aug 2, 2018

dbaarda commented Aug 2, 2018

lanzafame commented Aug 3, 2018

dbaarda commented Aug 29, 2019 • edited Loading

lanzafame commented Sep 3, 2019

dbaarda commented Aug 29, 2019 •

edited

Loading