Test the benefits of Rabin Fingerprinting vs normal chunking #137

flyingzumwalt · 2017-01-30T01:40:07Z

The real goal here is to figure out which chunking algorithm(s) are ideal for archiving datasets so that we can recommend those chunking algorithms in any instructions we provide to archivists and data rescuers.

As @Kubuxu said in #134 (comment)

I would recommend checking if rabin fingerprinting gives any benefits in multiple areas: in file, cross files (directory) and cross datasets vs normal chunking.

@Kubuxu could you provide suggestions about how you would test/compare these options?

flyingzumwalt · 2017-01-30T01:40:46Z

cc @20zinnm

meyerzinn · 2017-01-30T18:42:37Z

I found that Rabin did great with historical archives, i.e. where you have the same thing at different points in time (archive.org style). Using Rabin, you could potentially reduce archives of websites (WARCs) by significant amounts. However, in some cases a diff might be more applicable.

rht · 2017-02-01T19:24:05Z

ahem (cleanse throat),

here is one benchmark (using nixpkgs narinfo [1], 20k of 10KB files, data size: 22.5MiB)

, and here is another (using president's daily brief [2], 2.5k of 200-600KB files, data size: 481.95MB)

In at least two datasets, the default chunker (size-262144) and rabin chunker did not provide significant compression (<10%).

Therefore, what should an archivist use? (internetarchive uses torrent for distributing the data)

Code to reproduce: https://github.com/rht/sfpi-benchmark/tree/master/chunk
Edit: cite data source and code to reproduce

[1] ipfs/kubo#3621, https://ipfs.io/ipfs/QmcsrSRuBmxNxcEXjMZ1pmyRgnutCGwfAhhnRfaNn9P94F
[2] https://archives.ipfs.io, "Presidental Daily Briefs", ipfs:/ipfs/Qme6epvZDj3vzHcFKdF1nZhbixjw8Bn4imGcKnbUyBJL89

flyingzumwalt · 2017-02-02T14:11:57Z

@rht thank you for taking a stab at this.

Please compare the two ipfs chunking algorithms (default vs rabin)

I wasn't clear in the description of this issue. This issue is not seeking to compare ipfs against other tools. It's seeking to compare two of the chunking algorithms that are already implemented in ipfs. We are trying to see if the rabin fingerprinting option on ipfs --add provides any benefits over the default ipfs chunking algorithm. Rabin fingerprinting is slower and more computationally expensive than the default chunking algorithm, so we want to know if it's worth the extra time and computational load.

This is, in part, so we can decide whether we should implement some other, more efficient, chunking algorithm. To that point, the git-objects number is quite interesting to me because it tells us we might want to implement something that imitate's git's packing strategy. You only included it in the first graph, so the results are inconclusive, but it's an interesting data point.

Please remove comparisons to compression algorithms

We are not looking to compare rabin fingerprinting against compression algorithms like gzip -- those two technologies were designed for different purposes, so the comparison is confusing rather than useful.

Cite your Source Data and Describe its Characteristics

Information we need in order to know what benchmarks are telling us:

What was the test data (please post them somewhere and cite them)
What is the size of the test data on disk, without chunking or compression?
Do the test data contain the sort of repetition that could be recognized by rabin fingerprinting? (repetition of content blocks that are more than 256k in size)?

Rabin fingerprinting is designed to deduplicate content at the block level. It's ideal for data that is repetitive, like version histories of data that have changed gradually over time. The fingerprint algorithm catches that duplication at the level of blocks, with a default block size of 256kb. For that reason, I would not expect it to catch much duplication in something like a series of speeches because speeches repeat words and sentences but they've very unlikely to precisely repeat 256k blocks of content.

Use a Variety of Test Data

I recommend using a variety of test data -- some with a lot of repetition that rabin fingerprinting should catch, others with less repetition. For example, in one test you could use sample data that contains many copies of the same file with varying sizes of edits to the file. In another test you could use a series of speeches, like you did in the first test run. If you choose a variety of test data like that it will give us a better basis for analyzing the results.

rht · 2017-02-02T15:27:43Z

@flyingzumwalt addressing your points.

I am curating the benchmark (facts) not just for the sake of the maintainers, but also archivists, institutions, etc looking to store large datasets using the IPFS protocol. I believe I have the right to put informative datasets to locate the performance of ipfs storage within the archiving tools ecosystem. What is the point of having a graph that shows "chunking algorithm x does better than chunking algorithm y by such amount" (but hides the order of magnitude of the overall improvement of storage saving compared to the existing archival tools?). For archivists, such benchmark is much clearer and transparent to parse (and used as premises) than a fuzzy statement to the point that one is led to say "using Rabin, you could potentially reduce archives of websites (WARCs) by significant amounts".
Deduplication != compression, but they are both used for the objective of storage saving.

I have added the citations and code to reproduce (though I already stated what the dataset is and already described the characteristics).

Do the test data contain the sort of repetition that could be recognized by rabin fingerprinting? (repetition of content blocks that are more than 256k in size)?

For the narinfo data, the schema is a key-value map, and they all share the same key.

For archival purpose (store many, read occassionally), it is always worth it to use the more expensive compression algorithm (7z/bz2 over gz).

Rabin fingerprinting is designed to deduplicate content at the block level. It's ideal for data that is repetitive, like version histories of data that have changed gradually over time.
I recommend using a variety of test data -- some with a lot of repetition that rabin fingerprinting should catch, others with less repetition.

My data choice and variety are constrained by the disk space I have on my own device, the variety of the pinned data in archives.ipfs.io, and whether I could practically download them w/ ipfs get (which is why I recommend to put the build recipe for each, preferrably deterministic). Besides, versioned data is a subset of the pool of large binary datasets, e.g. web archive of a website don't necessarily contain duplicated data

rht · 2017-02-02T16:38:48Z

TL;DR: Given the storage size ballooning due to pinning, the very specific data-saving condition of the chunking, the ipfs toolset is not ready for large data archival (this statement is backed by data). Corollary: it is dishonest if it is marketed by protocol labs as otherwise.

flyingzumwalt · 2017-02-02T16:52:33Z

@rht You're violating our code of conduct. This is not the first time.

I'm not willing to tolerate being bullied in this way and I completely disagree with your repeated, baseless implications that people are being dishonest or trying to hide information. We make huge efforts to operate in transparent ways and to provide people with good information. There are plenty of ways to press for better information without falsely and baselessly accusing people of bad intentions. We want everyone to help us make sure that we're providing the best possible information in transparent ways, but you must also respect our need to maintain a Friendly, Harrassment-free Space.

For the record, I like the idea of the benchmarks you're proposing. To do those benchmarks properly would require proper planning. Due to the tone of this conversation, I'm unable to help you with that work.

ghost · 2017-02-23T20:06:15Z

I've done some experiments on the modarchive dataset (#73). I'd assumed rabin might show an improvement since it's thousands of files with an awful lot of recycled soundfont samples.

For a small directory, with files 35KB or less, adding -s rabin makes no difference at all (directory hash doesn't change):

$ du -sb modarchive.org/ahx/
3011368 modarchive.org/ahx/

$ ipfs add -qr modarchive.org/ahx/ | tail -1 | ipfs object stat
NumLinks: 648
BlockSize: 38220
LinksSize: 38218
DataSize: 2
CumulativeSize: 3024006

Using --raw-leaves shaves off a few KB though, in fact it comes out significantly smaller than the native filesystem after accounting for ext4's directory size (32KiB):

$ ipfs add -qr --raw-leaves modarchive.org/ahx/ | tail -1 | ipfs object stat
NumLinks: 648
BlockSize: 39515
LinksSize: 39513
DataSize: 2
CumulativeSize: 3018115

With a much larger set of files (file size up to 3MB) adding -s rabin does something, but seems to make things slightly worse than the default. It's also very very slow:

$ du -sb modarchive.org/s3m/
2715454353      modarchive.org/s3m/

### default
$ ipfs add -qr modarchive.org/s3m/ | tail -1 | ipfs object stat
NumLinks: 9651
BlockSize: 582400
LinksSize: 582398
DataSize: 2
CumulativeSize: 2716399266

$ ipfs add -qr --raw-leaves modarchive.org/s3m/ | tail -1 | ipfs object stat
NumLinks: 9651
BlockSize: 593301
LinksSize: 593299
DataSize: 2
CumulativeSize: 2716216145

### rabin
$ ipfs add -qr -s rabin modarchive.org/s3m/ | tail -1 | ipfs object stat
NumLinks: 9651
BlockSize: 582400
LinksSize: 582398
DataSize: 2
CumulativeSize: 2716977717

$ ipfs add -qr -s rabin --raw-leaves modarchive.org/s3m/ | tail -1 | ipfs object stat
NumLinks: 9651
BlockSize: 588275
LinksSize: 588273
DataSize: 2
CumulativeSize: 2716710907

And not intended as a comparison, but just to show there's quite a lot of duplicate data hiding there:

$ lrztar --no-compress modarchive.org/s3m/
Compression Ratio: 1.837. Average Compression Speed:  2.718MB/s.
$ du -b s3m.tar.lrz
1482197571      s3m.tar.lrz

Lastly, I added the entire directory tree with -s rabin --raw-leaves (took about 16 hours). Comes out about 10MB larger than the archive in #73:

$ ipfs object stat QmUCVqYNznwxFp4kyCxzyQ9U12uuFJDsLgFMAr5vUih87L
CumulativeSize: 64689569054
$ ipfs object stat QmZjHYLDAr1B1tEU9ZwXFzuAF4HWvLPmuM1nT26r9fgdFL
CumulativeSize: 64698700466

Anyway, results are a bit of a letdown but I hope the data is worth something.

Kubuxu · 2017-02-23T20:16:58Z

Thank you very much. We should evaluate rebin chunker as it should provide better results. So far it looks like it might be better as no-op.

flyingzumwalt · 2017-02-23T20:26:09Z

Wow. Thanks @flussence. This confirms our suspicions that the current rabin fingerprinting chunker might just slow things down without any benefit. Bummer. Like @Kubuxu said, we will have to look into why it's not providing more benefits.

I'm confident that we will be able to develop better chunking algorithms over time and I suspect that the choice of chunking algorithm will be highly dependent on the type of data and the use cases surrounding the data (ie. many incremental edits vs. accumulation of new info over time, etc.)

For now I think we can close this specific issue because we have our answer to the original question in this issue: With the current code base, the rabin chunker isn't worth it. People should just use the default chunking algorithm until someone improves the rabin chunker or publishes an awesome alternative chunker.

I'm still intrigued by the types of benchmarks that @rht started pushing for. Before closing this current issue I'd like to create a separate issue flagging the need for us to do those broader, farther-reaching benchmarks in a carefully planned, repeatable way. Can anyone help me figure out how to characterize what questions @rht was trying to answer with those tests? Based on that I can write a new issue that describes the questions and sketches how the tests might be structured.

whyrusleeping · 2017-02-23T20:45:41Z

Hey @flussence, thank you so much for the data! If you don't mind, could you try out a few other things?

The rabin chunker does accept parameters (though, horribly documented, i had to go diving through the source before writing this comment to figure it out).

Examples:
ipfs add -s rabin-1024 - Rabin chunking with an average blocksize of 1024
ipfs add -s rabin-512-2048-4096 - Rabin chunking with a minimum blocksize of 512, average of 2048, and maximum of 4096

The average value should be a power of two (otherwise it gets truncated to one), the min and max can be any numbers so long as they are lower and higher than the average, respectively.

Could you try out some different parameters to rabin? Perhaps try a much smaller average size to start, maybe 1024, then 16386, then maybe 64k ?

ghost · 2017-02-23T21:45:32Z

@whyrusleeping that seems to make it even worse, somehow...

Numbers are for the /ahx/ directory -

switches	tree size
`-s rabin` (from above, no-op?)	3024006
`-s rabin-65536`	3025136
`-s rabin-16384`	3043515
`-s rabin-2048`	3245168
`-s rabin-1024`	3427008
`-s rabin-256-16384-65536`	3509324

Not only are the smaller chunk sizes larger overall, they seem to get exponentially slower. The last two were adding files at more or less dialup modem speeds, and they seem to be entirely disk IO-bound too - I'm not sure what's happening there.

whyrusleeping · 2017-02-23T23:50:13Z

@flussence the slow down was very likely because it caused ipfs to create a much large number of objects, the cost of storing each object has a base cost (something like X + (Y * size)) so you were seeing that bit dominating the performance.

Disappointing that rabin fingerprinting isnt showing any gains here...

This was referenced Jan 30, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

Captain's Log for the ipfs/archives endeavor #138

Open

flyingzumwalt mentioned this issue Feb 23, 2017

Make the Rabin Chunker perform well, or document why it's not fixable #142

Open

4 tasks

DonaldTsang mentioned this issue Feb 26, 2018

Possible updates with the use of non-Rabin algorithms restic/chunker#19

Closed

dbaarda mentioned this issue Aug 1, 2018

Intelligent chunking for archives ipfs/kubo#5323

Open

DonaldTsang mentioned this issue Sep 21, 2018

Listing possible chunking algorithm ipfs/go-ipfs-chunker#7

Closed

DonaldTsang mentioned this issue Nov 27, 2019

Collection of findings ipfs/go-ipfs-chunker#18

Closed

aidanhs mentioned this issue Dec 23, 2019

Research/quantify performance envelopes of multiple CDC algorighms ipfs/specs#227

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test the benefits of Rabin Fingerprinting vs normal chunking #137

Test the benefits of Rabin Fingerprinting vs normal chunking #137

flyingzumwalt commented Jan 30, 2017

flyingzumwalt commented Jan 30, 2017

meyerzinn commented Jan 30, 2017

rht commented Feb 1, 2017 •

edited

flyingzumwalt commented Feb 2, 2017 •

edited

rht commented Feb 2, 2017 •

edited

rht commented Feb 2, 2017

flyingzumwalt commented Feb 2, 2017 •

edited

ghost commented Feb 23, 2017

Kubuxu commented Feb 23, 2017

flyingzumwalt commented Feb 23, 2017

whyrusleeping commented Feb 23, 2017

ghost commented Feb 23, 2017

whyrusleeping commented Feb 23, 2017

Test the benefits of Rabin Fingerprinting vs normal chunking #137

Test the benefits of Rabin Fingerprinting vs normal chunking #137

Comments

flyingzumwalt commented Jan 30, 2017

flyingzumwalt commented Jan 30, 2017

meyerzinn commented Jan 30, 2017

rht commented Feb 1, 2017 • edited

flyingzumwalt commented Feb 2, 2017 • edited

Please compare the two ipfs chunking algorithms (default vs rabin)

Please remove comparisons to compression algorithms

Cite your Source Data and Describe its Characteristics

Use a Variety of Test Data

rht commented Feb 2, 2017 • edited

rht commented Feb 2, 2017

flyingzumwalt commented Feb 2, 2017 • edited

ghost commented Feb 23, 2017

Kubuxu commented Feb 23, 2017

flyingzumwalt commented Feb 23, 2017

whyrusleeping commented Feb 23, 2017

ghost commented Feb 23, 2017

whyrusleeping commented Feb 23, 2017

rht commented Feb 1, 2017 •

edited

flyingzumwalt commented Feb 2, 2017 •

edited

rht commented Feb 2, 2017 •

edited

flyingzumwalt commented Feb 2, 2017 •

edited