Rabin Fingerprinting #1

jbenet · 2015-04-15T21:59:00Z

In use by LBFS and other systems.

gwillen · 2015-04-15T22:04:18Z

I looked at https://github.com/ipfs/go-ipfs/blob/master/importer/chunk/rabin.go -- it looks not-crazy, but I'll note from Wikipedia (in http://en.wikipedia.org/wiki/Rolling_hash): "The choice of a and n is critical to get good hashing; see linear congruential generator for more discussion."

Wikipedia's "a and n" are 'a' and 'MOD' here.

Looking at the LCG article, it suggests:

a - 1 is divisible by all prime factors of MOD
a - 1 is a multiple of 4 if MOD is a multiple of 4.

There is also a table of popular values in the LCG article; note that the column for 'c' isn't relevant to this application since the next byte being hashed takes the place of 'c'. (I'm not sure what effect that has on the applicability of the article, and the article on rolling hashes doesn't really explain further.)

gwillen · 2015-04-15T22:06:14Z

I'll also point out for the record that, since this hash is non-cryptographic, it's easy for an adversary to create a file whose breakpoints are in any place of the adversary's choice. Since you have min and max block size parameters, this probably just means the adversary gets suboptimal content deduplication, but keep it in mind.

jbenet · 2015-07-04T12:27:18Z

https://github.com/restic/restic/tree/master/chunker <-- working implementation of rabin fingerprinting. maybe we can move it into https://github.com/jbenet/go-chunk

whyrusleeping · 2015-07-04T16:47:28Z

@jbenet nice, how did you manage to find that?

jbenet · 2015-07-04T22:59:34Z

@fd0 (author) came by #ipfs yesterday and pointed me to it (thanks!!). Backlog: https://botbot.me/freenode/ipfs/2015-07-04/?msg=43646408&page=3 (including link to hilarious comment :)

fd0 · 2015-07-05T07:33:19Z

Are you interested if I split out the chunker into another package (like github.com/restic/chunker) that you could just use?

What average chunk size are you aiming for? It's about 1MiB for restic (a backup program).

Another thing: I found the Wikipedia article misleading, and nearly all implementations of rabin fingerprinting are (unfortunately, I must say) wrong. I read the original paper by Rabin and it's awesome, albeit hard to understand for people without a solid understanding of Algebra:
http://www.xmailserver.org/rabin.pdf

Other hints for optimization techniques can be found in the documentation:
https://godoc.org/github.com/restic/restic/chunker

jbenet · 2015-07-05T08:07:48Z

Are you interested if I split out the chunker into another package (like github.com/restic/chunker) that you could just use?

Wither fine. We'll always vendor things anyway when we use them, so don't worry about changing urls, etc. :)

We may put it as part of https://github.com/jbenet/go-chunk if i make it

What average chunk size are you aiming for? It's about 1MiB for restic (a backup program).

I think LBFS used 2-64KiB. We'll use 2-256KiB I think.

nearly all implementations of rabin fingerprinting are (unfortunately, I must say) wrong

Even the LBFS implementation?

I read the original paper by Rabin and it's awesome, albeit hard to understand for people without a solid understanding of Algebra:

Thank you! really great paper, even though my algebra is not as solid.

fd0 · 2015-07-05T08:10:14Z

Even the LBFS implementation?

No, the LBFS implementation was one of the few correct ones. Btw, I had trouble finding the source code for lbfs, so once I found it I converted and archived their CVS at https://github.com/fd0/lbfs

jbenet · 2016-02-20T20:20:20Z

this is already in go-ipfs, thanks again @whyrusleeping and @fd0

meyerzinn · 2017-01-29T03:50:37Z

Would a better implementation that takes advantage of SSE2 on Intel chips be considered for this? The current implementation takes a grueling hour to do some 50GB on a computer from DigitalOcean with 16GB ram and 8 CPUs. I feel like that's unnecessarily slow, but it's also possible I've been too used to the normal split-the-block method.

The other implementation I was referring to was https://github.com/kevinko/rabin.

jbenet closed this as completed Feb 20, 2016

ZenGround0 mentioned this issue Jan 10, 2018

IPFS Cluster sharding RFC #278

Open

daviddias mentioned this issue Mar 27, 2018

Add support for Rabin fingerprinting to js-ipfs ipfs/js-ipfs#1283

Closed

This was referenced Nov 27, 2019

Collection of findings ipfs/go-ipfs-chunker#18

Closed

Adoption to IPFS? kevinko/rabin#3

Open

aidanhs mentioned this issue Dec 23, 2019

Research/quantify performance envelopes of multiple CDC algorighms ipfs/specs#227

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rabin Fingerprinting #1

Rabin Fingerprinting #1

jbenet commented Apr 15, 2015

gwillen commented Apr 15, 2015

gwillen commented Apr 15, 2015

jbenet commented Jul 4, 2015

whyrusleeping commented Jul 4, 2015

jbenet commented Jul 4, 2015

fd0 commented Jul 5, 2015

jbenet commented Jul 5, 2015

fd0 commented Jul 5, 2015

jbenet commented Feb 20, 2016

meyerzinn commented Jan 29, 2017

Rabin Fingerprinting #1

Rabin Fingerprinting #1

Comments

jbenet commented Apr 15, 2015

gwillen commented Apr 15, 2015

gwillen commented Apr 15, 2015

jbenet commented Jul 4, 2015

whyrusleeping commented Jul 4, 2015

jbenet commented Jul 4, 2015

fd0 commented Jul 5, 2015

jbenet commented Jul 5, 2015

fd0 commented Jul 5, 2015

jbenet commented Feb 20, 2016

meyerzinn commented Jan 29, 2017