-
Notifications
You must be signed in to change notification settings - Fork 30
Rabin Fingerprinting #1
Comments
I looked at https://github.com/ipfs/go-ipfs/blob/master/importer/chunk/rabin.go -- it looks not-crazy, but I'll note from Wikipedia (in http://en.wikipedia.org/wiki/Rolling_hash): "The choice of a and n is critical to get good hashing; see linear congruential generator for more discussion." Wikipedia's "a and n" are 'a' and 'MOD' here. Looking at the LCG article, it suggests:
There is also a table of popular values in the LCG article; note that the column for 'c' isn't relevant to this application since the next byte being hashed takes the place of 'c'. (I'm not sure what effect that has on the applicability of the article, and the article on rolling hashes doesn't really explain further.) |
I'll also point out for the record that, since this hash is non-cryptographic, it's easy for an adversary to create a file whose breakpoints are in any place of the adversary's choice. Since you have min and max block size parameters, this probably just means the adversary gets suboptimal content deduplication, but keep it in mind. |
https://github.com/restic/restic/tree/master/chunker <-- working implementation of rabin fingerprinting. maybe we can move it into https://github.com/jbenet/go-chunk |
@jbenet nice, how did you manage to find that? |
@fd0 (author) came by #ipfs yesterday and pointed me to it (thanks!!). Backlog: https://botbot.me/freenode/ipfs/2015-07-04/?msg=43646408&page=3 (including link to hilarious comment :) |
Are you interested if I split out the chunker into another package (like github.com/restic/chunker) that you could just use? What average chunk size are you aiming for? It's about 1MiB for restic (a backup program). Another thing: I found the Wikipedia article misleading, and nearly all implementations of rabin fingerprinting are (unfortunately, I must say) wrong. I read the original paper by Rabin and it's awesome, albeit hard to understand for people without a solid understanding of Algebra: Other hints for optimization techniques can be found in the documentation: |
Wither fine. We'll always vendor things anyway when we use them, so don't worry about changing urls, etc. :) We may put it as part of https://github.com/jbenet/go-chunk if i make it
I think LBFS used 2-64KiB. We'll use 2-256KiB I think.
Even the LBFS implementation?
Thank you! really great paper, even though my algebra is not as solid. |
No, the LBFS implementation was one of the few correct ones. Btw, I had trouble finding the source code for lbfs, so once I found it I converted and archived their CVS at https://github.com/fd0/lbfs |
this is already in go-ipfs, thanks again @whyrusleeping and @fd0 |
Would a better implementation that takes advantage of SSE2 on Intel chips be considered for this? The current implementation takes a grueling hour to do some 50GB on a computer from DigitalOcean with 16GB ram and 8 CPUs. I feel like that's unnecessarily slow, but it's also possible I've been too used to the normal split-the-block method. The other implementation I was referring to was https://github.com/kevinko/rabin. |
In use by LBFS and other systems.
The text was updated successfully, but these errors were encountered: