Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

Use Rabin splitter (data sensitive) by default instead of SizeSplitter (data insensitive) #13

Closed
sideeffffect opened this issue Apr 13, 2019 · 10 comments

Comments

@sideeffffect
Copy link

Why not use the Rabin splitter in IPFS instead of the SizeSplitter by default?
The rolling hash based data sensitive Rabin splitter has a huge obvious advantage: it creates shift-resistant chunks and thus improves data deduplication and sharing.

For example casync is doing it too:

@Stebalien
Copy link
Member

Primarily, to avoid changing file hashes too often. Our plan is to switch to rabin (or similar), CIDv1, raw leaves, UnixFSv2 (https://github.com/ipfs/unixfs-v2) etc. all in one go.

@sideeffffect
Copy link
Author

oh, interesting! any idea when that could happen? is it in weeks, months or years?

@Stebalien
Copy link
Member

At this point, hopefully months. We're pushing hard for UnixFSv2 at the moment as it will help us support arbitrary file metadata (important for package manager integration).

However, we could always use suggestions on better chunking algorithms. We picked rabbin out of a hat but there may be better algorithms.

@sideeffffect
Copy link
Author

FWIW casync (mentioned above) uses Buzhash

@momack2 momack2 added this to Inbox in ipfs/go-ipfs May 9, 2019
@MarkusTeufelberger
Copy link

MarkusTeufelberger commented Jun 13, 2019

bup (a backup utility) splits with "an algorithm similar to rsync": https://github.com/bup/bup/blob/master/Documentation/bup-split.md (implementation: https://github.com/bup/bup/blob/master/lib/bup/bupsplit.c)

restic (another backup utility) uses Rabin with a min and max size: https://github.com/restic/restic/blob/master/doc/design.rst#backups-and-deduplication

@sideeffffect
Copy link
Author

Hi, has the rolling hash chunker debate moved forward in the recent months?

@ribasushi
Copy link
Contributor

@sideeffffect on my end work has been postponed until Q1-ish of next year (the work that @aidanhs linked right above your comment)
A number of insights were already gathered, including a definitive proof that most current single-pass algorithms (like Rabin) do not perform sufficiently well at the places we are trying to use them at.
More in couple months, deity willing ;)

@sideeffffect
Copy link
Author

Has the situation about the chinking strategy changed?

@hacdias
Copy link
Member

hacdias commented Jun 16, 2023

This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.

You can learn more in the FAQs for the Boxo repo copying/consolidation effort.

@sideeffffect
Copy link
Author

Reopened in boxo: ipfs/boxo#355

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants