Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a buzhash preset with a smaller chunk-size target #7966

Open
RubenKelevra opened this issue Mar 9, 2021 · 4 comments
Open

Add a buzhash preset with a smaller chunk-size target #7966

RubenKelevra opened this issue Mar 9, 2021 · 4 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature

Comments

@RubenKelevra
Copy link
Contributor

With rabin we had a lot of flexibility regarding the chunk size it's been producing. With buzhash, we only get a one-keyword-preset with around 256 K size.

While I get the idea behind it, 256 K chunks are for some data just too large to create sensible deduplication. An example would be SQL dumps and VM images.

It would be nice to have an additional keyword for a smaller target chunk-size, like 8 K or 16 K to get a better deduplication ratio.

@RubenKelevra RubenKelevra added the kind/enhancement A net-new feature or improvement to an existing feature label Mar 9, 2021
@dbaarda
Copy link

dbaarda commented Mar 10, 2021

Looking at the the current hardcoded buzhash settings here;

https://github.com/ipfs/go-ipfs-chunker/blob/8f5b3dc1f01ec17e2e0c1c304f12a5435da83f4b/buzhash.go#L10

It is using minimum/target/maximum length settings of 128K/128K/512K. Note that the actual average chunk size, excluding the effect of max length truncation, is minimum + target, or 256K. Assuming 256K is the desired average chunk size, those settings are actually (and unusually) pretty good. the only thing that might be worth doing is increasing the maximum length by at least another 128K to avoid too many truncations.

I've done some testing of chunker algorithms and settings here that generally confirm that minimum==target, maximum >=minimum+4*target is the sweet spot for deduplication using these algorithms.

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

There is also the Microsoft Paper which found a sweet spot between chunk-deduplication and chunk-compression was around an average chunk size of 64K. IPFS doesn't do chunk compression (yet?) so maybe that's not a consideration, but without any other data to base things on perhaps 32K/32K/256K would be a better default?

Finally, there are advantages to NOT making the chunker size settings configurable, As I mentioned here;

https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/11?u=dbaarda

@dbaarda
Copy link

dbaarda commented May 22, 2021

I'd really like to know the sweet-spot chunk size in IPFS is, and nobody seems to be able to answer that. Is the default 256k chunk size really the sweet spot for per-chunk overheads vs large chunk overheads? How bad would it be if the average block size was reduced to 64K?

How bad would it be to just change the size to 32K/32K/256K? Is 64K too small? What about 64K/64K/512K for an average size of 128K?

Changing the size would break de-duplicating against content added before the change, but would improve deduplication against all future added content. I think this is no worse than the amount of deduplication that would be broken by adding yet another chunking method.

I can easily submit a pull request that just changes the sizes, if people think it's worth doing.

@RubenKelevra
Copy link
Contributor Author

Thanks for the paper, that's interesting.

If I read that correctly, they compared 4 KB avg chunks vs 64 KB avg chunks with compression. This is obviously only true for data that is decently compressable.

The main reason to avoid smaller chunks is usually memory considerations, where one server with limited memory has multiple hundred terabytes of data. Chunking and deduplicating them with very small chunks has major downsides, that's why it's important for this application to have larger chunks.

IPFS on the other hand has a different setup, you usually have much more memory per Terabyte of storage, so using smaller chunks has benefits, as more similar chunks in data can be found.

If you think about source code files or database dumps, 64 KByte is quite a lot of data that you need to store twice because of very small changes. The same goes for websites and for example wiki articles.

I think in the long term we might want to use rolling chunking for everything by default, and have a mime-aware switch to 1 MB blocks for video files, compressed archives, pictures etc, since they cannot be deduplicated anyway. So we can avoid the overhead of small chunks.

So my choice would be 4K/4K/16K to get 8K to 16K chunks.

RubenKelevra added a commit to RubenKelevra/ipfs_go-ipfs-chunker that referenced this issue Jul 7, 2021
RubenKelevra added a commit to RubenKelevra/ipfs_go-ipfs-chunker that referenced this issue Jul 18, 2021
@dbaarda
Copy link

dbaarda commented May 10, 2022

I think the main reason for the 64KB sweet-spot was actually for storage-size-minimization through a combination of de-duplication and compression. The compression they were using was not using context-aware pre-populated dictionaries and just compressed each chunk in isolation. With smaller chunks the wins from better deduplication were outweighed by the worse compression of the smaller chunks. With larger chunks the wins from better compression were outweighed by less deduplication. The dataset they were using was intended to be representative of stuff on windows file-servers, so a mix of uncompressed and compressed data.

Since IPFS doesn't (yet) compress chunks, you are right that smaller chunks would save more space. The biggest problem for IPFS with too-small chunks would be the per-chunk overheads outweighing the deduplication benefits. These overheads include per-chunk metadata, and all the transactional overheads of requesting/fetching/verifying chunks. I have no idea what those overheads are like, or where the sweet-spot would be.

Also, the max chunk size is a way to limit the maximum chunk size at the cost of deduplication. Any chunk that is truncated to the max size is at an arbitrary break-point that messes with break-point alignment for deduplication. You will get better deduplication even though some chunks will be larger if you set a larger max size, with the point of diminishing returns being around max=4*avg.

If I had to guess, I'd say 16K-16K-128K would be a good size, with 32K-32K-256K being better if the block overheads are high and/or you start compressing blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants