-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a buzhash preset with a smaller chunk-size target #7966
Comments
Looking at the the current hardcoded buzhash settings here; https://github.com/ipfs/go-ipfs-chunker/blob/8f5b3dc1f01ec17e2e0c1c304f12a5435da83f4b/buzhash.go#L10 It is using minimum/target/maximum length settings of 128K/128K/512K. Note that the actual average chunk size, excluding the effect of max length truncation, is minimum + target, or 256K. Assuming 256K is the desired average chunk size, those settings are actually (and unusually) pretty good. the only thing that might be worth doing is increasing the maximum length by at least another 128K to avoid too many truncations. I've done some testing of chunker algorithms and settings here that generally confirm that minimum==target, maximum >=minimum+4*target is the sweet spot for deduplication using these algorithms. https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst There is also the Microsoft Paper which found a sweet spot between chunk-deduplication and chunk-compression was around an average chunk size of 64K. IPFS doesn't do chunk compression (yet?) so maybe that's not a consideration, but without any other data to base things on perhaps 32K/32K/256K would be a better default? Finally, there are advantages to NOT making the chunker size settings configurable, As I mentioned here; https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/11?u=dbaarda |
I'd really like to know the sweet-spot chunk size in IPFS is, and nobody seems to be able to answer that. Is the default 256k chunk size really the sweet spot for per-chunk overheads vs large chunk overheads? How bad would it be if the average block size was reduced to 64K? How bad would it be to just change the size to 32K/32K/256K? Is 64K too small? What about 64K/64K/512K for an average size of 128K? Changing the size would break de-duplicating against content added before the change, but would improve deduplication against all future added content. I think this is no worse than the amount of deduplication that would be broken by adding yet another chunking method. I can easily submit a pull request that just changes the sizes, if people think it's worth doing. |
Thanks for the paper, that's interesting. If I read that correctly, they compared 4 KB avg chunks vs 64 KB avg chunks with compression. This is obviously only true for data that is decently compressable. The main reason to avoid smaller chunks is usually memory considerations, where one server with limited memory has multiple hundred terabytes of data. Chunking and deduplicating them with very small chunks has major downsides, that's why it's important for this application to have larger chunks. IPFS on the other hand has a different setup, you usually have much more memory per Terabyte of storage, so using smaller chunks has benefits, as more similar chunks in data can be found. If you think about source code files or database dumps, 64 KByte is quite a lot of data that you need to store twice because of very small changes. The same goes for websites and for example wiki articles. I think in the long term we might want to use rolling chunking for everything by default, and have a mime-aware switch to 1 MB blocks for video files, compressed archives, pictures etc, since they cannot be deduplicated anyway. So we can avoid the overhead of small chunks. So my choice would be 4K/4K/16K to get 8K to 16K chunks. |
I think the main reason for the 64KB sweet-spot was actually for storage-size-minimization through a combination of de-duplication and compression. The compression they were using was not using context-aware pre-populated dictionaries and just compressed each chunk in isolation. With smaller chunks the wins from better deduplication were outweighed by the worse compression of the smaller chunks. With larger chunks the wins from better compression were outweighed by less deduplication. The dataset they were using was intended to be representative of stuff on windows file-servers, so a mix of uncompressed and compressed data. Since IPFS doesn't (yet) compress chunks, you are right that smaller chunks would save more space. The biggest problem for IPFS with too-small chunks would be the per-chunk overheads outweighing the deduplication benefits. These overheads include per-chunk metadata, and all the transactional overheads of requesting/fetching/verifying chunks. I have no idea what those overheads are like, or where the sweet-spot would be. Also, the max chunk size is a way to limit the maximum chunk size at the cost of deduplication. Any chunk that is truncated to the max size is at an arbitrary break-point that messes with break-point alignment for deduplication. You will get better deduplication even though some chunks will be larger if you set a larger max size, with the point of diminishing returns being around max=4*avg. If I had to guess, I'd say 16K-16K-128K would be a good size, with 32K-32K-256K being better if the block overheads are high and/or you start compressing blocks. |
With rabin we had a lot of flexibility regarding the chunk size it's been producing. With buzhash, we only get a one-keyword-preset with around 256 K size.
While I get the idea behind it, 256 K chunks are for some data just too large to create sensible deduplication. An example would be SQL dumps and VM images.
It would be nice to have an additional keyword for a smaller target chunk-size, like 8 K or 16 K to get a better deduplication ratio.
The text was updated successfully, but these errors were encountered: