-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand rolling chunker documentation #8952
Conversation
RubenKelevra
commented
May 7, 2022
- Add default chunk size boundaries for chunker=rabin and chunker=buzhash
- Explain that different chunkers will lead to different CIDs (and thus no deduplication between files)
- Add a recommendation when to use rolling chunkers
- Explain the tradeoff between smaller and larger chunk sizes in general
- Explain the difference (in data structure) between rolling chunkers and static chunking
- Make clear that there's no need for support on the receiving side for a specific chunking algorithm
- Add default chunk size boundaries for chunker=rabin and chunker=buzhash - Explain that different chunkers will lead to different CIDs (and thus no deduplication between files) - Add a recommendation when to use rolling chunkers - Explain the tradeoff between smaller and larger chunk sizes in general - Explain the difference (in data structure) between rolling chunkers and static chunking - Make clear that there's no need for support on the receiving side for a specific chunking algorithm
@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by | |||
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer | |||
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | |||
|
|||
If you just specify 'rabin' without parameters the default sizes are | |||
'rabin-87381-262144-393216' and the parameters used for buzhash are | |||
always min: 131072 and max: 524288 with a mask of 65536. Buzhash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these default sizes are far from optimal for deduplication or speed according to my testing at;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
The real average block size is actually min+avg (ignoring minimal max truncation contributions), so I prefer to use the term "target" or "tgt" instead of "avg". For any rollsum using the normal chunker (ie, no fancy dynamic match like FastCDC or truncation handler like Regression Chunking) the optimal sizes for both deduplication and speed (from cut-point-skipping) for a given average chunk size is with min=tgt=avg/2 , max>=4*avg
.
The code currently also seems to support specifying just rabin-[avg]
which is the same as rabin-[avg/3]-[avg]-[avg*1.5]
. Also, <avg>
will get rounded-down to the nearest power of 2 value. There is also constraints that min<tgt<max<=1048576
, which means we cannot set min=tgt
and have to use min=tgt-1
.
For Buzhash the buzMask = 1<<17 - 1
. Looking at go operator precedence it seems -
binds with a higher precedence than <<
, so this really is 1<<16 = 65536
, which is a mask with only one bit set, equivalent to avg=2 for an average block size of 128K+2. This seems like a terrible bug, and the intention was really to have buzMask = (1<<17) - 1
which is a mask with 17 bits set for avg=128K. This bug might explain why buzhash chunking deduplication has been so crap in testing; it is pretty much identical to size-131074
So the current defaults for rabin give an average block size of 341K, and buzhash (due to the bug mentioned above) of 128K+2.
So better defaults for an average block size of 256K would be rabin-131071-131072-1048576, and for buzhash would be min=131072,max=1048576, mask=(1<<17) -1.
I know changing these defaults now will break deduplication against content already added with the old defaults, but given buzhash has effectively been broken by a bug fixing it is probably fine. Fixing rabin is more contentious, but you could make rabin-[avg]
equivalent to rabin-[avg/2-1]-[avg/2]-[min(avg*4,1M)]
so that it gives better defaults when specifying a target average size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was wondering what the - 1 on the bitshift is supposed to do as well...
Buzhash isn't that terrible right now... It actually performed as well as rabin overall in my tests: ipfs/go-ipfs-chunker#31 (comment)
But maybe @Kubuxu can give this a look? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems <<
does have higher precedence than -
in go, so there is not a bug in buzhash.
https://go.dev/ref/spec#Operators
I was misled by incorrect documentation here;
https://www.tutorialspoint.com/go/go_operators_precedence.htm
So it seems the default buzhash mask is actually 131071, not 65536 as you have documented here.
@RubenKelevra : thanks for contribution here. We think this should ideally show up under docs.ipfs.io rather than the help text of a CLI command. The help text can link to this docs article which can also use visuals, etc. Do you want to make a conribution to the IPFS docs site instead? Rationale: too much text like this in a cli help command isn't as helpful and is non-standard when compared to other tools like gzip. |
@BigLep sure, no problem. I'll cut the help page down and move it to the docs. Btw: I agree that something like this is more for man pages. Is there anything planned to have a man page for go-ipfs instead of large help texts? Maybe this makes more sense in the long run. :) |
2022-06-10 conversation:
We don't have plans for this now, in part because don't have the bandwidth to maintain it. There's also an issue that our ipfs cli helptext emits the long description, but that's a separate issue (see #9035). We still think getting a docs.ipfs.io page on chunking is the right path. |
@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by | |||
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer | |||
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | |||
|
|||
If you just specify 'rabin' without parameters the default sizes are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying the default rabin and buzhash parameters here seems good to me to help people understand what's going on.
IMO (and matching some of the comments above) higher level text on how to choose chunking and the tradeoffs seems best on the docs website since it's both not go-ipfs specific and is can be longer and more involved. |
@RubenKelevra fyi created ipfs/ipfs-docs#1176 with some details around what type of article we are looking for. (imo chunking is part of a bigger topic which can be summarized as "knobs that you can tweak when importing data to IPFS", but if you are only interested in documenting chunking, that is already really appreciated! ❤️ ). If you could open a PR against that repo with "chunker explainer" draft we can then decide if it should be a standalone article, or a section in a bigger one, explaining all the import parameters :) |
I'll give the topic a look tomorrow, as I was planning to take another test to see what I can come up with for ipfs/go-ipfs-chunker#31 as @dbaarda gave me a lot mode input on that. :) |
cc @DannyS03 (you may find this useful in the context of revamping docs.ipfs.tech) We have various types of help / explainers in A potential takeaway from this PR is that longer form explainers should live as useful and discoverable articles at https://docs.ipfs.tech, and the |