Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand rolling chunker documentation #8952

Closed
wants to merge 1 commit into from

Conversation

RubenKelevra
Copy link
Contributor

  • Add default chunk size boundaries for chunker=rabin and chunker=buzhash
  • Explain that different chunkers will lead to different CIDs (and thus no deduplication between files)
  • Add a recommendation when to use rolling chunkers
  • Explain the tradeoff between smaller and larger chunk sizes in general
  • Explain the difference (in data structure) between rolling chunkers and static chunking
  • Make clear that there's no need for support on the receiving side for a specific chunking algorithm

- Add default chunk size boundaries for chunker=rabin and chunker=buzhash
- Explain that different chunkers will lead to different CIDs (and thus no deduplication between files)
- Add a recommendation when to use rolling chunkers
- Explain the tradeoff between smaller and larger chunk sizes in general
- Explain the difference (in data structure) between rolling chunkers and static chunking
- Make clear that there's no need for support on the receiving side for a specific chunking algorithm
@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

If you just specify 'rabin' without parameters the default sizes are
'rabin-87381-262144-393216' and the parameters used for buzhash are
always min: 131072 and max: 524288 with a mask of 65536. Buzhash
Copy link

@dbaarda dbaarda May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these default sizes are far from optimal for deduplication or speed according to my testing at;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

The real average block size is actually min+avg (ignoring minimal max truncation contributions), so I prefer to use the term "target" or "tgt" instead of "avg". For any rollsum using the normal chunker (ie, no fancy dynamic match like FastCDC or truncation handler like Regression Chunking) the optimal sizes for both deduplication and speed (from cut-point-skipping) for a given average chunk size is with min=tgt=avg/2 , max>=4*avg.

The code currently also seems to support specifying just rabin-[avg] which is the same as rabin-[avg/3]-[avg]-[avg*1.5]. Also, <avg> will get rounded-down to the nearest power of 2 value. There is also constraints that min<tgt<max<=1048576, which means we cannot set min=tgt and have to use min=tgt-1.

For Buzhash the buzMask = 1<<17 - 1. Looking at go operator precedence it seems - binds with a higher precedence than <<, so this really is 1<<16 = 65536, which is a mask with only one bit set, equivalent to avg=2 for an average block size of 128K+2. This seems like a terrible bug, and the intention was really to have buzMask = (1<<17) - 1 which is a mask with 17 bits set for avg=128K. This bug might explain why buzhash chunking deduplication has been so crap in testing; it is pretty much identical to size-131074

So the current defaults for rabin give an average block size of 341K, and buzhash (due to the bug mentioned above) of 128K+2.

So better defaults for an average block size of 256K would be rabin-131071-131072-1048576, and for buzhash would be min=131072,max=1048576, mask=(1<<17) -1.

I know changing these defaults now will break deduplication against content already added with the old defaults, but given buzhash has effectively been broken by a bug fixing it is probably fine. Fixing rabin is more contentious, but you could make rabin-[avg] equivalent to rabin-[avg/2-1]-[avg/2]-[min(avg*4,1M)] so that it gives better defaults when specifying a target average size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering what the - 1 on the bitshift is supposed to do as well...

Buzhash isn't that terrible right now... It actually performed as well as rabin overall in my tests: ipfs/go-ipfs-chunker#31 (comment)

But maybe @Kubuxu can give this a look? :)

Copy link

@dbaarda dbaarda May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems << does have higher precedence than - in go, so there is not a bug in buzhash.

https://go.dev/ref/spec#Operators

I was misled by incorrect documentation here;

https://www.tutorialspoint.com/go/go_operators_precedence.htm

So it seems the default buzhash mask is actually 131071, not 65536 as you have documented here.

@BigLep
Copy link
Contributor

BigLep commented Jun 3, 2022

@RubenKelevra : thanks for contribution here. We think this should ideally show up under docs.ipfs.io rather than the help text of a CLI command. The help text can link to this docs article which can also use visuals, etc. Do you want to make a conribution to the IPFS docs site instead?

Rationale: too much text like this in a cli help command isn't as helpful and is non-standard when compared to other tools like gzip.

@RubenKelevra
Copy link
Contributor Author

@BigLep sure, no problem.

I'll cut the help page down and move it to the docs.

Btw: I agree that something like this is more for man pages. Is there anything planned to have a man page for go-ipfs instead of large help texts?

Maybe this makes more sense in the long run. :)

@BigLep
Copy link
Contributor

BigLep commented Jun 10, 2022

2022-06-10 conversation:

Btw: I agree that something like this is more for man pages. Is there anything planned to have a man page for go-ipfs instead of large help texts?

We don't have plans for this now, in part because don't have the bandwidth to maintain it.

There's also an issue that our ipfs cli helptext emits the long description, but that's a separate issue (see #9035).

We still think getting a docs.ipfs.io page on chunking is the right path.

@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

If you just specify 'rabin' without parameters the default sizes are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying the default rabin and buzhash parameters here seems good to me to help people understand what's going on.

@aschmahmann
Copy link
Contributor

IMO (and matching some of the comments above) higher level text on how to choose chunking and the tradeoffs seems best on the docs website since it's both not go-ipfs specific and is can be longer and more involved.

@lidel
Copy link
Member

lidel commented Jun 10, 2022

@RubenKelevra fyi created ipfs/ipfs-docs#1176 with some details around what type of article we are looking for. (imo chunking is part of a bigger topic which can be summarized as "knobs that you can tweak when importing data to IPFS", but if you are only interested in documenting chunking, that is already really appreciated! ❤️ ).

If you could open a PR against that repo with "chunker explainer" draft we can then decide if it should be a standalone article, or a section in a bigger one, explaining all the import parameters :)

@RubenKelevra
Copy link
Contributor Author

I'll give the topic a look tomorrow, as I was planning to take another test to see what I can come up with for ipfs/go-ipfs-chunker#31 as @dbaarda gave me a lot mode input on that. :)

@lidel
Copy link
Member

lidel commented Nov 10, 2022

cc @DannyS03 (you may find this useful in the context of revamping docs.ipfs.tech)

We have various types of help / explainers in ipfs --help commands (right now, only place this content can be found is CLI or https://docs.ipfs.tech/reference/kubo/cli/ page generated automatically on new Kubo release).

A potential takeaway from this PR is that longer form explainers should live as useful and discoverable articles at https://docs.ipfs.tech, and the --help text produced on CLI should link to them.

@Jorropo Jorropo closed this Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants