Expand rolling chunker documentation #8952

RubenKelevra · 2022-05-07T14:41:24Z

Add default chunk size boundaries for chunker=rabin and chunker=buzhash
Explain that different chunkers will lead to different CIDs (and thus no deduplication between files)
Add a recommendation when to use rolling chunkers
Explain the tradeoff between smaller and larger chunk sizes in general
Explain the difference (in data structure) between rolling chunkers and static chunking
Make clear that there's no need for support on the receiving side for a specific chunking algorithm

- Add default chunk size boundaries for chunker=rabin and chunker=buzhash - Explain that different chunkers will lead to different CIDs (and thus no deduplication between files) - Add a recommendation when to use rolling chunkers - Explain the tradeoff between smaller and larger chunk sizes in general - Explain the difference (in data structure) between rolling chunkers and static chunking - Make clear that there's no need for support on the receiving side for a specific chunking algorithm

dbaarda · 2022-05-10T00:59:27Z

core/commands/add.go

@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by
 specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer
 to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

+If you just specify 'rabin' without parameters the default sizes are
+'rabin-87381-262144-393216' and the parameters used for buzhash are
+always min: 131072 and max: 524288 with a mask of 65536. Buzhash


Note that these default sizes are far from optimal for deduplication or speed according to my testing at;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

The real average block size is actually min+avg (ignoring minimal max truncation contributions), so I prefer to use the term "target" or "tgt" instead of "avg". For any rollsum using the normal chunker (ie, no fancy dynamic match like FastCDC or truncation handler like Regression Chunking) the optimal sizes for both deduplication and speed (from cut-point-skipping) for a given average chunk size is with min=tgt=avg/2 , max>=4*avg.

The code currently also seems to support specifying just rabin-[avg] which is the same as rabin-[avg/3]-[avg]-[avg*1.5]. Also, <avg> will get rounded-down to the nearest power of 2 value. There is also constraints that min<tgt<max<=1048576, which means we cannot set min=tgt and have to use min=tgt-1.

For Buzhash the buzMask = 1<<17 - 1. Looking at go operator precedence it seems - binds with a higher precedence than <<, so this really is 1<<16 = 65536, which is a mask with only one bit set, equivalent to avg=2 for an average block size of 128K+2. This seems like a terrible bug, and the intention was really to have buzMask = (1<<17) - 1 which is a mask with 17 bits set for avg=128K. This bug might explain why buzhash chunking deduplication has been so crap in testing; it is pretty much identical to size-131074

So the current defaults for rabin give an average block size of 341K, and buzhash (due to the bug mentioned above) of 128K+2.

So better defaults for an average block size of 256K would be rabin-131071-131072-1048576, and for buzhash would be min=131072,max=1048576, mask=(1<<17) -1.

I know changing these defaults now will break deduplication against content already added with the old defaults, but given buzhash has effectively been broken by a bug fixing it is probably fine. Fixing rabin is more contentious, but you could make rabin-[avg] equivalent to rabin-[avg/2-1]-[avg/2]-[min(avg*4,1M)] so that it gives better defaults when specifying a target average size.

Yeah I was wondering what the - 1 on the bitshift is supposed to do as well...

Buzhash isn't that terrible right now... It actually performed as well as rabin overall in my tests: ipfs/go-ipfs-chunker#31 (comment)

But maybe @Kubuxu can give this a look? :)

Actually, it seems << does have higher precedence than - in go, so there is not a bug in buzhash.

https://go.dev/ref/spec#Operators

I was misled by incorrect documentation here;

https://www.tutorialspoint.com/go/go_operators_precedence.htm

So it seems the default buzhash mask is actually 131071, not 65536 as you have documented here.

BigLep · 2022-06-03T15:50:13Z

@RubenKelevra : thanks for contribution here. We think this should ideally show up under docs.ipfs.io rather than the help text of a CLI command. The help text can link to this docs article which can also use visuals, etc. Do you want to make a conribution to the IPFS docs site instead?

Rationale: too much text like this in a cli help command isn't as helpful and is non-standard when compared to other tools like gzip.

RubenKelevra · 2022-06-03T19:47:15Z

@BigLep sure, no problem.

I'll cut the help page down and move it to the docs.

Btw: I agree that something like this is more for man pages. Is there anything planned to have a man page for go-ipfs instead of large help texts?

Maybe this makes more sense in the long run. :)

BigLep · 2022-06-10T15:52:30Z

2022-06-10 conversation:

Btw: I agree that something like this is more for man pages. Is there anything planned to have a man page for go-ipfs instead of large help texts?

We don't have plans for this now, in part because don't have the bandwidth to maintain it.

There's also an issue that our ipfs cli helptext emits the long description, but that's a separate issue (see #9035).

We still think getting a docs.ipfs.io page on chunking is the right path.

aschmahmann · 2022-06-10T15:55:25Z

core/commands/add.go

@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by
 specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer
 to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

+If you just specify 'rabin' without parameters the default sizes are


Specifying the default rabin and buzhash parameters here seems good to me to help people understand what's going on.

aschmahmann · 2022-06-10T15:59:49Z

IMO (and matching some of the comments above) higher level text on how to choose chunking and the tradeoffs seems best on the docs website since it's both not go-ipfs specific and is can be longer and more involved.

lidel · 2022-06-10T19:29:41Z

@RubenKelevra fyi created ipfs/ipfs-docs#1176 with some details around what type of article we are looking for. (imo chunking is part of a bigger topic which can be summarized as "knobs that you can tweak when importing data to IPFS", but if you are only interested in documenting chunking, that is already really appreciated! ❤️ ).

If you could open a PR against that repo with "chunker explainer" draft we can then decide if it should be a standalone article, or a section in a bigger one, explaining all the import parameters :)

RubenKelevra · 2022-06-10T20:41:47Z

I'll give the topic a look tomorrow, as I was planning to take another test to see what I can come up with for ipfs/go-ipfs-chunker#31 as @dbaarda gave me a lot mode input on that. :)

lidel · 2022-11-10T17:04:16Z

cc @DannyS03 (you may find this useful in the context of revamping docs.ipfs.tech)

We have various types of help / explainers in ipfs --help commands (right now, only place this content can be found is CLI or https://docs.ipfs.tech/reference/kubo/cli/ page generated automatically on new Kubo release).

A potential takeaway from this PR is that longer form explainers should live as useful and discoverable articles at https://docs.ipfs.tech, and the --help text produced on CLI should link to them.

dbaarda reviewed May 10, 2022

View reviewed changes

dbaarda mentioned this pull request May 10, 2022

buzhash: reduce target size and cutoff size ipfs/go-ipfs-chunker#31

Closed

lidel mentioned this pull request Jun 10, 2022

CLI helptext cleanup into --help and --help-long #9035

Open

aschmahmann reviewed Jun 10, 2022

View reviewed changes

lidel mentioned this pull request Jun 10, 2022

Article: options and tradeoffs around data import parameters ipfs/ipfs-docs#1176

Open

Jorropo closed this Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand rolling chunker documentation #8952

Expand rolling chunker documentation #8952

RubenKelevra commented May 7, 2022

dbaarda May 10, 2022 •

edited

Loading

RubenKelevra May 10, 2022

dbaarda May 10, 2022 •

edited

Loading

BigLep commented Jun 3, 2022

RubenKelevra commented Jun 3, 2022

BigLep commented Jun 10, 2022 •

edited by lidel

Loading

aschmahmann Jun 10, 2022

aschmahmann commented Jun 10, 2022

lidel commented Jun 10, 2022

RubenKelevra commented Jun 10, 2022

lidel commented Nov 10, 2022

Expand rolling chunker documentation #8952

Expand rolling chunker documentation #8952

Conversation

RubenKelevra commented May 7, 2022

dbaarda May 10, 2022 • edited Loading

Choose a reason for hiding this comment

RubenKelevra May 10, 2022

Choose a reason for hiding this comment

dbaarda May 10, 2022 • edited Loading

Choose a reason for hiding this comment

BigLep commented Jun 3, 2022

RubenKelevra commented Jun 3, 2022

BigLep commented Jun 10, 2022 • edited by lidel Loading

aschmahmann Jun 10, 2022

Choose a reason for hiding this comment

aschmahmann commented Jun 10, 2022

lidel commented Jun 10, 2022

RubenKelevra commented Jun 10, 2022

lidel commented Nov 10, 2022

dbaarda May 10, 2022 •

edited

Loading

dbaarda May 10, 2022 •

edited

Loading

BigLep commented Jun 10, 2022 •

edited by lidel

Loading