Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand rolling chunker documentation #8952

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions core/commands/add.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

If you just specify 'rabin' without parameters the default sizes are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying the default rabin and buzhash parameters here seems good to me to help people understand what's going on.

'rabin-87381-262144-393216' and the parameters used for buzhash are
always min: 131072 and max: 524288 with a mask of 65536. Buzhash
Copy link

@dbaarda dbaarda May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these default sizes are far from optimal for deduplication or speed according to my testing at;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

The real average block size is actually min+avg (ignoring minimal max truncation contributions), so I prefer to use the term "target" or "tgt" instead of "avg". For any rollsum using the normal chunker (ie, no fancy dynamic match like FastCDC or truncation handler like Regression Chunking) the optimal sizes for both deduplication and speed (from cut-point-skipping) for a given average chunk size is with min=tgt=avg/2 , max>=4*avg.

The code currently also seems to support specifying just rabin-[avg] which is the same as rabin-[avg/3]-[avg]-[avg*1.5]. Also, <avg> will get rounded-down to the nearest power of 2 value. There is also constraints that min<tgt<max<=1048576, which means we cannot set min=tgt and have to use min=tgt-1.

For Buzhash the buzMask = 1<<17 - 1. Looking at go operator precedence it seems - binds with a higher precedence than <<, so this really is 1<<16 = 65536, which is a mask with only one bit set, equivalent to avg=2 for an average block size of 128K+2. This seems like a terrible bug, and the intention was really to have buzMask = (1<<17) - 1 which is a mask with 17 bits set for avg=128K. This bug might explain why buzhash chunking deduplication has been so crap in testing; it is pretty much identical to size-131074

So the current defaults for rabin give an average block size of 341K, and buzhash (due to the bug mentioned above) of 128K+2.

So better defaults for an average block size of 256K would be rabin-131071-131072-1048576, and for buzhash would be min=131072,max=1048576, mask=(1<<17) -1.

I know changing these defaults now will break deduplication against content already added with the old defaults, but given buzhash has effectively been broken by a bug fixing it is probably fine. Fixing rabin is more contentious, but you could make rabin-[avg] equivalent to rabin-[avg/2-1]-[avg/2]-[min(avg*4,1M)] so that it gives better defaults when specifying a target average size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering what the - 1 on the bitshift is supposed to do as well...

Buzhash isn't that terrible right now... It actually performed as well as rabin overall in my tests: ipfs/go-ipfs-chunker#31 (comment)

But maybe @Kubuxu can give this a look? :)

Copy link

@dbaarda dbaarda May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems << does have higher precedence than - in go, so there is not a bug in buzhash.

https://go.dev/ref/spec#Operators

I was misled by incorrect documentation here;

https://www.tutorialspoint.com/go/go_operators_precedence.htm

So it seems the default buzhash mask is actually 131071, not 65536 as you have documented here.

is the significantly lighter algorithm with comparable performance.

The parameters chosen won't be stored in the resulting data and the
chunker only determines the cut-marks - there's no special support
needed for the reading client for any of these.
Choosing a rolling chunker your data will be stored with a variable
blocksizes in a file while with a fixed size-* all blocks will have
the same length.

Choosing a smaller chunksize might lead to better deduplication
results if you store similar files, with the tradeoff of storage
overhead and may result in a slower transmission speed due to more
objects which need to be transmitted.

Files stored with different chunkers in IPFS will NOT deduplicate,
as the hashes per chunk will be different. Resulting in different
CIDs for each chunk and the resulting file.

If you store large files with a chance of duplicate data stored in
them in a "plaintext fashing" - not compressed, it is recommended to
use a rolling chunker.

The following examples use very small byte sizes to demonstrate the
properties of the different chunkers on a small file. You'll likely
want to use a 1024 times larger chunk sizes for most files.
Expand Down