buzhash: reduce target size and cutoff size #31

RubenKelevra · 2021-07-07T01:56:53Z

welcome · 2021-07-07T01:56:54Z

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

The context for this PR is clear, with relevant discussion, decisions
and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the
rest) and in its best form. Follow the code contribution
guidelines
if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

A maintainer will triage and assign priority to this PR, commenting on
any missing things and potentially assigning a reviewer for high
priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

Stebalien

If we're going to change this, we need to make it configurable (like rabin and size) so users can restore the defaults.

RubenKelevra · 2021-07-09T17:41:13Z

Hey @Stebalien, the previous settings are way too large to do sensible deduplication.

There's a study linked in ipfs/kubo#7966 which states, that 64 KByte is a good choice for deduplication when compression is applied too.

But that's for highly mixed data and with very limited memory per terabyte of storage.

We don't have such restrictions, users can choose buzhash blocked data like iso images, VM images uncompressed tar archives as well as HTML, CSS, and other text-based files.

With the new buzhash size those should deduplicate very well.

For any compressed or encrypted data format deduplication isn't possible anyway, so they can be stored in the default 256 K blocks and don't need the overhead of being stored with smaller blocks.

Stebalien · 2021-07-09T19:49:57Z

I agree we should change the defaults (although please also consider that we're trading off per-block overhead in terms of access times, network overhead, latencies, etc). However, if we're going to change the defaults, I'd like to let users revert to the old defaults (e.g., to ensure hashes converge with previously imported data).

RubenKelevra · 2021-07-09T21:23:31Z

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Latency on disk shouldn't be an issue, since the average 8K should be still pretty neat in performance on an SSD. I mean, that's what a database usually uses as the block size. So operating systems are kind of optimized for that. Since we can do more deduplication means, we can do more efficient caching of the same amount of data. This might even lead to more performance, depending on the amount of duplicate data.

We also reduce the transfer size, by being able to fetch data from the local disk, which hasn't been changed.

For large file transfers, we might want to tweak the network settings to think in data size, not individual blocks. This way we would just fetch more blocks containing the same amount of data.

The target size of this new setting is 8K since as mentioned above that's what databases use which means the operating system is optimized for this size of requests.

When the blocks are stored in a database they would be stored sequentially anyway, so there's no real loss in write speed.

In terms of flatfs, we might hit some barriers since we probably do sync after each block has been written. So we might want to recommend turning off sync if there are performance issues.

Regarding old data: Buzhash is fairly new in ipfs and I don't think there are many users, especially since rabin gives a bigger deduplication ratio at the current settings.

I don't think we need backward compatibility here.

Stebalien · 2021-07-09T22:15:50Z

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Probably only matters if you have enough information to make a guess. Please test `ipfs cat` (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference. 1. Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option. 2. Prefetching is _hard_. The smaller the blocks, the worse we are at prefetching. `ipfs cat` currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this. Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency. In terms of defaults, people _depend_ on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here. I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.

RubenKelevra · 2021-07-11T18:07:52Z

@Stebalien wrote:

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Probably only matters if you have enough information to make a guess. Please test ipfs cat (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.
2. Prefetching is hard. The smaller the blocks, the worse we are at prefetching. ipfs cat currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this. Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.

Fair enough, but I think we can improve the performance since it's just about latency. So it's basically the same issue as TCP with a too-small receive window. We just have to send more data "blindly" basically. There's no reason it has to be (much) slower than larger chunks.

Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.

True. But there's always the option to use just 256K blocks for these type of data (if not even larger). There's basically no point in using a rolling hash for these data to begin with.

So maybe fixed 256K chunks should be the default and if we're dealing with files we could switch the chunker based on the mime type of the file if no chunker is manually set by the user?

In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.

Okay.

I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.

Will do.

RubenKelevra · 2021-07-12T17:26:38Z

@Stebalien I'll first verify my assumptions that such small blocks are necessary for good deduplications with some sample data, before I proceed.

I just don't trust this Microsoft paper and still think 4-16 K is pretty much what we need.

But will we actually "snap" to 4K blocks, or 512 Bytes or will it output a random byte length?

RubenKelevra · 2021-07-14T22:55:10Z

@Stebalien I'm finally done - this took longer than expected.

TL;DR:

4-16 K is way too small.
Hot candidates are 12k-24k-48k and 16k-32k-64k.

LMK if there's any use-case I missed that would greatly benefit from deduplication on block-level. :)

ipfs rolling hash test - Sheet1.pdf

/ipfs/bafykbzacedsteoatejnapzdhnhoaiyc2y5wsypl6jsif4pf5uwjsztdsk3owa

RubenKelevra · 2021-07-14T22:59:06Z

Ah and fixed size 8K and 16K are actually pretty great for databases. So in theory we could recommend those for maximum amounts of deduplication of database data.

They even outperform rabin-4k-8k-16k which means knowledge of the alignment of the data is better than simply throwing rolling hashsums against anything.

Stebalien · 2021-07-15T01:18:47Z

Awesome work! Were you able to test transfer times as well?

RubenKelevra · 2021-07-15T05:51:10Z

This would have taken too long for all options I've tested. I can run this test, but only for a limited number of cases.

Do you agree with my candidates?

RubenKelevra · 2021-07-15T18:51:39Z

@Stebalien, I forgot the ping :)

RubenKelevra · 2021-07-16T23:32:05Z

@Stebalien here's my pin time test:

ipfs rolling hash pin-time - Sheet1.pdf

/ipfs/QmPMsvgjoQHQPvw6vWjgHRvC3YddErgAacg9iGu4YLzQsp

Note that this is the worst-case scenario since this connection is plagued with buffer bloat. Additionally as the corpus is only one big file, there only large blocks. In my first test I had also folders with many small files in which case the results would be closer.

I only tested TCP, to get more reliable results, as UDP is sometimes deprioritized by providers.

Result for me: Go with 12k-24k-48k, as it's a fair compromise between small chunks and overhead. The performance penalty for transfers has nothing to do with processing power, just the missing handling of longer latencies. So it should be fixable.

Stebalien · 2021-07-17T02:52:21Z

Is that the buzhash with new defaults or buzhash with 256Kib Blocks?
What was the latency?
How many blocks?
Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the mean-time.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

Stebalien · 2021-07-17T02:52:59Z

(but I'd still need the answers to the rest of those questions to tell)
(and this still needs to be configurable)

RubenKelevra · 2021-07-17T22:26:28Z

@Stebalien wrote:

Is that the buzhash with new defaults or buzhash with 256Kib Blocks?

I've tested the latency with git commit which is in the profile of the node on the right.

What was the latency?

The latency of the connection is in the document, once idle and once while downloading (with 256K blocks - which did hit the ceiling on this connection).

How many blocks?

The file size is 742 MB, the average block size should determine how many blocks each chunker created, but I can look that up, if the exact number is important.

Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

I've used time ipfs pin add --progress <CID> and then unpinned the again and run an ipfs repo gc between the pin adds.

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the meantime.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

Okay

RubenKelevra · 2021-07-17T22:30:49Z

(but I'd still need the answers to the rest of those questions to tell)
(and this still needs to be configurable)

Well, I think we shouldn't make this configurable. A new and an old buzhash should be enough. If everyone experiments with the settings, we end up with the same as with rabin, where basically every data has different chunker settings - which doesn't help the network-wide deduplication.

I actually think we should try to determine the mime type of the files added, and select static chunks or the smaller buzhash accordingly for the users. So compressed/encrypted or non-diff-duplicatable files get large chunks with good performance, and if we encounter an HTML/CSS/JS/SQL/XML/etc we switch to buzhash.

This could be named "adaptive", for example.

RubenKelevra · 2021-07-18T09:09:34Z

To continue this PR I think it makes sense to change buzhash while buzhash-legacy keeps the old chunking parameters. This way users will automatically switch to the new buzhash chunking, while everyone who needs to reproduce a CID from data can use the buzhash-legacy chunker.

How does this sound?

Fixes ipfs/kubo#7966

RubenKelevra · 2021-07-18T11:28:54Z

@Stebalien wrote:

How many blocks?

256k: 3136 blocks
/ipfs/QmQ268dbzaWMLvweiDcRQXmapTUx1PKE5rivPsHPNXvgSY
rabin-12k-24k-48k: 30644 blocks
/ipfs/QmefWQeJHY5hN4BSJw7v1gC5m85gwGHAgUMNJ6nJXEFpk2
buzhash: 3279 blocks
/ipfs/QmSSrnWmXcnST4jD4qnU9WNKaqEV6WNzKQEKzyYfobe37D
rabin-16K-32K-64K: 19805
/ipfs/QmW7rCHh7QZH8Cqjeoum2f8Gzrh66rKydoT6BgWEhvqHnA

dbaarda · 2022-05-10T02:16:21Z

Note that I've done extensive testing of chunker sizes here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

And in my tests the optimal min-tgt-max sizes for a given avg chunk size is [avg/2]-[avg/2]-[avg*4]. This is optimal for deduplication with good speed (cut-point-skipping). The optimal for speed with still good deduplication is around `[avg2/3]-[avg/3]-[avg4]. Yes, that is "min" is greater than "avg" AKA "tgt".

I note you didn't test using settings like this... it would be interesting to see how it compares.

This do NOT match what you are proposing.

dbaarda · 2022-05-10T04:07:44Z

Some observations about the rabin settings you were using in your testing in https://github.com/ipfs/go-ipfs-chunker/files/6819330/ipfs.rolling.hash.test.-.Sheet1.pdf

for rabin settings [min]-[tgt]-[max] the average chunk size is actually min+tgt, which is why I prefer using "target" or "tgt" for the middle setting, as using "avg" is misleading.
your max sizes are too small. You want max>=4*tgt+min, or better max>=4*avg=4*(min+tgt). Setting max smaller than that seriously degrades deduplication for no wins (large chunks are less overhead). Even just using max=1M would probably be better. Its this degradation that probably explains why buzhash and rabin performed much worse than the fixed-size chunkers in some cases.
In general, a smaller averages chunk size gives better deduplication, and for a given average the sweet spot is with min=tgt. However, at some point the per-chunk overheads start to overtake the deduplication wins, and you figures seem to show this.
Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.

If you want an average block size of 32K, I'd recommend 16K-16K-128K, and for an average block size of 64K (which is a sweetspot identified by Microsoft in their chunking tests) then I'd go with 32K-32K-256K. Though in both cases you can set the max higher if you want. If you want to compare against the default 256K average chunk size I'd use 128K-128K-1M.

dbaarda · 2022-05-10T04:35:07Z

buzhash.go

+	buzMaxDefault = 64 << 10
+	buzMinLegacy  = 128 << 10
+	buzMaxLegacy  = 512 << 10
+	buzMask       = 1<<17 - 1


Note that buzMask affects the "target" size, which is the average distance after the min size where chunk boundaries will be found. This setting gives tgt=2^17 or 128K. This means the average chunk boundary will be at 128K + 16K = 144K. Since this is larger than your max of 64K, this means most chunks will be truncated to 64K.

You need to also set buzMaskLegacy=1<<17 -1 and change buzMaskDefault=1<<14 -1 for tgt=16K for a default average block size of 32K. I'd also bump up buzMaxDefault = 128 << 10

Shouldn't that be (1<<14)-1?

Ha! It turns out << does have higher precedence than -. See ipfs/kubo#8952 (comment)

RubenKelevra · 2022-05-25T00:20:43Z

@dbaarda wrote:

Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.

Okay, but that doesn't really matter, as the CIDs are all the same, right?

So you get like 32391 same CIDs in a row.

--

Thanks for all the feedback. Def worth going over this again. :)

dbaarda · 2022-06-30T23:59:44Z

Sorry for the late reply: yes, all the min-size zero blocks have the same CID. The (maybe minor) problem is that this gives you more blocks, and thus the per-block metadata/etc overheads are worse. But that might be insignificant... I dunno.

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

Kubuxu · 2022-11-11T11:38:13Z

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

Interesting observation, we could generate a new table for it to avoid that property. The primary property for the generation was keeping 50/50 split between one and zero bits in the whole table.

hacdias · 2023-06-16T09:31:42Z

This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.

You can learn more in the FAQs for the Boxo repo copying/consolidation effort.

Stebalien suggested changes Jul 9, 2021

View reviewed changes

RubenKelevra marked this pull request as draft July 11, 2021 18:08

buzhash: reduce target size and cutoff size

6016eed

Fixes ipfs/kubo#7966

RubenKelevra force-pushed the patch-1 branch from e551af5 to 6016eed Compare July 18, 2021 10:37

RubenKelevra marked this pull request as ready for review July 18, 2021 10:39

RubenKelevra requested a review from Stebalien July 18, 2021 10:39

aschmahmann marked this pull request as draft July 23, 2021 15:40

dbaarda reviewed May 10, 2022

View reviewed changes

This was referenced Jun 10, 2022

Expand rolling chunker documentation ipfs/kubo#8952

Closed

[Draft] A cache sweeper for kubo (go-ipfs) ipfs/notes#428

Open

hacdias closed this Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buzhash: reduce target size and cutoff size #31

buzhash: reduce target size and cutoff size #31

RubenKelevra commented Jul 7, 2021

welcome bot commented Jul 7, 2021

Stebalien left a comment

RubenKelevra commented Jul 9, 2021 •

edited

Loading

Stebalien commented Jul 9, 2021

RubenKelevra commented Jul 9, 2021

Stebalien commented Jul 9, 2021 via email

RubenKelevra commented Jul 11, 2021 •

edited

Loading

RubenKelevra commented Jul 12, 2021

RubenKelevra commented Jul 14, 2021 •

edited

Loading

RubenKelevra commented Jul 14, 2021

Stebalien commented Jul 15, 2021

RubenKelevra commented Jul 15, 2021

RubenKelevra commented Jul 15, 2021

RubenKelevra commented Jul 16, 2021 •

edited

Loading

Stebalien commented Jul 17, 2021

Stebalien commented Jul 17, 2021

RubenKelevra commented Jul 17, 2021

RubenKelevra commented Jul 17, 2021

RubenKelevra commented Jul 18, 2021 •

edited

Loading

RubenKelevra commented Jul 18, 2021

dbaarda commented May 10, 2022

dbaarda commented May 10, 2022 •

edited

Loading

dbaarda May 10, 2022 •

edited

Loading

RubenKelevra May 10, 2022

dbaarda May 10, 2022

RubenKelevra commented May 25, 2022

dbaarda commented Jun 30, 2022

Kubuxu commented Nov 11, 2022

hacdias commented Jun 16, 2023

buzhash: reduce target size and cutoff size #31

buzhash: reduce target size and cutoff size #31

Conversation

RubenKelevra commented Jul 7, 2021

welcome bot commented Jul 7, 2021

Stebalien left a comment

Choose a reason for hiding this comment

RubenKelevra commented Jul 9, 2021 • edited Loading

Stebalien commented Jul 9, 2021

RubenKelevra commented Jul 9, 2021

Stebalien commented Jul 9, 2021 via email

RubenKelevra commented Jul 11, 2021 • edited Loading

RubenKelevra commented Jul 12, 2021

RubenKelevra commented Jul 14, 2021 • edited Loading

RubenKelevra commented Jul 14, 2021

Stebalien commented Jul 15, 2021

RubenKelevra commented Jul 15, 2021

RubenKelevra commented Jul 15, 2021

RubenKelevra commented Jul 16, 2021 • edited Loading

Stebalien commented Jul 17, 2021

Stebalien commented Jul 17, 2021

RubenKelevra commented Jul 17, 2021

RubenKelevra commented Jul 17, 2021

RubenKelevra commented Jul 18, 2021 • edited Loading

RubenKelevra commented Jul 18, 2021

dbaarda commented May 10, 2022

dbaarda commented May 10, 2022 • edited Loading

dbaarda May 10, 2022 • edited Loading

Choose a reason for hiding this comment

RubenKelevra May 10, 2022

Choose a reason for hiding this comment

dbaarda May 10, 2022

Choose a reason for hiding this comment

RubenKelevra commented May 25, 2022

dbaarda commented Jun 30, 2022

Kubuxu commented Nov 11, 2022

hacdias commented Jun 16, 2023

RubenKelevra commented Jul 9, 2021 •

edited

Loading

RubenKelevra commented Jul 11, 2021 •

edited

Loading

RubenKelevra commented Jul 14, 2021 •

edited

Loading

RubenKelevra commented Jul 16, 2021 •

edited

Loading

RubenKelevra commented Jul 18, 2021 •

edited

Loading

dbaarda commented May 10, 2022 •

edited

Loading

dbaarda May 10, 2022 •

edited

Loading