Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

buzhash: reduce target size and cutoff size #31

Closed
wants to merge 1 commit into from

Conversation

RubenKelevra
Copy link

@welcome
Copy link

welcome bot commented Jul 7, 2021

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to change this, we need to make it configurable (like rabin and size) so users can restore the defaults.

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 9, 2021

Hey @Stebalien, the previous settings are way too large to do sensible deduplication.

There's a study linked in ipfs/kubo#7966 which states, that 64 KByte is a good choice for deduplication when compression is applied too.

But that's for highly mixed data and with very limited memory per terabyte of storage.

We don't have such restrictions, users can choose buzhash blocked data like iso images, VM images uncompressed tar archives as well as HTML, CSS, and other text-based files.

With the new buzhash size those should deduplicate very well.

For any compressed or encrypted data format deduplication isn't possible anyway, so they can be stored in the default 256 K blocks and don't need the overhead of being stored with smaller blocks.

@Stebalien
Copy link
Member

I agree we should change the defaults (although please also consider that we're trading off per-block overhead in terms of access times, network overhead, latencies, etc). However, if we're going to change the defaults, I'd like to let users revert to the old defaults (e.g., to ensure hashes converge with previously imported data).

@RubenKelevra
Copy link
Author

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Latency on disk shouldn't be an issue, since the average 8K should be still pretty neat in performance on an SSD. I mean, that's what a database usually uses as the block size. So operating systems are kind of optimized for that. Since we can do more deduplication means, we can do more efficient caching of the same amount of data. This might even lead to more performance, depending on the amount of duplicate data.

We also reduce the transfer size, by being able to fetch data from the local disk, which hasn't been changed.

For large file transfers, we might want to tweak the network settings to think in data size, not individual blocks. This way we would just fetch more blocks containing the same amount of data.


The target size of this new setting is 8K since as mentioned above that's what databases use which means the operating system is optimized for this size of requests.

When the blocks are stored in a database they would be stored sequentially anyway, so there's no real loss in write speed.

In terms of flatfs, we might hit some barriers since we probably do sync after each block has been written. So we might want to recommend turning off sync if there are performance issues.


Regarding old data: Buzhash is fairly new in ipfs and I don't think there are many users, especially since rabin gives a bigger deduplication ratio at the current settings.

I don't think we need backward compatibility here.

@Stebalien
Copy link
Member

Stebalien commented Jul 9, 2021 via email

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 11, 2021

@Stebalien wrote:

I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.

Probably only matters if you have enough information to make a guess. Please test ipfs cat (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.
2. Prefetching is hard. The smaller the blocks, the worse we are at prefetching. ipfs cat currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this. Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.

Fair enough, but I think we can improve the performance since it's just about latency. So it's basically the same issue as TCP with a too-small receive window. We just have to send more data "blindly" basically. There's no reason it has to be (much) slower than larger chunks.

  1. Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.

True. But there's always the option to use just 256K blocks for these type of data (if not even larger). There's basically no point in using a rolling hash for these data to begin with.

So maybe fixed 256K chunks should be the default and if we're dealing with files we could switch the chunker based on the mime type of the file if no chunker is manually set by the user?

In terms of defaults, people depend on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.

Okay.

I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.

Will do.

@RubenKelevra RubenKelevra marked this pull request as draft July 11, 2021 18:08
@RubenKelevra
Copy link
Author

@Stebalien I'll first verify my assumptions that such small blocks are necessary for good deduplications with some sample data, before I proceed.

I just don't trust this Microsoft paper and still think 4-16 K is pretty much what we need.

But will we actually "snap" to 4K blocks, or 512 Bytes or will it output a random byte length?

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 14, 2021

@Stebalien I'm finally done - this took longer than expected.

TL;DR:

  • 4-16 K is way too small.
  • Hot candidates are 12k-24k-48k and 16k-32k-64k.

LMK if there's any use-case I missed that would greatly benefit from deduplication on block-level. :)

ipfs rolling hash test - Sheet1.pdf

/ipfs/bafykbzacedsteoatejnapzdhnhoaiyc2y5wsypl6jsif4pf5uwjsztdsk3owa

@RubenKelevra
Copy link
Author

Ah and fixed size 8K and 16K are actually pretty great for databases. So in theory we could recommend those for maximum amounts of deduplication of database data.

They even outperform rabin-4k-8k-16k which means knowledge of the alignment of the data is better than simply throwing rolling hashsums against anything.

@Stebalien
Copy link
Member

Awesome work! Were you able to test transfer times as well?

@RubenKelevra
Copy link
Author

This would have taken too long for all options I've tested. I can run this test, but only for a limited number of cases.

Do you agree with my candidates?

@RubenKelevra
Copy link
Author

@Stebalien, I forgot the ping :)

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 16, 2021

@Stebalien here's my pin time test:

ipfs rolling hash pin-time - Sheet1.pdf

/ipfs/QmPMsvgjoQHQPvw6vWjgHRvC3YddErgAacg9iGu4YLzQsp

Note that this is the worst-case scenario since this connection is plagued with buffer bloat. Additionally as the corpus is only one big file, there only large blocks. In my first test I had also folders with many small files in which case the results would be closer.

I only tested TCP, to get more reliable results, as UDP is sometimes deprioritized by providers.


Result for me: Go with 12k-24k-48k, as it's a fair compromise between small chunks and overhead. The performance penalty for transfers has nothing to do with processing power, just the missing handling of longer latencies. So it should be fixable.

@Stebalien
Copy link
Member

  1. Is that the buzhash with new defaults or buzhash with 256Kib Blocks?
  2. What was the latency?
  3. How many blocks?
  4. Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the mean-time.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

@Stebalien
Copy link
Member

(but I'd still need the answers to the rest of those questions to tell)
(and this still needs to be configurable)

@RubenKelevra
Copy link
Author

@Stebalien wrote:

  1. Is that the buzhash with new defaults or buzhash with 256Kib Blocks?

I've tested the latency with git commit which is in the profile of the node on the right.

  1. What was the latency?

The latency of the connection is in the document, once idle and once while downloading (with 256K blocks - which did hit the ceiling on this connection).

  1. How many blocks?

The file size is 742 MB, the average block size should determine how many blocks each chunker created, but I can look that up, if the exact number is important.

  1. Which command? ipfs pin add or ipfs cat (both behave differently and will perform differently).

I've used time ipfs pin add --progress <CID> and then unpinned the again and run an ipfs repo gc between the pin adds.

Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the meantime.

Given the numbers I'm seeing, it sounds like 16k-32k-64k is acceptable

Okay

@RubenKelevra
Copy link
Author

(but I'd still need the answers to the rest of those questions to tell)
(and this still needs to be configurable)

Well, I think we shouldn't make this configurable. A new and an old buzhash should be enough. If everyone experiments with the settings, we end up with the same as with rabin, where basically every data has different chunker settings - which doesn't help the network-wide deduplication.

I actually think we should try to determine the mime type of the files added, and select static chunks or the smaller buzhash accordingly for the users. So compressed/encrypted or non-diff-duplicatable files get large chunks with good performance, and if we encounter an HTML/CSS/JS/SQL/XML/etc we switch to buzhash.

This could be named "adaptive", for example.

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 18, 2021

To continue this PR I think it makes sense to change buzhash while buzhash-legacy keeps the old chunking parameters. This way users will automatically switch to the new buzhash chunking, while everyone who needs to reproduce a CID from data can use the buzhash-legacy chunker.

How does this sound?

@RubenKelevra
Copy link
Author

@Stebalien wrote:

  1. How many blocks?

256k: 3136 blocks
/ipfs/QmQ268dbzaWMLvweiDcRQXmapTUx1PKE5rivPsHPNXvgSY
rabin-12k-24k-48k: 30644 blocks
/ipfs/QmefWQeJHY5hN4BSJw7v1gC5m85gwGHAgUMNJ6nJXEFpk2
buzhash: 3279 blocks
/ipfs/QmSSrnWmXcnST4jD4qnU9WNKaqEV6WNzKQEKzyYfobe37D
rabin-16K-32K-64K: 19805
/ipfs/QmW7rCHh7QZH8Cqjeoum2f8Gzrh66rKydoT6BgWEhvqHnA

@aschmahmann aschmahmann marked this pull request as draft July 23, 2021 15:40
@dbaarda
Copy link

dbaarda commented May 10, 2022

Note that I've done extensive testing of chunker sizes here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

And in my tests the optimal min-tgt-max sizes for a given avg chunk size is [avg/2]-[avg/2]-[avg*4]. This is optimal for deduplication with good speed (cut-point-skipping). The optimal for speed with still good deduplication is around `[avg2/3]-[avg/3]-[avg4]. Yes, that is "min" is greater than "avg" AKA "tgt".

I note you didn't test using settings like this... it would be interesting to see how it compares.

This do NOT match what you are proposing.

@dbaarda
Copy link

dbaarda commented May 10, 2022

Some observations about the rabin settings you were using in your testing in https://github.com/ipfs/go-ipfs-chunker/files/6819330/ipfs.rolling.hash.test.-.Sheet1.pdf

  • for rabin settings [min]-[tgt]-[max] the average chunk size is actually min+tgt, which is why I prefer using "target" or "tgt" for the middle setting, as using "avg" is misleading.
  • your max sizes are too small. You want max>=4*tgt+min, or better max>=4*avg=4*(min+tgt). Setting max smaller than that seriously degrades deduplication for no wins (large chunks are less overhead). Even just using max=1M would probably be better. Its this degradation that probably explains why buzhash and rabin performed much worse than the fixed-size chunkers in some cases.
  • In general, a smaller averages chunk size gives better deduplication, and for a given average the sweet spot is with min=tgt. However, at some point the per-chunk overheads start to overtake the deduplication wins, and you figures seem to show this.
  • Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.

If you want an average block size of 32K, I'd recommend 16K-16K-128K, and for an average block size of 64K (which is a sweetspot identified by Microsoft in their chunking tests) then I'd go with 32K-32K-256K. Though in both cases you can set the max higher if you want. If you want to compare against the default 256K average chunk size I'd use 128K-128K-1M.

buzMaxDefault = 64 << 10
buzMinLegacy = 128 << 10
buzMaxLegacy = 512 << 10
buzMask = 1<<17 - 1
Copy link

@dbaarda dbaarda May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that buzMask affects the "target" size, which is the average distance after the min size where chunk boundaries will be found. This setting gives tgt=2^17 or 128K. This means the average chunk boundary will be at 128K + 16K = 144K. Since this is larger than your max of 64K, this means most chunks will be truncated to 64K.

You need to also set buzMaskLegacy=1<<17 -1 and change buzMaskDefault=1<<14 -1 for tgt=16K for a default average block size of 32K. I'd also bump up buzMaxDefault = 128 << 10

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be (1<<14)-1?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! It turns out << does have higher precedence than -. See ipfs/kubo#8952 (comment)

@RubenKelevra
Copy link
Author

@dbaarda wrote:

  • Note that buzhash has an interesting corner-case that it always breaks at a run of 32 zeros. Things like tar use runs of zero padding between files so it often nicely breaks at file boundaries within a tar file. However, it also means long runs of zeros, like empty blocks in filesystem images, get turned into lots of min-size blocks.

Okay, but that doesn't really matter, as the CIDs are all the same, right?

So you get like 32391 same CIDs in a row.

--

Thanks for all the feedback. Def worth going over this again. :)

@dbaarda
Copy link

dbaarda commented Jun 30, 2022

Sorry for the late reply: yes, all the min-size zero blocks have the same CID. The (maybe minor) problem is that this gives you more blocks, and thus the per-block metadata/etc overheads are worse. But that might be insignificant... I dunno.

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

@Kubuxu
Copy link
Member

Kubuxu commented Nov 11, 2022

Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this.

Interesting observation, we could generate a new table for it to avoid that property. The primary property for the generation was keeping 50/50 split between one and zero bits in the whole table.

@hacdias
Copy link
Member

hacdias commented Jun 16, 2023

This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.

You can learn more in the FAQs for the Boxo repo copying/consolidation effort.

@hacdias hacdias closed this Jun 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add a buzhash preset with a smaller chunk-size target
5 participants