Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: auto-shard based on node size #171

Merged
merged 6 commits into from
Feb 9, 2023

Conversation

achingbrain
Copy link
Member

@achingbrain achingbrain commented Sep 12, 2021

js counterpart to ipfs/kubo#8114

Changes the shardSplitThreshold parameter to mean the size of the final DAGNode (including link names, sizes, etc) instead of the number of entries in the directory.

Fixes: #149

BREAKING CHANGE: the shardSplitThreshold option has changed to shardSplitThresholdBytes and reflects a DAGNode size where sharding might kick in

js counterpart to ipfs/kubo#8114

Changes the `shardSplitThreshold` parameter to mean the size of the
final DAGNode (including link names, sizes, etc) instead of the
number of entries in the directory.

Fixes: #149

BREAKING CHANGE: `shardSplitThreshold` now refers to node size, not number of entries
@olizilla
Copy link
Member

Consider changing the config key option as the meaning of the key is changing, to make it easier to throw a sensible error message to users who miss the memo

@olizilla
Copy link
Member

e.g use shardSplitThresholdBytes and warn users that are still using shardSplitThreshold that it is deprecated and ignored.

@BigLep BigLep added this to In Review in Maintenance Priorities - JS via automation Sep 24, 2021
@BigLep
Copy link

BigLep commented Sep 24, 2021

We won't ship until the go counterpart (ipfs/kubo#8114) ships.

@BigLep BigLep added the status/blocked Unable to be worked further until needs are met label Sep 24, 2021
@BigLep BigLep moved this from In Review to Parked/Blocked in Maintenance Priorities - JS Oct 8, 2021
@BigLep BigLep removed this from Parked/Blocked in Maintenance Priorities - JS Jan 21, 2022
@BigLep BigLep removed the status/blocked Unable to be worked further until needs are met label Jan 21, 2022
@BigLep
Copy link

BigLep commented Jan 21, 2022

Per 2022-01-21: this is no longer blocked on go-ipfs (just needs maintainer to be able to finish it off).

@rvagg
Copy link
Member

rvagg commented Jan 24, 2022

Hey, so I was just fixing the impl for this in go-unixfsnode to match what go-unixfs is doing for go-ipfs (it's already being used in the wild, such as in go-car, and currently doesn't do the right thing with sharding); I could tweak that test case to find the exact edge where it switches from plain to sharded and generate two test cases for the switch. Then we'd have root CIDs that we could duplicate across implementations in our tests to ensure we're doing the exact same thing and generating the exact same data.

Worthwhile?

@rvagg
Copy link
Member

rvagg commented Jan 24, 2022

I went ahead and did it @ ipfs/go-unixfsnode#21

The tests pack up a directory called "rootDir" full of files with the name "long name to fill out bytes to make the sharded directory test flip over the sharded directory limit because link names are included in the directory entry X" where X increments from 0 to the number of files. The files contain the same string as the name.

  • 1343 of these files in the directory makes an unsharded root with CID bafybeihecq4rpl4nw3cgfb2uiwltgsmw5sutouvuldv5fxn4gfbihvnalq (and reported total size of 490665—that may not be a relevant concern here but the builders report a total size).
  • one extra file, 1344, makes a sharded root with CID bafybeigyvxs6og5jbmpaa43qbhhd5swklqcfzqdrtjgfh53qjon6hpjaye (reported total size of 515735).

@BigLep
Copy link

BigLep commented Mar 18, 2022

2022-03-18 conversation: thanks for finishing @rvagg . We'll get to reviewing after the js-libp2p typescript conversion.

@BigLep BigLep marked this pull request as draft May 27, 2022 13:57
@BigLep
Copy link

BigLep commented May 27, 2022

2022-05-27 conversation: need to tidyup now that go-ipfs has shipped the feature. This will involve a change in the interop test suite.

@achingbrain
Copy link
Member Author

@rvagg can you remember what the shard split threshold used by go-unixfsnode in these tests is?

With 1343 entries I can generate a tree with the expected CID of bafybeihecq4rpl4nw3cgfb2uiwltgsmw5sutouvuldv5fxn4gfbihvnalq, but only if I set the shard split threshold to greater than 256KiB which is what I thought the limit was supposed to be - indeed the block size for the (unsharded) root node is 276895 bytes.

With 1344 entries and the shard limit at 256KiB I get the expected CID bafybeigyvxs6og5jbmpaa43qbhhd5swklqcfzqdrtjgfh53qjon6hpjaye and the block size for the (sharded) root node is 13272.

Both tests use raw leaves and CIDv1.

@achingbrain achingbrain changed the title feat: auto-shard based on node size feat!: auto-shard based on node size Feb 9, 2023
@achingbrain achingbrain marked this pull request as ready for review February 9, 2023 09:44
@rvagg
Copy link
Member

rvagg commented Feb 9, 2023

* @returns {number}
*/
calculateNodeSize () {
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the abstract superclass of dir-flat and dir-sharded - this returns 0 to appease TypeScript, the implementation in the concrete subclasses actually do the work.

This will be refactored in a future PR to be an interface so it'll look less weird.

@achingbrain
Copy link
Member Author

256KiB, like the kubo link but more explicit

Ah, right the estimateDirSize function in go-unixfsnode only counts DAGLink Tsize and Name length. The version here calculates the actual block size by serializing the DAGNode.

So consequently the block size go-unixfsnode produces for the version with 1343 links is 276895 which is over the sharding threshold (256KiB = 262144b).

Do we want that or should it be accurate?

@rvagg
Copy link
Member

rvagg commented Feb 9, 2023

It's somewhat arbitrary, the actual block size will be bigger but doesn't have a way of busting block size limits. I figure that we just should roll with historical calculations to make implementations match, it's good enough and having consistency is worth more than achieving a perfect block size (documentation should set the expectation right for the user though).

We could go with accuracy, but it'd have to be a coordinated effort and probably be a breaking change for Kubo since people will find their hashes changing for the same data.

@achingbrain
Copy link
Member Author

Ok, I'll make it consistent.

What happens when people set the chunk size bigger than the sharding threshold? Seems a bit weird to have a maybe-256KiB limit on the root block but then files with 512KB leaves, for example.

@rvagg
Copy link
Member

rvagg commented Feb 9, 2023

My guess is that whoever made this decision in the first place thought it'd be the most efficient because it has to be calculated so frequently and serialising is expensive. But there are cheap ways of calculating pb encoded size, particularly with these blocks, so it wouldn't be too expensive (our codecs do it internally as they write so we could even expose that).
Weight of history is heavy when it comes to hashes though.

@achingbrain
Copy link
Member Author

achingbrain commented Feb 9, 2023

Well that's just it, it's really cheap to calculate the protobuf overhead on each field, plus things like CIDs are fixed length so you don't need to hash anything so you could make it fast to get an accurate size calculation 🤷

@achingbrain
Copy link
Member Author

All PR comments have been addressed, I'm going to merge this to unblock ipfs/helia-unixfs#1

@achingbrain achingbrain merged commit 6ef187f into master Feb 9, 2023
@achingbrain achingbrain deleted the feat/auto-shard-based-on-node-size branch February 9, 2023 13:53
github-actions bot pushed a commit that referenced this pull request Feb 9, 2023
## [ipfs-unixfs-v10.0.0](ipfs-unixfs-v9.0.1...ipfs-unixfs-v10.0.0) (2023-02-09)

### ⚠ BREAKING CHANGES

* the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in

### Features

* auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149)
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

🎉 This PR is included in version ipfs-unixfs-v10.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

github-actions bot pushed a commit that referenced this pull request Feb 9, 2023
## [ipfs-unixfs-importer-v13.0.0](ipfs-unixfs-importer-v12.0.1...ipfs-unixfs-importer-v13.0.0) (2023-02-09)

### ⚠ BREAKING CHANGES

* the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in

### Features

* auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149)

### Dependencies

* update sibling dependencies ([1a37705](1a37705))
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

🎉 This PR is included in version ipfs-unixfs-importer-v13.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

github-actions bot pushed a commit that referenced this pull request Feb 9, 2023
## [ipfs-unixfs-exporter-v11.0.0](ipfs-unixfs-exporter-v10.0.1...ipfs-unixfs-exporter-v11.0.0) (2023-02-09)

### ⚠ BREAKING CHANGES

* the `shardSplitThreshold` option has changed to `shardSplitThresholdBytes` and reflects a DAGNode size where sharding might kick in

### Features

* auto-shard based on node size ([#171](#171)) ([6ef187f](6ef187f)), closes [#149](#149)

### Dependencies

* update sibling dependencies ([54018a1](54018a1))
* update sibling dependencies ([1a37705](1a37705))
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

🎉 This PR is included in version ipfs-unixfs-exporter-v11.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@rvagg
Copy link
Member

rvagg commented Feb 10, 2023

this is very 🥳 🎆

having the 3 implementations agree on this, have the same test fixtures, and get the same CIDs for a complex operation, is awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Align sharding threshold detection
5 participants