Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement unixfs sharding #3042

Merged
merged 8 commits into from
Mar 24, 2017
Merged

Implement unixfs sharding #3042

merged 8 commits into from
Mar 24, 2017

Conversation

whyrusleeping
Copy link
Member

@whyrusleeping whyrusleeping commented Aug 5, 2016

This PR implements a hash array mapped trie using merkledag nodes as a base layer for sharded unixfs directories.

This changeset is broken up into three distinct parts:

  1. Implement the HAMT structure itself. This is a bunch of algorithmic code that i'd like some well thought out sanity checking on.
    • Need confirmation on hash code numbering
    • Thoughts on willf/bitset vs stdlib math/big.Int
    • Implementation correctness
    • Consensus on addition of fields to unixfs protobuf
  2. Integration of HAMT code into the directory builder logic.
    • provides us a nice abstraction to think about things in terms of a filesystem
    • optimizes smaller directories by not sharding until we hit a threshold
  3. Integration of directory builder code into mfs
    • moves mfs from using a simple merkledag Node object to using a dirbuilder
    • allows mfs to create sharded directories
    • by way of the previous point, ipfs add and friends can now create sharded directories.

Thorough review will be very appreciated!

cc @Kubuxu @jbenet @kevina @diasdavid @dignifiedquire

also cc @haadcode as you might like to review the datastructure logic

@whyrusleeping whyrusleeping added the need/review Needs a review label Aug 5, 2016
@whyrusleeping
Copy link
Member Author

Note: tests won't pass yet as I havent vendored some of the new libs. I havent quite decided between willf/bitset and just using the math/big.Int to handle the bitfield logic.

@whyrusleeping whyrusleeping force-pushed the feat/hamt-sharding branch 3 times, most recently from 74a6117 to 259aeb4 Compare August 5, 2016 16:23
@whyrusleeping
Copy link
Member Author

going with the stdlib, fewer imports and the stdlib is proven to be pretty stable. Its a hair bit less efficient on some operations, but we can optimize later if it becomes an issue.

@kevina
Copy link
Contributor

kevina commented Aug 5, 2016

@whyrusleeping, I am afraid I do not know anything about sharding, so I will have to first learn about that before I am of any use. Doing a quick search I discovered ipfs/notes#76, is there anything else that would help with context for this pull request?

)

const (
HashMurmur3 uint64 = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember that this is configurable on the object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will likely use multihash if this proposal goes through: multiformats/multihash#30

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i'd love to use an 'official' multihash table for this.

@jbenet
Copy link
Member

jbenet commented Aug 5, 2016

@whyrusleeping this is a very big PR that will be critical to review + merge in stages. if this has any bugs it will be serious trouble and will take weeks of work to find and debug. please:

  • rip out unixfs into its own repo first, then we can iterate on the datastructure there first. once we're convinced it really works, we can pull this back into ipfs.
  • make it easy to test and verify the hamt data struct stuff is working well. meaning make some tests that shard large directories well. and be able to produce graphviz out of these to inspect and debug formation.
  • this is going to need very serious stress tests. so far i saw very light invariance testing but no serious stresses, with long sequences of operations expected to test how the HAMT grows and shrinks as stuff is added and removed, and testing that the exact same data structures are produced despite sequences of operations being different.

@whyrusleeping
Copy link
Member Author

@kevina if you want to review the algorithm for the HAMT datastructure and use that to check my implementation, that would be very helpful.

https://en.wikipedia.org/wiki/Hash_array_mapped_trie

@whyrusleeping
Copy link
Member Author

@jbenet I moved the HAMT code into its own repo and added a larger test to verify operation order doesnt change the resultant structure: https://github.com/whyrusleeping/unixfs-hamt

the hamt_stress_test.go will generate a random set of insertions and deletions and run through them in different permutations and verify the outputs are consistent.

@jbenet
Copy link
Member

jbenet commented Aug 9, 2016

Thanks i will revivew this year

@whyrusleeping whyrusleeping added the status/in-progress In progress label Sep 14, 2016
@flyingzumwalt flyingzumwalt added status/ready Ready to be worked status/blocked Unable to be worked further until needs are met and removed status/in-progress In progress labels Sep 26, 2016
@whyrusleeping whyrusleeping force-pushed the feat/hamt-sharding branch 2 times, most recently from d5d3b21 to 5ae869f Compare October 12, 2016 20:52
@whyrusleeping whyrusleeping added this to the Directory Sharding milestone Oct 18, 2016
@whyrusleeping whyrusleeping force-pushed the feat/hamt-sharding branch 2 times, most recently from c8f0fa0 to 10fb7a7 Compare November 1, 2016 07:49
@whyrusleeping whyrusleeping added status/in-progress In progress and removed status/ready Ready to be worked status/blocked Unable to be worked further until needs are met labels Nov 2, 2016
@daviddias
Copy link
Member

@whyrusleeping @jbenet mentioned that it would be better rip out all the unixfs into its own repo and this actually might be extremely beneficial to point several people into something isolated they can review fully.

Also, I think the best review I can do is grab this implementation and write the implementation in JavaScript and possibly write the spec for it.

@whyrusleeping
Copy link
Member Author

@diasdavid I had it pulled out at one point, but it was really just for the sake of 'putting the code somewhere else to look at it'. Actually extracting this code is difficult because it would require extracting a large number of other go-ipfs packages.

@whyrusleeping
Copy link
Member Author

Rebased on latest master (1a365a8).

@diasdavid @dignifiedquire The actual HAMT code is isolated in this single file: https://github.com/ipfs/go-ipfs/blob/07683a7d9a6e83c859c84e75b2eb47d2aee79aa7/unixfs/hamt/hamt.go

All the review i'm looking for is just that file. the rest of the changes are just integration of that up through the stack and i'm not nearly as concerned about that part of the changeset.

mask := new(big.Int).Sub(new(big.Int).Exp(big.NewInt(2), big.NewInt(int64(bp)), nil), big.NewInt(1))
mask.And(mask, ds.bitfield)

return popCount(mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely clear on how this function works/what exactly it's purpose is. And in turn I'm having a hard time verifying that the bit operations you do, do what they are supposed to be doing. Would be great to have a comment about what the function does and an explenation why the bit ops do what they do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the popCount that one is clear, but rather the whole indexForBitPos.

whyrusleeping and others added 5 commits March 21, 2017 20:19
License: MIT
Signed-off-by: Jeromy <why@ipfs.io>
License: MIT
Signed-off-by: Jeromy <why@ipfs.io>
License: MIT
Signed-off-by: Jeromy <why@ipfs.io>
License: MIT
Signed-off-by: Jeromy <why@ipfs.io>
License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
@Kubuxu
Copy link
Member

Kubuxu commented Mar 22, 2017

Wow, rebased. Is this coming in soon (TM)?

@whyrusleeping whyrusleeping force-pushed the feat/hamt-sharding branch 2 times, most recently from 774715f to f72c37e Compare March 23, 2017 05:04
@whyrusleeping
Copy link
Member Author

@pgte Anything else we need to finish before this goes forward? I think we've agreed on not setting filesizes on the shards.

@pgte
Copy link

pgte commented Mar 23, 2017

@whyrusleeping Correct, I don't think we need anything else.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

Apart from one removed test: #3042 (comment)

License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
mfs/dir.go Outdated
var out []NodeListing
err := d.ForEachEntry(context.TODO(), func(nl NodeListing) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shouldn't have missed that.

License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
@whyrusleeping
Copy link
Member Author

Choo Choo! Here comes the merge train! 🚋🚋🚋🚋🚋🚋🚋🚋

@whyrusleeping whyrusleeping merged commit ac69697 into master Mar 24, 2017
@whyrusleeping whyrusleeping deleted the feat/hamt-sharding branch March 24, 2017 03:51
@whyrusleeping whyrusleeping removed the status/ready Ready to be worked label Mar 24, 2017
laurentsenta pushed a commit to laurentsenta/kubo that referenced this pull request Feb 25, 2022
laurentsenta pushed a commit to laurentsenta/kubo that referenced this pull request Feb 25, 2022
laurentsenta pushed a commit to laurentsenta/kubo that referenced this pull request Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants