Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster sharding #309

Closed
ZenGround0 opened this issue Jan 29, 2018 · 8 comments
Closed

cluster sharding #309

ZenGround0 opened this issue Jan 29, 2018 · 8 comments
Assignees
Labels
exp/wizard Extensive knowledge (implications, ramifications) required P1 High: Likely tackled by core team if no one steps up

Comments

@ZenGround0
Copy link
Collaborator

ZenGround0 commented Jan 29, 2018

It's happening, this issue is to track work on ipfs-cluster sharding and file ingestion. See the link for a more in depth work plan.

Current TODOs
[x] basic file streaming from ctl to service over HTTP (multi-file, no dirs or symlinks yet)
[ ] factor a dex library out of go-ipfs and make it suitable for reading an HTTP multipart stream
[x] add a file's dag (non-sharded) to ipfs after importing in cluster

Work takes place on the feat/sharding branch. Current plan is to leave this branch separate from cluster master for the time being and make prs to feat/sharding. We will merge back to master upon reaching major milestones but expect many PRs before that.

Although we are pushing forward we are still lacking feedback. If you find yourself reading this issue and interested in contributing to this work consider leaving comments on the RFC linked above.

@ZenGround0
Copy link
Collaborator Author

ZenGround0 commented Feb 21, 2018

Update: PR #310 brought streaming of files over http and the add subcommand to ipfs-cluster. PR #316 and PR #318 brought an unpolished but functional local add of files to the calling node.

Current work is happening in PR #317. A rough sharding implementation exists (triggered by add --shard) there and it succeeds in sharding small files among a cluster. Up next is to improve this sharder implementation in some basic ways for a more reliable, functional and extensible baseline. In parallel @hsanjuan is working to abstract modules from go-ipfs so that cluster will not need all of go-ipfs as a dependency to import files to dag nodes ( The second TODO in the first comment of the issue).

Dev TODOs
[x] Sharder can handle multiple sessions concurrently
[x] Improve robustness and simplicity of rpc_calls into sharder
[x] cluster-DAG has a single root grouping all shards
[x] shard nodes split into a tree when they grow too big for a single block
[ ] better UX for add

  1. api handler sends an add response
  2. add subcommand is parametrized (repl factor, ipfs add options)
  3. state includes built in mechanism for identifying shards and associated cluster-DAG roots with files
  4. Informative, actionable error messages tailored to specific failure scenarios

Current Testing TODOs
Unit Tests:
[x] Test sharding two files concurrently
[ ] Test various shard sizes (i.e. ensure config matches shard size)
[x] Test files with many small chunks and large shard sizes to force cluster-DAG shard nodes to exceed IPFS block limits
Live Cluster Tests:
[ ] Test files of various sizes when sharding
[ ] Test various shard sizes (stress test to get a sense for max data sizes and drive profiling and scaling efforts)
[ ] Test clusters with limited free space in their repos so that shard packing is tightly constrained
[ ] Test different import options (trickle dag, rabin chunking etc.)
[ ] Run shard trials where nodes are forced to fail. Examine cluster state after failure, assess error messages for utility, run through the process of cleaning up state or trying to continue the add to understand pain points and write documentation based on experience.
[ ] Find/use/build a framework to do most/all of this automatically. Might want to automate deployment too. Open question: is k8s-ipfs going to work for us here? How much harder to use/improve it than to make a custom deployment/shell script framework?

Dev NiceToHaves, future TODOs:
[ ] add UX better informed by use cases and sharding files across a cluster from any source with any transfer method is a breeze. Will involve studying existing workflows that people are comfortable with. May involve implementing functionality that "stream pulls" files over HTTP given a link, serializes to a multipart file, or archive dependent stream format, and pushes the stream to the service api
[ ] Importers exist for a variety of archive types (ex deb archives for something like @elopio's use case) in the same way that a tar importer exists today so that these archives can be sharded.
[ ] Safety is addressed, at the very least with documentation. Perhaps we will provide some functionality that allow things like restarting an add where it left off, or automatic cleanup of unfinished add. We could also implement options allowing users to tradeoff safety and speed during an add. We should also have support commands for safety UX, for example a command giving an estimated upper bound on the size of a file that a cluster add can handle, or at least adding up all repo sizes. As in "better UX" bullet all of this is going to require spending time with standard workflows and understanding user painpoints.
[ ] Clever handling of the edge case when repos are filled up and while enough space exists for file and metadata, no one node has room for the configured shard size. Perhaps the best option is to leave this as a failure.

@hsanjuan hsanjuan added exp/wizard Extensive knowledge (implications, ramifications) required status/in-progress In progress P1 High: Likely tackled by core team if no one steps up labels Feb 21, 2018
@ZenGround0
Copy link
Collaborator Author

ZenGround0 commented Mar 14, 2018

-- State format change proposal --

At a high level I want the state format to support two new things

  1. "shallow" pins that are not literally part of the set of hashes that cluster pins through ipfs and tracks but rather reference a pin that IS part of the pinset. For the sharding implementation these shallow cids would be the root CID of a sharded file and they would reference the cluster-DAG root hash of the sharded file.

  2. "hidden" pins that are tracked by the pinset but don't show up in UX queries about the pinset unless the user explicitly asks, e.g. using ipfs-cluster-clt pin status <hidden-pin> or ipfs pin ls -a. This way we will not clutter the pinset with shardnodes and clusterdag nodes unless a user wants to know about them.

To implement 1. my proposal is to include a second map, ShallowPinMap, along with the PinMap and Version in the MapState. The ShallowPinMap will keep track of shallow pins and the tracked pins they rely on. There are a few methods for representing shallow pins:

  • A: the ShallowPinMap could also be of type map[string]api.PinSerial where string key of the shallow pin would be the shallow pin cid and the PinSerial would be the pinserial of the reference pin
  • B: as a shallow pin really only needs to refer to another, this could simply be of type map[string]*cid.Cid

I think either of these should work fine, and lean towards option A. If we go with option A pretty printing the state may get a bit easier as we won't need to fetch the reference pin to send back info on it. Displaying shallow pins is also a topic for discussion. Perhaps as in the sharding RFC all pins should display a field clusterDAG: ... and shallow pins are the only ones to display a hash. Alternatively I lean towards printing the shallow pins differently and displaying the custom clusterDAG/referencePin field only for shallow pins. Additionally the shallow pin display would include the other fields as in a regular display. Allocations just means all of the peers with shards and the repl factors should still mean the same thing, only now across shards.

To implement 2. my proposal is to add a hidden field to the api "Pin" type. The only pins that will be made hidden are those that the sharding component adds to the pinset when constructing the clusterdag. Additionally the reporting functionality (pin ls and pin status) will need to change, though this should manifest itself as a simple check if a pin is hidden before sending out data in certain calls.

@hsanjuan let me know your thoughts and suggestions for alternatives

@hsanjuan
Copy link
Collaborator

hsanjuan commented Mar 14, 2018

I see some problems:

  • It is not possible to know which shallow pin is associated with a cluster-DAG
  • It is not possible to know to which cluster-DAG(s) a shard belongs to
  • Looking up pin attributes (like replication factor) requires a double look-up and the same information would be attached to all shards.
  • A Hidden field sole purpose is to indicate a display property to the final user rather than being useful to manage our internal shared state. We should be able to use other fields to figure out what type of pin we're handling.

I'm going to establish some glossary too:

  • Cid Object is an abstract object which carries at least a cid.Cid, and maybe some other info. pin (below) is the only Cid Object for which we have a concrete format (api.Pin).
  • metapin is a Cid Object. The cid is the the root of a DAG sharded by cluster. Thus, there is a Cluster DAG associated to it. This is the shallow pins above.
  • pin is a Cid Object. The cid is the root of a DAG not sharded by cluster. Thus, this DAG fits in a single ipfs daemon.
  • cdag is a Cid Object. The cid is the root of a Cluster DAG.
  • A Cluster DAG is is a DAG which groups several shards.
  • Shard is a Cid Object. The cid is the root of a DAG. This DAG is assumed to fit in a single ipfs daemon.

We are trying to specify what these Cid Objects look like, other than carrying a Cid in them.

Considerations:

  • Anything that is pinned recursively in ipfs, needs to be mappable to the actual metapin in cluster (if its not a regular pin). Probably by mapping to the cdag and then the cdag to the metapin
  • What happens when a shard belongs to two different cdags with different replication-factors? Should be allow this? Should we have unique shards? I think we should allow it (de-duplication is important), but this possibly affects our data format (?) or at least we have to figure how to handle it.
  • A cdag needs to be mapped to the shards. It is naturally mapped but this would require consulting ipfs. Are there any cases which would push us to not rely on ipfs for this and carry references to the shards?
  • My feeling is that shards are very very much like pins to all effects, except display and having a parent.
  • Because the shared state might grow very large in the future, and not be memory-based, we need to keep some things in mind (even though we are at prototype and we'll have plenty of time to optimize):
    • Smaller state is better (don't waste fields)
    • Faster state is better (reduce queries)
    • Simpler state is better (will make revamping the component easier).
  • Because all Cid objects can be addressed by a Cid, they should probably share the same keyspace/map/datastore for the moment. For simplicity, until it makes sense performance-wise to separate them.
  • Here's a table a made to watch the different things each type of Cid object requires:
Cid object Allocations Pin / Unpin (ipfsconn) Pin/Unpin (internal state) Has parent Has children Needs repl factor Display Name
Pins X X (recur) X O O X X X
Meta pins O O X O X X X X
Shard X X (recur) X X (multiple?) X From metapin On req ?
cdag -1 (pin everywhere?) X (direct) X metapin? X (shards) -1 (?) on Req ?

I need to let this sink a little while. I'm not sure right now if it's best to consolidate everything on a single Pin type, to separate everything into multiple types, or to have a single Pin type with separate subtypes. In the meantime, ideas are welcome :)

@ZenGround0
Copy link
Collaborator Author

ZenGround0 commented Mar 14, 2018

Anything that is pinned recursively in ipfs, needs to be mappable to the actual metapin in cluster (if its not a regular pin). Probably by mapping to the cdag and then the cdag to the metapin

I'm not sure I fully understand the utility of including the metapin reference in the cdag. Which operations/workflows start from the metadata and then move on to working on the data? For example when we make a sharding aware unpin operation it will start from the metapin and release resources by following the links to clusterdag and eventually shards. What needs links in the other direction? We might be thinking about the metadata differently. My understanding was that for the most part the metadata is ignored and simple acts as the magic glue keeping the sharded file in the cluster.

What happens when a shard belongs to two different cdags with different replication-factors?

Are you worried about the case where a shard is pinned with factor 1 through clusterDagA and factor 5 through clusterDagB but then clusterDagB is unpinned so the shard is replicated unnecessarily high? This seems pretty tricky and maybe I'm seeing now one of those reasons you'd want to keep track of parents of the shard. If you kept track of a set of parents you could look from the shardnode to the clusterDAGs and figure out your current max and min repl factors. During unpinning shardnodes would remove the unpinned parent ref and then update their max factor / allow themselves to unpin.

Apart from this problem I'm not sure I am seeing the big picture, perhaps you could elaborate on your concerns if I'm missing some of them. I do think that in general it would be good to support multiple clusterDAGs referencing a shard but for the intended use case of sharding right now it does seem unlikely for people to hit on this.

For the most part I like the terminology and the general idea. Keep me posted on your thoughts, I am going to attempt a rough draft of a state consisting of a map of cid objects to make this more concrete.

@hsanjuan
Copy link
Collaborator

hsanjuan commented Mar 14, 2018

I'm not sure I fully understand the utility of including the metapin reference in the cdag

You might be right. I just don't want things pinned on ipfs by cluster which we cannot track to the actual metapin. But tracking can be done by checking every metapin to find which one holds a shard. If a shard fails to pin though, it should be visible/obvious that it belongs to a certain metapin and the entrypoint for any operation to fix it (recover), should probably be the metapin.

Are you worried about the case where a shard is pinned with factor 1 through clusterDagA and factor 5 through clusterDagB but then clusterDagB is unpinned so the shard is replicated unnecessarily high?

Im worried that when we unpin a metapin, we unpin the cdag, and then we unpin the shards. But maybe some shards should not be unpinned because they are also used by a different cdag. This might happen if a user shards some data, then adds the same data with something else appended. The original shards made by cluster would be part of two cluster dags... ipfs has essentially this problem too, i think it keeps a list of parent-references to pins, but I can't remember.

@ZenGround0
Copy link
Collaborator Author

ZenGround0 commented Mar 14, 2018

To your first point: that makes sense. There are other problems around this we should keep in mind. Notably that in the case we are sharding a file the original pinning operation of a shard by the sharding component can not know the clusterdag cid as the whole dag needs to be imported for that and we can't in general wait to pin shards. Maybe we could block put and then pin later when cluster dag is ready however. We could also include an update call to associate the shard and clusterdag after the fact. Either way we have to be careful to ensure that the shard is always associated with the metapin. This gets tough when we think about cluster failures during sharding.

To your second point: this also makes sense and I see more clearly how tracking parent references in the shards would help with this. Maybe when the sharder updates the clusterdag reference on a shard cid object it can check for existence of the shardcid in the pinset and add the parent to the existing cid object. We could also associate the replication factors with the cdag. This way shards could figure out which repl factors they should switch to during an unpin. This seems better than sweeping through all metapins to see if a shard can be reached during any metapin unpin op. I guess it makes sense fundamentally that refcounting is the way to go over sweeping because we are dealing with acyclic graphs.

@ZenGround0
Copy link
Collaborator Author

ZenGround0 commented Apr 18, 2018

Tracking bugs and work needed to polish the sharding branch:

[x] Directories are not printed out by cluster add print
[x] Error when adding a sharded file with refactored add
[x] Error on pin ls -a
[X] TestLocalAdd and TestShardAdd for Cluster:AddFile
[x] More tests on restapi server and client
[x] cluster add has same wrap default as ipfs (right now defaults to pseudowrap but shouldn't)
[x] data files generated on demand to keep repo clean
[x] Fix bug where pin rm <cid> fails when is added both sharded and unsharded
[x] sharness tests for ipfs-cluster-ctl add and ls -a
[x] cluster add --quiet only prints out main hash of
[ ] Refactor tests in Sharder and Importer to remove code duplication (importer refactor done, will get to sharder refactor time permitting but it is already pretty good)
[ ] More tests for un/pinning new pintypes
[x] More sharder tests (4 is enough for now)
[ ] Upgrade path for sharding once recursive pinning up to a certain depth is implemented
[ ] Debug repinning of sharded nodes -- naive solution is to track recursion info in pininfo datastruct

@hsanjuan
Copy link
Collaborator

hsanjuan commented Apr 25, 2019

I'm closing this and directing further discussion to #496

@ghost ghost removed the status/in-progress In progress label Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/wizard Extensive knowledge (implications, ramifications) required P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

2 participants