-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossipsub #67
gossipsub #67
Conversation
|
Added a small zoo of basic gossipsub tests, including a mixed mode test with floodsub peers. |
made a small tweak -- sources that have not joined the mesh also emit gossip. |
Removed a potentially harmful topic membership check for mesh peers; potential inconsistency if the ANNOUNCE was lost (or reordered after GRAFT on retry). |
gossipsub.go
Outdated
if len(peers) < GossipSubDlo { | ||
ineed := GossipSubD - len(peers) | ||
plst := gs.getPeers(topic, ineed, func(p peer.ID) bool { | ||
_, ok := peers[p] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like we're filtering something, mind adding a comment on what we're filtering on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are filtering the peers that are not already in our peer list; i can add a comment to that extent.
gossipsub.go
Outdated
} | ||
|
||
func (gs *GossipSubRouter) heartbeatTimer() { | ||
ticker := time.NewTicker(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heartbeat every second? Does every heartbeat send messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily.
It will send GRAFT/PRUNE if it needs to adjust the overlay, and schedule gossip for piggybacking.
If the gossip is not sent by the next heartbeat, then it will be flushed in its own messages.
} | ||
} | ||
|
||
func (gs *GossipSubRouter) heartbeat() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function feels too long. Mind trying to break it up a little bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure i'll refactor a bit, although I would like to keep the main logic together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can easily refactor out the part that sends the GRAFT/PRUNE with coalescing, which is also incidental for the logic in the heartbeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Implemented retry of ANNOUNCE messages in pubsub, now that we have a test that exercises the relevant code paths. |
} | ||
|
||
// wait for heartbeats to build mesh | ||
time.Sleep(time.Second * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we always have to wait for heartbeats? Is it because we don't send out subscription notices immediately now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, subscription notices are still sent immediately, just retried if they fail.
Now that I think of it, we can probably greatly reduce this delay in almost all the tests -- maybe down to say 100ms, just enough for announcements to go out.
My rationale for this delay was to wait a couple of heartbeats to avoid interference from nodes who have joined but haven't seen any peer announcements yet. Also, I wanted to avoid interference from the overlay construction, but it should still be connected after the announcements get sent and nodes pick their peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is a genuine concern that is an artifact of the concurrent Join from subscriptions.
If we subscribe all the nodes together, then they won't have any peer announcements when they do the Join, and they'll have to wait a heartbeat before they start adding peers to the mesh.
We can avoid this if we add a small (say 10ms) delay after each subscription.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm... I'm very skeptical of 'fixing' things by adding delays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in most of the tests the subscriptions are created before connecting the network, which means that all nodes start empty and build the mesh purely in the heartbeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyzo It's really really bad practice to use delays like this, especially without a select statement to escape out of it if the context is canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paralin this is just a test that needs a delay -- and there is nothing to cancel the context so a select would be totally useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyzo Gotcha, I notice now it's a test.
Could we add some tests that check the number of messages sent, and maybe a way of tracking the overall efficiency of the implementation (like how many nodes received the same message from multiple peers) maybe in terms of bandwidth overhead? like, received 500 bytes for every 200 bytes of useful data at a message size of 100 bytes. |
Hrm, these are tests i would like to have too -- but not sure they are really unit tests. |
Yeah, they are definitely integrations tests. No need to write them as unit tests. We should run these tests for floodsub and for gossipsub and compare the results, and choose some failure threshold, i.e. gossipsub should not use more bandwidth than floodsub |
We have developed a conflict, so I will rebase. |
Rebased; also added a context done check that was missing in the announce retry goroutine. |
The |
Hi folks! Any particular reason you went with a custom protocol instead of something built on Chord/Pastry/PolderCast etc? |
hrm, seems like the fix lost the coverage for control piggybacking -- probably because of the slowdown with the |
@ORBAT several reasons: simplicity of implementation, robustness, and perhaps most important of all backwards compatibility with floodsub so that we can easily deploy. |
and control piggyback coverage is back, at least for |
@ORBAT It looks like PolderCast didnt make it into our pubsub research reading list: https://ipfs.io/ipfs/QmNinWNHd287finciBwbgovkAqEBQKvnys1W26sY8uupc5/pubsub%20reading%20list.pdf In any case, we've done a pretty thorough review of the problem space before we arrived at our version zero protocol, floodsub. With the idea that it is the base layer protocol, and provides very few guarantees. This code, which we're calling gossipsub is an iterative improvement over floodsub that essentially only adds a fairly simple tree pruning via gossip. Simplicity and ease of implementation are very important for us, gossipsub can be implemented in 150 lines of scheme and not too many lines of go. That said, this is still under review. Review of the protocol and/or implementation is very much welcome. |
so that travis doesn't get killed by OOM.
…ublished in a while
88a9bb0
to
1b4fbb8
Compare
I've rebased but I'm having some trouble reproducing the issue. I'm currently running the test in a loop to see if that gets me anywhere. |
No dice. |
@whyrusleeping @Stebalien |
@whyrusleeping I just read libp2p/interop#1. Having a daemon will of course be useful, although not having to depend on code in Go is preferable, and JSON tests are needed. |
I had no reason not to merge this, so I did. Next steps, putting it into a flag in ipfs. |
@whyrusleeping shouldn't this be a separate pubsub implementation so that folks can pick the pubsub implementation to use? Will this PR make PubSub in go-ipfs not interop with js-ipfs? |
@whyrusleeping just confirmed that this package would be better named Interop remains |
I can't find any mention of DHT (looking in relation to the mention "The initial contact nodes can be obtained via rendezvous with DHT provider records." here. Will this be done in a separate interface (that uses the DHT in libp2p, as well as gossipsub)? Also you really should use constant variables instead of literals. Should we also specify a common source of randomness for interoperability? |
@jamesray1 |
@mhchia, sure, that's fine. |
Its more 'extracting things into multiple packages in go is annoying to do when you might be changing things in both really soon'. @jamesray1 @mhchia Yeah, the DHT is only an example. You can use any means to rendezvous. Take a look at our rendezvous spec proposal for ideas towards a more specialized way of doing rendezvous |
What duration should we use for timeout requests? Context: implementing a system config for Kademlia to use to get nodes. As for the timeout duration, according to RabbitMQ, that is twice the heartbeat interval, which is 1 s in this Go implementation, so based on that it would be 2 s. However later on the same page it says a timeout of 5 to 20 s is optimal. I am guessing to use 40000 s for I'll look further into this. https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf says OK at the moment I'm selecting: /// tRefresh in Kademlia implementations, sources:
/// http://xlattice.sourceforge.net/components/protocol/kademlia/specs.html#refresh
/// https://www.kth.se/social/upload/516479a5f276545d6a965080/3-kademlia.pdf
/// 1 hour
kbuckets_timeout: Duration.hour(1)
/// go gossipsub uses 1 s:
/// https://github.com/libp2p/go-floodsub/pull/67/files#diff-013da88fee30f5c765f693797e8b358dR30
/// However, https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout uses 60 s, and
/// https://gist.github.com/gubatron/cd9cfa66839e18e49846#routing-table uses 15 minutes.
/// Let's make a conservative selection and choose 15 minutes for an alpha release.
request_timeout: Duration.minutes(15), |
} | ||
|
||
func (gs *GossipSubRouter) handleIWant(ctl *pb.ControlMessage) []*pb.Message { | ||
ihave := make(map[string]*pb.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a map, or could it be a slice? Is it a map to deduplicate message IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it needs to deduplicate.
Implements the gossipsub protocol; see https://github.com/vyzo/gerbil-simsub for a high-level literate specification.
TODO:
PubSub
, need to be reliable so that we accurately track peers in a topic