Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use floodsub events to simplify discovery #176

Closed
wants to merge 6 commits into from

Conversation

dirkmc
Copy link
Collaborator

@dirkmc dirkmc commented Dec 3, 2018

Fixes #153

Note: This PR is ready for review, but will not pass tests until release v0.15.2 is included into libp2p2, and we include that version of libp2p2 in peer-star-app

@pgte
Copy link
Contributor

pgte commented Dec 4, 2018

pedroteixeira@MacBook-Pro-4:~/projects/ipfs/peer-star/peer-star-app (master)
→ npm ls | ag floodsub
│ │ ├── libp2p-floodsub@0.15.2 deduped
│ ├─┬ libp2p-floodsub@0.15.2

@dirkmc if you rebase from master you should get that dependency included..

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 7, 2018

It looks like there are a couple of pinner related tests failing that I don't think are related to the work I've been doing (but I will look into it in the morning). I think the discovery related code is ready for testing, maybe with the minimalist peer-pad @jimpick?

I wrote a simple chat app to try it out, and it looks like the peers are connecting and setting up membership gossip within a couple of seconds (this is using the default rendezvous server at ws-star1.par.dwebops.pub). Watch for the Connected message at the bottom right:
https://youtu.be/Au239dGo2TA

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 7, 2018

FYI to see the log messages for web-socket-star, floodsub, and peer-star-app discovery paste this into the browser console and refresh:
localStorage.debug = 'libp2p:floodsub,libp2p:floodsub:*,peer-star:discovery,peer-star:discovery:*,libp2p:websocket-star,libp2p:websocket-star:*'

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 7, 2018

@pgte there are quite a lot of changes here, let me know if you would like to code review in real time and I can talk you through it :)

@pgte
Copy link
Contributor

pgte commented Dec 7, 2018

@dirkmc thanks, I may take you up on that some time later!

I pulled your branch locally and rebased from latest master and did a fresh install.
When running the tests, sometimes I'm getting an error on test/replication.spec.js:

$ npx aegir test -t node -f test/replication.spec.js
[...]
 1) replication
       can stop pinner:
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/Users/pedroteixeira/projects/ipfs/peer-star/peer-star-app/test/replication.spec.js)


  2) replication
       peers dont list pinner any longer after pinner stopped:

      AssertionError: expected 1 to equal 0
      + expected - actual

      -1
      +0

      at collaborations.forEach (test/replication.spec.js:112:103)
      at Array.forEach (<anonymous>)
      at Context.it (test/replication.spec.js:112:20)

I did a bit of digging around and I found out that, when the tests fail, sometimes the connection from the nodes to the pinner is not closed after pinner.stop(). Do you have any clues on why this might be happening?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 7, 2018

@pgte it seems like at some stage in the past libp2p must have called AppTransport.close() but it no longer seems to do that (it calls close on the listeners, not the transport itself)

I can listen for when libp2p calls discovery.close(), but it seems like at that point libp2p has already stopped, so it's not possible to disconnect.

I'm digging around in the libp2p code trying to work out how we can make it wait till we've closed out connections before shutting down.

@aschmahmann
Copy link
Collaborator

@dirkmc @pgte how much have you spoken with the (js)libp2p folks about this? I know service discovery is something they're interested in and it feels like we're basically using pubsub to implement our own discovery service because it's not done there yet. Maybe there's some way for us to help them out and ensure the libp2p discovery covers the peer-star-app use cases.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 7, 2018

@aschmahmann that seems like it could be useful for other projects, although our use case is quite particular - it feels a little experimental

@pgte it seems like maybe the disconnects are not happening cleanly. I notice that when the pinner closes its connections, the pinner's push-protocol always detects the error and bails out, while the pull-protocol doesn't always notice the error.
Correspondingly, the regular peer's pull-protocol always seems to detect the disconnect while it's push-protocol doesn't.

So there seems to be something about the connection disconnect that is not being detected. I will do more digging

@pgte
Copy link
Contributor

pgte commented Dec 8, 2018

@dirkmc thank you!

@aschmahmann there's a spec for a rendezvous protocol being designed here: libp2p/specs#56 . Is this what you were referring to?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 10, 2018

I looked into this more over the weekend. If I add a random delay before dialing, the replication test passes.

My theory is that when libp2p-switch gets an incoming connection for a peer at the same time as it is dialing that same peer, it gets confused and loses track of some connections. When it shuts down, it closes all connections for a peer, but I believe it misses some of the connections, which is why the protocol pull stream never detects the connection ending. Tomorrow I'll try to write a test case against libp2p-switch that consistently fails.

@aschmahmann
Copy link
Collaborator

@pgte Yes, thanks.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Dec 10, 2018

I created a test case to demonstrate the issue at #199

@pgte
Copy link
Contributor

pgte commented Dec 11, 2018

@dirkmc I dug into this a bit and I think you're right, apparently there's some problem with libp2p connection management. This is evident through your new test9:
Doing app.stop() triggers ipfs.stop() , which should stop the discovery and transports, which should then disconnect all.

While I did some digging I think I found a problem in this package and the way that it handles connections: when you do collaboration.stop(), the existing outbound connections were not being terminated, which then results in libp2p maintaining connections to other nodes indefinitely.
If we fix this, our tests run consistently, simply because stopping an app will stop all the collaborations, which will then terminate all the collaboration-specific connections.

Because I'd like to review these fixes with you, I suggest we continue this in another PR that joins these contributions to the connection management issues: #200

@pgte pgte closed this Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants