Skip to content

Conversation

cfromknecht
Copy link
Contributor

@cfromknecht cfromknecht commented Nov 18, 2017

This PR reorders logic in the peer termination watcher such that we short circuit earlier if we already have pending persistent connection requests. Before, the debug statement may have indicated that a reconnection was being attempted, even though it may have exited before submitting the request to the connection manager.

@cfromknecht cfromknecht force-pushed the persistent-conn-mgr branch 9 times, most recently from c020dce to a20ab99 Compare November 18, 2017 11:57
@bajohns
Copy link
Contributor

bajohns commented Nov 25, 2017

@cfromknecht I am seeing an issue where my nodes peer count eventually goes to zero. When this happens it stops receiving updates and I have to restart the lnd process to get peers again.

Will this PR solve this loss of peers issue? I will pull from you and test.

@Roasbeef
Copy link
Member

@bajohns is the node you're testing on persistent? Is it listening on a publicly reachable interface?

Also, what do you mean by "stops receiving updates"? Does the node have any active channels, or is it just bootstrapping to new peers? Finally, what commit are you on?

@bajohns
Copy link
Contributor

bajohns commented Nov 26, 2017

@Roasbeef the node I am using is persistent - I run it 24/7. If persistent is a setting then see my cli options below.

It is not on a publicly reachable interface nor does it have active channels. Unfortunately, I just pulled/rebased onto master yesterday in hope that the issue would resolve. Now I am running off of 7016f5b.

Re "Stops receiving updates": The daemon appears to take in two streams of info. Lightning P2P and Bitcoin chain info from Neutrino. I have not had any issues with Neutrino. However, over the course of 24-32 hours I hit 0 lightning peers lncli listpeersand then I have to restart the daemon.

lnd --bitcoin.active --bitcoin.testnet --debuglevel=debug 
  --neutrino.active --neutrino.connect=faucet.lightning.community 
  --configfile=$HOME/workspace 
  --datadir=$HOME/workspace/data
  --logdir=$HOME/workspace/logs

Now I am running off 7016f5b with this PR merged to see if the situation improves. However, test net seems to have re-org'd in the last 24 hours and that causes some instability (Maybe faucet.lightning.community was down?). Regardless, I will test another 24 hours and report back.

@bajohns
Copy link
Contributor

bajohns commented Nov 27, 2017

The last twenty-four hours of testing are unclear and different from prior experience.

Before, the daemon appeared to be disconnected from Lightning P2P and received block updates but not channel updates. After applying this PR the node appears to stay connected to Lightning P2P but become disconnected from the remote neutrino. I am not sure if this related to the PR or is separate. @Roasbeef Perhaps I should make an issue to discuss further?

Short summary:
Below you will see activities from the LND process. To start the test, I restarted the process around 04:00 on the graph. New blocks are in purple and channel updates are in green. After about eight hours the blocks stop but the channels updates continue. I restarted the process at around 23:00 and you can see the new block announcements flood in for the missing time period.

To generate this graph I created a branch of LND that outputs and event log and piped this to an ELK stack instance.

My next step will be to move LND to an IP that is publicly reachable and run a local btcd to attempt to eliminate that as a factor.

lnd2017112627

@cfromknecht
Copy link
Contributor Author

@bajohns thanks for your thorough investigation on this! This PR should only affect the connections maintained in the Lightning P2P network, and should have no effect on neutrino (that I'm aware of).

There might be certain conditions under which this PR helped the connectivity with the P2P network in your most recent test, though @Roasbeef has some other hunches about what may have caused this. Primarily, the connections made by the DNS bootstrapper are not registered as persistent. In this case, persistent means that lnd will try to reestablish them if they fail, not necessarily that they will be persisted to disk or anything.

In order to prevent what happened on day one, the best two solutions seem to be:

  1. open a channel with a peer (or multiple), this registers a persistent connection with that peer, and will also write their address to disk, so that the communication can be reestablished after a restart
  2. modify the DNS bootstrapper to mark successful connections as persistent, which would allow lnd to maintain persistent connections irrespective of if you have any channels open. If you're interested in helping with this, i'd be more than happy to open an issue with further instructions!

In regards to day 2, it seems more likely that this was caused by a reorg or huge increase in hash rate. You can see the number of new blocks increases sharply, just before stopping all together, and the number of new blocks accepted after restarting would give an average rate of ~200 blocks/hr while the node wasn't receiving updates. In fact, I just checked yesterday's blocks and that seems to be the case 😊

By the way, really cool work in getting this graph generated! Will have to check this out 😎

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, perhaps we can just apply the OG patch (the one I started, but you fixed) to my fork, which will then eventually be submitted back upstream to btcd. This way, it will actually also be possible to review the added diff. As currently presented, it's difficult to see exactly how the original connmgr package was modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also significantly shrink the size of this diff!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bulk of the PR has been moved to Roasbeef/btcd#10, thinking this PR will just become the glide update/minor fixes?

@bajohns
Copy link
Contributor

bajohns commented Nov 28, 2017

@cfromknecht I am interested to work on the solution 2 you presented above. If you can create an issue with instructions I will get started.

The graph you saw is from a hobby project of mine; I will send you a link once I launch it.

@cfromknecht cfromknecht force-pushed the persistent-conn-mgr branch 2 times, most recently from 38f59f4 to 4604b87 Compare November 30, 2017 00:31
glide.yaml Outdated
Copy link
Contributor Author

@cfromknecht cfromknecht Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary redirection until merged into roasbeef/btcd

Copy link
Contributor Author

@cfromknecht cfromknecht Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was hoping this would work on travis, but it only works locally due to the btcd setup done by gotest.sh

@cfromknecht cfromknecht changed the title Adds Persistent Connection Manager Improve Consistency of Logging and Actions in peerTerminationWatcher Dec 13, 2017
This commit reorders logic in the peer termination
watcher such that we short circuit earlier if we
already have pending persistent connection requests.
Before, the debug statement may have indicated that
a reconnection was being attempted, even though it
may have exited before submitting the request to
the connection manager.
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@Roasbeef Roasbeef merged commit 3941830 into lightningnetwork:master Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants