Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: lnd (master 15d2ff0) is stalling connections at startup using externalhosts #7924

Closed
blckbx opened this issue Aug 25, 2023 · 12 comments · Fixed by #7938
Closed

[bug]: lnd (master 15d2ff0) is stalling connections at startup using externalhosts #7924

blckbx opened this issue Aug 25, 2023 · 12 comments · Fixed by #7938
Labels
bug Unintended code behaviour networking P1 MUST be fixed or reviewed
Milestone

Comments

@blckbx
Copy link

blckbx commented Aug 25, 2023

Background

Follow-up issue of #7921

Good news first: externalip is working now (with and without Tor).
Bad news: using externalhosts LND still stalls connections and lnd becomes unresponsive with and without Tor.

  • Compiled master branch from source (15d2ff0).
  • LND now starts up without go runtime error but stalls after a while trying to reconnect to peers.
  • seeing all connections (in and out) timing out
[INF] PEER: Peer(x): disconnecting x@x.x.x.x:37298, reason: unable to start peer: unable to send init msg: write tcp 10.151.111.76:9738->x.x.x.x:58556: write: broken pipe

[ERR] CMGR: Can't accept connection: unable to accept connection from x.x.x.x:56404: read tcp 10.151.111.76:9738->x.x.x.x:56404: i/o timeout
  • local lncli commands are getting stuck and rpc calls are running into timeouts, finally getting cancelled.

Startup log show that lnd is resolving the DNS correctly:

[INF] NANN: HostAnnouncer starting
[DBG] NANN: IP change detected! lightning.airdns.org:9738: <nil> -> 195.206.105.203:9738

[DBG] NANN: HostAnnouncer checking for any IP changes...
[DBG] NANN: No IP changes detected for hosts: [lightning.airdns.org:9738]

Your environment

  • version of lnd: v0.17.0-beta.rc1-g15d2ff0c4
  • which operating system (uname -a on *Nix): Linux node 5.15.0-79-generic 86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend: bitcoind v25.0
  • any other relevant environment details: using hybrid mode (Tor + externalhosts) setting in combination with vpn service

Steps to reproduce

  • compile 15d2ff0 from source
  • lnd.conf: set a DNS address for externalhosts
  • startup LND
  • LND comes to halt trying to connect to peers and stops responding to rpc calls

Expected behaviour

LND is starting up without issues and responding to calls.

Actual behaviour

LND starts up with externalhosts set but comes to halt and stops responding to lncli commands.

I can provide a full log for externalhosts and externalip scenario privately, if required.

@blckbx blckbx added bug Unintended code behaviour needs triage labels Aug 25, 2023
@Roasbeef
Copy link
Member

Are you able to repro this without Tor? Just tried to repro and seems fine.

Re Tor, seems related to this issue perhaps #7917 (comment)? Might be an intermittent Tor issue. I can try it out on my hybrid tor node once we get the next rc out (rc2).

@Roasbeef
Copy link
Member

using hybrid mode (Tor + externalhosts)

What tor flags are you running with? lnd.tor.skip-proxy-for-clearnet-targets=1 as well?

@blckbx
Copy link
Author

blckbx commented Aug 26, 2023

Are you able to repro this without Tor? Just tried to repro and seems fine.

Yes, I ran into this with and without Tor before.
With active Tor, tor.skip-proxy-for-clearnet-targets=true was set, yes.


Just ran new tests today:

externalhosts + no Tor:

listen=0.0.0.0:9738
externalhosts=lightning.airdns.org:9738

[tor]
tor.active=false
tor.v3=false

LND stalls after 15 minutes, it's caught in state server is still in the process of starting.
No lncli and rpc calls possible.

externalhosts + Tor:

listen=0.0.0.0:9738
externalhosts=lightning.airdns.org:9738

[tor]
tor.active=true
tor.v3=true
tor.streamisolation=false
tor.skip-proxy-for-clearnet-targets=true

LND works fine now. I think for this test case, it could have been a Tor glitch yesterday.
lncli and rpc calls are working fine.

So the only problem left seems to be: running clearnet-only with externalhosts setting.
I have recorded logs for both test cases.

@Roasbeef
Copy link
Member

So the only problem left seems to be: running clearnet-only with externalhosts setting.

Next time things stall, do you think you can grab a profile? https://github.com/lightningnetwork/lnd/blob/master/docs/debugging_lnd.md#capturing-pprof-data-with-lnd

I ran with tor hybrid and external host and wasn't able to repro.

@Roasbeef Roasbeef added this to the v0.17.0 milestone Aug 28, 2023
@saubyk saubyk added networking P1 MUST be fixed or reviewed labels Aug 29, 2023
@blckbx
Copy link
Author

blckbx commented Aug 29, 2023

I ran with tor hybrid and external host and wasn't able to repro.

This case is not a problem anymore (thanks to Elle's fix).
If Tor is turned off completely (no hybrid mode, clearnet only), then LND is stalling connections.

pprof.txt

pprof

@Roasbeef
Copy link
Member

@blckbx can you get a goroutine dump from that? https://go.dev/blog/pprof

Then we can see what is actually blocked.

@guggero
Copy link
Collaborator

guggero commented Aug 29, 2023

There is a goroutine dump here: https://github.com/lightningnetwork/lnd/files/12462101/pprof.txt, though it's in a bit of a different format from what a goroutine dump normally looks like.

@blckbx
Copy link
Author

blckbx commented Aug 29, 2023

There is a goroutine dump here: https://github.com/lightningnetwork/lnd/files/12462101/pprof.txt, though it's in a bit of a different format from what a goroutine dump normally looks like.

Sorry I'm new to this tool. This I copied from the curl command mentioned in debugging guide:
$ curl http://localhost:9736/debug/pprof/goroutine?debug=1.

What exact command do I need to run to get the output you can work with?

@guggero
Copy link
Collaborator

guggero commented Aug 29, 2023

No this is fine, thanks.

Roasbeef added a commit to Roasbeef/lnd that referenced this issue Aug 29, 2023
In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](lightningnetwork#7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.

This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.

Fixes lightningnetwork#7924

Fixes lightningnetwork#7928

Fixes lightningnetwork#7866
@Roasbeef
Copy link
Member

We have a candidate fix here: #7928

@mercurytoxic
Copy link

Pull request #7938 fixed all stalling issues for me.

@blckbx
Copy link
Author

blckbx commented Aug 30, 2023

We have a candidate fix here: #7928

This indeed fixes the issue with node running clearnet-only using externalhosts setting and thus solves #7924.

Roasbeef added a commit that referenced this issue Aug 30, 2023
…7938)

In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](#7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.

This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.

Fixes #7924

Fixes #7928

Fixes #7866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour networking P1 MUST be fixed or reviewed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants