Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1 #7928

Closed
niteshbalusu11 opened this issue Aug 27, 2023 · 50 comments · Fixed by #7938 or #8151
Closed

[bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1 #7928

niteshbalusu11 opened this issue Aug 27, 2023 · 50 comments · Fixed by #7938 or #8151
Assignees
Labels
P1 MUST be fixed or reviewed shutdown Issues related to shutdown process of LND
Milestone

Comments

@niteshbalusu11
Copy link

Background

I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn't shut down gracefully.

2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878811056959193089) got: router shutting down
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(818216971034558465) got: router shutting down
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878297584961191937) got: router shutting down
2023-08-26 20:26:08.469 [ERR] DISC: Update edge for short_chan_id(771274421704720385) got: router shutting down
2023-08-26 20:26:13.261 [ERR] DISC: Update edge for short_chan_id(865996249021546496) got: router shutting down
2023-08-26 20:26:13.374 [ERR] DISC: Update edge for short_chan_id(875772006717980672) got: router shutting down

Your environment

  • version of 0.17.0-beta.rc1
  • ubuntu
  • bitcoind 24
@guggero
Copy link
Collaborator

guggero commented Aug 28, 2023

Could you please pull a goroutine dump while it's attempting to shut down? That would help a lot.

@Roasbeef
Copy link
Member

What extra patch are you running? W/e that was should be already folded into the latest rc.

@Roasbeef Roasbeef added this to the v0.17.0 milestone Aug 28, 2023
@saubyk
Copy link
Collaborator

saubyk commented Aug 29, 2023

What extra patch are you running? W/e that was should be already folded into the latest rc.

The extra patch that is being referred here is this fix: #7922

@niteshbalusu11 did you compile the master, or applied this pr to a branch of yours?

@niteshbalusu11
Copy link
Author

I compiled master.

@niteshbalusu11
Copy link
Author

Could you please pull a goroutine dump while it's attempting to shut down? That would help a lot.

How do I do this?

@saubyk saubyk added shutdown Issues related to shutdown process of LND P1 MUST be fixed or reviewed labels Aug 29, 2023
@niteshbalusu11
Copy link
Author

niteshbalusu11 commented Aug 29, 2023

# Enable profiling
profile=4000
curl http://localhost:4000/debug/pprof/goroutine?debug=1
curl: (56) Recv failure: Connection reset by peer

I enabled profiling in conf file, restarted lnd and then after it started back up, i just get this when attempting to curl

@niteshbalusu11
Copy link
Author

New results today

2023-08-29 12:56:10.900 [INF] LTND: Received terminated
2023-08-29 12:56:10.900 [INF] LTND: Shutting down...
2023-08-29 12:56:10.900 [INF] LTND: Gracefully shutting down.
2023-08-29 12:56:22.993 [INF] NTFN: Historical spend dispatch finished for request outpoint=59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2, script=0 a2cc853fe736c0e446d02f22702c3e6efaa85aa54cf4976177fb5104848a6737 (start=804604 end=805311) with details: 1f580d4f4ad9bc9e9ed5a328e806f01d4f2116ce0fda3ad9058d1528f60b21c8[0] spending 59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2 at height=804604
2023-08-29 12:56:23.223 [DBG] NTFN: Updated spend hint to height=804604 for confirmed spend request outpoint=59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2, script=0 a2cc853fe736c0e446d02f22702c3e6efaa85aa54cf4976177fb5104848a6737
2023-08-29 12:56:23.224 [INF] NTFN: Dispatching confirmed spend notification for outpoint=59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2, script=0 a2cc853fe736c0e446d02f22702c3e6efaa85aa54cf4976177fb5104848a6737 at current height=805311: 1f580d4f4ad9bc9e9ed5a328e806f01d4f2116ce0fda3ad9058d1528f60b21c8[0] spending 59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2 at height=804604
2023-08-29 12:56:23.225 [DBG] CNCT: Found confirmed spend of HTLC output 59dfc2002ce40341e4332bdd552fd71b706854943c0a8bc771a62cccf2becef7:2 in tx=1f580d4f4ad9bc9e9ed5a328e806f01d4f2116ce0fda3ad9058d1528f60b21c8

then a whole bunch of

2023-08-29 13:02:39.462 [ERR] LNWL: unable to fetch transaction 159e0c490d3a91803635e2cedb874bbdf40c815138358331cf8589ca5ee4b0b8 from mempool: -5: No such mempool transaction. Use -txindex or provide a block hash to enable blockchain transaction queries. Use gettransaction for wallet transactions.
2023-08-29 13:02:40.160 [ERR] LNWL: unable to fetch transaction 889fde61468f8b2d131315f59cefe2b5925a1634a7d15e56af08971c023cdd7f from mempool: -5: No such mempool transaction. Use -txindex or provide a block hash to enable blockchain transaction queries. Use gettransaction for wallet transactions.
2023-08-29 13:02:43.794 [ERR] LNWL: unable to fetch transaction f76b5d53444b5b1393d25ba90fb039ee1812df8b2e47d9208536ac36f8692096 from mempool: -5: No such mempool transaction. Use -txindex or provide a block hash to enable blockchain transaction queries. Use gettransaction for wallet transactions.

then it's still trying to talk to bitcoind 6 mins later

2023-08-29 13:02:45.129 [INF] LNWL: Started polling mempool for new bitcoind transactions via RPC.
2023-08-29 13:07:34.412 [ERR] CMGR: Can't accept connection: unable to accept connection from 170.75.163.209:58486: read tcp 172.18.0.3:9735->170.75.163.209:58486: i/o timeout
2023-08-29 13:08:41.359 [INF] CRTR: Pruning channel graph using block 00000000000000000002c087b46918b03d6827cfe7487a01d6c4cc12c9c54d82 (height=805312)
2023-08-29 13:08:41.541 [DBG] NTFN: Filtering 2975 txns for 49 spend requests at height 805312
2023-08-29 13:08:41.552 [INF] CRTR: Block 00000000000000000002c087b46918b03d6827cfe7487a01d6c4cc12c9c54d82 (height=805312) closed 1 channels
2023-08-29 13:08:42.163 [INF] NTFN: New block: height=805312, sha=00000000000000000002c087b46918b03d6827cfe7487a01d6c4cc12c9c54d82

around the 15min mark, then container gets killed because of docker timeout

2023-08-29 13:09:52.206 [DBG] CNCT: ChannelArbitrator(55bbe001814f30f93ed4e3c3583440022bcfdfcf66ee022795ce4e0208c7256a:1): terminating at state=StateDefault
2023-08-29 13:09:52.205 [DBG] CNCT: ChannelArbitrator(388f3ca997a2d1d90b3c191e37f7438ca49185a08a0644ce25149468c20d3a77:0): no actions for chain trigger, terminating
2023-08-29 13:09:52.206 [DBG] CNCT: ChannelArbitrator(388f3ca997a2d1d90b3c191e37f7438ca49185a08a0644ce25149468c20d3a77:0): terminating at state=StateDefault
2023-08-29 13:09:52.205 [DBG] CNCT: ChannelArbitrator(4fa7eb9d396e8f988fb49e093a157054ac49febb0850a487cbf8b61c659a428c:1): no actions for chain trigger, terminating
2023-08-29 13:09:52.206 [DBG] CNCT: ChannelArbitrator(4fa7eb9d396e8f988fb49e093a157054ac49febb0850a487cbf8b61c659a428c:1): terminating at state=StateDefault
2023-08-29 13:09:52.206 [DBG] CNCT: ChannelArbitrator(b85b18593d75137e96ec2066d75081ffb1f8fc4c33776377cebd95e7c58ca26e:1): terminating at state=StateDefault
2023-08-29 13:09:52.207 [DBG] CNCT: ChannelArbitrator(60815fd6f35f29b6d50345e9b178d1aea0b9d2408ed6b45aeeefb3e33500aa3a:1): checking commit chain actions at height=805313, in_htlc_count=0, out_htlc_count=0
2023-08-29 13:09:52.207 [DBG] CNCT: ChannelArbitrator(60815fd6f35f29b6d50345e9b178d1aea0b9d2408ed6b45aeeefb3e33500aa3a:1): no actions for chain trigger, terminating
2023-08-29 13:09:52.208 [DBG] CNCT: ChannelArbitrator(60815fd6f35f29b6d50345e9b178d1aea0b9d2408ed6b45aeeefb3e33500aa3a:1): terminating at state=StateDefault
2023-08-29 13:10:51.668 [ERR] CMGR: Can't accept connection: unable to accept connection from 170.75.163.209:32854: read tcp 172.18.0.3:9735->170.75.163.209:32854: i/o timeout
2023-08-29 13:10:58.857 [DBG] CRTR: Received ChannelEdgePolicy for channel 850169878425370625

@guggero
Copy link
Collaborator

guggero commented Aug 29, 2023

curl: (56) Recv failure: Connection reset by peer

Ah, so I guess that part is already shut down...
Can you try sending SIGQUIT (kill -3 <pid>) and capturing stdout (it should output a goroutine dump).

@niteshbalusu11
Copy link
Author

curl: (56) Recv failure: Connection reset by peer

Ah, so I guess that part is already shut down... Can you try sending SIGQUIT (kill -3 <pid>) and capturing stdout (it should output a goroutine dump).

logs.zip

@guggero
Copy link
Collaborator

guggero commented Aug 29, 2023

Thanks a lot!

Took me a while to comb through ~19k lines... Might be I missed something but this specific area looks suspicious to me:

There are 218 goroutines waiting for MarkEdgeLive:

goroutine 49695 [sync.Mutex.Lock]:
runtime.gopark(0x20977310290c469?, 0x14a898b01b63e7c?, 0x2a?, 0xd2?, 0x331e34001f682a8?)
                runtime/proc.go:381 +0xd6 fp=0xc00661d460 sp=0xc00661d440 pc=0x43a236
runtime.goparkunlock(...)
                runtime/proc.go:387
runtime.semacquire1(0xc0009fe5b4, 0x0?, 0x3, 0x1, 0xce?)
                runtime/sema.go:160 +0x20f fp=0xc00661d4c8 sp=0xc00661d460 pc=0x44b5af
sync.runtime_SemacquireMutex(0x29d42370128852f?, 0xc?, 0x3e7f388006cd0bf?)
                runtime/sema.go:77 +0x26 fp=0xc00661d500 sp=0xc00661d4c8 pc=0x468e86
sync.(*Mutex).lockSlow(0xc0009fe5b0)
                sync/mutex.go:171 +0x165 fp=0xc00661d550 sp=0xc00661d500 pc=0x489b65
sync.(*Mutex).Lock(...)
                sync/mutex.go:90
sync.(*RWMutex).Lock(0xc00661d300?)
                sync/rwmutex.go:147 +0x36 fp=0xc00661d580 sp=0xc00661d550 pc=0x48b1b6
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).MarkEdgeLive(0xc0009fe5a0, 0x0)
                github.com/lightningnetwork/lnd/channeldb/graph.go:3896 +0x65 fp=0xc00661d610 sp=0xc00661d580 pc=0x1514105
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).MarkEdgeLive(0xc005be2000?, {0x4644600?, 0xc0?, 0x0?})
                github.com/lightningnetwork/lnd/routing/router.go:2766 +0x35 fp=0xc00661d630 sp=0xc00661d610 pc=0x166a1f5

And 43 goroutines waiting for HasChannelEdge:

goroutine 50286 [sync.Mutex.Lock]:
runtime.gopark(0x7f73e6c5df18?, 0xa78706?, 0x0?, 0x0?, 0x155?)
                runtime/proc.go:381 +0xd6 fp=0xc006722438 sp=0xc006722418 pc=0x43a236
runtime.goparkunlock(...)
                runtime/proc.go:387
runtime.semacquire1(0xc0009fe5b4, 0x0?, 0x3, 0x1, 0x26?)
                runtime/sema.go:160 +0x20f fp=0xc0067224a0 sp=0xc006722438 pc=0x44b5af
sync.runtime_SemacquireMutex(0x16000018?, 0x0?, 0x1940?)
                runtime/sema.go:77 +0x26 fp=0xc0067224d8 sp=0xc0067224a0 pc=0x468e86
sync.(*Mutex).lockSlow(0xc0009fe5b0)
                sync/mutex.go:171 +0x165 fp=0xc006722528 sp=0xc0067224d8 pc=0x489b65
sync.(*Mutex).Lock(...)
                sync/mutex.go:90
sync.(*RWMutex).Lock(0xc0009fe5b0?)
                sync/rwmutex.go:147 +0x36 fp=0xc006722558 sp=0xc006722528 pc=0x48b1b6
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).HasChannelEdge(0xc0009fe5a0, 0x9241f000c630000)
                github.com/lightningnetwork/lnd/channeldb/graph.go:1138 +0x2f4 fp=0xc0067226a0 sp=0xc006722558 pc=0x1505e14
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).IsKnownEdge(0x9241f000c630000?, {0x1f7b5fd?, 0x0?, 0x2a0c?})
                github.com/lightningnetwork/lnd/routing/router.go:2700 +0x35 fp=0xc0067226d0 sp=0xc0067226a0 pc=0x1669f35

Not sure if we changed anything in that area or if that is a red herring (and it's just normal graph sync and zombie resurrection going on).

@niteshbalusu11 for how long has the node been running now without shutting down?

The other mutex waiting for a lock is this one, which sounds more like something we recently worked on (cc @yyforyongyu):

goroutine 43625 [sync.Mutex.Lock]:
runtime.gopark(0x10?, 0xc0006dae38?, 0xb0?, 0x5b?, 0x40dbe8?)
                runtime/proc.go:381 +0xd6 fp=0xc0029f5b38 sp=0xc0029f5b18 pc=0x43a236
runtime.goparkunlock(...)
                runtime/proc.go:387
runtime.semacquire1(0xc0017161cc, 0x0?, 0x3, 0x1, 0x31?)
                runtime/sema.go:160 +0x20f fp=0xc0029f5ba0 sp=0xc0029f5b38 pc=0x44b5af
sync.runtime_SemacquireMutex(0xc0029f5c18?, 0x63?, 0xc004fe17c0?)
                runtime/sema.go:77 +0x26 fp=0xc0029f5bd8 sp=0xc0029f5ba0 pc=0x468e86
sync.(*Mutex).lockSlow(0xc0017161c8)
                sync/mutex.go:171 +0x165 fp=0xc0029f5c28 sp=0xc0029f5bd8 pc=0x489b65
sync.(*Mutex).Lock(...)
                sync/mutex.go:90
github.com/lightningnetwork/lnd/multimutex.(*Mutex[...]).Lock(0x22e06a0, {0x73, 0xa, 0x2d, 0x16, 0xcc, 0x88, 0x19, 0x79, 0xc9, ...})
                github.com/lightningnetwork/lnd/multimutex/multimutex.go:58 +0x11a fp=0xc0029f5c60 sp=0xc0029f5c28 pc=0x9f4eda
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc0006780a0, 0xc001802d60, 0xc0029f5e38)
                github.com/lightningnetwork/lnd/blockcache/blockcache.go:38 +0xa5 fp=0xc0029f5da0 sp=0xc0029f5c60 pc=0x1630ee5
github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify.(*BitcoindNotifier).GetBlock(...)
                github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify/bitcoind.go:1037
github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify.(*BitcoindNotifier).historicalSpendDetails(0xc0000f9290, {{{0x89, 0x62, 0xc7, 0x6f, 0x40, 0x1c, 0x95, 0x0, 0x5c, ...}, ...}, ...}, ...)
                github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify/bitcoind.go:853 +0xe7 fp=0xc0029f5e78 sp=0xc0029f5da0 pc=0x1681fa7
github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify.(*BitcoindNotifier).notificationDispatcher.func2(0xc0018007d0)
                github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify/bitcoind.go:300 +0x118 fp=0xc0029f5fc8 sp=0xc0029f5e78 pc=0x167fff8
github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify.(*BitcoindNotifier).notificationDispatcher.func5()
                github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify/bitcoind.go:339 +0x2a fp=0xc0029f5fe0 sp=0xc0029f5fc8 pc=0x167feaa
runtime.goexit()
                runtime/asm_amd64.s:1598 +0x1 fp=0xc0029f5fe8 sp=0xc0029f5fe0 pc=0x46d261
created by github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify.(*BitcoindNotifier).notificationDispatcher
                github.com/lightningnetwork/lnd/chainntnfs/bitcoindnotify/bitcoind.go:297 +0x94b

@niteshbalusu11
Copy link
Author

for how long has the node been running now without shutting down?

I had to restart a few times to get these logs so between the last two restarts it's like 5 mins apart.

@guggero
Copy link
Collaborator

guggero commented Aug 29, 2023

Okay, this seems to be related to some of the other reported issues, think this is the most relevant one:

goroutine 47339 [semacquire]:
runtime.gopark(0x0?, 0x0?, 0x20?, 0x36?, 0x10?)
                runtime/proc.go:381 +0xd6 fp=0xc002ca1a30 sp=0xc002ca1a10 pc=0x43a236
runtime.goparkunlock(...)
                runtime/proc.go:387
runtime.semacquire1(0xc00af2a6c8, 0x88?, 0x1, 0x0, 0x0?)
                runtime/sema.go:160 +0x20f fp=0xc002ca1a98 sp=0xc002ca1a30 pc=0x44b5af
sync.runtime_Semacquire(0x46b1ce?)
                runtime/sema.go:62 +0x27 fp=0xc002ca1ad0 sp=0xc002ca1a98 pc=0x468d67
sync.(*WaitGroup).Wait(0xc002ca1b58?)
                sync/waitgroup.go:116 +0x4b fp=0xc002ca1af8 sp=0xc002ca1ad0 pc=0x48b66b
github.com/lightningnetwork/lnd/peer.(*Brontide).WaitForDisconnect(0xc00af2a380, 0xc00076ca20)
                github.com/lightningnetwork/lnd/peer/brontide.go:1102 +0xeb fp=0xc002ca1b88 sp=0xc002ca1af8 pc=0x19021ab
github.com/lightningnetwork/lnd.(*server).peerTerminationWatcher(0xc000004300, 0xc00af2a380, 0xc00e0b8360?)
                github.com/lightningnetwork/lnd/server.go:4017 +0x90 fp=0xc002ca1fb8 sp=0xc002ca1b88 pc=0x19cc250
github.com/lightningnetwork/lnd.(*server).peerInitializer.func2()
                github.com/lightningnetwork/lnd/server.go:3973 +0x2e fp=0xc002ca1fe0 sp=0xc002ca1fb8 pc=0x19cc12e
runtime.goexit()
                runtime/asm_amd64.s:1598 +0x1 fp=0xc002ca1fe8 sp=0xc002ca1fe0 pc=0x46d261
created by github.com/lightningnetwork/lnd.(*server).peerInitializer
                github.com/lightningnetwork/lnd/server.go:3973 +0x14a

Roasbeef added a commit to Roasbeef/lnd that referenced this issue Aug 29, 2023
In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](lightningnetwork#7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.

This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.

Fixes lightningnetwork#7924

Fixes lightningnetwork#7928

Fixes lightningnetwork#7866
@Roasbeef
Copy link
Member

We have a candidate fix here: #7928

@niteshbalusu11
Copy link
Author

We have a candidate fix here: #7928

Please merge 🙏

@Roasbeef
Copy link
Member

@niteshbalusu11 does that seem to do the trick?

@niteshbalusu11
Copy link
Author

I have not tested it yet.

@niteshbalusu11
Copy link
Author

Just tested it, seems to be working well.

@guggero
Copy link
Collaborator

guggero commented Oct 23, 2023

@mycelia1 thanks for the goroutine profiles. This looks very similar to something we've already fixed. Can you please confirm what exact version of lnd you are running?

@mycelia1
Copy link

@guggero
"version": "0.17.0-beta commit=v0.17.0-beta"

@guggero
Copy link
Collaborator

guggero commented Oct 23, 2023

Hmm, okay, thanks.

So I see this goroutine waiting for a wait group

7 @ 0x43df8e 0x44f438 0x44f40f 0x46cf25 0x48f1a8 0x181c312 0x18e0545 0x471161
#       0x46cf24        sync.runtime_Semacquire+0x24                                            runtime/sema.go:62
#       0x48f1a7        sync.(*WaitGroup).Wait+0x47                                             sync/waitgroup.go:116
#       0x181c311       github.com/lightningnetwork/lnd/peer.(*Brontide).WaitForDisconnect+0xd1 github.com/lightningnetwork/lnd/peer/brontide.go:1102
#       0x18e0544       github.com/lightningnetwork/lnd.(*server).peerTerminationWatcher+0x84   github.com/lightningnetwork/lnd/server.go:4017

Looks like the readHeader goroutine didn't shut down though:

7 @ 0x43df8e 0x436677 0x46b625 0x4deae7 0x4dfdda 0x4dfdc8 0x5526e5 0x564285 0x4b81d0 0x16fcd88 0x16fcd62 0x16f7a25 0x181c690 0x181e18d 0x471161
#       0x46b624        internal/poll.runtime_pollWait+0x84                                     runtime/netpoll.go:343
#       0x4deae6        internal/poll.(*pollDesc).wait+0x26                                     internal/poll/fd_poll_runtime.go:84
#       0x4dfdd9        internal/poll.(*pollDesc).waitRead+0x279                                internal/poll/fd_poll_runtime.go:89
#       0x4dfdc7        internal/poll.(*FD).Read+0x267                                          internal/poll/fd_unix.go:164
#       0x5526e4        net.(*netFD).Read+0x24                                                  net/fd_posix.go:55
#       0x564284        net.(*conn).Read+0x44                                                   net/net.go:179
#       0x4b81cf        io.ReadAtLeast+0x8f                                                     io/io.go:335
#       0x16fcd87       io.ReadFull+0x47                                                        io/io.go:354
#       0x16fcd61       github.com/lightningnetwork/lnd/brontide.(*Machine).ReadHeader+0x21     github.com/lightningnetwork/lnd/brontide/noise.go:852
#       0x16f7a24       github.com/lightningnetwork/lnd/brontide.(*Conn).ReadNextHeader+0x44    github.com/lightningnetwork/lnd/brontide/conn.go:127
#       0x181c68f       github.com/lightningnetwork/lnd/peer.(*Brontide).readNextMessage+0xaf   github.com/lightningnetwork/lnd/peer/brontide.go:1146
#       0x181e18c       github.com/lightningnetwork/lnd/peer.(*Brontide).readHandler+0x2ec      github.com/lightningnetwork/lnd/peer/brontide.go:1519

It's stuck at noiseConn.ReadNextHeader():

pktLen, err := noiseConn.ReadNextHeader()

Which looks like it doesn't have a timeout set at all? But could be misinterpreting things...

@Roasbeef Roasbeef modified the milestones: v0.17.1, v0.18.0 Oct 31, 2023
@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Nov 2, 2023

Going off of this:

logs.zip

The ChannelRouter is waiting for its goroutines to stop. There are two calls to UpdateEdgePolicy which are hanging, they are waiting for the cacheMu lock (this is obfuscated somewhat - the cacheMu mutex is the locker in the chanScheduler). All of the calls to HasChannelEdge and MarkEdgeLive are also waiting for the cacheMu lock. There is one call to MarkEdgeLive that is not waiting for the cacheMu lock:

goroutine 48194 [runnable]:\
runtime.asyncPreempt2()\
        runtime/preempt.go:307 +0x3f fp=0xc002a5cd38 sp=0xc002a5cd18 pc=0x43893f\
runtime.asyncPreempt()\
        runtime/preempt_amd64.s:53 +0xdb fp=0xc002a5cec0 sp=0xc002a5cd38 pc=0x46e8db\
go.etcd.io/bbolt.(*pgids).Less(0x9, 0x459b2, 0x45862)\
        <autogenerated>:1 +0x9 fp=0xc002a5cee0 sp=0xc002a5cec0 pc=0xa49849\
sort.partition(\{0x22dfca0, 0xc0069ceb70\}, 0x455c2, 0x45b05, 0xc0069ceb70?)\
        sort/zsortinterface.go:154 +0x182 fp=0xc002a5cf18 sp=0xc002a5cee0 pc=0x478102\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0x449b1?, 0x4ab9c?, 0xc0069ceb70?)\
        sort/zsortinterface.go:114 +0x225 fp=0xc002a5cf98 sp=0xc002a5cf18 pc=0x477e65\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0xf57c?, 0x4f890?, 0x40dc1d?)\
        sort/zsortinterface.go:121 +0x276 fp=0xc002a5d018 sp=0xc002a5cf98 pc=0x477eb6\
sort.pdqsort(\{0x22dfca0, 0xc0069ceb70\}, 0x18?, 0x1b64280?, 0xc001278501?)\
        sort/zsortinterface.go:125 +0x2b5 fp=0xc002a5d098 sp=0xc002a5d018 pc=0x477ef5\
sort.Sort(\{0x22dfca0, 0xc0069ceb70\})\
        sort/sort.go:48 +0x5d fp=0xc002a5d0d0 sp=0xc002a5d098 pc=0x4761bd\
go.etcd.io/bbolt.(*freelist).hashmapGetFreePageIDs(0xc000664380)\
        go.etcd.io/bbolt@v1.3.7/freelist_hmap.go:84 +0x165 fp=0xc002a5d1e0 sp=0xc002a5d0d0 pc=0xa3f825\
go.etcd.io/bbolt.(*freelist).hashmapGetFreePageIDs-fm()\
        <autogenerated>:1 +0x26 fp=0xc002a5d1f8 sp=0xc002a5d1e0 pc=0xa4a206\
go.etcd.io/bbolt.(*freelist).copyall(0xc000664380, \{0xc014c4c018, 0xae16c, 0xae16c\})\
        go.etcd.io/bbolt@v1.3.7/freelist.go:104 +0x21d fp=0xc002a5d2f8 sp=0xc002a5d1f8 pc=0xa3d6dd\
go.etcd.io/bbolt.(*freelist).write(0xc0006d6ee0?, 0xc014c4c000)\
        go.etcd.io/bbolt@v1.3.7/freelist.go:335 +0xcb fp=0xc002a5d360 sp=0xc002a5d2f8 pc=0xa3e84b\
go.etcd.io/bbolt.(*Tx).commitFreelist(0xc0006d6ee0)\
        go.etcd.io/bbolt@v1.3.7/tx.go:243 +0xaa fp=0xc002a5d3a8 sp=0xc002a5d360 pc=0xa44f0a\
go.etcd.io/bbolt.(*Tx).Commit(0xc0006d6ee0)\
        go.etcd.io/bbolt@v1.3.7/tx.go:178 +0x1cd fp=0xc002a5d4d0 sp=0xc002a5d3a8 pc=0xa44a2d\
github.com/btcsuite/btcwallet/walletdb/bdb.(*transaction).Commit(0x22ebec0?)\
        github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:101 +0x25 fp=0xc002a5d508 sp=0xc002a5d4d0 pc=0xa4b1a5\
github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update(0xc002a5d300?, 0xc002a04ad0, 0xc002a5d2e0?)\
        github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:437 +0x11c fp=0xc002a5d580 sp=0xc002a5d508 pc=0xa4e13c\
github.com/lightningnetwork/lnd/kvdb.Update(...)\
        github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16\
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).MarkEdgeLive(0xc0009fe5a0, 0x0)\
        github.com/lightningnetwork/lnd/channeldb/graph.go:3899 +0xe5 fp=0xc002a5d610 sp=0xc002a5d580 pc=0x1514185\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).MarkEdgeLive(0xc0069d1550?, \{0x2409040?, 0xc0?, 0x0?\})\
        github.com/lightningnetwork/lnd/routing/router.go:2766 +0x35 fp=0xc002a5d630 sp=0xc002a5d610 pc=0x166a1f5\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processZombieUpdate(0xc0065ad600, 0xc0030d4580, 0xc0069d1550)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2056 +0x1b7 fp=0xc002a5d6c0 sp=0xc002a5d630 pc=0x176bfb7\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleChanUpdate(0xc0065ad600, 0xc00453d700, 0xc0069d1550, \{0xc0065830a0, 0x1, 0x1\})\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2731 +0xb1e fp=0xc002a5de90 sp=0xc002a5d6c0 pc=0x1770ede\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement(0xc007c0d1d0?, 0xc00453d700)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2005 +0xfe fp=0xc002a5def8 sp=0xc002a5de90 pc=0x176bd5e\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleNetworkMessages(0xc0065ad600, 0xc00453d700, 0x18ff846?, 0xc007c0d1d0)\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1506 +0x22d fp=0xc002a5dfb0 sp=0xc002a5def8 pc=0x1768f4d\
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler.func4()\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1421 +0x32 fp=0xc002a5dfe0 sp=0xc002a5dfb0 pc=0x1768bd2\
runtime.goexit()\
        runtime/asm_amd64.s:1598 +0x1 fp=0xc002a5dfe8 sp=0xc002a5dfe0 pc=0x46d261\
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler\
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1421 +0x913\

This goroutine that was trying to call MarkEdgeLive and write the freelist was preempted (see: runtime.asyncPreempt2). In this specific scenario, I don't think there's a deadlock. But I don't know why that database transaction isn't happening. We did update bbolt in 17, but I'm not sure if that's related. How big is your database @niteshbalusu11 ?

@saubyk saubyk assigned Crypt-iQ and unassigned yyforyongyu Nov 2, 2023
@niteshbalusu11
Copy link
Author

How big is your database @niteshbalusu11 ?

@Crypt-iQ 429mb

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Nov 2, 2023

Noticed something in this log dump that is separate from my comment above:

@Roasbeef Here's another goroutine dump. I think the issue still persists.

godump.zip

The ChannelRouter is trying to Stop and is waiting for the wait group counter to go to zero. However, there is an outstanding GetBlock call via fetchFundingTx:

goroutine 15327295 [select, 43 minutes]:\
runtime.gopark(0xc00d085878?, 0x3?, 0x20?, 0x70?, 0xc00d08583a?)\
        runtime/proc.go:398 +0xce fp=0xc00d0856d0 sp=0xc00d0856b0 pc=0x43df0e\
runtime.selectgo(0xc00d085878, 0xc00d085834, 0x6238e8?, 0x0, 0x0?, 0x1)\
        runtime/select.go:327 +0x725 fp=0xc00d0857f0 sp=0xc00d0856d0 pc=0x44e385\
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00064c580, 0xc01d6d3da0)\
        github.com/btcsuite/btcwallet@v0.16.10-0.20230804184612-07be54bc22cf/chain/bitcoind_conn.go:410 +0x1d1 fp=0xc00d0858b8 sp=0xc00d0857f0 pc=0x947fd1\
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc008e2b770?, 0x0?)\
        github.com/btcsuite/btcwallet@v0.16.10-0.20230804184612-07be54bc22cf/chain/bitcoind_client.go:144 +0x17 fp=0xc00d0858d8 sp=0xc00d0858b8 pc=0x93f5f7\
github.com/btcsuite/btcwallet/chain.Interface.GetBlock-fm(0x96826bc740000002?)\
        <autogenerated>:1 +0x2b fp=0xc00d0858f8 sp=0xc00d0858d8 pc=0x1580d8b\
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc0014af8c0, 0xc01d6d3da0, 0xc00d085ac8)\
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a fp=0xc00d085a38 sp=0xc00d0858f8 pc=0x156e10a\
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x1000001e00bab?, 0x7fdcf6b86c78?)\
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de fp=0xc00d085b30 sp=0xc00d085a38 pc=0x1575d9e\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000149040, 0xc008e2ade0)\
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c fp=0xc00d085ba0 sp=0xc00d085b30 pc=0x15a0bfc\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000149040, \{0x1c82220?, 0xc018812580?\}, \{0xc001a0b0d8, 0x1, 0x1\})\
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7 fp=0xc00d085f10 sp=0xc00d085ba0 pc=0x159fca7\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000149040, 0xc009855bf0, 0xc0066a67e0)\
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d fp=0xc00d085fb8 sp=0xc00d085f10 pc=0x159c72d\
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler.func4()\
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0x28 fp=0xc00d085fe0 sp=0xc00d085fb8 pc=0x159db08\
runtime.goexit()\
        runtime/asm_amd64.s:1650 +0x1 fp=0xc00d085fe8 sp=0xc00d085fe0 pc=0x4710e1\
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 61461\
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf\

Since you're running a pruned node, it will wait here for the block: https://github.com/btcsuite/btcwallet/blob/9c135429d5b7c5c84b19eaffce9e6a32e2d9d5d0/chain/bitcoind_conn.go#L479
The logs also show that your node didn't have any peers at the time and was waiting here: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L263. This prevents shutdown until your node can connect to a peer that sends over the block.

@bitromortac
Copy link
Collaborator

goroutine 33483 [semacquire, 4282 minutes]:\
runtime.gopark(0xc000132530?, 0xc002271c20?, 0xc0?, 0x20?, 0xc006c0d0c0?)\
        runtime/proc.go:398 +0xce fp=0xc006c0d048 sp=0xc006c0d028 pc=0x43df0e\
runtime.goparkunlock(...)\
        runtime/proc.go:404\
runtime.semacquire1(0xc002e06d88, 0x1e?, 0x1, 0x0, 0x60?)\
        runtime/sema.go:160 +0x218 fp=0xc006c0d0b0 sp=0xc006c0d048 pc=0x44f3b8\
sync.runtime_Semacquire(0xc006c0d100?)\
        runtime/sema.go:62 +0x25 fp=0xc006c0d0e8 sp=0xc006c0d0b0 pc=0x46cea5\
sync.(*WaitGroup).Wait(0xc002b9dce0?)\
        sync/waitgroup.go:116 +0x48 fp=0xc006c0d110 sp=0xc006c0d0e8 pc=0x48f128\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x1ac59e0?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:826 +0x19c fp=0xc006c0d178 sp=0xc006c0d110 pc=0x150c8fc\
github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc0057ca4e0, \{\{0x6a, 0x25, 0xc7, 0x8, 0x2, 0x4e, 0xce, 0x95, 0x27, ...\}, ...\})\
        github.com/lightningnetwork/lnd/contractcourt/chain_arbitrator.go:468 +0x21c fp=0xc006c0d228 sp=0xc006c0d178 pc=0x15010bc\
github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func4()\
        github.com/lightningnetwork/lnd/contractcourt/chain_arbitrator.go:404 +0x78 fp=0xc006c0d278 sp=0xc006c0d228 pc=0x1500cb8\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002e06b00, 0x1e93776?, 0x0, 0x1?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:1265 +0x107c fp=0xc006c0dbb0 sp=0xc006c0d278 pc=0x150da1c\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002e06b00, 0x1e80b45?, 0x0, 0xc001a40640?)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:1512 +0x17b fp=0xc006c0dc50 sp=0xc006c0dbb0 pc=0x150fadb\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002e06b00, 0xc5454)\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:2930 +0x10b2 fp=0xc006c0dfc0 sp=0xc006c0dc50 pc=0x1518352\
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start.func2()\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:567 +0x25 fp=0xc006c0dfe0 sp=0xc006c0dfc0 pc=0x150b205\
runtime.goexit()\
        runtime/asm_amd64.s:1650 +0x1 fp=0xc006c0dfe8 sp=0xc006c0dfe0 pc=0x4710e1\
created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 1\
        github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:567 +0x68e\
\

Isn't that also strange, seems to be circular as channelAttendant calls the ChannelArbitrator's Stop method, which is waiting on the work group channelAttendant it is part of?

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Nov 3, 2023

Isn't that also strange, seems to be circular as channelAttendant calls the ChannelArbitrator's Stop method, which is waiting on the work group channelAttendant it is part of?

Good catch. This doesn't prevent shutdown, but is certainly not the right behavior. Apparently, that Stop function when called from channelAttendant just waits indefinitely

@dskvr
Copy link

dskvr commented Dec 4, 2023

0.17.2-beta

still getting getting logs flooded with lines like

2023-12-04 17:11:18.158 [ERR] DISC: Update edge for short_chan_id(879313533626023936) got: router shutting down

Node will not shutdown until a few thousand of these are spat out.

Is this issue actually resolved?

@guggero
Copy link
Collaborator

guggero commented Dec 4, 2023

That just sounds like a logging issue. Shutting down does involve quite a bit of work, so it might take some seconds. And from how I interpret your message, the node does actually shut down?
Also, the PR that closed this issue was only released in v0.17.3-beta which is currently in RC1.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Dec 4, 2023

@dskvr Your node is processing many ChannelUpdate. Once the shutdown signal is received, it may take a while to shut down because many of these ChannelUpdate will get processed before the router subsystem actually shuts down

@dskvr
Copy link

dskvr commented Dec 4, 2023

@guggero @Crypt-iQ To be clear, the original text of this issue

I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn't shut down gracefully.

What I have described is the exact same issue described by OP. I added a post to this issue here, the resource consumption and logging patterns I outlined was echoed shortly after by @mycelia1 here, followed up with a goroutine that was reviewed by @guggero here.

That just sounds like a logging issue. Shutting down does involve quite a bit of work, so it might take some seconds.

I'm not talking seconds, I'm talking 1-2 hours.

The only difference between what I've described and the OP that I am not forcing a SIGTERM after 15m. It takes about 1-2 hours to shut down a node for a layer-2 solution that demands ~100% uptime from its nodes. So between the stated purpose and risks of said software, the layer-2 problem it aims to solve and the contents of this thread, it would be difficult to infer anything other than the following:

Blocking a restart while updating non-essential data in a graph in a circumstance that can result in monetary loss, is not the intended behavior.

Also thanks for highlighting that my eyes completely missed the milestone update, I'll watch out for 0.17.3-beta

@Crypt-iQ Crypt-iQ reopened this Dec 4, 2023
@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Dec 4, 2023

The linked PR that closed this issue did fix a shutdown bug, but it seems that another issue is lock contention with the graph's cacheMu mutex.

  • The router tries to Stop and waits for the wait group to go to zero.
    • The handleNetworkUpdate goroutine is trying to call UpdateEdgePolicy which waits for cacheMu.
  • Many ChannelUpdate are received in the gossiper and call IsKnownEdge or MarkEdgeLive in handleChanUpdate. These functions try to acquire cacheMu.
    • These calls prevent the UpdateEdgePolicy calls from running until they are done.

I'm not sure how (or if) lock contention is handled in the go runtime (i.e. whether it's FIFO on mutex acquisition or something). I also wonder if we're receiving ChannelUpdate fast enough that it prevents shutdown.

@guggero
Copy link
Collaborator

guggero commented Dec 5, 2023

Okay, thanks for the clarification. It did sound like you were mostly complaining about the log spam. But anything more than a minute for shutting down is unacceptable, I agree.

@C-Otto
Copy link
Contributor

C-Otto commented Mar 6, 2024

Maybe this issue is resolved? See #8250 for a description of why lnd refused to stop (using a pruned bitcoind).

@saubyk
Copy link
Collaborator

saubyk commented Mar 6, 2024

Hi @niteshbalusu11 can you please confirm if this is still an issue with lnd 0.17.4? thanks

@niteshbalusu11
Copy link
Author

Hi @niteshbalusu11 can you please confirm if this is still an issue with lnd 0.17.4? thanks

I think it can be closed. Doesn't happen anymore.

@guggero guggero closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 MUST be fixed or reviewed shutdown Issues related to shutdown process of LND
Projects
None yet