-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: shutdown chanrouter correctly. #8497
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are limited to specific labels. Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
But this is just part of the fix why the other node is not able to sync the graph to the chain, we definitely need to retry the blockfetch and not fail immediately if we cannot get the block from the first peer. This issue is already tracked in this issue: |
801e50f
to
3223097
Compare
3223097
to
85a52aa
Compare
Swapped the order when we add the cleanup |
routing/router.go
Outdated
@@ -539,6 +539,13 @@ func (r *ChannelRouter) Start() error { | |||
return nil | |||
} | |||
|
|||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this given the change of orders in the following commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we can remove this now.
@@ -1847,173 +1847,174 @@ func (s *server) Start() error { | |||
cleanup := cleaner{} | |||
|
|||
s.start.Do(func() { | |||
cleanup = cleanup.add(s.customMessageServer.Stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is interesting...need to take a second look to see if this can be done safely, as there's some stop pattern that requires a certain state to be reached before it can stop, such as,
Lines 201 to 205 in acc5951
func (m *PingManager) Stop() error { | |
if m.pingTicker == nil { | |
return errors.New("PingManager cannot be stopped because it " + | |
"isn't running") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, because we just call the stop function, and if we cannot stop it we just print a warning and continue with the cleanup of the other subsystems. We cannot know at which state the startup might fail so therefore I think we should precautiously call the stop function.
f21e62c
to
8c831e5
Compare
@yyforyongyu while adding the interruptibility to the startup of the server, I figured out that we need to make sure that each stop call is atomic (only happens once) otherwise we first call it in the But I think when the tests pass the switch of the cleanup order should have no side effects and can prevent some cases where subsystems depend on each other and therefore cannot shutdown correctly in case on of them does not close the |
@ziggie1984, remember to re-request review from reviewers when ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, will check the itest logs to understand more about the new behavior.
chanfitness/chaneventstore.go
Outdated
@@ -48,6 +49,9 @@ var ( | |||
// ChannelEventStore maintains a set of event logs for the node's channels to | |||
// provide insight into the performance and health of channels. | |||
type ChannelEventStore struct { | |||
started uint32 // To be used atomically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there are only two states, I think we can use atomic.Bool
instead
chanfitness/chaneventstore.go
Outdated
@@ -142,6 +146,10 @@ func NewChannelEventStore(config *Config) *ChannelEventStore { | |||
// information from the store. If this function fails, it cancels its existing | |||
// subscriptions and returns an error. | |||
func (c *ChannelEventStore) Start() error { | |||
if !atomic.CompareAndSwapUint32(&c.started, 0, 1) { | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's always return an error when Start
is called twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also put the log log.Info("ChannelEventStore starting")
before this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think the current design calls them twice, erroring out might cause problems, let's see what the itests say with the new design.
chanfitness/chaneventstore.go
Outdated
@@ -203,6 +211,10 @@ func (c *ChannelEventStore) Start() error { | |||
|
|||
// Stop terminates all goroutines started by the event store. | |||
func (c *ChannelEventStore) Stop() { | |||
if !atomic.CompareAndSwapUint32(&c.stopped, 0, 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re error out on duplicate Stop
, and more logging would be great.
invoices/invoiceregistry.go
Outdated
@@ -101,6 +101,9 @@ func (r *htlcReleaseEvent) Less(other queue.PriorityQueueItem) bool { | |||
// created by the daemon. The registry is a thin wrapper around a map in order | |||
// to ensure that all updates/reads are thread safe. | |||
type InvoiceRegistry struct { | |||
started sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use atomic.Bool
instead to save us some indentations.
lnd.go
Outdated
@@ -675,10 +675,28 @@ func Main(cfg *Config, lisCfg ListenerCfg, implCfg *ImplementationCfg, | |||
|
|||
// With all the relevant chains initialized, we can finally start the | |||
// server itself. | |||
if err := server.Start(); err != nil { | |||
errChan := make(chan error) | |||
go server.Start(errChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative would be, seems easier to follow,
go func() {
errChan <- server.Start()
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes looks better, changed.
Make sure that each subsystem only starts and stop once. This makes sure we don't close e.g. quit channels twice.
8c831e5
to
0ac6836
Compare
This commit does two things. It starts up the server in a way that it can be interrupted and shutdown gracefully. Moreover it makes sure that subsystems clean themselves up when they fail to start. This makes sure that depending subsytems can shutdown gracefully as well and the shutdown process is not stuck.
0ac6836
to
5508ca3
Compare
|
||
s.txPublisher.Stop() | ||
|
||
if err := s.txPublisher.Stop(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yyforyongyu not sure but I forgot why the txpublisher had no error return ? Added one now hmm
Let's see whether all the itests pass after the change to error out when a start/stop is called twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few nits and needs a rebase - think there's a new subserver added, we may need to change that here too.
log.Info("ChannelEventStore starting") | ||
log.Info("ChannelEventStore starting...") | ||
|
||
if c.started.Swap(true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice TILed this nice Swap
method.
@@ -198,13 +207,18 @@ func (c *ChannelEventStore) Start() error { | |||
cancel: cancel, | |||
}) | |||
|
|||
log.Info("ChannelEventStore started") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use defer
instead
@@ -213,6 +227,10 @@ func (c *ChannelEventStore) Stop() { | |||
// Stop the ticker after the goroutine reading from it has exited, to | |||
// avoid a race. | |||
c.cfg.FlapCountTicker.Stop() | |||
|
|||
log.Infof("ChannelEventStore shutdown complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I usually do an info
level on starting/stopping
, then a debug
level log on started/stopped
to prevent too many logs - not sure if it really helps since restart should be infrequent.
} | ||
// Start InvoiceExpiryWatcher and prepopulate it with existing | ||
// active invoices. | ||
err = i.expiryWatcher.Start( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
if err := s.chanRouter.Start(); err != nil { | ||
// The authGossiper depends on the chanRouter and therefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cherry-pick this change to a new commit - seems more like a potential bug fix.
@@ -704,7 +704,6 @@ func (t *TxPublisher) Stop() error { | |||
log.Info("TxPublisher stopped") | |||
|
|||
return nil | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
|
||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
Fixes #8489EDIT: Fixes #8721
So in the above linked issue, the channel graph could not be synced correctly so the ChanRouter:
...
so the 34 query failed and therefore the startup of the chanrouter failed as well.
We fail here and never call the
Stop
function of the channel router.https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L628
When cleaning up all the other subsystems we get stuck however:
because we don't close the quit channel of the channel router and therefore the
Authenticated Gossiper
cannot stop as well so the cleanup process is stuck holding up the shutdown of all subsystems, causing some sideeffects because other subsystems are still running.Goroutine Dump:
So we need to think how to prevent those situations, because I think we don't close the
quit channel
for almost all subsystems when the start fails.