Don't shutdown lit on accounts service critical error, and register to status server #642

ViktorTigerstrom · 2023-09-21T01:11:46Z

This PR is based on #541

In this PR we ensure that if the accounts service critically errors, we won't shutdown LiT, but instead log the error and reject any new requests to the accounts service.

We also add the accounts service to the status manager, and use the status from the status manager to determine if we should allow gRPC requests for the accounts service through or not.

Finally, we also add a config option to allow users to disable the accounts service, which ensures that the accounts service isn't started if the config option is set to disable.

TODO:

add tests

ViktorTigerstrom · 2023-09-21T01:18:51Z

terminal.go

-					"keeping litd running: %v", err,
+			case errFunc := <-g.nonCriticalErrQueue.ChanOut():
+				serviceName, err := errFunc()
+				if serviceName == accounts.ACCOUNTS_SERVICE {


I wasn't really sure how to do this in a clean way, to be able to tell which subsystem the error is sent from. I'm not a big fan of the channel returning a func() (string, error), but couldn't figure out a better solution. Let me know if you have a better alternative!

I guess one alternative would let it be a queue.ConcurrentQueue[error] queue, and send it to the accounts service only and no other service. But that kind of defeats the purpose of having a queue in the first place.

yeah this is why I think it would be cool to have a lit subsystem manager type thing which will handle receiving errors from the various subsystems (just accounts for now) and setting the status correctly, logging etc. Then you would not need this work around

ellemouton

Leaving some comments for now - just want to gauge what the consensus of the best direction is.

I think ideally we want things to be as generic as possible so that this pattern can easily be applied to other LiT sub-systems

accounts/rpcserver.go

ellemouton · 2023-09-21T07:04:14Z

accounts/service.go

+			s.Lock()
+			s.disable()
+			s.Unlock()


I suggest letting disable handle the locking and unlocking of the mutex itself. And then you can have another disableUnsafe method which does not grab the mutex so that you can use that one in places where the mutex is already acquired.

func (s *InterceptorService) disable() { s.Lock() defer s.Unlock() s.disableUnsafe() } func (s *InterceptorService) disableUnsafe() { s.isEnabled = false }

Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!).
I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error { s.disable() return fmt.Errorf(format, a...) }

Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!). I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error { s.disable() return fmt.Errorf(format, a...) }

Ok got it!
The reason I added this pattern was that I was a bit afraid that people will forget to disable the account service if the functions that need to disable the service errors.
I push this fix in separate fixup commits now, but will squash them if you think it looks good :)

accounts/service.go

accounts/rpcserver.go

ellemouton · 2023-09-21T07:23:46Z

accounts/service.go

+//
+// NOTE: The store lock MUST be held as either a read or write lock when calling
+// this method.
+func (s *InterceptorService) requireRunning() error {


I think we can just use the IsRunning() method instead of this

Left this out, due to the discussion below :)

which discussion?

Why not just have something like this?

func (s *Inter...) isRunning() bool { return s.Enabled }

and then

use this throughout?

if !isRunning() { return ErrAccountServiceDisabled }

to me, returning an error from a function indicates failure of the function itself.

Ah, sorry. I ment the discussion that we need to check if the service is running under the same lock, and not drop the lock to then reacquire it, so we still need two different versions of this function which is why I kept this function before.

But agree that it's confusing with the error, so I changed the PR now to instead include a isRunningUnsafe function that doesn't acquire the lock, and use that function instead where it is needed :)

ellemouton · 2023-09-21T07:29:35Z

terminal.go

-					"keeping litd running: %v", err,
+			case errFunc := <-g.nonCriticalErrQueue.ChanOut():
+				serviceName, err := errFunc()
+				if serviceName == accounts.ACCOUNTS_SERVICE {


yeah this is why I think it would be cool to have a lit subsystem manager type thing which will handle receiving errors from the various subsystems (just accounts for now) and setting the status correctly, logging etc. Then you would not need this work around

ellemouton · 2023-09-21T07:31:14Z

terminal.go

+	// is disabled, we still want to mark that the service should be stopped
+	// so that we  can properly shut it down (canceling the ctx, and the
+	// accounts store) in the shutdownSubServers call.
+	g.accountServiceShouldBeStopped = true


do we need it to be stopped if the accounts mode is disabled though?

I also think the previous name for the variable is fine btw

Hmm yeah agree that this is just strange. Perhaps we should just close everything in the accountService immediately if the service is disabled :)!

Added that we now stop the account service directly if it is disabled instead!

ViktorTigerstrom · 2023-09-21T10:27:43Z

Thanks for the review @ellemouton!
Posting a general comment here to respond what much of the feedback is related to :).

The main issues I see with just having one outer check that checks if a requests should be allowed through or not based on if the service is running or not, and then not checking in any of the internal functions in the server itself, is that:

The internal functions in the accounts system is not only called through gRPC requests, but we are also subscribed to the lndclient.LightningClient which will execute invoiceUpdate when an accounts invoice is added or paid, and lndclient.RouterClient which will execute paymentUpdate when a tracked payment is updated. It's important that these functions won't get executed if the service has already been disabled, especially for invoiceUpdate as we'd then risk that the currentAddIndex & currentSettleIndex goes out of sync leading to missed added or settled invoices.
The service can shutdown during the time a request that has been allowed through is actually executed. I.e. if an account makes a request to send a payment, the service can be Running when the actual request is done, but when the payment has been sent and we handle the SendPaymentV2 response in TrackPayment, the service might have been shutdown through for example an invoice update that came in while the users request was processed.

This is also why it's very important that we often disable the service under the same lock in the function execution that actually caused the error. Imagine of the following scenario:
We get 2 invoice settle updates rapidly that settles two different invoices in a very short timespan. Now if the first invoice update acquires the the lock, and the second update attempts to grab the lock in another thread. It's in that scenario it's super important that the we disable the service under the same lock as the update is attempted for the first invoice, if the update fails. If we didn't and dropped the lock, and reacquired it to disable the service, we'd risk that the second invoice update grabs the lock before we've managed to disable the service. If the second invoice update then succeeds, we'd end-up with the currentSettleIndex out of sync, which would mean that we'd never credit the account for the first invoice update.
So if we remove the requireRunning function, we need to add a IsRunning function version that doesn't also acquire the lock, i.e. a UnsafeIsRunning is running function.

Now it's most clear why we'd need to check if the service is already running before we execute the invoiceUpdate function, and we may not need to do that for any other function. But I didn't really know where to draw the line of when we'd require that the service is running if we executed the function, so I settled for functions that update the balance of an account. This is because it feels strange that in a scenario that the service was disabled while send payment request by an account was executed, but before the SendPaymentV2 response from LND was received, that we'd still successfully the handle the payment update. Meaning we'd successfully execute TrackPayment which adds the subscription to the lndclient.RouterClient and which in turn would then execute paymentUpdate once the payment has been updated. It just feels strange that all of those call would successfully execute while the service was disabled, which is why I drew the line of requiring that the service is running for any function that updates the user balance.

ellemouton · 2023-09-21T15:37:12Z

ok cool! thanks for the explanation @ViktorTigerstrom - that makes sense. I didnt think about point 1.

guggero

Did a first pass. Great work! Looks pretty good, but have a few suggestions to make the diff even smaller.

accounts/interface.go

guggero · 2023-09-22T09:12:42Z

accounts/service.go

+			s.Lock()
+			s.disable()
+			s.Unlock()


Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!).
I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error { s.disable() return fmt.Errorf(format, a...) }

terminal.go

accounts/checkers.go

guggero · 2023-09-22T09:22:34Z

accounts/service.go

-	account.Payments[hash] = &PaymentEntry{
-		Status:     lnrpc.Payment_UNKNOWN,
-		FullAmount: fullAmt,
+	if !ok {


I'm wondering if we could ever end up here with an all-zero payment hash (e.g. because it wasn't set in one of the calls). Not sure if we should check for that at least before we associate the payment?

Hmm I'm not completely sure what that I understood this correctly. I think this should be fixed with the fix for this other feedback:
#642 (comment)

But please double check if I misunderstood what you ment with this comment, and let me know if I need to fix something else :)

accounts/service.go

terminal.go

ViktorTigerstrom · 2023-09-22T10:51:07Z

Thanks a lot for the reviews @ellemouton & @guggero! Working on addressing them.

Though I'd like some more feedback on how we should surface the errors to the status server, so would just like to get consensus on that. Thanks for the suggestion on using a callback instead @guggero, will definitely change to that! Though would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed? I think we will still need the internal RequireRunning checks either way though.

ellemouton · 2023-09-22T11:10:45Z

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

guggero · 2023-09-22T11:11:28Z

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

Was about to suggest the same! Even though turning everything into subservers sounds great, might be overkill for this PR. So let's see how the diff looks like with just the callback for now.

ViktorTigerstrom · 2023-09-22T11:57:49Z

Thanks! Will leave it out for now then, and then we can address it in a follow-up later :)

ViktorTigerstrom · 2023-09-25T13:27:52Z

Addressed the latest feedback, rebased on main now that the status server PR has been merged, and added unit tests!

ViktorTigerstrom · 2023-09-25T13:45:16Z

Hmm sorry, looking into to addressing the unit-race errors.

In terms of itests, I'm looking into adding an itest that triggers the mainErrCallback to be called, to shutdown the accounts service while not disabled. Though unfortunately I think it's going to be a bit hard as the litd process is started from binary and not through the struct. So in case I can't find any, I'll just add a disabled through configuration itest.

check commits is expected to fail until the commits are squashed.

ViktorTigerstrom · 2023-09-25T22:48:00Z

Fixed CI errors! check commits is expected to fail until the PR has been squashed.

ViktorTigerstrom · 2023-09-27T00:40:28Z

Updated the itests to test that we can disable the accounts system through configuration.
Also updated the commit that adds that config option a little, as I noticed some bugs in it when making the itest.

guggero

Looks good! Can you please apply the fixup commits and I'll do a final review on the completed diff?

accounts/service.go

guggero · 2023-09-27T08:10:42Z

terminal.go

 	} else {
-		g.statusMgr.SetRunning(subservers.ACCOUNTS)
+		stopAccountService()


This else seems to be wrong... Why would we stop the service if we haven't started it?

So agree that this is a bit confusing. The reason why we need to do this is that s.accountService.Stop() closes the contexts and db store which are opened when we create the account service with accounts.NewService in the beginning of the LightningTerminal.start function. This is done regardless of if the accounts service is disabled or if we fail to start it, so we need to close the contexts and db store either way, so this is why we run the stopAccountService function either way.

I changed the name of the stopAccountService function to closeAccountService to better represent what it actually does to make it more understandable. Let me know though if you have a suggestion of how I could do this better :)

itest/litd_mode_integrated_test.go

ViktorTigerstrom · 2023-09-27T08:59:59Z

Thanks for the review at @guggero! Addressed the feedback and squashed the commits :)!

Snap though, I notice that the check commits check still fails, so will address that now.

ViktorTigerstrom · 2023-09-27T09:26:47Z

Fixed check commits CI failure :)

guggero

Looks great, thanks a lot! Only nits left, LGTM 🎉

accounts/service.go

terminal.go

ViktorTigerstrom · 2023-09-27T13:23:23Z

Thanks @guggero 🎉!! Addressed you're latest feedback with the last push :)

ViktorTigerstrom · 2023-09-27T22:38:40Z

Added a small fix that ensures that the node runner can't stop the accounts service while a request by an account user is being processed.
This is especially important to ensure that we don't stop the service exactly after a user has made an rpc call to send a payment we can't know the payment hash for prior to the actual payment being sent (i.e. Keysend or SendToRoute). This is because if we stop the service after the send request has been sent to lnd, but before TrackPayment has been called, we won't be able to track the payment and debit the account.

ellemouton

Looks good! Only nits/style comments from my side.

The only one I think we should maybe defs address here is converting the new "accounts-mode" option to a "Accounts.Disable" boolean instead?

ellemouton · 2023-09-28T06:31:37Z

accounts/service.go

+//
+// NOTE: The store lock MUST be held as either a read or write lock when calling
+// this method.
+func (s *InterceptorService) requireRunning() error {


which discussion?

Why not just have something like this?

func (s *Inter...) isRunning() bool { return s.Enabled }

and then

use this throughout?

if !isRunning() { return ErrAccountServiceDisabled }

to me, returning an error from a function indicates failure of the function itself.

accounts/service.go

accounts/interceptor.go

accounts/service.go

config.go

terminal.go

Ensure that we don't stop the service while we're processing a request. This is especially important to ensure that we don't stop the service exactly after a user has made an rpc call to send a payment we can't know the payment hash for prior to the actual payment being sent (i.e. Keysend or SendToRoute). This is because if we stop the service after the send request has been sent to lnd, but before TrackPayment has been called, we won't be able to track the payment and debit the account.

Add the accounts service to status manager. This will allow us to query the status of the accounts service and see if it is running or not. For incoming gRPC requests to the accounts service, we also use the status manager to check if the accounts service is running or not to determine if we should let the request through or not.

ViktorTigerstrom · 2023-09-28T11:47:40Z

Thanks a lot for the the review @ellemouton 🎉 🚀!!

Addressed the latest feedback, and left some comments for some of the comments which weren't addressed :)

ViktorTigerstrom commented Sep 21, 2023

View reviewed changes

ViktorTigerstrom requested review from ellemouton and guggero September 21, 2023 01:19

ellemouton reviewed Sep 21, 2023

View reviewed changes

levmi assigned ViktorTigerstrom Sep 21, 2023

guggero reviewed Sep 22, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 1d3a4b0 to 1a40522 Compare September 25, 2023 13:27

guggero self-requested a review September 25, 2023 13:29

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 5 times, most recently from ceffb37 to acbe1dd Compare September 25, 2023 22:36

ViktorTigerstrom requested a review from ellemouton September 25, 2023 22:48

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 3 times, most recently from 7d3af76 to 56bd496 Compare September 27, 2023 00:37

guggero reviewed Sep 27, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 56bd496 to f23a2d2 Compare September 27, 2023 08:45

ViktorTigerstrom requested a review from guggero September 27, 2023 09:02

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from f23a2d2 to 67c80de Compare September 27, 2023 09:26

guggero approved these changes Sep 27, 2023

View reviewed changes

accounts/service.go Outdated Show resolved Hide resolved

accounts/service.go Outdated Show resolved Hide resolved

accounts/service.go Outdated Show resolved Hide resolved

accounts/service.go Outdated Show resolved Hide resolved

terminal.go Show resolved Hide resolved

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 2 times, most recently from 4c81420 to aafd459 Compare September 27, 2023 13:20

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from aafd459 to 841aa33 Compare September 27, 2023 22:36

ellemouton approved these changes Sep 28, 2023

View reviewed changes

ViktorTigerstrom added 8 commits September 28, 2023 12:58

accounts: disallow requests after critical errors

a49422f

accounts: add service disabled unit tests

4696c47

terminal: don't stop litd on account system error

32b2a58

accounts: associate payments before sending them

666f19c

terminal: add disable accounts service cfg option

20fa2c0

itest: add disable test for accounts endpoint

294e9d0

ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 841aa33 to 294e9d0 Compare September 28, 2023 11:10

ellemouton merged commit 6ed39b0 into lightninglabs:master Sep 28, 2023
12 checks passed

levmi mentioned this pull request Sep 28, 2023

Terminal server shuts down from hash error #632

Open

Don't shutdown lit on accounts service critical error, and register to status server #642

Don't shutdown lit on accounts service critical error, and register to status server #642

Conversation

ViktorTigerstrom commented Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Sep 21, 2023 • edited

ellemouton commented Sep 21, 2023

guggero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Sep 22, 2023

ellemouton commented Sep 22, 2023

guggero commented Sep 22, 2023

ViktorTigerstrom commented Sep 22, 2023

ViktorTigerstrom commented Sep 25, 2023

ViktorTigerstrom commented Sep 25, 2023

ViktorTigerstrom commented Sep 25, 2023 • edited

ViktorTigerstrom commented Sep 27, 2023

guggero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Sep 27, 2023

ViktorTigerstrom commented Sep 27, 2023

guggero left a comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Sep 27, 2023

ViktorTigerstrom commented Sep 27, 2023

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Sep 28, 2023 • edited

ViktorTigerstrom commented Sep 21, 2023 •

edited

ViktorTigerstrom commented Sep 21, 2023 •

edited

ViktorTigerstrom commented Sep 25, 2023 •

edited

ViktorTigerstrom commented Sep 28, 2023 •

edited