Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't shutdown lit on accounts service critical error, and register to status server #642

Conversation

ViktorTigerstrom
Copy link
Contributor

@ViktorTigerstrom ViktorTigerstrom commented Sep 21, 2023

This PR is based on #541

In this PR we ensure that if the accounts service critically errors, we won't shutdown LiT, but instead log the error and reject any new requests to the accounts service.

We also add the accounts service to the status manager, and use the status from the status manager to determine if we should allow gRPC requests for the accounts service through or not.

Finally, we also add a config option to allow users to disable the accounts service, which ensures that the accounts service isn't started if the config option is set to disable.

TODO:

  • add tests

terminal.go Outdated
"keeping litd running: %v", err,
case errFunc := <-g.nonCriticalErrQueue.ChanOut():
serviceName, err := errFunc()
if serviceName == accounts.ACCOUNTS_SERVICE {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really sure how to do this in a clean way, to be able to tell which subsystem the error is sent from. I'm not a big fan of the channel returning a func() (string, error), but couldn't figure out a better solution. Let me know if you have a better alternative!

I guess one alternative would let it be a queue.ConcurrentQueue[error] queue, and send it to the accounts service only and no other service. But that kind of defeats the purpose of having a queue in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is why I think it would be cool to have a lit subsystem manager type thing which will handle receiving errors from the various subsystems (just accounts for now) and setting the status correctly, logging etc. Then you would not need this work around

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments for now - just want to gauge what the consensus of the best direction is.

I think ideally we want things to be as generic as possible so that this pattern can easily be applied to other LiT sub-systems

accounts/rpcserver.go Outdated Show resolved Hide resolved
Comment on lines 104 to 106
s.Lock()
s.disable()
s.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest letting disable handle the locking and unlocking of the mutex itself. And then you can have another disableUnsafe method which does not grab the mutex so that you can use that one in places where the mutex is already acquired.

func (s *InterceptorService) disable() {
	s.Lock()
	defer s.Unlock()
	
	s.disableUnsafe()
}

func (s *InterceptorService) disableUnsafe() {
	s.isEnabled = false
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!).
I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error {
    s.disable()
    return fmt.Errorf(format, a...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!). I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error {
    s.disable()
    return fmt.Errorf(format, a...)
}

Ok got it!
The reason I added this pattern was that I was a bit afraid that people will forget to disable the account service if the functions that need to disable the service errors.
I push this fix in separate fixup commits now, but will squash them if you think it looks good :)

accounts/service.go Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/rpcserver.go Outdated Show resolved Hide resolved
//
// NOTE: The store lock MUST be held as either a read or write lock when calling
// this method.
func (s *InterceptorService) requireRunning() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use the IsRunning() method instead of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this out, due to the discussion below :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which discussion?

Why not just have something like this?

func (s *Inter...) isRunning() bool {
      return s.Enabled
}

and then

use this throughout?

if !isRunning() {
    return ErrAccountServiceDisabled
}

to me, returning an error from a function indicates failure of the function itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I ment the discussion that we need to check if the service is running under the same lock, and not drop the lock to then reacquire it, so we still need two different versions of this function which is why I kept this function before.

But agree that it's confusing with the error, so I changed the PR now to instead include a isRunningUnsafe function that doesn't acquire the lock, and use that function instead where it is needed :)

terminal.go Outdated
"keeping litd running: %v", err,
case errFunc := <-g.nonCriticalErrQueue.ChanOut():
serviceName, err := errFunc()
if serviceName == accounts.ACCOUNTS_SERVICE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is why I think it would be cool to have a lit subsystem manager type thing which will handle receiving errors from the various subsystems (just accounts for now) and setting the status correctly, logging etc. Then you would not need this work around

terminal.go Outdated
// is disabled, we still want to mark that the service should be stopped
// so that we can properly shut it down (canceling the ctx, and the
// accounts store) in the shutdownSubServers call.
g.accountServiceShouldBeStopped = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need it to be stopped if the accounts mode is disabled though?

I also think the previous name for the variable is fine btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah agree that this is just strange. Perhaps we should just close everything in the accountService immediately if the service is disabled :)!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that we now stop the account service directly if it is disabled instead!

@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Sep 21, 2023

Thanks for the review @ellemouton!
Posting a general comment here to respond what much of the feedback is related to :).

The main issues I see with just having one outer check that checks if a requests should be allowed through or not based on if the service is running or not, and then not checking in any of the internal functions in the server itself, is that:

  1. The internal functions in the accounts system is not only called through gRPC requests, but we are also subscribed to the lndclient.LightningClient which will execute invoiceUpdate when an accounts invoice is added or paid, and lndclient.RouterClient which will execute paymentUpdate when a tracked payment is updated. It's important that these functions won't get executed if the service has already been disabled, especially for invoiceUpdate as we'd then risk that the currentAddIndex & currentSettleIndex goes out of sync leading to missed added or settled invoices.
  2. The service can shutdown during the time a request that has been allowed through is actually executed. I.e. if an account makes a request to send a payment, the service can be Running when the actual request is done, but when the payment has been sent and we handle the SendPaymentV2 response in TrackPayment, the service might have been shutdown through for example an invoice update that came in while the users request was processed.

This is also why it's very important that we often disable the service under the same lock in the function execution that actually caused the error. Imagine of the following scenario:
We get 2 invoice settle updates rapidly that settles two different invoices in a very short timespan. Now if the first invoice update acquires the the lock, and the second update attempts to grab the lock in another thread. It's in that scenario it's super important that the we disable the service under the same lock as the update is attempted for the first invoice, if the update fails. If we didn't and dropped the lock, and reacquired it to disable the service, we'd risk that the second invoice update grabs the lock before we've managed to disable the service. If the second invoice update then succeeds, we'd end-up with the currentSettleIndex out of sync, which would mean that we'd never credit the account for the first invoice update.
So if we remove the requireRunning function, we need to add a IsRunning function version that doesn't also acquire the lock, i.e. a UnsafeIsRunning is running function.

Now it's most clear why we'd need to check if the service is already running before we execute the invoiceUpdate function, and we may not need to do that for any other function. But I didn't really know where to draw the line of when we'd require that the service is running if we executed the function, so I settled for functions that update the balance of an account. This is because it feels strange that in a scenario that the service was disabled while send payment request by an account was executed, but before the SendPaymentV2 response from LND was received, that we'd still successfully the handle the payment update. Meaning we'd successfully execute TrackPayment which adds the subscription to the lndclient.RouterClient and which in turn would then execute paymentUpdate once the payment has been updated. It just feels strange that all of those call would successfully execute while the service was disabled, which is why I drew the line of requiring that the service is running for any function that updates the user balance.

@ellemouton
Copy link
Member

ok cool! thanks for the explanation @ViktorTigerstrom - that makes sense. I didnt think about point 1.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass. Great work! Looks pretty good, but have a few suggestions to make the diff even smaller.

accounts/interface.go Show resolved Hide resolved
Comment on lines 104 to 106
s.Lock()
s.disable()
s.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a bit of a weird pattern with the defer and returnErr (pretty cool idea in general though!).
I would make it a bit more explicit by just replacing any return fmt.Errorf() with return disableAndErrorf(...):

disableAndErrorf := func (format string, a ...any) error {
    s.disable()
    return fmt.Errorf(format, a...)
}

terminal.go Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
accounts/checkers.go Outdated Show resolved Hide resolved
account.Payments[hash] = &PaymentEntry{
Status: lnrpc.Payment_UNKNOWN,
FullAmount: fullAmt,
if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could ever end up here with an all-zero payment hash (e.g. because it wasn't set in one of the calls). Not sure if we should check for that at least before we associate the payment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not completely sure what that I understood this correctly. I think this should be fixed with the fix for this other feedback:
#642 (comment)

But please double check if I misunderstood what you ment with this comment, and let me know if I need to fix something else :)

accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the reviews @ellemouton & @guggero! Working on addressing them.

Though I'd like some more feedback on how we should surface the errors to the status server, so would just like to get consensus on that. Thanks for the suggestion on using a callback instead @guggero, will definitely change to that! Though would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed? I think we will still need the internal RequireRunning checks either way though.

@ellemouton
Copy link
Member

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

@guggero
Copy link
Member

guggero commented Sep 22, 2023

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

Was about to suggest the same! Even though turning everything into subservers sounds great, might be overkill for this PR. So let's see how the diff looks like with just the callback for now.

@ViktorTigerstrom
Copy link
Contributor Author

Thanks! Will leave it out for now then, and then we can address it in a follow-up later :)

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 1d3a4b0 to 1a40522 Compare September 25, 2023 13:27
@ViktorTigerstrom
Copy link
Contributor Author

Addressed the latest feedback, rebased on main now that the status server PR has been merged, and added unit tests!

@guggero guggero self-requested a review September 25, 2023 13:29
@ViktorTigerstrom
Copy link
Contributor Author

Hmm sorry, looking into to addressing the unit-race errors.

In terms of itests, I'm looking into adding an itest that triggers the mainErrCallback to be called, to shutdown the accounts service while not disabled. Though unfortunately I think it's going to be a bit hard as the litd process is started from binary and not through the struct. So in case I can't find any, I'll just add a disabled through configuration itest.

check commits is expected to fail until the commits are squashed.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 5 times, most recently from ceffb37 to acbe1dd Compare September 25, 2023 22:36
@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Sep 25, 2023

Fixed CI errors! check commits is expected to fail until the PR has been squashed.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 3 times, most recently from 7d3af76 to 56bd496 Compare September 27, 2023 00:37
@ViktorTigerstrom
Copy link
Contributor Author

Updated the itests to test that we can disable the accounts system through configuration.
Also updated the commit that adds that config option a little, as I noticed some bugs in it when making the itest.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Can you please apply the fixup commits and I'll do a final review on the completed diff?

accounts/service.go Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
terminal.go Outdated
} else {
g.statusMgr.SetRunning(subservers.ACCOUNTS)
stopAccountService()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else seems to be wrong... Why would we stop the service if we haven't started it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So agree that this is a bit confusing. The reason why we need to do this is that s.accountService.Stop() closes the contexts and db store which are opened when we create the account service with accounts.NewService in the beginning of the LightningTerminal.start function. This is done regardless of if the accounts service is disabled or if we fail to start it, so we need to close the contexts and db store either way, so this is why we run the stopAccountService function either way.

I changed the name of the stopAccountService function to closeAccountService to better represent what it actually does to make it more understandable. Let me know though if you have a suggestion of how I could do this better :)

itest/litd_mode_integrated_test.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 56bd496 to f23a2d2 Compare September 27, 2023 08:45
@ViktorTigerstrom
Copy link
Contributor Author

Thanks for the review at @guggero! Addressed the feedback and squashed the commits :)!

Snap though, I notice that the check commits check still fails, so will address that now.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from f23a2d2 to 67c80de Compare September 27, 2023 09:26
@ViktorTigerstrom
Copy link
Contributor Author

Fixed check commits CI failure :)

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot! Only nits left, LGTM 🎉

accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
terminal.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch 2 times, most recently from 4c81420 to aafd459 Compare September 27, 2023 13:20
@ViktorTigerstrom
Copy link
Contributor Author

Thanks @guggero 🎉!! Addressed you're latest feedback with the last push :)

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from aafd459 to 841aa33 Compare September 27, 2023 22:36
@ViktorTigerstrom
Copy link
Contributor Author

Added a small fix that ensures that the node runner can't stop the accounts service while a request by an account user is being processed.
This is especially important to ensure that we don't stop the service exactly after a user has made an rpc call to send a payment we can't know the payment hash for prior to the actual payment being sent (i.e. Keysend or SendToRoute). This is because if we stop the service after the send request has been sent to lnd, but before TrackPayment has been called, we won't be able to track the payment and debit the account.

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only nits/style comments from my side.

The only one I think we should maybe defs address here is converting the new "accounts-mode" option to a "Accounts.Disable" boolean instead?

//
// NOTE: The store lock MUST be held as either a read or write lock when calling
// this method.
func (s *InterceptorService) requireRunning() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which discussion?

Why not just have something like this?

func (s *Inter...) isRunning() bool {
      return s.Enabled
}

and then

use this throughout?

if !isRunning() {
    return ErrAccountServiceDisabled
}

to me, returning an error from a function indicates failure of the function itself.

accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Show resolved Hide resolved
accounts/interceptor.go Show resolved Hide resolved
accounts/service.go Show resolved Hide resolved
accounts/service.go Outdated Show resolved Hide resolved
accounts/service.go Show resolved Hide resolved
accounts/service.go Show resolved Hide resolved
config.go Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
Ensure that we don't stop the service while we're processing a request.
This is especially important to ensure that we don't stop the service
exactly after a user has made an rpc call to send a payment we can't
know the payment hash for prior to the actual payment being sent
(i.e. Keysend or SendToRoute). This is because if we stop the service
after the send request has been sent to lnd, but before TrackPayment
has been called, we won't be able to track the payment and debit the
account.
Add the accounts service to status manager. This will allow us to query
the status of the accounts service and see if it is running or not.
For incoming gRPC requests to the accounts service, we also use the
status manager to check if the accounts service is running or not to
determine if we should let the request through or not.
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-09-dont-stop-lit-on-account-system-error branch from 841aa33 to 294e9d0 Compare September 28, 2023 11:10
@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Sep 28, 2023

Thanks a lot for the the review @ellemouton 🎉 🚀!!

Addressed the latest feedback, and left some comments for some of the comments which weren't addressed :)

@ellemouton ellemouton merged commit 6ed39b0 into lightninglabs:master Sep 28, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants