Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

Merged
merged 7 commits into from
Aug 30, 2023

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Aug 26, 2023

This PR resolves two outstanding issues:

  1. If a revoke failed in the past, we wouldn't tear down the link. This could lead to desynchornized state, eventually leading to a force close.
  2. For the kvdb SQL emulation backends, if we got an error (SQLITE_BUSY, etc), we should just fail right then. We now implement an exponential back off with some jitter to give other writing goroutines time to finish/interleave their writes.

For 1, I tacked on an extra commit to just tear down the link all together. This may save us from a scenario where Alice had a DB failure, Bob then tries to send a new state, but Alice has torn down the link, but not the TCP connection. In this case, Bob is now "stuck" as she's sent out a sig, and if Alice never recovers, they may need to go on chain to time out the HTLC.

In the 0.18 cycle, as we move to integrate the pure SQL backends for payments/invoices, we'll want to attempt to unify some of this logic, as we effectively have two sql scaffoldings today: KV-emulation, and proper schema based.

Replaces https://github.com/lightningnetwork/lnd/pull/7876/files

Fixes #7869

@Roasbeef Roasbeef added bug Unintended code behaviour database Related to the database/storage of LND safety General label for issues/PRs related to the safety of using the software labels Aug 26, 2023
@Roasbeef Roasbeef added this to the v0.17.0 milestone Aug 28, 2023
Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up the issue! Straightforward fix and just need to fix the imports.

kvdb/sqlbase/readwrite_tx.go Show resolved Hide resolved
// returned, then this means that daemon is shutting down so we
// should abort the retries.
waitBeforeRetry := func(attemptNumber int) bool {
retryDelay := randRetryDelay(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking, but we could use ticker := chain.NewJitterTicker(DefaultInitialRetryDelay, 0.5) instead.

kvdb/sqlbase/db.go Outdated Show resolved Hide resolved
kvdb/sqlbase/db.go Show resolved Hide resolved
Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🌊

Copy link
Collaborator

@positiveblue positiveblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if IsSerializationError(dbErr) {
_ = tx.Rollback()

if waitBeforeRetry(i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we know that there is not going to be a next iteration, we can skip the last wait (which will also be the longest) and return the ErrRetriesExceeded here

ziggie1984 and others added 7 commits August 30, 2023 16:01
The log message is off by one.
When the revocation of a channel state fails after receiving a new
CommitmentSigned msg we have to fail the channel otherwise we
continue with an unclean state.
If we couldn't revoke due to a DB error, then we want to also tear down
the connection, as we don't want the other party to continue to send
updates. That may lead to de-sync'd state an eventual force close.
Otherwise, the database might be able to recover come the next
reconnection attempt.
In this commit, we modify the default isolation level to be
`sql.LevelSerializable. This is the strictness isolation type for
postgres. For sqlite, there's only ever a single writer, so this doesn't
apply directly.
…ilures

In this commit, we add randomized exponential backoff for serialization
failures. For postgres, we''ll his this any time a transaction set fails
to be linearized. For sqlite, we'll his this if we have many writers
trying to grab the write lock at time same time, manifesting as a
`SQLITE_BUSY` error code.

As is, we'll retry up to 10 times, waiting a minimum of 50 miliseconds
between each attempt, up to 5 seconds without any delay at all. For
sqlite, this is also bounded by the busy timeout set, which applies on
top of this retry logic (block for busy timeout seconds, then apply this
back off logic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour database Related to the database/storage of LND no-changelog safety General label for issues/PRs related to the safety of using the software
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[bug]: Force closes for commitment not revoked, probable SQLite management issue
5 participants