multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

Roasbeef · 2023-08-26T00:02:55Z

This PR resolves two outstanding issues:

If a revoke failed in the past, we wouldn't tear down the link. This could lead to desynchornized state, eventually leading to a force close.
For the kvdb SQL emulation backends, if we got an error (SQLITE_BUSY, etc), we should just fail right then. We now implement an exponential back off with some jitter to give other writing goroutines time to finish/interleave their writes.

For 1, I tacked on an extra commit to just tear down the link all together. This may save us from a scenario where Alice had a DB failure, Bob then tries to send a new state, but Alice has torn down the link, but not the TCP connection. In this case, Bob is now "stuck" as she's sent out a sig, and if Alice never recovers, they may need to go on chain to time out the HTLC.

In the 0.18 cycle, as we move to integrate the pure SQL backends for payments/invoices, we'll want to attempt to unify some of this logic, as we effectively have two sql scaffoldings today: KV-emulation, and proper schema based.

Replaces https://github.com/lightningnetwork/lnd/pull/7876/files

Fixes #7869

yyforyongyu

Thanks for picking up the issue! Straightforward fix and just need to fix the imports.

kvdb/sqlbase/readwrite_tx.go

yyforyongyu · 2023-08-29T08:48:28Z

kvdb/sqlbase/db.go

+	// returned, then this means that daemon is shutting down so we
+	// should abort the retries.
+	waitBeforeRetry := func(attemptNumber int) bool {
+		retryDelay := randRetryDelay(


Non-blocking, but we could use ticker := chain.NewJitterTicker(DefaultInitialRetryDelay, 0.5) instead.

kvdb/sqlbase/db.go

yyforyongyu

LGTM🌊

positiveblue

👍

positiveblue · 2023-08-30T15:23:49Z

kvdb/sqlbase/db.go

+		if IsSerializationError(dbErr) {
+			_ = tx.Rollback()
+
+			if waitBeforeRetry(i) {


nit: if we know that there is not going to be a next iteration, we can skip the last wait (which will also be the longest) and return the ErrRetriesExceeded here

htlcswitch/link.go

The log message is off by one.

When the revocation of a channel state fails after receiving a new CommitmentSigned msg we have to fail the channel otherwise we continue with an unclean state.

If we couldn't revoke due to a DB error, then we want to also tear down the connection, as we don't want the other party to continue to send updates. That may lead to de-sync'd state an eventual force close. Otherwise, the database might be able to recover come the next reconnection attempt.

In this commit, we modify the default isolation level to be `sql.LevelSerializable. This is the strictness isolation type for postgres. For sqlite, there's only ever a single writer, so this doesn't apply directly.

…ilures In this commit, we add randomized exponential backoff for serialization failures. For postgres, we''ll his this any time a transaction set fails to be linearized. For sqlite, we'll his this if we have many writers trying to grab the write lock at time same time, manifesting as a `SQLITE_BUSY` error code. As is, we'll retry up to 10 times, waiting a minimum of 50 miliseconds between each attempt, up to 5 seconds without any delay at all. For sqlite, this is also bounded by the busy timeout set, which applies on top of this retry logic (block for busy timeout seconds, then apply this back off logic).

Roasbeef added bug Unintended code behaviour database Related to the database/storage of LND safety General label for issues/PRs related to the safety of using the software labels Aug 26, 2023

Roasbeef requested review from positiveblue and yyforyongyu August 28, 2023 22:39

Roasbeef added this to the v0.17.0 milestone Aug 28, 2023

yyforyongyu requested changes Aug 29, 2023

View reviewed changes

Roasbeef force-pushed the sql-db-safety branch from 5a217ec to c768770 Compare August 29, 2023 20:56

Roasbeef requested a review from yyforyongyu August 29, 2023 21:17

saubyk assigned Roasbeef Aug 29, 2023

yyforyongyu added the no-changelog label Aug 30, 2023

yyforyongyu approved these changes Aug 30, 2023

View reviewed changes

positiveblue approved these changes Aug 30, 2023

View reviewed changes

morehouse reviewed Aug 30, 2023

View reviewed changes

htlcswitch/link.go Show resolved Hide resolved

ziggie1984 and others added 7 commits August 30, 2023 16:01

lnwallet: fix log output msg

78aa555

The log message is off by one.

htlcswitch: fail channel when revoking it fails.

f08fa0e

When the revocation of a channel state fails after receiving a new CommitmentSigned msg we have to fail the channel otherwise we continue with an unclean state.

docs: update release-docs

d07f4c0

kvdb: use sql.LevelSerializable for all backends

ab01331

In this commit, we modify the default isolation level to be `sql.LevelSerializable. This is the strictness isolation type for postgres. For sqlite, there's only ever a single writer, so this doesn't apply directly.

docs/release-notes: add entry for sqlite/postgres tx retry

9e25f9f

Roasbeef force-pushed the sql-db-safety branch from c768770 to 9e25f9f Compare August 30, 2023 23:02

Roasbeef merged commit 01c6471 into lightningnetwork:master Aug 30, 2023
21 of 24 checks passed

yyforyongyu mentioned this pull request Sep 4, 2023

Suggest increasing busy_timeout to prevent errors #7945

Closed

7 tasks

guggero mentioned this pull request Oct 2, 2023

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

Roasbeef commented Aug 26, 2023 •

edited

yyforyongyu left a comment

yyforyongyu Aug 29, 2023

yyforyongyu left a comment

positiveblue left a comment

positiveblue Aug 30, 2023

multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927

Conversation

Roasbeef commented Aug 26, 2023 • edited

yyforyongyu left a comment

Choose a reason for hiding this comment

yyforyongyu Aug 29, 2023

Choose a reason for hiding this comment

yyforyongyu left a comment

Choose a reason for hiding this comment

positiveblue left a comment

Choose a reason for hiding this comment

positiveblue Aug 30, 2023

Choose a reason for hiding this comment

Roasbeef commented Aug 26, 2023 •

edited