Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tapdb: use the serializable isolation by default for postgres #334

Merged
merged 2 commits into from
Jun 6, 2023

Conversation

Roasbeef
Copy link
Member

In this commit, we start to use the serializable snapshot isolation by default for postgres. For sqlite, this is a noop as only a single writer is allowed at any time. With this mode, we should guard against all types of anomalies, at the cost of extra locking and potential perf hits.

One thing we'll need to test out is if the DB driver will automatically do retries or not if a transaction cannot be serialized.

This is an attempt at fixing #333 at a global level.

@Roasbeef
Copy link
Member Author

itest failure show we do indeed have an issue here:

2023-05-31 22:26:22.717 [WRN] UNIV: unable to log sync event: unknown postgres error: ERROR: null value in column "universe_root_id" violates not-null constraint (SQLSTATE 23502)

@Roasbeef
Copy link
Member Author

@Roasbeef
Copy link
Member Author

Roasbeef commented Jun 1, 2023

itest failure show we do indeed have an issue here:

Fixed with the latest commit: we now retry (forever). We should log, and also consider an upper limit for retry as well.

Copy link
Collaborator

@jharveyb jharveyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with #336, and confirmed these change fix the proof upload issues we observed before.

Need to apply the fixup commits but otherwise LGTM, nice catch on the default postgres modes 💯

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This should indeed make things safer.

Though with the current implementation I think we might get into weird situations with the combinations of goto and defer() (see inline comments).

// Roll back the transaction, then pop back up to try
// once again.
//
// TODO(roasbeef): needs retry limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix this TODO to avoid endless re-tries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so, open question: what's a good limit for retries? My strategy was going to be deploy the modified version in staging/testnet, then check the logs to see how often we retry. We have @jharveyb's mega mint script, so that can be used to calibrate.

@@ -100,6 +101,7 @@ func NewTransactionExecutor[Querier any](db BatchedQuerier,
func (t *TransactionExecutor[Q]) ExecTx(ctx context.Context,
txOptions TxOptions, txBody func(Q) error) error {

txStart:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a new defer with a rollback to be caused for each retry. And IIUC then Rollback() will return an error if it is called twice. So I think we need to roll this out into an actual loop, otherwise we might get weird side effects or errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to rollback after each attempt, will do some doc/code digging.

@Roasbeef Roasbeef force-pushed the postgres-serializable branch 2 times, most recently from beb672a to 782c21b Compare June 6, 2023 01:37
@Roasbeef Roasbeef requested a review from guggero June 6, 2023 01:38
@Roasbeef
Copy link
Member Author

Roasbeef commented Jun 6, 2023

Tacked on a proper retry loop PTAL.

Also has logging now so we can see how many times things need to be retried in practice.

@Roasbeef Roasbeef force-pushed the postgres-serializable branch 2 times, most recently from 0b70655 to 28bad1f Compare June 6, 2023 01:43
In this commit, we start to use the serializable snapshot isolation by
default for postgres. For sqlite, this is a noop as only a single writer
is allowed at any time. With this mode, we should guard against all
types of anomalies, at the cost of extra locking and potential perf
hits.

One thing we'll need to test out is if the DB driver will automatically
do retries or not if a transaction cannot be serialized.

This is an attempt at fixing lightninglabs#333 at a global level.
In this commit, we introduce a new postgres specific error returned when
a transaction cannot be serialized. We then catch this error in order to
implement retry logic for a transaction that failed to be serialized.
We'll retry up to 10 times, waiting a random amount up to 50 ms between
each retry.
@Roasbeef
Copy link
Member Author

Roasbeef commented Jun 6, 2023

Re calling rollback multiple times, looks like any second call will be a noop here due to the CAS: https://cs.opensource.google/go/go/+/refs/tags/go1.20.4:src/database/sql/sql.go;l=2279-2284

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, LGTM 🎉

@guggero
Copy link
Member

guggero commented Jun 6, 2023

Re calling rollback multiple times, looks like any second call will be a noop here due to the CAS: https://cs.opensource.google/go/go/+/refs/tags/go1.20.4:src/database/sql/sql.go;l=2279-2284

Ah, yes. I overlooked that we just ignore the ErrTxDone in the defer call, so it's a no-op indeed.

@guggero guggero added this pull request to the merge queue Jun 6, 2023
Merged via the queue into lightninglabs:main with commit 192c67e Jun 6, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants