Fix random SQLite busy database is locked errors #2527

eddiemundo · 2021-12-23T01:37:43Z

If hiedb raises an SQLITE_BUSY exception, retry with random exponential backoff for both reads and writes. The max delay is 1 second, the base delay is 1 millisecond, and the max number of retries is 10.

This potentially solves the case where we have multiple HLS instances using the same SQLite database. Although, maybe if multiple clients could connect to the same server this solution would be moot again.

Edit: The below is now mostly irrelevant because WAL mode is already used for the sqlite db. Oops.

There seem to be a lot of ways of solving this problem, and so what I've done might not be want we want to do.

Here's a summary of what I've learned...

Background:

The SQLite we use is bundled in direct-sqlite, and uses default configuration for most things. This means that SQLite uses a rollback journal. It also means its sqlite3_busy_handler is null. The busy handler being null means that when a connection tries to acquire a lock and is prevented by another connection already holding certain locks then it will return a SQLITE_BUSY immediately which manifests as an exception.

When using a rollback journal there are 4 locks that can be acquired: "shared", "reserved", "pending", "exclusive".

Both read and writes always acquire shared locks. Multiple processes (I don't think they mean OS processes) can acquire shared locks at the same time.
When a process wants to write it acquires a reserved lock. Only 1 process can hold a reserved lock at a time. If it fails to acquire a lock then SQLITE_BUSY.
After the reserved lock is acquired the process creates a rollback journal.
When the process wants to update the database file it acquires a pending lock. If another process tries to acquire a shared (or reserved?) lock at this time it will fail with SQLITE_BUSY
Once all other shared locks are released it will also acquire an exclusive lock. If another process tries to acquire a lock at this time it will fail with SQLITE_BUSY.
When it is done updating the database file, both pending and exclusive locks are released.

Problem:

The above means that a write can cause a read to SQLITE_BUSY. In fact in rare cases it seems even a read can cause a read or write to SQLITE_BUSY, although I think that has to be super rare because the rollback journal has to be "hot" for that to happen.

Currently we serialize writes on one connection, and do reads on another. The writes happen on one thread, the reads can happen from multiple threads. Our random exceptions are most likely a read trying to acquire a shared lock when a write is finishing up.

Solutions:

Turn on write-ahead logging mode. This journaling mode does not need the locking procedure described above. I think only writes need to be serialized. This means that this is probably a drop-in solution. Performance for reads is a bit slower (they say 1-2%), and if we really want to tune performance then fiddling with "checkpoints" is necessary.
SQLITE_BUSY can still occur in rare cases:
- If the last connection to a database is exiting then it can acquire a lock while doing cleanup, and if another connection opens and tries to query then SQLITE_BUSY.
- If the last connection to a database crashes, then the next new connection will acquire a lock while doing recovery, and hence a 3rd connection doing a query can SQLITE_BUSY.
When SQLITE_BUSY is received, retry the read. This is what this PR does. Not really that satisfying because what the sleep times should be depend on how long the pertinent locks are held. Currently the thread sleeps for at least 1 millisecond before retrying, for a maximum of 10 retries.
Providing an sqlite3_busy_handler. This more or less would do what retrying above does but maybe better. I think this requires sending PRs to expose this functionality in direct-sqlite, but maybe I didn't look hard enough.
Have reads wait for in-progress writes. I think this can be done by creating a transaction for reads that checks the in-progress TVar in the hiedbWriter in ShakeExtras, or by using some kinda broadcast signaling synchronization abstraction. There are still some edge cases that probably don't matter. Eg. a read is recovering a hot journal, and another read or write comes in?
Serialize reads and writes. Not sure how to do this one without changing how the current system works. Probably some kind of MVar protecting a single connection, and spawning a thread per write if we don't want to block on writes. I don't think performance will be much different.

I don't know much about SQLite so maybe I've got something wrong, or missed some alternatives. I might try to do the WAL mode on another branch since it seems like a one-liner but I don't know how to test it in HLS. If I do this I only know how to create synthetic examples in sqlite-simple, or direct-sqlite.

jneira · 2021-12-23T06:15:26Z

I think insights from @wz1000 would be useful with sqlite issues

wz1000 · 2021-12-23T08:36:12Z

Turn on write-ahead logging mode. This journaling mode does not need the locking procedure described above. I think only writes need to be serialized.

We already turn on WAL mode in hiedb: https://github.com/wz1000/HieDb/blob/305a896adbaf78175ebd866c0e665d99c093cab8/src/HieDb/Create.hs#L74

SQLITE_BUSY can still occur in rare cases:

Are we sure this is not responsible for the errors?

Currently we serialize all writes to the database (that is what HieDbWriter is for). There is one writer thread and an arbitrary number of reader threads. The second part is important, since we consult the database in many places to provide results for the user, and we would not like to serialize these reads.

Retrying on exceptions is a plausible solution, but I don't quite understand why it is necessary given our usage scenario. Is it possible to reproduce the same errors on a simple test program with the same usage pattern? (i.e. WAL mode, one writer, multiple readers)

jneira · 2021-12-23T09:32:41Z

Out of curiosity, does exist some issue reported related with this? or even informal reports in any channel?

pepeiborra · 2021-12-23T10:44:39Z

This was creating problems in our testsuite, reported here: #1430 (comment). I haven't seen these errors reproduce again since this commit: a044156

wz1000 · 2021-12-23T10:49:43Z

This reminds me, perhaps we should be using wal2 mode now: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

Last time I tried the version of sqlite bundled with direct-sqlite was too old.

eddiemundo · 2021-12-23T18:10:08Z

I think I read init_conn and at that time I didn't know what WAL was, which filtered out the pragmas, then when I learned what it was, I assumed I already knew what init_conn did and went full steam ahead.

It does seem like #2484 fixes the rare cases, and my PR is not needed because of WAL mode.

wz1000 · 2021-12-23T18:20:09Z

I think it could still be valuable, especially if a log message was emitted everytime a transaction was retried. You can never really be too sure about this.

These show up much more consistently when we have multiple HLS processes accessing the same database, possibly because of multiple editor sessions. Retrying is the best we can do in that case,

We also probably want a random exponential backoff instead of a fixed delay. See https://en.wikipedia.org/wiki/Exponential_backoff. I found https://hackage.haskell.org/package/retry after a quick search, though it might just be best to implement it ourselves instead of incurring a dependency.

Feel free to reopen if you are wanting to continue with this.

eddiemundo · 2021-12-23T18:53:16Z

Oh didn't think of multiple HLS hitting a single database, in that case the writes might also need to retry.

I'll add logging retries and random exponential backoff.

ghcide/session-loader/Development/IDE/Session.hs

wz1000

Look good so far. We need to use the backoff even in the writer thread too (the indexHieFile function).

pepeiborra · 2021-12-24T16:42:51Z

Are there any performance implications?

wz1000 · 2021-12-24T16:46:37Z

I don't think there should be in the absence of contention, since we only retry if a query fails.

eddiemundo · 2021-12-24T19:51:01Z

Look good so far. We need to use the backoff even in the writer thread too (the indexHieFile function).

I see... I thought having the thread on the receive side of the TQueue wrapping the function it receives with retries would be enough, but it could be that that function does too much like send messages, and other computations, so we'd be retrying stuff other than hiedb calls.

So I've threaded the retry wrapper through so the client can decide what do retry.

pepeiborra

I'm not sure that we want to ever support 2 concurrent HLS instances writing to the same cache. In such scenario the internal invariant that "no one else is mutating the interface store" no longer holds.

That said, even though this PR is not trivial, the approach is thoughtful and I cannot see any problem with it.

ghcide/session-loader/Development/IDE/Session.hs

ghcide/src/Development/IDE/Core/FileStore.hs

ghcide/session-loader/Development/IDE/Session.hs

… empty lines

…y wrapper to action instead

… writerThread with retries, promote time duration and maxRetryCount constants to top level

wz1000 · 2021-12-27T07:53:45Z

Perhaps we should also retry around HieDb.runCommand in ghcide/src/Development/IDE/Main.hs to better support usecases like #2543 ?

wz1000 · 2021-12-27T07:57:41Z

I'm not sure that we want to ever support 2 concurrent HLS instances writing to the same cache. In such scenario the internal invariant that "no one else is mutating the interface store" no longer holds.

I realised this shortly after leaving my last comment, but I still think it could be valuable. Many people, especially beginners, tend to use the IDE on standalone haskell files in their home directory or somewhere, without a .cabal file. It should be OK to have multiple sessions of the IDE in these cases, especially if the sessions don't overlap in terms of the files they load.

…nCommand with retry, fix tests

…tuff, move WithHieDb to Types.Shake to avoid circular dependency

eddiemundo · 2021-12-27T18:09:36Z

OK, HieDb.runCommand has been wrapped and I don't see any more uses of hiedb that use sqlite via grepping that aren't wrapped.

Since HieDb.runCommand doesn't actually get passed a HieDb connection but uses one internally, I refactored retryOnSqliteBusy to multiple functions the chief being retryOnException which potentially retries an m a instead of a HieDb -> m a, and adding another function on top that does take a HieDb -> m a and uses retryOnException internally.

I also noticed that wrapping createExportsMapHieDb with retry was too coarse grained because it internally does 2 HieDb calls, one of them in a loop, so I pushed the withHieDb callback into it so on the event of a busy error it won't retry so many things.

wz1000

Looks good. Thanks!

Anton-Latukha · 2021-12-28T23:36:41Z

🎄

Hey. This merged. Thaanks.

Improving tests stability is appreciated. We did had problems with it & never know when in the future the structure changes/optimizations would uncovered real DB reasons again.
Frequently have multiple HLS processes. Also when people start debugging HLS - they often run several IDEs to check the HLS integration/behavior in different IDEs & during that most probably test it on the same project.
Also, DB work is always important & not looked in often. So happy to see work happening, thanks.

* fix sql busy database is locked errors using retries * fix ghc 9.0+ * hlint fixes * fix ghc 9.0+ again * remove accidentally added redundant liftIO, remove accidentally added empty lines * add logging to retries, add exponential backoff, refactor tests * add random-1.2.1 to older stack.yamls * use Random typeclass instead of uniformR, revert stack.yamls * logInfo instead of logDebug * dont wrap action with hiedb retry on the writer side, give hiedb retry wrapper to action instead * bump log priorities to warning, wrap all hiedb calls in runWithDb and writerThread with retries, promote time duration and maxRetryCount constants to top level * fix ghc 9.0.1 * refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.runCommand with retry, fix tests * push WithHieDb into createExportsMapHieDb to potentially retry less stuff, move WithHieDb to Types.Shake to avoid circular dependency Co-authored-by: Javier Neira <atreyu.bbb@gmail.com>

jneira requested review from wz1000 and pepeiborra December 23, 2021 06:14

eddiemundo closed this Dec 23, 2021

eddiemundo deleted the fix-sql-busy-database-locked branch December 23, 2021 18:11

eddiemundo restored the fix-sql-busy-database-locked branch December 23, 2021 18:32

eddiemundo reopened this Dec 23, 2021

wz1000 reviewed Dec 24, 2021

View reviewed changes

ghcide/session-loader/Development/IDE/Session.hs Outdated Show resolved Hide resolved

wz1000 reviewed Dec 24, 2021

View reviewed changes

Anton-Latukha mentioned this pull request Dec 25, 2021

CI: linting #2538

Merged

pepeiborra approved these changes Dec 26, 2021

View reviewed changes

eddiemundo added 9 commits December 26, 2021 22:28

fix sql busy database is locked errors using retries

7739ad5

fix ghc 9.0+

b97c74a

hlint fixes

434a4fa

fix ghc 9.0+ again

32ef833

remove accidentally added redundant liftIO, remove accidentally added…

d3e275c

… empty lines

add logging to retries, add exponential backoff, refactor tests

794d609

add random-1.2.1 to older stack.yamls

4106410

use Random typeclass instead of uniformR, revert stack.yamls

91fdff8

logInfo instead of logDebug

7574560

eddiemundo added 2 commits December 26, 2021 22:28

dont wrap action with hiedb retry on the writer side, give hiedb retr…

c3fd649

…y wrapper to action instead

bump log priorities to warning, wrap all hiedb calls in runWithDb and…

1bc3adc

… writerThread with retries, promote time duration and maxRetryCount constants to top level

eddiemundo force-pushed the fix-sql-busy-database-locked branch from d05bb30 to 1bc3adc Compare December 27, 2021 03:29

fix ghc 9.0.1

3296e8b

eddiemundo added 3 commits December 27, 2021 05:25

refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.ru…

6acc965

…nCommand with retry, fix tests

push WithHieDb into createExportsMapHieDb to potentially retry less s…

db72aeb

…tuff, move WithHieDb to Types.Shake to avoid circular dependency

Merge branch 'master' into fix-sql-busy-database-locked

5c78248

wz1000 approved these changes Dec 28, 2021

View reviewed changes

Merge branch 'master' into fix-sql-busy-database-locked

8aeaa20

jneira added the merge me Label to trigger pull request merge label Dec 28, 2021

mergify bot merged commit 2fa5994 into haskell:master Dec 28, 2021

Uh oh!

Fix random SQLite busy database is locked errors #2527

Fix random SQLite busy database is locked errors #2527

Uh oh!

Conversation

eddiemundo commented Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background:

Problem:

Solutions:

Uh oh!

jneira commented Dec 23, 2021

Uh oh!

wz1000 commented Dec 23, 2021

Uh oh!

jneira commented Dec 23, 2021

Uh oh!

pepeiborra commented Dec 23, 2021

Uh oh!

wz1000 commented Dec 23, 2021

Uh oh!

eddiemundo commented Dec 23, 2021

Uh oh!

wz1000 commented Dec 23, 2021

Uh oh!

eddiemundo commented Dec 23, 2021

Uh oh!

Uh oh!

wz1000 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pepeiborra commented Dec 24, 2021

Uh oh!

wz1000 commented Dec 24, 2021

Uh oh!

eddiemundo commented Dec 24, 2021

Uh oh!

pepeiborra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wz1000 commented Dec 27, 2021

Uh oh!

wz1000 commented Dec 27, 2021

Uh oh!

eddiemundo commented Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wz1000 left a comment

Choose a reason for hiding this comment

Uh oh!

Anton-Latukha commented Dec 28, 2021

Uh oh!

Uh oh!

eddiemundo commented Dec 23, 2021 •

edited

Loading

wz1000 left a comment •

edited

Loading

eddiemundo commented Dec 27, 2021 •

edited

Loading