Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random SQLite busy database is locked errors #2527

Merged
merged 16 commits into from Dec 28, 2021

Conversation

eddiemundo
Copy link
Collaborator

@eddiemundo eddiemundo commented Dec 23, 2021

If hiedb raises an SQLITE_BUSY exception, retry with random exponential backoff for both reads and writes. The max delay is 1 second, the base delay is 1 millisecond, and the max number of retries is 10.

This potentially solves the case where we have multiple HLS instances using the same SQLite database. Although, maybe if multiple clients could connect to the same server this solution would be moot again.

Edit: The below is now mostly irrelevant because WAL mode is already used for the sqlite db. Oops.

There seem to be a lot of ways of solving this problem, and so what I've done might not be want we want to do.

Here's a summary of what I've learned...

Background:

The SQLite we use is bundled in direct-sqlite, and uses default configuration for most things. This means that SQLite uses a rollback journal. It also means its sqlite3_busy_handler is null. The busy handler being null means that when a connection tries to acquire a lock and is prevented by another connection already holding certain locks then it will return a SQLITE_BUSY immediately which manifests as an exception.

When using a rollback journal there are 4 locks that can be acquired: "shared", "reserved", "pending", "exclusive".

  • Both read and writes always acquire shared locks. Multiple processes (I don't think they mean OS processes) can acquire shared locks at the same time.
  • When a process wants to write it acquires a reserved lock. Only 1 process can hold a reserved lock at a time. If it fails to acquire a lock then SQLITE_BUSY.
  • After the reserved lock is acquired the process creates a rollback journal.
  • When the process wants to update the database file it acquires a pending lock. If another process tries to acquire a shared (or reserved?) lock at this time it will fail with SQLITE_BUSY
  • Once all other shared locks are released it will also acquire an exclusive lock. If another process tries to acquire a lock at this time it will fail with SQLITE_BUSY.
  • When it is done updating the database file, both pending and exclusive locks are released.

Problem:

The above means that a write can cause a read to SQLITE_BUSY. In fact in rare cases it seems even a read can cause a read or write to SQLITE_BUSY, although I think that has to be super rare because the rollback journal has to be "hot" for that to happen.

Currently we serialize writes on one connection, and do reads on another. The writes happen on one thread, the reads can happen from multiple threads. Our random exceptions are most likely a read trying to acquire a shared lock when a write is finishing up.

Solutions:

  • Turn on write-ahead logging mode. This journaling mode does not need the locking procedure described above. I think only writes need to be serialized. This means that this is probably a drop-in solution. Performance for reads is a bit slower (they say 1-2%), and if we really want to tune performance then fiddling with "checkpoints" is necessary.
    SQLITE_BUSY can still occur in rare cases:
    • If the last connection to a database is exiting then it can acquire a lock while doing cleanup, and if another connection opens and tries to query then SQLITE_BUSY.
    • If the last connection to a database crashes, then the next new connection will acquire a lock while doing recovery, and hence a 3rd connection doing a query can SQLITE_BUSY.
  • When SQLITE_BUSY is received, retry the read. This is what this PR does. Not really that satisfying because what the sleep times should be depend on how long the pertinent locks are held. Currently the thread sleeps for at least 1 millisecond before retrying, for a maximum of 10 retries.
  • Providing an sqlite3_busy_handler. This more or less would do what retrying above does but maybe better. I think this requires sending PRs to expose this functionality in direct-sqlite, but maybe I didn't look hard enough.
  • Have reads wait for in-progress writes. I think this can be done by creating a transaction for reads that checks the in-progress TVar in the hiedbWriter in ShakeExtras, or by using some kinda broadcast signaling synchronization abstraction. There are still some edge cases that probably don't matter. Eg. a read is recovering a hot journal, and another read or write comes in?
  • Serialize reads and writes. Not sure how to do this one without changing how the current system works. Probably some kind of MVar protecting a single connection, and spawning a thread per write if we don't want to block on writes. I don't think performance will be much different.

I don't know much about SQLite so maybe I've got something wrong, or missed some alternatives. I might try to do the WAL mode on another branch since it seems like a one-liner but I don't know how to test it in HLS. If I do this I only know how to create synthetic examples in sqlite-simple, or direct-sqlite.

@jneira
Copy link
Member

jneira commented Dec 23, 2021

I think insights from @wz1000 would be useful with sqlite issues

@wz1000
Copy link
Collaborator

wz1000 commented Dec 23, 2021

Turn on write-ahead logging mode. This journaling mode does not need the locking procedure described above. I think only writes need to be serialized.

We already turn on WAL mode in hiedb: https://github.com/wz1000/HieDb/blob/305a896adbaf78175ebd866c0e665d99c093cab8/src/HieDb/Create.hs#L74

SQLITE_BUSY can still occur in rare cases:

Are we sure this is not responsible for the errors?

Currently we serialize all writes to the database (that is what HieDbWriter is for). There is one writer thread and an arbitrary number of reader threads. The second part is important, since we consult the database in many places to provide results for the user, and we would not like to serialize these reads.

Retrying on exceptions is a plausible solution, but I don't quite understand why it is necessary given our usage scenario. Is it possible to reproduce the same errors on a simple test program with the same usage pattern? (i.e. WAL mode, one writer, multiple readers)

@jneira
Copy link
Member

jneira commented Dec 23, 2021

Out of curiosity, does exist some issue reported related with this? or even informal reports in any channel?

@pepeiborra
Copy link
Collaborator

This was creating problems in our testsuite, reported here: #1430 (comment). I haven't seen these errors reproduce again since this commit: a044156

@wz1000
Copy link
Collaborator

wz1000 commented Dec 23, 2021

This reminds me, perhaps we should be using wal2 mode now: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

Last time I tried the version of sqlite bundled with direct-sqlite was too old.

@eddiemundo
Copy link
Collaborator Author

I think I read init_conn and at that time I didn't know what WAL was, which filtered out the pragmas, then when I learned what it was, I assumed I already knew what init_conn did and went full steam ahead.

It does seem like #2484 fixes the rare cases, and my PR is not needed because of WAL mode.

@eddiemundo eddiemundo closed this Dec 23, 2021
@eddiemundo eddiemundo deleted the fix-sql-busy-database-locked branch December 23, 2021 18:11
@wz1000
Copy link
Collaborator

wz1000 commented Dec 23, 2021

I think it could still be valuable, especially if a log message was emitted everytime a transaction was retried. You can never really be too sure about this.

These show up much more consistently when we have multiple HLS processes accessing the same database, possibly because of multiple editor sessions. Retrying is the best we can do in that case,

We also probably want a random exponential backoff instead of a fixed delay. See https://en.wikipedia.org/wiki/Exponential_backoff. I found https://hackage.haskell.org/package/retry after a quick search, though it might just be best to implement it ourselves instead of incurring a dependency.

Feel free to reopen if you are wanting to continue with this.

@eddiemundo eddiemundo restored the fix-sql-busy-database-locked branch December 23, 2021 18:32
@eddiemundo
Copy link
Collaborator Author

Oh didn't think of multiple HLS hitting a single database, in that case the writes might also need to retry.

I'll add logging retries and random exponential backoff.

@eddiemundo eddiemundo reopened this Dec 23, 2021
Copy link
Collaborator

@wz1000 wz1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good so far. We need to use the backoff even in the writer thread too (the indexHieFile function).

@pepeiborra
Copy link
Collaborator

Are there any performance implications?

@wz1000
Copy link
Collaborator

wz1000 commented Dec 24, 2021

I don't think there should be in the absence of contention, since we only retry if a query fails.

@eddiemundo
Copy link
Collaborator Author

Look good so far. We need to use the backoff even in the writer thread too (the indexHieFile function).

I see... I thought having the thread on the receive side of the TQueue wrapping the function it receives with retries would be enough, but it could be that that function does too much like send messages, and other computations, so we'd be retrying stuff other than hiedb calls.

So I've threaded the retry wrapper through so the client can decide what do retry.

@Anton-Latukha Anton-Latukha mentioned this pull request Dec 25, 2021
Copy link
Collaborator

@pepeiborra pepeiborra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we want to ever support 2 concurrent HLS instances writing to the same cache. In such scenario the internal invariant that "no one else is mutating the interface store" no longer holds.

That said, even though this PR is not trivial, the approach is thoughtful and I cannot see any problem with it.

ghcide/session-loader/Development/IDE/Session.hs Outdated Show resolved Hide resolved
ghcide/session-loader/Development/IDE/Session.hs Outdated Show resolved Hide resolved
ghcide/src/Development/IDE/Core/FileStore.hs Show resolved Hide resolved
ghcide/session-loader/Development/IDE/Session.hs Outdated Show resolved Hide resolved
… writerThread with retries, promote time duration and maxRetryCount constants to top level
@wz1000
Copy link
Collaborator

wz1000 commented Dec 27, 2021

Perhaps we should also retry around HieDb.runCommand in ghcide/src/Development/IDE/Main.hs to better support usecases like #2543 ?

@wz1000
Copy link
Collaborator

wz1000 commented Dec 27, 2021

I'm not sure that we want to ever support 2 concurrent HLS instances writing to the same cache. In such scenario the internal invariant that "no one else is mutating the interface store" no longer holds.

I realised this shortly after leaving my last comment, but I still think it could be valuable. Many people, especially beginners, tend to use the IDE on standalone haskell files in their home directory or somewhere, without a .cabal file. It should be OK to have multiple sessions of the IDE in these cases, especially if the sessions don't overlap in terms of the files they load.

@eddiemundo
Copy link
Collaborator Author

eddiemundo commented Dec 27, 2021

OK, HieDb.runCommand has been wrapped and I don't see any more uses of hiedb that use sqlite via grepping that aren't wrapped.

Since HieDb.runCommand doesn't actually get passed a HieDb connection but uses one internally, I refactored retryOnSqliteBusy to multiple functions the chief being retryOnException which potentially retries an m a instead of a HieDb -> m a, and adding another function on top that does take a HieDb -> m a and uses retryOnException internally.

I also noticed that wrapping createExportsMapHieDb with retry was too coarse grained because it internally does 2 HieDb calls, one of them in a loop, so I pushed the withHieDb callback into it so on the event of a busy error it won't retry so many things.

Copy link
Collaborator

@wz1000 wz1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@jneira jneira added the merge me Label to trigger pull request merge label Dec 28, 2021
@mergify mergify bot merged commit 2fa5994 into haskell:master Dec 28, 2021
@Anton-Latukha
Copy link
Collaborator

🎄

Hey. This merged. Thaanks.

Improving tests stability is appreciated. We did had problems with it & never know when in the future the structure changes/optimizations would uncovered real DB reasons again.
Frequently have multiple HLS processes. Also when people start debugging HLS - they often run several IDEs to check the HLS integration/behavior in different IDEs & during that most probably test it on the same project.
Also, DB work is always important & not looked in often. So happy to see work happening, thanks.

drsooch pushed a commit to drsooch/haskell-language-server that referenced this pull request Dec 29, 2021
* fix sql busy database is locked errors using retries

* fix ghc 9.0+

* hlint fixes

* fix ghc 9.0+ again

* remove accidentally added redundant liftIO, remove accidentally added empty lines

* add logging to retries, add exponential backoff, refactor tests

* add random-1.2.1 to older stack.yamls

* use Random typeclass instead of uniformR, revert stack.yamls

* logInfo instead of logDebug

* dont wrap action with hiedb retry on the writer side, give hiedb retry wrapper to action instead

* bump log priorities to warning, wrap all hiedb calls in runWithDb and writerThread with retries, promote time duration and maxRetryCount constants to top level

* fix ghc 9.0.1

* refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.runCommand with retry, fix tests

* push WithHieDb into createExportsMapHieDb to potentially retry less stuff, move WithHieDb to Types.Shake to avoid circular dependency

Co-authored-by: Javier Neira <atreyu.bbb@gmail.com>
drsooch pushed a commit to drsooch/haskell-language-server that referenced this pull request Jan 18, 2022
* fix sql busy database is locked errors using retries

* fix ghc 9.0+

* hlint fixes

* fix ghc 9.0+ again

* remove accidentally added redundant liftIO, remove accidentally added empty lines

* add logging to retries, add exponential backoff, refactor tests

* add random-1.2.1 to older stack.yamls

* use Random typeclass instead of uniformR, revert stack.yamls

* logInfo instead of logDebug

* dont wrap action with hiedb retry on the writer side, give hiedb retry wrapper to action instead

* bump log priorities to warning, wrap all hiedb calls in runWithDb and writerThread with retries, promote time duration and maxRetryCount constants to top level

* fix ghc 9.0.1

* refactor retryOnSqliteBusy into retryOnException et al, wrap Hiedb.runCommand with retry, fix tests

* push WithHieDb into createExportsMapHieDb to potentially retry less stuff, move WithHieDb to Types.Shake to avoid circular dependency

Co-authored-by: Javier Neira <atreyu.bbb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge me Label to trigger pull request merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants