Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

Closed
wants to merge 5 commits into from

Conversation

thomcc
Copy link
Contributor

@thomcc thomcc commented Jun 7, 2019

Fixes #1230.

This basically does two things:

  1. Retries in a loop.
  2. Enables the unlock_notify rusqlite feature (this isn't actually required on it's own, but it ensures that
    rusqlite calls sqlite3_reset on prepared statements after a busy/locked error (See @linacambridge's comment here https://bugzilla.mozilla.org/show_bug.cgi?id=1435446#c17).

The downside of this is that now we require you

  • Set SQLCIPHER_NO_PKG_CONFIG=1 in your dev environment
  • To use the SQLCipher built in libs/desktop.

Both of these are done by ./libs/bootstrap-desktop.sh, but we had (undocumented) support for using homebrew to install sqlcipher and openssl. That is no longer true.

This was already true in most cases, but Mac users might hit a snag. This is true because SQLcipher on macOS (and probably elsewhere) does not set SQLITE_ENABLE_UNLOCK_NOTIFY by default. Unfortunate.

If you still have an issue after running ./libs/bootstrap-desktop.sh, make sure you do a cargo clean, since SQLCIPHER_NO_PKG_CONFIG doesn't seem to always get picked up.

I also took the time to set =1 to all our defines, since (while it's not 100% relevant here, it confused me when debugging), there are cases when SQLite just does #if SQLITE_ENABLE_UNLOCK_NOTIFY .

Pull Request checklist

  • Quality: This PR builds and tests run cleanly
    • cargo test --all produces no test failures
    • cargo clippy --all --all-targets --all-features runs without emitting any warnings
    • cargo fmt does not produce any changes to the code
    • ./gradlew ktlint detekt runs without emitting any warnings
    • swiftformat --swiftversion 4 megazords components/*/ios && swiftlint runs without emitting any warnings or producing changes
    • Note: For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

Thom Chiovoloni added 5 commits June 7, 2019 12:37
- breaks support for homebrew-installed SQLcipher
- Also ensures we pass `=1` to all defines, since apparently SQLite uses
  `#if SQLITE_ENABLE_BLAH` (and not #ifdef) in some cases.
@thomcc
Copy link
Contributor Author

thomcc commented Jun 8, 2019

Note: The pathological cases I'm referring to are cases where one transaction is open for more than ~30 seconds. We'll still fail with a busy error. if this happens.

@mhammond
Copy link
Member

mhammond commented Jun 8, 2019

Thanks for digging in here. The patch is obviously fine (sans micro-nits), but I'm concerned about the strategy.

IIUC, our working hypothesis is that a sync is probably running when this happens. This means that a single "chunked transaction" is taking over 4 seconds to commit (assuming the noteObservation is made immediately after the sync starts, and sync correctly attempts a commit after 1 second.) The device is obviously performing poorly, so this may be one of many "chunks" the sync tries to do. We don't yet have telemetry, but when we do, we might be horrified about the time outliers take to sync.

We also have no insight into how long the transaction actually took - again, telemetry seems close and might give us actual insights. Sentry implies most users see it once or twice but not regularly.

Assuming the browser is being actively used, there may be many noteObservation calls just in that 5 seconds. If I'm reading the patch correctly, we end up retrying a number of times for each call - up to ~30 seconds each one. ISTM that it wouldn't take long before every available thread is in this retry loop, waiting for some sync-related thing that we don't yet understand to complete.

I'm also not sure why we need a backoff strategy like this? How is this different than simply asking sqlite to use a 30 second timeout? That also has the advantage of unblocking ASAP, whereas with this patch, there seems a chance that the thread just started sleeping for 5 seconds when the lock became available - it's then got to wait 5 seconds before trying again, at which point something else might now have the lock (eg, history finished and bookmarks started in that 5 seconds). Or maybe drop the chunked transaction timeout down to 500ms?

Another possible strategy would be a simple queue of observations - eg, on busy we just stick the failed observation in a queue/vec, and the next time we are asked to add an observation we add all queued ones too, all in a single transaction, which would almost certainly perform better than just "queueing" the many small transactions, which is effectively what this patch does.

So I'm mildly -1 on doing this at this time - but open to persuasion if others think it is the correct approach. Given how rare it happens and given Fenix is still pre-release, just dropping those noteObservation calls on the floor seems reasonable (but obviously not ideal) and much lower risk given we don't have a clear understanding of what's going on yet.

@mhammond
Copy link
Member

mhammond commented Jun 10, 2019

Thom and I had a bit of a chat in slack about this. I think it's less risky and with a reasonable chance of success to do exactly 1 retry without a sleep() when we see this error. This should both double the time we allow for the other side to complete, and would probably be effective if an underlying issue is that Android caused both our threads to sleep during this period, meaning that when we woke the timeout expired even though nothing was actually done.

Thom feels fairly strongly that the reset part of this patch is worthwhile too and while agree I with that in general, I'm on the fence about whether the risk of that is worthwhile at this point in time.

}

/// A helper that attempts to get an Immediate lock on the DB. If it fails with
/// a "busy" or "locked" error, it does exactly 1 retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This isn't true anymore. 😄

return result.map_err(Error::from);
}
// Do the retry loop. Each iteration will assign to `result`, so that in the
// case of repeated BUSY failures, w
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Cliffhanger comment!

// These are fairly arbitrary. We'll retry 5 times before giving up
// completely, but after each failure, we wait for longer than the previous.
// Note that between each attempt, SQLite itself will wait up to
// `sqlite3_busy_timeout` ms, which is 5000 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my understanding is pretty fuzzy on how sqlite3_busy_timeout, sqlite3_unlock_notify, SQLITE_BUSY, SQLITE_LOCKED, and shared-cache mode interact. I thought:

  • The unlock notification is only triggered on SQLITE_LOCKED, which requires shared-cache mode.
  • Shared cache mode ignores sqlite3_busy_timeout, since it's the same logical connection as far as SQLite is concerned.
  • SQLITE_BUSY happens if two different logical connections (the default for us—AFAIK, we only use shared-cache in PlacesApi::new_memory) are writing to the same database, and one can't get the lock after waiting for sqlite3_busy_timeout.
  • The unlock notification isn't called for SQLITE_BUSY.
  • However, the docs say:

If sqlite3_unlock_notify() is called in a multi-threaded application, there is a chance that the blocking connection will have already concluded its transaction by the time sqlite3_unlock_notify() is invoked. If this happens, then the specified callback is invoked immediately, from within the call to sqlite3_unlock_notify().

So...

Does that mean that rusqlite's wait_for_unlock_notify immediately calls unlock_notify_cb, which fires the callback and then, in RawStatement::step, calls reset? So we're using the unlock notification to work around rusqlite not exposing reset?

Or does unlock_notify_cb get called for SQLITE_BUSY, too, and not just SQLITE_LOCKED?

@linabutler
Copy link
Contributor

How is this different than simply asking sqlite to use a 30 second timeout?

I'm not sure, but, when working on bug 1435446, I noticed that neither bumping PRAGMA busy_timeout nor retrying the same statement without resetting worked. Bumping the timeout still caused SQLITE_BUSY errors for the same statement (and, due to the mix of sync and async methods in Places, would sometimes block the main thread for synchronous statements). Retrying the same statement, or retrying a statement in a DEFERRED transaction also returned SQLITE_BUSY indefinitely.

I couldn't reproduce in an xpcshell test, or using the C API directly in a standalone app—in those cases, both retrying the statement a few times and bumping busy_timeout worked, and SQLite un-stuck itself. But that bug was easy to repro in a Desktop build...just open the library, start moving bookmarks around, and SQLite would wedge with SQLITE_BUSY in less than 10 seconds. It would also trigger randomly and reliably when syncing history, adding bookmarks, hitting "Enter" in the awesomebar—anything that wrote to the database could contend with sync eventually.

Changing all Places transactions from DEFERRED to IMMEDIATE still caused infinite loops, but only on BEGIN IMMEDIATE TRANSACTION—once that ran successfully, all statements would run and COMMIT just fine. Adding the sqlite_reset is what would let it get unstuck. It would still retry many (50-100?) times, since it was busy-waiting with PR_INTERVAL_NO_WAIT...but, eventually, it'd succeed, once the other transaction finished, committed, and SQLite finished writing to disk.

TL;DR: I haven't the foggiest why Desktop works. 😕 But, since that turned out to be a pretty simple fix after a week of debugging, I didn't spend more time on it. 😅

on busy we just stick the failed observation in a queue/vec, and the next time we are asked to add an observation we add all queued ones too, all in a single transaction, which would almost certainly perform better than just "queueing" the many small transactions

That's also a good idea—I think we'll still want a retry, though, or we might risk losing data when the app is backgrounded and Android force-stops the process. With busy-spinning, that might happen today...though, like I noted on Desktop, with those two changes, it was only a few milliseconds of waiting. (With trace logging in the merge transaction, it was closer to a second—some lag, but the operation would still run quickly). I never saw multi-second wait times.

If we queue as soon as we get a SQLITE_BUSY, we run the risk of not writing anything (potentially indefinitely? What if every statement executed after that returns SQLITE_BUSY?), history or bookmarks, and losing that data when Android stops the process. It's way less predictable and durable than Desktop.

@eoger
Copy link
Contributor

eoger commented Aug 22, 2019

Please re-tag for review once this is ready!

@rfk
Copy link
Contributor

rfk commented Dec 12, 2019

This PR is getting pretty old, is it something we want to keep or should we close it and revisit afresh at some later time?

@eoger
Copy link
Contributor

eoger commented Apr 21, 2020

Let's re-open if we're going to do this.

@eoger eoger closed this Apr 21, 2020
@rfk rfk deleted the retry-tx-loop branch June 7, 2021 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PlacesConnectionBusy crash
5 participants