New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260
Conversation
- breaks support for homebrew-installed SQLcipher - Also ensures we pass `=1` to all defines, since apparently SQLite uses `#if SQLITE_ENABLE_BLAH` (and not #ifdef) in some cases.
…d on statements after a SQLITE_BUSY
Note: The pathological cases I'm referring to are cases where one transaction is open for more than ~30 seconds. We'll still fail with a busy error. if this happens. |
Thanks for digging in here. The patch is obviously fine (sans micro-nits), but I'm concerned about the strategy. IIUC, our working hypothesis is that a sync is probably running when this happens. This means that a single "chunked transaction" is taking over 4 seconds to commit (assuming the noteObservation is made immediately after the sync starts, and sync correctly attempts a commit after 1 second.) The device is obviously performing poorly, so this may be one of many "chunks" the sync tries to do. We don't yet have telemetry, but when we do, we might be horrified about the time outliers take to sync. We also have no insight into how long the transaction actually took - again, telemetry seems close and might give us actual insights. Sentry implies most users see it once or twice but not regularly. Assuming the browser is being actively used, there may be many noteObservation calls just in that 5 seconds. If I'm reading the patch correctly, we end up retrying a number of times for each call - up to ~30 seconds each one. ISTM that it wouldn't take long before every available thread is in this retry loop, waiting for some sync-related thing that we don't yet understand to complete. I'm also not sure why we need a backoff strategy like this? How is this different than simply asking sqlite to use a 30 second timeout? That also has the advantage of unblocking ASAP, whereas with this patch, there seems a chance that the thread just started sleeping for 5 seconds when the lock became available - it's then got to wait 5 seconds before trying again, at which point something else might now have the lock (eg, history finished and bookmarks started in that 5 seconds). Or maybe drop the chunked transaction timeout down to 500ms? Another possible strategy would be a simple queue of observations - eg, on busy we just stick the failed observation in a queue/vec, and the next time we are asked to add an observation we add all queued ones too, all in a single transaction, which would almost certainly perform better than just "queueing" the many small transactions, which is effectively what this patch does. So I'm mildly -1 on doing this at this time - but open to persuasion if others think it is the correct approach. Given how rare it happens and given Fenix is still pre-release, just dropping those noteObservation calls on the floor seems reasonable (but obviously not ideal) and much lower risk given we don't have a clear understanding of what's going on yet. |
Thom and I had a bit of a chat in slack about this. I think it's less risky and with a reasonable chance of success to do exactly 1 retry without a sleep() when we see this error. This should both double the time we allow for the other side to complete, and would probably be effective if an underlying issue is that Android caused both our threads to sleep during this period, meaning that when we woke the timeout expired even though nothing was actually done. Thom feels fairly strongly that the |
} | ||
|
||
/// A helper that attempts to get an Immediate lock on the DB. If it fails with | ||
/// a "busy" or "locked" error, it does exactly 1 retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This isn't true anymore. 😄
return result.map_err(Error::from); | ||
} | ||
// Do the retry loop. Each iteration will assign to `result`, so that in the | ||
// case of repeated BUSY failures, w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Cliffhanger comment!
// These are fairly arbitrary. We'll retry 5 times before giving up | ||
// completely, but after each failure, we wait for longer than the previous. | ||
// Note that between each attempt, SQLite itself will wait up to | ||
// `sqlite3_busy_timeout` ms, which is 5000 by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my understanding is pretty fuzzy on how sqlite3_busy_timeout
, sqlite3_unlock_notify
, SQLITE_BUSY
, SQLITE_LOCKED
, and shared-cache mode interact. I thought:
- The unlock notification is only triggered on
SQLITE_LOCKED
, which requires shared-cache mode. - Shared cache mode ignores
sqlite3_busy_timeout
, since it's the same logical connection as far as SQLite is concerned. SQLITE_BUSY
happens if two different logical connections (the default for us—AFAIK, we only use shared-cache inPlacesApi::new_memory
) are writing to the same database, and one can't get the lock after waiting forsqlite3_busy_timeout
.- The unlock notification isn't called for
SQLITE_BUSY
. - However, the docs say:
If sqlite3_unlock_notify() is called in a multi-threaded application, there is a chance that the blocking connection will have already concluded its transaction by the time sqlite3_unlock_notify() is invoked. If this happens, then the specified callback is invoked immediately, from within the call to sqlite3_unlock_notify().
So...
Does that mean that rusqlite's wait_for_unlock_notify
immediately calls unlock_notify_cb
, which fires the callback and then, in RawStatement::step
, calls reset
? So we're using the unlock notification to work around rusqlite not exposing reset
?
Or does unlock_notify_cb
get called for SQLITE_BUSY
, too, and not just SQLITE_LOCKED
?
I'm not sure, but, when working on bug 1435446, I noticed that neither bumping I couldn't reproduce in an xpcshell test, or using the C API directly in a standalone app—in those cases, both retrying the statement a few times and bumping Changing all Places transactions from TL;DR: I haven't the foggiest why Desktop works. 😕 But, since that turned out to be a pretty simple fix after a week of debugging, I didn't spend more time on it. 😅
That's also a good idea—I think we'll still want a retry, though, or we might risk losing data when the app is backgrounded and Android force-stops the process. With busy-spinning, that might happen today...though, like I noted on Desktop, with those two changes, it was only a few milliseconds of waiting. (With trace logging in the merge transaction, it was closer to a second—some lag, but the operation would still run quickly). I never saw multi-second wait times. If we queue as soon as we get a |
Please re-tag for review once this is ready! |
This PR is getting pretty old, is it something we want to keep or should we close it and revisit afresh at some later time? |
Let's re-open if we're going to do this. |
Fixes #1230.
This basically does two things:
rusqlite
callssqlite3_reset
on prepared statements after a busy/locked error (See @linacambridge's comment here https://bugzilla.mozilla.org/show_bug.cgi?id=1435446#c17).The downside of this is that now we require you
SQLCIPHER_NO_PKG_CONFIG=1
in your dev environmentlibs/desktop
.Both of these are done by
./libs/bootstrap-desktop.sh
, but we had (undocumented) support for using homebrew to installsqlcipher
andopenssl
. That is no longer true.This was already true in most cases, but Mac users might hit a snag. This is true because SQLcipher on macOS (and probably elsewhere) does not set
SQLITE_ENABLE_UNLOCK_NOTIFY
by default. Unfortunate.If you still have an issue after running
./libs/bootstrap-desktop.sh
, make sure you do acargo clean
, sinceSQLCIPHER_NO_PKG_CONFIG
doesn't seem to always get picked up.I also took the time to set
=1
to all our defines, since (while it's not 100% relevant here, it confused me when debugging), there are cases when SQLite just does#if SQLITE_ENABLE_UNLOCK_NOTIFY
.Pull Request checklist
cargo test --all
produces no test failurescargo clippy --all --all-targets --all-features
runs without emitting any warningscargo fmt
does not produce any changes to the code./gradlew ktlint detekt
runs without emitting any warningsswiftformat --swiftversion 4 megazords components/*/ios && swiftlint
runs without emitting any warnings or producing changes[ci full]
to the PR title.