Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

thomcc · 2019-06-07T22:54:34Z

This basically does two things:

Retries in a loop.
Enables the unlock_notify rusqlite feature (this isn't actually required on it's own, but it ensures that
rusqlite calls sqlite3_reset on prepared statements after a busy/locked error (See @linacambridge's comment here https://bugzilla.mozilla.org/show_bug.cgi?id=1435446#c17).

The downside of this is that now we require you

Set SQLCIPHER_NO_PKG_CONFIG=1 in your dev environment
To use the SQLCipher built in libs/desktop.

Both of these are done by ./libs/bootstrap-desktop.sh, but we had (undocumented) support for using homebrew to install sqlcipher and openssl. That is no longer true.

This was already true in most cases, but Mac users might hit a snag. This is true because SQLcipher on macOS (and probably elsewhere) does not set SQLITE_ENABLE_UNLOCK_NOTIFY by default. Unfortunate.

If you still have an issue after running ./libs/bootstrap-desktop.sh, make sure you do a cargo clean, since SQLCIPHER_NO_PKG_CONFIG doesn't seem to always get picked up.

I also took the time to set =1 to all our defines, since (while it's not 100% relevant here, it confused me when debugging), there are cases when SQLite just does #if SQLITE_ENABLE_UNLOCK_NOTIFY .

Pull Request checklist

Quality: This PR builds and tests run cleanly
- cargo test --all produces no test failures
- cargo clippy --all --all-targets --all-features runs without emitting any warnings
- cargo fmt does not produce any changes to the code
- ./gradlew ktlint detekt runs without emitting any warnings
- swiftformat --swiftversion 4 megazords components/*/ios && swiftlint runs without emitting any warnings or producing changes
- Note: For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
Tests: This PR includes thorough tests or an explanation of why it does not
Changelog: This PR includes a changelog entry or an explanation of why it does not need one
- Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
Dependencies: This PR follows our dependency management guidelines
- Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

…heck

- breaks support for homebrew-installed SQLcipher - Also ensures we pass `=1` to all defines, since apparently SQLite uses `#if SQLITE_ENABLE_BLAH` (and not #ifdef) in some cases.

…d on statements after a SQLITE_BUSY

thomcc · 2019-06-08T00:52:49Z

Note: The pathological cases I'm referring to are cases where one transaction is open for more than ~30 seconds. We'll still fail with a busy error. if this happens.

mhammond · 2019-06-08T01:56:29Z

Thanks for digging in here. The patch is obviously fine (sans micro-nits), but I'm concerned about the strategy.

IIUC, our working hypothesis is that a sync is probably running when this happens. This means that a single "chunked transaction" is taking over 4 seconds to commit (assuming the noteObservation is made immediately after the sync starts, and sync correctly attempts a commit after 1 second.) The device is obviously performing poorly, so this may be one of many "chunks" the sync tries to do. We don't yet have telemetry, but when we do, we might be horrified about the time outliers take to sync.

We also have no insight into how long the transaction actually took - again, telemetry seems close and might give us actual insights. Sentry implies most users see it once or twice but not regularly.

Assuming the browser is being actively used, there may be many noteObservation calls just in that 5 seconds. If I'm reading the patch correctly, we end up retrying a number of times for each call - up to ~30 seconds each one. ISTM that it wouldn't take long before every available thread is in this retry loop, waiting for some sync-related thing that we don't yet understand to complete.

I'm also not sure why we need a backoff strategy like this? How is this different than simply asking sqlite to use a 30 second timeout? That also has the advantage of unblocking ASAP, whereas with this patch, there seems a chance that the thread just started sleeping for 5 seconds when the lock became available - it's then got to wait 5 seconds before trying again, at which point something else might now have the lock (eg, history finished and bookmarks started in that 5 seconds). Or maybe drop the chunked transaction timeout down to 500ms?

Another possible strategy would be a simple queue of observations - eg, on busy we just stick the failed observation in a queue/vec, and the next time we are asked to add an observation we add all queued ones too, all in a single transaction, which would almost certainly perform better than just "queueing" the many small transactions, which is effectively what this patch does.

So I'm mildly -1 on doing this at this time - but open to persuasion if others think it is the correct approach. Given how rare it happens and given Fenix is still pre-release, just dropping those noteObservation calls on the floor seems reasonable (but obviously not ideal) and much lower risk given we don't have a clear understanding of what's going on yet.

mhammond · 2019-06-10T05:44:50Z

Thom and I had a bit of a chat in slack about this. I think it's less risky and with a reasonable chance of success to do exactly 1 retry without a sleep() when we see this error. This should both double the time we allow for the other side to complete, and would probably be effective if an underlying issue is that Android caused both our threads to sleep during this period, meaning that when we woke the timeout expired even though nothing was actually done.

Thom feels fairly strongly that the reset part of this patch is worthwhile too and while agree I with that in general, I'm on the fence about whether the risk of that is worthwhile at this point in time.

linabutler · 2019-06-07T23:34:43Z

components/places/src/db/tx/coop_transaction.rs

+}
+
+/// A helper that attempts to get an Immediate lock on the DB. If it fails with
+/// a "busy" or "locked" error, it does exactly 1 retry.


Nit: This isn't true anymore. 😄

linabutler · 2019-06-07T23:37:54Z

components/places/src/db/tx/coop_transaction.rs

+        return result.map_err(Error::from);
+    }
+    // Do the retry loop. Each iteration will assign to `result`, so that in the
+    // case of repeated BUSY failures, w


Nit: Cliffhanger comment!

linabutler · 2019-06-08T00:40:00Z

components/places/src/db/tx/coop_transaction.rs

+    // These are fairly arbitrary. We'll retry 5 times before giving up
+    // completely, but after each failure, we wait for longer than the previous.
+    // Note that between each attempt, SQLite itself will wait up to
+    // `sqlite3_busy_timeout` ms, which is 5000 by default.


Hmm, my understanding is pretty fuzzy on how sqlite3_busy_timeout, sqlite3_unlock_notify, SQLITE_BUSY, SQLITE_LOCKED, and shared-cache mode interact. I thought:

The unlock notification is only triggered on SQLITE_LOCKED, which requires shared-cache mode.

Shared cache mode ignores sqlite3_busy_timeout, since it's the same logical connection as far as SQLite is concerned.

SQLITE_BUSY happens if two different logical connections (the default for us—AFAIK, we only use shared-cache in PlacesApi::new_memory) are writing to the same database, and one can't get the lock after waiting for sqlite3_busy_timeout.

The unlock notification isn't called for SQLITE_BUSY.

However, the docs say:

If sqlite3_unlock_notify() is called in a multi-threaded application, there is a chance that the blocking connection will have already concluded its transaction by the time sqlite3_unlock_notify() is invoked. If this happens, then the specified callback is invoked immediately, from within the call to sqlite3_unlock_notify().

So...

Does that mean that rusqlite's wait_for_unlock_notify immediately calls unlock_notify_cb, which fires the callback and then, in RawStatement::step, calls reset? So we're using the unlock notification to work around rusqlite not exposing reset?

Or does unlock_notify_cb get called for SQLITE_BUSY, too, and not just SQLITE_LOCKED?

linabutler · 2019-06-10T15:35:13Z

How is this different than simply asking sqlite to use a 30 second timeout?

I'm not sure, but, when working on bug 1435446, I noticed that neither bumping PRAGMA busy_timeout nor retrying the same statement without resetting worked. Bumping the timeout still caused SQLITE_BUSY errors for the same statement (and, due to the mix of sync and async methods in Places, would sometimes block the main thread for synchronous statements). Retrying the same statement, or retrying a statement in a DEFERRED transaction also returned SQLITE_BUSY indefinitely.

I couldn't reproduce in an xpcshell test, or using the C API directly in a standalone app—in those cases, both retrying the statement a few times and bumping busy_timeout worked, and SQLite un-stuck itself. But that bug was easy to repro in a Desktop build...just open the library, start moving bookmarks around, and SQLite would wedge with SQLITE_BUSY in less than 10 seconds. It would also trigger randomly and reliably when syncing history, adding bookmarks, hitting "Enter" in the awesomebar—anything that wrote to the database could contend with sync eventually.

Changing all Places transactions from DEFERRED to IMMEDIATE still caused infinite loops, but only on BEGIN IMMEDIATE TRANSACTION—once that ran successfully, all statements would run and COMMIT just fine. Adding the sqlite_reset is what would let it get unstuck. It would still retry many (50-100?) times, since it was busy-waiting with PR_INTERVAL_NO_WAIT...but, eventually, it'd succeed, once the other transaction finished, committed, and SQLite finished writing to disk.

TL;DR: I haven't the foggiest why Desktop works. 😕 But, since that turned out to be a pretty simple fix after a week of debugging, I didn't spend more time on it. 😅

on busy we just stick the failed observation in a queue/vec, and the next time we are asked to add an observation we add all queued ones too, all in a single transaction, which would almost certainly perform better than just "queueing" the many small transactions

That's also a good idea—I think we'll still want a retry, though, or we might risk losing data when the app is backgrounded and Android force-stops the process. With busy-spinning, that might happen today...though, like I noted on Desktop, with those two changes, it was only a few milliseconds of waiting. (With trace logging in the merge transaction, it was closer to a second—some lag, but the operation would still run quickly). I never saw multi-second wait times.

If we queue as soon as we get a SQLITE_BUSY, we run the risk of not writing anything (potentially indefinitely? What if every statement executed after that returns SQLITE_BUSY?), history or bookmarks, and losing that data when Android stops the process. It's way less predictable and durable than Desktop.

eoger · 2019-08-22T21:08:14Z

Please re-tag for review once this is ready!

rfk · 2019-12-12T02:26:50Z

This PR is getting pretty old, is it something we want to keep or should we close it and revisit afresh at some later time?

eoger · 2020-04-21T16:47:32Z

Let's re-open if we're going to do this.

Thom Chiovoloni added 5 commits June 7, 2019 12:37

Retry multiple times in get_tx_with_retry_on_locked

53f8a07

Pass a 'proof of lock' into get_tx_with_retry_on_locked as a sanity c…

1d21504

…heck

Ensure we build SQLcipher with unlock_notify

3c5ce16

- breaks support for homebrew-installed SQLcipher - Also ensures we pass `=1` to all defines, since apparently SQLite uses `#if SQLITE_ENABLE_BLAH` (and not #ifdef) in some cases.

Require unlock_notify feature from rusqlite, so that reset is calle…

489188b

…d on statements after a SQLITE_BUSY

Turn the check_coop_tx example into a (ignored by default) test

c34fac5

thomcc requested review from mhammond and linabutler June 7, 2019 22:54

linabutler reviewed Jun 10, 2019

View reviewed changes

thomcc mentioned this pull request Jun 11, 2019

rc_log should include thread id in it's log output on android #1275

Closed

rfk mentioned this pull request Jul 2, 2019

Prep a v0.31.3 with any critical uplifts for Fenix #1344

Closed

eoger removed the request for review from mhammond August 22, 2019 21:07

eoger closed this Apr 21, 2020

rfk deleted the retry-tx-loop branch June 7, 2021 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

thomcc commented Jun 7, 2019

thomcc commented Jun 8, 2019

mhammond commented Jun 8, 2019

mhammond commented Jun 10, 2019 •

edited

linabutler Jun 7, 2019

linabutler Jun 7, 2019

linabutler Jun 8, 2019

linabutler commented Jun 10, 2019

eoger commented Aug 22, 2019

rfk commented Dec 12, 2019

eoger commented Apr 21, 2020

Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

Attempt to fix SQLITE_BUSY outside of pathological cases [ci full] #1260

Conversation

thomcc commented Jun 7, 2019

Pull Request checklist

thomcc commented Jun 8, 2019

mhammond commented Jun 8, 2019

mhammond commented Jun 10, 2019 • edited

linabutler Jun 7, 2019

Choose a reason for hiding this comment

linabutler Jun 7, 2019

Choose a reason for hiding this comment

linabutler Jun 8, 2019

Choose a reason for hiding this comment

linabutler commented Jun 10, 2019

eoger commented Aug 22, 2019

rfk commented Dec 12, 2019

eoger commented Apr 21, 2020

mhammond commented Jun 10, 2019 •

edited