Handle Database has a global failure error during replication #5348

newtewt · 2019-02-05T17:50:01Z

When replicating to tecno wx3 the indexeddb database seems to be dying. Errors in the console are showing. Reloading the app finishes the app start process and the app becomes usable.

Publishing 'progress' event with: Object {purged: 2325, processed: 486, total: 486}
Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}

Steps to reproduce:
Using user filysa on purge-upgrade-muso-mali instance
Log in to a fresh app install on tecno wx3. Have chrome://inspect going for the mobile app

What should happen:
The app should launch without error.

What actually happens:
The app replicates and purges documents then the errors showing the 404 error page and the error above are shown in the log. Currently we have a solution to handle this when it is unhandled but this is being handled.

https://medic.slack.com/files/UBHER12S2/FFZ6U3HEJ/indexed_db_went_bad.log

Environment:

Instance: https://purge-upgrade-muso-mali.dev.medicmobile.org
Browser: mobile app on tecno wx3
Client platform: Android 7.0
App: mobile app
Version: 3.3.x

The text was updated successfully, but these errors were encountered:

kennsippell · 2019-02-05T18:47:38Z

Currently we have a solution to handle this when it is unhandled but this is being handled.

To clarify -- this quote refers to the changes from #5045 which prompt the user to reload the page for unhandled rejection DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing..

The exception in this issue here is DOMException: Transaction timed out due to inactivity. Although they are both DOMExceptions, this isn't the same issue seen in Siaya. As written, I wouldn't expect to see a reload dialog for the error reported in this issue (when handled or unhandled).

garethbowen · 2019-02-06T22:42:52Z

@SCdF As the resident IDB and purging expert does this ring any bells?

SCdF · 2019-02-07T14:23:52Z

Originally I thought it might be the view cleanup that happens after purge, but looking at the whole log there is a lot more going on.

A slightly more complete look at that log is:

Purge function has changed, running purge
Publishing 'start' event with: Object {totalContacts: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: 0, processed: 1, total: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: N, processed: M, total: 486}
^ 486 times

Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Uncaught (in promise) DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing.(…)
^ * 2800 times

n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
db.type() is deprecated and will be removed in a future version of PouchDB
Initiating changes service
Database has a global failure DOMException: Transaction timed out due to inactivity.
Error updating icons n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
Error initialising zscore util n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error loading facilities n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error getting tasks n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error checking authorization n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}

^ * 25 times
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
Error fetching form definitions n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Failed to retrieve forms n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching contact n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Initiating changes service
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
^ * many more

The key numbers here are:

486 rounds of purging, which will contain one bulk docs write per round, though only 238 of these rounds contained a change and thus fired a bulk_docs call.
2325 documents purged
2800 uncaught failures

I can't get these to line perfectly up, but they get suspiciously close to each other.

@newtewt can you confirm that for this user the purging actually occurred? That is to say, were the documents no longer accessible?

My hunch is that it IndexedDB has broken either before purge has run, or because we are doing an all_docs at the start and idb is "not ready" somehow, and so none of the purges were actually successful.

In terms of the actual error, it's hard to track down. Apart from one specific usage that we aren't going to get to, it's just used whenever a transaction fails.

One option is that we could inject failures in pdb's code and narrow it down to only places where the promise chain doesn't except.

newtewt · 2019-02-07T14:28:07Z

I'll confirm today that the docs are unavailable. I was going based on the console log that purging completed.

newtewt · 2019-02-09T04:30:03Z

@SCdF purging completes properly. The home_visit docs are not available anymore after refreshing and getting into the app.

garethbowen · 2019-02-10T20:46:22Z

I'm trying to work out if this is a blocker for 3.3.0 or if we should add it to 3.5.0.

@newtewt Do you have any idea how often this would happen?

@SCdF I'm leaning towards bumping it - what do you think?

newtewt · 2019-02-10T20:54:30Z

This happens consistently for the filysa user on first replication and purging. So it is reproducible. However, I logged in as user madinagu and didn't experience that error. This user has more docs than filysa user.

garethbowen · 2019-02-10T21:04:52Z

Have you tried wiping the filysa user db, or logging in as them on another device? It's possible the db has been corrupted and isn't recovering.

newtewt · 2019-02-10T21:30:56Z

I'm wiping all the data in the Android app. I would need to try another device to be sure as I can't remember off the top of my head right now.

garethbowen · 2019-02-10T21:40:11Z

That's messed up. @SCdF Would you mind investigating this user to see if they have a dodgy doc or purge configuration that's messing them up? The goal is just to get 3.3.0 to the point of go/no-go so if in your opinion this isn't required then bump it out to 3.5.0 and we'll pick it off the stack when we get to it.

SCdF · 2019-02-11T09:45:30Z

I'm on it (well, once I get creds etc)

SCdF · 2019-02-11T16:45:28Z

So it took four different devices, but I managed to replicate it, on our old friend the Tecno Y4.

It's unclear why it happens on that device over the others, though it is the worst device I have.

@newtewt can you post a user who does succeed compared to this one, and specifically:

How many documents they have in total
How many documents get purged for them

I'm trying to work out if there is some kind of breaking point ratio.

In the mean time, I'm going to pull this user's data down locally and I will try mucking with the code to see if I can produce any interesting results.

newtewt · 2019-02-11T16:50:19Z

@SCdF Here are the two users I'm using on the wx3 and seeing the error. I tried on my pixel 2 and tecno J5 with no errors. Let me know if you want more.

user	total	number of docs after purging	sees error	device
filysa	4792	2467	yes	tecno wx3
madinagu	14751	7255	no	tecno wx3

SCdF · 2019-02-11T17:47:58Z

So I'm guessing that we're using a WX3 because Muso people have them deployed?

newtewt · 2019-02-11T17:49:56Z

I believe so. @abbyad mentioned that is what the CHW's are using in Mali.

https://medic.slack.com/archives/C02LFG69T/p1547677473108500?thread_ts=1547670638.106300&cid=C02LFG69T

We already compact the local DB on each write, so this should be unneccessary. It is also causing crashes on some lower power devices, presumably because it's holding open some transaction for too long and erroring everything out after that. #5348

SCdF · 2019-02-12T11:27:16Z

So I'm going to remove the compaction (optimise) step.

We are already supposedly doing this on every write (and the code seems to exist in PouchDB), and manual compaction is seq based, so it's going to run over the whole changes feed again, which is a lot.

There is no reason this should cause crashes, but there is also no reason for us to manually compact if we're always compacting when we write.

@newtewt this build is available in purge-no-optimise branch build.

I'm going to test this locally as well, but you should test it on your phones when you get a chance.

newtewt · 2019-02-12T16:53:57Z

So far this has been working fine on purge-upgrade instance. I haven't seen the crash today. I've initial replicated a few times in browser, tecno wx3, and tecno j5. I'd say let's push a PR and get another 3.3 beta and give it another once over.

garethbowen · 2019-02-12T21:21:26Z

@SCdF I'll leave it for you to merge and backport accordingly. The diff looks fine - we should probably delete the translations for "optimize" at the same time.

Drop "optimise", the manual compaction step. We don't need to do this as we are compacting on write as we go. Move start. This now happens once we decide we are going to purge, as it's more useful to show this early and have something to show while we pull in and organise documents that need purging. #5348

SCdF · 2019-02-14T10:12:36Z

I made enough changes after @garethbowen 's comment that I made a PR and have linked it to Gareth.

@garethbowen leave it to me to merge / forward port.

Drop "optimise", the manual compaction step. We don't need to do this as we are compacting on write as we go. Move start. This now happens once we decide we are going to purge, as it's more useful to show this early and have something to show while we pull in and organise documents that need purging. #5348

garethbowen · 2019-02-14T22:51:50Z

@newtewt @ngaruko I've merged this into 3.3.x so AT (and hopefully RT) can start. Wait for the build to finish and you'll be all good.

Drop "optimise", the manual compaction step. We don't need to do this as we are compacting on write as we go. Move start. This now happens once we decide we are going to purge, as it's more useful to show this early and have something to show while we pull in and organise documents that need purging. #5348

garethbowen · 2019-02-14T22:56:33Z

@SCdF To get AT unblocked I've ignored your directive and merged the PR. I also forward ported to 3.4.x and created a PR for you to merge into master.

Drop "optimise", the manual compaction step. We don't need to do this as we are compacting on write as we go. Move start. This now happens once we decide we are going to purge, as it's more useful to show this early and have something to show while we pull in and organise documents that need purging. #5348

SCdF · 2019-02-15T10:15:42Z

Coolio, so:

you merged into 3.3
you forward ported into 3.4
I merged into master
there is no 3.5

**breathes out**

newtewt · 2019-02-15T15:49:31Z

Replicating and purging is occurring as expected. Closing

newtewt added this to To do in 3.3.0 via automation Feb 5, 2019

garethbowen added Type: Bug Fix something that isn't working as intended Priority: 2 - Medium Normal priority labels Feb 7, 2019

SCdF self-assigned this Feb 11, 2019

SCdF mentioned this issue Feb 14, 2019

Drop optimise event, move start event #5396

Merged

SCdF moved this from To do to In progress in 3.3.0 Feb 14, 2019

garethbowen mentioned this issue Feb 14, 2019

Drop optimise event, move start event #5404

Merged

5 tasks

garethbowen moved this from In progress to In AT in 3.3.0 Feb 14, 2019

SCdF removed their assignment Feb 15, 2019

newtewt closed this as completed Feb 15, 2019

3.3.0 automation moved this from In AT to Done Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Database has a global failure error during replication #5348

Handle Database has a global failure error during replication #5348

newtewt commented Feb 5, 2019

kennsippell commented Feb 5, 2019 •

edited

garethbowen commented Feb 6, 2019

SCdF commented Feb 7, 2019

newtewt commented Feb 7, 2019

newtewt commented Feb 9, 2019

garethbowen commented Feb 10, 2019

newtewt commented Feb 10, 2019

garethbowen commented Feb 10, 2019

newtewt commented Feb 10, 2019

garethbowen commented Feb 10, 2019

SCdF commented Feb 11, 2019 •

edited

SCdF commented Feb 11, 2019

newtewt commented Feb 11, 2019

SCdF commented Feb 11, 2019

newtewt commented Feb 11, 2019

SCdF commented Feb 12, 2019

newtewt commented Feb 12, 2019

garethbowen commented Feb 12, 2019

SCdF commented Feb 14, 2019

garethbowen commented Feb 14, 2019

garethbowen commented Feb 14, 2019

SCdF commented Feb 15, 2019

newtewt commented Feb 15, 2019

Handle Database has a global failure error during replication #5348

Handle Database has a global failure error during replication #5348

Comments

newtewt commented Feb 5, 2019

kennsippell commented Feb 5, 2019 • edited

garethbowen commented Feb 6, 2019

SCdF commented Feb 7, 2019

newtewt commented Feb 7, 2019

newtewt commented Feb 9, 2019

garethbowen commented Feb 10, 2019

newtewt commented Feb 10, 2019

garethbowen commented Feb 10, 2019

newtewt commented Feb 10, 2019

garethbowen commented Feb 10, 2019

SCdF commented Feb 11, 2019 • edited

SCdF commented Feb 11, 2019

newtewt commented Feb 11, 2019

SCdF commented Feb 11, 2019

newtewt commented Feb 11, 2019

SCdF commented Feb 12, 2019

newtewt commented Feb 12, 2019

garethbowen commented Feb 12, 2019

SCdF commented Feb 14, 2019

garethbowen commented Feb 14, 2019

garethbowen commented Feb 14, 2019

SCdF commented Feb 15, 2019

newtewt commented Feb 15, 2019

kennsippell commented Feb 5, 2019 •

edited

SCdF commented Feb 11, 2019 •

edited