Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Database has a global failure error during replication #5348

Closed
newtewt opened this issue Feb 5, 2019 · 23 comments
Closed

Handle Database has a global failure error during replication #5348

newtewt opened this issue Feb 5, 2019 · 23 comments
Labels
Priority: 2 - Medium Normal priority Type: Bug Fix something that isn't working as intended
Projects

Comments

@newtewt
Copy link
Contributor

newtewt commented Feb 5, 2019

When replicating to tecno wx3 the indexeddb database seems to be dying. Errors in the console are showing. Reloading the app finishes the app start process and the app becomes usable.

Publishing 'progress' event with: Object {purged: 2325, processed: 486, total: 486}
Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} 

Steps to reproduce:
Using user filysa on purge-upgrade-muso-mali instance
Log in to a fresh app install on tecno wx3. Have chrome://inspect going for the mobile app

What should happen:
The app should launch without error.

What actually happens:
The app replicates and purges documents then the errors showing the 404 error page and the error above are shown in the log. Currently we have a solution to handle this when it is unhandled but this is being handled.

https://medic.slack.com/files/UBHER12S2/FFZ6U3HEJ/indexed_db_went_bad.log

Environment:

@newtewt newtewt added this to To do in 3.3.0 via automation Feb 5, 2019
@kennsippell
Copy link
Member

kennsippell commented Feb 5, 2019

Currently we have a solution to handle this when it is unhandled but this is being handled.

To clarify -- this quote refers to the changes from #5045 which prompt the user to reload the page for unhandled rejection DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing..

The exception in this issue here is DOMException: Transaction timed out due to inactivity. Although they are both DOMExceptions, this isn't the same issue seen in Siaya. As written, I wouldn't expect to see a reload dialog for the error reported in this issue (when handled or unhandled).

@garethbowen
Copy link
Member

@SCdF As the resident IDB and purging expert does this ring any bells?

@garethbowen garethbowen added Type: Bug Fix something that isn't working as intended Priority: 2 - Medium Normal priority labels Feb 7, 2019
@SCdF
Copy link
Contributor

SCdF commented Feb 7, 2019

Originally I thought it might be the view cleanup that happens after purge, but looking at the whole log there is a lot more going on.

A slightly more complete look at that log is:

Purge function has changed, running purge
Publishing 'start' event with: Object {totalContacts: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: 0, processed: 1, total: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: N, processed: M, total: 486}
^ 486 times

Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Uncaught (in promise) DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing.(…)
^ * 2800 times

n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
db.type() is deprecated and will be removed in a future version of PouchDB
Initiating changes service
Database has a global failure DOMException: Transaction timed out due to inactivity.
Error updating icons n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
Error initialising zscore util n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error loading facilities n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error getting tasks n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error checking authorization n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}

^ * 25 times
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
Error fetching form definitions n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Failed to retrieve forms n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching contact n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Initiating changes service
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
^ * many more

The key numbers here are:

  • 486 rounds of purging, which will contain one bulk docs write per round, though only 238 of these rounds contained a change and thus fired a bulk_docs call.
  • 2325 documents purged
  • 2800 uncaught failures

I can't get these to line perfectly up, but they get suspiciously close to each other.

@newtewt can you confirm that for this user the purging actually occurred? That is to say, were the documents no longer accessible?

My hunch is that it IndexedDB has broken either before purge has run, or because we are doing an all_docs at the start and idb is "not ready" somehow, and so none of the purges were actually successful.

In terms of the actual error, it's hard to track down. Apart from one specific usage that we aren't going to get to, it's just used whenever a transaction fails.

One option is that we could inject failures in pdb's code and narrow it down to only places where the promise chain doesn't except.

@newtewt
Copy link
Contributor Author

newtewt commented Feb 7, 2019

I'll confirm today that the docs are unavailable. I was going based on the console log that purging completed.

@newtewt
Copy link
Contributor Author

newtewt commented Feb 9, 2019

@SCdF purging completes properly. The home_visit docs are not available anymore after refreshing and getting into the app.

@garethbowen
Copy link
Member

I'm trying to work out if this is a blocker for 3.3.0 or if we should add it to 3.5.0.

@newtewt Do you have any idea how often this would happen?

@SCdF I'm leaning towards bumping it - what do you think?

@newtewt
Copy link
Contributor Author

newtewt commented Feb 10, 2019

This happens consistently for the filysa user on first replication and purging. So it is reproducible. However, I logged in as user madinagu and didn't experience that error. This user has more docs than filysa user.

@garethbowen
Copy link
Member

Have you tried wiping the filysa user db, or logging in as them on another device? It's possible the db has been corrupted and isn't recovering.

@newtewt
Copy link
Contributor Author

newtewt commented Feb 10, 2019

I'm wiping all the data in the Android app. I would need to try another device to be sure as I can't remember off the top of my head right now.

@garethbowen
Copy link
Member

That's messed up. @SCdF Would you mind investigating this user to see if they have a dodgy doc or purge configuration that's messing them up? The goal is just to get 3.3.0 to the point of go/no-go so if in your opinion this isn't required then bump it out to 3.5.0 and we'll pick it off the stack when we get to it.

@SCdF SCdF self-assigned this Feb 11, 2019
@SCdF
Copy link
Contributor

SCdF commented Feb 11, 2019

I'm on it (well, once I get creds etc)

@SCdF
Copy link
Contributor

SCdF commented Feb 11, 2019

So it took four different devices, but I managed to replicate it, on our old friend the Tecno Y4.

It's unclear why it happens on that device over the others, though it is the worst device I have.

@newtewt can you post a user who does succeed compared to this one, and specifically:

  • How many documents they have in total
  • How many documents get purged for them

I'm trying to work out if there is some kind of breaking point ratio.

In the mean time, I'm going to pull this user's data down locally and I will try mucking with the code to see if I can produce any interesting results.

@newtewt
Copy link
Contributor Author

newtewt commented Feb 11, 2019

@SCdF Here are the two users I'm using on the wx3 and seeing the error. I tried on my pixel 2 and tecno J5 with no errors. Let me know if you want more.

user total number of docs after purging sees error device
filysa 4792 2467 yes tecno wx3
madinagu 14751 7255 no tecno wx3

@SCdF
Copy link
Contributor

SCdF commented Feb 11, 2019

So I'm guessing that we're using a WX3 because Muso people have them deployed?

@newtewt
Copy link
Contributor Author

newtewt commented Feb 11, 2019

I believe so. @abbyad mentioned that is what the CHW's are using in Mali.

https://medic.slack.com/archives/C02LFG69T/p1547677473108500?thread_ts=1547670638.106300&cid=C02LFG69T

SCdF pushed a commit that referenced this issue Feb 12, 2019
We already compact the local DB on each write, so this should be
unneccessary. It is also causing crashes on some lower power devices,
presumably because it's holding open some transaction for too long and
erroring everything out after that.

#5348
@SCdF
Copy link
Contributor

SCdF commented Feb 12, 2019

So I'm going to remove the compaction (optimise) step.

We are already supposedly doing this on every write (and the code seems to exist in PouchDB), and manual compaction is seq based, so it's going to run over the whole changes feed again, which is a lot.

There is no reason this should cause crashes, but there is also no reason for us to manually compact if we're always compacting when we write.

@newtewt this build is available in purge-no-optimise branch build.

I'm going to test this locally as well, but you should test it on your phones when you get a chance.

@newtewt
Copy link
Contributor Author

newtewt commented Feb 12, 2019

So far this has been working fine on purge-upgrade instance. I haven't seen the crash today. I've initial replicated a few times in browser, tecno wx3, and tecno j5. I'd say let's push a PR and get another 3.3 beta and give it another once over.

@garethbowen
Copy link
Member

@SCdF I'll leave it for you to merge and backport accordingly. The diff looks fine - we should probably delete the translations for "optimize" at the same time.

SCdF pushed a commit that referenced this issue Feb 13, 2019
Drop "optimise", the manual compaction step. We don't need to do this
as we are compacting on write as we go.

Move start. This now happens once we decide we are going to purge, as
it's more useful to show this early and have something to show while
we pull in and organise documents that need purging.

#5348
@SCdF
Copy link
Contributor

SCdF commented Feb 14, 2019

I made enough changes after @garethbowen 's comment that I made a PR and have linked it to Gareth.

@garethbowen leave it to me to merge / forward port.

@SCdF SCdF moved this from To do to In progress in 3.3.0 Feb 14, 2019
garethbowen pushed a commit that referenced this issue Feb 14, 2019
Drop "optimise", the manual compaction step. We don't need to do this
as we are compacting on write as we go.

Move start. This now happens once we decide we are going to purge, as
it's more useful to show this early and have something to show while
we pull in and organise documents that need purging.

#5348
@garethbowen
Copy link
Member

@newtewt @ngaruko I've merged this into 3.3.x so AT (and hopefully RT) can start. Wait for the build to finish and you'll be all good.

garethbowen pushed a commit that referenced this issue Feb 14, 2019
Drop "optimise", the manual compaction step. We don't need to do this
as we are compacting on write as we go.

Move start. This now happens once we decide we are going to purge, as
it's more useful to show this early and have something to show while
we pull in and organise documents that need purging.

#5348
garethbowen pushed a commit that referenced this issue Feb 14, 2019
Drop "optimise", the manual compaction step. We don't need to do this
as we are compacting on write as we go.

Move start. This now happens once we decide we are going to purge, as
it's more useful to show this early and have something to show while
we pull in and organise documents that need purging.

#5348
@garethbowen
Copy link
Member

@SCdF To get AT unblocked I've ignored your directive and merged the PR. I also forward ported to 3.4.x and created a PR for you to merge into master.

@garethbowen garethbowen moved this from In progress to In AT in 3.3.0 Feb 14, 2019
SCdF pushed a commit that referenced this issue Feb 15, 2019
Drop "optimise", the manual compaction step. We don't need to do this
as we are compacting on write as we go.

Move start. This now happens once we decide we are going to purge, as
it's more useful to show this early and have something to show while
we pull in and organise documents that need purging.

#5348
@SCdF
Copy link
Contributor

SCdF commented Feb 15, 2019

Coolio, so:

  • you merged into 3.3
  • you forward ported into 3.4
  • I merged into master
  • there is no 3.5

**breathes out**

@SCdF SCdF removed their assignment Feb 15, 2019
@newtewt
Copy link
Contributor Author

newtewt commented Feb 15, 2019

Replicating and purging is occurring as expected. Closing

@newtewt newtewt closed this as completed Feb 15, 2019
3.3.0 automation moved this from In AT to Done Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 2 - Medium Normal priority Type: Bug Fix something that isn't working as intended
Projects
No open projects
3.3.0
  
Done
Development

No branches or pull requests

4 participants