Heavy changes request block all other requests #6246

Mwongela · 2020-02-12T12:48:07Z

Describe the performance issue
A user with many documents, once they log in during the initial replication, the heavy change request blocks all the other requests. During LG rollouts for V3.6.0, other users are not able to log in and sync their documents and online users are not able to access the app on the browser. We are syncing up to 1000 users at a time, but the problem starts at 300 users.

Describe the improvement you'd like
We have considered the following:

Clustering the medic-api over multiple cores
Load balancing the medic-api and CouchDB over multiple machines

Can this be fixed at with faster _changes responses? Or yielding to allow parallel requests? Or a realistic guideline for how many can sync in parallel for these large-scale upgrade.

To Reproduce
Sharing reproduction instructions for this is difficult since you need access to the server, but roughly,

we have a heavy request to changes feed taking 60000ms
during this time all the other requests are queued and very slow

Measurements
See graphs below

Environment

Instance: LG
App: api
Version: 3.6.0

Additional context
Add any other context or screenshots about the feature request here.

garethbowen · 2020-02-12T21:53:55Z

Hi @Mwongela , thanks for the issue!

We have made some performance improvements to replication in the last couple of releases, in particular #5550, #5878, #5759, #5797, and #5942 . It would be interesting to run our scalability testing suite on 3.6.0, 3.7.0, and 3.8.0 to see how these have impacted the real world replication scalability. @ngaruko Is this something you can run?

I'm wary about clustering or load balancing API as it hasn't been designed with clustering in mind. Clustering CouchDB is natively supported and I expect it would work well. It will come down to which service is the root cause. Can you check the CPU and memory usage of the various processes on the system?

Clustering comes with its own overheads, complexity, and costs. An easier first step is to look at procuring more server resources, for example, bumping the AWS instance to a larger size. While this costs more it should be possible to upgrade while running the onboarding and then scale down again when load has returned to normal. This is the ideal balance of cost and performance.

enyachoke · 2020-02-14T04:35:45Z

@garethbowen We are currently preparing to test version 3.8 we will share the experience. We are still investigating the cause of this issue and it is very likely it’s not a server resource constraint I shall update the issue with more details. Thanks for the support.

enyachoke · 2020-02-17T08:25:22Z

So am testing out 3.8.0 and saw this warning.

What potential impact will it have on the overall system? Is this something we need to worry about.

dianabarsan · 2020-02-17T16:47:37Z

@enyachoke Please have a look at the issue which added this warning: #5362

It's meant to help identify misconfigured users which end up downloading large numbers of documents because of missing roles or permissions, incorrect home facility etc.

garethbowen · 2020-02-17T19:20:00Z

The performance for users with a large number of docs hasn't been degraded in any way in 3.8.0 - the only change is we added a warning.

The occasional user with a few more than 10,000 docs is probably ok, but many users with significantly more docs will have a serious impact on both server performance for everyone and client side performance for that user. If that's the case we recommend configuring server side purge so the client only replicates the docs that are relevant to what they're working on.

niiamon-lg · 2020-02-25T08:58:17Z

@garethbowen thanks for the detailed response. Given our current operational context, how would you advise that we think of scaling? We feel that we already have a lot of computational and other resources (e.g. we are running API on a 32-core machine and we never see any memory pressure during peak times). This is something that's going to pop up soon on our radar and I'd like to be prepared. Also, are there any sorts of system limits that we should be aware of? This would help to deal with capacity planning. In short, we would appreciate any sort operational data that you have around running the codebase with a larger set of users.

garethbowen · 2020-02-29T03:54:28Z

@niiamon-lg Let me discuss some options for scaling with the rest Product team and get back to you.

Currently our primary limitation on replication is the number of documents that user will receive. Do you have an idea of how many documents you need to be able to support?

API performance bottlenecks are likely caused by a single request maxing out the CPU core and blocking the other requests, therefore the number of cores won't help scalability much. Thank you for the response time screenshots above - can you also provide CPU usage data? I'm particularly interested in the per process usage so we can isolate API from CouchDB and the other processes on the server.

niiamon-lg · 2020-03-03T06:15:50Z

@garethbowen thanks, Gareth. It would be a bit difficult to know what numbers of documents we would need to be able to support per user. What we have seen is that with time and usage, the number of documents a user will possess increases. For instance, on our Uganda instance, we have seen well-configured users with about 11,000 documents.

Yes, it is as you say: we have seen that a single user with a high number of documents is able to cause the entire service to become unavailable. I have asked @enyachoke to send you some details on CPU usage. I am sure that we can take some telemetry from htop or something similar which would help with your request.

kennsippell · 2021-07-21T16:23:38Z

Related #7183

garethbowen · 2024-02-13T09:37:07Z

There have been significant improvements here particularly in versions 4.0.0 - 4.4.0 which will solve this. Specifically as of 4.3.0 the changes request is no longer used. I believe this means this specific issue is resolved, and I recommend you upgrade when possible.

If you experience similar issues in 4.4.0+ please reopen this issue or create a new one with additional data.

Mwongela added the Type: Performance Make something faster label Feb 12, 2020

MaxDiz added this to Needs Triage in Implementing Partners Backlog via automation Feb 12, 2020

ranjuts moved this from Needs Triage to To do in Implementing Partners Backlog Mar 15, 2020

garethbowen mentioned this issue Mar 18, 2020

Investigate possibility of scaling API horizontally #6265

Open

kennsippell mentioned this issue Jun 4, 2020

Clarify performance, capacity, technical capabilities to technical partners medic/cht-docs#83

Open

ranjuts moved this from To do to Needs Triage in Implementing Partners Backlog Oct 7, 2020

garethbowen closed this as completed Feb 13, 2024

garethbowen added the Won't fix: Duplicate Covered by a different issue label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy changes request block all other requests #6246

Heavy changes request block all other requests #6246

Mwongela commented Feb 12, 2020

garethbowen commented Feb 12, 2020

enyachoke commented Feb 14, 2020

enyachoke commented Feb 17, 2020

dianabarsan commented Feb 17, 2020

garethbowen commented Feb 17, 2020

niiamon-lg commented Feb 25, 2020

garethbowen commented Feb 29, 2020

niiamon-lg commented Mar 3, 2020

kennsippell commented Jul 21, 2021

garethbowen commented Feb 13, 2024

Heavy changes request block all other requests #6246

Heavy changes request block all other requests #6246

Comments

Mwongela commented Feb 12, 2020

garethbowen commented Feb 12, 2020

enyachoke commented Feb 14, 2020

enyachoke commented Feb 17, 2020

dianabarsan commented Feb 17, 2020

garethbowen commented Feb 17, 2020

niiamon-lg commented Feb 25, 2020

garethbowen commented Feb 29, 2020

niiamon-lg commented Mar 3, 2020

kennsippell commented Jul 21, 2021

garethbowen commented Feb 13, 2024