Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy changes request block all other requests #6246

Closed
Mwongela opened this issue Feb 12, 2020 · 10 comments
Closed

Heavy changes request block all other requests #6246

Mwongela opened this issue Feb 12, 2020 · 10 comments
Labels
Type: Performance Make something faster Won't fix: Duplicate Covered by a different issue

Comments

@Mwongela
Copy link

Describe the performance issue
A user with many documents, once they log in during the initial replication, the heavy change request blocks all the other requests. During LG rollouts for V3.6.0, other users are not able to log in and sync their documents and online users are not able to access the app on the browser. We are syncing up to 1000 users at a time, but the problem starts at 300 users.

Describe the improvement you'd like
We have considered the following:

  • Clustering the medic-api over multiple cores
  • Load balancing the medic-api and CouchDB over multiple machines

Can this be fixed at with faster _changes responses? Or yielding to allow parallel requests? Or a realistic guideline for how many can sync in parallel for these large-scale upgrade.

To Reproduce
Sharing reproduction instructions for this is difficult since you need access to the server, but roughly,

  1. we have a heavy request to changes feed taking 60000ms
  2. during this time all the other requests are queued and very slow

Measurements
See graphs below
Screenshot 2020-02-12 at 15 43 29
Screenshot 2020-02-12 at 15 42 04

Environment

  • Instance: LG
  • App: api
  • Version: 3.6.0

Additional context
Add any other context or screenshots about the feature request here.

@Mwongela Mwongela added the Type: Performance Make something faster label Feb 12, 2020
@MaxDiz MaxDiz added this to Needs Triage in Implementing Partners Backlog via automation Feb 12, 2020
@garethbowen
Copy link
Member

Hi @Mwongela , thanks for the issue!

We have made some performance improvements to replication in the last couple of releases, in particular #5550, #5878, #5759, #5797, and #5942 . It would be interesting to run our scalability testing suite on 3.6.0, 3.7.0, and 3.8.0 to see how these have impacted the real world replication scalability. @ngaruko Is this something you can run?

I'm wary about clustering or load balancing API as it hasn't been designed with clustering in mind. Clustering CouchDB is natively supported and I expect it would work well. It will come down to which service is the root cause. Can you check the CPU and memory usage of the various processes on the system?

Clustering comes with its own overheads, complexity, and costs. An easier first step is to look at procuring more server resources, for example, bumping the AWS instance to a larger size. While this costs more it should be possible to upgrade while running the onboarding and then scale down again when load has returned to normal. This is the ideal balance of cost and performance.

@enyachoke
Copy link

@garethbowen We are currently preparing to test version 3.8 we will share the experience. We are still investigating the cause of this issue and it is very likely it’s not a server resource constraint I shall update the issue with more details. Thanks for the support.

@enyachoke
Copy link

So am testing out 3.8.0 and saw this warning.
Screenshot 2020-02-17 at 11 19 02
What potential impact will it have on the overall system? Is this something we need to worry about.

@dianabarsan
Copy link
Member

@enyachoke Please have a look at the issue which added this warning: #5362

It's meant to help identify misconfigured users which end up downloading large numbers of documents because of missing roles or permissions, incorrect home facility etc.

@garethbowen
Copy link
Member

The performance for users with a large number of docs hasn't been degraded in any way in 3.8.0 - the only change is we added a warning.

The occasional user with a few more than 10,000 docs is probably ok, but many users with significantly more docs will have a serious impact on both server performance for everyone and client side performance for that user. If that's the case we recommend configuring server side purge so the client only replicates the docs that are relevant to what they're working on.

@niiamon-lg
Copy link

@garethbowen thanks for the detailed response. Given our current operational context, how would you advise that we think of scaling? We feel that we already have a lot of computational and other resources (e.g. we are running API on a 32-core machine and we never see any memory pressure during peak times). This is something that's going to pop up soon on our radar and I'd like to be prepared. Also, are there any sorts of system limits that we should be aware of? This would help to deal with capacity planning. In short, we would appreciate any sort operational data that you have around running the codebase with a larger set of users.

@garethbowen
Copy link
Member

@niiamon-lg Let me discuss some options for scaling with the rest Product team and get back to you.

Currently our primary limitation on replication is the number of documents that user will receive. Do you have an idea of how many documents you need to be able to support?

API performance bottlenecks are likely caused by a single request maxing out the CPU core and blocking the other requests, therefore the number of cores won't help scalability much. Thank you for the response time screenshots above - can you also provide CPU usage data? I'm particularly interested in the per process usage so we can isolate API from CouchDB and the other processes on the server.

@niiamon-lg
Copy link

@garethbowen thanks, Gareth. It would be a bit difficult to know what numbers of documents we would need to be able to support per user. What we have seen is that with time and usage, the number of documents a user will possess increases. For instance, on our Uganda instance, we have seen well-configured users with about 11,000 documents.

Yes, it is as you say: we have seen that a single user with a high number of documents is able to cause the entire service to become unavailable. I have asked @enyachoke to send you some details on CPU usage. I am sure that we can take some telemetry from htop or something similar which would help with your request.

@kennsippell
Copy link
Member

Related #7183

@garethbowen
Copy link
Member

There have been significant improvements here particularly in versions 4.0.0 - 4.4.0 which will solve this. Specifically as of 4.3.0 the changes request is no longer used. I believe this means this specific issue is resolved, and I recommend you upgrade when possible.

If you experience similar issues in 4.4.0+ please reopen this issue or create a new one with additional data.

@garethbowen garethbowen added the Won't fix: Duplicate Covered by a different issue label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Make something faster Won't fix: Duplicate Covered by a different issue
Projects
No open projects
Development

No branches or pull requests

6 participants