-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heavy changes request block all other requests #6246
Comments
Hi @Mwongela , thanks for the issue! We have made some performance improvements to replication in the last couple of releases, in particular #5550, #5878, #5759, #5797, and #5942 . It would be interesting to run our scalability testing suite on 3.6.0, 3.7.0, and 3.8.0 to see how these have impacted the real world replication scalability. @ngaruko Is this something you can run? I'm wary about clustering or load balancing API as it hasn't been designed with clustering in mind. Clustering CouchDB is natively supported and I expect it would work well. It will come down to which service is the root cause. Can you check the CPU and memory usage of the various processes on the system? Clustering comes with its own overheads, complexity, and costs. An easier first step is to look at procuring more server resources, for example, bumping the AWS instance to a larger size. While this costs more it should be possible to upgrade while running the onboarding and then scale down again when load has returned to normal. This is the ideal balance of cost and performance. |
@garethbowen We are currently preparing to test version 3.8 we will share the experience. We are still investigating the cause of this issue and it is very likely it’s not a server resource constraint I shall update the issue with more details. Thanks for the support. |
@enyachoke Please have a look at the issue which added this warning: #5362 It's meant to help identify misconfigured users which end up downloading large numbers of documents because of missing roles or permissions, incorrect home facility etc. |
The performance for users with a large number of docs hasn't been degraded in any way in 3.8.0 - the only change is we added a warning. The occasional user with a few more than 10,000 docs is probably ok, but many users with significantly more docs will have a serious impact on both server performance for everyone and client side performance for that user. If that's the case we recommend configuring server side purge so the client only replicates the docs that are relevant to what they're working on. |
@garethbowen thanks for the detailed response. Given our current operational context, how would you advise that we think of scaling? We feel that we already have a lot of computational and other resources (e.g. we are running API on a 32-core machine and we never see any memory pressure during peak times). This is something that's going to pop up soon on our radar and I'd like to be prepared. Also, are there any sorts of system limits that we should be aware of? This would help to deal with capacity planning. In short, we would appreciate any sort operational data that you have around running the codebase with a larger set of users. |
@niiamon-lg Let me discuss some options for scaling with the rest Product team and get back to you. Currently our primary limitation on replication is the number of documents that user will receive. Do you have an idea of how many documents you need to be able to support? API performance bottlenecks are likely caused by a single request maxing out the CPU core and blocking the other requests, therefore the number of cores won't help scalability much. Thank you for the response time screenshots above - can you also provide CPU usage data? I'm particularly interested in the per process usage so we can isolate API from CouchDB and the other processes on the server. |
@garethbowen thanks, Gareth. It would be a bit difficult to know what numbers of documents we would need to be able to support per user. What we have seen is that with time and usage, the number of documents a user will possess increases. For instance, on our Uganda instance, we have seen well-configured users with about 11,000 documents. Yes, it is as you say: we have seen that a single user with a high number of documents is able to cause the entire service to become unavailable. I have asked @enyachoke to send you some details on CPU usage. I am sure that we can take some telemetry from htop or something similar which would help with your request. |
Related #7183 |
There have been significant improvements here particularly in versions 4.0.0 - 4.4.0 which will solve this. Specifically as of 4.3.0 the changes request is no longer used. I believe this means this specific issue is resolved, and I recommend you upgrade when possible. If you experience similar issues in 4.4.0+ please reopen this issue or create a new one with additional data. |
Describe the performance issue
A user with many documents, once they log in during the initial replication, the heavy change request blocks all the other requests. During LG rollouts for V3.6.0, other users are not able to log in and sync their documents and online users are not able to access the app on the browser. We are syncing up to 1000 users at a time, but the problem starts at 300 users.
Describe the improvement you'd like
We have considered the following:
Can this be fixed at with faster _changes responses? Or yielding to allow parallel requests? Or a realistic guideline for how many can sync in parallel for these large-scale upgrade.
To Reproduce
Sharing reproduction instructions for this is difficult since you need access to the server, but roughly,
Measurements
![Screenshot 2020-02-12 at 15 43 29](https://user-images.githubusercontent.com/3817321/74336118-bb491a80-4dae-11ea-865a-7b5a2edf4589.png)
![Screenshot 2020-02-12 at 15 42 04](https://user-images.githubusercontent.com/3817321/74336126-bf753800-4dae-11ea-92ae-f51d50c14916.png)
See graphs below
Environment
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: