-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large snapshots prevent the addition of new managers to the cluster #2374
Comments
@wsong and I discussed this, but this should include better error reporting for failure scenarios related to this. |
I think streaming the snapshot is the best approach.
|
3 is just a better separation of critical vs non-critical data, IMO. 4 is an optimization. 1 can possibly be done in the short term. 2 is the approach for long term. |
Increasing grpc message size to 128MB, in case it is needed: #2375 |
Short-term fix has been landed. Reducing priority to P1 since the long term solution is not P1. |
Fixed in #2458 |
This issue has been seen in a couple of production clusters, and is considered critical. We must fix it on SwarmKit.
Summary
When the raft snapshot becomes larger than 4MB, then adding a new manager to the cluster becomes problematic. This is because the default gRPC message limit is 4MB, and sending a snapshot over to the new joining manager fails. As a result, the new manager does not end up with proper cluster state. This can also happen if a manager in an existing cluster falls behind and needs to receive a snapshot from a raft peer.
What Makes the Snapshot Large
Running a large number of services/tasks possibly connected to many networks can increase the size of the snapshot. If the task history retention limit is particularly high, a lot of old tasks can stay around bloating it further. Having a large number of (possibly large) secrets can also cause this problem.
Possible Fixes
There are several possible fixes that have been discussed. Let's use this issue to discuss pros and cons.
We may have to do a combination of these things.
cc @wsong @anshulpundir @stevvooe @aluzzardi @aaronlehmann @jlhawn
The text was updated successfully, but these errors were encountered: