control-plane hardening: Avoid nDB stale entries #1727
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the current design in libnetwork control plane events are first sent through gossip. This is through UDP which can be lossy. To account for that there is an anti entropy phase every 30 seconds that does a full state sync.
Problem: In some deployments we have seen CPU getting pegged or memory exhaustion (till OOM kicks in) for longer periods resulting in both gossip and push/pull state sync failing. Delete events are retained in nDB for 60 seconds currently. So at the max a deleted event will remain only for two bulk-sync cycles. If few bulk-syncs fail from this node the peers will be left with stale events forever.
This can be fixed by two approaches:
This carries the risk that we can incorrectly delete an event because there is no
owner
identification in the event messages and bulk sync happens with only one peer. In bigger clusters it can longer for an event to eventually reach all nodes through bulk sync (if it was missed earlier from the gossip). To avoid this we have to let the entries remain for many bulk sync cycles. Approach 2 achieves the same in a much simpler and reliable way.This applies for network scoped gossip (endpoint join/leave events) and the global gossip (node leave, network join/leave).
There is a specific case to consider though: gossip for last few tasks getting deleted on a node is lost. In this case we will remove that network from this node. Hence increasing the reap time for endpoint events won't help. But the network leave event itself will be retained longer with this change. This combined with #1704 will make sure the state is cleaned up on all remote nodes.
Signed-off-by: Santhosh Manohar santhosh@docker.com