Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-plane hardening: Avoid nDB stale entries #1727

Merged
merged 1 commit into from
Apr 25, 2017

Conversation

sanimej
Copy link

@sanimej sanimej commented Apr 19, 2017

With the current design in libnetwork control plane events are first sent through gossip. This is through UDP which can be lossy. To account for that there is an anti entropy phase every 30 seconds that does a full state sync.

Problem: In some deployments we have seen CPU getting pegged or memory exhaustion (till OOM kicks in) for longer periods resulting in both gossip and push/pull state sync failing. Delete events are retained in nDB for 60 seconds currently. So at the max a deleted event will remain only for two bulk-sync cycles. If few bulk-syncs fail from this node the peers will be left with stale events forever.

This can be fixed by two approaches:

  1. Introduce a mechanism in every node to somehow figure out when an event is stale and delete it locally. This can be done by a mark and sweep logic. ie: on every bulk sync mark all events in the nDB as stale and clear the flag as we process the events from the bulk sync. If an event remains stale for few bulk sync intervals then delete it.

This carries the risk that we can incorrectly delete an event because there is no owner identification in the event messages and bulk sync happens with only one peer. In bigger clusters it can longer for an event to eventually reach all nodes through bulk sync (if it was missed earlier from the gossip). To avoid this we have to let the entries remain for many bulk sync cycles. Approach 2 achieves the same in a much simpler and reliable way.

  1. Let the deleted entries remain longer in the networkDB. Currently its cleaned up after 60 seconds which is too aggressive. This change increases it to 30 minutes.

This applies for network scoped gossip (endpoint join/leave events) and the global gossip (node leave, network join/leave).
There is a specific case to consider though: gossip for last few tasks getting deleted on a node is lost. In this case we will remove that network from this node. Hence increasing the reap time for endpoint events won't help. But the network leave event itself will be retained longer with this change. This combined with #1704 will make sure the state is cleaned up on all remote nodes.

Signed-off-by: Santhosh Manohar santhosh@docker.com

Signed-off-by: Santhosh Manohar <santhosh@docker.com>
@mavenugo
Copy link
Contributor

LGTM

@mavenugo mavenugo merged commit 5dc95a3 into moby:master Apr 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants