-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServiceController should clarify behavior of "being deleted" nodes #111824
Comments
@thockin: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Interesting
The pods that doesn't have a node associated are removed from the EndpointSlice #110639 , but the endpoint slices controller doesn't handle the Node DELETING "pseudo-state" too 🤷 However, it does handle the Pod terminating state https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints There is also https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown Maybe we should have a better "global" history for these lifecycles of pods/nodes and cascade it to the controllers that depend on them like LB/services/endpoints, instead of doing it bottom/up and increase the risk of inconsistent behaviors |
I agree. And determining which we end up doing should be a decision based on how the CCM handles syncing a non-existing VM. If that condition causes it to enter some form of repeated error state, then it should be a comment on why it isn't a predicate....and I get the feeling from the existing auto-scaling taint predicate, that this is the case. If we are able to cleanly handle it (either currently or by fixing the possibly broken behaviour), then we should probably remove the current taint predicate, and just sync on the DELETED state - as to give "best-practice-obiding controllers" which use a finalizer and intend on deleting the node, the possibility to connection drain, if even for a very short while. Imagine the auto-scaler using this mechanism instead of tainting the node, i.e: it would use a finalizer, trigger a node delete, cordon the node, then drain the node, delete the VM and finally remove the finalizer. We would want the CCM to remove the node from the LB set on the last step, but drain the connections once we drain the node. This isn't the case today, but it might be in the future for other controllers. |
This is why I wrote https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/object-lifecycle.md :) When a Pod is DELETING, should it be in the EPSlice? Yes, that's the terminating-endpoints KEP. When a Pod is on a Node which is DELETING, should it be in the EPSlice? Probably not. It seems better to remove it from the EPSlice while the pod still exists, so propagation delay results in successful connections. Otherwise it might be in the slice for O(secods) after the Node is actually deleted. When a Node is DELETING, should LBs use it? That's this issue in particular. It's important that we enable a distinction between "this node is probably going away soon - no new connections" and "this node is effectively gone". We have When an EPSlice is DELETING, should kube-proxy use it at all? I think no - treat it as deleted. To Antonio's point - I think the semantic of DELETING is "this object will go away abruptly and without further notice - be prepared". In almost all cases, that means "treat it like it is really gone". Agree or disagree? |
I am not sure I agree with:
The LB will not send traffic to a Node which is actually deleted because that Node should not be running kube-proxy well before that point and should hence be failing the HC. If the KCCM handles syncing a deleted node for which there is no VM anymore, then I think it's fine in the Node case to sync on DELETED as I mentioned above I will have a look at the KCCM and verify how it handles deleting a DELETED VM. Possibly fix it, if it can't handle that today and fix it if the fix is easy enough. |
agree node DELETING = put out of rotation
there some assumptions we can not make, the API and the reality are never in sync, there are a lot of async operations on this dance with multiple controllers reacting independently to the same events ... and some of these controllers are not under our control, each cloud provider have a different controller out of tree, not everybody uses kube-proxy, you can be using ingresses, ... I think that is better to have a predictable behavior in this case so we all know our limitations and just live with that ... than one close to optimum that will be behave differently depending on the cloud provider, type of services proxy, ... |
But being gentle and not dropping existing connections can also be achieved when the node starts failing the HC. And it will start failing at one point - irrespective of if they are running kube-proxy or some other service proxy - because whatever service proxy anyone uses, if that supports ExternalTrafficPolicy, it must expose an endpoint for the LB to target with its HC. And once the service-proxy is evicted, that HC will fail and the connections will gently be allowed to finish instead of abruptly killing all of them (for the LB that support this). The advantage of this is:
I also worked out that there is no problem if the KCCM syncs Nodes on DELETED. Even if their VM might have been GC'ed the moment the LB is to be updated, that will have no impact on the KCCMs ability to update the LB set correctly, see: #111658 (comment) I will hence use that PR to address this issue. |
but my point is that deleting a Node from the API doesn't mean that the VM has gone or the pods has been evicted, there are situations where Pods are running, the Pod object is alive, but the Node object has been deleted |
I revised a previous drawing. I am still not 100% on the taint part (see kubernetes/autoscaler#5201) Tell me if you disagree with this. |
Questions:
Comments:
I feel we are in situation similar to |
I think most of your comments / question is addressed on kubernetes/enhancements#3460 I've updated the KEP and gone into further details about how the lifecycle of Nodes/Endpoints/service proxy impacts application ingress. Hopefully that will be beneficial as background and provide a good ground for a discussion going forwards.
Nothing really tracks the LB-HC status. The LB uses the HC to evaluate which nodes should be used for traffic load-balancing, but this is only done in iteration N and nothing is kept in memory / tracked. The KEP details how things currently work in greater detail.
Yes, a node can be unscheduled, as in "no new pods should get assigned to this node". It's does not make any reference to "currently running pods on this node" and what happens to them, that is eviction. This condition was removed from the CCM many moons ago with: #90823 because that condition caused production issues. |
Also, it seems that from the KCCM service controller we sync nodes' status only every 100s to check if load balancers need to be updated to point to a new set. Any hints on this @thockin? |
No, not really. The
That's not the case, as I explained above. |
@alexanderConstantinescu Did we resolve this? I only see one reference to |
I would say: yes. IMO, what we concluded for the
Agreed. We can add that as a new predicate and add it to the https://github.com/kubernetes/kubernetes/blob/f5fbafda8700a888759a62f006be19e34401be3b/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L945 - we can do that once the KEP PR merges, so that it's easier to keep track of things ? |
Sure. SGTM. You're just racking up the PRs ;) Feel free to send it with your other one included so I can pipeline the approvals. |
Closed by #116536 |
Should nodes that are "being deleted" (have been DELETE'd but have an active finalizer) be IN or OUT of the LB set? I want to say they should be removed as soon as we see the deletion timestamp, but I can also see an argument for this being another part of "draining". Either way, we should document this in the code explicitly (e.g. as a predicate or a comment why it isn't a predicate).
@alexanderConstantinescu
@aojea
The text was updated successfully, but these errors were encountered: