Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceController should clarify behavior of "being deleted" nodes #111824

Closed
thockin opened this issue Aug 12, 2022 · 17 comments
Closed

ServiceController should clarify behavior of "being deleted" nodes #111824

thockin opened this issue Aug 12, 2022 · 17 comments
Assignees
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@thockin
Copy link
Member

thockin commented Aug 12, 2022

Should nodes that are "being deleted" (have been DELETE'd but have an active finalizer) be IN or OUT of the LB set? I want to say they should be removed as soon as we see the deletion timestamp, but I can also see an argument for this being another part of "draining". Either way, we should document this in the code explicitly (e.g. as a predicate or a comment why it isn't a predicate).

@alexanderConstantinescu
@aojea

@thockin thockin added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Aug 12, 2022
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 12, 2022
@k8s-ci-robot
Copy link
Contributor

@thockin: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@thockin thockin changed the title ServiceController should clarify behavbior of "being deleted" nodes ServiceController should clarify behavior of "being deleted" nodes Aug 12, 2022
@aojea
Copy link
Member

aojea commented Aug 16, 2022

Interesting

Should nodes that are "being deleted" (have been DELETE'd but have an active finalizer) be IN or OUT of the LB set? I

The pods that doesn't have a node associated are removed from the EndpointSlice #110639 , but the endpoint slices controller doesn't handle the Node DELETING "pseudo-state" too 🤷

However, it does handle the Pod terminating state https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints

There is also https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown

Maybe we should have a better "global" history for these lifecycles of pods/nodes and cascade it to the controllers that depend on them like LB/services/endpoints, instead of doing it bottom/up and increase the risk of inconsistent behaviors

@alexanderConstantinescu
Copy link
Member

Either way, we should document this in the code explicitly (e.g. as a predicate or a comment why it isn't a predicate).

I agree. And determining which we end up doing should be a decision based on how the CCM handles syncing a non-existing VM. If that condition causes it to enter some form of repeated error state, then it should be a comment on why it isn't a predicate....and I get the feeling from the existing auto-scaling taint predicate, that this is the case.

If we are able to cleanly handle it (either currently or by fixing the possibly broken behaviour), then we should probably remove the current taint predicate, and just sync on the DELETED state - as to give "best-practice-obiding controllers" which use a finalizer and intend on deleting the node, the possibility to connection drain, if even for a very short while. Imagine the auto-scaler using this mechanism instead of tainting the node, i.e: it would use a finalizer, trigger a node delete, cordon the node, then drain the node, delete the VM and finally remove the finalizer. We would want the CCM to remove the node from the LB set on the last step, but drain the connections once we drain the node. This isn't the case today, but it might be in the future for other controllers.

@thockin thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 18, 2022
@thockin
Copy link
Member Author

thockin commented Aug 18, 2022

This is why I wrote https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/object-lifecycle.md :)

When a Pod is DELETING, should it be in the EPSlice? Yes, that's the terminating-endpoints KEP.

When a Pod is on a Node which is DELETING, should it be in the EPSlice? Probably not. It seems better to remove it from the EPSlice while the pod still exists, so propagation delay results in successful connections. Otherwise it might be in the slice for O(secods) after the Node is actually deleted.

When a Node is DELETING, should LBs use it? That's this issue in particular. It's important that we enable a distinction between "this node is probably going away soon - no new connections" and "this node is effectively gone". We have unschedulable to represent the former (after all @alexanderConstantinescu fixes go in :). Same as above, it seems better to remove it from the LBs while the node still exists, so propagation delay results in successful connections. Otherwise it might be in the LB for O(secods) after the Node is actually deleted.

When an EPSlice is DELETING, should kube-proxy use it at all? I think no - treat it as deleted.

To Antonio's point - I think the semantic of DELETING is "this object will go away abruptly and without further notice - be prepared". In almost all cases, that means "treat it like it is really gone".

Agree or disagree?

@alexanderConstantinescu
Copy link
Member

Agree or disagree?

I am not sure I agree with:

When a Node is DELETING, should LBs use it? That's this issue in particular. It's important that we enable a distinction between "this node is probably going away soon - no new connections" and "this node is effectively gone". We have unschedulable to represent the former (after all @alexanderConstantinescu fixes go in :). Same as above, it seems better to remove it from the LBs while the node still exists, so propagation delay results in successful connections. Otherwise it might be in the LB for O(secods) after the Node is actually deleted.

The LB will not send traffic to a Node which is actually deleted because that Node should not be running kube-proxy well before that point and should hence be failing the HC. If the KCCM handles syncing a deleted node for which there is no VM anymore, then I think it's fine in the Node case to sync on DELETED as I mentioned above

I will have a look at the KCCM and verify how it handles deleting a DELETED VM. Possibly fix it, if it can't handle that today and fix it if the fix is easy enough.

@aojea
Copy link
Member

aojea commented Aug 26, 2022

Agree or disagree?

agree node DELETING = put out of rotation
why do you want to send new connections to something you know is going to be deleted but you don't know why?
just be gentle and don't drop the existing connections if you can, but that is on the LB, not in the controller ...

The LB will not send traffic to a Node which is actually deleted because that Node should not be running kube-proxy well before that point and should hence be failing the HC.

there some assumptions we can not make, the API and the reality are never in sync, there are a lot of async operations on this dance with multiple controllers reacting independently to the same events ... and some of these controllers are not under our control, each cloud provider have a different controller out of tree, not everybody uses kube-proxy, you can be using ingresses, ...

I think that is better to have a predictable behavior in this case so we all know our limitations and just live with that ... than one close to optimum that will be behave differently depending on the cloud provider, type of services proxy, ...

@alexanderConstantinescu
Copy link
Member

just be gentle and don't drop the existing connections if you can, but that is on the LB, not in the controller ...

But being gentle and not dropping existing connections can also be achieved when the node starts failing the HC. And it will start failing at one point - irrespective of if they are running kube-proxy or some other service proxy - because whatever service proxy anyone uses, if that supports ExternalTrafficPolicy, it must expose an endpoint for the LB to target with its HC. And once the service-proxy is evicted, that HC will fail and the connections will gently be allowed to finish instead of abruptly killing all of them (for the LB that support this).

The advantage of this is:

  • Easier to read / understand code in the CCM with less predicates which are unclear and dubious
  • The delta between, the LB which stops sending the node traffic and the pod finally being evicted and no longer able to accept traffic, is reduced. If we remove the node from the LB set well before the pod is stopped and evicted, and if that pod is the only endpoint for the service: then we increase any outage - which granted, will in anyways occur when the pod is evicted and restarted elsewhere.

I also worked out that there is no problem if the KCCM syncs Nodes on DELETED. Even if their VM might have been GC'ed the moment the LB is to be updated, that will have no impact on the KCCMs ability to update the LB set correctly, see: #111658 (comment)

I will hence use that PR to address this issue.

@aojea
Copy link
Member

aojea commented Sep 7, 2022

but my point is that deleting a Node from the API doesn't mean that the VM has gone or the pods has been evicted, there are situations where Pods are running, the Pod object is alive, but the Node object has been deleted

#110639

@thockin
Copy link
Member Author

thockin commented Sep 20, 2022

I revised a previous drawing. I am still not 100% on the taint part (see kubernetes/autoscaler#5201)

Node_LB_states_v2

Tell me if you disagree with this.

@khenidak
Copy link
Contributor

khenidak commented Oct 5, 2022

Questions:

  • Why are we tracking LB-HC status?

Comments:

  • State NODE is not in LB -> NODE _is LB: remove dependency on readiness. LB probes should identify if it is ready or not. What is ready to us may not be ready from LB POV due to network disruption. The inverse is also true ready/not ready is kubelet status not the readiness of pods running.
  • I remember seeing that you can still run pods even unschedulable taint is set via toleration. I could be wrong. But if i am not then it should be removed as pre condition.

I feel we are in situation similar to it is now safe to unplug your USB device. Meaning, while we are moving a node from LB because objects is deleted, traffic might be still flowing into these nodes. The question i am trying to raise here what do we consider as safe?

@alexanderConstantinescu
Copy link
Member

alexanderConstantinescu commented Oct 6, 2022

I think most of your comments / question is addressed on kubernetes/enhancements#3460

I've updated the KEP and gone into further details about how the lifecycle of Nodes/Endpoints/service proxy impacts application ingress. Hopefully that will be beneficial as background and provide a good ground for a discussion going forwards.

Why are we tracking LB-HC status?

Nothing really tracks the LB-HC status. The LB uses the HC to evaluate which nodes should be used for traffic load-balancing, but this is only done in iteration N and nothing is kept in memory / tracked. The KEP details how things currently work in greater detail.

I remember seeing that you can still run pods even unschedulable taint is set via toleration.

Yes, a node can be unscheduled, as in "no new pods should get assigned to this node". It's does not make any reference to "currently running pods on this node" and what happens to them, that is eviction. This condition was removed from the CCM many moons ago with: #90823 because that condition caused production issues.

@sterchelen
Copy link

sterchelen commented Dec 2, 2022

Also, it seems that from the KCCM service controller we sync nodes' status only every 100s to check if load balancers need to be updated to point to a new set.
See here https://github.com/kubernetes/cloud-provider/blob/016d675f2dd271246e45aa8a1e5b17fd1909539c/controllers/service/controller.go#L51
I try to find the history of why those 100s... that would help me to understand why we aren't watching nodes changes and act accordingly.

Any hints on this @thockin?

@alexanderConstantinescu
Copy link
Member

Also, it seems that from the KCCM service controller we sync nodes' status only every 100s to check if load balancers need to be updated to point to a new set.

No, not really. The nodeSyncPeriod defines the interval at which a full re-sync of all items in the informer cache is triggered. Any change (ADD/UPDATE/DELETE) on any Node object will trigger a sync for that particular Node object, regardless of nodeSyncPeriod.

that would help me to understand why we aren't watching nodes changes and act accordingly.

That's not the case, as I explained above.

@thockin
Copy link
Member Author

thockin commented Mar 9, 2023

@alexanderConstantinescu Did we resolve this? I only see one reference to DeletionTimestamp and it's not considered a predicate - probably should be?

@alexanderConstantinescu
Copy link
Member

Did we resolve this?

I would say: yes. IMO, what we concluded for the ToBeDeleted... taint applies to DeletionTimestamp or any other indicator for a "deleting node". That is to say: we want to keep the node in the LB set for as long as possible (as to allow connection draining to happen) but exclude the Node whenever we do sync, as to not cause reconciliation errors in the service controller should the VM already be gone.

I only see one reference to DeletionTimestamp and it's not considered a predicate - probably should be?

Agreed. We can add that as a new predicate and add it to the https://github.com/kubernetes/kubernetes/blob/f5fbafda8700a888759a62f006be19e34401be3b/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L945 - we can do that once the KEP PR merges, so that it's easier to keep track of things ?

@thockin
Copy link
Member Author

thockin commented Mar 10, 2023

Sure. SGTM.

You're just racking up the PRs ;)

Feel free to send it with your other one included so I can pipeline the approvals.

@thockin
Copy link
Member Author

thockin commented Mar 16, 2023

Closed by #116536

@thockin thockin closed this as completed Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants