-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437
Comments
/sig scheduling |
I guess the reason is we disable the resync https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L290 |
@Huang-Wei @ahg-g WDYT? |
that's a big if... pretty much all the components in kubernetes depend on reliable informers.
when the network issue resolves, the node informer will relist, observe the "worker1" node no longer exists, and will call the delete handler |
We stumbled into this when trying to reproduce issue #56261. The scheduler missing a node deletion event seemed sufficient enough of a concern to warrant a fix in that case (by invalidating the corresponding node from the node cache). It would be a similar fix here as well? |
We only drop one "node delete" message for once. After that, we keep the network healthy. We find that the scheduler node informer won't automatically relist after network becomes healthy and stays unaware of the deletion. As a result, scheduler keeps trying to schdule pods on non-existing nodes forever and cannot recover from it automatically. Since #56261 was trying to make scheduler recover from it automatically, similar fix might be helpful to address this issue here. |
I think we had a fix that we broke in 1.18. The fix is #56622, it basically deletes the node when binding fails with a node @alculquicondor who worked on the default bind plugin. |
Was that done by forcibly cutting the network between scheduler and API server? As @liggitt said above, once the network is restored, the node will be deleted. And this would inevitably happen if binding is also working (returning a not-found error in this case). Still, I'll look into a way of preserving the type of the error so we can still use the same codepath for detecting deleted nodes. |
/assign |
Although we can fix the bug by amending the scheduling logic (such as remove the "deleted node" upon a "node not found" error), the root cause is still unclear to me.
The "relist" mechanism should have been battle-tested - if that's problem, the whole Kubernetes and CRD ecosystem is already a problem, so I don't quite believe that's the root cause. @srteam2020 Which version are you using in production when this issue surfaced? |
@Huang-Wei |
Please retry with 1.18.8. I'm not aware of any known issues around this topic, but 1.18.0 is quite old at this point. Could you also clarify if the issue reproduces in a live cluster with casual network problems? If you are forcibly dropping events by changing things such as code in reflector.go, we are not really concerned with it. |
@alculquicondor Thank you for the reply and suggestions!
We just retried with 1.18.8 and the issue is still there (same symptom).
The scheduler cache could miss the “node delete” due to network problem as mentioned in the previous bug report #56261 ( the bug happened in a live cluster). |
as stated previously, once the network recovers, we should get the update event. Is this not happening? If you are dropping the event to "reproduce", that would be lost. Transient errors are acceptable under network problems. |
@alculquicondor |
BTW: usually scheduler is deployed along with the apiserver, so it's less likely that apiserver is live while scheduler loses its network connection. |
#94692 is up for review. This won't be cherry pick to older versions unless there is evidence of non-transient errors without it. |
/close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
If the scheduler misses a single update message "node delete", then it won't be able to know the node is deleted and still tries to schedule some pods to the deleted node, which will make those pods never able to be scheduled to the right nodes. #56261 should be related but somehow the issue still happens.
In a k8s cluster with one apiserver, one scheduler and two workers (worker1 and worker2), after deleting a worker node from the cluster (by
kubectl delete node worker1
), the scheduler should hear from apirserver that “worker1 is deleted” by the node informer. However, if the “worker1 is deleted” update is missed by the scheduler due to some network issue (which meansdeleteNodeFromCache
won’t be called), then the scheduler won’t know that worker1 is deleted. When we create pods in the cluster, the scheduler will keep trying to schedule some of the pods to the non-existing worker1, and these pods will never be running successfully.What you expected to happen:
The scheduler should be able to tell that "worker1" doesn't exist any more after several retrying to schedule pods to "worker1" but fails.
How to reproduce it (as minimally and precisely as possible):
The simplest way to reproduce this issue is to use kind on a single machine.
Start a cluster with one cp node (
kind-control-plane
) and two worker nodes (kind-worker
andkind-worker2
)Delete one worker node (
kubectl delete node kind-worker
), and now we only havekind-control-plane
andkind-worker2
as show here:Drop the “deleted” message from apiserver to scheduler by setting the iptable rules or instrumenting the k8s src code to make the scheduler "unaware" of the "delete" message. For example, we can do this by simply skipping
r.store.Delete(event.Object)
for "node delete" event inreflector.go
or removing thedeleteNodeFromCache
like below (both are fine):Create a
statefulSet
(ordeployment
) with 6 replicas.We can see that the first replica is stuck as pending (because it is scheduled to non-existing kind-worker)
If we create deployment, then the symptom is different. It shows some pods scheduled to
kind-work
are going through "Pending -> ContainerCreating -> Running -> Terminating" cycles forever.Logs:
some logs on scheduler:
Suggestions for fix:
The log indicates that it hits the last branch in
MakeDefaultErrorFunc
infactory.go
:Since the
err
is sayingFailed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found
, maybe we can check theerr
to see whether it has the content like "node not found". If so, we can remove thenode
from the scheduler cache just like how it deals withapierrors.IsNotFound
.Environment:
The text was updated successfully, but these errors were encountered: