Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

srteam2020 · 2020-09-02T16:06:31Z

What happened:
If the scheduler misses a single update message "node delete", then it won't be able to know the node is deleted and still tries to schedule some pods to the deleted node, which will make those pods never able to be scheduled to the right nodes. #56261 should be related but somehow the issue still happens.

In a k8s cluster with one apiserver, one scheduler and two workers (worker1 and worker2), after deleting a worker node from the cluster (by kubectl delete node worker1), the scheduler should hear from apirserver that “worker1 is deleted” by the node informer. However, if the “worker1 is deleted” update is missed by the scheduler due to some network issue (which means deleteNodeFromCache won’t be called), then the scheduler won’t know that worker1 is deleted. When we create pods in the cluster, the scheduler will keep trying to schedule some of the pods to the non-existing worker1, and these pods will never be running successfully.

What you expected to happen:
The scheduler should be able to tell that "worker1" doesn't exist any more after several retrying to schedule pods to "worker1" but fails.

How to reproduce it (as minimally and precisely as possible):
The simplest way to reproduce this issue is to use kind on a single machine.

Start a cluster with one cp node (kind-control-plane) and two worker nodes (kind-worker and kind-worker2)

Delete one worker node (kubectl delete node kind-worker), and now we only have kind-control-plane and kind-worker2 as show here:

NAME                 STATUS   ROLES    AGE   VERSION
kind-control-plane   Ready    master   32m   v1.18.0
kind-worker2         Ready    <none>   31m   v1.18.0

Drop the “deleted” message from apiserver to scheduler by setting the iptable rules or instrumenting the k8s src code to make the scheduler "unaware" of the "delete" message. For example, we can do this by simply skipping r.store.Delete(event.Object) for "node delete" event in reflector.go or removing the deleteNodeFromCache like below (both are fine):

diff --git a/pkg/scheduler/eventhandlers.go b/pkg/scheduler/eventhandlers.go
index f6d8bb1c..d5ef9e5d 100644
--- a/pkg/scheduler/eventhandlers.go
+++ b/pkg/scheduler/eventhandlers.go
@@ -407,7 +407,6 @@ func addAllEventHandlers(
                cache.ResourceEventHandlerFuncs{
                        AddFunc:    sched.addNodeToCache,
                        UpdateFunc: sched.updateNodeInCache,
-                       DeleteFunc: sched.deleteNodeFromCache,
                },
        )

Create a statefulSet (or deployment) with 6 replicas.
We can see that the first replica is stuck as pending (because it is scheduled to non-existing kind-worker)

NAME    READY   STATUS    RESTARTS   AGE     IP           NODE           NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          3m1s    10.244.2.3   kind-worker2   <none>           <none>
web-1   0/1     Pending   0          2m50s   <none>       <none>         <none>           <none>

If we create deployment, then the symptom is different. It shows some pods scheduled to kind-work are going through "Pending -> ContainerCreating -> Running -> Terminating" cycles forever.

NAME                                READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
nginx-deployment-6b474476c4-4qb6w   1/1     Running   0          15s   10.244.2.6   kind-worker2   <none>           <none>
nginx-deployment-6b474476c4-5szj7   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-d2zpg   1/1     Running   0          15s   10.244.2.5   kind-worker2   <none>           <none>
nginx-deployment-6b474476c4-jjzb7   1/1     Running   0          15s   10.244.2.4   kind-worker2   <none>           <none>
nginx-deployment-6b474476c4-pn8s5   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qqh7j   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qqh7j   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-5szj7   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-pn8s5   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-pn8s5   1/1     Running             0          41s   10.244.1.3   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qqh7j   1/1     Running             0          41s   10.244.1.2   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-5szj7   1/1     Running             0          41s   10.244.1.4   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-5szj7   1/1     Terminating         0          46s   10.244.1.4   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-5szj7   1/1     Terminating         0          46s   10.244.1.4   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qfr2k   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
nginx-deployment-6b474476c4-qfr2k   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-pn8s5   1/1     Terminating         0          46s   10.244.1.3   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-jsrnr   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
nginx-deployment-6b474476c4-pn8s5   1/1     Terminating         0          46s   10.244.1.3   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-jsrnr   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qqh7j   1/1     Terminating         0          46s   10.244.1.2   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-qqh7j   1/1     Terminating         0          46s   10.244.1.2   kind-worker    <none>           <none>
nginx-deployment-6b474476c4-sb4xh   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
nginx-deployment-6b474476c4-sb4xh   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>

Logs:
some logs on scheduler:

E0902 03:39:07.887869       1 factory.go:503] pod kube-system/coredns-66bff467f8-q69s9 is already present in the backoff queue
E0902 03:39:07.887869       1 factory.go:503] pod kube-system/coredns-66bff467f8-q69s9 is already present in the backoff queue
E0902 03:40:36.934416       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:40:37.971947       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:41:08.240213       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:41:14.179062       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:42:21.709110       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying

Suggestions for fix:
The log indicates that it hits the last branch in MakeDefaultErrorFunc in factory.go:

if _, ok := err.(*core.FitError); ok {
	klog.V(2).Infof("Unable to schedule %v/%v: no fit: %v; waiting", pod.Namespace, pod.Name, err)
} else if apierrors.IsNotFound(err) {
	klog.V(2).Infof("Unable to schedule %v/%v: possibly due to node not found: %v; waiting", pod.Namespace, pod.Name, err)
	......
	schedulerCache.RemoveNode(&node)
	......
} else {
	klog.Errorf("Error scheduling %v/%v: %v; retrying", pod.Namespace, pod.Name, err)
}

Since the err is saying Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found, maybe we can check the err to see whether it has the content like "node not found". If so, we can remove the node from the scheduler cache just like how it deals with apierrors.IsNotFound.

Environment:

Kubernetes version: v1.18

The text was updated successfully, but these errors were encountered:

srteam2020 · 2020-09-02T16:10:26Z

/sig scheduling

soulxu · 2020-09-03T01:47:29Z

I guess the reason is we disable the resync

https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L290

dims · 2020-09-03T21:35:33Z

@Huang-Wei @ahg-g WDYT?

liggitt · 2020-09-03T21:38:11Z

If the scheduler misses a single update message "node delete"

that's a big if... pretty much all the components in kubernetes depend on reliable informers.

However, if the “worker1 is deleted” update is missed by the scheduler due to some network issue (which means deleteNodeFromCache won’t be called)

when the network issue resolves, the node informer will relist, observe the "worker1" node no longer exists, and will call the delete handler

srteam2020 · 2020-09-03T22:00:40Z

We stumbled into this when trying to reproduce issue #56261. The scheduler missing a node deletion event seemed sufficient enough of a concern to warrant a fix in that case (by invalidating the corresponding node from the node cache). It would be a similar fix here as well?

srteam2020 · 2020-09-03T23:14:53Z

when the network issue resolves, the node informer will relist, observe the "worker1" node no longer exists, and will call the delete handler

We only drop one "node delete" message for once. After that, we keep the network healthy. We find that the scheduler node informer won't automatically relist after network becomes healthy and stays unaware of the deletion. As a result, scheduler keeps trying to schdule pods on non-existing nodes forever and cannot recover from it automatically.

Since #56261 was trying to make scheduler recover from it automatically, similar fix might be helpful to address this issue here.

ahg-g · 2020-09-08T15:15:31Z

I think we had a fix that we broke in 1.18.

The fix is #56622, it basically deletes the node when binding fails with a node NotFound error. However, in 1.18+ we refactored the binding logic to run in a plugin. The plugin returns a framework.Status type which gets created from the error message, the conversion from error to framework.status however drops the error type, and so the delete logic introduced in the fix isn't being triggered anymore.

@alculquicondor who worked on the default bind plugin.

alculquicondor · 2020-09-08T15:38:17Z

We stumbled into this when trying to reproduce issue #56261

Was that done by forcibly cutting the network between scheduler and API server? As @liggitt said above, once the network is restored, the node will be deleted. And this would inevitably happen if binding is also working (returning a not-found error in this case).

Still, I'll look into a way of preserving the type of the error so we can still use the same codepath for detecting deleted nodes.

alculquicondor · 2020-09-08T15:38:35Z

/assign

Huang-Wei · 2020-09-08T18:32:31Z

Although we can fix the bug by amending the scheduling logic (such as remove the "deleted node" upon a "node not found" error), the root cause is still unclear to me.

We only drop one "node delete" message for once. After that, we keep the network healthy. We find that the scheduler node informer won't automatically relist after network becomes healthy and stays unaware of the deletion.

The "relist" mechanism should have been battle-tested - if that's problem, the whole Kubernetes and CRD ecosystem is already a problem, so I don't quite believe that's the root cause.

@srteam2020 Which version are you using in production when this issue surfaced?

srteam2020 · 2020-09-08T20:08:11Z

@Huang-Wei
We were using Kubernetes-v1.18.0 downloaded from: https://v1-18.docs.kubernetes.io/docs/setup/release/notes/

alculquicondor · 2020-09-08T21:07:02Z

Please retry with 1.18.8. I'm not aware of any known issues around this topic, but 1.18.0 is quite old at this point.

Could you also clarify if the issue reproduces in a live cluster with casual network problems? If you are forcibly dropping events by changing things such as code in reflector.go, we are not really concerned with it.

srteam2020 · 2020-09-09T16:41:08Z

@alculquicondor Thank you for the reply and suggestions!

Please retry with 1.18.8. I'm not aware of any known issues around this topic, but 1.18.0 is quite old at this point.

We just retried with 1.18.8 and the issue is still there (same symptom).

Could you also clarify if the issue reproduces in a live cluster with casual network problems? If you are forcibly dropping events by changing things such as code in reflector.go, we are not really concerned with it.

The scheduler cache could miss the “node delete” due to network problem as mentioned in the previous bug report #56261 ( the bug happened in a live cluster).
This issue(94437) we found share the same severity level with the previously reported one(56261) and can also be reproduced in a live cluster with casual network problems, as long as the “node delete” message is dropped by the network issue. The reason for "using kind and changing src code" is that we just want to make the reproduction steps as simple as possible (rather than set up a cluster and manipulate iptable rules to drop the single message).

alculquicondor · 2020-09-09T16:48:48Z

as long as the “node delete” message is dropped by the network issue

as stated previously, once the network recovers, we should get the update event. Is this not happening? If you are dropping the event to "reproduce", that would be lost. Transient errors are acceptable under network problems.

srteam2020 · 2020-09-09T17:45:45Z

@alculquicondor
Understood: we'll investigate that as well and get back to you.

Huang-Wei · 2020-09-09T18:46:13Z

BTW: usually scheduler is deployed along with the apiserver, so it's less likely that apiserver is live while scheduler loses its network connection.

alculquicondor · 2020-09-10T18:25:35Z

#94692 is up for review. This won't be cherry pick to older versions unless there is evidence of non-transient errors without it.

alculquicondor · 2020-10-13T21:13:09Z

/close
as we haven't heard any reports of this being a non-transient bug.

k8s-ci-robot · 2020-10-13T21:13:24Z

@alculquicondor: Closing this issue.

In response to this:

/close
as we haven't heard any reports of this being a non-transient bug.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

srteam2020 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 2, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2020

k8s-ci-robot assigned alculquicondor Sep 8, 2020

alculquicondor mentioned this issue Sep 10, 2020

Wrap errors from VolumeBinding and DefaultBinder plugins #94692

Merged

alculquicondor mentioned this issue Sep 10, 2020

Improve Error handling in framework #94695

Closed

k8s-ci-robot closed this as completed Oct 13, 2020

tanjing2020 mentioned this issue Feb 24, 2021

Use %w instead of %v to format errors #99389

Merged

saurap mentioned this issue May 10, 2021

Scheduler scheduled pods to a deleted node #101872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

srteam2020 commented Sep 2, 2020 •

edited

srteam2020 commented Sep 2, 2020

soulxu commented Sep 3, 2020

dims commented Sep 3, 2020

liggitt commented Sep 3, 2020

srteam2020 commented Sep 3, 2020

srteam2020 commented Sep 3, 2020 •

edited

ahg-g commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

Huang-Wei commented Sep 8, 2020

srteam2020 commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

srteam2020 commented Sep 9, 2020

alculquicondor commented Sep 9, 2020

srteam2020 commented Sep 9, 2020

Huang-Wei commented Sep 9, 2020

alculquicondor commented Sep 10, 2020

alculquicondor commented Oct 13, 2020

k8s-ci-robot commented Oct 13, 2020

Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

Comments

srteam2020 commented Sep 2, 2020 • edited

srteam2020 commented Sep 2, 2020

soulxu commented Sep 3, 2020

dims commented Sep 3, 2020

liggitt commented Sep 3, 2020

srteam2020 commented Sep 3, 2020

srteam2020 commented Sep 3, 2020 • edited

ahg-g commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

Huang-Wei commented Sep 8, 2020

srteam2020 commented Sep 8, 2020

alculquicondor commented Sep 8, 2020

srteam2020 commented Sep 9, 2020

alculquicondor commented Sep 9, 2020

srteam2020 commented Sep 9, 2020

Huang-Wei commented Sep 9, 2020

alculquicondor commented Sep 10, 2020

alculquicondor commented Oct 13, 2020

k8s-ci-robot commented Oct 13, 2020

srteam2020 commented Sep 2, 2020 •

edited

srteam2020 commented Sep 3, 2020 •

edited