Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing a single update message will cause the scheduler never able to schedule pods to the right nodes #94437

Closed
srteam2020 opened this issue Sep 2, 2020 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@srteam2020
Copy link

srteam2020 commented Sep 2, 2020

What happened:
If the scheduler misses a single update message "node delete", then it won't be able to know the node is deleted and still tries to schedule some pods to the deleted node, which will make those pods never able to be scheduled to the right nodes. #56261 should be related but somehow the issue still happens.

In a k8s cluster with one apiserver, one scheduler and two workers (worker1 and worker2), after deleting a worker node from the cluster (by kubectl delete node worker1), the scheduler should hear from apirserver that “worker1 is deleted” by the node informer. However, if the “worker1 is deleted” update is missed by the scheduler due to some network issue (which means deleteNodeFromCache won’t be called), then the scheduler won’t know that worker1 is deleted. When we create pods in the cluster, the scheduler will keep trying to schedule some of the pods to the non-existing worker1, and these pods will never be running successfully.

What you expected to happen:
The scheduler should be able to tell that "worker1" doesn't exist any more after several retrying to schedule pods to "worker1" but fails.

How to reproduce it (as minimally and precisely as possible):
The simplest way to reproduce this issue is to use kind on a single machine.

  1. Start a cluster with one cp node (kind-control-plane) and two worker nodes (kind-worker and kind-worker2)

  2. Delete one worker node (kubectl delete node kind-worker), and now we only have kind-control-plane and kind-worker2 as show here:

    NAME                 STATUS   ROLES    AGE   VERSION
    kind-control-plane   Ready    master   32m   v1.18.0
    kind-worker2         Ready    <none>   31m   v1.18.0
    
  3. Drop the “deleted” message from apiserver to scheduler by setting the iptable rules or instrumenting the k8s src code to make the scheduler "unaware" of the "delete" message. For example, we can do this by simply skipping r.store.Delete(event.Object) for "node delete" event in reflector.go or removing the deleteNodeFromCache like below (both are fine):

    diff --git a/pkg/scheduler/eventhandlers.go b/pkg/scheduler/eventhandlers.go
    index f6d8bb1c..d5ef9e5d 100644
    --- a/pkg/scheduler/eventhandlers.go
    +++ b/pkg/scheduler/eventhandlers.go
    @@ -407,7 +407,6 @@ func addAllEventHandlers(
                    cache.ResourceEventHandlerFuncs{
                            AddFunc:    sched.addNodeToCache,
                            UpdateFunc: sched.updateNodeInCache,
    -                       DeleteFunc: sched.deleteNodeFromCache,
                    },
            )
    
  4. Create a statefulSet (or deployment) with 6 replicas.
    We can see that the first replica is stuck as pending (because it is scheduled to non-existing kind-worker)

    NAME    READY   STATUS    RESTARTS   AGE     IP           NODE           NOMINATED NODE   READINESS GATES
    web-0   1/1     Running   0          3m1s    10.244.2.3   kind-worker2   <none>           <none>
    web-1   0/1     Pending   0          2m50s   <none>       <none>         <none>           <none>
    

    If we create deployment, then the symptom is different. It shows some pods scheduled to kind-work are going through "Pending -> ContainerCreating -> Running -> Terminating" cycles forever.

    NAME                                READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
    nginx-deployment-6b474476c4-4qb6w   1/1     Running   0          15s   10.244.2.6   kind-worker2   <none>           <none>
    nginx-deployment-6b474476c4-5szj7   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-d2zpg   1/1     Running   0          15s   10.244.2.5   kind-worker2   <none>           <none>
    nginx-deployment-6b474476c4-jjzb7   1/1     Running   0          15s   10.244.2.4   kind-worker2   <none>           <none>
    nginx-deployment-6b474476c4-pn8s5   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qqh7j   0/1     Pending   0          15s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qqh7j   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-5szj7   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-pn8s5   0/1     ContainerCreating   0          27s   <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-pn8s5   1/1     Running             0          41s   10.244.1.3   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qqh7j   1/1     Running             0          41s   10.244.1.2   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-5szj7   1/1     Running             0          41s   10.244.1.4   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-5szj7   1/1     Terminating         0          46s   10.244.1.4   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-5szj7   1/1     Terminating         0          46s   10.244.1.4   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qfr2k   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
    nginx-deployment-6b474476c4-qfr2k   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-pn8s5   1/1     Terminating         0          46s   10.244.1.3   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-jsrnr   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
    nginx-deployment-6b474476c4-pn8s5   1/1     Terminating         0          46s   10.244.1.3   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-jsrnr   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qqh7j   1/1     Terminating         0          46s   10.244.1.2   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-qqh7j   1/1     Terminating         0          46s   10.244.1.2   kind-worker    <none>           <none>
    nginx-deployment-6b474476c4-sb4xh   0/1     Pending             0          0s    <none>       <none>         <none>           <none>
    nginx-deployment-6b474476c4-sb4xh   0/1     Pending             0          0s    <none>       kind-worker    <none>           <none>
    

Logs:
some logs on scheduler:

E0902 03:39:07.887869       1 factory.go:503] pod kube-system/coredns-66bff467f8-q69s9 is already present in the backoff queue
E0902 03:39:07.887869       1 factory.go:503] pod kube-system/coredns-66bff467f8-q69s9 is already present in the backoff queue
E0902 03:40:36.934416       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:40:37.971947       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:41:08.240213       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:41:14.179062       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying
E0902 03:42:21.709110       1 factory.go:478] Error scheduling default/web-1: Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found; retrying

Suggestions for fix:
The log indicates that it hits the last branch in MakeDefaultErrorFunc in factory.go:

if _, ok := err.(*core.FitError); ok {
	klog.V(2).Infof("Unable to schedule %v/%v: no fit: %v; waiting", pod.Namespace, pod.Name, err)
} else if apierrors.IsNotFound(err) {
	klog.V(2).Infof("Unable to schedule %v/%v: possibly due to node not found: %v; waiting", pod.Namespace, pod.Name, err)
	......
	schedulerCache.RemoveNode(&node)
	......
} else {
	klog.Errorf("Error scheduling %v/%v: %v; retrying", pod.Namespace, pod.Name, err)
}

Since the err is saying Failed to bind volumes: failed to get node "kind-worker": node "kind-worker" not found, maybe we can check the err to see whether it has the content like "node not found". If so, we can remove the node from the scheduler cache just like how it deals with apierrors.IsNotFound.

Environment:

  • Kubernetes version: v1.18
@srteam2020 srteam2020 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 2, 2020
@srteam2020
Copy link
Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2020
@soulxu
Copy link
Contributor

soulxu commented Sep 3, 2020

@dims
Copy link
Member

dims commented Sep 3, 2020

@Huang-Wei @ahg-g WDYT?

@liggitt
Copy link
Member

liggitt commented Sep 3, 2020

If the scheduler misses a single update message "node delete"

that's a big if... pretty much all the components in kubernetes depend on reliable informers.

However, if the “worker1 is deleted” update is missed by the scheduler due to some network issue (which means deleteNodeFromCache won’t be called)

when the network issue resolves, the node informer will relist, observe the "worker1" node no longer exists, and will call the delete handler

@srteam2020
Copy link
Author

We stumbled into this when trying to reproduce issue #56261. The scheduler missing a node deletion event seemed sufficient enough of a concern to warrant a fix in that case (by invalidating the corresponding node from the node cache). It would be a similar fix here as well?

@srteam2020
Copy link
Author

srteam2020 commented Sep 3, 2020

when the network issue resolves, the node informer will relist, observe the "worker1" node no longer exists, and will call the delete handler

We only drop one "node delete" message for once. After that, we keep the network healthy. We find that the scheduler node informer won't automatically relist after network becomes healthy and stays unaware of the deletion. As a result, scheduler keeps trying to schdule pods on non-existing nodes forever and cannot recover from it automatically.

Since #56261 was trying to make scheduler recover from it automatically, similar fix might be helpful to address this issue here.

@ahg-g
Copy link
Member

ahg-g commented Sep 8, 2020

I think we had a fix that we broke in 1.18.

The fix is #56622, it basically deletes the node when binding fails with a node NotFound error. However, in 1.18+ we refactored the binding logic to run in a plugin. The plugin returns a framework.Status type which gets created from the error message, the conversion from error to framework.status however drops the error type, and so the delete logic introduced in the fix isn't being triggered anymore.

@alculquicondor who worked on the default bind plugin.

@alculquicondor
Copy link
Member

We stumbled into this when trying to reproduce issue #56261

Was that done by forcibly cutting the network between scheduler and API server? As @liggitt said above, once the network is restored, the node will be deleted. And this would inevitably happen if binding is also working (returning a not-found error in this case).

Still, I'll look into a way of preserving the type of the error so we can still use the same codepath for detecting deleted nodes.

@alculquicondor
Copy link
Member

/assign

@Huang-Wei
Copy link
Member

Although we can fix the bug by amending the scheduling logic (such as remove the "deleted node" upon a "node not found" error), the root cause is still unclear to me.

We only drop one "node delete" message for once. After that, we keep the network healthy. We find that the scheduler node informer won't automatically relist after network becomes healthy and stays unaware of the deletion.

The "relist" mechanism should have been battle-tested - if that's problem, the whole Kubernetes and CRD ecosystem is already a problem, so I don't quite believe that's the root cause.

@srteam2020 Which version are you using in production when this issue surfaced?

@srteam2020
Copy link
Author

@Huang-Wei
We were using Kubernetes-v1.18.0 downloaded from: https://v1-18.docs.kubernetes.io/docs/setup/release/notes/

@alculquicondor
Copy link
Member

Please retry with 1.18.8. I'm not aware of any known issues around this topic, but 1.18.0 is quite old at this point.

Could you also clarify if the issue reproduces in a live cluster with casual network problems? If you are forcibly dropping events by changing things such as code in reflector.go, we are not really concerned with it.

@srteam2020
Copy link
Author

@alculquicondor Thank you for the reply and suggestions!

Please retry with 1.18.8. I'm not aware of any known issues around this topic, but 1.18.0 is quite old at this point.

We just retried with 1.18.8 and the issue is still there (same symptom).

Could you also clarify if the issue reproduces in a live cluster with casual network problems? If you are forcibly dropping events by changing things such as code in reflector.go, we are not really concerned with it.

The scheduler cache could miss the “node delete” due to network problem as mentioned in the previous bug report #56261 ( the bug happened in a live cluster).
This issue(94437) we found share the same severity level with the previously reported one(56261) and can also be reproduced in a live cluster with casual network problems, as long as the “node delete” message is dropped by the network issue. The reason for "using kind and changing src code" is that we just want to make the reproduction steps as simple as possible (rather than set up a cluster and manipulate iptable rules to drop the single message).

@alculquicondor
Copy link
Member

as long as the “node delete” message is dropped by the network issue

as stated previously, once the network recovers, we should get the update event. Is this not happening? If you are dropping the event to "reproduce", that would be lost. Transient errors are acceptable under network problems.

@srteam2020
Copy link
Author

@alculquicondor
Understood: we'll investigate that as well and get back to you.

@Huang-Wei
Copy link
Member

BTW: usually scheduler is deployed along with the apiserver, so it's less likely that apiserver is live while scheduler loses its network connection.

@alculquicondor
Copy link
Member

#94692 is up for review. This won't be cherry pick to older versions unless there is evidence of non-transient errors without it.

@alculquicondor
Copy link
Member

/close
as we haven't heard any reports of this being a non-transient bug.

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close
as we haven't heard any reports of this being a non-transient bug.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

8 participants