New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes scheduler spams cluster with pods in NodeAffinity status #92067
Comments
|
/sig scheduling |
|
by spamming are you referring to scheduler logs? also, why are the nodes not labelled from the get go? |
Just to confirm, all the pods in question are still in Pending state, yes?
If the pods are in pending state, all of them will get scheduled eventually once the nodes get their labels. |
|
The pods are not in Also, we are looking at labeling the nodes via kubelet flags to see if that helps. |
|
/remove-kind bug |
|
After we add some retry logic and wait time in As you can see this function returns initial node as the first request failed. |
|
I also get hit by this, when single controller node gets rebooted and scheduler starts as static container before kubelet becomes ready. Can be cleaned up with: |
|
/sig node |
|
the spam cycle is this loop: "pod create, pod schedule, worker reject, pod failed" -- we don't have a lot of protection in higher level workload controllers to handle pods that fail at kubelet admission time versus pods that just dont create due to quota/admission. in general, is the spam causing any other stability issues? is pod gc pruning the pods? |
|
I wonder if we can do a trick in node lister to mitigate any race condition here... |
|
we could maybe do something similar to https://github.com/kubernetes/kubernetes/pull/91500/files to ensure node lister has synced... |
|
hacking on option to ensure if kubelet has a node lister with a valid kube client, we wait for it to sync at least once here which should mitigate this issue. see: #94087 |
|
We noticed the problem on the OLM |
|
@rtheis sure... i would have thought the pod is failed in that instance and reached a terminal state. either way, i am looking at what i can do to mitigate this. |
|
The problem persists in our environment as well, and the workaround cited above will not do anymore. Is there something in the works for this one? |
For Deployment and DaemonSet if we rely on some label not created by kubelet directly like `kubernetes.io/os` the Pod can be scheduled on a Node running a Kubelet that don't know yet the label in NodeSelector, so Pod get stuck in `NodeAffinity` and never removed. See: kubernetes/kubernetes#93338 kubernetes/kubernetes#92067 Let's rely back on `beta.kubernetes.io/os` label for the moment. NOTE: This label get deprecated in 1.19
Instead of waiting for all scheduled pods to be running (which is already done in E2E tests), the "stabilization" script is now only checking that no change to Pod objects can be seen over a given period of time. This will effectively hide any scheduling issue, especially the `NodeAffinity` flaky which has been impacting us a lot recently. See kubernetes/kubernetes#92067 for reference.
Instead of waiting for all scheduled pods to be running (which is already done in E2E tests), the "stabilization" script is now only checking that no change to Pod objects can be seen over a given period of time. This will effectively hide any scheduling issue, especially the `NodeAffinity` flaky which has been impacting us a lot recently. See kubernetes/kubernetes#92067 for reference.
Instead of waiting for all scheduled pods to be running (which is already done in E2E tests), the "stabilization" script is now only checking that no change to Pod objects can be seen over a given period of time. This will effectively hide any scheduling issue, especially the `NodeAffinity` flaky which has been impacting us a lot recently. See kubernetes/kubernetes#92067 for reference.
Instead of waiting for all scheduled pods to be running (which is already done in E2E tests), the "stabilization" script is now only checking that no change to Pod objects can be seen over a given period of time. This will effectively hide any scheduling issue, especially the `NodeAffinity` flaky which has been impacting us a lot recently. See kubernetes/kubernetes#92067 for reference.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale |
What happened:
A deployment is created that uses a
nodeSelectorafter the cluster master is created but before the cluster worker nodes are fully initialized. During the initialization of the worker nodes, the nodes are temporarily available for scheduling without the necessary label to match the deployment's node selector. Depending on how long the nodes are available for scheduling without the necessary node labels for the deployment, the scheduler will start to spam with cluster with pods inNodeAffinitystatus. This spamming stops once the worker nodes are fully initialized and the pods are scheduled successful. The exact timing of all of this is TBD.What you expected to happen:
Pods in
NodeAffinitystatus are annoying and require manual cleanup. It would be preferred that the scheduler avoid this an simply issue warning events.How to reproduce it (as minimally and precisely as possible):
See issue description.
Anything else we need to know?: No.
Environment:
kubectl version): 1.18.3cat /etc/os-release): Ubuntu 18.04.4 LTSuname -a): 4.15.0-96-genericThe text was updated successfully, but these errors were encountered: