New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet rejects pod scheduled based on newly added node labels which have not been observed by the kubelet yet #93338
Comments
Can you describe the issue? |
|
The controller currently ensures the beta OS and OS labels match. What issue are you seeing occur with affinity? |
|
@liggitt We believe the flow is as follows:
There is a timing window here since we don't always see the problem. |
Can you provide the manifests that are deployed for reference?
Can you provide the pod as fetched with
Are you using |
|
|
@liggitt Please see previous comment for example pod yaml and here is the OLM PR to fix the addon-catalog-source pod: operator-framework/operator-lifecycle-manager#1562. The PR details the files containing the manifests. |
That should make the scheduler wait until a node with the appropriate labels appears to schedule the pod Can you provide the full pod (including the status) and scheduler events which you were seeing issues with? |
|
@liggitt Unfortunately, I only have the following data from the previous failure. I can try to recreate the problem again if you need more information. |
|
@kubernetes/sig-scheduling-misc NodeAffinity failure because no nodes currently match the pod's selector is not a terminal state, right (e.g. if a node becomes available that matches the selector, the pod will be scheduled)? |
yes, the pod should be unschedulable until a node that matches the selector shows up. |
|
To clarify, is the issue that a pod with nodeSelector set is getting scheduled on a node that doesn't yet have the corresponding labels? |
|
@ahg-g No, a pod with nodeSelector set is never getting scheduled on a node when initial scheduling attempt does not find a corresponding node label. Even when the node label is added later, the pod is stuck in NodeAffinity. |
|
This issue is very similar to #92067. |
|
From #93338 (comment), it seems the pod got scheduled, the nodeName is set: |
where is this line coming from, a log? an event? |
|
@ahg-g #93338 (comment) is showing a successful pod. Sorry for the confusion. And I've added more context for that pod line. |
|
Not sure about the source of the status field, but this is certainly not a scheduling issue. |
a pod that does not have a node it can be scheduled to should remain pending, and then be scheduled successfully when such a node appears, right? |
yes, and that is what is happening here, the pod didn't get scheduled initially and then got scheduled (I presume when the label got applied). #93338 (comment) explains that the issue is that the pod didn't get scheduled after the label was applied, and I am saying that it did get scheduled because the nodeName is set. |
|
@ahg-g So are you saying the pod is scheduled and may be running even though status is |
|
From the scheduler perspective, the pod is scheduled because the nodeName is set. The status isn't something the scheduler manages, the scheduler only adds a pod condition when it fails to schedule the pod. |
|
oh, I get now where the issue is, the scheduler sees that the label is applied but kubelet doesn't, and so kubelet is not admitting the pod after the scheduler scheduled the pod. |
|
/sig-node |
|
/sig node @kubernetes/sig-node-bugs |
|
/remove-sig scheduling |
For Deployment and DaemonSet if we rely on some label not created by kubelet directly like `kubernetes.io/os` the Pod can be scheduled on a Node running a Kubelet that don't know yet the label in NodeSelector, so Pod get stuck in `NodeAffinity` and never removed. See: kubernetes/kubernetes#93338 kubernetes/kubernetes#92067 Let's rely back on `beta.kubernetes.io/os` label for the moment. NOTE: This label get deprecated in 1.19
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
|
/remove-lifecycle stale |
|
I've met same issue with v1.19.3, reboot node normally the pod will enter NodeAffinity state. |
The fix for this issue was released to v1.19.8+ in #97996 |
|
/close fixed in #94087 |
|
@liggitt: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Whenever a node is added or updated, there is a small window where pods are scheduled to that node, before any beta labels are applied to it. This can cause issues with pods that are queued up to be scheduled and that have a NodeAffinity (in our case) to the now deprecated
beta.kubernetes.io/oslabel.What you expected to happen:
The proper labels to be applied to workers before the scheduling of pods on that node.
How to reproduce it (as minimally and precisely as possible):
(Not 100 percent success rate)
Anything else we need to know?:
I have been told this step used to be done on the worker side, but is now done on the master side. Which could explain why this is happening. https://github.com/kubernetes/kubernetes/blob/v1.19.0-rc.2/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1534-L1578
/sig scheduling
The text was updated successfully, but these errors were encountered: