Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet rejects pod scheduled based on newly added node labels which have not been observed by the kubelet yet #93338

Closed
Joseph-Goergen opened this issue Jul 22, 2020 · 32 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@Joseph-Goergen
Copy link
Contributor

Joseph-Goergen commented Jul 22, 2020

What happened:

Whenever a node is added or updated, there is a small window where pods are scheduled to that node, before any beta labels are applied to it. This can cause issues with pods that are queued up to be scheduled and that have a NodeAffinity (in our case) to the now deprecated beta.kubernetes.io/os label.

What you expected to happen:

The proper labels to be applied to workers before the scheduling of pods on that node.

How to reproduce it (as minimally and precisely as possible):

(Not 100 percent success rate)

  • deploy 1.19 cluster with no workers
  • apply a deployment with a node affinity for the beta.kubernetes.io/os label
  • add worker

Anything else we need to know?:

I have been told this step used to be done on the worker side, but is now done on the master side. Which could explain why this is happening. https://github.com/kubernetes/kubernetes/blob/v1.19.0-rc.2/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1534-L1578

/sig scheduling

@Joseph-Goergen Joseph-Goergen added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2020
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 22, 2020
@liggitt
Copy link
Member

liggitt commented Jul 23, 2020

This can cause issues with pods that are queued up to be scheduled and that have a NodeAffinity (in our case) to the now deprecated beta.kubernetes.io/os label.

Can you describe the issue?

@liggitt liggitt added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 23, 2020
@liggitt
Copy link
Member

liggitt commented Jul 23, 2020

The controller currently ensures the beta OS and OS labels match. What issue are you seeing occur with affinity?

@rtheis
Copy link

rtheis commented Jul 23, 2020

@liggitt We believe the flow is as follows:

  1. Deploy control plane components, API server etc.
  2. Deploy addon components (https://github.com/operator-framework/operator-lifecycle-manager in our case). These components will stay in Pending status until worker nodes are ready.
  3. Deploy worker nodes.
  4. Worker node X is registered with the API server (no beta.kubernetes.io/os label yet).
  5. Scheduler schedules the OLM addon catalog pod which uses the beta.kubernetes.io/os label. This pod ends up in NodeAffinity status.
  6. Control plan adds the beta.kubernetes.io/os label to worker node X.
  7. Deleting the OLM addon catalog pod triggers a new pod to be created which is scheduled properly.

There is a timing window here since we don't always see the problem.

@liggitt
Copy link
Member

liggitt commented Jul 23, 2020

2. Deploy addon components (https://github.com/operator-framework/operator-lifecycle-manager in our case). These components will stay in Pending status until worker nodes are ready.

Can you provide the manifests that are deployed for reference?

5. Scheduler schedules the OLM addon catalog pod which uses the beta.kubernetes.io/os label.

Can you provide the pod as fetched with kubectl get pod/$name -o yaml for reference?

5. This pod ends up in NodeAffinity status.

Are you using requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution affinity?

@rtheis
Copy link

rtheis commented Jul 24, 2020

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: e2e-test-privileged-psp
  creationTimestamp: "2020-07-22T18:48:03Z"
  generateName: addon-catalog-source-
  labels:
    olm.catalogSource: addon-catalog-source
    olm.configMapResourceVersion: "10075545"
  name: addon-catalog-source-dtjxc
  namespace: ibm-system
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: CatalogSource
    name: addon-catalog-source
    uid: 1e8f8344-6f25-4dc1-b2dc-108762d9f3a7
  resourceVersion: "11390387"
  selfLink: /api/v1/namespaces/ibm-system/pods/addon-catalog-source-dtjxc
  uid: d4e741c2-fdde-49d5-b45e-e986492e9284
spec:
  containers:
  - command:
    - configmap-server
    - -c
    - addon-catalog-manifests
    - -n
    - ibm-system
    image: registry.ng.bluemix.net/armada-master/configmap-operator-registry:v1.6.1
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=localhost:50051
      failureThreshold: 3
      initialDelaySeconds: 2
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: configmap-registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=localhost:50051
      failureThreshold: 3
      initialDelaySeconds: 1
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 10m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: addon-catalog-source-configmap-server-token-mdcdc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 10.177.223.161
  nodeSelector:
    beta.kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: addon-catalog-source-configmap-server
  serviceAccountName: addon-catalog-source-configmap-server
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  volumes:
  - name: addon-catalog-source-configmap-server-token-mdcdc
    secret:
      defaultMode: 420
      secretName: addon-catalog-source-configmap-server-token-mdcdc

@rtheis
Copy link

rtheis commented Jul 24, 2020

@liggitt Please see previous comment for example pod yaml and here is the OLM PR to fix the addon-catalog-source pod: operator-framework/operator-lifecycle-manager#1562. The PR details the files containing the manifests.

@liggitt
Copy link
Member

liggitt commented Jul 24, 2020

nodeSelector:
    beta.kubernetes.io/os: linux

That should make the scheduler wait until a node with the appropriate labels appears to schedule the pod

Can you provide the full pod (including the status) and scheduler events which you were seeing issues with?

@rtheis
Copy link

rtheis commented Jul 24, 2020

@liggitt Unfortunately, I only have the following data from the previous failure. I can try to recreate the problem again if you need more information.

ibm-system    addon-catalog-source-br548                             0/1     NodeAffinity   0          147m   <none>           10.240.134.54   <none>           <none>
Name:         addon-catalog-source-br548
Namespace:    ibm-system
Priority:     0
Node:         10.240.134.54/
Start Time:   Wed, 22 Jul 2020 03:04:08 +0000
Labels:       olm.catalogSource=addon-catalog-source
              olm.configMapResourceVersion=2908
Annotations:  kubernetes.io/psp: ibm-privileged-psp
Status:       Failed
Reason:       NodeAffinity
Message:      Pod Predicate NodeAffinity failed
IP:           
IPs:          <none>
Containers:
  configmap-registry-server:
    Image:      registry.ng.bluemix.net/armada-master/configmap-operator-registry:v1.6.1
    Port:       50051/TCP
    Host Port:  0/TCP
    Command:
      configmap-server
      -c
      addon-catalog-manifests
      -n
      ibm-system
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from addon-catalog-source-configmap-server-token-f5gsm (ro)
Volumes:
  addon-catalog-source-configmap-server-token-f5gsm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  addon-catalog-source-configmap-server-token-f5gsm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
Events:          <none>
kubectl get events --all-namespaces
No resources found

@liggitt
Copy link
Member

liggitt commented Jul 24, 2020

@kubernetes/sig-scheduling-misc NodeAffinity failure because no nodes currently match the pod's selector is not a terminal state, right (e.g. if a node becomes available that matches the selector, the pod will be scheduled)?

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

@kubernetes/sig-scheduling-misc NodeAffinity failure because no nodes currently match the pod's selector is not a terminal state, right (e.g. if a node becomes available that matches the selector, the pod will be scheduled)?

yes, the pod should be unschedulable until a node that matches the selector shows up.

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

To clarify, is the issue that a pod with nodeSelector set is getting scheduled on a node that doesn't yet have the corresponding labels?

@rtheis
Copy link

rtheis commented Jul 24, 2020

@ahg-g No, a pod with nodeSelector set is never getting scheduled on a node when initial scheduling attempt does not find a corresponding node label. Even when the node label is added later, the pod is stuck in NodeAffinity.

@rtheis
Copy link

rtheis commented Jul 24, 2020

This issue is very similar to #92067.

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

From #93338 (comment), it seems the pod got scheduled, the nodeName is set: nodeName: 10.177.223.161

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

ibm-system    addon-catalog-source-br548                             0/1     NodeAffinity   0          147m   <none>           10.240.134.54   <none>           <none>

where is this line coming from, a log? an event?

@rtheis
Copy link

rtheis commented Jul 24, 2020

@ahg-g #93338 (comment) is showing a successful pod. Sorry for the confusion. And I've added more context for that pod line.

+ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                                   READY   STATUS         RESTARTS   AGE    IP               NODE            NOMINATED NODE   READINESS GATES
default       test-k8s-e2e-pvg-master-verification                   1/1     Running        0          123m   172.30.3.45      10.240.134.60   <none>           <none>
ibm-system    addon-catalog-source-br548                             0/1     NodeAffinity   0          147m   <none>           10.240.134.54   <none>           <none>

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

Not sure about the source of the status field, but this is certainly not a scheduling issue.

@liggitt
Copy link
Member

liggitt commented Jul 24, 2020

this is certainly not a scheduling issue.

a pod that does not have a node it can be scheduled to should remain pending, and then be scheduled successfully when such a node appears, right?

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

this is certainly not a scheduling issue.

a pod that does not have a node it can be scheduled to should remain pending, and then be scheduled successfully when such a node appears, right?

yes, and that is what is happening here, the pod didn't get scheduled initially and then got scheduled (I presume when the label got applied).

#93338 (comment) explains that the issue is that the pod didn't get scheduled after the label was applied, and I am saying that it did get scheduled because the nodeName is set.

@rtheis
Copy link

rtheis commented Jul 24, 2020

@ahg-g So are you saying the pod is scheduled and may be running even though status is NodeAffinity? Or maybe kubelet is confused and doesn't start the pod because it doesn't think the pod should have been scheduled to the node due to missing node label (according to kubelet)? Or maybe the API server didn't update status properly?

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

From the scheduler perspective, the pod is scheduled because the nodeName is set. The status isn't something the scheduler manages, the scheduler only adds a pod condition when it fails to schedule the pod.

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

oh, I get now where the issue is, the scheduler sees that the label is applied but kubelet doesn't, and so kubelet is not admitting the pod after the scheduler scheduled the pod.

@ahg-g
Copy link
Member

ahg-g commented Jul 24, 2020

/sig-node

@liggitt
Copy link
Member

liggitt commented Jul 24, 2020

/sig node

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 24, 2020
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 24, 2020
@liggitt liggitt changed the title Node schedules pods before registration events Kubelet rejects pod scheduled based on newly added node labels which have not been observed yet Jul 24, 2020
@liggitt liggitt changed the title Kubelet rejects pod scheduled based on newly added node labels which have not been observed yet Kubelet rejects pod scheduled based on newly added node labels which have not been observed by the kubelet yet Jul 24, 2020
@ahg-g
Copy link
Member

ahg-g commented Aug 27, 2020

/remove-sig scheduling

@k8s-ci-robot k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 27, 2020
@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed kind/support Categorizes issue or PR as a support question. labels Nov 21, 2020
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Nov 26, 2020
For Deployment and DaemonSet if we rely on some label not created by
kubelet directly like `kubernetes.io/os` the Pod can be scheduled on a
Node running a Kubelet that don't know yet the label in NodeSelector, so
Pod get stuck in `NodeAffinity` and never removed.
See: kubernetes/kubernetes#93338
     kubernetes/kubernetes#92067

Let's rely back on `beta.kubernetes.io/os` label for the moment.
NOTE: This label get deprecated in 1.19
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2021
@rtheis
Copy link

rtheis commented Feb 19, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2021
@lynic
Copy link

lynic commented Feb 25, 2021

I've met same issue with v1.19.3, reboot node normally the pod will enter NodeAffinity state.

@liggitt
Copy link
Member

liggitt commented Feb 25, 2021

I've met same issue with v1.19.3, reboot node normally the pod will enter NodeAffinity state.

The fix for this issue was released to v1.19.8+ in #97996

@liggitt
Copy link
Member

liggitt commented Feb 25, 2021

/close

fixed in #94087

@k8s-ci-robot
Copy link
Contributor

@liggitt: Closing this issue.

In response to this:

/close

fixed in #94087

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants