Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes scheduler spams cluster with pods in NodeAffinity status #92067

Closed
rtheis opened this issue Jun 12, 2020 · 18 comments · Fixed by #94087
Closed

Kubernetes scheduler spams cluster with pods in NodeAffinity status #92067

rtheis opened this issue Jun 12, 2020 · 18 comments · Fixed by #94087
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@rtheis
Copy link

rtheis commented Jun 12, 2020

What happened:

A deployment is created that uses a nodeSelector after the cluster master is created but before the cluster worker nodes are fully initialized. During the initialization of the worker nodes, the nodes are temporarily available for scheduling without the necessary label to match the deployment's node selector. Depending on how long the nodes are available for scheduling without the necessary node labels for the deployment, the scheduler will start to spam with cluster with pods in NodeAffinity status. This spamming stops once the worker nodes are fully initialized and the pods are scheduled successful. The exact timing of all of this is TBD.

What you expected to happen:

Pods in NodeAffinity status are annoying and require manual cleanup. It would be preferred that the scheduler avoid this an simply issue warning events.

How to reproduce it (as minimally and precisely as possible):

See issue description.

Anything else we need to know?: No.

Environment:

  • Kubernetes version (use kubectl version): 1.18.3
  • Cloud provider or hardware configuration: IBM Cloud Kubernetes Service
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.4 LTS
  • Kernel (e.g. uname -a): 4.15.0-96-generic
  • Install tools: Managed service install
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: N/A
@rtheis rtheis added the kind/bug Categorizes issue or PR as related to a bug. label Jun 12, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2020
@rtheis
Copy link
Author

rtheis commented Jun 12, 2020

Related to #43846 and #52902.

@rtheis
Copy link
Author

rtheis commented Jun 12, 2020

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 12, 2020
@ahg-g
Copy link
Member

ahg-g commented Jun 12, 2020

by spamming are you referring to scheduler logs? also, why are the nodes not labelled from the get go?

@ingvagabund
Copy link
Contributor

the nodes are temporarily available for scheduling without the necessary label to match the deployment's node selector.

Just to confirm, all the pods in question are still in Pending state, yes?

Pods in NodeAffinity status are annoying and require manual cleanup.

If the pods are in pending state, all of them will get scheduled eventually once the nodes get their labels.

@rtheis
Copy link
Author

rtheis commented Jun 13, 2020

The pods are not in Pending state. They are actually in NodeAffinitiy state. The pods eventually get scheduled but we are left with anywhere from a few to dozens of these NodeAffinity state pods.

Also, we are looking at labeling the nodes via kubelet flags to see if that helps.

@ahg-g
Copy link
Member

ahg-g commented Jun 13, 2020

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Jun 13, 2020
@zhouya0
Copy link
Member

zhouya0 commented Jun 14, 2020

After we add some retry logic and wait time in getNodeAnyway this problem can be solved:

func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) {
	if kl.kubeClient != nil {
		if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil {
			return n, nil
		}
	}
	return kl.initialNode(context.TODO())
}

As you can see this function returns initial node as the first request failed.

@invidian
Copy link
Member

invidian commented Jun 25, 2020

I also get hit by this, when single controller node gets rebooted and scheduler starts as static container before kubelet becomes ready.

Can be cleaned up with: kubectl delete pods $(kubectl get pods | grep NodeAff | awk '{print $1}' | tr \\n ' ')

@sjenning
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 18, 2020
@derekwaynecarr
Copy link
Member

the spam cycle is this loop: "pod create, pod schedule, worker reject, pod failed" -- we don't have a lot of protection in higher level workload controllers to handle pods that fail at kubelet admission time versus pods that just dont create due to quota/admission. in general, is the spam causing any other stability issues? is pod gc pruning the pods?

@derekwaynecarr
Copy link
Member

I wonder if we can do a trick in node lister to mitigate any race condition here...

@derekwaynecarr
Copy link
Member

we could maybe do something similar to https://github.com/kubernetes/kubernetes/pull/91500/files to ensure node lister has synced...

@derekwaynecarr
Copy link
Member

hacking on option to ensure if kubelet has a node lister with a valid kube client, we wait for it to sync at least once here which should mitigate this issue. see: #94087

@rtheis
Copy link
Author

rtheis commented Aug 18, 2020

We noticed the problem on the OLM addon-catalog-source pod which is managed by the OLM operator. Right or wrong, the operator does not like seeing the pod go to NodeAffinity status and no longer reconciles the pod. The end result is that manual intervention (i.e. pod deletion) is required to fix the problem.

@derekwaynecarr
Copy link
Member

@rtheis sure... i would have thought the pod is failed in that instance and reached a terminal state. either way, i am looking at what i can do to mitigate this.

@igraecao
Copy link
Contributor

igraecao commented Oct 8, 2020

The problem persists in our environment as well, and the workaround cited above will not do anymore. Is there something in the works for this one?

TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Nov 26, 2020
For Deployment and DaemonSet if we rely on some label not created by
kubelet directly like `kubernetes.io/os` the Pod can be scheduled on a
Node running a Kubelet that don't know yet the label in NodeSelector, so
Pod get stuck in `NodeAffinity` and never removed.
See: kubernetes/kubernetes#93338
     kubernetes/kubernetes#92067

Let's rely back on `beta.kubernetes.io/os` label for the moment.
NOTE: This label get deprecated in 1.19
gdemonet added a commit to scality/metalk8s that referenced this issue Dec 9, 2020
Instead of waiting for all scheduled pods to be running (which is
already done in E2E tests), the "stabilization" script is now only
checking that no change to Pod objects can be seen over a given period
of time.

This will effectively hide any scheduling issue, especially the
`NodeAffinity` flaky which has been impacting us a lot recently.
See kubernetes/kubernetes#92067 for reference.
gdemonet added a commit to scality/metalk8s that referenced this issue Dec 9, 2020
Instead of waiting for all scheduled pods to be running (which is
already done in E2E tests), the "stabilization" script is now only
checking that no change to Pod objects can be seen over a given period
of time.

This will effectively hide any scheduling issue, especially the
`NodeAffinity` flaky which has been impacting us a lot recently.
See kubernetes/kubernetes#92067 for reference.
gdemonet added a commit to scality/metalk8s that referenced this issue Dec 9, 2020
Instead of waiting for all scheduled pods to be running (which is
already done in E2E tests), the "stabilization" script is now only
checking that no change to Pod objects can be seen over a given period
of time.

This will effectively hide any scheduling issue, especially the
`NodeAffinity` flaky which has been impacting us a lot recently.
See kubernetes/kubernetes#92067 for reference.
gdemonet added a commit to scality/metalk8s that referenced this issue Dec 10, 2020
Instead of waiting for all scheduled pods to be running (which is
already done in E2E tests), the "stabilization" script is now only
checking that no change to Pod objects can be seen over a given period
of time.

This will effectively hide any scheduling issue, especially the
`NodeAffinity` flaky which has been impacting us a lot recently.
See kubernetes/kubernetes#92067 for reference.
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2021
@rtheis
Copy link
Author

rtheis commented Jan 6, 2021

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants