Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet Controller doesn't delete orphaned pods #71349

Closed
krzysztof-jastrzebski opened this issue Nov 22, 2018 · 8 comments · Fixed by #73401
Closed

DaemonSet Controller doesn't delete orphaned pods #71349

krzysztof-jastrzebski opened this issue Nov 22, 2018 · 8 comments · Fixed by #73401
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@krzysztof-jastrzebski
Copy link
Contributor

krzysztof-jastrzebski commented Nov 22, 2018

What happened:
DaemonSet doesn't delete pods that:

  • weren't scheduled yet,
  • node was deleted so pod can't be scheduled.
    What you expected to happen:
    Pod should be deleted when node is removed.
    How to reproduce it (as minimally and precisely as possible):
    I encountered this problem when doing tests with cluster with 1000 nodes and 30000 pods running. During tests I was deleting and adding nodes using GCE API without draining them. I reproduced problem on much smaller cluster but it requires stopping K8s scheduler.

Scenario:

  1. Create cluster with 2 nodes.
  2. Wait till fluentd-gcp pods are running on nodes.
  3. Stop K8s scheduler.
  4. Delete fluentd-gcp pods.
  5. Wait till daemonset controller creates pending pods.
  6. Delete one node.
  7. Start scheduler.
  8. One of the fluentd-gcp pods will be pending and scheduler won't be able to schedule.

Anything else we need to know?:

Environment: GKE

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.2-gke.3", GitCommit:"3bdb4f77629a001276ae061e68aff4bac147fbc5", GitTreeState:"clean", BuildDate:"2018-11-13T22:54:51Z", GoVersion:"go1.10.4b4", Compiler:"gc", Platform:"linux/amd64"}

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2018
@krzysztof-jastrzebski
Copy link
Contributor Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2018
@krzysztof-jastrzebski
Copy link
Contributor Author

/assign janetkuo

@liggitt
Copy link
Member

liggitt commented Dec 10, 2018

/assign @k82cn

@k82cn
Copy link
Member

k82cn commented Jan 29, 2019

We have PodGC to clean up orphaned pods every 20s, is that enough?

@krzysztof-jastrzebski
Copy link
Contributor Author

I'm removing pods that have only node affinity set and don't have pod.Spec.NodeName set. These pods were never touched by scheduler before node was removed and after node was removed can't be scheduled as node is doesn't exist anymore. Such pods are pending forever and PodGC can't remove them as pod.Spec.NodeName is empty.

@k82cn
Copy link
Member

k82cn commented Jan 29, 2019

have only node affinity set and don't have pod.Spec.NodeName set

Yes, that's right :) let me take few time to review the PR, thanks.

@krzysztof-jastrzebski
Copy link
Contributor Author

@k82cn Should I backport fix to 1.13 and 1.12?

@k82cn
Copy link
Member

k82cn commented Feb 1, 2019

Should I backport fix to 1.13 and 1.12?

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants