Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drain daemonsets #75482

Open
kfox1111 opened this issue Mar 19, 2019 · 61 comments
Open

drain daemonsets #75482

kfox1111 opened this issue Mar 19, 2019 · 61 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@kfox1111
Copy link

What would you like to be added:
a flag to drain to support draining daemonset managed pods too.

Why is this needed:
I needed to blow out container runtime images completely to fix an issue. daemonsets tried to keep pods running.

@kfox1111 kfox1111 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 19, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 19, 2019
@kfox1111
Copy link
Author

Taking a guess here...
@kubernetes/sig-node

@vllry
Copy link
Contributor

vllry commented Mar 20, 2019

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 20, 2019
@nikopen
Copy link
Contributor

nikopen commented May 17, 2019

/milestone v1.16
/sig apps
/kind bug
/priority important-soon

I'd say it's a bug as drain is not able to hold its promise of --ignore-daemonsets=false which is even the default value.

Prioritizing this for the next release, ideally

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone May 17, 2019
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 17, 2019
@paivagustavo
Copy link
Contributor

paivagustavo commented Jun 19, 2019

Hey, I would like to give this a try. @nikopen do you think a new contributor to kubernetes could fix this issue? If yes, could you give me some directions?

@paivagustavo
Copy link
Contributor

I was going through the docs and found this statement:

Note: Pods created by a DaemonSet controller bypass the Kubernetes scheduler and do not respect the unschedulable attribute on a node. This assumes that daemons belong on the machine even if it is being drained of applications while it prepares for a reboot.

Is this really a bug or is it the intended behavior?

@nikopen
Copy link
Contributor

nikopen commented Jun 20, 2019

@paivagustavo I think the implementation is not a problem (likely trivial), the problem lies mostly in regards to decision-making from the responsible SIG. Could be viewed as both a bug and intented, but there should be a way to manually force this behavior.

@kfox1111
Copy link
Author

Yeah. Generally, you want the network stack (kube-proxy,flannel,etc) to continue to function while draining the node of pods, so you don't want those to get deleted too early. But you do want a way to eventually completely clean all pods out. Typically I've seen upgrading docker/containerd/etc loose track of pods because of this issue, as kubelet still thinks they are on the node, but the engine no longer finds them. I hit it again today. So we do need a solution to it.

Maybe an additional kind of drain that taints the node as going completely off line and deletes all remaining pods? This would prevent the daemonsets from relaunching pods on them.

@xmudrii
Copy link
Member

xmudrii commented Aug 19, 2019

@kfox1111 Hello! I'm bug triage lead for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.

Do we still target this issue to be fixed for 1.16?

@kfox1111
Copy link
Author

I hope so, but I don't think anyone is actively working on it at the moment. So probably will miss 1.16.

@nikopen
Copy link
Contributor

nikopen commented Aug 22, 2019

/milestone v1.17

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.16, v1.17 Aug 22, 2019
@ttousai
Copy link

ttousai commented Oct 31, 2019

Bug triage for 1.17 here with a gentle reminder that code freeze for this release is on Nov. 18. Is this issue still intended for 1.17?

@kfox1111
Copy link
Author

This still periodically affects me. Need a fix.

@ttousai
Copy link

ttousai commented Nov 4, 2019

Correction: Code freeze is on Thursday, November 14.

Sorry for the mix up.

@derekwaynecarr
Copy link
Member

Are you able to address this by performing a power cycle? Typically, we do not recommend updating the host without rebooting it.

@kfox1111
Copy link
Author

No. Sometimes upgrading the container engine may cause it to loose track of containers unsafely but persistently (storage). The instructions say, to clean off 'all' containers off of the system before upgrading the container runtime. Daemonsets are ignoring this and not actually draining. (Sometimes useful). There needs to be an easy way to completely remove all running kubernetes managed containers off of a host. Like, a flag on the host or something to say, kill all the rest and don't let it restart automatically.

@josiahbjorgaard
Copy link
Contributor

/milestone v1.18

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.17, v1.18 Nov 13, 2019
@wilmardo
Copy link

We ran into this today, this issue is the result from a design decision at containerd:

KillMode handles when containerd is being shut down. By default, systemd will look in its named cgroup and kill every process that it knows about for the service. This is not what we want. As ops, we want to be able to upgrade containerd and allow existing containers to keep running without interruption. Setting KillMode to process ensures that systemd only kills the containerd daemon and not any child processes such as the shims and containers.

Source: https://github.com/containerd/containerd/blob/master/docs/ops.md#systemd

So when containerd is upgraded it will abandon its child processes by design. So all the ignored daemonsets still running on the drained node are abandoned but Kubernetes will not pick them back up after uncordoning. This is seems the issue that @kfox1111 describes by:

Typically I've seen upgrading docker/containerd/etc loose track of pods because of this issue, as kubelet still thinks they are on the node, but the engine no longer finds them

This is resolved by rebooting node as @derekwaynecarr suggests since this kill the abandoned child processes.

If needed I can provide some examples. In short this is reproducible by:

  • Run a cluster on containerd (1.2.6 for example)
  • Drain a node with --ignore-daemonsets
  • Upgrade containerd to 1.2.10 on this node
  • Run ps faux and see the abandoned child processes
  • Uncordon the node and see Kubernetes not being able to resolve the state
  • Restart the node
  • Watch all pods go to running again

@kfox1111
Copy link
Author

Containerd has it for sure. Other runtimes might as well. The need to drain a node completely of workload is a reasonable one on its own regardless of which runtime is involved though.

I've seen kubelet continuously complain in the logs about pods that disappeared after an upgrade of containerd as well, so just rebooting doesn't entirely fix the issue.

@furkatgofurov7
Copy link
Member

@kfox1111 hi, how can we move forward with this issue, this has been here for a long time, any ideas? Is it a problem of no one willing to take this up or it needs a confirmation from sig node?
Also, @nikopen hi, is there anything I can do to move forward with this issue, thanks!

@kfox1111
Copy link
Author

kfox1111 commented May 9, 2022

Pretty sure its not on anyone's radar anymore. Someone should bring it back up to sig-node maybe?

@furkatgofurov7
Copy link
Member

Pretty sure its not on anyone's radar anymore. Someone should bring it back up to sig-node maybe?

Got it, sorry I am not aware of the process, what exactly does one have to do to take this up to sig-node/'s attention?
/sig node

@kfox1111
Copy link
Author

kfox1111 commented May 9, 2022

Got it, sorry I am not aware of the process, what exactly does one have to do to take this up to sig-node/'s attention? /sig node

Sorry. I don't really know.

Guessing it may need to be brought up at one of the meetings?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2022
@kfox1111
Copy link
Author

kfox1111 commented Aug 8, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2022
@kfox1111
Copy link
Author

kfox1111 commented Aug 8, 2022

still an issue. :/

@movikbence
Copy link

still an issue. :/

+1

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2022
@kfox1111
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@SergeyKanzhelev
Copy link
Member

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 1, 2023
@SergeyKanzhelev SergeyKanzhelev removed this from High Priority in SIG Node Bugs Mar 1, 2023
@SergeyKanzhelev
Copy link
Member

/priority backlog

to reflect the reality

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Mar 1, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels May 30, 2023
@SergeyKanzhelev
Copy link
Member

/remove-priority important-soon

@k8s-ci-robot k8s-ci-robot removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 24, 2023
@atiratree
Copy link
Member

This feature might be eventually supported with Declarative Node Maintenance: kubernetes/enhancements#4213

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Status: Needs Triage
Development

No branches or pull requests