drain daemonsets #75482

kfox1111 · 2019-03-19T16:45:17Z

What would you like to be added:
a flag to drain to support draining daemonset managed pods too.

Why is this needed:
I needed to blow out container runtime images completely to fix an issue. daemonsets tried to keep pods running.

kfox1111 · 2019-03-19T17:23:59Z

Taking a guess here...
@kubernetes/sig-node

vllry · 2019-03-20T05:00:11Z

/sig node

nikopen · 2019-05-17T13:53:40Z

/milestone v1.16
/sig apps
/kind bug
/priority important-soon

I'd say it's a bug as drain is not able to hold its promise of --ignore-daemonsets=false which is even the default value.

Prioritizing this for the next release, ideally

paivagustavo · 2019-06-19T05:44:44Z

Hey, I would like to give this a try. @nikopen do you think a new contributor to kubernetes could fix this issue? If yes, could you give me some directions?

paivagustavo · 2019-06-20T01:30:34Z

I was going through the docs and found this statement:

Note: Pods created by a DaemonSet controller bypass the Kubernetes scheduler and do not respect the unschedulable attribute on a node. This assumes that daemons belong on the machine even if it is being drained of applications while it prepares for a reboot.

Is this really a bug or is it the intended behavior?

nikopen · 2019-06-20T15:49:44Z

@paivagustavo I think the implementation is not a problem (likely trivial), the problem lies mostly in regards to decision-making from the responsible SIG. Could be viewed as both a bug and intented, but there should be a way to manually force this behavior.

kfox1111 · 2019-07-15T16:38:17Z

Yeah. Generally, you want the network stack (kube-proxy,flannel,etc) to continue to function while draining the node of pods, so you don't want those to get deleted too early. But you do want a way to eventually completely clean all pods out. Typically I've seen upgrading docker/containerd/etc loose track of pods because of this issue, as kubelet still thinks they are on the node, but the engine no longer finds them. I hit it again today. So we do need a solution to it.

Maybe an additional kind of drain that taints the node as going completely off line and deletes all remaining pods? This would prevent the daemonsets from relaunching pods on them.

xmudrii · 2019-08-19T13:24:30Z

@kfox1111 Hello! I'm bug triage lead for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.

Do we still target this issue to be fixed for 1.16?

kfox1111 · 2019-08-19T15:35:52Z

I hope so, but I don't think anyone is actively working on it at the moment. So probably will miss 1.16.

nikopen · 2019-08-22T19:23:31Z

/milestone v1.17

ttousai · 2019-10-31T07:31:17Z

Bug triage for 1.17 here with a gentle reminder that code freeze for this release is on Nov. 18. Is this issue still intended for 1.17?

kfox1111 · 2019-10-31T15:48:23Z

This still periodically affects me. Need a fix.

ttousai · 2019-11-04T15:50:14Z

Correction: Code freeze is on Thursday, November 14.

Sorry for the mix up.

derekwaynecarr · 2019-11-11T02:56:18Z

Are you able to address this by performing a power cycle? Typically, we do not recommend updating the host without rebooting it.

kfox1111 · 2019-11-11T16:52:12Z

No. Sometimes upgrading the container engine may cause it to loose track of containers unsafely but persistently (storage). The instructions say, to clean off 'all' containers off of the system before upgrading the container runtime. Daemonsets are ignoring this and not actually draining. (Sometimes useful). There needs to be an easy way to completely remove all running kubernetes managed containers off of a host. Like, a flag on the host or something to say, kill all the rest and don't let it restart automatically.

josiahbjorgaard · 2019-11-13T04:23:22Z

/milestone v1.18

wilmardo · 2019-11-20T16:47:08Z

We ran into this today, this issue is the result from a design decision at containerd:

KillMode handles when containerd is being shut down. By default, systemd will look in its named cgroup and kill every process that it knows about for the service. This is not what we want. As ops, we want to be able to upgrade containerd and allow existing containers to keep running without interruption. Setting KillMode to process ensures that systemd only kills the containerd daemon and not any child processes such as the shims and containers.

Source: https://github.com/containerd/containerd/blob/master/docs/ops.md#systemd

So when containerd is upgraded it will abandon its child processes by design. So all the ignored daemonsets still running on the drained node are abandoned but Kubernetes will not pick them back up after uncordoning. This is seems the issue that @kfox1111 describes by:

Typically I've seen upgrading docker/containerd/etc loose track of pods because of this issue, as kubelet still thinks they are on the node, but the engine no longer finds them

This is resolved by rebooting node as @derekwaynecarr suggests since this kill the abandoned child processes.

If needed I can provide some examples. In short this is reproducible by:

Run a cluster on containerd (1.2.6 for example)
Drain a node with --ignore-daemonsets
Upgrade containerd to 1.2.10 on this node
Run ps faux and see the abandoned child processes
Uncordon the node and see Kubernetes not being able to resolve the state
Restart the node
Watch all pods go to running again

kfox1111 · 2019-11-25T18:24:20Z

Containerd has it for sure. Other runtimes might as well. The need to drain a node completely of workload is a reasonable one on its own regardless of which runtime is involved though.

I've seen kubelet continuously complain in the logs about pods that disappeared after an upgrade of containerd as well, so just rebooting doesn't entirely fix the issue.

furkatgofurov7 · 2022-05-09T11:03:52Z

@kfox1111 hi, how can we move forward with this issue, this has been here for a long time, any ideas? Is it a problem of no one willing to take this up or it needs a confirmation from sig node?
Also, @nikopen hi, is there anything I can do to move forward with this issue, thanks!

kfox1111 · 2022-05-09T16:50:58Z

Pretty sure its not on anyone's radar anymore. Someone should bring it back up to sig-node maybe?

furkatgofurov7 · 2022-05-09T21:38:56Z

Pretty sure its not on anyone's radar anymore. Someone should bring it back up to sig-node maybe?

Got it, sorry I am not aware of the process, what exactly does one have to do to take this up to sig-node/'s attention?
/sig node

kfox1111 · 2022-05-09T21:46:26Z

Got it, sorry I am not aware of the process, what exactly does one have to do to take this up to sig-node/'s attention? /sig node

Sorry. I don't really know.

Guessing it may need to be brought up at one of the meetings?

k8s-triage-robot · 2022-08-07T22:41:07Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kfox1111 · 2022-08-08T16:15:12Z

/remove-lifecycle stale

kfox1111 · 2022-08-08T16:15:35Z

still an issue. :/

movikbence · 2022-09-19T12:29:45Z

still an issue. :/

+1

k8s-triage-robot · 2022-12-18T12:37:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kfox1111 · 2022-12-19T17:25:49Z

/remove-lifecycle stale

SergeyKanzhelev · 2023-03-01T18:57:35Z

/remove-kind bug
/kind feature

SergeyKanzhelev · 2023-03-01T18:58:11Z

/priority backlog

to reflect the reality

k8s-triage-robot · 2023-05-30T19:56:24Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

SergeyKanzhelev · 2023-06-24T02:41:34Z

/remove-priority important-soon

atiratree · 2024-01-10T21:47:31Z

This feature might be eventually supported with Declarative Node Maintenance: kubernetes/enhancements#4213

k8s-triage-robot · 2024-04-09T22:18:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kfox1111 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 19, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 19, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 20, 2019

k8s-ci-robot added this to the v1.16 milestone May 17, 2019

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 17, 2019

k8s-ci-robot modified the milestones: v1.16, v1.17 Aug 22, 2019

k8s-ci-robot modified the milestones: v1.17, v1.18 Nov 13, 2019

jayunit100 mentioned this issue Jul 19, 2022

DaemonSet pods linger forever, even when node is deleted #111257

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2022

hamishforbes mentioned this issue Oct 20, 2023

Taint nodes before deletion kubernetes-sigs/karpenter#621

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022

k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 1, 2023

SergeyKanzhelev removed this from High Priority in SIG Node Bugs Mar 1, 2023

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Mar 1, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels May 30, 2023

k8s-ci-robot removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 24, 2023

atiratree mentioned this issue Feb 2, 2024

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drain daemonsets #75482

drain daemonsets #75482

kfox1111 commented Mar 19, 2019

kfox1111 commented Mar 19, 2019

vllry commented Mar 20, 2019

nikopen commented May 17, 2019

paivagustavo commented Jun 19, 2019 •

edited

paivagustavo commented Jun 20, 2019

nikopen commented Jun 20, 2019

kfox1111 commented Jul 15, 2019

xmudrii commented Aug 19, 2019

kfox1111 commented Aug 19, 2019

nikopen commented Aug 22, 2019

ttousai commented Oct 31, 2019

kfox1111 commented Oct 31, 2019

ttousai commented Nov 4, 2019

derekwaynecarr commented Nov 11, 2019

kfox1111 commented Nov 11, 2019

josiahbjorgaard commented Nov 13, 2019

wilmardo commented Nov 20, 2019

kfox1111 commented Nov 25, 2019

furkatgofurov7 commented May 9, 2022

kfox1111 commented May 9, 2022

furkatgofurov7 commented May 9, 2022

kfox1111 commented May 9, 2022

k8s-triage-robot commented Aug 7, 2022

kfox1111 commented Aug 8, 2022

kfox1111 commented Aug 8, 2022

movikbence commented Sep 19, 2022

k8s-triage-robot commented Dec 18, 2022

kfox1111 commented Dec 19, 2022

SergeyKanzhelev commented Mar 1, 2023

SergeyKanzhelev commented Mar 1, 2023

k8s-triage-robot commented May 30, 2023

SergeyKanzhelev commented Jun 24, 2023

atiratree commented Jan 10, 2024

k8s-triage-robot commented Apr 9, 2024

drain daemonsets #75482

drain daemonsets #75482

Comments

kfox1111 commented Mar 19, 2019

kfox1111 commented Mar 19, 2019

vllry commented Mar 20, 2019

nikopen commented May 17, 2019

paivagustavo commented Jun 19, 2019 • edited

paivagustavo commented Jun 20, 2019

nikopen commented Jun 20, 2019

kfox1111 commented Jul 15, 2019

xmudrii commented Aug 19, 2019

kfox1111 commented Aug 19, 2019

nikopen commented Aug 22, 2019

ttousai commented Oct 31, 2019

kfox1111 commented Oct 31, 2019

ttousai commented Nov 4, 2019

derekwaynecarr commented Nov 11, 2019

kfox1111 commented Nov 11, 2019

josiahbjorgaard commented Nov 13, 2019

wilmardo commented Nov 20, 2019

kfox1111 commented Nov 25, 2019

furkatgofurov7 commented May 9, 2022

kfox1111 commented May 9, 2022

furkatgofurov7 commented May 9, 2022

kfox1111 commented May 9, 2022

k8s-triage-robot commented Aug 7, 2022

kfox1111 commented Aug 8, 2022

kfox1111 commented Aug 8, 2022

movikbence commented Sep 19, 2022

k8s-triage-robot commented Dec 18, 2022

kfox1111 commented Dec 19, 2022

SergeyKanzhelev commented Mar 1, 2023

SergeyKanzhelev commented Mar 1, 2023

k8s-triage-robot commented May 30, 2023

SergeyKanzhelev commented Jun 24, 2023

atiratree commented Jan 10, 2024

k8s-triage-robot commented Apr 9, 2024

paivagustavo commented Jun 19, 2019 •

edited