Kubelet may reject static pods when the node is out of disk space, and will not retry them #35521

justinsb · 2016-10-25T15:02:08Z

If a pod fails admission (admission controller, disk space, or network not ready as of #33347), it is not retried.

I think this is the cause of #35409

Working up a PR now... I am thinking of keeping a list of pods that were rejected, and reattempting them every housekeeping interval.

yujuhong · 2016-10-25T15:04:27Z

@justinsb pod phases progress monotonically. Once a pod has been rejected and transition to the "Failed" state, it is not supposed to go back to pending or running. I don't think kubelet should retry them

justinsb · 2016-10-25T15:06:01Z

But - for example - we check disk space in canAdmitPod. That is a temporary condition.

How do you propose to handle this instead?

yujuhong · 2016-10-25T15:09:23Z

kubelet doesn't care about whether the condition is temporary or not (for regular pods). It assumes the control plane (e.g., ReplicaSet) can react and determine what to do.

yujuhong · 2016-10-25T15:09:34Z

/cc @kubernetes/sig-node

justinsb · 2016-10-25T15:16:01Z

Hmmm.. that is tricky because of static pod manifests. I put the network-readiness check in the same place as the disk space check, as I figured that was the closest analog.

Looking for a place to move it such that the pod can be admitted, but where we can prevent it actually starting:

Any suggestion?
Should I move the disk free-space check there also?

justinsb · 2016-10-25T15:28:23Z

How about in Kubelet::syncPod?

yujuhong · 2016-10-25T15:31:30Z

Hmmm.. that is tricky because of static pod manifests. I put the network-readiness check in the same place as the disk space check, as I figured that was the closest analog.

@justinsb hmm....even for regular pods, this behavior seems undesirable. If kubelet restarts, and the network plugin is not ready (due to podCIDIR) yet, it would reject all pods assigned to it. That seems to be what caused the serial suite to fail.

How about in Kubelet::syncPod?

It might work. That just means kubelet will not sync the pod if network is not ready. I don't think disk space should be moved to the same space because kubelet actually wants to reject them.
/cc @dchen1107

Can we revert your original PR first? Thanks.

justinsb · 2016-10-25T15:39:37Z

Can we revert your original PR first? Thanks.

I feel like I've seen bad behaviour with full disks, so it feels like this is actually exposing a pre-existing problem. I don't understand why we would permanently reject a static pod if the disk was full?

It feels like we should move both to syncPod... I have a WIP PR which I'm testing now - can we give me a few minutes to see if that just fixes things, and then if not we can revert?

justinsb · 2016-10-25T15:45:02Z

Actually, on second thoughts, if we're blocking the submit queue we can just revert. It sounds like this is a non-trivial issue, so it's worth finding the right solution.

yujuhong · 2016-10-25T16:25:24Z

Actually, on second thoughts, if we're blocking the submit queue we can just revert. It sounds like this is a non-trivial issue, so it's worth finding the right solution.

I don't think it's blocking the submit queue. Since you no longer need to change the admission logic, there are parts in your original PR that's not needed anymore. I think reverting makes sense here, and will make cherrypicking easier too.

yujuhong · 2016-10-25T16:31:57Z

I feel like I've seen bad behaviour with full disks, so it feels like this is actually exposing a pre-existing problem. I don't understand why we would permanently reject a static pod if the disk was full?

This is a bug, and it should be fixed. I changed the title to reflect that (thanks). The immediate solution if anyone encounters such a situation is to restart kubelet to let the pods go through the admission process again.

@justinsb I don't think your new PR needs to change the admission logic at all, so it shouldn't be affected by this bug.

timstclair · 2016-10-25T19:25:34Z

Related PR: #35342

Conditions which the scheduler is not aware of, or that should be retried on the same node should be "soft rejected", i.e. the pod should not be marked as failed, instead it should be blocked from starting.

dchen1107 · 2016-10-31T22:36:26Z

Related to: #22212

Automatic merge from submit-queue kubelet bootstrap: start hostNetwork pods before we have PodCIDR Network readiness was checked in the pod admission phase, but pods that fail admission are not retried. Move the check to the pod start phase. Issue #35409 Issue #35521

fejta-bot · 2017-12-19T03:19:39Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-18T04:07:26Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

wgahnagl · 2021-06-24T20:52:34Z

/kind bug

gjkim42 · 2021-06-25T09:50:48Z

/assign

gjkim42 · 2021-06-25T10:35:25Z

/unassign

I don't want to make another exception for static pods...

/priority backlog

ehashman · 2021-08-05T23:04:57Z

/milestone clear

SergeyKanzhelev · 2021-09-01T17:54:53Z

/triage needs-information

need repro on the latest versions of confirmation from somebody experiencing it

matthyx · 2022-08-29T12:44:49Z

Please reopen if you experience this issue.
/close

k8s-ci-robot · 2022-08-29T12:45:07Z

@matthyx: Closing this issue.

In response to this:

Please reopen if you experience this issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-github-robot added area/kubelet team/control-plane labels Oct 25, 2016

This was referenced Oct 25, 2016

kubelet bootstrap: start hostNetwork pods before we have PodCIDR #35526

Merged

WIP: kubelet: retry pods that fail admission #35527

Closed

yujuhong changed the title ~~kubelet does not retry pods that fail admission~~ Kubelet may reject static pods when the node is out of disk space, and will not retry them Oct 25, 2016

yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed team/control-plane labels Oct 25, 2016

yujuhong mentioned this issue Oct 27, 2016

Prioritize static pods #35477

Closed

yujuhong added this to the next-candidate milestone Nov 8, 2016

yujuhong mentioned this issue Nov 9, 2016

kube-proxy evicted due to low compute resources and not resheduled #36539

Closed

yujuhong added the area/reliability label Dec 5, 2016

yujuhong mentioned this issue Dec 5, 2016

kube-proxy fails to recover from out of disk #38129

Closed

mikedanese mentioned this issue Dec 7, 2016

move kube-proxy into a daemonset #23225

Closed

yujuhong mentioned this issue Dec 7, 2016

Mark kube-proxy as critical #38314

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2018

yujuhong added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 18, 2018

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 24, 2021

k8s-ci-robot assigned gjkim42 Jun 25, 2021

k8s-ci-robot unassigned gjkim42 Jun 25, 2021

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 25, 2021

ehashman added this to Triage in SIG Node Bugs Jul 9, 2021

k8s-ci-robot removed this from the next-candidate milestone Aug 5, 2021

ehashman moved this from Triage to Triaged in SIG Node Bugs Aug 11, 2021

ehashman moved this from Triaged to Triage in SIG Node Bugs Aug 11, 2021

SergeyKanzhelev moved this from Triage to Needs Information in SIG Node Bugs Sep 1, 2021

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 1, 2021

k8s-ci-robot closed this as completed Aug 29, 2022

SIG Node Bugs automation moved this from Needs Information to Done Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet may reject static pods when the node is out of disk space, and will not retry them #35521

Kubelet may reject static pods when the node is out of disk space, and will not retry them #35521

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016 •

edited

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

yujuhong commented Oct 25, 2016

timstclair commented Oct 25, 2016

dchen1107 commented Oct 31, 2016

fejta-bot commented Dec 19, 2017

fejta-bot commented Jan 18, 2018

wgahnagl commented Jun 24, 2021

gjkim42 commented Jun 25, 2021

gjkim42 commented Jun 25, 2021

ehashman commented Aug 5, 2021

SergeyKanzhelev commented Sep 1, 2021

matthyx commented Aug 29, 2022

k8s-ci-robot commented Aug 29, 2022

Kubelet may reject static pods when the node is out of disk space, and will not retry them #35521

Kubelet may reject static pods when the node is out of disk space, and will not retry them #35521

Comments

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016 • edited

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

justinsb commented Oct 25, 2016

justinsb commented Oct 25, 2016

yujuhong commented Oct 25, 2016

yujuhong commented Oct 25, 2016

timstclair commented Oct 25, 2016

dchen1107 commented Oct 31, 2016

fejta-bot commented Dec 19, 2017

fejta-bot commented Jan 18, 2018

wgahnagl commented Jun 24, 2021

gjkim42 commented Jun 25, 2021

gjkim42 commented Jun 25, 2021

ehashman commented Aug 5, 2021

SergeyKanzhelev commented Sep 1, 2021

matthyx commented Aug 29, 2022

k8s-ci-robot commented Aug 29, 2022

justinsb commented Oct 25, 2016 •

edited