Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

pires · 2020-05-06T10:00:33Z

What happened:

A Pod was scheduled for deletion because a taint w/ effect NoExecute was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn't canceled.

After scheduling pod deletion, the NoExecuteTaintManager only cancels pod deletion if all NoExecute taints are removed from the node, including any taints that the user relies on for their use-cases (example below).

What you expected to happen:

After the taint that triggered scheduling pod deletion is removed, NoExecuteTaintManager cancels the scheduled pod deletion.

How to reproduce it (as minimally and precisely as possible):

A node MyNode is tainted w/ OnlyForMyUsage:NoExecute, so any workloads already assigned to it which don't tolerate this taint are evicted.
A pod MyPod is created w/ tolerations: OnlyForMyUsage:NoExecute w/ unspecified tolerationSeconds and NotReady:NoExecute w/ tolerationSeconds: 300.
MyPod is assigned to MyNode.
At some point in time after the previous steps took place, MyNode is tainted as NotReady:NoExecute.
NoExecuteTaintManager gets an update event for MyNode and observes two NoExecute taints. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the two taints which at this point is 300 seconds, and marks MyPod for deletion in ~300 seconds - this happens in-memory, in a timed-queue.
At some point in time after the previous steps took place, and before the 300 seconds trigger happen, NotReady:NoExecute taint is removed from MyNode.
NoExecuteTaintManager gets an update event for MyNode and observes only one NoExecute taint, OnlyForMyUsage:NoExecute. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggered MyPod deletion is no longer observed.

Anything else we need to know?:

Worked w/ @bobveznat to figure this out after all pods in a node got evicted without apparent reason.

Environment:

Kubernetes version (use kubectl version): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master.
Cloud provider or hardware configuration: N/A
OS (e.g: cat /etc/os-release): N/A
Kernel (e.g. uname -a): 5.x
Install tools: N/A
Network plugin and version (if this is a network-related bug): N/A
Others: N/A

cc @gmarek @bowei @k82cn (owners of this code)

The text was updated successfully, but these errors were encountered:

pires · 2020-05-06T10:02:02Z

There seems to be at least one other similar report in the wild.

bobveznat · 2020-05-06T12:40:27Z

You can sort of get a feel for what's going on here by looking at these two code snippets:

This is what happens in the fast path when the node does not have any NoExecute taints:

	if len(taints) == 0 {
		klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)
		for i := range pods {
			tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})
		}
		return
	}

If you have any NoExecute taints on the node processPodOnNode gets called and this routine is really, really eager to evict pods. So this code path is followed:

	if minTolerationTime < 0 {
		klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String())
		return
	}

We are currently working around this issue by changing over our NoExecute taints to NoSchedule.

neolit123 · 2020-05-06T15:04:58Z

/sig scheduling node

liggitt · 2020-05-06T15:27:14Z

/cc @karan

ahg-g · 2020-05-07T12:14:09Z

/remove-sig scheduling

scheduler isn't involved in this case.

pires · 2020-05-07T13:01:19Z

Which SIG owns the controller-manager or is the node lifecycle manager (and its NoExecuteTaintManager) owned by sig-node alone?

liggitt · 2020-05-07T13:02:55Z

I think node lifecycle is jointly owned by sig-node and sig-cloud-provider

liggitt · 2020-05-07T13:03:33Z

API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific

fejta-bot · 2020-08-05T13:07:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

bobveznat · 2020-08-05T13:57:50Z

Looks like we lost some momentum here and I'd like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner?

liggitt · 2020-08-05T14:06:16Z

adding sig-node leads for triage/routing. this seems like a clearly reproducible bug with significant correctness issues.

/assign @derekwaynecarr @dchen1107

derekwaynecarr · 2020-08-05T14:36:02Z

@bobveznat @pires thanks for the detailed report and reproducer. to help accelerate this, do you have a form of your reproducer we could add in the e2e suite (even if its failing initially)?

liggitt · 2020-08-05T14:36:50Z

do you have a form of your reproducer we could add in the e2e suite

custom taints are considered disruptive, so it couldn't run in the main suite

pires · 2020-08-05T14:52:55Z

It's been a while since I wrote tests for the e2e suite but if you are available over slack in case guidance is needed, I'm willing to put the effort.

pires · 2020-08-05T14:55:09Z

custom taints are considered disruptive, so it couldn't run in the main suite

Do all suites need to pass for a fix to be merged? I'd assume it is the case, otherwise how would a regression be prevented after the fix lands?

derekwaynecarr · 2020-08-05T15:17:37Z

@pires i was more inquiring if there was a test that could automate the reproduction, so even just a linked commit may be helpful for someone that wants to pick this up. i am happy to help if you want to reach out over slack later this week.

liggitt · 2020-08-05T15:42:44Z

looks like a unit test could reproduce the issue, opened a WIP fix in #93722

pires · 2020-08-05T18:14:20Z

@derekwaynecarr our code is not publicly accessible, I'm sorry. PR linked by @liggitt seems to cover both the fix and the test that proves it.

pires · 2020-08-05T21:13:22Z

/remove-lifecycle frozen

liggitt · 2020-08-05T21:19:52Z

frozen keeps the bot from auto-closing

pires · 2020-08-06T12:17:55Z

Ah, sorry then.

pires added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2020

k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 7, 2020

liggitt added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label May 7, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2020

k8s-ci-robot assigned dchen1107 and derekwaynecarr Aug 5, 2020

liggitt mentioned this issue Aug 5, 2020

Do not evict pods which tolerate all NoExecute taints #93722

Merged

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 5, 2020

liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 5, 2020

k8s-ci-robot closed this as completed in #93722 Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

pires commented May 6, 2020 •

edited

pires commented May 6, 2020

bobveznat commented May 6, 2020

neolit123 commented May 6, 2020

liggitt commented May 6, 2020

ahg-g commented May 7, 2020

pires commented May 7, 2020

liggitt commented May 7, 2020

liggitt commented May 7, 2020

fejta-bot commented Aug 5, 2020

bobveznat commented Aug 5, 2020

liggitt commented Aug 5, 2020

derekwaynecarr commented Aug 5, 2020

liggitt commented Aug 5, 2020

pires commented Aug 5, 2020

pires commented Aug 5, 2020

derekwaynecarr commented Aug 5, 2020 •

edited

liggitt commented Aug 5, 2020

pires commented Aug 5, 2020

pires commented Aug 5, 2020

liggitt commented Aug 5, 2020

pires commented Aug 6, 2020

Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

Comments

pires commented May 6, 2020 • edited

pires commented May 6, 2020

bobveznat commented May 6, 2020

neolit123 commented May 6, 2020

liggitt commented May 6, 2020

ahg-g commented May 7, 2020

pires commented May 7, 2020

liggitt commented May 7, 2020

liggitt commented May 7, 2020

fejta-bot commented Aug 5, 2020

bobveznat commented Aug 5, 2020

liggitt commented Aug 5, 2020

derekwaynecarr commented Aug 5, 2020

liggitt commented Aug 5, 2020

pires commented Aug 5, 2020

pires commented Aug 5, 2020

derekwaynecarr commented Aug 5, 2020 • edited

liggitt commented Aug 5, 2020

pires commented Aug 5, 2020

pires commented Aug 5, 2020

liggitt commented Aug 5, 2020

pires commented Aug 6, 2020

pires commented May 6, 2020 •

edited

derekwaynecarr commented Aug 5, 2020 •

edited