Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794

Closed
pires opened this issue May 6, 2020 · 21 comments · Fixed by #93722
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@pires
Copy link
Contributor

pires commented May 6, 2020

What happened:

A Pod was scheduled for deletion because a taint w/ effect NoExecute was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn't canceled.

After scheduling pod deletion, the NoExecuteTaintManager only cancels pod deletion if all NoExecute taints are removed from the node, including any taints that the user relies on for their use-cases (example below).

What you expected to happen:

After the taint that triggered scheduling pod deletion is removed, NoExecuteTaintManager cancels the scheduled pod deletion.

How to reproduce it (as minimally and precisely as possible):

  • A node MyNode is tainted w/ OnlyForMyUsage:NoExecute, so any workloads already assigned to it which don't tolerate this taint are evicted.
  • A pod MyPod is created w/ tolerations: OnlyForMyUsage:NoExecute w/ unspecified tolerationSeconds and NotReady:NoExecute w/ tolerationSeconds: 300.
  • MyPod is assigned to MyNode.
  • At some point in time after the previous steps took place, MyNode is tainted as NotReady:NoExecute.
  • NoExecuteTaintManager gets an update event for MyNode and observes two NoExecute taints. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the two taints which at this point is 300 seconds, and marks MyPod for deletion in ~300 seconds - this happens in-memory, in a timed-queue.
  • At some point in time after the previous steps took place, and before the 300 seconds trigger happen, NotReady:NoExecute taint is removed from MyNode.
  • NoExecuteTaintManager gets an update event for MyNode and observes only one NoExecute taint, OnlyForMyUsage:NoExecute. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggered MyPod deletion is no longer observed.

Anything else we need to know?:

  • Worked w/ @bobveznat to figure this out after all pods in a node got evicted without apparent reason.

Environment:

  • Kubernetes version (use kubectl version): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master.
  • Cloud provider or hardware configuration: N/A
  • OS (e.g: cat /etc/os-release): N/A
  • Kernel (e.g. uname -a): 5.x
  • Install tools: N/A
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: N/A

cc @gmarek @bowei @k82cn (owners of this code)

@pires pires added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2020
@pires
Copy link
Contributor Author

pires commented May 6, 2020

There seems to be at least one other similar report in the wild.

@bobveznat
Copy link
Contributor

You can sort of get a feel for what's going on here by looking at these two code snippets:

This is what happens in the fast path when the node does not have any NoExecute taints:

	if len(taints) == 0 {
		klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)
		for i := range pods {
			tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})
		}
		return
	}

If you have any NoExecute taints on the node processPodOnNode gets called and this routine is really, really eager to evict pods. So this code path is followed:

	if minTolerationTime < 0 {
		klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String())
		return
	}

We are currently working around this issue by changing over our NoExecute taints to NoSchedule.

@neolit123
Copy link
Member

/sig scheduling node

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2020
@liggitt
Copy link
Member

liggitt commented May 6, 2020

/cc @karan

@ahg-g
Copy link
Member

ahg-g commented May 7, 2020

/remove-sig scheduling

scheduler isn't involved in this case.

@k8s-ci-robot k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 7, 2020
@pires
Copy link
Contributor Author

pires commented May 7, 2020

Which SIG owns the controller-manager or is the node lifecycle manager (and its NoExecuteTaintManager) owned by sig-node alone?

@liggitt liggitt added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label May 7, 2020
@liggitt
Copy link
Member

liggitt commented May 7, 2020

I think node lifecycle is jointly owned by sig-node and sig-cloud-provider

@liggitt
Copy link
Member

liggitt commented May 7, 2020

API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2020
@bobveznat
Copy link
Contributor

Looks like we lost some momentum here and I'd like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner?

@liggitt liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 5, 2020
@liggitt
Copy link
Member

liggitt commented Aug 5, 2020

adding sig-node leads for triage/routing. this seems like a clearly reproducible bug with significant correctness issues.

/assign @derekwaynecarr @dchen1107

@derekwaynecarr
Copy link
Member

@bobveznat @pires thanks for the detailed report and reproducer. to help accelerate this, do you have a form of your reproducer we could add in the e2e suite (even if its failing initially)?

@liggitt
Copy link
Member

liggitt commented Aug 5, 2020

do you have a form of your reproducer we could add in the e2e suite

custom taints are considered disruptive, so it couldn't run in the main suite

@pires
Copy link
Contributor Author

pires commented Aug 5, 2020

It's been a while since I wrote tests for the e2e suite but if you are available over slack in case guidance is needed, I'm willing to put the effort.

@pires
Copy link
Contributor Author

pires commented Aug 5, 2020

custom taints are considered disruptive, so it couldn't run in the main suite

Do all suites need to pass for a fix to be merged? I'd assume it is the case, otherwise how would a regression be prevented after the fix lands?

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Aug 5, 2020

@pires i was more inquiring if there was a test that could automate the reproduction, so even just a linked commit may be helpful for someone that wants to pick this up. i am happy to help if you want to reach out over slack later this week.

@liggitt
Copy link
Member

liggitt commented Aug 5, 2020

looks like a unit test could reproduce the issue, opened a WIP fix in #93722

@pires
Copy link
Contributor Author

pires commented Aug 5, 2020

@derekwaynecarr our code is not publicly accessible, I'm sorry. PR linked by @liggitt seems to cover both the fix and the test that proves it.

@pires
Copy link
Contributor Author

pires commented Aug 5, 2020

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 5, 2020
@liggitt liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 5, 2020
@liggitt
Copy link
Member

liggitt commented Aug 5, 2020

frozen keeps the bot from auto-closing

@pires
Copy link
Contributor Author

pires commented Aug 6, 2020

Ah, sorry then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants