New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed #90794
Comments
There seems to be at least one other similar report in the wild. |
You can sort of get a feel for what's going on here by looking at these two code snippets: This is what happens in the fast path when the node does not have any NoExecute taints:
If you have any NoExecute taints on the node processPodOnNode gets called and this routine is really, really eager to evict pods. So this code path is followed:
We are currently working around this issue by changing over our NoExecute taints to NoSchedule. |
/sig scheduling node |
/cc @karan |
/remove-sig scheduling scheduler isn't involved in this case. |
Which SIG owns the controller-manager or is the node lifecycle manager (and its NoExecuteTaintManager) owned by sig-node alone? |
I think node lifecycle is jointly owned by sig-node and sig-cloud-provider |
API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Looks like we lost some momentum here and I'd like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner? |
adding sig-node leads for triage/routing. this seems like a clearly reproducible bug with significant correctness issues. /assign @derekwaynecarr @dchen1107 |
@bobveznat @pires thanks for the detailed report and reproducer. to help accelerate this, do you have a form of your reproducer we could add in the e2e suite (even if its failing initially)? |
custom taints are considered disruptive, so it couldn't run in the main suite |
It's been a while since I wrote tests for the e2e suite but if you are available over slack in case guidance is needed, I'm willing to put the effort. |
Do all suites need to pass for a fix to be merged? I'd assume it is the case, otherwise how would a regression be prevented after the fix lands? |
@pires i was more inquiring if there was a test that could automate the reproduction, so even just a linked commit may be helpful for someone that wants to pick this up. i am happy to help if you want to reach out over slack later this week. |
looks like a unit test could reproduce the issue, opened a WIP fix in #93722 |
@derekwaynecarr our code is not publicly accessible, I'm sorry. PR linked by @liggitt seems to cover both the fix and the test that proves it. |
/remove-lifecycle frozen |
frozen keeps the bot from auto-closing |
Ah, sorry then. |
What happened:
A Pod was scheduled for deletion because a taint w/ effect
NoExecute
was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn't canceled.After scheduling pod deletion, the
NoExecuteTaintManager
only cancels pod deletion if allNoExecute
taints are removed from the node, including any taints that the user relies on for their use-cases (example below).What you expected to happen:
After the taint that triggered scheduling pod deletion is removed,
NoExecuteTaintManager
cancels the scheduled pod deletion.How to reproduce it (as minimally and precisely as possible):
MyNode
is tainted w/OnlyForMyUsage:NoExecute
, so any workloads already assigned to it which don't tolerate this taint are evicted.MyPod
is created w/ tolerations:OnlyForMyUsage:NoExecute
w/ unspecifiedtolerationSeconds
andNotReady:NoExecute
w/tolerationSeconds: 300
.MyPod
is assigned toMyNode
.MyNode
is tainted asNotReady:NoExecute
.NoExecuteTaintManager
gets an update event forMyNode
and observes twoNoExecute
taints. It proceeds to calculate the minimum time (in seconds) thatMyPod
tolerates for the two taints which at this point is 300 seconds, and marksMyPod
for deletion in ~300 seconds - this happens in-memory, in a timed-queue.NotReady:NoExecute
taint is removed fromMyNode
.NoExecuteTaintManager
gets an update event forMyNode
and observes only oneNoExecute
taint,OnlyForMyUsage:NoExecute
. It proceeds to calculate the minimum time (in seconds) thatMyPod
tolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggeredMyPod
deletion is no longer observed.Anything else we need to know?:
Environment:
kubectl version
): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master.cat /etc/os-release
): N/Auname -a
): 5.xcc @gmarek @bowei @k82cn (owners of this code)
The text was updated successfully, but these errors were encountered: