New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NotReady taint to new nodes during admission #73097

Merged
merged 3 commits into from Jan 25, 2019

Conversation

@bsalamat
Copy link
Contributor

bsalamat commented Jan 19, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a race condition between node controller applying taints for new nodes and the scheduler using new nodes for scheduling pods by adding not ready taint to new nodes at admission time.

Which issue(s) this PR fixes:

Fixes #72129

Special notes for your reviewer:
My understanding is that we do not need to add the taint in node updates, mainly because node condition updates may arrive anytime and can race with other K8s operations. For example, pods may be scheduled on nodes that may become unavailable shortly afterwards in normal condition and K8s handles those scenarios well. The bigger problem that we are trying to fix is to avoid placing pods on newly added nodes.

UPDATE: Given that tainting a node at update reduces wasted work and provides better experience with respect to scheduling pods on more reliable nodes, I added the logic to taint nodes at updates as well.

Does this PR introduce a user-facing change?:

A new `TaintNodesByCondition` admission plugin taints newly created Node objects as "not ready", to fix a race condition that could cause pods to be scheduled on new nodes before their taints were updated to accurately reflect their reported conditions. This admission plugin is enabled by default if the `TaintNodesByCondition` feature is enabled.

/sig scheduling
/sig node
/priority critical-urgent

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Jan 19, 2019

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Jan 19, 2019

For example, pods may be scheduled on nodes that may become unavailable shortly afterwards in normal condition and K8s handles those scenarios well.

Yeah, that is right, eventually the pods may get evicted from those nodes which become unavailable over time, also the pods get scheduled onto those kind of nodes - more on that scenario can be found here: #72548 (comment)

@@ -350,7 +368,15 @@ func (c *nodePlugin) admitNode(nodeName string, a admission.Attributes) error {
if len(requestedName) == 0 {
requestedName = node.Name
}
// Taint node with NotReady taint at creation if TaintNodesByCondition is
// enabled. This is needed to make sure that nodes are added to the cluster
// with the taint. Otherwise, a new node may receive the taint with some delay

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Jan 19, 2019

Contributor

nit: s/the taint/the NotReadytaint.

the comment could be further expanded to something like the new node may receive taint from nodecontroller with some delay..

expectedTaints []api.Taint
}{
{
name: "notReady taint is added on creation",

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Jan 19, 2019

Contributor

nit: Probably a test to ensure we don't add this taint, if the taint already exists.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 19, 2019

I think we'd want this in its own admission plugin that we could enable by default here:

defaultOnPlugins := sets.NewString(

@@ -44,7 +44,8 @@ import (
)

const (
PluginName = "NodeRestriction"
PluginName = "NodeRestriction"
TaintNodeNotReady = "node.kubernetes.io/not-ready"

This comment has been minimized.

@liggitt

liggitt Jan 19, 2019

Member

Isn't there an existing key for the notready taint the node controller manages?

This comment has been minimized.

@bsalamat

bsalamat Jan 22, 2019

Author Contributor

There is one in scheduler API, but I didn't want to create a dependency to the API only for this label.

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 21, 2019

@answer1991 Please take a look, this patch is critical and align with your original proposal.

// with the taint. Otherwise, a new node may receive the taint with some delay
// causing pods to be scheduled on a not-ready node.
if c.features.Enabled(features.TaintNodesByCondition) {
addNotReadyTaint(node)

This comment has been minimized.

@resouer

resouer Jan 21, 2019

Member

And we may also need to point out who is expected to un-taint the node in the future (NodeLifecycleController etc).

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Jan 22, 2019

Thanks @liggitt, @ravisantoshgudimetla, @resouer for your comments and reviews. I addressed all the comments. PTAL.

@bsalamat bsalamat force-pushed the bsalamat:fix_taint_nodes branch 3 times, most recently from 77107aa to f2b0959 Jan 22, 2019

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 23, 2019

Seems still some golints errors to fix. :-)

@bsalamat bsalamat force-pushed the bsalamat:fix_taint_nodes branch from f2b0959 to b91be7d Jan 23, 2019

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Jan 23, 2019

The bigger problem is the fact that a NotReady taint is removed in a node update causing TestNodeAuthorizer to fail. I added an experimental commit to let a change of NotReady taint to be accepted, but I don't feel good about this change.
@liggitt what would be the recommended way of addressing this issue?

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 24, 2019

that's very surprising to me... doesn't that seem like it would not test actual behavior very accurately?

@liggitt AFAIK the goal of node-e2e is mainly focus on Node API and kubelet's behavior, for example, it's normally how we verify CRI shims. We can start node lifecycle controller there, or just un-taint node as a quick fix?

That said, this PR may deserve a "Urgent Upgrade Notes" in release note, as the assumption of how node is operated is changed.

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Jan 24, 2019

@liggitt AFAIK the goal of node-e2e is mainly focus on Node API and kubelet's behavior, for example, it's normally how we used to verify CRI shims. We can start node lifecycle controller there, or just un-taint node as a quick fix?

@resouer - Yeah, I agree, perhaps starting node lifecycle controller is good in this case.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 24, 2019

@resouer - Yeah, I agree, perhaps starting node lifecycle controller is good in this case.

I'd expect the node-e2e to run in a relatively realistic environment w.r.t. the controllers that interact with nodes and pods on nodes. That seems useful for flushing out unexpected interactions.

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Jan 24, 2019

@bsalamat Aha, I believe node lifecycle controller is not started in node-e2e tests, which only includes kubelet, kube-apiserver, and etcd.

@resouer That's correct. @yujuhong verified that node lifecycle controller is not run in node e2e tests. Those tests are special e2e tests. Only API server and kubelet are run.

@resouer - Yeah, I agree, perhaps starting node lifecycle controller is good in this case.

@ravisantoshgudimetla I don't think that's a good idea as those tests don't run any other controllers and it could cause other issues in the tests. A different approach is needed.

I'd expect the node-e2e to run in a relatively realistic environment w.r.t. the controllers that interact with nodes and pods on nodes. That seems useful for flushing out unexpected interactions.

@liggitt While I agree with you, it is not the case and I don't think we should change that here. We need to cherrypick this to older releases. I don't want to make a risky change to our tests in older releases.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 24, 2019

@liggitt While I agree with you, it is not the case and I don't think we should change that here. We need to cherrypick this to older releases. I don't want to make a risky change to our tests in older releases.

then the safest alternative would seem to be to disable this admission plugin specifically for the node-e2e test, if it wants to run in a partial environment

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 24, 2019

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 24, 2019

I'd expect the node-e2e to run in a relatively realistic environment w.r.t. the controllers that interact with nodes and pods on nodes. That seems useful for flushing out unexpected interactions.

Node e2e was introduced mainly to test kubelet and things below that in different environments (e.g., various os images). The apiserver was added simply to make interacting with kubelet easier. The node e2e suite helps us identify node failures easily (and also allows us to change the kubelet configuration easily to test different features on the node).

The interaction between kubelet and controllers are tested in cluster e2e. It may be useful to have a simpler environment to test control plane on the master with nodes, but that wasn't what node e2e was created for.

then the safest alternative would seem to be to disable this admission plugin specifically for the node-e2e test, if it wants to run in a partial environment

+1 to disable this.

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Jan 24, 2019

Thanks, @yujuhong!

I just pushed a commit to disable the new admission plugin in node e2e tests.

@resouer
Copy link
Member

resouer left a comment

/lgtm

let's see how CI says then :-)

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 24, 2019

Disable TaintNodesByCondition in node e2e tests, because node lifecyc…
…le controller does not run in the tests

@bsalamat bsalamat force-pushed the bsalamat:fix_taint_nodes branch from 9b780d3 to 37be9f5 Jan 24, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 24, 2019

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Jan 25, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 25, 2019

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Jan 25, 2019

/lgtm
/hold cancel

Thanks ALL working on this issue!

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 25, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit d654b49 into kubernetes:master Jan 25, 2019

18 checks passed

cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
}

func addNotReadyTaint(node *api.Node) {
notReadyTaint := api.Taint{

This comment has been minimized.

@tedyu

tedyu Jan 25, 2019

Contributor

It seems this can be declared as singleton outside the func

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 26, 2019

@bsalamat: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e 9b780d3 link /test pull-kubernetes-node-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot added a commit that referenced this pull request Jan 29, 2019

Merge pull request #73298 from bsalamat/automated-cherry-pick-of-#730…
…97-upstream-release-1.12

Automated cherry pick of #73097 upstream release 1.12

k8s-ci-robot added a commit that referenced this pull request Jan 30, 2019

Merge pull request #73296 from bsalamat/automated-cherry-pick-of-#730…
…97-upstream-release-1.13

Automated cherry pick of #73097 upstream release 1.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment