Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new flag for whitelisting node taints #81043

Merged
merged 2 commits into from Sep 11, 2019

Conversation

@johnSchnake
Copy link
Contributor

commented Aug 6, 2019

What type of PR is this?
/kind feature

Technically a feature I suppose since it adds a new flag; considered a bug in the test framework by some.

What this PR does / why we need it:
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.

Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.

Which issue(s) this PR fixes:
Fixes #74282

Special notes for your reviewer:

  • Moves a few pieces of logic into the e2enode. In some cases the logic was duplicated there and in the framework package but I just exported it from the e2enode package in order to remove the redundancy.
  • Wrapped the main logic change into a single function which was easily testable; added a table driven test for it

Does this PR introduce a user-facing change?:

/test/e2e/framework: Adds a flag "non-blocking-taints" which allows tests to run in environments with tainted nodes. String value should be a comma-separated list.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE
@johnSchnake

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

/sig testing
/area test
/priority important-soon

@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from 16894e9 to 347153a Aug 6, 2019
nodeCopy.Spec.Taints = []v1.Taint{}
for _, v := range node.Spec.Taints {
if !ignoreTaints.MatchString(v.Key) {
nodeCopy.Spec.Taints = append(nodeCopy.Spec.Taints, v)

This comment has been minimized.

Copy link
@timothysc

timothysc Aug 6, 2019

Member

I'm worried if this will work on all the pod-specs generated.

What I'm wondering is if we whitelist we apply a blanket toleration.

If a test tries to schedule a pod won't it just hang...?

This comment has been minimized.

Copy link
@johnSchnake

johnSchnake Aug 7, 2019

Author Contributor

I think there are two separate use cases with different solutions:

  • a user has a cluster with N nodes, M of which are tainted with their own custom NoSchedule taints (e.g. node/etcd, node/control-plane, etc) which are meant to be NoSchedule. They do not want test pods to be allowed to run on those nodes, they just need the existing "wait until ALL nodes are scheduleable unless they have the master noe label" logic to tolerate them.
  • a user has a cluster with N nodes, all N are tainted with various NoSchedule taints but in their actual use, they deliberately taint pods they wish to run on certain nodes.

This PR (and I thought the issue) was around solving the first issue; that is the one that I've seen and heard from users about.

You've mentioned case 2, but I'm not sure that's the right problem to solve here:

  • it would have a potential impact of every test
  • it would require an explicit list of exact taints which I've been told may be frustrating to provide whereas this currently allows a regexp (this was a comment in the DD)
  • I am admittedly unfamiliar with this use case, but it seems unclear that it would work uniformly for all those users. What if they have N nodes, M tainted for reason X, (N-M) tainted for reason Y (gpu availability, networking capabilities, geography, etc). Would they want the test suite to run any/all workloads on any/all nodes? Maybe the answer is yes, maybe my situation is even a bit silly, but it seems like if they are tainting all their nodes they may need extra strict control over what workloads run where and it is hard to see how/why that should be supported. Especially with the idea of conformance=workload portability; is everyone clear on the expectation that those clusters to be conformant even if we couldn't move a vanilla hello-world pod into them without added tolerations?

I'm sure you've thought about that more than I have, but that just wasn't the problem I thought I was trying to solve by this ticket.

This comment has been minimized.

Copy link
@timothysc

timothysc Aug 8, 2019

Member

So my thinking was both user stories are valid, and have gotten reports on both.

The logic for determining nodes ready needs to change but also pod tolerations need to be adjusted.

This comment has been minimized.

Copy link
@johnSchnake

johnSchnake Aug 8, 2019

Author Contributor

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests? Isn't that what we'll have to do to pass tests in that situation? In addition, if tests get cut off in the middle, wouldn't it be possible to have left a mutating webhook impacting all pods in the user cluster? That seems like a concern to me.

Regardless, do you have a problem with me continuing with solving use case 1 with a regexp as done?

This would mean that startup isnt blocked by taints matching XYZ, and then to solve use case 2 we'd have to have a separate, concrete list of taints which MUST be tolerated by pods in order for tests to pass.

We don't want to try and reuse a list of taints for both use cases because, as I mentioned above, it seems very possible to have every pod tainted in a custom way but only intend pods to be scheduled on a subset of them.

This comment has been minimized.

Copy link
@timothysc

timothysc Aug 12, 2019

Member

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests?

It's a pain and it's why we wanted this change. I'll prod folks from the wild to comment on the issue.

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from alejandrox1 Aug 6, 2019
@timothysc timothysc self-assigned this Aug 12, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from 347153a to f1c6412 Aug 12, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch 3 times, most recently from 640ccf8 to b38feb0 Aug 12, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from e8c2384 to 4bfb19d Aug 30, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from 4bfb19d to 0c2ace2 Aug 30, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from 0c2ace2 to 18fe2d2 Aug 30, 2019
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.

Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from 18fe2d2 to 8772187 Aug 30, 2019
@johnSchnake

This comment has been minimized.

Copy link
Contributor Author

commented Aug 30, 2019

I had some of the logic in a setupSuite method which I thought was the right place to put it. However I guess in some paths it doesn't get called. I moved it into a more idiomatic location that is only for updating the testContext after all the flags are parsed. Some tests failing for, what seem to be, unrelated issues. Retesting.

/test pull-kubernetes-conformance-kind-ipv6

@xmudrii

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

@andrewsykim @neolit123 This PR fixes the issue that is in the 1.16 milestone, but as the code freeze started, do we want to move both the PR and the issue for 1.17 or this is urgent?

@neolit123

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

we should move them to 1.17.
/milestone v1.17

@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Sep 5, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch 5 times, most recently from da49d3b to f85fd51 Sep 5, 2019
@johnSchnake johnSchnake force-pushed the johnSchnake:whitelistedTaints branch from f85fd51 to 3c53481 Sep 5, 2019
@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2019

Are those failures flakes?
/retest

Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Sep 10, 2019
@andrewsykim

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

@neolit123

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

or at least prefixed with something like /test/e2e/framework:...

@fejta-bot

This comment has been minimized.

Copy link

commented Sep 11, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 0f46a8a into kubernetes:master Sep 11, 2019
25 checks passed
25 checks passed
cla/linuxfoundation johnSchnake authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-conformance-kind-ipv6 Job succeeded.
Details
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-alpha-features Skipped.
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.