Add new flag for whitelisting node taints #81043

johnSchnake · 2019-08-06T20:16:33Z

What type of PR is this?
/kind feature

Technically a feature I suppose since it adds a new flag; considered a bug in the test framework by some.

What this PR does / why we need it:
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.

Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.

Which issue(s) this PR fixes:
Fixes #74282

Special notes for your reviewer:

Moves a few pieces of logic into the e2enode. In some cases the logic was duplicated there and in the framework package but I just exported it from the e2enode package in order to remove the redundancy.
Wrapped the main logic change into a single function which was easily testable; added a table driven test for it

Does this PR introduce a user-facing change?:

/test/e2e/framework: Adds a flag "non-blocking-taints" which allows tests to run in environments with tainted nodes. String value should be a comma-separated list.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

johnSchnake · 2019-08-06T20:24:02Z

/sig testing
/area test
/priority important-soon

timothysc · 2019-08-06T20:41:27Z

test/e2e/framework/node/resource.go

+		nodeCopy.Spec.Taints = []v1.Taint{}
+		for _, v := range node.Spec.Taints {
+			if !ignoreTaints.MatchString(v.Key) {
+				nodeCopy.Spec.Taints = append(nodeCopy.Spec.Taints, v)


I'm worried if this will work on all the pod-specs generated.

What I'm wondering is if we whitelist we apply a blanket toleration.

If a test tries to schedule a pod won't it just hang...?

I think there are two separate use cases with different solutions:

a user has a cluster with N nodes, M of which are tainted with their own custom NoSchedule taints (e.g. node/etcd, node/control-plane, etc) which are meant to be NoSchedule. They do not want test pods to be allowed to run on those nodes, they just need the existing "wait until ALL nodes are scheduleable unless they have the master noe label" logic to tolerate them.

a user has a cluster with N nodes, all N are tainted with various NoSchedule taints but in their actual use, they deliberately taint pods they wish to run on certain nodes.

This PR (and I thought the issue) was around solving the first issue; that is the one that I've seen and heard from users about.

You've mentioned case 2, but I'm not sure that's the right problem to solve here:

it would have a potential impact of every test

it would require an explicit list of exact taints which I've been told may be frustrating to provide whereas this currently allows a regexp (this was a comment in the DD)

I am admittedly unfamiliar with this use case, but it seems unclear that it would work uniformly for all those users. What if they have N nodes, M tainted for reason X, (N-M) tainted for reason Y (gpu availability, networking capabilities, geography, etc). Would they want the test suite to run any/all workloads on any/all nodes? Maybe the answer is yes, maybe my situation is even a bit silly, but it seems like if they are tainting all their nodes they may need extra strict control over what workloads run where and it is hard to see how/why that should be supported. Especially with the idea of conformance=workload portability; is everyone clear on the expectation that those clusters to be conformant even if we couldn't move a vanilla hello-world pod into them without added tolerations?

I'm sure you've thought about that more than I have, but that just wasn't the problem I thought I was trying to solve by this ticket.

So my thinking was both user stories are valid, and have gotten reports on both.

The logic for determining nodes ready needs to change but also pod tolerations need to be adjusted.

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests? Isn't that what we'll have to do to pass tests in that situation? In addition, if tests get cut off in the middle, wouldn't it be possible to have left a mutating webhook impacting all pods in the user cluster? That seems like a concern to me.

Regardless, do you have a problem with me continuing with solving use case 1 with a regexp as done?

This would mean that startup isnt blocked by taints matching XYZ, and then to solve use case 2 we'd have to have a separate, concrete list of taints which MUST be tolerated by pods in order for tests to pass.

We don't want to try and reuse a list of taints for both use cases because, as I mentioned above, it seems very possible to have every pod tainted in a custom way but only intend pods to be scheduled on a subset of them.

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests?

It's a pain and it's why we wanted this change. I'll prod folks from the wild to comment on the issue.

alejandrox1 · 2019-08-06T22:39:03Z

/cc

Adds a new flag which allows users to specify a regexp which will effectively whitelist certain taints and allow the test framework to startup tests despite having tainted nodes. Fixes an issue where e2e tests were unable to be run in tainted environments due to the framework waiting for all the nodes to be schedulable without any tolerations.

johnSchnake · 2019-08-30T17:41:16Z

I had some of the logic in a setupSuite method which I thought was the right place to put it. However I guess in some paths it doesn't get called. I moved it into a more idiomatic location that is only for updating the testContext after all the flags are parsed. Some tests failing for, what seem to be, unrelated issues. Retesting.

/test pull-kubernetes-conformance-kind-ipv6

test/e2e/framework/test_context.go

xmudrii · 2019-09-05T12:48:06Z

@andrewsykim @neolit123 This PR fixes the issue that is in the 1.16 milestone, but as the code freeze started, do we want to move both the PR and the issue for 1.17 or this is urgent?

neolit123 · 2019-09-05T12:59:20Z

we should move them to 1.17.
/milestone v1.17

alejandrox1 · 2019-09-06T12:32:37Z

Are those failures flakes?
/retest

timothysc

/lgtm

andrewsykim · 2019-09-10T21:10:17Z

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

neolit123 · 2019-09-10T21:24:34Z

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

or at least prefixed with something like /test/e2e/framework:...

fejta-bot · 2019-09-11T00:57:10Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

k8s-ci-robot requested review from andrewsykim and ixdy August 6, 2019 20:18

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 6, 2019

johnSchnake force-pushed the whitelistedTaints branch from 16894e9 to 347153a Compare August 6, 2019 20:25

timothysc reviewed Aug 6, 2019

View reviewed changes

k8s-ci-robot requested a review from alejandrox1 August 6, 2019 22:39

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2019

timothysc self-assigned this Aug 12, 2019

johnSchnake force-pushed the whitelistedTaints branch from 347153a to f1c6412 Compare August 12, 2019 18:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2019

johnSchnake force-pushed the whitelistedTaints branch 3 times, most recently from 640ccf8 to b38feb0 Compare August 13, 2019 14:31

johnSchnake force-pushed the whitelistedTaints branch from 4bfb19d to 0c2ace2 Compare August 30, 2019 14:04

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2019

johnSchnake force-pushed the whitelistedTaints branch from 0c2ace2 to 18fe2d2 Compare August 30, 2019 14:36

johnSchnake force-pushed the whitelistedTaints branch from 18fe2d2 to 8772187 Compare August 30, 2019 16:16

andrewsykim reviewed Sep 3, 2019

View reviewed changes

test/e2e/framework/test_context.go Outdated Show resolved Hide resolved

k8s-ci-robot added this to the v1.17 milestone Sep 5, 2019

johnSchnake mentioned this pull request Sep 5, 2019

Simplify test suite startup conditions and provide an opt-out #78500

Closed

johnSchnake force-pushed the whitelistedTaints branch 5 times, most recently from da49d3b to f85fd51 Compare September 5, 2019 19:20

Move from regexp to csv string

3c53481

johnSchnake force-pushed the whitelistedTaints branch from f85fd51 to 3c53481 Compare September 5, 2019 19:37

timothysc approved these changes Sep 10, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2019

k8s-ci-robot merged commit 0f46a8a into kubernetes:master Sep 11, 2019

johnSchnake mentioned this pull request Sep 12, 2019

Support e2e runs despite custom taints on nodes vmware-tanzu/sonobuoy#599

Closed

johnSchnake mentioned this pull request Sep 30, 2019

Test framework should support custom taints during testing #83329

Closed

johnSchnake deleted the whitelistedTaints branch October 29, 2019 13:42

johnSchnake mentioned this pull request Dec 2, 2019

Add an FAQ vmware-tanzu/sonobuoy#998

Merged

johnSchnake mentioned this pull request Mar 19, 2020

E2E test failing. Please assist vmware-tanzu/sonobuoy#1091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new flag for whitelisting node taints #81043

Add new flag for whitelisting node taints #81043

johnSchnake commented Aug 6, 2019 •

edited

johnSchnake commented Aug 6, 2019

timothysc Aug 6, 2019

johnSchnake Aug 7, 2019

timothysc Aug 8, 2019

johnSchnake Aug 8, 2019

timothysc Aug 12, 2019

alejandrox1 commented Aug 6, 2019

johnSchnake commented Aug 30, 2019

xmudrii commented Sep 5, 2019

neolit123 commented Sep 5, 2019

alejandrox1 commented Sep 6, 2019

timothysc left a comment

andrewsykim commented Sep 10, 2019

neolit123 commented Sep 10, 2019

fejta-bot commented Sep 11, 2019

Add new flag for whitelisting node taints #81043

Add new flag for whitelisting node taints #81043

Conversation

johnSchnake commented Aug 6, 2019 • edited

johnSchnake commented Aug 6, 2019

timothysc Aug 6, 2019

Choose a reason for hiding this comment

johnSchnake Aug 7, 2019

Choose a reason for hiding this comment

timothysc Aug 8, 2019

Choose a reason for hiding this comment

johnSchnake Aug 8, 2019

Choose a reason for hiding this comment

timothysc Aug 12, 2019

Choose a reason for hiding this comment

alejandrox1 commented Aug 6, 2019

johnSchnake commented Aug 30, 2019

xmudrii commented Sep 5, 2019

neolit123 commented Sep 5, 2019

alejandrox1 commented Sep 6, 2019

timothysc left a comment

Choose a reason for hiding this comment

andrewsykim commented Sep 10, 2019

neolit123 commented Sep 10, 2019

fejta-bot commented Sep 11, 2019

johnSchnake commented Aug 6, 2019 •

edited