Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new flag for whitelisting node taints #81043

Merged
merged 2 commits into from Sep 11, 2019

Conversation

johnSchnake
Copy link
Contributor

@johnSchnake johnSchnake commented Aug 6, 2019

What type of PR is this?
/kind feature

Technically a feature I suppose since it adds a new flag; considered a bug in the test framework by some.

What this PR does / why we need it:
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.

Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.

Which issue(s) this PR fixes:
Fixes #74282

Special notes for your reviewer:

  • Moves a few pieces of logic into the e2enode. In some cases the logic was duplicated there and in the framework package but I just exported it from the e2enode package in order to remove the redundancy.
  • Wrapped the main logic change into a single function which was easily testable; added a table driven test for it

Does this PR introduce a user-facing change?:

/test/e2e/framework: Adds a flag "non-blocking-taints" which allows tests to run in environments with tainted nodes. String value should be a comma-separated list.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 6, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Aug 6, 2019
@johnSchnake
Copy link
Contributor Author

/sig testing
/area test
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 6, 2019
nodeCopy.Spec.Taints = []v1.Taint{}
for _, v := range node.Spec.Taints {
if !ignoreTaints.MatchString(v.Key) {
nodeCopy.Spec.Taints = append(nodeCopy.Spec.Taints, v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried if this will work on all the pod-specs generated.

What I'm wondering is if we whitelist we apply a blanket toleration.

If a test tries to schedule a pod won't it just hang...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two separate use cases with different solutions:

  • a user has a cluster with N nodes, M of which are tainted with their own custom NoSchedule taints (e.g. node/etcd, node/control-plane, etc) which are meant to be NoSchedule. They do not want test pods to be allowed to run on those nodes, they just need the existing "wait until ALL nodes are scheduleable unless they have the master noe label" logic to tolerate them.
  • a user has a cluster with N nodes, all N are tainted with various NoSchedule taints but in their actual use, they deliberately taint pods they wish to run on certain nodes.

This PR (and I thought the issue) was around solving the first issue; that is the one that I've seen and heard from users about.

You've mentioned case 2, but I'm not sure that's the right problem to solve here:

  • it would have a potential impact of every test
  • it would require an explicit list of exact taints which I've been told may be frustrating to provide whereas this currently allows a regexp (this was a comment in the DD)
  • I am admittedly unfamiliar with this use case, but it seems unclear that it would work uniformly for all those users. What if they have N nodes, M tainted for reason X, (N-M) tainted for reason Y (gpu availability, networking capabilities, geography, etc). Would they want the test suite to run any/all workloads on any/all nodes? Maybe the answer is yes, maybe my situation is even a bit silly, but it seems like if they are tainting all their nodes they may need extra strict control over what workloads run where and it is hard to see how/why that should be supported. Especially with the idea of conformance=workload portability; is everyone clear on the expectation that those clusters to be conformant even if we couldn't move a vanilla hello-world pod into them without added tolerations?

I'm sure you've thought about that more than I have, but that just wasn't the problem I thought I was trying to solve by this ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thinking was both user stories are valid, and have gotten reports on both.

The logic for determining nodes ready needs to change but also pod tolerations need to be adjusted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests? Isn't that what we'll have to do to pass tests in that situation? In addition, if tests get cut off in the middle, wouldn't it be possible to have left a mutating webhook impacting all pods in the user cluster? That seems like a concern to me.

Regardless, do you have a problem with me continuing with solving use case 1 with a regexp as done?

This would mean that startup isnt blocked by taints matching XYZ, and then to solve use case 2 we'd have to have a separate, concrete list of taints which MUST be tolerated by pods in order for tests to pass.

We don't want to try and reuse a list of taints for both use cases because, as I mentioned above, it seems very possible to have every pod tainted in a custom way but only intend pods to be scheduled on a subset of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests?

It's a pain and it's why we wanted this change. I'll prod folks from the wild to comment on the issue.

@alejandrox1
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2019
@timothysc timothysc self-assigned this Aug 12, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2019
@johnSchnake johnSchnake force-pushed the whitelistedTaints branch 3 times, most recently from 640ccf8 to b38feb0 Compare August 13, 2019 14:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2019
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.

Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.
@johnSchnake
Copy link
Contributor Author

I had some of the logic in a setupSuite method which I thought was the right place to put it. However I guess in some paths it doesn't get called. I moved it into a more idiomatic location that is only for updating the testContext after all the flags are parsed. Some tests failing for, what seem to be, unrelated issues. Retesting.

/test pull-kubernetes-conformance-kind-ipv6

@xmudrii
Copy link
Member

xmudrii commented Sep 5, 2019

@andrewsykim @neolit123 This PR fixes the issue that is in the 1.16 milestone, but as the code freeze started, do we want to move both the PR and the issue for 1.17 or this is urgent?

@neolit123
Copy link
Member

we should move them to 1.17.
/milestone v1.17

@alejandrox1
Copy link
Contributor

Are those failures flakes?
/retest

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2019
@andrewsykim
Copy link
Member

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

@neolit123
Copy link
Member

@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though

or at least prefixed with something like /test/e2e/framework:...

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support custom taints during e2e tests
9 participants