New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new flag for whitelisting node taints #81043
Add new flag for whitelisting node taints #81043
Conversation
/sig testing |
16894e9
to
347153a
Compare
nodeCopy.Spec.Taints = []v1.Taint{} | ||
for _, v := range node.Spec.Taints { | ||
if !ignoreTaints.MatchString(v.Key) { | ||
nodeCopy.Spec.Taints = append(nodeCopy.Spec.Taints, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried if this will work on all the pod-specs generated.
What I'm wondering is if we whitelist we apply a blanket toleration.
If a test tries to schedule a pod won't it just hang...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two separate use cases with different solutions:
- a user has a cluster with N nodes, M of which are tainted with their own custom NoSchedule taints (e.g. node/etcd, node/control-plane, etc) which are meant to be NoSchedule. They do not want test pods to be allowed to run on those nodes, they just need the existing "wait until ALL nodes are scheduleable unless they have the master noe label" logic to tolerate them.
- a user has a cluster with N nodes, all N are tainted with various NoSchedule taints but in their actual use, they deliberately taint pods they wish to run on certain nodes.
This PR (and I thought the issue) was around solving the first issue; that is the one that I've seen and heard from users about.
You've mentioned case 2, but I'm not sure that's the right problem to solve here:
- it would have a potential impact of every test
- it would require an explicit list of exact taints which I've been told may be frustrating to provide whereas this currently allows a regexp (this was a comment in the DD)
- I am admittedly unfamiliar with this use case, but it seems unclear that it would work uniformly for all those users. What if they have N nodes, M tainted for reason X, (N-M) tainted for reason Y (gpu availability, networking capabilities, geography, etc). Would they want the test suite to run any/all workloads on any/all nodes? Maybe the answer is yes, maybe my situation is even a bit silly, but it seems like if they are tainting all their nodes they may need extra strict control over what workloads run where and it is hard to see how/why that should be supported. Especially with the idea of conformance=workload portability; is everyone clear on the expectation that those clusters to be conformant even if we couldn't move a vanilla hello-world pod into them without added tolerations?
I'm sure you've thought about that more than I have, but that just wasn't the problem I thought I was trying to solve by this ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my thinking was both user stories are valid, and have gotten reports on both.
The logic for determining nodes ready needs to change but also pod tolerations need to be adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests? Isn't that what we'll have to do to pass tests in that situation? In addition, if tests get cut off in the middle, wouldn't it be possible to have left a mutating webhook impacting all pods in the user cluster? That seems like a concern to me.
Regardless, do you have a problem with me continuing with solving use case 1 with a regexp as done?
This would mean that startup isnt blocked by taints matching XYZ, and then to solve use case 2 we'd have to have a separate, concrete list of taints which MUST be tolerated by pods in order for tests to pass.
We don't want to try and reuse a list of taints for both use cases because, as I mentioned above, it seems very possible to have every pod tainted in a custom way but only intend pods to be scheduled on a subset of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the second use case applies to you, would it be reasonable to require users to add their own mutating webhook before trying to run tests?
It's a pain and it's why we wanted this change. I'll prod folks from the wild to comment on the issue.
/cc |
347153a
to
f1c6412
Compare
640ccf8
to
b38feb0
Compare
4bfb19d
to
0c2ace2
Compare
0c2ace2
to
18fe2d2
Compare
Adds a new flag which allows users to specify a regexp which will effectively whitelist certain taints and allow the test framework to startup tests despite having tainted nodes. Fixes an issue where e2e tests were unable to be run in tainted environments due to the framework waiting for all the nodes to be schedulable without any tolerations.
18fe2d2
to
8772187
Compare
I had some of the logic in a /test pull-kubernetes-conformance-kind-ipv6 |
@andrewsykim @neolit123 This PR fixes the issue that is in the 1.16 milestone, but as the code freeze started, do we want to move both the PR and the issue for 1.17 or this is urgent? |
we should move them to 1.17. |
da49d3b
to
f85fd51
Compare
f85fd51
to
3c53481
Compare
Are those failures flakes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@johnSchnake release notes should be updated to reflect the new flag name, not sure I would consider this "user facing" though |
or at least prefixed with something like |
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind feature
Technically a feature I suppose since it adds a new flag; considered a bug in the test framework by some.
What this PR does / why we need it:
Adds a new flag which allows users to specify a regexp
which will effectively whitelist certain taints and
allow the test framework to startup tests despite having
tainted nodes.
Fixes an issue where e2e tests were unable to be run
in tainted environments due to the framework waiting for
all the nodes to be schedulable without any tolerations.
Which issue(s) this PR fixes:
Fixes #74282
Special notes for your reviewer:
e2enode
. In some cases the logic was duplicated there and in theframework
package but I just exported it from thee2enode
package in order to remove the redundancy.Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: