New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wildcard tolerations to kube-proxy #56589
Add wildcard tolerations to kube-proxy #56589
Conversation
It is expected that nodes with extended resources attached will be tainted with the resouce name, so that we can create dedicated nodes. If ExtendedResourceToleration admission controller is enabled, pods requesting such resources will automatically tolerate such taints. nvidia-gpu-device-plugin daemonset doesn't request such resources but still needs to run on such nodes, so it needs this toleration.
fluend-gcp already has these tolerations. kube-proxy when it runs as a static pod gets wildcard `NoExecute` toleration (all static pods get that). So, added the same toleration to kube-proxy when it runs as a daemonset. Also added wildcard `NoSchedule` toleration to kube-proxy.
/lgtm |
cc @mikedanese for approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-proxy part LGTM.
/lgtm
/retest |
/lgtm |
squash? |
@@ -107,7 +107,6 @@ spec: | |||
effect: "NoSchedule" | |||
- operator: "Exists" | |||
effect: "NoExecute" | |||
#TODO: remove this toleration once #44445 is properly fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this comment? The issue hasn't closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this toleration to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't is different than we never will. TODO indicates we don't but we might want to eventually. Is this issue obsolete? Should we close it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#44445 contains multiple issues.
- The title of the issue:
Improvement: fluentd-gcp to get same toleration as kube-proxy
will be fixed by this PR. - There was a regression in the daemonset controller mentioned in the bug which was fixed a while back.
- I guess the only thing that is not fixed are the three comments starting from Improvement: fluentd-gcp to get same toleration as kube-proxy #44445 (comment) (users not being able to modify system addons on managed services like GKE and so can't use NoSchedule or NoExecute taints if these addons don't tolerate them because adding such taints make the nodes not have these "required" addons)
But even if we fix that (allow users to modify the toleration of system addons). I think the default should still be that these addons tolerate all taints. If users really want they can use the ability to modify the toleration of system addons to remove these wildcard tolerations.
Also, we need system addons that run on every node to have wildcard NoSchedule toleration for issue #55080, PR #55839.
cc @vishh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. The expected behavior is that all system addons that are expected to run on all GKE nodes should tolerate all taints and effects. Certain addons like GPU plugins need to tolerate GPU specific taints only.
This I feel is probably slightly different from #44445 where if a user taints all GKE nodes then cluster level system addons will not run at all. This is a separate feature and is not tied to the comment at all.
@mikedanese thoughts?
Those commits are doing two different things. The commit messages have more detail. |
[MILESTONENOTIFIER] Milestone Pull Request Needs Attention @MrHohn @bsalamat @davidopp @jiayingz @mikedanese @mindprince @vishh @kubernetes/sig-scheduling-misc Action required: During code freeze, pull requests in the milestone should be in progress. Note: This pull request is marked as Example update:
Pull Request Labels
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, mikedanese, mindprince, MrHohn, vishh Associated issue: 55080 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
nvidia.com/gpu
toleration to nvidia-gpu-device-plugin.Related to #55080 and #44445.
/kind bug
/priority critical-urgent
/sig scheduling
Release note:
/assign @davidopp @bsalamat @vishh @jiayingz