-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
templates: add priority class system-node-critical
to etcd pod
#353
templates: add priority class system-node-critical
to etcd pod
#353
Conversation
@@ -120,6 +120,9 @@ contents: | |||
containerPort: 2379 | |||
protocol: TCP | |||
hostNetwork: true | |||
priorityClassName: system-node-critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning are there any other magic incantations missing from etcd here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really should be system-cluster-critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
seems like cluster-critical can be evicted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically on that node the static pod is critical (so I can buy node critical). Seems cluster critical may need to be defined better. Also etcd is special.
Looks like this needs a rebase. (I am still tempted to change the template unit tests to only sanity check one or two generated files, not all of them) |
a073a91
to
a75a166
Compare
I tried testing this locally and I'm still seeing kubelet preempt etcd-member. Either my test procedure is faulty or this isn't sufficient. |
@sjenning on a local cluster that doesn't have this change, the etcd pod was evicted. We edited the etcd static pod to includes these changes on the node and restarted the kubelet. But the kubelet still evicted the etcd pod... |
/retest |
1 similar comment
/retest |
```console go test ./pkg/controller/template/... -u ```
a75a166
to
266b6d2
Compare
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, sjenning The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
rate limiting errors e2e-aws level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating cluster..."
level=error
level=error msg="Error: Error applying plan:"
level=error
level=error msg="2 errors occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.worker_routing[0]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.worker_routing.0: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error msg="\t* module.vpc.aws_route_table_association.route_net[5]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error
level=error
level=error
level=error msg="Terraform does not automatically rollback in the face of errors."
level=error msg="Instead, your Terraform state file has been partially updated with"
level=error msg="any resources that successfully completed. Please address the error"
level=error msg="above and apply again to incrementally change your infrastructure."
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform" will retest in a bit. |
/retest |
1 similar comment
/retest |
|
/test e2e-aws-op |
using
system-node-critical
from https://docs.okd.io/latest/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class