Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates: add priority class system-node-critical to etcd pod #353

Merged

Conversation

abhinavdahiya
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 29, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2019
@@ -120,6 +120,9 @@ contents:
containerPort: 2379
protocol: TCP
hostNetwork: true
priorityClassName: system-node-critical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning are there any other magic incantations missing from etcd here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really should be system-cluster-critical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from https://docs.okd.io/latest/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.

seems like cluster-critical can be evicted...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically on that node the static pod is critical (so I can buy node critical). Seems cluster critical may need to be defined better. Also etcd is special.

@cgwalters
Copy link
Member

Looks like this needs a rebase. (I am still tempted to change the template unit tests to only sanity check one or two generated files, not all of them)

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 1, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2019
@crawford
Copy link
Contributor

crawford commented Feb 5, 2019

I tried testing this locally and I'm still seeing kubelet preempt etcd-member. Either my test procedure is faulty or this isn't sufficient.

@abhinavdahiya
Copy link
Contributor Author

I tried testing this locally and I'm still seeing kubelet preempt etcd-member. Either my test procedure is faulty or this isn't sufficient.

@sjenning on a local cluster that doesn't have this change, the etcd pod was evicted.

We edited the etcd static pod to includes these changes on the node and restarted the kubelet. But the kubelet still evicted the etcd pod...

@openshift-merge-robot
Copy link
Contributor

/retest

1 similar comment
@openshift-merge-robot
Copy link
Contributor

/retest

@abhinavdahiya
Copy link
Contributor Author

/retest

@sjenning
Copy link
Contributor

sjenning commented Feb 7, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor Author

rate limiting errors e2e-aws

level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating cluster..."
level=error
level=error msg="Error: Error applying plan:"
level=error
level=error msg="2 errors occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.worker_routing[0]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.worker_routing.0: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error msg="\t* module.vpc.aws_route_table_association.route_net[5]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error
level=error
level=error
level=error msg="Terraform does not automatically rollback in the face of errors."
level=error msg="Instead, your Terraform state file has been partially updated with"
level=error msg="any resources that successfully completed. Please address the error"
level=error msg="above and apply again to incrementally change your infrastructure."
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"

will retest in a bit.

@sjenning
Copy link
Contributor

sjenning commented Feb 8, 2019

/retest

1 similar comment
@ashcrow
Copy link
Member

ashcrow commented Feb 9, 2019

/retest

@ashcrow
Copy link
Member

ashcrow commented Feb 9, 2019

clusterversion.config.openshift.io/version condition met
/bin/bash: line 52: /etc/passwd: Permission denied

@ashcrow
Copy link
Member

ashcrow commented Feb 9, 2019

/test e2e-aws-op

@openshift-merge-robot openshift-merge-robot merged commit 943ed14 into openshift:master Feb 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants