New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates: add priority class `system-node-critical` to etcd pod #353

Merged
merged 2 commits into from Feb 9, 2019

Conversation

Projects
None yet
8 participants
@@ -120,6 +120,9 @@ contents:
containerPort: 2379
protocol: TCP
hostNetwork: true
priorityClassName: system-node-critical

This comment has been minimized.

@smarterclayton

smarterclayton Jan 29, 2019

Member

@sjenning are there any other magic incantations missing from etcd here?

This comment has been minimized.

@sjenning

sjenning Feb 5, 2019

Contributor

really should be system-cluster-critical

This comment has been minimized.

@abhinavdahiya

abhinavdahiya Feb 5, 2019

Author Member

from https://docs.okd.io/latest/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.

seems like cluster-critical can be evicted...

This comment has been minimized.

@smarterclayton

smarterclayton Feb 7, 2019

Member

Technically on that node the static pod is critical (so I can buy node critical). Seems cluster critical may need to be defined better. Also etcd is special.

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 1, 2019

Looks like this needs a rebase. (I am still tempted to change the template unit tests to only sanity check one or two generated files, not all of them)

@abhinavdahiya abhinavdahiya force-pushed the abhinavdahiya:etcd_priority_class branch from a073a91 to a75a166 Feb 5, 2019

@crawford

This comment has been minimized.

Copy link
Member

crawford commented Feb 5, 2019

I tried testing this locally and I'm still seeing kubelet preempt etcd-member. Either my test procedure is faulty or this isn't sufficient.

@abhinavdahiya

This comment has been minimized.

Copy link
Member Author

abhinavdahiya commented Feb 5, 2019

I tried testing this locally and I'm still seeing kubelet preempt etcd-member. Either my test procedure is faulty or this isn't sufficient.

@sjenning on a local cluster that doesn't have this change, the etcd pod was evicted.

We edited the etcd static pod to includes these changes on the node and restarted the kubelet. But the kubelet still evicted the etcd pod...

@openshift-merge-robot

This comment has been minimized.

Copy link
Contributor

openshift-merge-robot commented Feb 6, 2019

/retest

1 similar comment
@openshift-merge-robot

This comment has been minimized.

Copy link
Contributor

openshift-merge-robot commented Feb 6, 2019

/retest

@abhinavdahiya abhinavdahiya force-pushed the abhinavdahiya:etcd_priority_class branch from a75a166 to 266b6d2 Feb 6, 2019

@abhinavdahiya

This comment has been minimized.

Copy link
Member Author

abhinavdahiya commented Feb 7, 2019

/retest

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Feb 7, 2019

/lgtm

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 7, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya

This comment has been minimized.

Copy link
Member Author

abhinavdahiya commented Feb 7, 2019

rate limiting errors e2e-aws

level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating cluster..."
level=error
level=error msg="Error: Error applying plan:"
level=error
level=error msg="2 errors occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.worker_routing[0]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.worker_routing.0: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error msg="\t* module.vpc.aws_route_table_association.route_net[5]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.5: timeout while waiting for state to become 'success' (timeout: 5m0s)"
level=error
level=error
level=error
level=error
level=error
level=error msg="Terraform does not automatically rollback in the face of errors."
level=error msg="Instead, your Terraform state file has been partially updated with"
level=error msg="any resources that successfully completed. Please address the error"
level=error msg="above and apply again to incrementally change your infrastructure."
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"

will retest in a bit.

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Feb 8, 2019

/retest

1 similar comment
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 9, 2019

/retest

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 9, 2019

clusterversion.config.openshift.io/version condition met
/bin/bash: line 52: /etc/passwd: Permission denied
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 9, 2019

/test e2e-aws-op

@openshift-merge-robot openshift-merge-robot merged commit 943ed14 into openshift:master Feb 9, 2019

6 checks passed

ci/prow/e2e-aws Job succeeded.
Details
ci/prow/e2e-aws-op Job succeeded.
Details
ci/prow/images Job succeeded.
Details
ci/prow/rhel-images Job succeeded.
Details
ci/prow/unit Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment