Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

Merged

Conversation

ravisantoshgudimetla
Copy link
Contributor

As of now, because of infinite tolerations against all the possible taints, we are seeing that operators are not getting evicted from nodes that have NoExecute taint on them. This PR tightens the conditions around which can be pods can be scheduled/evicted. The downside is there is a very good chance that pods would be evicted from nodes that have certain conditions like disk-pressure, memory-pressure, taints added by other controllers(operators) etc. So, please make sure that this change is ok with your operator/operand before merging this PR.

/cc @sjenning @smarterclayton

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 2, 2019
@mfojtik
Copy link
Member

mfojtik commented May 3, 2019

@ravisantoshgudimetla kube apiserver is static pod... I heard from @sjenning that static pods can't be evicted (ever). So I wonder if this PR makes sense for static pods (KAS, KCM and KSM)

@deads2k
Copy link
Contributor

deads2k commented May 3, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 3, 2019
@ravisantoshgudimetla
Copy link
Contributor Author

kube apiserver is static pod... I heard from @sjenning that static pods can't be evicted (ever). So I wonder if this PR makes sense for static pods (KAS, KCM and KSM)

Static pods can be evicted(by kubelet if they're not critical pods) but we have decided not to apply tolerations to the static. I will remove the tolerations for them soon.

@ravisantoshgudimetla
Copy link
Contributor Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2019
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 3, 2019
@ravisantoshgudimetla
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2019
@mfojtik
Copy link
Member

mfojtik commented May 7, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, mfojtik, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ravisantoshgudimetla
Copy link
Contributor Author

/retest

@ravisantoshgudimetla
Copy link
Contributor Author

error: could not run steps: step e2e-aws-operator failed: template pod "e2e-aws-operator" failed: the pod ci-op-jick3xv8/e2e-aws-operator failed after 53m39s (failed containers: test): ContainerFailed one or more containers exited

FAIL: github.com/openshift/cluster-kube-apiserver-operator/test/e2e TestNamedCertificates/User_three.test 1m0.25s FAIL: github.com/openshift/cluster-kube-apiserver-operator/test/e2e TestNamedCertificates 3m8.1s

@mfojtik @deads2k, seeing the above error in e2e-operator failures ^

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@ravisantoshgudimetla
Copy link
Contributor Author

ns/openshift-machine-config-operator pod/etcd-quorum-guard-57cff7b968-jplzg 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (61 times)

/cc @RobertKrawitz

@eparis
Copy link
Member

eparis commented May 7, 2019

/retest

@RobertKrawitz
Copy link

@openshift-merge-robot openshift-merge-robot merged commit 0c66803 into openshift:master May 7, 2019
@smarterclayton
Copy link
Contributor

As we get closer to release, please ensure code changes have a bug and the bug is associated in the PR title - follow the conventions described in previous emails about how to associate bugs with PRs. The PR title must be Bug XXXX: <description>.

This PR didn't get correct title because it was Bug XXXXXX : (extra space)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants