Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

ravisantoshgudimetla · 2019-05-02T21:56:07Z

As of now, because of infinite tolerations against all the possible taints, we are seeing that operators are not getting evicted from nodes that have NoExecute taint on them. This PR tightens the conditions around which can be pods can be scheduled/evicted. The downside is there is a very good chance that pods would be evicted from nodes that have certain conditions like disk-pressure, memory-pressure, taints added by other controllers(operators) etc. So, please make sure that this change is ok with your operator/operand before merging this PR.

/cc @sjenning @smarterclayton

mfojtik · 2019-05-03T11:27:40Z

@ravisantoshgudimetla kube apiserver is static pod... I heard from @sjenning that static pods can't be evicted (ever). So I wonder if this PR makes sense for static pods (KAS, KCM and KSM)

deads2k · 2019-05-03T19:39:59Z

/lgtm

ravisantoshgudimetla · 2019-05-03T19:46:56Z

kube apiserver is static pod... I heard from @sjenning that static pods can't be evicted (ever). So I wonder if this PR makes sense for static pods (KAS, KCM and KSM)

Static pods can be evicted(by kubelet if they're not critical pods) but we have decided not to apply tolerations to the static. I will remove the tolerations for them soon.

ravisantoshgudimetla · 2019-05-03T19:47:04Z

/hold

ravisantoshgudimetla · 2019-05-07T12:34:18Z

/hold cancel

mfojtik · 2019-05-07T12:45:10Z

/lgtm

openshift-ci-robot · 2019-05-07T12:45:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, mfojtik, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,mfojtik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ravisantoshgudimetla · 2019-05-07T13:36:24Z

/retest

ravisantoshgudimetla · 2019-05-07T14:36:04Z

error: could not run steps: step e2e-aws-operator failed: template pod "e2e-aws-operator" failed: the pod ci-op-jick3xv8/e2e-aws-operator failed after 53m39s (failed containers: test): ContainerFailed one or more containers exited

FAIL: github.com/openshift/cluster-kube-apiserver-operator/test/e2e TestNamedCertificates/User_three.test 1m0.25s FAIL: github.com/openshift/cluster-kube-apiserver-operator/test/e2e TestNamedCertificates 3m8.1s

@mfojtik @deads2k, seeing the above error in e2e-operator failures ^

openshift-bot · 2019-05-07T15:16:36Z

/retest

Please review the full test history for this PR and help us cut down flakes.

ravisantoshgudimetla · 2019-05-07T16:26:52Z

ns/openshift-machine-config-operator pod/etcd-quorum-guard-57cff7b968-jplzg 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (61 times)

/cc @RobertKrawitz

eparis · 2019-05-07T17:53:09Z

/retest

RobertKrawitz · 2019-05-07T18:05:25Z

Ref https://bugzilla.redhat.com/show_bug.cgi?id=1707212#c6

smarterclayton · 2019-05-13T21:21:10Z

As we get closer to release, please ensure code changes have a bug and the bug is associated in the PR title - follow the conventions described in previous emails about how to associate bugs with PRs. The PR title must be Bug XXXX: <description>.

This PR didn't get correct title because it was Bug XXXXXX : (extra space)

openshift-ci-robot requested review from sjenning and smarterclayton May 2, 2019 21:56

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 2, 2019

openshift-ci-robot assigned deads2k May 3, 2019

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 3, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2019

ravisantoshgudimetla force-pushed the fix-taints branch from a19cd8c to 1ccf428 Compare May 3, 2019 21:11

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 3, 2019

Fix tolerations

9ad60c1

ravisantoshgudimetla force-pushed the fix-taints branch from 1ccf428 to 9ad60c1 Compare May 6, 2019 21:05

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2019

openshift-ci-robot assigned mfojtik May 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2019

openshift-ci-robot requested a review from RobertKrawitz May 7, 2019 16:26

openshift-merge-robot merged commit 0c66803 into openshift:master May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

ravisantoshgudimetla commented May 2, 2019

mfojtik commented May 3, 2019

deads2k commented May 3, 2019

ravisantoshgudimetla commented May 3, 2019

ravisantoshgudimetla commented May 3, 2019

ravisantoshgudimetla commented May 7, 2019

mfojtik commented May 7, 2019

openshift-ci-robot commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

openshift-bot commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

eparis commented May 7, 2019

RobertKrawitz commented May 7, 2019

smarterclayton commented May 13, 2019

Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

Bug 1705649 : Cluster with halted master did not reschedule operators after 5m of being down #454

Conversation

ravisantoshgudimetla commented May 2, 2019

mfojtik commented May 3, 2019

deads2k commented May 3, 2019

ravisantoshgudimetla commented May 3, 2019

ravisantoshgudimetla commented May 3, 2019

ravisantoshgudimetla commented May 7, 2019

mfojtik commented May 7, 2019

openshift-ci-robot commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

openshift-bot commented May 7, 2019

ravisantoshgudimetla commented May 7, 2019

eparis commented May 7, 2019

RobertKrawitz commented May 7, 2019

smarterclayton commented May 13, 2019