Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

tssurya · 2020-06-15T10:58:36Z

This patch checks for the statuses of pods of those deployments and daemonsets
which are in a "hung" state. If any of the pods are in CrashLoopBackOff state
the operator will be marked as degraded.

Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com

pkg/controller/statusmanager/pod_status.go

pecameron · 2020-06-15T13:08:42Z

@tssurya This looks like network pods only? Also, I have encountered some non-network pods in CrashLoopBackoff during bringup. They are fine when cluster is fully up. Not sure why we try to start pods before cluster is ready.

tssurya · 2020-06-15T13:22:38Z

@tssurya This looks like network pods only?

This patch actually includes all the pods in the cluster - as in any pod in any namespace which is stuck in the crash looping phase will be taken into consideration. Also this is a really simple patch to update the status of the operator accordingly.

Also, I have encountered some non-network pods in CrashLoopBackoff during bringup. They are fine when cluster is fully up. Not sure why we try to start pods before cluster is ready.

I am not sure about this, haven't seen this before (but I'm also relatively new to CNO and operators in general)

juanluisvaladas · 2020-06-15T14:04:23Z

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready). It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

Also, some components aren't critical and therefore shouldn't be enough to mark the CO as degraded, those daemonsets created with the annotation networkoperator.openshift.io/non-critical: "" should be ignored here.

squeed · 2020-06-15T14:24:01Z

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready). It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

@juanluisvaladas yeah, this is somewhat of a cosmetic improvement: if a pod is in CrashLoopBackoff, we will eventually report degraded because NumberUnavailable > 0. However, we can do better: if we know that a pod is CrashLoopBackoff, we should go Degraded immediately.

tssurya · 2020-06-15T15:04:10Z

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready).

yea, this is true, the daemonset reports having less ready replicas :

cluster-network-operator/pkg/controller/statusmanager/pod_status.go

Line 92 in a10490e

} else if ds.Status.NumberUnavailable > 0 {

and that's when I go in to check specifically for the crashloopbackoff case.

It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

It's trying to do both I guess, firstly fix an observed problem - say we have a pod which is alternating between crashloopbackoff and running or completed states, the existing logic which checks for the daemonset's status won't catch this among the hung ones and secondly its also an improvement where we are saying its better to let the user/op know that the network is degraded because something is crashing.

Also, some components aren't critical and therefore shouldn't be enough to mark the CO as degraded, those daemonsets created with the annotation networkoperator.openshift.io/non-critical: "" should be ignored here.

I didn't know about this, I'll include the !isNonCritical check before doing this. Thanks for pointing this out.

juanluisvaladas · 2020-06-15T15:08:31Z

we are saying its better to let the user/op know that the network is degraded because something is crashing.

That's a pretty solid reason. Sounds like a good change.

juanluisvaladas

Just a couple nitpicks

pkg/controller/statusmanager/pod_status.go

pkg/controller/statusmanager/status_manager_test.go

juanluisvaladas · 2020-06-19T09:49:32Z

/lgtm

tssurya · 2020-06-22T10:17:08Z

/retest

tssurya · 2020-06-22T13:08:18Z

/retest

squeed · 2020-06-22T13:53:52Z

pkg/controller/statusmanager/pod_status.go

@@ -290,6 +299,27 @@ func (status *StatusManager) setLastPodState(
 	})
 }

+func (status *StatusManager) CheckCrashLoopBackOffPods(dName types.NamespacedName, lb map[string]string, name string) []string {


A few minor cleanups:

rename lb to selector, name to kind.

Write a quick docblock, something like

CheckCrashLoopBackOffPods checks for pods in the label selector with with any containers in state CrashLoopBackoff. It returns a human-readable string for any pod in such a state. dName should be the name of a DaemonSet or Deployment.

ack, I'll fix these up

squeed · 2020-06-22T13:54:33Z

One minor cleanup, then this looks good.
/approve

I'll lgtm when it's good to go. We might have to skip some tests.

This patch checks for the statuses of pods of those deployments and daemonsets which are in a "hung" state. If any of the pods are in CrashLoopBackOff state the operator will be marked as degraded. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

squeed · 2020-06-22T16:49:30Z

/lgtm
/retest

openshift-ci-robot · 2020-06-22T16:49:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juanluisvaladas, squeed, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [squeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-06-22T17:50:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-22T18:03:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-22T18:17:30Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-06-22T18:49:11Z

@tssurya: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-vsphere	`840e67e`	link	`/test e2e-vsphere`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-06-22T18:55:58Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2020

openshift-ci-robot requested review from alexanderConstantinescu and juanluisvaladas June 15, 2020 10:58

squeed reviewed Jun 15, 2020

View reviewed changes

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from a10490e to 8c02395 Compare June 16, 2020 19:31

tssurya changed the title ~~[WIP] Mark operator as degraded if there are any pods in CrashLoopBackOff State~~ Mark operator as degraded if there are any pods in CrashLoopBackOff State Jun 16, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2020

juanluisvaladas suggested changes Jun 18, 2020

View reviewed changes

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

pkg/controller/statusmanager/status_manager_test.go Show resolved Hide resolved

tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 8c02395 to 470ea8c Compare June 19, 2020 09:41

openshift-ci-robot assigned juanluisvaladas Jun 19, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2020

juanluisvaladas approved these changes Jun 19, 2020

View reviewed changes

tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 470ea8c to 61d2fb1 Compare June 22, 2020 10:42

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2020

squeed reviewed Jun 22, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2020

tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 61d2fb1 to 840e67e Compare June 22, 2020 16:27

openshift-ci-robot assigned squeed Jun 22, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2020

openshift-merge-robot merged commit 8f41bdb into openshift:master Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

tssurya commented Jun 15, 2020

pecameron commented Jun 15, 2020

tssurya commented Jun 15, 2020

juanluisvaladas commented Jun 15, 2020

squeed commented Jun 15, 2020

tssurya commented Jun 15, 2020

juanluisvaladas commented Jun 15, 2020

juanluisvaladas left a comment

juanluisvaladas commented Jun 19, 2020

tssurya commented Jun 22, 2020

tssurya commented Jun 22, 2020

squeed Jun 22, 2020 •

edited

tssurya Jun 22, 2020

squeed commented Jun 22, 2020

squeed commented Jun 22, 2020

openshift-ci-robot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-ci-robot commented Jun 22, 2020 •

edited

openshift-bot commented Jun 22, 2020

Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

Conversation

tssurya commented Jun 15, 2020

pecameron commented Jun 15, 2020

tssurya commented Jun 15, 2020

juanluisvaladas commented Jun 15, 2020

squeed commented Jun 15, 2020

tssurya commented Jun 15, 2020

juanluisvaladas commented Jun 15, 2020

juanluisvaladas left a comment

Choose a reason for hiding this comment

juanluisvaladas commented Jun 19, 2020

tssurya commented Jun 22, 2020

tssurya commented Jun 22, 2020

squeed Jun 22, 2020 • edited

Choose a reason for hiding this comment

tssurya Jun 22, 2020

Choose a reason for hiding this comment

squeed commented Jun 22, 2020

squeed commented Jun 22, 2020

openshift-ci-robot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-bot commented Jun 22, 2020

openshift-ci-robot commented Jun 22, 2020 • edited

openshift-bot commented Jun 22, 2020

squeed Jun 22, 2020 •

edited

openshift-ci-robot commented Jun 22, 2020 •

edited