Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark operator as degraded if there are any pods in CrashLoopBackOff State #668

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Jun 15, 2020

This patch checks for the statuses of pods of those deployments and daemonsets
which are in a "hung" state. If any of the pods are in CrashLoopBackOff state
the operator will be marked as degraded.

Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 15, 2020
@pecameron
Copy link
Contributor

@tssurya This looks like network pods only? Also, I have encountered some non-network pods in CrashLoopBackoff during bringup. They are fine when cluster is fully up. Not sure why we try to start pods before cluster is ready.

@tssurya
Copy link
Contributor Author

tssurya commented Jun 15, 2020

@tssurya This looks like network pods only?

This patch actually includes all the pods in the cluster - as in any pod in any namespace which is stuck in the crash looping phase will be taken into consideration. Also this is a really simple patch to update the status of the operator accordingly.

Also, I have encountered some non-network pods in CrashLoopBackoff during bringup. They are fine when cluster is fully up. Not sure why we try to start pods before cluster is ready.

I am not sure about this, haven't seen this before (but I'm also relatively new to CNO and operators in general)

@juanluisvaladas
Copy link
Contributor

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready). It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

Also, some components aren't critical and therefore shouldn't be enough to mark the CO as degraded, those daemonsets created with the annotation networkoperator.openshift.io/non-critical: "" should be ignored here.

@squeed
Copy link
Contributor

squeed commented Jun 15, 2020

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready). It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

@juanluisvaladas yeah, this is somewhat of a cosmetic improvement: if a pod is in CrashLoopBackoff, we will eventually report degraded because NumberUnavailable > 0. However, we can do better: if we know that a pod is CrashLoopBackoff, we should go Degraded immediately.

@tssurya
Copy link
Contributor Author

tssurya commented Jun 15, 2020

If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready).

yea, this is true, the daemonset reports having less ready replicas :

} else if ds.Status.NumberUnavailable > 0 {
and that's when I go in to check specifically for the crashloopbackoff case.

It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement?

It's trying to do both I guess, firstly fix an observed problem - say we have a pod which is alternating between crashloopbackoff and running or completed states, the existing logic which checks for the daemonset's status won't catch this among the hung ones and secondly its also an improvement where we are saying its better to let the user/op know that the network is degraded because something is crashing.

Also, some components aren't critical and therefore shouldn't be enough to mark the CO as degraded, those daemonsets created with the annotation networkoperator.openshift.io/non-critical: "" should be ignored here.

I didn't know about this, I'll include the !isNonCritical check before doing this. Thanks for pointing this out.

@juanluisvaladas
Copy link
Contributor

we are saying its better to let the user/op know that the network is degraded because something is crashing.

That's a pretty solid reason. Sounds like a good change.

@tssurya tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from a10490e to 8c02395 Compare June 16, 2020 19:31
@tssurya tssurya changed the title [WIP] Mark operator as degraded if there are any pods in CrashLoopBackOff State Mark operator as degraded if there are any pods in CrashLoopBackOff State Jun 16, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2020
Copy link
Contributor

@juanluisvaladas juanluisvaladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nitpicks

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved
@tssurya tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 8c02395 to 470ea8c Compare June 19, 2020 09:41
@juanluisvaladas
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2020
@tssurya
Copy link
Contributor Author

tssurya commented Jun 22, 2020

/retest

@tssurya tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 470ea8c to 61d2fb1 Compare June 22, 2020 10:42
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2020
@tssurya
Copy link
Contributor Author

tssurya commented Jun 22, 2020

/retest

@@ -290,6 +299,27 @@ func (status *StatusManager) setLastPodState(
})
}

func (status *StatusManager) CheckCrashLoopBackOffPods(dName types.NamespacedName, lb map[string]string, name string) []string {
Copy link
Contributor

@squeed squeed Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor cleanups:

  • rename lb to selector, name to kind.
  • Write a quick docblock, something like
CheckCrashLoopBackOffPods checks for pods in the label selector with with 
any containers in state CrashLoopBackoff. It returns a human-readable string 
for any pod in such a state.

dName should be the name of a DaemonSet or Deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I'll fix these up

@squeed
Copy link
Contributor

squeed commented Jun 22, 2020

One minor cleanup, then this looks good.
/approve

I'll lgtm when it's good to go. We might have to skip some tests.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2020
This patch checks for the statuses of pods of those deployments and daemonsets
which are in a "hung" state. If any of the pods are in CrashLoopBackOff state
the operator will be marked as degraded.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@tssurya tssurya force-pushed the set_cno_degraded_for_crashlooping_pods branch from 61d2fb1 to 840e67e Compare June 22, 2020 16:27
@squeed
Copy link
Contributor

squeed commented Jun 22, 2020

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juanluisvaladas, squeed, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 22, 2020

@tssurya: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-vsphere 840e67e link /test e2e-vsphere

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8f41bdb into openshift:master Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants