New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark operator as degraded if there are any pods in CrashLoopBackOff State #668
Mark operator as degraded if there are any pods in CrashLoopBackOff State #668
Conversation
@tssurya This looks like network pods only? Also, I have encountered some non-network pods in CrashLoopBackoff during bringup. They are fine when cluster is fully up. Not sure why we try to start pods before cluster is ready. |
This patch actually includes all the pods in the cluster - as in any pod in any namespace which is stuck in the crash looping phase will be taken into consideration. Also this is a really simple patch to update the status of the operator accordingly.
I am not sure about this, haven't seen this before (but I'm also relatively new to CNO and operators in general) |
If pods are in crashloop backoff the daemonset should report as having less ready replicas than desired (if it's crashloopbackoff it's not ready). It looks me like duplicating the effort of the kube controller manager, is this resolving an observed problem or an improvement? Also, some components aren't critical and therefore shouldn't be enough to mark the CO as degraded, those daemonsets created with the annotation |
@juanluisvaladas yeah, this is somewhat of a cosmetic improvement: if a pod is in CrashLoopBackoff, we will eventually report degraded because |
yea, this is true, the daemonset reports having less ready replicas :
It's trying to do both I guess, firstly fix an observed problem - say we have a pod which is alternating between crashloopbackoff and running or completed states, the existing logic which checks for the daemonset's status won't catch this among the hung ones and secondly its also an improvement where we are saying its better to let the user/op know that the network is degraded because something is crashing.
I didn't know about this, I'll include the !isNonCritical check before doing this. Thanks for pointing this out. |
That's a pretty solid reason. Sounds like a good change. |
a10490e
to
8c02395
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple nitpicks
8c02395
to
470ea8c
Compare
/lgtm |
/retest |
470ea8c
to
61d2fb1
Compare
/retest |
@@ -290,6 +299,27 @@ func (status *StatusManager) setLastPodState( | |||
}) | |||
} | |||
|
|||
func (status *StatusManager) CheckCrashLoopBackOffPods(dName types.NamespacedName, lb map[string]string, name string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor cleanups:
- rename
lb
toselector
,name
tokind
. - Write a quick docblock, something like
CheckCrashLoopBackOffPods checks for pods in the label selector with with
any containers in state CrashLoopBackoff. It returns a human-readable string
for any pod in such a state.
dName should be the name of a DaemonSet or Deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, I'll fix these up
One minor cleanup, then this looks good. I'll lgtm when it's good to go. We might have to skip some tests. |
This patch checks for the statuses of pods of those deployments and daemonsets which are in a "hung" state. If any of the pods are in CrashLoopBackOff state the operator will be marked as degraded. Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
61d2fb1
to
840e67e
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juanluisvaladas, squeed, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@tssurya: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
This patch checks for the statuses of pods of those deployments and daemonsets
which are in a "hung" state. If any of the pods are in CrashLoopBackOff state
the operator will be marked as degraded.
Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com