-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add KubeDeploymentRolloutStuck #845
feat: Add KubeDeploymentRolloutStuck #845
Conversation
2f261d0
to
0ed6536
Compare
Question how do we make sure that deployment is stuck longer than Is it this Would be great to add tests |
Yes, that condition is set to false by the controller when exceeding the deadline (
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment ). I just noticed it's not the
only possible reason though, it can also come from ReplicaSetCreateError. Reason aren't included in metrics labels
(AFAICT), so we can distinguish. Would that be a problem ? We care about the symptom rather than the cause right ?
I'll look at the test syntax and add something shortly.
|
This add an alert when a deployment rollout hits its `spec.progressDeadlineSeconds`. Those could have a number of causes, which can originates from the cluster or the deployment: - pods taking too much time to start - cluster at full capacity and deployment surging during upgrade (maxSurge > 0).
0ed6536
to
a35c53b
Compare
I think in both cases alert |
I've added some tests |
Just noticed that we also have |
(
kube_deployment_spec_replicas{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
>
kube_deployment_status_replicas_available{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
) and (
changes(kube_deployment_status_replicas_updated{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}[10m])
==
0
) (this is from the KubeDeploymentReplicasMismatch def) However, the default Third thought: In conclusion (sorry for that brain dump) I don't think they would fire for the same thing in all cases. but there should some overlap. |
Is there any other questions to address regarding this ? Do you think it's mergeable in it's current state or does it need more work ? |
So I am currently worried about this part:
Anyway we can make this alert to not fire together? |
You mean we should have an exclusive rather than inclusive OR between these two right ? |
I mean two alerts shouldnt fire for same issue.
From: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit |
I think the case where this can happen (both alerts firing) is : Deployment is rolling out, with So what do you think of adding a "spec.replicas <= status.availableReplicas" condition to KubeDeploymentRolloutStuck ? This should narrow the alerts to the case where the rollout tries to surge but can't (cluster full, scheduler problems, etc). I'm a bit wary of adding that though because it seems like it would be designing the alert not to stand on its own. wdyt ? |
Maybe let's leave it as it is.We can add the condition later. I think I'm a bit nit picky here :D |
This add an alert when a deployment rollout hits its
spec.progressDeadlineSeconds
.Those could have a number of causes, which can originates from the
cluster or the deployment:
pods taking too much time to start
cluster at full capacity and deployment surging during upgrade
(maxSurge > 0).
Relevant doc : https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment