feat: Add KubeDeploymentRolloutStuck #845

VannTen · 2023-05-09T12:33:13Z

This add an alert when a deployment rollout hits its
spec.progressDeadlineSeconds.
Those could have a number of causes, which can originates from the
cluster or the deployment:

pods taking too much time to start
cluster at full capacity and deployment surging during upgrade
(maxSurge > 0).

Relevant doc : https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment

povilasv · 2023-05-11T08:13:43Z

Question how do we make sure that deployment is stuck longer than spec.progressDeadlineSeconds. ?

Is it this condition="Progressing", status="false" label combination?

Would be great to add tests

VannTen · 2023-05-11T08:46:42Z

Yes, that condition is set to false by the controller when exceeding the deadline ( https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment ). I just noticed it's not the only possible reason though, it can also come from ReplicaSetCreateError. Reason aren't included in metrics labels (AFAICT), so we can distinguish. Would that be a problem ? We care about the symptom rather than the cause right ? I'll look at the test syntax and add something shortly.

This add an alert when a deployment rollout hits its `spec.progressDeadlineSeconds`. Those could have a number of causes, which can originates from the cluster or the deployment: - pods taking too much time to start - cluster at full capacity and deployment surging during upgrade (maxSurge > 0).

povilasv · 2023-05-11T11:25:00Z

I think in both cases alert KubeDeploymentRolloutStuck firing makes sense

VannTen · 2023-05-11T14:06:17Z

I've added some tests

povilasv · 2023-05-12T09:41:52Z

Just noticed that we also have KubeDeploymentReplicasMismatch wouldn't both alerts fire at the same time when deployment is stuck?

VannTen · 2023-05-16T12:44:40Z

              (
                kube_deployment_spec_replicas{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
                  >
                kube_deployment_status_replicas_available{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
              ) and (
                changes(kube_deployment_status_replicas_updated{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}[10m])
                  ==
                0
              )

(this is from the KubeDeploymentReplicasMismatch def)
My initial thought was :
I think the changes part (which I read as "the status.updated field has not changed in the last 10 minutes) means that KubeDeploymentReplicasMismatch won't fire in the same case, since a rollout implies a updated replicas, right ?

However, the default spec.progressDeadlineSeconds 600 seconds, aka 10 minutes, so they would indeed fire at the same time, I think. Expect in the case of non-default spec.progressDeadlineSeconds, I guess.

Third thought:
A stuck rollout can perfectly happen without available replicas being less than spec.replicas : you just need to have maxSurge > 0, and the controller could try to launch a new pod, and won't remove the old one until it succeed.

In conclusion (sorry for that brain dump) I don't think they would fire for the same thing in all cases. but there should some overlap.

VannTen · 2023-05-24T08:33:39Z

Is there any other questions to address regarding this ? Do you think it's mergeable in it's current state or does it need more work ?

povilasv · 2023-05-24T09:00:37Z

So I am currently worried about this part:

However, the default spec.progressDeadlineSeconds 600 seconds, aka 10 minutes, so they would indeed fire at the same time, I think. Expect in the case of non-default spec.progressDeadlineSeconds, I guess.

Anyway we can make this alert to not fire together?

VannTen · 2023-05-24T09:08:08Z

You mean we should have an exclusive rather than inclusive OR between these two right ?

povilasv · 2023-05-24T09:13:13Z

I mean two alerts shouldnt fire for same issue.

Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.

From: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

VannTen · 2023-05-24T09:59:07Z

I think the case where this can happen (both alerts firing) is :

Deployment is rolling out, with maxUnavailable != 0 => controller can update by removing a pod first. => if the deployment can't progress, spec.replicas > status.availableReplicas will become true both alerts will fire.

So what do you think of adding a "spec.replicas <= status.availableReplicas" condition to KubeDeploymentRolloutStuck ? This should narrow the alerts to the case where the rollout tries to surge but can't (cluster full, scheduler problems, etc).

I'm a bit wary of adding that though because it seems like it would be designing the alert not to stand on its own.

wdyt ?

povilasv · 2023-05-26T09:39:29Z

Maybe let's leave it as it is.We can add the condition later. I think I'm a bit nit picky here :D

VannTen mentioned this pull request May 9, 2023

Add an alert for detecting stuck deployment rollouts red-hat-data-services/odh-deployer#334

Closed

3 tasks

VannTen force-pushed the feat/stuck_rollout branch from 2f261d0 to 0ed6536 Compare May 10, 2023 12:06

VannTen added 2 commits May 11, 2023 11:28

Add tests for KubeDeploymentRolloutStuck

a35c53b

VannTen force-pushed the feat/stuck_rollout branch from 0ed6536 to a35c53b Compare May 11, 2023 09:30

povilasv approved these changes May 12, 2023

View reviewed changes

povilasv merged commit b5c70aa into kubernetes-monitoring:master May 26, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add KubeDeploymentRolloutStuck #845

feat: Add KubeDeploymentRolloutStuck #845

VannTen commented May 9, 2023 •

edited

Loading

povilasv commented May 11, 2023

VannTen commented May 11, 2023 via email

povilasv commented May 11, 2023

VannTen commented May 11, 2023

povilasv commented May 12, 2023

VannTen commented May 16, 2023

VannTen commented May 24, 2023

povilasv commented May 24, 2023

VannTen commented May 24, 2023

povilasv commented May 24, 2023

VannTen commented May 24, 2023

povilasv commented May 26, 2023

feat: Add KubeDeploymentRolloutStuck #845

feat: Add KubeDeploymentRolloutStuck #845

Conversation

VannTen commented May 9, 2023 • edited Loading

povilasv commented May 11, 2023

VannTen commented May 11, 2023 via email

povilasv commented May 11, 2023

VannTen commented May 11, 2023

povilasv commented May 12, 2023

VannTen commented May 16, 2023

VannTen commented May 24, 2023

povilasv commented May 24, 2023

VannTen commented May 24, 2023

povilasv commented May 24, 2023

VannTen commented May 24, 2023

povilasv commented May 26, 2023

VannTen commented May 9, 2023 •

edited

Loading