Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add KubeDeploymentRolloutStuck #845

Merged
merged 2 commits into from
May 26, 2023

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented May 9, 2023

This add an alert when a deployment rollout hits its
spec.progressDeadlineSeconds.
Those could have a number of causes, which can originates from the
cluster or the deployment:

@povilasv
Copy link
Contributor

Question how do we make sure that deployment is stuck longer than spec.progressDeadlineSeconds. ?

Is it this condition="Progressing", status="false" label combination?

Would be great to add tests

@VannTen
Copy link
Contributor Author

VannTen commented May 11, 2023 via email

This add an alert when a deployment rollout hits its
`spec.progressDeadlineSeconds`.
Those could have a number of causes, which can originates from the
cluster or the deployment:
- pods taking too much time to start
- cluster at full capacity and deployment surging during upgrade
  (maxSurge > 0).
@povilasv
Copy link
Contributor

I think in both cases alert KubeDeploymentRolloutStuck firing makes sense

@VannTen
Copy link
Contributor Author

VannTen commented May 11, 2023

I've added some tests

@povilasv
Copy link
Contributor

Just noticed that we also have KubeDeploymentReplicasMismatch wouldn't both alerts fire at the same time when deployment is stuck?

@VannTen
Copy link
Contributor Author

VannTen commented May 16, 2023

              (
                kube_deployment_spec_replicas{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
                  >
                kube_deployment_status_replicas_available{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}
              ) and (
                changes(kube_deployment_status_replicas_updated{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}[10m])
                  ==
                0
              )

(this is from the KubeDeploymentReplicasMismatch def)
My initial thought was :
I think the changes part (which I read as "the status.updated field has not changed in the last 10 minutes) means that KubeDeploymentReplicasMismatch won't fire in the same case, since a rollout implies a updated replicas, right ?

However, the default spec.progressDeadlineSeconds 600 seconds, aka 10 minutes, so they would indeed fire at the same time, I think. Expect in the case of non-default spec.progressDeadlineSeconds, I guess.

Third thought:
A stuck rollout can perfectly happen without available replicas being less than spec.replicas : you just need to have maxSurge > 0, and the controller could try to launch a new pod, and won't remove the old one until it succeed.

In conclusion (sorry for that brain dump) I don't think they would fire for the same thing in all cases. but there should some overlap.

@VannTen
Copy link
Contributor Author

VannTen commented May 24, 2023

Is there any other questions to address regarding this ? Do you think it's mergeable in it's current state or does it need more work ?

@povilasv
Copy link
Contributor

So I am currently worried about this part:

However, the default spec.progressDeadlineSeconds 600 seconds, aka 10 minutes, so they would indeed fire at the same time, I think. Expect in the case of non-default spec.progressDeadlineSeconds, I guess.

Anyway we can make this alert to not fire together?

@VannTen
Copy link
Contributor Author

VannTen commented May 24, 2023

You mean we should have an exclusive rather than inclusive OR between these two right ?

@povilasv
Copy link
Contributor

I mean two alerts shouldnt fire for same issue.

Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.

From: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

@VannTen
Copy link
Contributor Author

VannTen commented May 24, 2023

I think the case where this can happen (both alerts firing) is :

Deployment is rolling out, with maxUnavailable != 0 => controller can update by removing a pod first. => if the deployment can't progress, spec.replicas > status.availableReplicas will become true both alerts will fire.

So what do you think of adding a "spec.replicas <= status.availableReplicas" condition to KubeDeploymentRolloutStuck ? This should narrow the alerts to the case where the rollout tries to surge but can't (cluster full, scheduler problems, etc).

I'm a bit wary of adding that though because it seems like it would be designing the alert not to stand on its own.

wdyt ?

@povilasv
Copy link
Contributor

Maybe let's leave it as it is.We can add the condition later. I think I'm a bit nit picky here :D

@povilasv povilasv merged commit b5c70aa into kubernetes-monitoring:master May 26, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants