Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eliminate kube_daemonset_status_number_misscheduled fluctuation due to autoscaling #812

Closed
garo opened this issue Jul 9, 2019 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@garo
Copy link

garo commented Jul 9, 2019

/kind feature

What happened:

The metric kube_daemonset_status_number_misscheduled is often used with Prometheus alerting to alert if a DaemonSet member pod cannot be scheduled in all machines where it requires to be present.

However in a cluster with an active cluster autoscaler new machines constantly come and go, thus there are often several nodes which hasn't had time to schedule the required DaemonSet pods. This causes the kube_daemonset_status_number_misscheduled based alerts to trigger without an actual valid error condition.

What you expected to happen:

I would expect either that the kube_daemonset_status_number_misscheduled metric would not include DaemonSet pod members from nodes which hasn't yet had the time to get the pod working, or that the metric would investigate each DaemonSet pod member and determine the reason why the pod is not running and ignore reasons when the pod is still in starting mode.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 9, 2019
@tariq1890
Copy link
Contributor

Have you tried blacklisting this metric?

@brancz
Copy link
Member

brancz commented Jul 9, 2019

I agree the situation is not good, but I don't think this is to be fixed in kube-state-metrics. kube-state-metrics takes this value directly from the DaemonSet object, so if anything then this should change, or the alerting rule should be less sensitive, which is probably appropriate either way. Can you share the alerting rule? (if you didn't write it yourself it probably makes most sense to open an issue where you are taking the alerting rule from :) )

@garo
Copy link
Author

garo commented Jul 9, 2019

The current rule I'm using comes from the Prometheus Operator project by default and it's a simple

"kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0"

I can change this to be less and less sensitive so that it alerts later, but if the cluster is constantly scaling then the problem never goes away.

You can think this in a way that each node hosting a DaemonSet pod has its own timeline. Now the kube_daemonset_status_number_misscheduled simply combines all together and thus ignoring the node individuality. If your cluster is constantly scaling then you will always have one node which is just being started and this will then trigger the alert. Reducing the threshold can then easily miss nodes which are problematic for another reason unrelated to scaling.

I would gladly modify the alert in any way possible to fix the scaling based issues while still retaining the visibility for non-scaling based issues, but to my knowledge there isn't another metric which I could use.

One fix could be to create a new metric like kube_daemonset_pod_status which would have daemonset="" label and another label for the node where the pod is scheduled and the value being the statys. This way I could create an alert which would look each pod for each daemonset as a single individual and trigger only if a single pod is unscheduled for more than x minutes.

Might kube-state-metrics be a right place for this kind of new metric?

@brancz
Copy link
Member

brancz commented Jul 9, 2019

The current rule I'm using comes from the Prometheus Operator project by default and it's a simple

The Prometheus Operator pretty much doesn't define any alerting rules (if at all kube-prometheus), it imports all of them through the kubernetes-mixin.

kube-state-metrics only mirrors Kubernetes API objects to Prometheus metrics, no correlation or pre-aggregation or something like that, so if this data is supposed to end up in kube-state-metrics, then it first has to be in the object itself.

In terms of the alerting rule, I think increasing the "for" value could be a first step, a second one could be changing the threshold to be higher. At the end of the day, if the alerting rule doesn't actually help you, it either worth modifying or removing entirely.

@garo
Copy link
Author

garo commented Jul 9, 2019

Thank you for the clear answer that kube-state-metrics is not a place to do aggregations, thus making my feature request invalid.

What comes for my use case this particular metric doesn't help me at this stage so I'm going to remove it. What I will be lacking is the knowledge if a single pod belonging to a DaemonSet is unable to start itself for a longer period of time, but I will need to find another way to express that alert.

Thank you.

@aantn
Copy link

aantn commented Dec 28, 2021

@garo you can do it with Robusta by using an on_dameonset_update trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants