Adding more labels to kube_pod_status_phase #332

rajatjindal · 2018-01-04T18:24:32Z

we had a situation where # of failed pod counts increased dramatically, and we were wondering what happened.

on debugging we found 2 nodes were having docker issues and most of the failed nodes were being scheduled on those problematic nodes.

i think it will be useful to add more labels to kube_pod_status_phase, so that we can run query like all failed pods count group by node.

rajatjindal · 2018-01-04T18:25:56Z

I will be more than happy to open a PR if we agree that this is a reasonable ask.

brancz · 2018-01-05T07:47:59Z

That can be done at query time. Prometheus supports joins, so you can join pod info on the phase to figure out the node.

andyxning · 2018-01-06T16:12:31Z

In case you have not used the join syntax of Prometheus, there is an example in the #137 comment with some query like:

sum by(node)(avg by(node,pod,namespace)(kube_pod_info{}) * on(pod, namespace) group_right(node) kube_pod_status_phase{phase="Failed"})

rajatjindal · 2018-01-10T06:47:42Z

thank you very much guys. I will try this out tonight

andyxning closed this as completed Jan 6, 2018

Provide feedback