Use Prometheus collector pattern for stateless metrics #270

metalmatze · 2018-07-12T17:10:08Z

What this PR does / why we need it:
If a machine can't be deleted we want to alert on it.
With the old approach we had state in the metrics that we couldn't delete. Hence a machine that once existed still had metrics which made us unable to alert on the metric, even when the machine was long gone.

By using the Prometheus collector pattern we simply query the informers to give us a list of machines and nodes. We iterate over those lists and forget about the state afterwards.
This improvement allows metrics of deleted machines to not show up anymore.
We can now tell, that if a machine_controller_machine_deleted metric is there for more than 10min the machine is not able to be deleted.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:
I've also removed the histograms for the operations as they were utterly broken.
Example histogram:

machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.005"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.01"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.025"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.05"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.1"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.25"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="1"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="2.5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="10"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="+Inf"} 64
machine_controller_controller_operation_duration_seconds_sum{operation="validate-machine"} 0.00042467500000000006
machine_controller_controller_operation_duration_seconds_count{operation="validate-machine"} 64

In this example, all buckets were hit 64 making them useless. For all the other machine-controller histograms I looked at I saw the same results.

alvaroaleman · 2018-07-12T21:57:43Z

Thanks for the PR! Can you create a tracking issue for the defunct metrics you removed?

metalmatze · 2018-07-13T08:28:50Z

Done in #271

alvaroaleman · 2018-07-13T09:53:13Z

pkg/controller/metrics.go

+	}
+}
+
+type NodeController struct {


This should be NodeCollector not NodeController right?

Yes, that's a typo. 😇

alvaroaleman · 2018-07-13T09:54:24Z

pkg/controller/metrics.go

+
+		nodes: prometheus.NewDesc(
+			metricsPrefix+"nodes",
+			"The number of nodes created by a machine",


I am actually not sure if this metric makes sense, because it will include all nodes, not only the ones that were created by a machine-controller, we could get the same info from kube state metrics

Yup, had the same thought yesterday. Then I also looked into kube-state-metrics again and found kube_node_created.
The reason I added them back were

We already had node metrics.

You still get node metrics, even though you have not deployed kube-state-metrics.

I'm fine with removing them.

alvaroaleman · 2018-07-13T09:57:25Z

pkg/controller/metrics.go

+		),
+		nodeCreated: prometheus.NewDesc(
+			metricsPrefix+"node_created",
+			"The number of nodes created by a machine",


This should read something like nodeCreationTimestamp or am I misunderstanding something?

I am not 100% sure on that one either.
I searched for created and deleted in our Prometheus metrics and found that kube-state-metrics simply uses e.g. kube_node_created and others.
The Prometheus docs don't say anything specific about naming for timestamps either.
https://prometheus.io/docs/instrumenting/writing_exporters/#naming

ghost assigned metalmatze Jul 12, 2018

ghost added the review label Jul 12, 2018

metalmatze mentioned this pull request Jul 12, 2018

Investigate machine-controller deletion issue kubermatic/kubermatic#1437

Closed

2 tasks

alvaroaleman mentioned this pull request Jul 13, 2018

Remove broken histograms from metrics #271

Closed

alvaroaleman suggested changes Jul 13, 2018

View reviewed changes

metalmatze changed the title ~~Metrics collector~~ Use Prometheus collector pattern for stateless metrics Jul 13, 2018

alvaroaleman mentioned this pull request Jul 18, 2018

label machines metric with kubelet version #274

Merged

ghost assigned alvaroaleman Jul 18, 2018

metalmatze added 6 commits July 18, 2018 11:25

Refactor long line

53fc8fb

Remove old stateful metrics

f61e251

Create Machine and Node collector

a1f0054

Add example alert for machine being unable to delete

81f6fd2

Fix typo and rename to NodeCollector

116c928

Remove NodeCollector as it's a kube-state-metrics dup

08e8f17

alvaroaleman force-pushed the metrics-collector branch from 251e7e1 to 08e8f17 Compare July 18, 2018 10:02

alvaroaleman approved these changes Jul 18, 2018

View reviewed changes

alvaroaleman merged commit 107da31 into master Jul 18, 2018

ghost removed the review label Jul 18, 2018

alvaroaleman deleted the metrics-collector branch July 18, 2018 11:28

xrstf mentioned this pull request Aug 21, 2018

Bring back node metrics #302

Closed

LittleFox94 mentioned this pull request Jun 21, 2022

re-add the metric node_join_duration #1332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Prometheus collector pattern for stateless metrics #270

Use Prometheus collector pattern for stateless metrics #270

Uh oh!

metalmatze commented Jul 12, 2018

Uh oh!

alvaroaleman commented Jul 12, 2018

Uh oh!

metalmatze commented Jul 13, 2018

Uh oh!

alvaroaleman Jul 13, 2018

Uh oh!

metalmatze Jul 13, 2018

Uh oh!

alvaroaleman Jul 13, 2018

Uh oh!

metalmatze Jul 13, 2018

Uh oh!

alvaroaleman Jul 13, 2018

Uh oh!

metalmatze Jul 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use Prometheus collector pattern for stateless metrics #270

Use Prometheus collector pattern for stateless metrics #270

Uh oh!

Conversation

metalmatze commented Jul 12, 2018

Uh oh!

alvaroaleman commented Jul 12, 2018

Uh oh!

metalmatze commented Jul 13, 2018

Uh oh!

alvaroaleman Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

metalmatze Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

alvaroaleman Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

metalmatze Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

alvaroaleman Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

metalmatze Jul 13, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants