Skip to content

Conversation

@metalmatze
Copy link
Contributor

What this PR does / why we need it:
If a machine can't be deleted we want to alert on it.
With the old approach we had state in the metrics that we couldn't delete. Hence a machine that once existed still had metrics which made us unable to alert on the metric, even when the machine was long gone.

By using the Prometheus collector pattern we simply query the informers to give us a list of machines and nodes. We iterate over those lists and forget about the state afterwards.
This improvement allows metrics of deleted machines to not show up anymore.
We can now tell, that if a machine_controller_machine_deleted metric is there for more than 10min the machine is not able to be deleted.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:
I've also removed the histograms for the operations as they were utterly broken.
Example histogram:

machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.005"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.01"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.025"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.05"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.1"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.25"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="0.5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="1"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="2.5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="5"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="10"} 64
machine_controller_controller_operation_duration_seconds_bucket{operation="validate-machine",le="+Inf"} 64
machine_controller_controller_operation_duration_seconds_sum{operation="validate-machine"} 0.00042467500000000006
machine_controller_controller_operation_duration_seconds_count{operation="validate-machine"} 64

In this example, all buckets were hit 64 making them useless. For all the other machine-controller histograms I looked at I saw the same results.

@alvaroaleman
Copy link
Contributor

Thanks for the PR! Can you create a tracking issue for the defunct metrics you removed?

@metalmatze
Copy link
Contributor Author

Done in #271

}
}

type NodeController struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be NodeCollector not NodeController right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a typo. 😇


nodes: prometheus.NewDesc(
metricsPrefix+"nodes",
"The number of nodes created by a machine",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure if this metric makes sense, because it will include all nodes, not only the ones that were created by a machine-controller, we could get the same info from kube state metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, had the same thought yesterday. Then I also looked into kube-state-metrics again and found kube_node_created.
The reason I added them back were

  1. We already had node metrics.
  2. You still get node metrics, even though you have not deployed kube-state-metrics.

I'm fine with removing them.

),
nodeCreated: prometheus.NewDesc(
metricsPrefix+"node_created",
"The number of nodes created by a machine",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read something like nodeCreationTimestamp or am I misunderstanding something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure on that one either.
I searched for created and deleted in our Prometheus metrics and found that kube-state-metrics simply uses e.g. kube_node_created and others.
The Prometheus docs don't say anything specific about naming for timestamps either.
https://prometheus.io/docs/instrumenting/writing_exporters/#naming

@metalmatze metalmatze changed the title Metrics collector Use Prometheus collector pattern for stateless metrics Jul 13, 2018
@ghost ghost assigned alvaroaleman Jul 18, 2018
@alvaroaleman alvaroaleman merged commit 107da31 into master Jul 18, 2018
@ghost ghost removed the review label Jul 18, 2018
@alvaroaleman alvaroaleman deleted the metrics-collector branch July 18, 2018 11:28
@xrstf xrstf mentioned this pull request Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants