add metrics.md to document exposed metrics #152

elmiko · 2020-05-28T16:53:33Z

This change adds a document to describe the metrics available from this
operator.

/kind documentation

enxebre · 2020-06-09T13:48:21Z

Can we include a ref to the metrics related to the cluster autoscaler https://github.com/openshift/cluster-autoscaler-operator/blob/master/pkg/controller/clusterautoscaler/monitoring.go
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md?

elmiko · 2020-06-09T20:34:29Z

i'm digging deeper into some of these metrics sources in hopes that we can provide links back to the original sources.

elmiko · 2020-06-10T20:06:46Z

updated with links to external docs. i've tried to make this easier to consume by removing the sample scrape in favor of links to the code. unfortunately i could not find definitive documentation for the metrics provided by controller-runtime and prometheus/client-go, so i have opted to link directly to the metrics implementations.

michaelgugino · 2020-06-22T21:28:21Z

docs/dev/metrics.md

+### Prometheus REST server metrics
+
+The `url` label can be quite useful for querying these metrics, here are a few
+of the available URL values to use: `"https://172.30.0.1:443/%7Bprefix%7D"`, `"https://172.30.0.1:443/apis?timeout=32s"`.


From where should I utilize these URLs? I'm assuming these IPs are subject to change?

these are the in-pod IP addresses that are assigned to the prometheus http server, they are used as labels during promql calls to isolate the different endpoints. in my tests it looked like these IP addresses always came up the same, but i have to imagine it would change if the address range changed for the container network.

this might also be a chicken/egg situation too, where a user would have to scrape the metrics once to find out what is available, or they would have to know the IP address for the container in question.

should i add some more context here?

Just as a heads up, I believe 172.30.0.1 is the service IP of the Kubernetes service in the default namespace on an OpenShift cluster, ie, it's the service that fronts the API server, not sure why this is coming up WRT prometheus though

is it possible we are just seeing a local ip address that only exists within the pod network for these containers?

edit:
just for reference, i scraped these values from a running CAO on openshift 4.5

The pod network on openshift is a 10.x subnet, the 172.30.x is used for services, perhaps the metrics are for requests that CAO is making to the Kubernetes API?

could be, i'll take some time to dig in a little and see if i can learn more here.

i guess another way to look at this is, should we just remove this info and point to the docs in the prometheus project? my concern is that we get focused on this metric when it seems like a minor thing.

This is a valid concern, I agree, maybe we just drop that little bit

first, thank you both for pushing a little on this one it's been a fascinating investigation. as near as i can tell, these metrics are being generated by client-go using the controller-runtime package. i'm pretty sure this is where it starts. it appears to be showing all the outbound requests that are being made by the client-go inside the operator. which would square up with what Joel is saying about the IP address being the root API server.

so, maybe explain that bit here and give the example IP address? or is that too much detail?

enxebre · 2020-06-23T07:43:21Z

/approve

elmiko · 2020-06-29T19:40:13Z

updated to add more information about the controller-runtime based metrics. also removed the specific urls for the api server, with a description instead.

JoelSpeed · 2020-06-30T09:26:09Z

LGTM, I'll leave the label for @michaelgugino in case he has further comments

michaelgugino · 2020-07-01T02:35:07Z

docs/dev/metrics.md

+
+## Metrics about the Prometheus collectors
+
+Prometheus provides some default metrics about the internal state


I'm not clear on what this section means. Are these metrics collected automatically by virtue of using some prometheus library in the sig-controller-runtime code?

Okay, yeah, that's how it works. Wasn't clear what this was about.

How about an example.
process_cpu_seconds_total{job="cluster-autoscaler-operator"}

sure, i can add an example. i do wonder if i shouldn't update the mao doc as well then since it has the same clause in it?

https://github.com/openshift/machine-api-operator/blob/master/docs/dev/metrics.md#metrics-about-the-prometheus-collectors

note that the sample metric in that section is specifically added by the mao, that isn't the case here.

michaelgugino · 2020-07-01T02:35:33Z

docs/dev/metrics.md

+* [Prometheus documentation, Standard and runtime collectors](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors)
+* [Prometheus client Go language collectors](https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go)
+
+# Cluster Autoscaler Metrics


Nit: This seems like it could replace the first H1 at the top of this doc.

this one is specifically about the autoscaler though, the top level link is about the operator. i was trying to draw a distinction, maybe there is a better way to specify?

michaelgugino · 2020-07-01T02:47:58Z

docs/dev/metrics.md

+
+### Admission webhook metrics
+
+These metric names begin with `controller_runtime_webhook_`.  The label


Are these actually being scraped? I did some queries on 4.5, and didn't get anything back for "controller_runtime_webhook_requests_total" and other metrics in this link.

define "queries".

these metrics may not get collected by telemetry, but they are exposed from the operator. i was able to scrape them from a running cao on 4.5.

Yeah, they're not getting collected by prometheus inside the cluster, I'm not talking about telemetry. They may be exposed but not collected, as you say.

michaelgugino · 2020-07-01T02:49:42Z

docs/dev/metrics.md

+
+### Kubernetes controller metrics
+
+These metric names begin with `controller_runtime_reconcile_`. The labels


Are these metrics actually getting collected? No CAO/MAO stuff being reported in 4.5.

they may not be collected, but they are exposed by the metrics endpoint for both operators.

Same here. Not getting collected by prometheus. I think some other operators do have this information being collected in prometheus. In any case, we should specify that we are/aren't collecting these?

In any case, we should specify that we are/aren't collecting these?

that's a good question, my gut feeling is that we should document all the possible metrics someone could scrape from the operator. given that the collection/telemetry stuff is controlled from a different location, perhaps we should drop a link to that repo so readers can evaluate what is being exported.

elmiko · 2020-07-09T17:13:08Z

/retest

This change adds a document to describe the metrics available from this operator.

michaelgugino

/lgtm

/approve

openshift-ci-robot · 2020-07-09T22:07:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, michaelgugino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre,michaelgugino]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label May 28, 2020

openshift-ci-robot requested review from alexander-demicev and michaelgugino May 28, 2020 16:53

elmiko force-pushed the add-metrics-doc branch 3 times, most recently from 23ca6b2 to 147311f Compare June 10, 2020 20:04

michaelgugino reviewed Jun 22, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2020

elmiko force-pushed the add-metrics-doc branch from 147311f to bb9a105 Compare June 29, 2020 19:39

michaelgugino reviewed Jul 1, 2020

View reviewed changes

elmiko force-pushed the add-metrics-doc branch from bb9a105 to 820cbc0 Compare July 9, 2020 17:51

add metrics.md to document exposed metrics

9333ff1

This change adds a document to describe the metrics available from this operator.

elmiko force-pushed the add-metrics-doc branch from 820cbc0 to 9333ff1 Compare July 9, 2020 18:10

michaelgugino approved these changes Jul 9, 2020

View reviewed changes

openshift-ci-robot assigned michaelgugino Jul 9, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2020

openshift-merge-robot merged commit 3790a11 into openshift:master Jul 9, 2020

elmiko deleted the add-metrics-doc branch July 10, 2020 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metrics.md to document exposed metrics #152

add metrics.md to document exposed metrics #152

elmiko commented May 28, 2020

enxebre commented Jun 9, 2020

elmiko commented Jun 9, 2020

elmiko commented Jun 10, 2020

michaelgugino Jun 22, 2020

elmiko Jun 23, 2020 •

edited

JoelSpeed Jun 25, 2020

elmiko Jun 25, 2020 •

edited

JoelSpeed Jun 25, 2020

elmiko Jun 25, 2020

JoelSpeed Jun 25, 2020

elmiko Jun 25, 2020

enxebre commented Jun 23, 2020

elmiko commented Jun 29, 2020

JoelSpeed commented Jun 30, 2020

michaelgugino Jul 1, 2020

michaelgugino Jul 1, 2020

elmiko Jul 1, 2020

michaelgugino Jul 1, 2020

elmiko Jul 1, 2020

michaelgugino Jul 1, 2020

elmiko Jul 1, 2020

michaelgugino Jul 1, 2020

michaelgugino Jul 1, 2020

elmiko Jul 1, 2020

michaelgugino Jul 1, 2020

elmiko Jul 1, 2020

elmiko commented Jul 9, 2020

michaelgugino left a comment

openshift-ci-robot commented Jul 9, 2020


		## Metrics about the Prometheus collectors

		Prometheus provides some default metrics about the internal state


		### Admission webhook metrics

		These metric names begin with `controller_runtime_webhook_`. The label


		### Kubernetes controller metrics

		These metric names begin with `controller_runtime_reconcile_`. The labels

add metrics.md to document exposed metrics #152

add metrics.md to document exposed metrics #152

Conversation

elmiko commented May 28, 2020

enxebre commented Jun 9, 2020

elmiko commented Jun 9, 2020

elmiko commented Jun 10, 2020

Choose a reason for hiding this comment

elmiko Jun 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko Jun 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Jun 23, 2020

elmiko commented Jun 29, 2020

JoelSpeed commented Jun 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko commented Jul 9, 2020

michaelgugino left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jul 9, 2020

elmiko Jun 23, 2020 •

edited

elmiko Jun 25, 2020 •

edited