New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metrics.md to document exposed metrics #152
Conversation
Can we include a ref to the metrics related to the cluster autoscaler https://github.com/openshift/cluster-autoscaler-operator/blob/master/pkg/controller/clusterautoscaler/monitoring.go |
i'm digging deeper into some of these metrics sources in hopes that we can provide links back to the original sources. |
23ca6b2
to
147311f
Compare
updated with links to external docs. i've tried to make this easier to consume by removing the sample scrape in favor of links to the code. unfortunately i could not find definitive documentation for the metrics provided by controller-runtime and prometheus/client-go, so i have opted to link directly to the metrics implementations. |
docs/dev/metrics.md
Outdated
### Prometheus REST server metrics | ||
|
||
The `url` label can be quite useful for querying these metrics, here are a few | ||
of the available URL values to use: `"https://172.30.0.1:443/%7Bprefix%7D"`, `"https://172.30.0.1:443/apis?timeout=32s"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From where should I utilize these URLs? I'm assuming these IPs are subject to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the in-pod IP addresses that are assigned to the prometheus http server, they are used as labels during promql calls to isolate the different endpoints. in my tests it looked like these IP addresses always came up the same, but i have to imagine it would change if the address range changed for the container network.
this might also be a chicken/egg situation too, where a user would have to scrape the metrics once to find out what is available, or they would have to know the IP address for the container in question.
should i add some more context here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a heads up, I believe 172.30.0.1
is the service IP of the Kubernetes service in the default namespace on an OpenShift cluster, ie, it's the service that fronts the API server, not sure why this is coming up WRT prometheus though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible we are just seeing a local ip address that only exists within the pod network for these containers?
edit:
just for reference, i scraped these values from a running CAO on openshift 4.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod network on openshift is a 10.x
subnet, the 172.30.x
is used for services, perhaps the metrics are for requests that CAO is making to the Kubernetes API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be, i'll take some time to dig in a little and see if i can learn more here.
i guess another way to look at this is, should we just remove this info and point to the docs in the prometheus project? my concern is that we get focused on this metric when it seems like a minor thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid concern, I agree, maybe we just drop that little bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first, thank you both for pushing a little on this one it's been a fascinating investigation. as near as i can tell, these metrics are being generated by client-go using the controller-runtime package. i'm pretty sure this is where it starts. it appears to be showing all the outbound requests that are being made by the client-go inside the operator. which would square up with what Joel is saying about the IP address being the root API server.
so, maybe explain that bit here and give the example IP address? or is that too much detail?
/approve |
updated to add more information about the controller-runtime based metrics. also removed the specific urls for the api server, with a description instead. |
LGTM, I'll leave the label for @michaelgugino in case he has further comments |
|
||
## Metrics about the Prometheus collectors | ||
|
||
Prometheus provides some default metrics about the internal state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on what this section means. Are these metrics collected automatically by virtue of using some prometheus library in the sig-controller-runtime code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, that's how it works. Wasn't clear what this was about.
How about an example.
process_cpu_seconds_total{job="cluster-autoscaler-operator"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i can add an example. i do wonder if i shouldn't update the mao doc as well then since it has the same clause in it?
note that the sample metric in that section is specifically added by the mao, that isn't the case here.
* [Prometheus documentation, Standard and runtime collectors](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors) | ||
* [Prometheus client Go language collectors](https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go) | ||
|
||
# Cluster Autoscaler Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This seems like it could replace the first H1 at the top of this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is specifically about the autoscaler though, the top level link is about the operator. i was trying to draw a distinction, maybe there is a better way to specify?
|
||
### Admission webhook metrics | ||
|
||
These metric names begin with `controller_runtime_webhook_`. The label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these actually being scraped? I did some queries on 4.5, and didn't get anything back for "controller_runtime_webhook_requests_total" and other metrics in this link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define "queries".
these metrics may not get collected by telemetry, but they are exposed from the operator. i was able to scrape them from a running cao on 4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they're not getting collected by prometheus inside the cluster, I'm not talking about telemetry. They may be exposed but not collected, as you say.
|
||
### Kubernetes controller metrics | ||
|
||
These metric names begin with `controller_runtime_reconcile_`. The labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these metrics actually getting collected? No CAO/MAO stuff being reported in 4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they may not be collected, but they are exposed by the metrics endpoint for both operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Not getting collected by prometheus. I think some other operators do have this information being collected in prometheus. In any case, we should specify that we are/aren't collecting these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, we should specify that we are/aren't collecting these?
that's a good question, my gut feeling is that we should document all the possible metrics someone could scrape from the operator. given that the collection/telemetry stuff is controlled from a different location, perhaps we should drop a link to that repo so readers can evaluate what is being exported.
/retest |
This change adds a document to describe the metrics available from this operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, michaelgugino The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This change adds a document to describe the metrics available from this
operator.
/kind documentation