Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics.md to document exposed metrics #152

Merged
merged 1 commit into from Jul 9, 2020

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented May 28, 2020

This change adds a document to describe the metrics available from this
operator.

/kind documentation

@openshift-ci-robot openshift-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label May 28, 2020
@enxebre
Copy link
Member

enxebre commented Jun 9, 2020

Can we include a ref to the metrics related to the cluster autoscaler https://github.com/openshift/cluster-autoscaler-operator/blob/master/pkg/controller/clusterautoscaler/monitoring.go
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md?

@elmiko
Copy link
Contributor Author

elmiko commented Jun 9, 2020

i'm digging deeper into some of these metrics sources in hopes that we can provide links back to the original sources.

@elmiko elmiko force-pushed the add-metrics-doc branch 3 times, most recently from 23ca6b2 to 147311f Compare June 10, 2020 20:04
@elmiko
Copy link
Contributor Author

elmiko commented Jun 10, 2020

updated with links to external docs. i've tried to make this easier to consume by removing the sample scrape in favor of links to the code. unfortunately i could not find definitive documentation for the metrics provided by controller-runtime and prometheus/client-go, so i have opted to link directly to the metrics implementations.

### Prometheus REST server metrics

The `url` label can be quite useful for querying these metrics, here are a few
of the available URL values to use: `"https://172.30.0.1:443/%7Bprefix%7D"`, `"https://172.30.0.1:443/apis?timeout=32s"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where should I utilize these URLs? I'm assuming these IPs are subject to change?

Copy link
Contributor Author

@elmiko elmiko Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the in-pod IP addresses that are assigned to the prometheus http server, they are used as labels during promql calls to isolate the different endpoints. in my tests it looked like these IP addresses always came up the same, but i have to imagine it would change if the address range changed for the container network.

this might also be a chicken/egg situation too, where a user would have to scrape the metrics once to find out what is available, or they would have to know the IP address for the container in question.

should i add some more context here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a heads up, I believe 172.30.0.1 is the service IP of the Kubernetes service in the default namespace on an OpenShift cluster, ie, it's the service that fronts the API server, not sure why this is coming up WRT prometheus though

Copy link
Contributor Author

@elmiko elmiko Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible we are just seeing a local ip address that only exists within the pod network for these containers?

edit:
just for reference, i scraped these values from a running CAO on openshift 4.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod network on openshift is a 10.x subnet, the 172.30.x is used for services, perhaps the metrics are for requests that CAO is making to the Kubernetes API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be, i'll take some time to dig in a little and see if i can learn more here.

i guess another way to look at this is, should we just remove this info and point to the docs in the prometheus project? my concern is that we get focused on this metric when it seems like a minor thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern, I agree, maybe we just drop that little bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, thank you both for pushing a little on this one it's been a fascinating investigation. as near as i can tell, these metrics are being generated by client-go using the controller-runtime package. i'm pretty sure this is where it starts. it appears to be showing all the outbound requests that are being made by the client-go inside the operator. which would square up with what Joel is saying about the IP address being the root API server.

so, maybe explain that bit here and give the example IP address? or is that too much detail?

@enxebre
Copy link
Member

enxebre commented Jun 23, 2020

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2020
@elmiko
Copy link
Contributor Author

elmiko commented Jun 29, 2020

updated to add more information about the controller-runtime based metrics. also removed the specific urls for the api server, with a description instead.

@JoelSpeed
Copy link
Contributor

LGTM, I'll leave the label for @michaelgugino in case he has further comments


## Metrics about the Prometheus collectors

Prometheus provides some default metrics about the internal state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on what this section means. Are these metrics collected automatically by virtue of using some prometheus library in the sig-controller-runtime code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah, that's how it works. Wasn't clear what this was about.

How about an example.
process_cpu_seconds_total{job="cluster-autoscaler-operator"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i can add an example. i do wonder if i shouldn't update the mao doc as well then since it has the same clause in it?

https://github.com/openshift/machine-api-operator/blob/master/docs/dev/metrics.md#metrics-about-the-prometheus-collectors

note that the sample metric in that section is specifically added by the mao, that isn't the case here.

* [Prometheus documentation, Standard and runtime collectors](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors)
* [Prometheus client Go language collectors](https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go)

# Cluster Autoscaler Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This seems like it could replace the first H1 at the top of this doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is specifically about the autoscaler though, the top level link is about the operator. i was trying to draw a distinction, maybe there is a better way to specify?


### Admission webhook metrics

These metric names begin with `controller_runtime_webhook_`. The label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these actually being scraped? I did some queries on 4.5, and didn't get anything back for "controller_runtime_webhook_requests_total" and other metrics in this link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define "queries".

these metrics may not get collected by telemetry, but they are exposed from the operator. i was able to scrape them from a running cao on 4.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they're not getting collected by prometheus inside the cluster, I'm not talking about telemetry. They may be exposed but not collected, as you say.


### Kubernetes controller metrics

These metric names begin with `controller_runtime_reconcile_`. The labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metrics actually getting collected? No CAO/MAO stuff being reported in 4.5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they may not be collected, but they are exposed by the metrics endpoint for both operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Not getting collected by prometheus. I think some other operators do have this information being collected in prometheus. In any case, we should specify that we are/aren't collecting these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, we should specify that we are/aren't collecting these?

that's a good question, my gut feeling is that we should document all the possible metrics someone could scrape from the operator. given that the collection/telemetry stuff is controlled from a different location, perhaps we should drop a link to that repo so readers can evaluate what is being exported.

@elmiko
Copy link
Contributor Author

elmiko commented Jul 9, 2020

/retest

This change adds a document to describe the metrics available from this
operator.
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, michaelgugino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [enxebre,michaelgugino]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 3790a11 into openshift:master Jul 9, 2020
@elmiko elmiko deleted the add-metrics-doc branch July 10, 2020 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/documentation Categorizes issue or PR as related to documentation. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants