add metrics.md document #591

elmiko · 2020-05-15T18:06:06Z

This change adds a document which details all the available metrics for
scraping by Prometheus. It has sample dumps along with some text to help
guide the reader.

JoelSpeed · 2020-05-18T12:41:10Z

docs/dev/metrics.md

+```
+# HELP mapi_machine_created_timestamp_seconds Timestamp of the mapi managed Machine creation time
+# TYPE mapi_machine_created_timestamp_seconds gauge
+mapi_machine_created_timestamp_seconds{api_version="machine.openshift.io/v1beta1",name="ocp-cluster-rndpg-master-0",namespace="openshift-machine-api",node="ip-10-0-130-139.us-east-2.compute.internal",phase="Running",spec_provider_id="aws:///us-east-2a/i-08624d119917119d6"} 1.589550152e+09


I notice there are some labels on this that might change through the lifecycle of a machine, eg phase, does this really just report when a machine was created? Or is there something deeper on this that's not being expressed by the name? I wonder if this needs a bit more explanation

that's a good question. i produced this output by scraping the mao on a running cluster i had been using for testing, i will look into this a little deeper.

ok, did some checking. themapi_machine_created_timestamp_seconds and mapi_machine_set_created_timestamp_seconds will both update their .status.phase on every update cycle (not sure on the frequency here). theoretically, the other values could change as well but it seems like phase is the only one that will change.

should i add a note about this behavior?

I'd be tempted to add some detail if you think this will add understanding for the metrics.

At the moment, this list to me is just a list of prometheus metrics, it's still quite hard to interpret. If there's any detail we can work out and add to explain what the metrics are/why they exist, that's helpful for future readers so they don't also have to try and interpret the metrics scrape too

i think that's fair. my thought process here was to put up something with the raw scrape, broken into sections, to help start the ball rolling. i'm a little torn about how much detail to add here, but i'll go back and give it another pass.

elmiko · 2020-05-18T14:40:18Z

added more text around the metrics sections and reorganized things a little.

JoelSpeed

/lgtm

Thanks for the additional info, I think it adds a lot more value to the doc!

elmiko · 2020-06-02T12:51:03Z

/kind documentation

enxebre · 2020-06-09T13:34:38Z

docs/dev/metrics.md

+They can be used to diagnose issues such as increased memory or cpu usage and
+other system resource related queries.
+
+```


could you point me to where are these coming from?

i found these by scraping the running pod, but i will dig into the code a little to figure out what is generating them.

so, apparently these are coming from the prometheus go-client. i am trying to find some documentation that i could link to, i think it will make these cleaner.

elmiko · 2020-06-09T20:00:20Z

@enxebre i cleaned this up considerably and removed all the metrics from the prometheus client-go in favor of links to those docs and code.

This change adds a document which details all the available metrics for scraping by Prometheus. It has sample dumps along with some text to help guide the reader.

JoelSpeed · 2020-06-10T08:18:03Z

/lgtm

Defer to @enxebre to approve

enxebre · 2020-06-10T08:24:01Z

thanks a lot @elmiko
/retest
/approve

openshift-ci-robot · 2020-06-10T08:24:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-06-10T11:15:17Z

/retest