Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics.md document #591

Merged
merged 1 commit into from Jun 10, 2020

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented May 15, 2020

This change adds a document which details all the available metrics for
scraping by Prometheus. It has sample dumps along with some text to help
guide the reader.

```
# HELP mapi_machine_created_timestamp_seconds Timestamp of the mapi managed Machine creation time
# TYPE mapi_machine_created_timestamp_seconds gauge
mapi_machine_created_timestamp_seconds{api_version="machine.openshift.io/v1beta1",name="ocp-cluster-rndpg-master-0",namespace="openshift-machine-api",node="ip-10-0-130-139.us-east-2.compute.internal",phase="Running",spec_provider_id="aws:///us-east-2a/i-08624d119917119d6"} 1.589550152e+09
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice there are some labels on this that might change through the lifecycle of a machine, eg phase, does this really just report when a machine was created? Or is there something deeper on this that's not being expressed by the name? I wonder if this needs a bit more explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question. i produced this output by scraping the mao on a running cluster i had been using for testing, i will look into this a little deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, did some checking. themapi_machine_created_timestamp_seconds and mapi_machine_set_created_timestamp_seconds will both update their .status.phase on every update cycle (not sure on the frequency here). theoretically, the other values could change as well but it seems like phase is the only one that will change.

should i add a note about this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to add some detail if you think this will add understanding for the metrics.

At the moment, this list to me is just a list of prometheus metrics, it's still quite hard to interpret. If there's any detail we can work out and add to explain what the metrics are/why they exist, that's helpful for future readers so they don't also have to try and interpret the metrics scrape too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that's fair. my thought process here was to put up something with the raw scrape, broken into sections, to help start the ball rolling. i'm a little torn about how much detail to add here, but i'll go back and give it another pass.

@elmiko
Copy link
Contributor Author

elmiko commented May 18, 2020

added more text around the metrics sections and reorganized things a little.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for the additional info, I think it adds a lot more value to the doc!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2020
@elmiko
Copy link
Contributor Author

elmiko commented Jun 2, 2020

/kind documentation

@openshift-ci-robot openshift-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Jun 2, 2020
@enxebre enxebre mentioned this pull request Jun 5, 2020
They can be used to diagnose issues such as increased memory or cpu usage and
other system resource related queries.

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you point me to where are these coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found these by scraping the running pod, but i will dig into the code a little to figure out what is generating them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, apparently these are coming from the prometheus go-client. i am trying to find some documentation that i could link to, i think it will make these cleaner.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2020
@elmiko
Copy link
Contributor Author

elmiko commented Jun 9, 2020

@enxebre i cleaned this up considerably and removed all the metrics from the prometheus client-go in favor of links to those docs and code.

This change adds a document which details all the available metrics for
scraping by Prometheus. It has sample dumps along with some text to help
guide the reader.
@JoelSpeed
Copy link
Contributor

/lgtm

Defer to @enxebre to approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@enxebre
Copy link
Member

enxebre commented Jun 10, 2020

thanks a lot @elmiko
/retest
/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 10, 2020

@elmiko: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator cc8878b link /test e2e-azure-operator
ci/prow/e2e-azure cc8878b link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 6c26955 into openshift:master Jun 10, 2020
@elmiko elmiko deleted the add-metrics-doc branch June 11, 2020 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/documentation Categorizes issue or PR as related to documentation. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants