Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather more monitoring data #234

Merged

Conversation

sthaha
Copy link
Contributor

@sthaha sthaha commented Jun 8, 2021

This patch gathers more monitoring data from

  1. Prometheus such as
  • rules
  • altermanagers
  • status/config
  • status/flags
  • status/runtimeinfo
  • status/tsdb
  1. AlertManager such as
  • /api/v2/status

JIRA: https://issues.redhat.com/browse/MON-1234

Signed-off-by: Sunil Thaha sthaha@redhat.com

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2021
@sthaha
Copy link
Contributor Author

sthaha commented Jun 8, 2021

@sthaha sthaha force-pushed the gather-more-monitoring branch 3 times, most recently from b9afcb4 to d04501e Compare June 9, 2021 07:54
Copy link

@paulfantom paulfantom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sthaha sthaha changed the title WIP: Gather more monitoring data Gather more monitoring data Jun 9, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2021
@dgrisonnet
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2021
@dgrisonnet
Copy link
Member

/assign @sferich888

@deads2k
Copy link
Contributor

deads2k commented Jun 9, 2021

I'm in favor of gathering metrics. Can you indicate whether any of the data being gathered by this change scales with the time the cluster has been running? We had an issue a while back with gathering node logs for all time, so the longer a cluster ran, the more data we pulled back. We want to bound the data being gathered to about two-three days in the no-arg case and longer if specifically requested when running the script.

@paulfantom
Copy link

@deads2k Script is not querying for metrics, but "only" gets status information from prometheus and alertmanager. As such it will be gathering an almost constant amount of data regardless of cluster size or its runtime. The only variance is the amount of alerting rules, but we are in control of this volume size as it is part of the OpenShift release payload and actively reconciled by CMO.

@sthaha
Copy link
Contributor Author

sthaha commented Jun 15, 2021

@deads2k Are we good to merge this so that we can make more progress on #234 ?

@simonpasquier
Copy link

/lgtm

Very cool! It will already provide great value for troubleshooting live clusters.

@sferich888
Copy link
Contributor

Should this be superseded by #214

@dgrisonnet
Copy link
Member

No, in my opinion, they are two different initiatives that should be considered to improve the number of information gathered around monitoring. This particular PR adds information that we (the monitoring team) found necessary when investigating Bugzillas, and as Pawel said, the data gathered should be constant regardless of the cluster size or its runtime so it shouldn't break any size limits required by the must-gather.
Whereas #214 is a bit different since it introduces the ability to get particular metrics/dashboards, but since this would amount to a lot of data, it would be opt-out by default. Only upon request to a customer, they would be able to gather this data and share it with us. I personally think that this would be very useful when asking customers particular metrics, but this needs to be thought through since it will amount to a lot of data.

@sthaha
Copy link
Contributor Author

sthaha commented Jun 28, 2021

@deads2k @sferich888 Are we good to merge this?

Copy link
Contributor

@sferich888 sferich888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I think this needs a more through code review to clean up some of the complexities this script creates/uses (some of this is baggage from prior commits).

I also want to understand how this overlaps with the work being done in #214


# force disk flush to ensure that all data gathered is accessible in the copy container
sync
cleanup() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? When the 'container' is destroyed (at the completion of a gather) won't this also be deleted then? Is this a wasted cycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure either if the cleanup is really needed but I left it there because

  1. I would consider as best practice for each script to cleanup what it created so that it doesn't influence/affect other scripts that gets executed after this one
  2. This was already present in the original implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a closer look, yes, we want to delete the CA bundle since that isn't something we want to gather.

--token="${SA_TOKEN}" \
--certificate-authority="${MONITORING_PATH}/ca-bundle.crt" \
--raw=/api/v1/rules?type=alert 2>"${MONITORING_PATH}/alert.stderr" > "${MONITORING_PATH}/alerts.json"
SA_TOKEN="$(oc sa get-token default)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? must-gather is given 'cluster-admin' by overriding the default service account, in the newly created namespace.

https://github.com/openshift/oc/blob/master/pkg/cli/admin/mustgather/mustgather.go#L564-L585

In short; by getting this SA; and creating 35-40 we are just complicating the understanding of this script.

Copy link
Contributor Author

@sthaha sthaha Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need the token is to because we use that token for authentication with prometheus and alertmanager.

I agree use of oc_get isn't too obvious. I will remove oc_get and repeat the code in prom_get and alertmanager_get

# begin gathering
# NOTE || true ignores failures

prom_get rules rules || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the following need a second argument. In 57-58 you only use the first input.

Copy link
Contributor Author

@sthaha sthaha Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this indeed useful .. say I want to GET /api/v1/rules?type=alert (like before) then I can invoke

prom_get rules?type=alert alerts

It currently happens to not use 'rules?type=alter&foo=bar' is only coincidental.

The syntax is prom_get $api/endpoint $output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put a codeded comment to that effect, so that other that come along can understand this.

This patch gathers more monitoring data from

1. `Prometheus` such as
  * rules
  * altermanagers
  * status/config
  * status/flags
  * status/runtimeinfo
  * status/tsdb

2. `AlertManager` such as
  * /api/v2/status

JIRA: https://issues.redhat.com/browse/MON-1234

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2021
@sferich888
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, paulfantom, sferich888, simonpasquier, sthaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2021
@openshift-merge-robot openshift-merge-robot merged commit 79aaee8 into openshift:master Jul 12, 2021
@sferich888
Copy link
Contributor

/cherry-pick release-4.6

@openshift-cherrypick-robot

@sferich888: #234 failed to apply on top of branch "release-4.6":

Applying: Gather more monitoring data
Using index info to reconstruct a base tree...
A	collection-scripts/gather_monitoring
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): collection-scripts/gather_monitoring deleted in HEAD and modified in Gather more monitoring data. Version Gather more monitoring data of collection-scripts/gather_monitoring left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Gather more monitoring data
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants