Gather more monitoring data #234

sthaha · 2021-06-08T05:13:28Z

This patch gathers more monitoring data from

Prometheus such as

rules
altermanagers
status/config
status/flags
status/runtimeinfo
status/tsdb

AlertManager such as

/api/v2/status

JIRA: https://issues.redhat.com/browse/MON-1234

Signed-off-by: Sunil Thaha sthaha@redhat.com

sthaha · 2021-06-08T05:16:22Z

cc: @simonpasquier @dgrisonnet @paulfantom

paulfantom

lgtm

dgrisonnet · 2021-06-09T10:08:55Z

/lgtm

dgrisonnet · 2021-06-09T10:09:11Z

/assign @sferich888

deads2k · 2021-06-09T12:46:20Z

I'm in favor of gathering metrics. Can you indicate whether any of the data being gathered by this change scales with the time the cluster has been running? We had an issue a while back with gathering node logs for all time, so the longer a cluster ran, the more data we pulled back. We want to bound the data being gathered to about two-three days in the no-arg case and longer if specifically requested when running the script.

paulfantom · 2021-06-09T19:17:06Z

@deads2k Script is not querying for metrics, but "only" gets status information from prometheus and alertmanager. As such it will be gathering an almost constant amount of data regardless of cluster size or its runtime. The only variance is the amount of alerting rules, but we are in control of this volume size as it is part of the OpenShift release payload and actively reconciled by CMO.

sthaha · 2021-06-15T02:17:48Z

@deads2k Are we good to merge this so that we can make more progress on #234 ?

simonpasquier · 2021-06-18T11:43:23Z

/lgtm

Very cool! It will already provide great value for troubleshooting live clusters.

sferich888 · 2021-06-21T15:35:47Z

Should this be superseded by #214

dgrisonnet · 2021-06-21T16:15:43Z

No, in my opinion, they are two different initiatives that should be considered to improve the number of information gathered around monitoring. This particular PR adds information that we (the monitoring team) found necessary when investigating Bugzillas, and as Pawel said, the data gathered should be constant regardless of the cluster size or its runtime so it shouldn't break any size limits required by the must-gather.
Whereas #214 is a bit different since it introduces the ability to get particular metrics/dashboards, but since this would amount to a lot of data, it would be opt-out by default. Only upon request to a customer, they would be able to gather this data and share it with us. I personally think that this would be very useful when asking customers particular metrics, but this needs to be thought through since it will amount to a lot of data.

sthaha · 2021-06-28T04:01:22Z

@deads2k @sferich888 Are we good to merge this?

sferich888

Right now I think this needs a more through code review to clean up some of the complexities this script creates/uses (some of this is baggage from prior commits).

I also want to understand how this overlaps with the work being done in #214

sferich888 · 2021-07-06T12:28:11Z

collection-scripts/gather_monitoring


-# force disk flush to ensure that all data gathered is accessible in the copy container
-sync
+cleanup() {


Is this needed? When the 'container' is destroyed (at the completion of a gather) won't this also be deleted then? Is this a wasted cycle?

I wasn't sure either if the cleanup is really needed but I left it there because

I would consider as best practice for each script to cleanup what it created so that it doesn't influence/affect other scripts that gets executed after this one

This was already present in the original implementation

On a closer look, yes, we want to delete the CA bundle since that isn't something we want to gather.

sferich888 · 2021-07-06T12:30:52Z

collection-scripts/gather_monitoring

-  --token="${SA_TOKEN}" \
-  --certificate-authority="${MONITORING_PATH}/ca-bundle.crt" \
-  --raw=/api/v1/rules?type=alert 2>"${MONITORING_PATH}/alert.stderr" > "${MONITORING_PATH}/alerts.json"
+  SA_TOKEN="$(oc sa get-token default)"


Is this needed? must-gather is given 'cluster-admin' by overriding the default service account, in the newly created namespace.

https://github.com/openshift/oc/blob/master/pkg/cli/admin/mustgather/mustgather.go#L564-L585

In short; by getting this SA; and creating 35-40 we are just complicating the understanding of this script.

The reason we need the token is to because we use that token for authentication with prometheus and alertmanager.

I agree use of oc_get isn't too obvious. I will remove oc_get and repeat the code in prom_get and alertmanager_get

sferich888 · 2021-07-06T12:33:00Z

collection-scripts/gather_monitoring

+  # begin gathering
+  # NOTE || true ignores failures
+
+  prom_get rules rules   || true


None of the following need a second argument. In 57-58 you only use the first input.

Actually this indeed useful .. say I want to GET /api/v1/rules?type=alert (like before) then I can invoke

prom_get rules?type=alert alerts

It currently happens to not use 'rules?type=alter&foo=bar' is only coincidental.

The syntax is prom_get $api/endpoint $output

We should put a codeded comment to that effect, so that other that come along can understand this.

This patch gathers more monitoring data from 1. `Prometheus` such as * rules * altermanagers * status/config * status/flags * status/runtimeinfo * status/tsdb 2. `AlertManager` such as * /api/v2/status JIRA: https://issues.redhat.com/browse/MON-1234 Signed-off-by: Sunil Thaha <sthaha@redhat.com>

sferich888 · 2021-07-12T11:47:17Z

/lgtm

openshift-ci · 2021-07-12T11:47:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, paulfantom, sferich888, simonpasquier, sthaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~collection-scripts/OWNERS~~ [sferich888]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sferich888 · 2021-08-04T12:48:20Z

/cherry-pick release-4.6

openshift-cherrypick-robot · 2021-08-04T12:48:56Z

@sferich888: #234 failed to apply on top of branch "release-4.6":

Applying: Gather more monitoring data
Using index info to reconstruct a base tree...
A	collection-scripts/gather_monitoring
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): collection-scripts/gather_monitoring deleted in HEAD and modified in Gather more monitoring data. Version Gather more monitoring data of collection-scripts/gather_monitoring left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Gather more monitoring data
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2021

openshift-ci bot requested review from davemulford and sferich888 June 8, 2021 05:13

sthaha force-pushed the gather-more-monitoring branch 3 times, most recently from b9afcb4 to d04501e Compare June 9, 2021 07:54

paulfantom approved these changes Jun 9, 2021

View reviewed changes

sthaha force-pushed the gather-more-monitoring branch from d04501e to 13017c4 Compare June 9, 2021 08:56

sthaha changed the title ~~WIP: Gather more monitoring data~~ Gather more monitoring data Jun 9, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2021

openshift-ci bot assigned dgrisonnet Jun 9, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2021

openshift-ci bot assigned sferich888 Jun 9, 2021

mtulio mentioned this pull request Jun 10, 2021

feat/gather-monitoring: Support to collect Prometheus metrics #214

Closed

openshift-ci bot assigned simonpasquier Jun 18, 2021

sferich888 suggested changes Jul 6, 2021

View reviewed changes

sthaha force-pushed the gather-more-monitoring branch from 13017c4 to 4d36ee0 Compare July 7, 2021 06:18

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2021

openshift-merge-robot merged commit 79aaee8 into openshift:master Jul 12, 2021

This was referenced Jul 14, 2021

Bug 2018197: [release-4.6] Add olm resources to default must-gather #242

Merged

WIP | Feat: Add cmd parser prometheus-* kxr/o-must-gather#60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather more monitoring data #234

Gather more monitoring data #234

sthaha commented Jun 8, 2021 •

edited

sthaha commented Jun 8, 2021

paulfantom left a comment

dgrisonnet commented Jun 9, 2021

dgrisonnet commented Jun 9, 2021

deads2k commented Jun 9, 2021

paulfantom commented Jun 9, 2021

sthaha commented Jun 15, 2021

simonpasquier commented Jun 18, 2021

sferich888 commented Jun 21, 2021

dgrisonnet commented Jun 21, 2021

sthaha commented Jun 28, 2021

sferich888 left a comment

sferich888 Jul 6, 2021

sthaha Jul 7, 2021

sthaha Jul 7, 2021

sferich888 Jul 6, 2021

sthaha Jul 7, 2021 •

edited

sferich888 Jul 6, 2021

sthaha Jul 7, 2021 •

edited

sferich888 Jul 12, 2021

sferich888 commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

sferich888 commented Aug 4, 2021

openshift-cherrypick-robot commented Aug 4, 2021

Gather more monitoring data #234

Gather more monitoring data #234

Conversation

sthaha commented Jun 8, 2021 • edited

sthaha commented Jun 8, 2021

paulfantom left a comment

Choose a reason for hiding this comment

dgrisonnet commented Jun 9, 2021

dgrisonnet commented Jun 9, 2021

deads2k commented Jun 9, 2021

paulfantom commented Jun 9, 2021

sthaha commented Jun 15, 2021

simonpasquier commented Jun 18, 2021

sferich888 commented Jun 21, 2021

dgrisonnet commented Jun 21, 2021

sthaha commented Jun 28, 2021

sferich888 left a comment

Choose a reason for hiding this comment

sferich888 Jul 6, 2021

Choose a reason for hiding this comment

sthaha Jul 7, 2021

Choose a reason for hiding this comment

sthaha Jul 7, 2021

Choose a reason for hiding this comment

sferich888 Jul 6, 2021

Choose a reason for hiding this comment

sthaha Jul 7, 2021 • edited

Choose a reason for hiding this comment

sferich888 Jul 6, 2021

Choose a reason for hiding this comment

sthaha Jul 7, 2021 • edited

Choose a reason for hiding this comment

sferich888 Jul 12, 2021

Choose a reason for hiding this comment

sferich888 commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

sferich888 commented Aug 4, 2021

openshift-cherrypick-robot commented Aug 4, 2021

sthaha commented Jun 8, 2021 •

edited

sthaha Jul 7, 2021 •

edited

sthaha Jul 7, 2021 •

edited