New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather more monitoring data #234
Gather more monitoring data #234
Conversation
b9afcb4
to
d04501e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
d04501e
to
13017c4
Compare
/lgtm |
/assign @sferich888 |
I'm in favor of gathering metrics. Can you indicate whether any of the data being gathered by this change scales with the time the cluster has been running? We had an issue a while back with gathering node logs for all time, so the longer a cluster ran, the more data we pulled back. We want to bound the data being gathered to about two-three days in the no-arg case and longer if specifically requested when running the script. |
@deads2k Script is not querying for metrics, but "only" gets status information from prometheus and alertmanager. As such it will be gathering an almost constant amount of data regardless of cluster size or its runtime. The only variance is the amount of alerting rules, but we are in control of this volume size as it is part of the OpenShift release payload and actively reconciled by CMO. |
/lgtm Very cool! It will already provide great value for troubleshooting live clusters. |
Should this be superseded by #214 |
No, in my opinion, they are two different initiatives that should be considered to improve the number of information gathered around monitoring. This particular PR adds information that we (the monitoring team) found necessary when investigating Bugzillas, and as Pawel said, the data gathered should be constant regardless of the cluster size or its runtime so it shouldn't break any size limits required by the must-gather. |
@deads2k @sferich888 Are we good to merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I think this needs a more through code review to clean up some of the complexities this script creates/uses (some of this is baggage from prior commits).
I also want to understand how this overlaps with the work being done in #214
|
||
# force disk flush to ensure that all data gathered is accessible in the copy container | ||
sync | ||
cleanup() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? When the 'container' is destroyed (at the completion of a gather) won't this also be deleted then? Is this a wasted cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure either if the cleanup is really needed but I left it there because
- I would consider as best practice for each script to cleanup what it created so that it doesn't influence/affect other scripts that gets executed after this one
- This was already present in the original implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a closer look, yes, we want to delete the CA bundle since that isn't something we want to gather.
--token="${SA_TOKEN}" \ | ||
--certificate-authority="${MONITORING_PATH}/ca-bundle.crt" \ | ||
--raw=/api/v1/rules?type=alert 2>"${MONITORING_PATH}/alert.stderr" > "${MONITORING_PATH}/alerts.json" | ||
SA_TOKEN="$(oc sa get-token default)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? must-gather is given 'cluster-admin' by overriding the default service account, in the newly created namespace.
https://github.com/openshift/oc/blob/master/pkg/cli/admin/mustgather/mustgather.go#L564-L585
In short; by getting this SA; and creating 35-40 we are just complicating the understanding of this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need the token is to because we use that token for authentication with prometheus
and alertmanager
.
I agree use of oc_get
isn't too obvious. I will remove oc_get
and repeat the code in prom_get
and alertmanager_get
# begin gathering | ||
# NOTE || true ignores failures | ||
|
||
prom_get rules rules || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the following need a second argument. In 57-58 you only use the first input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this indeed useful .. say I want to GET /api/v1/rules?type=alert
(like before) then I can invoke
prom_get rules?type=alert alerts
It currently happens to not use 'rules?type=alter&foo=bar'
is only coincidental.
The syntax is prom_get $api/endpoint $output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put a codeded comment to that effect, so that other that come along can understand this.
This patch gathers more monitoring data from 1. `Prometheus` such as * rules * altermanagers * status/config * status/flags * status/runtimeinfo * status/tsdb 2. `AlertManager` such as * /api/v2/status JIRA: https://issues.redhat.com/browse/MON-1234 Signed-off-by: Sunil Thaha <sthaha@redhat.com>
13017c4
to
4d36ee0
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, paulfantom, sferich888, simonpasquier, sthaha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-4.6 |
@sferich888: #234 failed to apply on top of branch "release-4.6":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This patch gathers more monitoring data from
Prometheus
such asAlertManager
such asJIRA: https://issues.redhat.com/browse/MON-1234
Signed-off-by: Sunil Thaha sthaha@redhat.com