-
Notifications
You must be signed in to change notification settings - Fork 228
Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure #388
Conversation
|
@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm |
|
@vikaschoudhary16 I was wondering, couldn't you put all those rules into one object? Example: https://github.com/openshift/release/blob/master/core-services/openshift-monitoring/cfg_prometheus-prow-rules_prometheusrule.yaml |
|
/test e2e-aws-operator |
52ed5c7 to
12f5df9
Compare
|
@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
12f5df9 to
3851641
Compare
| severity: critical | ||
| annotations: | ||
| message: "machine api MAO metric {{ $labels.kind }} is not being reported" | ||
| - name: mao-down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think we should expand M-A-O. It's easier for newcomers/users/customers/et al to sound it out, then guess what it stands for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but personally won't block this PR on it. We should try to merge this today.
My main question is: Doesn't this new alert make the existing MachineMAOMetricsDown alert unnecessary? If so, should we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not make MachineMAOMetricsDown unnecessary. This alert is for overall mao process. While the other one is specific to metrics reporting. It may happen that MAO is not down but metrics are not being reported for any reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the mapi_mao_collector_up metric tracks whether we're successfully scraping operands, or MachineSets etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes or any other similar problem which is specific to metrics reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have problems inside a team identifying what mapi_mao_collector_up refers to, how is customer supposed to know it?
Also a good practice is to expose metrics by operand itself and not aggregate them inside an operator. For example look how openshift or kube apiserver does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulfantom regarding naming confusion, have added more details and improved the naming.
MAO has also been updated to "machine-api-operator" at all the instances
Regarding reporting from operand, we discussed this part internally and decided to report cluster state related metrics from the operator. operands will also be reporting metrics specific to operands in the future.
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
3851641 to
1a3293b
Compare
| rules: | ||
| - alert: MachineAPIOperatorMetricsCollectionFailing | ||
| expr: | | ||
| mapi_machine-api-operator_collector_up == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we / should we consistently use _ (underscore) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you should, - is an invalid character in prometheus data model.
The metric name specifies the general feature of a system that is measured [...] It may contain ASCII letters and digits, as well as underscores and colons. It must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*
Source: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
| labels: | ||
| severity: critical | ||
| annotations: | ||
| message: "machine api operator metrics collection is failing. Check machine api operator logs for more details." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could be super helpful and where we say "check ... logs" turn that into something that can be cut/paste: "oc logs -n openshift-machine-api machine-api-operator", for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually focus on precisely describing what happened as this message is also showing up on SRE pager/slack channel/email. After that they are supposed to have runbooks on how to proceed. Personally I don't see any value in getting alert saying me to check logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On slack you can see examples of how alerts are being reported by going to one of:
#ops-testplatform (managed by DPTP)
#team-monitoring-alert (managed by our SREs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulfantom there i saw alert message just saying what problem has happened. eg:
Alert: TelemeterDown - stage [FIRING:1] No targets found for Telemeter in namespace . Telemeter is down.
Providing exact command to check logs i think is an improvement over not suggesting anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually focus on precisely describing what happened
machine api operator metrics collection is failing. this is clearly mentioning that metrics collection is failing
1a3293b to
66e8722
Compare
bison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
| severity: critical | ||
| annotations: | ||
| message: "machine api MAO metric {{ $labels.kind }} is not being reported" | ||
| message: "machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the line is indented too much
66e8722 to
9b666d6
Compare
|
/test e2e-azure-operator |
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ingvagabund The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@vikaschoudhary16: All pull requests linked via external trackers have merged. Bugzilla bug 1744752 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…eratorDown The cluster-version operator is responsible for complaining if the machine-API operator's deployment is sad, so no need for the operator to handle this directly (we end up doubling up if there's an issue). This drops the alert which was added in 9b666d6 (Add a new alert rule for reporting the MAO failure, 2019-08-23, openshift#388).
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1744752