Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure #388

vikaschoudhary16 · 2019-08-23T06:44:59Z

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1744752

openshift-ci-robot · 2019-08-23T06:45:44Z

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is invalid:

expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1744752: Add a new alert rule for reporting the MAO failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vikaschoudhary16 · 2019-08-23T06:47:47Z

/bugzilla refresh

openshift-ci-robot · 2019-08-23T06:47:52Z

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

paulfantom · 2019-08-23T07:35:37Z

/lgtm

paulfantom · 2019-08-23T14:21:39Z

@vikaschoudhary16 I was wondering, couldn't you put all those rules into one object? Example: https://github.com/openshift/release/blob/master/core-services/openshift-monitoring/cfg_prometheus-prow-rules_prometheusrule.yaml

vikaschoudhary16 · 2019-08-26T09:33:59Z

/test e2e-aws-operator

install/0000_90_machine-api-operator_04_alertrules.yaml

openshift-ci-robot · 2019-08-28T11:13:01Z

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid.

In response to this:

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

frobware · 2019-08-28T11:18:18Z

install/0000_90_machine-api-operator_04_alertrules.yaml

            severity: critical
          annotations:
            message: "machine api MAO metric {{ $labels.kind }} is not being reported"
+    - name: mao-down


Personally I think we should expand M-A-O. It's easier for newcomers/users/customers/et al to sound it out, then guess what it stands for.

I agree, but personally won't block this PR on it. We should try to merge this today.

My main question is: Doesn't this new alert make the existing MachineMAOMetricsDown alert unnecessary? If so, should we remove it?

No, it does not make MachineMAOMetricsDown unnecessary. This alert is for overall mao process. While the other one is specific to metrics reporting. It may happen that MAO is not down but metrics are not being reported for any reason.

Ah, so the mapi_mao_collector_up metric tracks whether we're successfully scraping operands, or MachineSets etc.?

yes or any other similar problem which is specific to metrics reporting.

If you have problems inside a team identifying what mapi_mao_collector_up refers to, how is customer supposed to know it?

Also a good practice is to expose metrics by operand itself and not aggregate them inside an operator. For example look how openshift or kube apiserver does it.

@paulfantom regarding naming confusion, have added more details and improved the naming.
MAO has also been updated to "machine-api-operator" at all the instances

Regarding reporting from operand, we discussed this part internally and decided to report cluster state related metrics from the operator. operands will also be reporting metrics specific to operands in the future.

vikaschoudhary16 · 2019-08-28T13:00:19Z

/test e2e-aws

vikaschoudhary16 · 2019-08-28T13:51:58Z

/test e2e-aws

frobware · 2019-08-29T09:21:56Z

install/0000_90_machine-api-operator_04_alertrules.yaml

+      rules:
+        - alert: MachineAPIOperatorMetricsCollectionFailing
+          expr: |
+             mapi_machine-api-operator_collector_up == 0


Can we / should we consistently use _ (underscore) here?

Yes you should, - is an invalid character in prometheus data model.

The metric name specifies the general feature of a system that is measured [...] It may contain ASCII letters and digits, as well as underscores and colons. It must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*

Source: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

frobware · 2019-08-29T09:23:51Z

install/0000_90_machine-api-operator_04_alertrules.yaml

+          labels:
+            severity: critical
+          annotations:
+            message: "machine api operator metrics collection is failing. Check machine api operator logs for more details."


We could be super helpful and where we say "check ... logs" turn that into something that can be cut/paste: "oc logs -n openshift-machine-api machine-api-operator", for example.

We usually focus on precisely describing what happened as this message is also showing up on SRE pager/slack channel/email. After that they are supposed to have runbooks on how to proceed. Personally I don't see any value in getting alert saying me to check logs.

On slack you can see examples of how alerts are being reported by going to one of:
#ops-testplatform (managed by DPTP)
#team-monitoring-alert (managed by our SREs)

@paulfantom there i saw alert message just saying what problem has happened. eg:
Alert: TelemeterDown - stage [FIRING:1] No targets found for Telemeter in namespace . Telemeter is down.

Providing exact command to check logs i think is an improvement over not suggesting anything.

usually focus on precisely describing what happened

machine api operator metrics collection is failing. this is clearly mentioning that metrics collection is failing

bison

/lgtm

ingvagabund · 2019-08-29T10:50:27Z

install/0000_90_machine-api-operator_04_alertrules.yaml

            severity: critical
          annotations:
-            message: "machine api MAO metric {{ $labels.kind }} is not being reported"
+                  message: "machine api operator metrics collection is failing. For more details:  oc logs <machine-api-operator-pod-name> -n openshift-machine-api"


Nit: the line is indented too much

vikaschoudhary16 · 2019-08-29T12:31:33Z

/test e2e-azure-operator

frobware · 2019-08-29T15:00:16Z

/lgtm

ingvagabund · 2019-08-30T14:30:09Z

/approve

openshift-ci-robot · 2019-08-30T14:30:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ingvagabund]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-08-30T16:50:32Z

@vikaschoudhary16: All pull requests linked via external trackers have merged. Bugzilla bug 1744752 has been moved to the MODIFIED state.

In response to this:

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…eratorDown The cluster-version operator is responsible for complaining if the machine-API operator's deployment is sad, so no need for the operator to handle this directly (we end up doubling up if there's an issue). This drops the alert which was added in 9b666d6 (Add a new alert rule for reporting the MAO failure, 2019-08-23, openshift#388).

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 23, 2019

openshift-ci-robot requested review from bison and ingvagabund August 23, 2019 06:45

vikaschoudhary16 changed the title ~~Add a new alert rule for reporting the MAO failure~~ Bug 1744752: Add a new alert rule for reporting the MAO failure Aug 23, 2019

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 23, 2019

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 23, 2019

openshift-ci-robot assigned paulfantom Aug 23, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2019

vikaschoudhary16 force-pushed the bz_1744752 branch from 52ed5c7 to 12f5df9 Compare August 27, 2019 08:46

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2019

vikaschoudhary16 changed the title ~~Bug 1744752: Add a new alert rule for reporting the MAO failure~~ Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure Aug 28, 2019

ingvagabund reviewed Aug 28, 2019

View reviewed changes

install/0000_90_machine-api-operator_04_alertrules.yaml Outdated Show resolved Hide resolved

vikaschoudhary16 force-pushed the bz_1744752 branch from 12f5df9 to 3851641 Compare August 28, 2019 11:16

frobware reviewed Aug 28, 2019

View reviewed changes

vikaschoudhary16 force-pushed the bz_1744752 branch from 3851641 to 1a3293b Compare August 29, 2019 09:01

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2019

frobware reviewed Aug 29, 2019

View reviewed changes

vikaschoudhary16 force-pushed the bz_1744752 branch from 1a3293b to 66e8722 Compare August 29, 2019 09:54

openshift-ci-robot assigned bison Aug 29, 2019

bison approved these changes Aug 29, 2019

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019

ingvagabund reviewed Aug 29, 2019

View reviewed changes

Add a new alert rule for reporting the MAO failure

9b666d6

vikaschoudhary16 force-pushed the bz_1744752 branch from 66e8722 to 9b666d6 Compare August 29, 2019 10:54

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019

openshift-ci-robot assigned frobware Aug 29, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2019

openshift-merge-robot merged commit a094922 into openshift:master Aug 30, 2019

wking mentioned this pull request Mar 11, 2021

install/0000_90_machine-api-operator_04_alertrules: Drop MachineAPIOperatorDown #826

Merged

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure #388

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure #388

Uh oh!

Conversation

vikaschoudhary16 commented Aug 23, 2019 • edited by frobware Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 23, 2019

Uh oh!

vikaschoudhary16 commented Aug 23, 2019

Uh oh!

openshift-ci-robot commented Aug 23, 2019

Uh oh!

paulfantom commented Aug 23, 2019

Uh oh!

paulfantom commented Aug 23, 2019

Uh oh!

vikaschoudhary16 commented Aug 26, 2019

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Aug 28, 2019

Uh oh!

vikaschoudhary16 commented Aug 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Aug 29, 2019

Uh oh!

frobware commented Aug 29, 2019

Uh oh!

ingvagabund commented Aug 30, 2019

Uh oh!

openshift-ci-robot commented Aug 30, 2019

Uh oh!

openshift-ci-robot commented Aug 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vikaschoudhary16 commented Aug 23, 2019 •

edited by frobware

Loading