Skip to content

Conversation

@vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Aug 23, 2019

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 23, 2019
@vikaschoudhary16 vikaschoudhary16 changed the title Add a new alert rule for reporting the MAO failure Bug 1744752: Add a new alert rule for reporting the MAO failure Aug 23, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1744752: Add a new alert rule for reporting the MAO failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vikaschoudhary16
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@paulfantom
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2019
@paulfantom
Copy link
Contributor

@vikaschoudhary16
Copy link
Contributor Author

/test e2e-aws-operator

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2019
@vikaschoudhary16 vikaschoudhary16 changed the title Bug 1744752: Add a new alert rule for reporting the MAO failure Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@vikaschoudhary16: This pull request references Bugzilla bug 1744752, which is valid.

In response to this:

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

severity: critical
annotations:
message: "machine api MAO metric {{ $labels.kind }} is not being reported"
- name: mao-down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should expand M-A-O. It's easier for newcomers/users/customers/et al to sound it out, then guess what it stands for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but personally won't block this PR on it. We should try to merge this today.

My main question is: Doesn't this new alert make the existing MachineMAOMetricsDown alert unnecessary? If so, should we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not make MachineMAOMetricsDown unnecessary. This alert is for overall mao process. While the other one is specific to metrics reporting. It may happen that MAO is not down but metrics are not being reported for any reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the mapi_mao_collector_up metric tracks whether we're successfully scraping operands, or MachineSets etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes or any other similar problem which is specific to metrics reporting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have problems inside a team identifying what mapi_mao_collector_up refers to, how is customer supposed to know it?

Also a good practice is to expose metrics by operand itself and not aggregate them inside an operator. For example look how openshift or kube apiserver does it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulfantom regarding naming confusion, have added more details and improved the naming.
MAO has also been updated to "machine-api-operator" at all the instances

Regarding reporting from operand, we discussed this part internally and decided to report cluster state related metrics from the operator. operands will also be reporting metrics specific to operands in the future.

@vikaschoudhary16
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@vikaschoudhary16
Copy link
Contributor Author

/test e2e-aws

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2019
rules:
- alert: MachineAPIOperatorMetricsCollectionFailing
expr: |
mapi_machine-api-operator_collector_up == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we / should we consistently use _ (underscore) here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you should, - is an invalid character in prometheus data model.

The metric name specifies the general feature of a system that is measured [...] It may contain ASCII letters and digits, as well as underscores and colons. It must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*

Source: https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

labels:
severity: critical
annotations:
message: "machine api operator metrics collection is failing. Check machine api operator logs for more details."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be super helpful and where we say "check ... logs" turn that into something that can be cut/paste: "oc logs -n openshift-machine-api machine-api-operator", for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually focus on precisely describing what happened as this message is also showing up on SRE pager/slack channel/email. After that they are supposed to have runbooks on how to proceed. Personally I don't see any value in getting alert saying me to check logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On slack you can see examples of how alerts are being reported by going to one of:
#ops-testplatform (managed by DPTP)
#team-monitoring-alert (managed by our SREs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulfantom there i saw alert message just saying what problem has happened. eg:
Alert: TelemeterDown - stage [FIRING:1] No targets found for Telemeter in namespace . Telemeter is down.

Providing exact command to check logs i think is an improvement over not suggesting anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually focus on precisely describing what happened

machine api operator metrics collection is failing. this is clearly mentioning that metrics collection is failing

Copy link
Contributor

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019
severity: critical
annotations:
message: "machine api MAO metric {{ $labels.kind }} is not being reported"
message: "machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the line is indented too much

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019
@vikaschoudhary16
Copy link
Contributor Author

/test e2e-azure-operator

@frobware
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2019
@ingvagabund
Copy link
Member

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2019
@openshift-merge-robot openshift-merge-robot merged commit a094922 into openshift:master Aug 30, 2019
@openshift-ci-robot
Copy link
Contributor

@vikaschoudhary16: All pull requests linked via external trackers have merged. Bugzilla bug 1744752 has been moved to the MODIFIED state.

In response to this:

Bug 1744752: Add a new alert rule for reporting the Machine-api Operator failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/machine-api-operator that referenced this pull request Mar 11, 2021
…eratorDown

The cluster-version operator is responsible for complaining if the
machine-API operator's deployment is sad, so no need for the operator
to handle this directly (we end up doubling up if there's an issue).
This drops the alert which was added in 9b666d6 (Add a new alert
rule for reporting the MAO failure, 2019-08-23, openshift#388).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants