only alert for ES being red if the csv succeeded #585

yithian · 2020-11-24T20:11:02Z

Description

During the rollout of a new logging-operator version, ES can take some
time before it goes green again. This patch only causes the alert for ES
being red to fire if the elasticsearch-operator csv succeeded.

If the csv is actively rolling out, ES being red should be expected and
we have other monitoring for when a csv is abnormal.

/cc @alanconway
/assign @ewolinetz

Links

JIRA: https://issues.redhat.com/browse/OSD-5685

RiRa12621 · 2020-11-24T20:34:48Z

I think that's an uncommon way of handling this: adding the values of status together.
It should still work but a more common way would be to use an "AND" condition.

yithian · 2020-11-25T15:08:36Z

/retest

During the rollout of a new logging-operator version, ES can take some time before it goes green again. This patch only causes the alert for ES being red to fire if the elasticsearch-operator csv succeeded. If the csv is actively rolling out, ES being red should be expected and we have other monitoring for when a csv is abnormal.

yithian · 2020-11-25T17:58:01Z

/retest

yithian · 2020-11-25T21:48:02Z

/retest

RiRa12621 · 2020-11-26T09:39:17Z

/retest

test/files/prometheus-unit-tests/test.yml

RiRa12621 · 2020-11-26T12:03:09Z

/lgtm

openshift-ci-robot · 2020-11-26T12:03:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RiRa12621, yithian
To complete the pull request process, please assign ewolinetz after the PR has been reviewed.
You can assign the PR to them by writing /assign @ewolinetz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RiRa12621 · 2020-11-26T12:04:48Z

/assign @ewolinetz

Co-authored-by: Rick Rackow <rrackow@redhat.com>

openshift-ci-robot · 2020-12-09T20:58:59Z

New changes are detected. LGTM label has been removed.

ewolinetz · 2020-12-09T21:33:15Z

/cc @blockloop @lukas-vlcek

ewolinetz · 2020-12-09T21:35:03Z

If the csv is actively rolling out, ES being red should be expected and
we have other monitoring for when a csv is abnormal.

I'm not sure I'm following here. why would it be expected that elasticsearch is red when the operator is being updated?

yithian · 2020-12-09T21:52:47Z

I'm basing that on a couple assumptions:

when a new elasticsearch operator rolls out, it will restart the ES nodes
when an ES node goes down, the cluster goes red

ewolinetz · 2020-12-09T21:56:42Z

* when a new elasticsearch operator rolls out, it will restart the ES nodes

this isn't a correct assumption. the operator will only restart the nodes if there is a cert change (CLO is responsible for this) or if there is a change in the elasticsearch CR that needs to be reflected in the deployments themselves.

when an ES node goes down, the cluster goes red

this can happen even if there isn't a csv update... would the proper fix for this be to adjust the for of the alert?

openshift-merge-robot · 2020-12-09T23:18:56Z

@yithian: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/unit	`aa06a54`	link	`/test unit`
ci/prow/cluster-logging-operator-e2e	`aa06a54`	link	`/test cluster-logging-operator-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

RiRa12621 · 2020-12-10T06:07:54Z

We explicitly don't want to extend the 'for'.
The reason is that it may hide completely unrelated issues for longer than it should.
Is there a metric we can use to capture when the cert is being replaced? If no, should we consider adding one? A gauge would do just fine eg "es_cert_replacing" could be 0 if everything is fine and 1 while it's replacing.
That would allow us to make conditional alerts on regards to the cluster being red but also to alert separately if the replacement is taking too long

blockloop

Wouldn't it be better to handle this by grouping and/or inhibition? I don't know that we should be combining this kind of behavior into one alert when the API already accounts for this behavior with another entity. Perhaps we should have another alert for CSV failure or or an inhibitor to this rule.

RiRa12621 · 2020-12-11T06:42:04Z

An inhibition does require an other alert to be present already.
However replacing a CSV is just a regular process and expected therefore there is no alert from that by default.
Conditions in alerts are perfectly fine via an "and" and not uncommon.

blockloop · 2020-12-11T20:06:48Z

Right, I think I failed to make the point that we probably want to inhibit all ES alerts while the CSV is invalid. In which case I'm suggesting we add an alert for csv_succeeded == 0. You're right that this isn't a critical issue, but it's an invalid state nonetheless so it can be without notifications.

ewolinetz · 2020-12-11T20:27:34Z

The reason is that it may hide completely unrelated issues for longer than it should.

Can you expand on this? The cluster being red on its own doesn't inherently mean there is an issue. The cluster is red when not all primary shards have been placed.

This can be true and still not be an issue if there are a very large number of primary shards on a node that was just restarted, or if the disk backing this node is slow. My underlying points is, just because its red doesn't mean something is wrong.

RiRa12621 · 2020-12-12T15:48:40Z

It can be true and not be an issue but more often than not it is an issue.
Additionally it is the best indicator we have about the general cluster health and usability of it.

yithian · 2020-12-14T20:48:05Z

Ultimately, what we want is not to get alerts for the elasticsearch cluster being unhealthy if its csv is in an unready state. I'm pretty agnostic about how we get to that point.

If we decide to set up an alert for the csv_succeeded{name=~"elasticsearch-operator.*"} == 0, can elasticsearch-operator update the main alertmanager config to put the new inhibition rules in place? Or is that something that would need to be configured elsewhere?

yithian · 2020-12-22T19:50:57Z

If elasticsearch-operator doesn't have a way to manage inhibition rules, should we go ahead with this change to the alerting rule? Or do we need to implement a new csv_succeeded rule and time its release with an update to the alertmanager config that would cause it to inhibit ElasticsearchClusterNotHealthy ?

RiRa12621 · 2020-12-22T20:09:35Z

Inhibition rules are part of the alertmanager configuration.
That is centrally managed as part of the cluster monitoring operator, so imho we should go ahead with the change you proposed here @yithian and see how it goes.
A potential option may be for OSD to pilot this and report back how it went.
How does that sound?

openshift-ci-robot assigned ewolinetz Nov 24, 2020

openshift-ci-robot requested a review from alanconway November 24, 2020 20:11

yithian added 2 commits November 25, 2020 10:14

clarify the ElasticsearchClusterNotHealthy alerting rule

9f58c25

yithian force-pushed the OSD-5685_no_ElasticsearchClusterNotHealthy_alert_while_rolling_out branch from 79f0bf6 to 9f58c25 Compare November 25, 2020 15:14

update unit tests for when ElasticsearchClusterNotHealthy is red

6bafdeb

RiRa12621 reviewed Nov 26, 2020

View reviewed changes

test/files/prometheus-unit-tests/test.yml Outdated Show resolved Hide resolved

openshift-ci-robot assigned RiRa12621 Nov 26, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2020

Update test/files/prometheus-unit-tests/test.yml

aa06a54

Co-authored-by: Rick Rackow <rrackow@redhat.com>

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2020

openshift-ci-robot requested review from blockloop and lukas-vlcek December 9, 2020 21:33

ewolinetz removed the request for review from alanconway December 9, 2020 21:33

blockloop suggested changes Dec 11, 2020

View reviewed changes

yithian mentioned this pull request Dec 23, 2020

Bug 1913464: configure a warning level alert for when the CSV phase isn't success #609

Merged

yithian closed this Dec 23, 2020

yithian deleted the OSD-5685_no_ElasticsearchClusterNotHealthy_alert_while_rolling_out branch December 23, 2020 20:05

only alert for ES being red if the csv succeeded #585

only alert for ES being red if the csv succeeded #585

Uh oh!

Conversation

yithian commented Nov 24, 2020

Description

Links

Uh oh!

RiRa12621 commented Nov 24, 2020

Uh oh!

yithian commented Nov 25, 2020

Uh oh!

yithian commented Nov 25, 2020

Uh oh!

yithian commented Nov 25, 2020

Uh oh!

RiRa12621 commented Nov 26, 2020

Uh oh!

Uh oh!

RiRa12621 commented Nov 26, 2020

Uh oh!

openshift-ci-robot commented Nov 26, 2020

Uh oh!

RiRa12621 commented Nov 26, 2020

Uh oh!

openshift-ci-robot commented Dec 9, 2020

Uh oh!

ewolinetz commented Dec 9, 2020

Uh oh!

ewolinetz commented Dec 9, 2020

Uh oh!

yithian commented Dec 9, 2020

Uh oh!

ewolinetz commented Dec 9, 2020

Uh oh!

openshift-merge-robot commented Dec 9, 2020

Uh oh!

RiRa12621 commented Dec 10, 2020

Uh oh!

blockloop left a comment

Choose a reason for hiding this comment

Uh oh!

RiRa12621 commented Dec 11, 2020

Uh oh!

blockloop commented Dec 11, 2020

Uh oh!

ewolinetz commented Dec 11, 2020

Uh oh!

RiRa12621 commented Dec 12, 2020

Uh oh!

yithian commented Dec 14, 2020

Uh oh!

yithian commented Dec 22, 2020

Uh oh!

RiRa12621 commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

RiRa12621 commented Dec 22, 2020 •

edited

Loading