Skip to content

Conversation

@yithian
Copy link
Contributor

@yithian yithian commented Nov 24, 2020

Description

During the rollout of a new logging-operator version, ES can take some
time before it goes green again. This patch only causes the alert for ES
being red to fire if the elasticsearch-operator csv succeeded.

If the csv is actively rolling out, ES being red should be expected and
we have other monitoring for when a csv is abnormal.

/cc @alanconway
/assign @ewolinetz

Links

@RiRa12621
Copy link

I think that's an uncommon way of handling this: adding the values of status together.
It should still work but a more common way would be to use an "AND" condition.

@yithian
Copy link
Contributor Author

yithian commented Nov 25, 2020

/retest

During the rollout of a new logging-operator version, ES can take some
time before it goes green again. This patch only causes the alert for ES
being red to fire if the elasticsearch-operator csv succeeded.

If the csv is actively rolling out, ES being red should be expected and
we have other monitoring for when a csv is abnormal.
@yithian yithian force-pushed the OSD-5685_no_ElasticsearchClusterNotHealthy_alert_while_rolling_out branch from 79f0bf6 to 9f58c25 Compare November 25, 2020 15:14
@yithian
Copy link
Contributor Author

yithian commented Nov 25, 2020

/retest

@yithian
Copy link
Contributor Author

yithian commented Nov 25, 2020

/retest

1 similar comment
@RiRa12621
Copy link

/retest

@RiRa12621
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RiRa12621, yithian
To complete the pull request process, please assign ewolinetz after the PR has been reviewed.
You can assign the PR to them by writing /assign @ewolinetz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@RiRa12621
Copy link

/assign @ewolinetz

Co-authored-by: Rick Rackow <rrackow@redhat.com>
@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2020
@ewolinetz
Copy link
Contributor

/cc @blockloop @lukas-vlcek

@ewolinetz ewolinetz removed the request for review from alanconway December 9, 2020 21:33
@ewolinetz
Copy link
Contributor

If the csv is actively rolling out, ES being red should be expected and
we have other monitoring for when a csv is abnormal.

I'm not sure I'm following here. why would it be expected that elasticsearch is red when the operator is being updated?

@yithian
Copy link
Contributor Author

yithian commented Dec 9, 2020

I'm basing that on a couple assumptions:

  • when a new elasticsearch operator rolls out, it will restart the ES nodes
  • when an ES node goes down, the cluster goes red

@ewolinetz
Copy link
Contributor

* when a new elasticsearch operator rolls out, it will restart the ES nodes

this isn't a correct assumption. the operator will only restart the nodes if there is a cert change (CLO is responsible for this) or if there is a change in the elasticsearch CR that needs to be reflected in the deployments themselves.

  • when an ES node goes down, the cluster goes red

this can happen even if there isn't a csv update... would the proper fix for this be to adjust the for of the alert?

@openshift-merge-robot
Copy link
Contributor

@yithian: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit aa06a54 link /test unit
ci/prow/cluster-logging-operator-e2e aa06a54 link /test cluster-logging-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@RiRa12621
Copy link

We explicitly don't want to extend the 'for'.
The reason is that it may hide completely unrelated issues for longer than it should.
Is there a metric we can use to capture when the cert is being replaced? If no, should we consider adding one? A gauge would do just fine eg "es_cert_replacing" could be 0 if everything is fine and 1 while it's replacing.
That would allow us to make conditional alerts on regards to the cluster being red but also to alert separately if the replacement is taking too long

Copy link
Contributor

@blockloop blockloop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to handle this by grouping and/or inhibition? I don't know that we should be combining this kind of behavior into one alert when the API already accounts for this behavior with another entity. Perhaps we should have another alert for CSV failure or or an inhibitor to this rule.

@RiRa12621
Copy link

An inhibition does require an other alert to be present already.
However replacing a CSV is just a regular process and expected therefore there is no alert from that by default.
Conditions in alerts are perfectly fine via an "and" and not uncommon.

@blockloop
Copy link
Contributor

Right, I think I failed to make the point that we probably want to inhibit all ES alerts while the CSV is invalid. In which case I'm suggesting we add an alert for csv_succeeded == 0. You're right that this isn't a critical issue, but it's an invalid state nonetheless so it can be without notifications.

@ewolinetz
Copy link
Contributor

The reason is that it may hide completely unrelated issues for longer than it should.

Can you expand on this? The cluster being red on its own doesn't inherently mean there is an issue. The cluster is red when not all primary shards have been placed.

This can be true and still not be an issue if there are a very large number of primary shards on a node that was just restarted, or if the disk backing this node is slow. My underlying points is, just because its red doesn't mean something is wrong.

@RiRa12621
Copy link

It can be true and not be an issue but more often than not it is an issue.
Additionally it is the best indicator we have about the general cluster health and usability of it.

@yithian
Copy link
Contributor Author

yithian commented Dec 14, 2020

Ultimately, what we want is not to get alerts for the elasticsearch cluster being unhealthy if its csv is in an unready state. I'm pretty agnostic about how we get to that point.

If we decide to set up an alert for the csv_succeeded{name=~"elasticsearch-operator.*"} == 0, can elasticsearch-operator update the main alertmanager config to put the new inhibition rules in place? Or is that something that would need to be configured elsewhere?

@yithian
Copy link
Contributor Author

yithian commented Dec 22, 2020

If elasticsearch-operator doesn't have a way to manage inhibition rules, should we go ahead with this change to the alerting rule? Or do we need to implement a new csv_succeeded rule and time its release with an update to the alertmanager config that would cause it to inhibit ElasticsearchClusterNotHealthy ?

@RiRa12621
Copy link

RiRa12621 commented Dec 22, 2020

Inhibition rules are part of the alertmanager configuration.
That is centrally managed as part of the cluster monitoring operator, so imho we should go ahead with the change you proposed here @yithian and see how it goes.
A potential option may be for OSD to pilot this and report back how it went.
How does that sound?

@yithian yithian closed this Dec 23, 2020
@yithian yithian deleted the OSD-5685_no_ElasticsearchClusterNotHealthy_alert_while_rolling_out branch December 23, 2020 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants