Log 1098 - Playbook for Critical Alerts #673

sasagarw · 2021-03-11T16:09:02Z

Description

This PR:

Ensures that critical alerts have proper diagnostic steps and action steps.

/cc @lukas-vlcek
/assign @ewolinetz

Links

JIRA: https://issues.redhat.com/browse/LOG-1098

docs/alerts.md

ewolinetz · 2021-03-24T22:21:07Z

docs/alerts.md

+      ```
+      oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
+      ```
+


We probably should have a follow up step here... but this starts to get really tricky... @lukas-vlcek can you think of some steps we can take here? Do we just want to try to restart the nodes that haven't joined? but if there's a cert issue we need to figure out which one has the correct certs... also the operator should be doing something there already...

ewolinetz · 2021-03-24T22:23:22Z

@jcantrill can you also try to look through some of these steps based on your past customer experiences?

ewolinetz · 2021-03-24T22:24:38Z

docs/alerts.md

+              ```
+              oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
+              ```
+


i think we also want to add a step to unlock all the indices now that the watermark level is below the threshold. ES will lock the indices automatically but it will not unlock them for ES6

@ewolinetz what I had found is that ES will lock the indices only on reaching flood watermark level and not on low or high. That's why I have mentioned a step to unlock the indices in flood watermark troubleshooting.

docs/alerts.md

ewolinetz · 2021-03-25T21:29:26Z

docs/alerts.md

+              oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty
+              ```
+            - Identify the number of replicas from the output of the above command.
+            - Lower the number of replicas if possible:


this would need to be cluster-wide... the operator would try to adjust them afterwards I believe

ewolinetz · 2021-04-06T19:17:18Z

@openshift/sre-alert-sme could you also take a look through some of these and comment?

RiRa12621 · 2021-04-06T19:30:05Z

/assign @RiRa12621

I'll check it out tomorrow, unless one of the APAC folks has time before that

RiRa12621 · 2021-04-09T08:49:47Z

sorry for the delay, lgtm from my perspective.
This should give users a good path to fix given problems

sasagarw · 2021-04-12T02:44:15Z

/retest

openshift-ci · 2021-04-12T03:41:06Z

@sasagarw: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-upgrade	`8324809`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ewolinetz · 2021-04-12T16:46:29Z

/lgtm

openshift-ci-robot · 2021-04-12T16:46:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ewolinetz, sasagarw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ewolinetz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jeremyeder · 2021-04-12T18:08:57Z

Thank you @ewolinetz !

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2021

openshift-ci-robot requested review from jcantrill and lukas-vlcek March 11, 2021 16:09

sasagarw force-pushed the critical-alert-playbook branch from 16f4d81 to a16524b Compare March 22, 2021 03:58

sasagarw changed the title ~~[WIP] Log 1098 - Playbook for Critical Alerts~~ Log 1098 - Playbook for Critical Alerts Mar 22, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2021

ewolinetz reviewed Mar 22, 2021

View reviewed changes