Add alerting docs #606

michaelgugino · 2020-06-02T05:25:40Z

No description provided.

elmiko

i think this is a good start for our alerting doc, i added a few nits inline.

elmiko · 2020-06-03T13:38:09Z

docs/user/Alerts.md

@@ -0,0 +1,64 @@
+## MachineWithoutValidNode
+Each machine should have a valid node reference and one or more machines does not after 10 minutes.


i still think this reads a little odd, i think it would be clearer like this:

Suggested change

Each machine should have a valid node reference and one or more machines does not after 10 minutes.

One or more machines does not have a valid node reference after 10 minutes.

otoh, if you want to keep the info about each machine needing a valid reference i might rewrite like this:

Suggested change

Each machine should have a valid node reference and one or more machines does not after 10 minutes.

Each machine should have a valid node reference, this alert signifies that one or more machines does not after 10 minutes.

Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛

elmiko · 2020-06-03T13:40:49Z

docs/user/Alerts.md

+
+### Resolution
+If the machine never became a node, consult the machine troubleshooting guide.
+If the node was deleted from the api, you may choose to delete the machine object, if appropriate.  (FYI, The machine-api will automatically delete nodes, there is no need to delete node objects directly)


i like the fyi, we cannot state this point enough times!

I though it was the cloud provider controller (in kube-controller-manager) that does this? Not machine-api

elmiko · 2020-06-03T13:43:17Z

docs/user/Alerts.md

+```
+
+### Possible Causes
+Machine-api-operator is unable to list machines or machinesets to gather metrics


might be worth mentioning a network connection issue as the metrics are served in a way that could be blocked if there is a missing service or something similar.

elmiko · 2020-06-03T13:44:09Z

docs/user/Alerts.md

+Machine-api-operator is unable to list machines or machinesets to gather metrics
+
+### Resolution
+Investigate the logs of the machine-api-operator to determine why it is unable to gather machines and machinesets.


i think it's also good to suggest looking at the deployment for the mao, to see if there are any abnormalities with the exposed metrics port.

enxebre · 2020-06-05T08:55:23Z

Thanks!
Can we include a PR desc?
s/Alerts.md/alerts.md?
Shall we consolidate the folder for this with #591 e.g docs/metrics ?

Are we still happy with all these alerts being critical?

JoelSpeed

Looks good, added a few suggestions

JoelSpeed · 2020-06-05T09:36:43Z

docs/user/Alerts.md

@@ -0,0 +1,64 @@
+## MachineWithoutValidNode
+Each machine should have a valid node reference and one or more machines does not after 10 minutes.


Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛

JoelSpeed · 2020-06-05T09:37:25Z

docs/user/Alerts.md

+### Query
+```
+# for: 10m
+(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
+```


+1 to this 🎉

JoelSpeed · 2020-06-05T09:39:38Z

docs/user/Alerts.md

+```
+
+### Possible Causes
+* The machine never became a node and joined the cluster


As it is currently reads a bit odd to me, this might be clearer?

Suggested change

* The machine never became a node and joined the cluster

* A node for this machine never joined the cluster, it is possible the machine failed to boot

Alternative:

Suggested change

* The machine never became a node and joined the cluster

* The machine never became a node and so never joined the cluster

JoelSpeed · 2020-06-05T09:47:00Z

docs/user/Alerts.md

+
+### Possible Causes
+* The machine was not properly provisioned in the cloud provider due to machine misconfiguration, invalid credentials, or lack of cloud capacity
+* The machine took longer than two hours to join the cluster and the bootstrap CSRs were not approved (due to networking or cloud quota/capacity constraints)


Surely this doesn't apply as this alert will fire within 10 minutes of the Machine being created if I've understood correctly?

I'm pretty sure it does.

But would this alert not fire way before the two hour mark that this cause is suggesting? Without the CSR the machine would stay in Provisioning right?

JoelSpeed · 2020-06-05T09:48:07Z

docs/user/Alerts.md

+* The node was deleted from the cluster via the api, but the machine still exists
+
+### Resolution
+If the machine never became a node, consult the machine troubleshooting guide.


Can we get a link to the machine troubleshooting guide please? Here and the other places that this is mentioned. Would make it easier for readers to resolve their issues

There isn't one yet.

Should we be referencing it if it doesn't yet exist? 😅

elmiko

althought there are some questions raised on this PR, i think we should definitely try to get it merged.

/approve

openshift-ci-robot · 2020-07-23T20:13:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko

i noticed one small nit (MAo), but i'm cool to merge this as is.

@michaelgugino and i spoke about future plans for this doc and there are changes coming to the alerts which will impact this doc. imo, we should merge this and then add/fix content when the alerts change.

lgtm, but i'm leaving the tag for @JoelSpeed since he has been following this as well.

elmiko · 2020-07-28T14:46:02Z

/retest

JoelSpeed · 2020-07-30T16:03:47Z

/retest

JoelSpeed · 2020-07-30T16:08:09Z

/lgtm

Please come back and update this when the referenced doc is merged to add links

openshift-bot · 2020-07-30T21:29:21Z

/retest