Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerting docs #606

Merged

Conversation

michaelgugino
Copy link
Contributor

No description provided.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a good start for our alerting doc, i added a few nits inline.

@@ -0,0 +1,64 @@
## MachineWithoutValidNode
Each machine should have a valid node reference and one or more machines does not after 10 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still think this reads a little odd, i think it would be clearer like this:

Suggested change
Each machine should have a valid node reference and one or more machines does not after 10 minutes.
One or more machines does not have a valid node reference after 10 minutes.

otoh, if you want to keep the info about each machine needing a valid reference i might rewrite like this:

Suggested change
Each machine should have a valid node reference and one or more machines does not after 10 minutes.
Each machine should have a valid node reference, this alert signifies that one or more machines does not after 10 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛


### Resolution
If the machine never became a node, consult the machine troubleshooting guide.
If the node was deleted from the api, you may choose to delete the machine object, if appropriate. (FYI, The machine-api will automatically delete nodes, there is no need to delete node objects directly)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the fyi, we cannot state this point enough times!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though it was the cloud provider controller (in kube-controller-manager) that does this? Not machine-api

```

### Possible Causes
Machine-api-operator is unable to list machines or machinesets to gather metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth mentioning a network connection issue as the metrics are served in a way that could be blocked if there is a missing service or something similar.

Machine-api-operator is unable to list machines or machinesets to gather metrics

### Resolution
Investigate the logs of the machine-api-operator to determine why it is unable to gather machines and machinesets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's also good to suggest looking at the deployment for the mao, to see if there are any abnormalities with the exposed metrics port.

@enxebre
Copy link
Member

enxebre commented Jun 5, 2020

Thanks!
Can we include a PR desc?
s/Alerts.md/alerts.md?
Shall we consolidate the folder for this with #591 e.g docs/metrics ?

Are we still happy with all these alerts being critical?

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added a few suggestions

@@ -0,0 +1,64 @@
## MachineWithoutValidNode
Each machine should have a valid node reference and one or more machines does not after 10 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛

Comment on lines +4 to +9
### Query
```
# for: 10m
(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this 🎉

```

### Possible Causes
* The machine never became a node and joined the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is currently reads a bit odd to me, this might be clearer?

Suggested change
* The machine never became a node and joined the cluster
* A node for this machine never joined the cluster, it is possible the machine failed to boot

Alternative:

Suggested change
* The machine never became a node and joined the cluster
* The machine never became a node and so never joined the cluster


### Possible Causes
* The machine was not properly provisioned in the cloud provider due to machine misconfiguration, invalid credentials, or lack of cloud capacity
* The machine took longer than two hours to join the cluster and the bootstrap CSRs were not approved (due to networking or cloud quota/capacity constraints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this doesn't apply as this alert will fire within 10 minutes of the Machine being created if I've understood correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would this alert not fire way before the two hour mark that this cause is suggesting? Without the CSR the machine would stay in Provisioning right?

* The node was deleted from the cluster via the api, but the machine still exists

### Resolution
If the machine never became a node, consult the machine troubleshooting guide.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a link to the machine troubleshooting guide please? Here and the other places that this is mentioned. Would make it easier for readers to resolve their issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't one yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be referencing it if it doesn't yet exist? 😅

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

althought there are some questions raised on this PR, i think we should definitely try to get it merged.

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i noticed one small nit (MAo), but i'm cool to merge this as is.

@michaelgugino and i spoke about future plans for this doc and there are changes coming to the alerts which will impact this doc. imo, we should merge this and then add/fix content when the alerts change.

lgtm, but i'm leaving the tag for @JoelSpeed since he has been following this as well.

@elmiko
Copy link
Contributor

elmiko commented Jul 28, 2020

/retest

1 similar comment
@JoelSpeed
Copy link
Contributor

/retest

@JoelSpeed
Copy link
Contributor

/lgtm

Please come back and update this when the referenced doc is merged to add links

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 30, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

11 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

27 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 4, 2020

@michaelgugino: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi d824271 link /test e2e-metal-ipi
ci/prow/e2e-gcp-operator d824271 link /test e2e-gcp-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 0387a5a into openshift:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. retest-not-required-docs-only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants