New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerting docs #606
Add alerting docs #606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is a good start for our alerting doc, i added a few nits inline.
docs/user/Alerts.md
Outdated
@@ -0,0 +1,64 @@ | |||
## MachineWithoutValidNode | |||
Each machine should have a valid node reference and one or more machines does not after 10 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still think this reads a little odd, i think it would be clearer like this:
Each machine should have a valid node reference and one or more machines does not after 10 minutes. | |
One or more machines does not have a valid node reference after 10 minutes. |
otoh, if you want to keep the info about each machine needing a valid reference i might rewrite like this:
Each machine should have a valid node reference and one or more machines does not after 10 minutes. | |
Each machine should have a valid node reference, this alert signifies that one or more machines does not after 10 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛
|
||
### Resolution | ||
If the machine never became a node, consult the machine troubleshooting guide. | ||
If the node was deleted from the api, you may choose to delete the machine object, if appropriate. (FYI, The machine-api will automatically delete nodes, there is no need to delete node objects directly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the fyi, we cannot state this point enough times!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though it was the cloud provider controller (in kube-controller-manager) that does this? Not machine-api
docs/user/Alerts.md
Outdated
``` | ||
|
||
### Possible Causes | ||
Machine-api-operator is unable to list machines or machinesets to gather metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth mentioning a network connection issue as the metrics are served in a way that could be blocked if there is a missing service or something similar.
docs/user/Alerts.md
Outdated
Machine-api-operator is unable to list machines or machinesets to gather metrics | ||
|
||
### Resolution | ||
Investigate the logs of the machine-api-operator to determine why it is unable to gather machines and machinesets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's also good to suggest looking at the deployment for the mao, to see if there are any abnormalities with the exposed metrics port.
Thanks! Are we still happy with all these alerts being critical? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added a few suggestions
docs/user/Alerts.md
Outdated
@@ -0,0 +1,64 @@ | |||
## MachineWithoutValidNode | |||
Each machine should have a valid node reference and one or more machines does not after 10 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be worth clarifying what this is 10 minutes after, 10 minutes after creation? Deletion? Node joining the cluster? Mid-day? 😛
### Query | ||
``` | ||
# for: 10m | ||
(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this 🎉
docs/user/Alerts.md
Outdated
``` | ||
|
||
### Possible Causes | ||
* The machine never became a node and joined the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is currently reads a bit odd to me, this might be clearer?
* The machine never became a node and joined the cluster | |
* A node for this machine never joined the cluster, it is possible the machine failed to boot |
Alternative:
* The machine never became a node and joined the cluster | |
* The machine never became a node and so never joined the cluster |
|
||
### Possible Causes | ||
* The machine was not properly provisioned in the cloud provider due to machine misconfiguration, invalid credentials, or lack of cloud capacity | ||
* The machine took longer than two hours to join the cluster and the bootstrap CSRs were not approved (due to networking or cloud quota/capacity constraints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this doesn't apply as this alert will fire within 10 minutes of the Machine being created if I've understood correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would this alert not fire way before the two hour mark that this cause is suggesting? Without the CSR the machine would stay in Provisioning
right?
* The node was deleted from the cluster via the api, but the machine still exists | ||
|
||
### Resolution | ||
If the machine never became a node, consult the machine troubleshooting guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a link to the machine troubleshooting guide please? Here and the other places that this is mentioned. Would make it easier for readers to resolve their issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't one yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be referencing it if it doesn't yet exist? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
althought there are some questions raised on this PR, i think we should definitely try to get it merged.
/approve
eda28a5
to
d824271
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i noticed one small nit (MAo), but i'm cool to merge this as is.
@michaelgugino and i spoke about future plans for this doc and there are changes coming to the alerts which will impact this doc. imo, we should merge this and then add/fix content when the alerts change.
lgtm, but i'm leaving the tag for @JoelSpeed since he has been following this as well.
/retest |
1 similar comment
/retest |
/lgtm Please come back and update this when the referenced doc is merged to add links |
/retest Please review the full test history for this PR and help us cut down flakes. |
11 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
27 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@michaelgugino: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
No description provided.