-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TroubleShooting Guide #721
Add TroubleShooting Guide #721
Conversation
@@ -0,0 +1,122 @@ | |||
# Important | |||
When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving the reference to "Masters" here because that matches current product docs in an effort to reduce confusion. Once that is sorted up in the product docs, we'll remove it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: IMO we should use it is
over it's
in documentation for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content looks mostly good to me, i had a few nits and suggestions
|
||
For in-depth information and steps to replace | ||
a Master/Control Plane machine, please refer | ||
to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker for me, but i think we will need to figure out some way to utilize the official doc links that won't require updating link when new versions are released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried replacing 4.5 with 'latest' and the pages redirect to the same page with 4.5. So, might be possible to put 'latest' everywhere. Of course, if we want to support versioned docs, we might need to adjust the older branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update to latest
for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?
docs/user/TroubleShooting.md
Outdated
<!-- /toc --> | ||
|
||
# Document Purpose | ||
The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should make this more direct,
The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document. | |
This document outlines steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document. |
docs/user/TroubleShooting.md
Outdated
|
||
# Important Pod Logs | ||
## machine-api components | ||
Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, exclude namespace from the code block text
Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of | |
Most everything that relates to the machine-api is viewable in the `openshift-machine-api` namespace. You will want to familiarize yourself with the output of |
docs/user/TroubleShooting.md
Outdated
```sh | ||
oc logs -n openshift-machine-api machine-api-controllers-<random suffix> -c <controller-name> | ||
``` | ||
The random suffix is au[Important Pod Logs](#important-pod-logs)tomatically generated by the `machine-api-controllers` deployment and is most easily found using the output of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "Important Pod Logs" link got caught in the middle of a word here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wtf, lol
docs/user/TroubleShooting.md
Outdated
```sh | ||
oc get csr | ||
``` | ||
If there are none, it means the kubelet did not start successfully. If there is a pending CSR for the corresponding Machine/Node, then check the logs for the cluster-machine-approver; refer to the section "Important Pod Logs" above for exact steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the other entries, should "cluster-machine-approver" be in code blocks here?
docs/user/TroubleShooting.md
Outdated
|
||
# I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away | ||
|
||
This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining. The best place to look for information is the machine-controller's logs; refer to the section "Important Pod Logs" above for exact steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above, should "machine-controller" be in code blocks?
|
||
First, consult with the ```machine-controller```'s logs; refer to [Important Pod Logs](#important-pod-logs) above for exact steps. | ||
|
||
Next, compare your findings in the machine-controller logs with the cloud provider's configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this reference to "machine-controller" seems a little different usage from the others, should we have code block here for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh
docs/user/TroubleShooting.md
Outdated
|
||
Ensure you have reviewed and understand that | ||
Masters/Control Plane machines are not backed | ||
by MachineSets at the root of this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like we could shorten this and just provide a link back to the top, eg "Ensure you have reviewed before proceeding"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but I have trust issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can empathize, maybe leave the text and have a link back to the top lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
@michaelgugino Thanks a lot for this doc. Can you squash the commits later?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexander-demichev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good, added a bunch of comments, though they're mostly nits, nothing major to add from my side
Perhaps a point for the wider doc discussion, but do we want to have a convention for the names of the files within the docs folders?
@@ -0,0 +1,122 @@ | |||
# Important | |||
When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: IMO we should use it is
over it's
in documentation for better readability
|
||
For in-depth information and steps to replace | ||
a Master/Control Plane machine, please refer | ||
to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update to latest
for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?
oc get pods -n openshift-machine-api | ||
``` | ||
|
||
The `machine-api-controllers-*` pod has several containers running: `machineset-controller`, `machine-controller`, `nodelink-controller`, and the `machine-healthcheck-controller`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also the rbac proxy pods, should we note those somewhere? Something along the lines of these exist but aren't really relevant
``` | ||
|
||
## cluster-machine-approver | ||
CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first sentence doesn't make sense/read well to me. We mention CSRs twice during this but they seem unrelated.
Perhaps cut it to
CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully. | |
On first boot, a machine's kubelet will attempt to join the cluster. Part of this process involves creating a CSR (Certificate signing request) to request credentials for the new machine. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully. |
oc get pods -n openshift-cluster-machine-approver | ||
``` | ||
|
||
Note the name of the pod `machine-approver-*` The suffix will be randomly generated by the pod's deployment controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't technically true (API server handles name generation), and it's mentioned a couple of times, do we need to mention this? Might be better to say that a random suffix is appended to the pod name when the pod is created
?
Be sure to replace `<random suffix>` above with the real suffix from the previous step. | ||
|
||
# I created a Machine (or scaled up a MachineSet) but I didn't get a Node. | ||
First, check that a Machine object was created successfully if scaling a MachineSet [TODO: need steps to look at MachineSet status and also Machines]. If there is not a new Machine, then check the `machineset-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the TODO as an html comment so that it isn't rendered in the markdown?
Next, check the Machine object's status. There may be status conditions that explain the problem, and be sure to check the Phase. | ||
|
||
## Machine Status: Phase Provisioning | ||
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline should be 1 pair of backticks
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps. | |
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps. |
# A Machine is listed as 'Failed' | ||
In this case, you'll need to take a look at the Machine's status and determine why the Machine entered a failed state. In many instances, simply deleting the Machine object is sufficient. In some other circumstances, the instance may need to be manually cleaned up directly from the cloud provider. The best place to look for information is the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps. | ||
|
||
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think this is an or
scenario
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud. | |
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration or the instance may have gone missing (eg. terminated by an outside actor) from the cloud. |
@michaelgugino: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
No description provided.