Add TroubleShooting Guide #721

michaelgugino · 2020-10-05T21:33:09Z

No description provided.

michaelgugino · 2020-10-05T21:52:39Z

docs/user/TroubleShooting.md

@@ -0,0 +1,122 @@
+# Important
+When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members.  On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members.  Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.


I'm leaving the reference to "Masters" here because that matches current product docs in an effort to reduce confusion. Once that is sorted up in the product docs, we'll remove it here as well.

Nit: IMO we should use it is over it's in documentation for better readability

elmiko

content looks mostly good to me, i had a few nits and suggestions

elmiko · 2020-10-06T16:45:19Z

docs/user/TroubleShooting.md

+
+For in-depth information and steps to replace
+a Master/Control Plane machine, please refer
+to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html


not a blocker for me, but i think we will need to figure out some way to utilize the official doc links that won't require updating link when new versions are released.

I just tried replacing 4.5 with 'latest' and the pages redirect to the same page with 4.5. So, might be possible to put 'latest' everywhere. Of course, if we want to support versioned docs, we might need to adjust the older branches.

Should we update to latest for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?

elmiko · 2020-10-06T16:47:18Z

docs/user/TroubleShooting.md

+<!-- /toc -->
+
+# Document Purpose
+The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster.  Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.


imo we should make this more direct,

Suggested change

The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.

This document outlines steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.

elmiko · 2020-10-06T16:48:42Z

docs/user/TroubleShooting.md

+
+# Important Pod Logs
+## machine-api components
+Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`.  You will want to familiarize yourself with the output of


minor nit, exclude namespace from the code block text

Suggested change

Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of

Most everything that relates to the machine-api is viewable in the `openshift-machine-api` namespace. You will want to familiarize yourself with the output of

docs/user/TroubleShooting.md

elmiko · 2020-10-06T16:52:02Z

docs/user/TroubleShooting.md

+```sh
+oc logs -n openshift-machine-api machine-api-controllers-<random suffix> -c <controller-name>
+```
+The random suffix is au[Important Pod Logs](#important-pod-logs)tomatically generated by the `machine-api-controllers` deployment and is most easily found using the output of


the "Important Pod Logs" link got caught in the middle of a word here

elmiko · 2020-10-06T16:56:18Z

docs/user/TroubleShooting.md

+```sh
+oc get csr
+```
+If there are none, it means the kubelet did not start successfully.  If there is a pending CSR for the corresponding Machine/Node, then check the logs for the cluster-machine-approver; refer to the section "Important Pod Logs" above for exact steps.


to be consistent with the other entries, should "cluster-machine-approver" be in code blocks here?

elmiko · 2020-10-06T16:57:15Z

docs/user/TroubleShooting.md

+
+# I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away
+
+This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining.  The best place to look for information is the machine-controller's logs; refer to the section "Important Pod Logs" above for exact steps.


similar to above, should "machine-controller" be in code blocks?

elmiko · 2020-10-06T16:58:45Z

docs/user/TroubleShooting.md

+
+First, consult with the ```machine-controller```'s logs; refer to [Important Pod Logs](#important-pod-logs) above for exact steps.
+
+Next, compare your findings in the machine-controller logs with the cloud provider's configuration.


not sure about this reference to "machine-controller" seems a little different usage from the others, should we have code block here for consistency?

elmiko · 2020-10-06T17:00:15Z

docs/user/TroubleShooting.md

+
+Ensure you have reviewed and understand that
+Masters/Control Plane machines are not backed
+by MachineSets at the root of this document.


i feel like we could shorten this and just provide a link back to the top, eg "Ensure you have reviewed before proceeding"

I thought about that, but I have trust issues.

i can empathize, maybe leave the text and have a link back to the top lol

alexander-demicev

/approve
@michaelgugino Thanks a lot for this doc. Can you squash the commits later?

openshift-ci-robot · 2020-10-07T10:21:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexander-demichev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alexander-demichev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko

/lgtm

openshift-bot · 2020-10-13T12:52:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-13T14:23:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

JoelSpeed

This is looking really good, added a bunch of comments, though they're mostly nits, nothing major to add from my side

Perhaps a point for the wider doc discussion, but do we want to have a convention for the names of the files within the docs folders?

JoelSpeed · 2020-10-13T12:47:58Z

docs/user/TroubleShooting.md

@@ -0,0 +1,122 @@
+# Important
+When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members.  On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members.  Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.


Nit: IMO we should use it is over it's in documentation for better readability

JoelSpeed · 2020-10-13T12:49:52Z

docs/user/TroubleShooting.md

+
+For in-depth information and steps to replace
+a Master/Control Plane machine, please refer
+to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html


Should we update to latest for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?

JoelSpeed · 2020-10-13T12:54:59Z

docs/user/TroubleShooting.md

+oc get pods -n openshift-machine-api
+```
+
+The `machine-api-controllers-*` pod has several containers running: `machineset-controller`, `machine-controller`, `nodelink-controller`, and the `machine-healthcheck-controller`.


There are also the rbac proxy pods, should we note those somewhere? Something along the lines of these exist but aren't really relevant

JoelSpeed · 2020-10-13T13:01:55Z

docs/user/TroubleShooting.md

+```
+
+## cluster-machine-approver
+CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`.  Under normal circumstances, these CSRs should be approved automatically.  On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.


This first sentence doesn't make sense/read well to me. We mention CSRs twice during this but they seem unrelated.
Perhaps cut it to

Suggested change

CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.

On first boot, a machine's kubelet will attempt to join the cluster. Part of this process involves creating a CSR (Certificate signing request) to request credentials for the new machine. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.

JoelSpeed · 2020-10-13T14:13:04Z

docs/user/TroubleShooting.md

+oc get pods -n openshift-cluster-machine-approver
+```
+
+Note the name of the pod `machine-approver-*` The suffix will be randomly generated by the pod's deployment controller.


This isn't technically true (API server handles name generation), and it's mentioned a couple of times, do we need to mention this? Might be better to say that a random suffix is appended to the pod name when the pod is created?

JoelSpeed · 2020-10-13T14:14:11Z

docs/user/TroubleShooting.md

+Be sure to replace `<random suffix>` above with the real suffix from the previous step.
+
+# I created a Machine (or scaled up a MachineSet) but I didn't get a Node.
+First, check that a Machine object was created successfully if scaling a MachineSet [TODO: need steps to look at MachineSet status and also Machines].  If there is not a new Machine, then check the `machineset-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.


Should we add the TODO as an html comment so that it isn't rendered in the markdown?

JoelSpeed · 2020-10-13T14:28:49Z

docs/user/TroubleShooting.md

+Next, check the Machine object's status.  There may be status conditions that explain the problem, and be sure to check the Phase.
+
+## Machine Status: Phase Provisioning
+If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another.  This could be quota, misconfiguration, or some other problem.  Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.


Inline should be 1 pair of backticks

Suggested change

If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.

If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.

JoelSpeed · 2020-10-13T14:33:14Z

docs/user/TroubleShooting.md

+# A Machine is listed as 'Failed'
+In this case, you'll need to take a look at the Machine's status and determine why the Machine entered a failed state.  In many instances, simply deleting the Machine object is sufficient.  In some other circumstances, the instance may need to be manually cleaned up directly from the cloud provider.  The best place to look for information is the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.
+
+If a Machine's status is failed, this means something unrecoverable has happened to the Machine.  It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.


Nit, I think this is an or scenario

Suggested change

If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.

If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration or the instance may have gone missing (eg. terminated by an outside actor) from the cloud.

openshift-ci-robot · 2020-10-13T15:05:57Z

@michaelgugino: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`3542e3c`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-10-13T15:54:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

michaelgugino added 2 commits October 5, 2020 17:20

TroubleShooting guide

6db47af

TOC with mdtoc and formatting

71fa164

openshift-ci-robot requested review from alexander-demicev and enxebre October 5, 2020 21:33

Add TOC heading

9b73fac

michaelgugino commented Oct 5, 2020

View reviewed changes

elmiko suggested changes Oct 6, 2020

View reviewed changes

formatting and typos

3542e3c

alexander-demicev approved these changes Oct 7, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020

elmiko approved these changes Oct 13, 2020

View reviewed changes

openshift-ci-robot assigned elmiko Oct 13, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2020

JoelSpeed reviewed Oct 13, 2020

View reviewed changes

openshift-merge-robot merged commit c35135b into openshift:master Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TroubleShooting Guide #721

Add TroubleShooting Guide #721

michaelgugino commented Oct 5, 2020

michaelgugino Oct 5, 2020

JoelSpeed Oct 13, 2020

elmiko left a comment

elmiko Oct 6, 2020

michaelgugino Oct 6, 2020

JoelSpeed Oct 13, 2020

elmiko Oct 6, 2020

elmiko Oct 6, 2020

elmiko Oct 6, 2020

michaelgugino Oct 6, 2020

elmiko Oct 6, 2020

elmiko Oct 6, 2020

elmiko Oct 6, 2020

michaelgugino Oct 6, 2020

elmiko Oct 6, 2020

michaelgugino Oct 6, 2020

elmiko Oct 6, 2020

alexander-demicev left a comment

openshift-ci-robot commented Oct 7, 2020

elmiko left a comment

openshift-bot commented Oct 13, 2020

openshift-bot commented Oct 13, 2020

JoelSpeed left a comment

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

JoelSpeed Oct 13, 2020

openshift-ci-robot commented Oct 13, 2020 •

edited

Loading

openshift-bot commented Oct 13, 2020

		@@ -0,0 +1,122 @@
		# Important
		When troubleshooting a Master/Control Plane Machine, it's absolutely imperative that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.

	The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.
	This document outlines steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.

	Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of
	Most everything that relates to the machine-api is viewable in the `openshift-machine-api` namespace. You will want to familiarize yourself with the output of


		# I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away

		This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining. The best place to look for information is the machine-controller's logs; refer to the section "Important Pod Logs" above for exact steps.


		First, consult with the ```machine-controller```'s logs; refer to [Important Pod Logs](#important-pod-logs) above for exact steps.

		Next, compare your findings in the machine-controller logs with the cloud provider's configuration.

	If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.
	If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.

	If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.
	If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration or the instance may have gone missing (eg. terminated by an outside actor) from the cloud.

Add TroubleShooting Guide #721

Add TroubleShooting Guide #721

Conversation

michaelgugino commented Oct 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-demicev left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Oct 7, 2020

elmiko left a comment

Choose a reason for hiding this comment

openshift-bot commented Oct 13, 2020

openshift-bot commented Oct 13, 2020

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Oct 13, 2020 • edited Loading

openshift-bot commented Oct 13, 2020

openshift-ci-robot commented Oct 13, 2020 •

edited

Loading