Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TroubleShooting Guide #721

Merged
merged 4 commits into from
Oct 13, 2020

Conversation

michaelgugino
Copy link
Contributor

No description provided.

@@ -0,0 +1,122 @@
# Important
When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving the reference to "Masters" here because that matches current product docs in an effort to reduce confusion. Once that is sorted up in the product docs, we'll remove it here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IMO we should use it is over it's in documentation for better readability

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content looks mostly good to me, i had a few nits and suggestions


For in-depth information and steps to replace
a Master/Control Plane machine, please refer
to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker for me, but i think we will need to figure out some way to utilize the official doc links that won't require updating link when new versions are released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried replacing 4.5 with 'latest' and the pages redirect to the same page with 4.5. So, might be possible to put 'latest' everywhere. Of course, if we want to support versioned docs, we might need to adjust the older branches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update to latest for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?

<!-- /toc -->

# Document Purpose
The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we should make this more direct,

Suggested change
The intended purpose of this document is to outline steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.
This document outlines steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.


# Important Pod Logs
## machine-api components
Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, exclude namespace from the code block text

Suggested change
Most everything that relates to the machine-api is viewable in the `openshift-machine-api namespace`. You will want to familiarize yourself with the output of
Most everything that relates to the machine-api is viewable in the `openshift-machine-api` namespace. You will want to familiarize yourself with the output of

docs/user/TroubleShooting.md Show resolved Hide resolved
```sh
oc logs -n openshift-machine-api machine-api-controllers-<random suffix> -c <controller-name>
```
The random suffix is au[Important Pod Logs](#important-pod-logs)tomatically generated by the `machine-api-controllers` deployment and is most easily found using the output of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "Important Pod Logs" link got caught in the middle of a word here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf, lol

```sh
oc get csr
```
If there are none, it means the kubelet did not start successfully. If there is a pending CSR for the corresponding Machine/Node, then check the logs for the cluster-machine-approver; refer to the section "Important Pod Logs" above for exact steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the other entries, should "cluster-machine-approver" be in code blocks here?


# I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away

This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining. The best place to look for information is the machine-controller's logs; refer to the section "Important Pod Logs" above for exact steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, should "machine-controller" be in code blocks?


First, consult with the ```machine-controller```'s logs; refer to [Important Pod Logs](#important-pod-logs) above for exact steps.

Next, compare your findings in the machine-controller logs with the cloud provider's configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this reference to "machine-controller" seems a little different usage from the others, should we have code block here for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh


Ensure you have reviewed and understand that
Masters/Control Plane machines are not backed
by MachineSets at the root of this document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like we could shorten this and just provide a link back to the top, eg "Ensure you have reviewed before proceeding"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I have trust issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can empathize, maybe leave the text and have a link back to the top lol

Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
@michaelgugino Thanks a lot for this doc. Can you squash the commits later?

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexander-demichev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good, added a bunch of comments, though they're mostly nits, nothing major to add from my side

Perhaps a point for the wider doc discussion, but do we want to have a convention for the names of the files within the docs folders?

@@ -0,0 +1,122 @@
# Important
When troubleshooting a **Master/Control Plane Machine**, it's **absolutely imperative** that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IMO we should use it is over it's in documentation for better readability


For in-depth information and steps to replace
a Master/Control Plane machine, please refer
to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update to latest for these links then, assuming that they will always redirect to the latest? Or do we prefer to leave as is?

oc get pods -n openshift-machine-api
```

The `machine-api-controllers-*` pod has several containers running: `machineset-controller`, `machine-controller`, `nodelink-controller`, and the `machine-healthcheck-controller`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also the rbac proxy pods, should we note those somewhere? Something along the lines of these exist but aren't really relevant

```

## cluster-machine-approver
CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first sentence doesn't make sense/read well to me. We mention CSRs twice during this but they seem unrelated.
Perhaps cut it to

Suggested change
CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a `CSR (certificate signing request)`. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.
On first boot, a machine's kubelet will attempt to join the cluster. Part of this process involves creating a CSR (Certificate signing request) to request credentials for the new machine. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.

oc get pods -n openshift-cluster-machine-approver
```

Note the name of the pod `machine-approver-*` The suffix will be randomly generated by the pod's deployment controller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't technically true (API server handles name generation), and it's mentioned a couple of times, do we need to mention this? Might be better to say that a random suffix is appended to the pod name when the pod is created?

Be sure to replace `<random suffix>` above with the real suffix from the previous step.

# I created a Machine (or scaled up a MachineSet) but I didn't get a Node.
First, check that a Machine object was created successfully if scaling a MachineSet [TODO: need steps to look at MachineSet status and also Machines]. If there is not a new Machine, then check the `machineset-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the TODO as an html comment so that it isn't rendered in the markdown?

Next, check the Machine object's status. There may be status conditions that explain the problem, and be sure to check the Phase.

## Machine Status: Phase Provisioning
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline should be 1 pair of backticks

Suggested change
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the ```machine-controller```'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.

# A Machine is listed as 'Failed'
In this case, you'll need to take a look at the Machine's status and determine why the Machine entered a failed state. In many instances, simply deleting the Machine object is sufficient. In some other circumstances, the instance may need to be manually cleaned up directly from the cloud provider. The best place to look for information is the `machine-controller`'s logs; refer to the section [Important Pod Logs](#important-pod-logs) above for exact steps.

If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think this is an or scenario

Suggested change
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration or the instance may have gone missing (eg. terminated by an outside actor) from the cloud.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 13, 2020

@michaelgugino: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator 3542e3c link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c35135b into openshift:master Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants