diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index f1b4b4c0b3c2..65b10f5b78e1 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -2801,6 +2801,8 @@ Topics: # File: nodes-nodes-graceful-shutdown - Name: Managing the maximum number of pods per node File: nodes-nodes-managing-max-pods + - Name: Replacing a failed bare-metal control plane node without BMC credentials + File: nodes-nodes-replace-control-plane - Name: Using the Node Tuning Operator File: nodes-node-tuning-operator - Name: Remediating, fencing, and maintaining nodes diff --git a/modules/nodes-add-new-etcd-member.adoc b/modules/nodes-add-new-etcd-member.adoc new file mode 100644 index 000000000000..71e0d10800ab --- /dev/null +++ b/modules/nodes-add-new-etcd-member.adoc @@ -0,0 +1,82 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="add-new-etcd-member_{context}"] += Adding the new etcd member + +Finish adding the new control plane node by adding the new etcd member to the cluster. + +.Procedure + +. Add the new etcd member to the cluster by performing the following steps in a single bash shell session: + +.. Find the IP of the new control plane node by running the following command: ++ +[source,terminal] +---- +$ oc get nodes -owide -l node-role.kubernetes.io/control-plane +---- ++ +Make note of the node's IP address for later use. + +.. List the etcd pods by running the following command: ++ +[source,terminal] +---- +$ oc get -n openshift-etcd pods -l k8s-app=etcd -o wide +---- + +.. Connect to one of the running etcd pods by running the following command. The etcd pod on the new node should be in a `CrashLoopBackOff` state. ++ +[source,terminal] +---- +$ oc rsh -n openshift-etcd +---- ++ +Where `` is the name of a running pod shown in the previous step. + +.. View the etcd member list by running the following command: ++ +[source,terminal] +---- +sh-4.2# etcdctl member list -w table +---- + +.. Add the new control plane etcd member by running the following command: ++ +[source,terminal] +---- +sh-4.2# etcdctl member add --peer-urls="https://:2380" +---- ++ +Where `` is the name of the new control plane node, and `` is the IP address of the new node. + +.. Exit the rsh shell by running the following command: ++ +[source,terminal] +---- +sh-4.2# exit +---- + +. Force an etcd redeployment by running the following command: ++ +[source,terminal] +---- +$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge +---- + +. Turn the etcd quorum guard back on by running the following command: ++ +[source,terminal] +---- +$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}' +---- + +. Monitor the cluster Operator rollout by running the following command: ++ +[source,terminal] +---- +$ watch oc get co +---- diff --git a/modules/nodes-create-new-control-plane-node.adoc b/modules/nodes-create-new-control-plane-node.adoc new file mode 100644 index 000000000000..85ad5f4e796b --- /dev/null +++ b/modules/nodes-create-new-control-plane-node.adoc @@ -0,0 +1,140 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="create-new-machine_{context}"] += Creating the new control plane node + +Begin creating the new control plane node by creating a `BareMetalHost` object and node. + +.Procedure + +. Edit the `bmh_affected.yaml` file that you previously saved: ++ +-- +.. Remove the following metadata items from the file: ++ +* `creationTimestamp` +* `generation` +* `resourceVersion` +* `uid` + +.. Remove the `status` section of the file. +-- ++ +The resulting file should resemble the following: ++ +.Example `bmh_affected.yaml` file +[source,yaml] +---- +apiVersion: metal3.io/v1alpha1 +kind: BareMetalHost +metadata: + labels: + installer.openshift.io/role: control-plane + name: openshift-control-plane-2 + namespace: openshift-machine-api +spec: + automatedCleaningMode: disabled + bmc: + address: + credentialsName: + disableCertificateVerification: true + bootMACAddress: ab:cd:ef:ab:cd:ef + bootMode: UEFI + externallyProvisioned: true + online: true + rootDeviceHints: + deviceName: /dev/disk/by-path/pci-0000:04:00.0-nvme-1 + userData: + name: master-user-data-managed + namespace: openshift-machine-api +---- + +. Create the `BareMetalHost` object using the `bmh_affected.yaml` file by running the following command: ++ +[source,terminal] +---- +$ oc create -f bmh_affected.yaml +---- ++ +The following warning is expected upon creation of the `BareMetalHost` object: ++ +[source,terminal] +---- +Warning: metadata.finalizers: "baremetalhost.metal3.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers +---- + +. Extract the control plane ignition secret by running the following command: ++ +[source,terminal] +---- +$ oc extract secret/master-user-data-managed \ + -n openshift-machine-api \ + --keys=userData \ + --to=- \ + | sed '/^userData/d' > new_controlplane.ign +---- ++ +This command also removes the starting `userData` line of the ignition secret. + +. Create an nmstate YAML file titled `new_controlplane_nmstate.yaml` for the new node's network configuration, using the following example for reference: ++ +.Example nmstate YAML file +[source,yaml] +---- +interfaces: + - name: eno1 + type: ethernet + state: up + mac-address: "ab:cd:ef:01:02:03" + ipv4: + enabled: true + address: + - ip: 192.168.20.11 + prefix-length: 24 + dhcp: false + ipv6: + enabled: false +dns-resolver: + config: + search: + - iso.sterling.home + server: + - 192.168.20.8 +routes: + config: + - destination: 0.0.0.0/0 + metric: 100 + next-hop-address: 192.168.20.1 + next-hop-interface: eno1 + table-id: 254 +---- ++ +[NOTE] +==== +If you installed your cluster using the Agent-based Installer, you can use the failed node's `networkConfig` section in the `agent-config.yaml` file from the original cluster deployment as a starting point for the new control plane node's nmstate file. For example, the following command extracts the `networkConfig` section for the first control plane node: + +[source,terminal] +---- +$ cat agent-config-iso.yaml | yq .hosts[0].networkConfig > new_controlplane_nmstate.yaml +---- +==== + +. Create the customized {op-system-first} live ISO by running the following command: ++ +[source,terminal] +---- +$ coreos-installer iso customize rhcos-live.86_64.iso \ + --dest-ignition new_controlplane.ign \ + --network-nmstate new_controlplane_nmstate.yaml \ + --dest-device /dev/disk/by-path/ \ + -f +---- ++ +Where is the path to the target device on which the ISO will be generated. + +. Boot the new control plane node with the customized {op-system} live ISO. + +. Approve the Certificate Signing Requests (CSR) to join the new node to the cluster. diff --git a/modules/nodes-delete-machine-unhealthy-etcd.adoc b/modules/nodes-delete-machine-unhealthy-etcd.adoc new file mode 100644 index 000000000000..26e94a639ac2 --- /dev/null +++ b/modules/nodes-delete-machine-unhealthy-etcd.adoc @@ -0,0 +1,114 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="deleting-machine_{context}"] += Deleting the machine of the unhealthy etcd member + +Finish removing the failed control plane node by deleting the machine of the unhealthy etcd member. + +. Ensure that the Bare Metal Operator is available by running the following command: ++ +[source,terminal] +---- +$ oc get clusteroperator baremetal +---- ++ +.Example output +[source,terminal] +---- +NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE +baremetal 4.20.0 True False False 3d15h +---- + +. Save the `BareMetalHost` object of the affected node to a file for later use by running the following command: ++ +[source,terminal] +---- +$ oc get -n openshift-machine-api bmh -o yaml > bmh_affected.yaml +---- ++ +Where `` is the name of the affected node, which usually matches the associated `BareMetalHost` name. + +. View the YAML file of the saved `BareMetalHost` object by running the following command, and ensure the content is correct: ++ +[source,terminal] +---- +$ cat bmh_affected.yaml +---- + +. Remove the affected `BareMetalHost` object by running the following command: ++ +[source,terminal] +---- +$ oc delete -n openshift-machine-api bmh +---- ++ +Where `` is the name of the affected node. + +. List all machines by running the following command and identify the machine associated with the affected node: ++ +[source,terminal] +---- +$ oc get machines -n openshift-machine-api -o wide +---- ++ +.Example output +[source,terminal] +---- +NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE +examplecluster-control-plane-0 Running 3h11m openshift-control-plane-0 baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e externally provisioned +examplecluster-control-plane-1 Running 3h11m openshift-control-plane-1 baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1 externally provisioned +examplecluster-control-plane-2 Running 3h11m openshift-control-plane-2 baremetalhost:///openshift-machine-api/openshift-control-plane-2/3354bdac-61d8-410f-be5b-6a395b056135 externally provisioned +examplecluster-compute-0 Running 165m openshift-compute-0 baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f provisioned +examplecluster-compute-1 Running 165m openshift-compute-1 baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9 provisioned +---- + +. Delete the machine of the unhealthy member by running the following command: ++ +[source,terminal] +---- +$ oc delete machine -n openshift-machine-api +---- ++ +Where `` is the machine name associated with the affected node. ++ +.Example command +[source,terminal] +---- +$ oc delete machine -n openshift-machine-api examplecluster-control-plane-2 +---- ++ +[NOTE] +==== +After you remove the `BareMetalHost` and `Machine` objects, the machine controller automatically deletes the `Node` object. +==== + +. If deletion of the machine is delayed for any reason or the command is obstructed and delayed: Force deletion by removing the machine object finalizer field. ++ +[WARNING] +==== +Do not interrupt machine deletion by pressing `Ctrl+c`. You must allow the command to proceed to completion. Open a new terminal window to edit and delete the finalizer fields. +==== + +.. On a new terminal window, edit the machine configuration by running the following command: ++ +[source,terminal] +---- +$ oc edit machine -n openshift-machine-api examplecluster-control-plane-2 +---- + +.. Delete the following fields in the `Machine` custom resource, and then save the updated file: ++ +[source,yaml] +---- +finalizers: +- machine.machine.openshift.io +---- ++ +.Example output +[source,terminal] +---- +machine.machine.openshift.io/examplecluster-control-plane-2 edited +---- diff --git a/modules/nodes-link-node-machine-bmh.adoc b/modules/nodes-link-node-machine-bmh.adoc new file mode 100644 index 000000000000..ed526d4069b5 --- /dev/null +++ b/modules/nodes-link-node-machine-bmh.adoc @@ -0,0 +1,182 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="linking-node-machine-bmh_{context}"] += Linking the node, bare metal host, and machine together + +Continue creating the new control plane node by creating a machine and then linking it with the new `BareMetalHost` object and node. + +.Procedure + +. Get the `providerID` for control plane nodes by running the following command: ++ +[source,terminal] +---- +$ oc get -n openshift-machine-api baremetalhost -l installer.openshift.io/role=control-plane -ojson | jq -r '.items[] | "baremetalhost:///openshift-machine-api/" + .metadata.name + "/" + .metadata.uid' +---- ++ +.Example output +[source,terminal] +---- +baremetalhost:///openshift-machine-api/master-00/6214c5cf-c798-4168-8c78-1ff1a3cd2cb4 +baremetalhost:///openshift-machine-api/master-01/58fb60bd-b2a6-4ff3-a88d-208c33abf954 +baremetalhost:///openshift-machine-api/master-02/dc5a94f3-625b-43f6-ab5a-7cc4fc79f105 +---- + +. Get cluster information for labels by running the following command: ++ +[source,terminal] +---- +$ oc get machine -n openshift-machine-api \ + -l machine.openshift.io/cluster-api-machine-role=master \ + -L machine.openshift.io/cluster-api-cluster +---- ++ +.Example output +[source,terminal] +---- +NAME PHASE TYPE REGION ZONE AGE CLUSTER-API-CLUSTER +ci-op-jcp3s7wx-ng5sd-master-0 Running 10h ci-op-jcp3s7wx-ng5sd +ci-op-jcp3s7wx-ng5sd-master-1 Running 10h ci-op-jcp3s7wx-ng5sd +ci-op-jcp3s7wx-ng5sd-master-2 Running 10h ci-op-jcp3s7wx-ng5sd +---- + +. Create a `Machine` object for the new control plane node by creating a yaml file similar to the following: ++ +[source,yaml] +---- +apiVersion: machine.openshift.io/v1beta1 +kind: Machine +metadata: + annotations: + metal3.io/BareMetalHost: openshift-machine-api/ <1> + finalizers: + - machine.machine.openshift.io + labels: + machine.openshift.io/cluster-api-cluster: <2> + machine.openshift.io/cluster-api-machine-role: master + machine.openshift.io/cluster-api-machine-type: master + name: <1> + namespace: openshift-machine-api +spec: + metadata: {} + providerID: <3> + providerSpec: + value: + apiVersion: baremetal.cluster.k8s.io/v1alpha1 + hostSelector: {} + image: + checksum: "" + url: "" + kind: BareMetalMachineProviderSpec + userData: + name: master-user-data-managed +---- ++ +-- +where: + +:: Specifies the name of the new machine, which can be the same as the previously deleted machine name. +:: Specifies the `CLUSTER-API-CLUSTER` value for the other control plane machines, shown in the output of the previous step. +:: Specifies the `providerID` value of the new bare metal host, shown in the output of an earlier step. +-- ++ +The following warning is expected: ++ +[source,terminal] +---- +Warning: metadata.finalizers: "machine.machine.openshift.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers +---- + +. Link the new control plane node and `Machine` object to the `BareMetalHost` object by performing the following steps in a single bash shell session: + +.. Define the `NEW_NODE_NAME` variable by running the following command: ++ +[source,terminal] +---- +$ NEW_NODE_NAME= +---- ++ +Where `` is the name of the new control plane node. + +.. Define the `NEW_MACHINE_NAME` variable by running the following command: ++ +[source,terminal] +---- +$ NEW_MACHINE_NAME= +---- ++ +Where `` is the name of the new machine. + +.. Define the `BMH_UID` by running the following commands to extract it from the new node's `BareMetalHost` object: ++ +[source,terminal] +---- +$ BMH_UID=$(oc get -n openshift-machine-api bmh $NEW_NODE_NAME -ojson | jq -r .metadata.uid) +---- ++ +[source,terminal] +---- +$ echo $BMH_UID +---- + +.. Patch the `consumerRef` object into the bare metal host by running the following command: ++ +[source,terminal] +---- +$ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --type merge --patch '{"spec":{"consumerRef":{"apiVersion":"machine.openshift.io/v1beta1","kind":"Machine","name":"'$NEW_MACHINE_NAME'","namespace":"openshift-machine-api"}}}' +---- + +.. Patch the `providerID` value into the new node by running the following command: ++ +[source,terminal] +---- +$ oc patch node $NEW_NODE_NAME --type merge --patch '{"spec":{"providerID":"baremetalhost:///openshift-machine-api/'$NEW_NODE_NAME'/'$BMH_UID'"}}' +---- + +.. Review the `providerID` values by running the following command: ++ +[source,terminal] +---- +$ oc get node -l node-role.kubernetes.io/control-plane -ojson | jq -r '.items[] | .metadata.name + " " + .spec.providerID' +---- + +. Set the `BareMetalHost` object's `poweredOn` status to `true` by running the following command: ++ +[source,terminal] +---- +$ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --subresource status --type json -p '[{"op":"replace","path":"/status/poweredOn","value":true}]' +---- + +. Review the `BareMetalHost` object's `poweredOn` status by running the following command: ++ +[source,terminal] +---- +$ oc get bmh -n openshift-machine-api -ojson | jq -r '.items[] | .metadata.name + " PoweredOn:" + (.status.poweredOn | tostring)' +---- + +. Review the `BareMetalHost` object's provisioning state by running the following command: ++ +[source,terminal] +---- +$ oc get bmh -n openshift-machine-api -ojson | jq -r '.items[] | .metadata.name + " ProvisioningState:" + .status.provisioning.state' +---- ++ +[IMPORTANT] +==== +If the provisioning state is not `unmanaged`, change the provisioning state by running the following command: + +[source,terminal] +---- +$ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --subresource status --type json -p '[{"op":"replace","path":"/status/provisioning/state","value":"unmanaged"}]' +---- +==== + +. Set the machine's state to `Provisioned` by running the following command: ++ +[source,terminal] +---- +$ oc patch -n openshift-machine-api machines $NEW_MACHINE_NAME -n openshift-machine-api --subresource status --type json -p '[{"op":"replace","path":"/status/phase","value":"Provisioned"}]' +---- diff --git a/modules/nodes-remove-unhealthy-etcd-member.adoc b/modules/nodes-remove-unhealthy-etcd-member.adoc new file mode 100644 index 000000000000..195128788e77 --- /dev/null +++ b/modules/nodes-remove-unhealthy-etcd-member.adoc @@ -0,0 +1,179 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="removing-etcd-member_{context}"] += Removing the unhealthy etcd member + +Begin removing the failed control plane node by first removing the unhealthy etcd member. + +.Procedure + +. List etcd pods by running the following command and make note of a pod that is not on the affected node: ++ +[source,terminal] +---- +$ oc -n openshift-etcd get pods -l k8s-app=etcd -o wide +---- ++ +.Example output +[source,terminal] +---- +etcd-openshift-control-plane-0 5/5 Running 11 3h56m 192.168.10.9 openshift-control-plane-0 +etcd-openshift-control-plane-1 5/5 Running 0 3h54m 192.168.10.10 openshift-control-plane-1 +etcd-openshift-control-plane-2 5/5 Running 0 3h58m 192.168.10.11 openshift-control-plane-2 +---- + +. Connect to a running etcd container by running the following command: ++ +[source,terminal] +---- +$ oc rsh -n openshift-etcd +---- ++ +Where `` is the name of an etcd pod associated with one of the healthy nodes. ++ +.Example command +[source,terminal] +---- +$ oc rsh -n openshift-etcd etcd-openshift-control-plane-0 +---- + +. View the etcd member list by running the following command: ++ +[source,terminal] +---- +sh-4.2# etcdctl member list -w table +---- ++ +.Example output +[source,terminal] +---- ++------------------+---------+------------------------------+---------------------------+---------------------------+ +| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | ++------------------+---------+------------------------------+---------------------------+---------------------------+ +| 6fc1e7c9db35841d | started | openshift-control-plane-2 | https://10.0.131.183:2380 | https://10.0.131.183:2379 | +| 757b6793e2408b6c | started | openshift-control-plane-1 | https://10.0.164.97:2380 | https://10.0.164.97:2379 | +| ca8c2990a0aa29d1 | started | openshift-control-plane-0 | https://10.0.154.204:2380 | https://10.0.154.204:2379 | ++------------------+---------+------------------------------+---------------------------+---------------------------+ +---- ++ +[IMPORTANT] +==== +* Make note of the ID and the name of the unhealthy etcd member because these values are required later. + +* The `etcdctl endpoint health` command will list the removed member until the replacement is complete and the new member is added. +==== + +. Remove the unhealthy etcd member by running the following command: ++ +[source,terminal] +---- +sh-4.2# etcdctl member remove +---- ++ +Where `` is the ID of the etcd member on the unhealthy node. ++ +.Example command +[source,terminal] +---- +sh-4.2# etcdctl member remove 6fc1e7c9db35841d +---- ++ +.Example output +[source,terminal] +---- +Member 6fc1e7c9db35841d removed from cluster b23536c33f2cdd1b +---- + +. View the member list again by running the following command and verify that the member was removed: ++ +[source,terminal] +---- +sh-4.2# etcdctl member list -w table +---- ++ +.Example output +[source,terminal] +---- ++------------------+---------+------------------------------+---------------------------+---------------------------+ +| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | ++------------------+---------+------------------------------+---------------------------+---------------------------+ +| 757b6793e2408b6c | started | openshift-control-plane-1 | https://10.0.164.97:2380 | https://10.0.164.97:2379 | +| ca8c2990a0aa29d1 | started | openshift-control-plane-0 | https://10.0.154.204:2380 | https://10.0.154.204:2379 | ++------------------+---------+------------------------------+---------------------------+---------------------------+ +---- ++ +[IMPORTANT] +==== +After you remove the member, the cluster might be unreachable for a short time while the remaining etcd instances reboot. +==== + +. Exit the rsh session into the etcd pod by running the following command: ++ +[source,terminal] +---- +sh-4.2# exit +---- + +. Turn off the etcd quorum guard by running the following command: ++ +[source,terminal] +---- +$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' +---- ++ +This command ensures that you can successfully re-create secrets and roll out the static pods. + +. List the secrets for the removed, unhealthy etcd member by running the following command: ++ +[source,terminal] +---- +$ oc get secrets -n openshift-etcd | grep +---- ++ +Where `` is the name of the failed node whose etcd member you removed. ++ +.Example command +[source,terminal] +---- +$ oc get secrets -n openshift-etcd | grep openshift-control-plane-2 +---- ++ +.Example output +[source,terminal] +---- +etcd-peer-openshift-control-plane-2 kubernetes.io/tls 2 134m +etcd-serving-metrics-openshift-control-plane-2 kubernetes.io/tls 2 134m +etcd-serving-openshift-control-plane-2 kubernetes.io/tls 2 134m +---- + +. Delete the secrets associated with the affected node that was removed: + +.. Delete the peer secret by running the following command: ++ +[source,terminal] +---- +$ oc delete secret -n openshift-etcd etcd-peer- +---- ++ +Where `` is the name of the affected node. + +.. Delete the serving secret by running the following command: ++ +[source,terminal] +---- +$ oc delete secret -n openshift-etcd etcd-serving- +---- ++ +Where `` is the name of the affected node. + +.. Delete the metrics secret by running the following command: ++ +[source,terminal] +---- +$ oc delete secret -n openshift-etcd etcd-serving-metrics- <1> +---- ++ +Where `` is the name of the affected node. diff --git a/modules/nodes-replace-control-plane-prereqs.adoc b/modules/nodes-replace-control-plane-prereqs.adoc new file mode 100644 index 000000000000..eb3241e97adc --- /dev/null +++ b/modules/nodes-replace-control-plane-prereqs.adoc @@ -0,0 +1,26 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: REFERENCE +[id="prerequisites_{context}"] += Prerequisites + +* You have identified the unhealthy bare metal etcd member. +* You have verified that either the machine is not running or the node is not ready. +* You have access to the cluster as a user with the `cluster-admin` role. +* You have taken an xref:../../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backup-etcd[etcd backup] in case you encounter any issues. +* You have downloaded and installed the link:https://console.redhat.com/openshift/downloads#tool-coreos-installer[`coreos-installer` CLI]. +* Your cluster does not have a control plane `machineset`. You can check for `machinesets` by running the following command: ++ +[source,terminal] +---- +$ oc get machinesets,controlplanemachinesets -n openshift-machine-api +---- ++ +[IMPORTANT] +==== +There should be only one or more `machinesets` for the workers. +If `controlplanemachinesets` exists for the control plane, do not use this procedure. +==== +// This was a prereq download but it seems like this is generated by the above CLI tool, must verify whether or not to include this prereq item. * Red Hat CoreOS live ISO diff --git a/modules/nodes-verify-failed-node-deleted.adoc b/modules/nodes-verify-failed-node-deleted.adoc new file mode 100644 index 000000000000..da20c3ea84b3 --- /dev/null +++ b/modules/nodes-verify-failed-node-deleted.adoc @@ -0,0 +1,53 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-replace-control-plane.adoc + +:_mod-docs-content-type: PROCEDURE +[id="verify-machine-deleted_{context}"] += Verifying that the failed node was deleted + +Before proceeding to create a replacement control plane node, verify that the failed node was successfully deleted. + +.Procedure + +. Verify that the machine was deleted by running the following command: ++ +[source,terminal] +---- +$ oc get machines -n openshift-machine-api -o wide +---- ++ +.Example output +[source,terminal] +---- +NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE +examplecluster-control-plane-0 Running 3h11m openshift-control-plane-0 baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e externally provisioned +examplecluster-control-plane-1 Running 3h11m openshift-control-plane-1 baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1 externally provisioned +examplecluster-compute-0 Running 165m openshift-compute-0 baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f provisioned +examplecluster-compute-1 Running 165m openshift-compute-1 baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9 provisioned +---- + +. Verify that the node has been deleted by running the following command: ++ +[source,terminal] +---- +$ oc get nodes +---- ++ +.Example output +[source,terminal] +---- +NAME STATUS ROLES AGE VERSION +openshift-control-plane-0 Ready master 3h24m v1.33.4 +openshift-control-plane-1 Ready master 3h24m v1.33.4 +openshift-compute-0 Ready worker 176m v1.33.4 +openshift-compute-1 Ready worker 176m v1.33.4 +---- + +. Wait for all of the cluster Operators to complete rolling out changes. +Run the following command to monitor the progress: ++ +[source,terminal] +---- +$ watch oc get co +---- diff --git a/nodes/nodes/nodes-nodes-replace-control-plane.adoc b/nodes/nodes/nodes-nodes-replace-control-plane.adoc new file mode 100644 index 000000000000..28d0178e0d44 --- /dev/null +++ b/nodes/nodes/nodes-nodes-replace-control-plane.adoc @@ -0,0 +1,39 @@ +:_mod-docs-content-type: ASSEMBLY +[id="replacing-control-plane-node"] += Replacing a failed bare-metal control plane node without BMC credentials +include::_attributes/common-attributes.adoc[] +:context: replacing-control-plane-node + +toc::[] + +If a control plane node on your bare-metal cluster has failed and cannot be recovered, but you installed your cluster without providing baseboard management controller (BMC) credentials, you must take extra steps in order to replace the failed node with a new one. + +// Prerequisites +include::modules/nodes-replace-control-plane-prereqs.adoc[leveloffset=+1] + +// Removing the unhealthy etcd member +include::modules/nodes-remove-unhealthy-etcd-member.adoc[leveloffset=+1] + +// Deleting the machine of the unhealthy etcd member +include::modules/nodes-delete-machine-unhealthy-etcd.adoc[leveloffset=+1] + +// Verifying that the failed node was deleted +include::modules/nodes-verify-failed-node-deleted.adoc[leveloffset=+1] + +// Creating the new control plane node +include::modules/nodes-create-new-control-plane-node.adoc[leveloffset=+1] + +// Linking the node, bare metal host, and machine together +include::modules/nodes-link-node-machine-bmh.adoc[leveloffset=+1] + +// Adding the new etcd member +include::modules/nodes-add-new-etcd-member.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="additional-resources_{context}"] +== Additional resources + +* xref:../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#restore-replace-stopped-baremetal-etcd-member_replacing-unhealthy-etcd-member[Replacing an unhealthy bare metal etcd member whose machine is not running or whose node is not ready] +* xref:../../backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc#restore-replace-crashlooping-etcd-member_replacing-unhealthy-etcd-member[Replacing an unhealthy etcd member whose etcd pod is crashlooping] +* link:https://access.redhat.com/solutions/6471021[BareMetalHost reference is missing after adding a host to OpenShift Assisted Installer cluster] (Red{nbsp}Hat KCS article) +* link:https://access.redhat.com/solutions/5504291[How to retrieve Master or Worker Ignition Configuration from {product-title} 4?] (Red{nbsp}Hat KCS article)