Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2801,6 +2801,8 @@ Topics:
# File: nodes-nodes-graceful-shutdown
- Name: Managing the maximum number of pods per node
File: nodes-nodes-managing-max-pods
- Name: Replacing a failed bare-metal control plane node without BMC credentials
File: nodes-nodes-replace-control-plane
- Name: Using the Node Tuning Operator
File: nodes-node-tuning-operator
- Name: Remediating, fencing, and maintaining nodes
Expand Down
82 changes: 82 additions & 0 deletions modules/nodes-add-new-etcd-member.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
// Module included in the following assemblies:
//
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc

:_mod-docs-content-type: PROCEDURE
[id="add-new-etcd-member_{context}"]
= Adding the new etcd member

Finish adding the new control plane node by adding the new etcd member to the cluster.

.Procedure

. Add the new etcd member to the cluster by performing the following steps in a single bash shell session:

.. Find the IP of the new control plane node by running the following command:
+
[source,terminal]
----
$ oc get nodes -owide -l node-role.kubernetes.io/control-plane
----
+
Make note of the node's IP address for later use.

.. List the etcd pods by running the following command:
+
[source,terminal]
----
$ oc get -n openshift-etcd pods -l k8s-app=etcd -o wide
----

.. Connect to one of the running etcd pods by running the following command. The etcd pod on the new node should be in a `CrashLoopBackOff` state.
+
[source,terminal]
----
$ oc rsh -n openshift-etcd <running_pod>
----
+
Where `<running_pod>` is the name of a running pod shown in the previous step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See SSG for user-replaceable value DL format, i.e. where: and so on.


.. View the etcd member list by running the following command:
+
[source,terminal]
----
sh-4.2# etcdctl member list -w table
----

.. Add the new control plane etcd member by running the following command:
+
[source,terminal]
----
sh-4.2# etcdctl member add <new_node> --peer-urls="https://<ip_address>:2380"
----
+
Where `<new_node>` is the name of the new control plane node, and `<ip_address>` is the IP address of the new node.

.. Exit the rsh shell by running the following command:
+
[source,terminal]
----
sh-4.2# exit
----

. Force an etcd redeployment by running the following command:
+
[source,terminal]
----
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----

. Turn the etcd quorum guard back on by running the following command:
+
[source,terminal]
----
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}'
----

. Monitor the cluster Operator rollout by running the following command:
+
[source,terminal]
----
$ watch oc get co
----
140 changes: 140 additions & 0 deletions modules/nodes-create-new-control-plane-node.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
// Module included in the following assemblies:
//
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc

:_mod-docs-content-type: PROCEDURE
[id="create-new-machine_{context}"]
= Creating the new control plane node

Begin creating the new control plane node by creating a `BareMetalHost` object and node.

.Procedure

. Edit the `bmh_affected.yaml` file that you previously saved:
+
--
.. Remove the following metadata items from the file:
+
* `creationTimestamp`
* `generation`
* `resourceVersion`
* `uid`

.. Remove the `status` section of the file.
--
+
The resulting file should resemble the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The resulting file should resemble the following:
The resulting file should resemble the following example:

+
.Example `bmh_affected.yaml` file
[source,yaml]
----
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
labels:
installer.openshift.io/role: control-plane
name: openshift-control-plane-2
namespace: openshift-machine-api
spec:
automatedCleaningMode: disabled
bmc:
address:
credentialsName:
disableCertificateVerification: true
bootMACAddress: ab:cd:ef:ab:cd:ef
bootMode: UEFI
externallyProvisioned: true
online: true
rootDeviceHints:
deviceName: /dev/disk/by-path/pci-0000:04:00.0-nvme-1
userData:
name: master-user-data-managed
namespace: openshift-machine-api
----

. Create the `BareMetalHost` object using the `bmh_affected.yaml` file by running the following command:
+
[source,terminal]
----
$ oc create -f bmh_affected.yaml
----
+
The following warning is expected upon creation of the `BareMetalHost` object:
+
[source,terminal]
----
Warning: metadata.finalizers: "baremetalhost.metal3.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
----

. Extract the control plane ignition secret by running the following command:
+
[source,terminal]
----
$ oc extract secret/master-user-data-managed \
-n openshift-machine-api \
--keys=userData \
--to=- \
| sed '/^userData/d' > new_controlplane.ign
----
+
This command also removes the starting `userData` line of the ignition secret.

. Create an nmstate YAML file titled `new_controlplane_nmstate.yaml` for the new node's network configuration, using the following example for reference:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nmstate or NMState?

+
.Example nmstate YAML file
[source,yaml]
----
interfaces:
- name: eno1
type: ethernet
state: up
mac-address: "ab:cd:ef:01:02:03"
ipv4:
enabled: true
address:
- ip: 192.168.20.11
prefix-length: 24
dhcp: false
ipv6:
enabled: false
dns-resolver:
config:
search:
- iso.sterling.home
server:
- 192.168.20.8
routes:
config:
- destination: 0.0.0.0/0
metric: 100
next-hop-address: 192.168.20.1
next-hop-interface: eno1
table-id: 254
----
+
[NOTE]
====
If you installed your cluster using the Agent-based Installer, you can use the failed node's `networkConfig` section in the `agent-config.yaml` file from the original cluster deployment as a starting point for the new control plane node's nmstate file. For example, the following command extracts the `networkConfig` section for the first control plane node:

[source,terminal]
----
$ cat agent-config-iso.yaml | yq .hosts[0].networkConfig > new_controlplane_nmstate.yaml
----
====

. Create the customized {op-system-first} live ISO by running the following command:
+
[source,terminal]
----
$ coreos-installer iso customize rhcos-live.86_64.iso \
--dest-ignition new_controlplane.ign \
--network-nmstate new_controlplane_nmstate.yaml \
--dest-device /dev/disk/by-path/<device_path> \
-f
----
+
Where <device_path> is the path to the target device on which the ISO will be generated.

. Boot the new control plane node with the customized {op-system} live ISO.

. Approve the Certificate Signing Requests (CSR) to join the new node to the cluster.
114 changes: 114 additions & 0 deletions modules/nodes-delete-machine-unhealthy-etcd.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
// Module included in the following assemblies:
//
// * nodes/nodes/nodes-nodes-replace-control-plane.adoc

:_mod-docs-content-type: PROCEDURE
[id="deleting-machine_{context}"]
= Deleting the machine of the unhealthy etcd member

Finish removing the failed control plane node by deleting the machine of the unhealthy etcd member.

. Ensure that the Bare Metal Operator is available by running the following command:
Comment on lines +10 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Ensure that the Bare Metal Operator is available by running the following command:
.Procedure
. Ensure that the Bare Metal Operator is available by running the following command:

+
[source,terminal]
----
$ oc get clusteroperator baremetal
----
+
.Example output
[source,terminal]
----
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
baremetal 4.20.0 True False False 3d15h
----

. Save the `BareMetalHost` object of the affected node to a file for later use by running the following command:
+
[source,terminal]
----
$ oc get -n openshift-machine-api bmh <node_name> -o yaml > bmh_affected.yaml
----
+
Where `<node_name>` is the name of the affected node, which usually matches the associated `BareMetalHost` name.

. View the YAML file of the saved `BareMetalHost` object by running the following command, and ensure the content is correct:
+
[source,terminal]
----
$ cat bmh_affected.yaml
----

. Remove the affected `BareMetalHost` object by running the following command:
+
[source,terminal]
----
$ oc delete -n openshift-machine-api bmh <node_name>
----
+
Where `<node_name>` is the name of the affected node.

. List all machines by running the following command and identify the machine associated with the affected node:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
examplecluster-control-plane-0 Running 3h11m openshift-control-plane-0 baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e externally provisioned
examplecluster-control-plane-1 Running 3h11m openshift-control-plane-1 baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1 externally provisioned
examplecluster-control-plane-2 Running 3h11m openshift-control-plane-2 baremetalhost:///openshift-machine-api/openshift-control-plane-2/3354bdac-61d8-410f-be5b-6a395b056135 externally provisioned
examplecluster-compute-0 Running 165m openshift-compute-0 baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f provisioned
examplecluster-compute-1 Running 165m openshift-compute-1 baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9 provisioned
----

. Delete the machine of the unhealthy member by running the following command:
+
[source,terminal]
----
$ oc delete machine -n openshift-machine-api <machine_name>
----
+
Where `<machine_name>` is the machine name associated with the affected node.
+
.Example command
[source,terminal]
----
$ oc delete machine -n openshift-machine-api examplecluster-control-plane-2
----
+
[NOTE]
====
After you remove the `BareMetalHost` and `Machine` objects, the machine controller automatically deletes the `Node` object.
====

. If deletion of the machine is delayed for any reason or the command is obstructed and delayed: Force deletion by removing the machine object finalizer field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a reformat?

+
[WARNING]
====
Do not interrupt machine deletion by pressing `Ctrl+c`. You must allow the command to proceed to completion. Open a new terminal window to edit and delete the finalizer fields.
====

.. On a new terminal window, edit the machine configuration by running the following command:
+
[source,terminal]
----
$ oc edit machine -n openshift-machine-api examplecluster-control-plane-2
----

.. Delete the following fields in the `Machine` custom resource, and then save the updated file:
+
[source,yaml]
----
finalizers:
- machine.machine.openshift.io
----
+
.Example output
[source,terminal]
----
machine.machine.openshift.io/examplecluster-control-plane-2 edited
----
Loading