-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OSDOCS#17158: node replacement procedure #102520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * nodes/nodes/nodes-nodes-replace-control-plane.adoc | ||
|
|
||
| :_mod-docs-content-type: PROCEDURE | ||
| [id="add-new-etcd-member_{context}"] | ||
| = Adding the new etcd member | ||
|
|
||
| Finish adding the new control plane node by adding the new etcd member to the cluster. | ||
|
|
||
| .Procedure | ||
|
|
||
| . Add the new etcd member to the cluster by performing the following steps in a single bash shell session: | ||
|
|
||
| .. Find the IP of the new control plane node by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ oc get nodes -owide -l node-role.kubernetes.io/control-plane | ||
| ---- | ||
| + | ||
| Make note of the node's IP address for later use. | ||
|
|
||
| .. List the etcd pods by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ oc get -n openshift-etcd pods -l k8s-app=etcd -o wide | ||
| ---- | ||
|
|
||
| .. Connect to one of the running etcd pods by running the following command. The etcd pod on the new node should be in a `CrashLoopBackOff` state. | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ oc rsh -n openshift-etcd <running_pod> | ||
| ---- | ||
| + | ||
| Where `<running_pod>` is the name of a running pod shown in the previous step. | ||
|
|
||
| .. View the etcd member list by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| sh-4.2# etcdctl member list -w table | ||
| ---- | ||
|
|
||
| .. Add the new control plane etcd member by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| sh-4.2# etcdctl member add <new_node> --peer-urls="https://<ip_address>:2380" | ||
| ---- | ||
| + | ||
| Where `<new_node>` is the name of the new control plane node, and `<ip_address>` is the IP address of the new node. | ||
|
|
||
| .. Exit the rsh shell by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| sh-4.2# exit | ||
| ---- | ||
|
|
||
| . Force an etcd redeployment by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge | ||
| ---- | ||
|
|
||
| . Turn the etcd quorum guard back on by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}' | ||
| ---- | ||
|
|
||
| . Monitor the cluster Operator rollout by running the following command: | ||
| + | ||
| [source,terminal] | ||
| ---- | ||
| $ watch oc get co | ||
| ---- | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,140 @@ | ||||||
| // Module included in the following assemblies: | ||||||
| // | ||||||
| // * nodes/nodes/nodes-nodes-replace-control-plane.adoc | ||||||
|
|
||||||
| :_mod-docs-content-type: PROCEDURE | ||||||
| [id="create-new-machine_{context}"] | ||||||
| = Creating the new control plane node | ||||||
|
|
||||||
| Begin creating the new control plane node by creating a `BareMetalHost` object and node. | ||||||
|
|
||||||
| .Procedure | ||||||
|
|
||||||
| . Edit the `bmh_affected.yaml` file that you previously saved: | ||||||
| + | ||||||
| -- | ||||||
| .. Remove the following metadata items from the file: | ||||||
| + | ||||||
| * `creationTimestamp` | ||||||
| * `generation` | ||||||
| * `resourceVersion` | ||||||
| * `uid` | ||||||
|
|
||||||
| .. Remove the `status` section of the file. | ||||||
| -- | ||||||
| + | ||||||
| The resulting file should resemble the following: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| + | ||||||
| .Example `bmh_affected.yaml` file | ||||||
| [source,yaml] | ||||||
| ---- | ||||||
| apiVersion: metal3.io/v1alpha1 | ||||||
| kind: BareMetalHost | ||||||
| metadata: | ||||||
| labels: | ||||||
| installer.openshift.io/role: control-plane | ||||||
| name: openshift-control-plane-2 | ||||||
| namespace: openshift-machine-api | ||||||
| spec: | ||||||
| automatedCleaningMode: disabled | ||||||
| bmc: | ||||||
| address: | ||||||
| credentialsName: | ||||||
| disableCertificateVerification: true | ||||||
| bootMACAddress: ab:cd:ef:ab:cd:ef | ||||||
| bootMode: UEFI | ||||||
| externallyProvisioned: true | ||||||
| online: true | ||||||
| rootDeviceHints: | ||||||
| deviceName: /dev/disk/by-path/pci-0000:04:00.0-nvme-1 | ||||||
| userData: | ||||||
| name: master-user-data-managed | ||||||
| namespace: openshift-machine-api | ||||||
| ---- | ||||||
|
|
||||||
| . Create the `BareMetalHost` object using the `bmh_affected.yaml` file by running the following command: | ||||||
| + | ||||||
| [source,terminal] | ||||||
| ---- | ||||||
| $ oc create -f bmh_affected.yaml | ||||||
| ---- | ||||||
| + | ||||||
| The following warning is expected upon creation of the `BareMetalHost` object: | ||||||
| + | ||||||
| [source,terminal] | ||||||
| ---- | ||||||
| Warning: metadata.finalizers: "baremetalhost.metal3.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers | ||||||
| ---- | ||||||
|
|
||||||
| . Extract the control plane ignition secret by running the following command: | ||||||
| + | ||||||
| [source,terminal] | ||||||
| ---- | ||||||
| $ oc extract secret/master-user-data-managed \ | ||||||
| -n openshift-machine-api \ | ||||||
| --keys=userData \ | ||||||
| --to=- \ | ||||||
| | sed '/^userData/d' > new_controlplane.ign | ||||||
| ---- | ||||||
| + | ||||||
| This command also removes the starting `userData` line of the ignition secret. | ||||||
|
|
||||||
| . Create an nmstate YAML file titled `new_controlplane_nmstate.yaml` for the new node's network configuration, using the following example for reference: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
| + | ||||||
| .Example nmstate YAML file | ||||||
| [source,yaml] | ||||||
| ---- | ||||||
| interfaces: | ||||||
| - name: eno1 | ||||||
| type: ethernet | ||||||
| state: up | ||||||
| mac-address: "ab:cd:ef:01:02:03" | ||||||
| ipv4: | ||||||
| enabled: true | ||||||
| address: | ||||||
| - ip: 192.168.20.11 | ||||||
| prefix-length: 24 | ||||||
| dhcp: false | ||||||
| ipv6: | ||||||
| enabled: false | ||||||
| dns-resolver: | ||||||
| config: | ||||||
| search: | ||||||
| - iso.sterling.home | ||||||
| server: | ||||||
| - 192.168.20.8 | ||||||
| routes: | ||||||
| config: | ||||||
| - destination: 0.0.0.0/0 | ||||||
| metric: 100 | ||||||
| next-hop-address: 192.168.20.1 | ||||||
| next-hop-interface: eno1 | ||||||
| table-id: 254 | ||||||
| ---- | ||||||
| + | ||||||
| [NOTE] | ||||||
| ==== | ||||||
| If you installed your cluster using the Agent-based Installer, you can use the failed node's `networkConfig` section in the `agent-config.yaml` file from the original cluster deployment as a starting point for the new control plane node's nmstate file. For example, the following command extracts the `networkConfig` section for the first control plane node: | ||||||
|
|
||||||
| [source,terminal] | ||||||
| ---- | ||||||
| $ cat agent-config-iso.yaml | yq .hosts[0].networkConfig > new_controlplane_nmstate.yaml | ||||||
| ---- | ||||||
| ==== | ||||||
|
|
||||||
| . Create the customized {op-system-first} live ISO by running the following command: | ||||||
| + | ||||||
| [source,terminal] | ||||||
| ---- | ||||||
| $ coreos-installer iso customize rhcos-live.86_64.iso \ | ||||||
| --dest-ignition new_controlplane.ign \ | ||||||
| --network-nmstate new_controlplane_nmstate.yaml \ | ||||||
| --dest-device /dev/disk/by-path/<device_path> \ | ||||||
| -f | ||||||
| ---- | ||||||
| + | ||||||
| Where <device_path> is the path to the target device on which the ISO will be generated. | ||||||
|
|
||||||
| . Boot the new control plane node with the customized {op-system} live ISO. | ||||||
|
|
||||||
| . Approve the Certificate Signing Requests (CSR) to join the new node to the cluster. | ||||||
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,114 @@ | ||||||||||||
| // Module included in the following assemblies: | ||||||||||||
| // | ||||||||||||
| // * nodes/nodes/nodes-nodes-replace-control-plane.adoc | ||||||||||||
|
|
||||||||||||
| :_mod-docs-content-type: PROCEDURE | ||||||||||||
| [id="deleting-machine_{context}"] | ||||||||||||
| = Deleting the machine of the unhealthy etcd member | ||||||||||||
|
|
||||||||||||
| Finish removing the failed control plane node by deleting the machine of the unhealthy etcd member. | ||||||||||||
|
|
||||||||||||
| . Ensure that the Bare Metal Operator is available by running the following command: | ||||||||||||
|
Comment on lines
+10
to
+11
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc get clusteroperator baremetal | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| .Example output | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE | ||||||||||||
| baremetal 4.20.0 True False False 3d15h | ||||||||||||
| ---- | ||||||||||||
|
|
||||||||||||
| . Save the `BareMetalHost` object of the affected node to a file for later use by running the following command: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc get -n openshift-machine-api bmh <node_name> -o yaml > bmh_affected.yaml | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| Where `<node_name>` is the name of the affected node, which usually matches the associated `BareMetalHost` name. | ||||||||||||
|
|
||||||||||||
| . View the YAML file of the saved `BareMetalHost` object by running the following command, and ensure the content is correct: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ cat bmh_affected.yaml | ||||||||||||
| ---- | ||||||||||||
|
|
||||||||||||
| . Remove the affected `BareMetalHost` object by running the following command: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc delete -n openshift-machine-api bmh <node_name> | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| Where `<node_name>` is the name of the affected node. | ||||||||||||
|
|
||||||||||||
| . List all machines by running the following command and identify the machine associated with the affected node: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc get machines -n openshift-machine-api -o wide | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| .Example output | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE | ||||||||||||
| examplecluster-control-plane-0 Running 3h11m openshift-control-plane-0 baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e externally provisioned | ||||||||||||
| examplecluster-control-plane-1 Running 3h11m openshift-control-plane-1 baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1 externally provisioned | ||||||||||||
| examplecluster-control-plane-2 Running 3h11m openshift-control-plane-2 baremetalhost:///openshift-machine-api/openshift-control-plane-2/3354bdac-61d8-410f-be5b-6a395b056135 externally provisioned | ||||||||||||
| examplecluster-compute-0 Running 165m openshift-compute-0 baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f provisioned | ||||||||||||
| examplecluster-compute-1 Running 165m openshift-compute-1 baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9 provisioned | ||||||||||||
| ---- | ||||||||||||
|
|
||||||||||||
| . Delete the machine of the unhealthy member by running the following command: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc delete machine -n openshift-machine-api <machine_name> | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| Where `<machine_name>` is the machine name associated with the affected node. | ||||||||||||
| + | ||||||||||||
| .Example command | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc delete machine -n openshift-machine-api examplecluster-control-plane-2 | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| [NOTE] | ||||||||||||
| ==== | ||||||||||||
| After you remove the `BareMetalHost` and `Machine` objects, the machine controller automatically deletes the `Node` object. | ||||||||||||
| ==== | ||||||||||||
|
|
||||||||||||
| . If deletion of the machine is delayed for any reason or the command is obstructed and delayed: Force deletion by removing the machine object finalizer field. | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need a reformat? |
||||||||||||
| + | ||||||||||||
| [WARNING] | ||||||||||||
| ==== | ||||||||||||
| Do not interrupt machine deletion by pressing `Ctrl+c`. You must allow the command to proceed to completion. Open a new terminal window to edit and delete the finalizer fields. | ||||||||||||
| ==== | ||||||||||||
|
|
||||||||||||
| .. On a new terminal window, edit the machine configuration by running the following command: | ||||||||||||
| + | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| $ oc edit machine -n openshift-machine-api examplecluster-control-plane-2 | ||||||||||||
| ---- | ||||||||||||
|
|
||||||||||||
| .. Delete the following fields in the `Machine` custom resource, and then save the updated file: | ||||||||||||
| + | ||||||||||||
| [source,yaml] | ||||||||||||
| ---- | ||||||||||||
| finalizers: | ||||||||||||
| - machine.machine.openshift.io | ||||||||||||
| ---- | ||||||||||||
| + | ||||||||||||
| .Example output | ||||||||||||
| [source,terminal] | ||||||||||||
| ---- | ||||||||||||
| machine.machine.openshift.io/examplecluster-control-plane-2 edited | ||||||||||||
| ---- | ||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See SSG for user-replaceable value DL format, i.e.
where:and so on.