From d2c26b689621b31f7f05cebb82ebd4dfbc2a1af2 Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Tue, 15 Aug 2023 15:01:40 -0700 Subject: [PATCH] TELCODOCS-1111: Added troubleshooting for local persistent volumes using LVMS.. --- _topic_maps/_topic_map.yml | 2 + ...ting-a-pvc-stuck-in-the-pending-state.adoc | 49 ++++++++++ ...eshooting-performing-a-forced-cleanup.adoc | 97 +++++++++++++++++++ ...shooting-recovering-from-disk-failure.adoc | 33 +++++++ ...m-missing-lvms-or-operator-components.adoc | 77 +++++++++++++++ ...shooting-recovering-from-node-failure.adoc | 34 +++++++ ...g-local-persistent-storage-using-lvms.adoc | 31 ++++++ 7 files changed, 323 insertions(+) create mode 100644 modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc create mode 100644 modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc create mode 100644 modules/lvms-troubleshooting-recovering-from-disk-failure.adoc create mode 100644 modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc create mode 100644 modules/lvms-troubleshooting-recovering-from-node-failure.adoc create mode 100644 storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index b35cb12b37b3..bb3de2804bac 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -1529,6 +1529,8 @@ Topics: File: persistent-storage-hostpath - Name: Persistent storage using LVM Storage File: persistent-storage-using-lvms + - Name: Troubleshooting local persistent storage using LVMS + File: troubleshooting-local-persistent-storage-using-lvms - Name: Using Container Storage Interface (CSI) Dir: container_storage_interface Distros: openshift-enterprise,openshift-origin diff --git a/modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc b/modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc new file mode 100644 index 000000000000..21a40866cca3 --- /dev/null +++ b/modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc @@ -0,0 +1,49 @@ +// This module is included in the following assemblies: +// +// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc + +:_content-type: PROCEDURE +[id="investigating-a-pvc-stuck-in-the-pending-state_{context}"] += Investigating a PVC stuck in the Pending state + +A persistent volume claim (PVC) can get stuck in a `Pending` state for a number of reasons. For example: + +- Insufficient computing resources +- Network problems +- Mismatched storage class or node selector +- No available volumes +- The node with the persistent volume (PV) is in a `Not Ready` state + +Identify the cause by using the `oc describe` command to review details about the stuck PVC. + +.Procedure + +. Retrieve the list of PVCs by running the following command: ++ +[source,terminal] +---- +$ oc get pvc +---- ++ +.Example output +[source,terminal] +---- +NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE +lvms-test Pending lvms-vg1 11s +---- + +. Inspect the events associated with a PVC stuck in the `Pending` state by running the following command: ++ +[source,terminal] +---- +$ oc describe pvc <1> +---- +<1> Replace `` with the name of the PVC. For example, `lvms-vg1`. ++ +.Example output +[source,terminal] +---- +Type Reason Age From Message +---- ------ ---- ---- ------- +Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller storageclass.storage.k8s.io "lvms-vg1" not found +---- \ No newline at end of file diff --git a/modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc b/modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc new file mode 100644 index 000000000000..c66ecf170356 --- /dev/null +++ b/modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc @@ -0,0 +1,97 @@ +// This module is included in the following assemblies: +// +// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc + +:_content-type: PROCEDURE +[id="performing-a-forced-cleanup_{context}"] += Performing a forced cleanup + +If disk- or node-related problems persist after you complete the troubleshooting procedures, it might be necessary to perform a forced cleanup procedure. A forced cleanup is used to comprehensively address persistent issues and ensure the proper functioning of the LVMS. + +.Prerequisites + +. All of the persistent volume claims (PVCs) created using the logical volume manager storage (LVMS) driver have been removed. + +. The pods using those PVCs have been stopped. + + +.Procedure + +. Switch to the `openshift-storage` namespace by running the following command: ++ +[source,terminal] +---- +$ oc project openshift-storage +---- + +. Ensure there is no `Logical Volume` custom resource (CR) remaining by running the following command: ++ +[source,terminal] +---- +$ oc get logicalvolume +---- ++ +.Example output +[source,terminal] +---- +No resources found +---- + +.. If there are any `LogicalVolume` CRs remaining, remove their finalizers by running the following command: ++ +[source,terminal] +---- +$ oc patch logicalvolume -p '{"metadata":{"finalizers":[]}}' --type=merge <1> +---- +<1> Replace `` with the name of the CR. + +.. After removing their finalizers, delete the CRs by running the following command: ++ +[source,terminal] +---- +$ oc delete logicalvolume <1> +---- +<1> Replace `` with the name of the CR. + +. Make sure there are no `LVMVolumeGroup` CRs left by running the following command: ++ +[source,terminal] +---- +$ oc get lvmvolumegroup +---- ++ +.Example output +[source,terminal] +---- +No resources found +---- + +.. If there are any `LVMVolumeGroup` CRs left, remove their finalizers by running the following command: ++ +[source,terminal] +---- +$ oc patch lvmvolumegroup -p '{"metadata":{"finalizers":[]}}' --type=merge <1> +---- +<1> Replace `` with the name of the CR. + +.. After removing their finalizers, delete the CRs by running the following command: ++ +[source,terminal] +---- +$ oc delete lvmvolumegroup <1> +---- +<1> Replace `` with the name of the CR. + +. Remove any `LVMVolumeGroupNodeStatus` CRs by running the following command: ++ +[source,terminal] +---- +$ oc delete lvmvolumegroupnodestatus --all +---- + +. Remove the `LVMCluster` CR by running the following command: ++ +[source,terminal] +---- +$ oc delete lvmcluster --all +---- diff --git a/modules/lvms-troubleshooting-recovering-from-disk-failure.adoc b/modules/lvms-troubleshooting-recovering-from-disk-failure.adoc new file mode 100644 index 000000000000..8c643db1aa69 --- /dev/null +++ b/modules/lvms-troubleshooting-recovering-from-disk-failure.adoc @@ -0,0 +1,33 @@ +// This module is included in the following assemblies: +// +// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc + +:_content-type: PROCEDURE +[id="recovering-from-disk-failure_{context}"] += Recovering from disk failure + +If you see a failure message while inspecting the events associated with the persistent volume claim (PVC), there might be a problem with the underlying volume or disk. Disk and volume provisioning issues often result with a generic error first, such as `Failed to provision volume with StorageClass `. A second, more specific error message usually follows. + +.Procedure + +. Inspect the events associated with a PVC by running the following command: ++ +[source,terminal] +---- +$ oc describe pvc <1> +---- +<1> Replace `` with the name of the PVC. Here are some examples of disk or volume failure error messages and their causes: ++ +- *Failed to check volume existence:* Indicates a problem in verifying whether the volume already exists. Volume verification failure can be caused by network connectivity problems or other failures. ++ +- *Failed to bind volume:* Failure to bind a volume can happen if the persistent volume (PV) that is available does not match the requirements of the PVC. ++ +- *FailedMount or FailedUnMount:* This error indicates problems when trying to mount the volume to a node or unmount a volume from a node. If the disk has failed, this error might appear when a pod tries to use the PVC. ++ +- *Volume is already exclusively attached to one node and can't be attached to another:* This error can appear with storage solutions that do not support `ReadWriteMany` access modes. + +. Establish a direct connection to the host where the problem is occurring. + +. Resolve the disk issue. + +After you have resolved the issue with the disk, you might need to perform the forced cleanup procedure if failure messages persist or reoccur. \ No newline at end of file diff --git a/modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc b/modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc new file mode 100644 index 000000000000..3e04f55c7ea9 --- /dev/null +++ b/modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc @@ -0,0 +1,77 @@ +// This module is included in the following assemblies: +// +// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc + +:_content-type: PROCEDURE +[id="recovering-from-missing-lvms-or-operator-components_{context}"] += Recovering from missing LVMS or Operator components + +If you encounter a storage class "not found" error, check the `LVMCluster` resource and ensure that all the logical volume manager storage (LVMS) pods are running. You can create an `LVMCluster` resource if it does not exist. + +.Procedure + +. Verify the presence of the LVMCluster resource by running the following command: ++ +[source,terminal] +---- +$ oc get lvmcluster -n openshift-storage +---- ++ +.Example output +[source,terminal] +---- +NAME AGE +my-lvmcluster 65m +---- + +. If the cluster doesn't have an `LVMCluster` resource, create one by running the following command: ++ +[source,terminal] +---- +$ oc create -n openshift-storage -f <1> +---- +<1> Replace `` with a custom resource URL or file tailored to your requirements. ++ +.Example custom resource +[source,yaml,options="nowrap",role="white-space-pre"] +---- +apiVersion: lvm.topolvm.io/v1alpha1 +kind: LVMCluster +metadata: + name: my-lvmcluster +spec: + storage: + deviceClasses: + - name: vg1 + default: true + thinPoolConfig: + name: thin-pool-1 + sizePercent: 90 + overprovisionRatio: 10 +---- + +. Check that all the pods from LVMS are in the `Running` state in the `openshift-storage` namespace by running the following command: ++ +[source,terminal] +---- +$ oc get pods -n openshift-storage +---- ++ +.Example output +[source,terminal] +---- +NAME READY STATUS RESTARTS AGE +lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m +topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m +topolvm-node-dr26h 4/4 Running 0 66m +vg-manager-r6zdv 1/1 Running 0 66m +---- ++ +The expected output is one running instance of `lvms-operator` and `vg-manager`. One instance of `topolvm-controller` and `topolvm-node` is expected for each node. ++ +If `topolvm-node` is stuck in the `Init` state, there is a failure to locate an available disk for LVMS to use. To retrieve the information necessary to troubleshoot, review the logs of the `vg-manager` pod by running the following command: ++ +[source,terminal] +---- +$ oc logs -l app.kubernetes.io/component=vg-manager -n openshift-storage +---- diff --git a/modules/lvms-troubleshooting-recovering-from-node-failure.adoc b/modules/lvms-troubleshooting-recovering-from-node-failure.adoc new file mode 100644 index 000000000000..d26fbba58e68 --- /dev/null +++ b/modules/lvms-troubleshooting-recovering-from-node-failure.adoc @@ -0,0 +1,34 @@ +// This module is included in the following assemblies: +// +// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc + +:_content-type: PROCEDURE +[id="recovering-from-node-failure_{context}"] += Recovering from node failure + +Sometimes a persistent volume claim (PVC) is stuck in a `Pending` state because a particular node in the cluster has failed. To identify the failed node, you can examine the restart count of the `topolvm-node` pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting. + +.Procedure + +* Examine the restart count of the `topolvm-node` pod instances by running the following command: ++ +[source,terminal] +---- +$ oc get pods -n openshift-storage +---- ++ +.Example output +[source,terminal] +---- +NAME READY STATUS RESTARTS AGE +lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m +topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m +topolvm-node-dr26h 4/4 Running 0 66m +topolvm-node-54as8 4/4 Running 0 66m +topolvm-node-78fft 4/4 Running 17 (8s ago) 66m +vg-manager-r6zdv 1/1 Running 0 66m +vg-manager-990ut 1/1 Running 0 66m +vg-manager-an118 1/1 Running 0 66m +---- ++ +After you resolve any issues with the node, you might need to perform the forced cleanup procedure if the PVC is still stuck in a `Pending` state. \ No newline at end of file diff --git a/storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc b/storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc new file mode 100644 index 000000000000..2407d3c5a440 --- /dev/null +++ b/storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc @@ -0,0 +1,31 @@ +:_content-type: ASSEMBLY +[id="troubleshooting-local-persistent-storage"] += Troubleshooting local persistent storage using LVMS +include::_attributes/common-attributes.adoc[] +:context: troubleshooting-local-persistent-storage-using-lvms + +toc::[] + +Because {product-title} does not scope a persistent volume (PV) to a single project, it can be shared across the cluster and claimed by any project using a persistent volume claim (PVC). This can lead to a number of issues that require troubleshooting. + +include::modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc[leveloffset=+1] + +include::modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc[leveloffset=+1] + +include::modules/lvms-troubleshooting-recovering-from-node-failure.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="additional-resources-forced-cleanup-1"] +.Additional resources + +* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup] + +include::modules/lvms-troubleshooting-recovering-from-disk-failure.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="additional-resources-forced-cleanup-2"] +.Additional resources + +* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup] + +include::modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc[leveloffset=+1]