openshift · skrthomas · Aug 16, 2023 · Aug 15, 2023
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -1529,6 +1529,8 @@ Topics:
       File: persistent-storage-hostpath
     - Name: Persistent storage using LVM Storage
       File: persistent-storage-using-lvms
+    - Name: Troubleshooting local persistent storage using LVMS
+      File: troubleshooting-local-persistent-storage-using-lvms
 - Name: Using Container Storage Interface (CSI)
   Dir: container_storage_interface
   Distros: openshift-enterprise,openshift-origin

diff --git a/modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc b/modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc
@@ -0,0 +1,49 @@
+// This module is included in the following assemblies: 
+//
+// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
+
+:_content-type: PROCEDURE
+[id="investigating-a-pvc-stuck-in-the-pending-state_{context}"]
+= Investigating a PVC stuck in the Pending state
+
+A persistent volume claim (PVC) can get stuck in a `Pending` state for a number of reasons. For example:
+
+- Insufficient computing resources
+- Network problems
+- Mismatched storage class or node selector
+- No available volumes
+- The node with the persistent volume (PV) is in a `Not Ready` state
+
+Identify the cause by using the `oc describe` command to review details about the stuck PVC.
+
+.Procedure
+
+. Retrieve the list of PVCs by running the following command:
++
+[source,terminal]
+----
+$ oc get pvc
+----
++
+.Example output
+[source,terminal]
+----
+NAME        STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
+lvms-test   Pending                                      lvms-vg1       11s
+----
+
+. Inspect the events associated with a PVC stuck in the `Pending` state by running the following command:
++
+[source,terminal]
+----
+$ oc describe pvc <pvc_name> <1>
+----
+<1> Replace `<pvc_name>` with the name of the PVC. For example, `lvms-vg1`.
++
+.Example output
+[source,terminal]
+----
+Type     Reason              Age               From                         Message
+----     ------              ----              ----                         -------
+Warning  ProvisioningFailed  4s (x2 over 17s)  persistentvolume-controller  storageclass.storage.k8s.io "lvms-vg1" not found
+----
diff --git a/modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc b/modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc
@@ -0,0 +1,97 @@
+// This module is included in the following assemblies: 
+//
+// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
+
+:_content-type: PROCEDURE
+[id="performing-a-forced-cleanup_{context}"]
+= Performing a forced cleanup
+
+If disk- or node-related problems persist after you complete the troubleshooting procedures, it might be necessary to perform a forced cleanup procedure. A forced cleanup is used to comprehensively address persistent issues and ensure the proper functioning of the LVMS.
+
+.Prerequisites
+
+. All of the persistent volume claims (PVCs) created using the logical volume manager storage (LVMS) driver have been removed.
+
+. The pods using those PVCs have been stopped.
+
+
+.Procedure
+
+. Switch to the `openshift-storage` namespace by running the following command:
++
+[source,terminal]
+----
+$ oc project openshift-storage
+----
+
+. Ensure there is no `Logical Volume` custom resource (CR) remaining by running the following command:
++
+[source,terminal]
+----
+$ oc get logicalvolume
+----
++
+.Example output
+[source,terminal]
+----
+No resources found
+----
+
+.. If there are any `LogicalVolume` CRs remaining, remove their finalizers by running the following command:
++
+[source,terminal]
+----
+$ oc patch logicalvolume <name> -p '{"metadata":{"finalizers":[]}}' --type=merge <1>
+----
+<1> Replace `<name>` with the name of the CR.
+
+.. After removing their finalizers, delete the CRs by running the following command:
++
+[source,terminal]
+----
+$ oc delete logicalvolume <name> <1>
+----
+<1> Replace `<name>` with the name of the CR.
+
+. Make sure there are no `LVMVolumeGroup` CRs left by running the following command:
++
+[source,terminal]
+----
+$ oc get lvmvolumegroup
+----
++
+.Example output
+[source,terminal]
+----
+No resources found
+----
+
+.. If there are any `LVMVolumeGroup` CRs left, remove their finalizers by running the following command:
++
+[source,terminal]
+----
+$ oc patch lvmvolumegroup <name> -p '{"metadata":{"finalizers":[]}}' --type=merge <1>
+----
+<1> Replace `<name>` with the name of the CR.
+
+.. After removing their finalizers, delete the CRs by running the following command:
++
+[source,terminal]
+----
+$ oc delete lvmvolumegroup <name> <1>
+----
+<1> Replace `<name>` with the name of the CR.
+
+. Remove any `LVMVolumeGroupNodeStatus` CRs by running the following command:
++
+[source,terminal]
+----
+$ oc delete lvmvolumegroupnodestatus --all
+----
+
+. Remove the `LVMCluster` CR by running the following command:
++
+[source,terminal]
+----
+$ oc delete lvmcluster --all
+----
diff --git a/modules/lvms-troubleshooting-recovering-from-disk-failure.adoc b/modules/lvms-troubleshooting-recovering-from-disk-failure.adoc
@@ -0,0 +1,33 @@
+// This module is included in the following assemblies: 
+//
+// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
+
+:_content-type: PROCEDURE
+[id="recovering-from-disk-failure_{context}"]
+= Recovering from disk failure
+
+If you see a failure message while inspecting the events associated with the persistent volume claim (PVC), there might be a problem with the underlying volume or disk. Disk and volume provisioning issues often result with a generic error first, such as `Failed to provision volume with StorageClass <storage_class_name>`. A second, more specific error message usually follows.
+
+.Procedure
+
+. Inspect the events associated with a PVC by running the following command:
++
+[source,terminal]
+----
+$ oc describe pvc <pvc_name> <1>
+----
+<1> Replace `<pvc_name>` with the name of the PVC. Here are some examples of disk or volume failure error messages and their causes:
++
+- *Failed to check volume existence:* Indicates a problem in verifying whether the volume already exists. Volume verification failure can be caused by network connectivity problems or other failures.
++
+- *Failed to bind volume:* Failure to bind a volume can happen if the persistent volume (PV) that is available does not match the requirements of the PVC.
++
+- *FailedMount or FailedUnMount:* This error indicates problems when trying to mount the volume to a node or unmount a volume from a node. If the disk has failed, this error might appear when a pod tries to use the PVC.
++
+- *Volume is already exclusively attached to one node and can't be attached to another:* This error can appear with storage solutions that do not support `ReadWriteMany` access modes.
+
+. Establish a direct connection to the host where the problem is occurring.
+
+. Resolve the disk issue. 
+
+After you have resolved the issue with the disk, you might need to perform the forced cleanup procedure if failure messages persist or reoccur.
diff --git a/...s/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc b/...s/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc
@@ -0,0 +1,77 @@
+// This module is included in the following assemblies: 
+//
+// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
+
+:_content-type: PROCEDURE
+[id="recovering-from-missing-lvms-or-operator-components_{context}"]
+= Recovering from missing LVMS or Operator components
+
+If you encounter a storage class "not found" error, check the `LVMCluster` resource and ensure that all the logical volume manager storage (LVMS) pods are running. You can create an `LVMCluster` resource if it does not exist.
+
+.Procedure
+
+. Verify the presence of the LVMCluster resource by running the following command:
++
+[source,terminal]
+----
+$ oc get lvmcluster -n openshift-storage
+----
++
+.Example output
+[source,terminal]
+----
+NAME            AGE
+my-lvmcluster   65m
+----
+
+. If the cluster doesn't have an `LVMCluster` resource, create one by running the following command:
++
+[source,terminal]
+----
+$ oc create -n openshift-storage -f <custom_resource> <1>
+----
+<1> Replace `<custom_resource>` with a custom resource URL or file tailored to your requirements.
++
+.Example custom resource
+[source,yaml,options="nowrap",role="white-space-pre"]
+----
+apiVersion: lvm.topolvm.io/v1alpha1
+kind: LVMCluster
+metadata:
+  name: my-lvmcluster
+spec:
+  storage:
+    deviceClasses:
+    - name: vg1
+      default: true
+      thinPoolConfig:
+        name: thin-pool-1
+        sizePercent: 90
+        overprovisionRatio: 10
+----
+
+. Check that all the pods from LVMS are in the `Running` state in the `openshift-storage` namespace by running the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-storage
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                  READY   STATUS    RESTARTS      AGE
+lvms-operator-7b9fb858cb-6nsml        3/3     Running   0             70m
+topolvm-controller-5dd9cf78b5-7wwr2   5/5     Running   0             66m
+topolvm-node-dr26h                    4/4     Running   0             66m
+vg-manager-r6zdv                      1/1     Running   0             66m
+----
++
+The expected output is one running instance of `lvms-operator` and `vg-manager`. One instance of `topolvm-controller` and `topolvm-node` is expected for each node.
++
+If `topolvm-node` is stuck in the `Init` state, there is a failure to locate an available disk for LVMS to use. To retrieve the information necessary to troubleshoot, review the logs of the `vg-manager` pod by running the following command:
++
+[source,terminal]
+----
+$ oc logs -l app.kubernetes.io/component=vg-manager -n openshift-storage
+----
diff --git a/modules/lvms-troubleshooting-recovering-from-node-failure.adoc b/modules/lvms-troubleshooting-recovering-from-node-failure.adoc
@@ -0,0 +1,34 @@
+// This module is included in the following assemblies: 
+//
+// storage/persistent_storage/persistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
+
+:_content-type: PROCEDURE
+[id="recovering-from-node-failure_{context}"]
+= Recovering from node failure
+
+Sometimes a persistent volume claim (PVC) is stuck in a `Pending` state because a particular node in the cluster has failed. To identify the failed node, you can examine the restart count of the `topolvm-node` pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting.
+
+.Procedure
+
+* Examine the restart count of the `topolvm-node` pod instances by running the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-storage
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                  READY   STATUS    RESTARTS      AGE
+lvms-operator-7b9fb858cb-6nsml        3/3     Running   0             70m
+topolvm-controller-5dd9cf78b5-7wwr2   5/5     Running   0             66m
+topolvm-node-dr26h                    4/4     Running   0             66m
+topolvm-node-54as8                    4/4     Running   0             66m
+topolvm-node-78fft                    4/4     Running   17 (8s ago)   66m
+vg-manager-r6zdv                      1/1     Running   0             66m
+vg-manager-990ut                      1/1     Running   0             66m
+vg-manager-an118                      1/1     Running   0             66m
+----
++
+After you resolve any issues with the node, you might need to perform the forced cleanup procedure if the PVC is still stuck in a `Pending` state.
diff --git a/...rsistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc b/...rsistent_storage_local/troubleshooting-local-persistent-storage-using-lvms.adoc
@@ -0,0 +1,31 @@
+:_content-type: ASSEMBLY
+[id="troubleshooting-local-persistent-storage"]
+= Troubleshooting local persistent storage using LVMS
+include::_attributes/common-attributes.adoc[]
+:context: troubleshooting-local-persistent-storage-using-lvms
+
+toc::[]
+
+Because {product-title} does not scope a persistent volume (PV) to a single project, it can be shared across the cluster and claimed by any project using a persistent volume claim (PVC). This can lead to a number of issues that require troubleshooting.
+
+include::modules/lvms-troubleshooting-investigating-a-pvc-stuck-in-the-pending-state.adoc[leveloffset=+1]
+
+include::modules/lvms-troubleshooting-recovering-from-missing-lvms-or-operator-components.adoc[leveloffset=+1]
+
+include::modules/lvms-troubleshooting-recovering-from-node-failure.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+[id="additional-resources-forced-cleanup-1"]
+.Additional resources
+
+* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup]
+
+include::modules/lvms-troubleshooting-recovering-from-disk-failure.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
+[id="additional-resources-forced-cleanup-2"]
+.Additional resources
+
+* xref:troubleshooting-local-persistent-storage-using-lvms.adoc#performing-a-forced-cleanup_troubleshooting-local-persistent-storage-using-lvms[Performing a forced cleanup]
+
+include::modules/lvms-troubleshooting-performing-a-forced-cleanup.adoc[leveloffset=+1]