[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

yangchiu · 2024-01-08T02:21:33Z

Describe the bug

When deleted replicas of an attached v2 volume, detaching the volume is necessary to trigger replica rebuilding. When detached the volume and replica rebuilding triggered, there were some rebuild-* snapshots being created:

After the replica rebuilding completed, re-attached the volume, and tried to remove these rebuild-* snapshots, the volume could get stuck in detaching/faulted state:

To Reproduce

Create v2 volume environment
Create v2 volume test-1 from UI, and also create PV/PVC for it from UI
Create pod to use this volume:

cat << EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
EOF
kubectl apply -f pod.yaml

write some data to the volume:

dd if=/dev/urandom of=/data/test-1 bs=3M count=1024

delete replicas of this volume from UI
detach the volume to trigger replica rebuilding

kubectl delete -f pod.yaml

replica rebuilding triggered, and there are some rebuild-* snapshots created
re-attach the volume by:

kubectl apply -f pod.yaml

try to remove rebuild-* snapshots from UI
volume could get stuck in detaching/faulted state

Expected behavior

Support bundle for troubleshooting

supportbundle_cbabf3df-d2bb-4981-b527-0ebf65b49c6a_2024-01-08T02-01-38Z.zip

Environment

Longhorn version: master-head (longhorn-manager a601b9b)
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: ubuntu 22.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

The text was updated successfully, but these errors were encountered:

derekbit · 2024-01-08T02:36:28Z

@yangchiu
Does it happen in v1.5.3 as well?

shuo-wu · 2024-01-08T04:02:05Z

The offline rebuilding feature does not work if one snapshot has multiple children (after introducing snapshot revert). Let me fix it first then re-check this issue.

longhorn-io-github-bot · 2024-01-15T13:17:20Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at
spdk: Prohibit deleting a snapshot containing multiple children or child volume-head longhorn-spdk-engine#100
v2 volumes get snapshot info from engines longhorn-instance-manager#371
Which areas/issues this PR might have potential impacts on?
Area v2 volume
Issues rebuilding and snapshot delete

shuo-wu · 2024-01-15T13:18:32Z

After the fix, reproducing step 9 should be rejected since Longhorn does not allow deleting the parent of the volume head.

yangchiu · 2024-01-17T04:57:31Z

Tested on v1.6.x-head (longhorn-instance-manager c14a405). I'm still able to remove the rebuild snapshot and once all rebuild snapshots removed, there is a replica will be deleted automatically and causes the volume becomes degrade from healthy.

@shuo-wu Could you help to check this?

supportbundle_2743423b-9c3b-4420-b7ad-b7ecc587f749_2024-01-17T04-54-21Z.zip

shuo-wu · 2024-01-17T11:43:59Z

I just made the instance manager and longhorn manager images based on the master head branch and tried the test. Most of the time it works fine except that I triggered this issue once during detaching:

[longhorn-instance-manager] time="2024-01-17T09:42:01Z" level=error msg="Failed to delete replica with cleanupRequired flag false" func="spdk.(*Replica).Delete.func1" file="replica.go:682" error="error sending message, id 7606, method nvmf_subsystem_remove_listener, params {nqn.2023-01.io.longhorn.spdk:vol-r-1c41559a {TCP IPv4 10.42.1.237 20001} }: {\"code\": -32603,\"message\": \"subsystem busy, retry later.\n\"}" lvsName=disk-2 lvsUUID=14821ea5-b73e-45de-9574-3d4628637986 replicaName=vol-r-1c41559a

yangchiu · 2024-01-18T06:28:22Z

Verified passed on master-head (longhorn-instance-manager d72e4da) and v1.6.x-head (longhorn-instance-manager bcadf21) following the test steps.

yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 8, 2024

yangchiu added this to the v1.6.0 milestone Jan 8, 2024

innobead added the area/v2-data-engine v2 data engine (SPDK) label Jan 8, 2024

innobead assigned shuo-wu Jan 8, 2024

derekbit added the duplicated label Jan 8, 2024

yangchiu mentioned this issue Jan 9, 2024

[FEATURE] v2 volume supports volume backup/restore #6138

Closed

9 tasks

This was referenced Jan 9, 2024

spdk: Update offline rebuilding after introducing snapshot revert longhorn/longhorn-spdk-engine#92

Merged

spdk: Prohibit deleting a snapshot containing multiple children or child volume-head longhorn/longhorn-spdk-engine#100

Merged

innobead removed the duplicated label Jan 15, 2024

shuo-wu mentioned this issue Jan 16, 2024

v2 volumes get snapshot info from engines longhorn/longhorn-instance-manager#371

Merged

yangchiu self-assigned this Jan 16, 2024

shuo-wu mentioned this issue Jan 17, 2024

[BUG] spdk_tgt somehow ran into an internal error. #7703

Open

shuo-wu mentioned this issue Jan 17, 2024

spdk: Make sure an errored replica can be stopped or terminated longhorn/longhorn-spdk-engine#102

Merged

yangchiu closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

yangchiu commented Jan 8, 2024

derekbit commented Jan 8, 2024

shuo-wu commented Jan 8, 2024

longhorn-io-github-bot commented Jan 15, 2024 •

edited by shuo-wu

shuo-wu commented Jan 15, 2024

yangchiu commented Jan 17, 2024 •

edited

shuo-wu commented Jan 17, 2024 •

edited

yangchiu commented Jan 18, 2024

[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

Comments

yangchiu commented Jan 8, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented Jan 8, 2024

shuo-wu commented Jan 8, 2024

longhorn-io-github-bot commented Jan 15, 2024 • edited by shuo-wu

Pre Ready-For-Testing Checklist

shuo-wu commented Jan 15, 2024

yangchiu commented Jan 17, 2024 • edited

shuo-wu commented Jan 17, 2024 • edited

yangchiu commented Jan 18, 2024

longhorn-io-github-bot commented Jan 15, 2024 •

edited by shuo-wu

yangchiu commented Jan 17, 2024 •

edited

shuo-wu commented Jan 17, 2024 •

edited