[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

yangchiu · 2024-01-08T04:18:55Z

Describe the bug

Delete replicas of a v2 volume during backup creation could cause the backup becomes Error state:

  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'

And remove the error backup could cause the volume stuck in detaching/faulted state:

To Reproduce

Create v2 volume environment
Create a v2 volume test-1 with 3 replicas from UI and also create PV/PVC for it from UI
Create a pod for it:

cat << EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
EOF
kubectl apply -f pod.yaml

Write some data to the volume:

dd if=/dev/urandom of=/data/test-1 bs=3M count=1024

Create backup from UI, and during the backup in progress, delete some replicas to make the backup be in Error state:

$ kubectl get backups -n longhorn-system backup-ae5264e951594c20 -oyaml
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  creationTimestamp: "2024-01-08T03:32:35Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    backup-volume: test-4
  name: backup-ae5264e951594c20
  namespace: longhorn-system
  resourceVersion: "26161"
  uid: f81b9dd2-7d32-46ec-b982-a3905599ecb7
spec:
  labels:
    KubernetesStatus: '{"pvName":"test-4","pvStatus":"Bound","namespace":"default","pvcName":"test-4","lastPVCRefAt":"","workloadsStatus":[{"podName":"test-pod-4","podStatus":"Running","workloadName":"","workloadType":""}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
  snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
  syncRequestedAt: null
status:
  backupCreatedAt: ""
  compressionMethod: ""
  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'
  labels: null
  lastSyncedAt: "2024-01-08T03:32:54Z"
  messages: null
  ownerID: ip-10-0-1-238
  progress: 10
  replicaAddress: ""
  size: ""
  snapshotCreatedAt: "2024-01-08T03:32:35Z"
  snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
  state: Error
  url: ""
  volumeBackingImageName: ""
  volumeCreated: ""
  volumeName: ""
  volumeSize: "21474836480"

Detach the volume to trigger offline rebuilding:

kubectl delete -f pod.yaml

After offline rebuilding completed, re-attach the volume:

kubectl apply -f pod.yaml

The volume is attached and healthy without problems, and once delete the error backup from UI, the volume gets stuck in detaching and faulted:

Please see volume test-4 related logs in the support bundle for more details.

Expected behavior

Support bundle for troubleshooting

supportbundle_43c90f10-cff2-486a-bf7b-dc91761bf1ea_2024-01-08T04-09-06Z.zip

Environment

Longhorn version: master-head (longhorn-manager a601b9b)
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: ubuntu 22.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

The text was updated successfully, but these errors were encountered:

derekbit · 2024-01-08T07:34:02Z

The error is triggered by the actual size mismatching error of replica lvols. Checking the replica verification logics.

[longhorn-manager-dqzsx] time="2024-01-08T07:27:01Z" level=warning msg="Instance test-1-r-3fe7d78f is state error, error message: found mismatching lvol actual size 2750414848 with recorded prev lvol actual size 2097152 when validating lvol test-1-r-3fe7d78f-snap-rebuild-6d8f7e1c" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:206

derekbit · 2024-01-08T08:36:22Z

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation.
Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

longhorn-io-github-bot · 2024-01-08T08:49:13Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at Prevent the false alarm caused by the check of actual size longhorn-spdk-engine#90
Which areas/issues this PR might have potential impacts on?
Area: v2 volume, snapshot, backup
Issues

innobead · 2024-01-08T10:35:55Z

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

Sounds good to me.

roger-ryao · 2024-01-09T07:48:40Z

Verified on master-head 20240109

longhorn master-head 96a0995
longhorn-spdk-engine main longhorn/longhorn-spdk-engine@71e569a

The test steps

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575 (comment)

Result passed

yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 8, 2024

yangchiu added this to the v1.6.0 milestone Jan 8, 2024

innobead added area/volume-backup-restore Volume backup restore area/v2-data-engine v2 data engine (SPDK) labels Jan 8, 2024

innobead assigned derekbit Jan 8, 2024

derekbit mentioned this issue Jan 8, 2024

Prevent the false alarm caused by the check of actual size longhorn/longhorn-spdk-engine#90

Merged

derekbit mentioned this issue Jan 8, 2024

[BUG] Delete v2 volume snapshot make volume detaching faulted #7585

Closed

derekbit mentioned this issue Jan 8, 2024

vendor: update longhorn-spdk-engine longhorn/longhorn-instance-manager#360

Merged

yangchiu mentioned this issue Jan 9, 2024

[FEATURE] v2 volume supports volume backup/restore #6138

Closed

9 tasks

roger-ryao self-assigned this Jan 9, 2024

roger-ryao closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

yangchiu commented Jan 8, 2024

derekbit commented Jan 8, 2024

derekbit commented Jan 8, 2024

longhorn-io-github-bot commented Jan 8, 2024 •

edited by derekbit

innobead commented Jan 8, 2024

roger-ryao commented Jan 9, 2024

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

Comments

yangchiu commented Jan 8, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented Jan 8, 2024

derekbit commented Jan 8, 2024

longhorn-io-github-bot commented Jan 8, 2024 • edited by derekbit

Pre Ready-For-Testing Checklist

innobead commented Jan 8, 2024

roger-ryao commented Jan 9, 2024

longhorn-io-github-bot commented Jan 8, 2024 •

edited by derekbit