Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

Closed
yangchiu opened this issue Jan 8, 2024 · 5 comments
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) area/volume-backup-restore Volume backup restore kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Jan 8, 2024

Describe the bug

Delete replicas of a v2 volume during backup creation could cause the backup becomes Error state:

  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'

And remove the error backup could cause the volume stuck in detaching/faulted state:
stuck

To Reproduce

  1. Create v2 volume environment
  2. Create a v2 volume test-1 with 3 replicas from UI and also create PV/PVC for it from UI
  3. Create a pod for it:
cat << EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
EOF
kubectl apply -f pod.yaml
  1. Write some data to the volume:
dd if=/dev/urandom of=/data/test-1 bs=3M count=1024
  1. Create backup from UI, and during the backup in progress, delete some replicas to make the backup be in Error state:
$ kubectl get backups -n longhorn-system backup-ae5264e951594c20 -oyaml
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  creationTimestamp: "2024-01-08T03:32:35Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    backup-volume: test-4
  name: backup-ae5264e951594c20
  namespace: longhorn-system
  resourceVersion: "26161"
  uid: f81b9dd2-7d32-46ec-b982-a3905599ecb7
spec:
  labels:
    KubernetesStatus: '{"pvName":"test-4","pvStatus":"Bound","namespace":"default","pvcName":"test-4","lastPVCRefAt":"","workloadsStatus":[{"podName":"test-pod-4","podStatus":"Running","workloadName":"","workloadType":""}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
  snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
  syncRequestedAt: null
status:
  backupCreatedAt: ""
  compressionMethod: ""
  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'
  labels: null
  lastSyncedAt: "2024-01-08T03:32:54Z"
  messages: null
  ownerID: ip-10-0-1-238
  progress: 10
  replicaAddress: ""
  size: ""
  snapshotCreatedAt: "2024-01-08T03:32:35Z"
  snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
  state: Error
  url: ""
  volumeBackingImageName: ""
  volumeCreated: ""
  volumeName: ""
  volumeSize: "21474836480"
  1. Detach the volume to trigger offline rebuilding:
kubectl delete -f pod.yaml
  1. After offline rebuilding completed, re-attach the volume:
kubectl apply -f pod.yaml
  1. The volume is attached and healthy without problems, and once delete the error backup from UI, the volume gets stuck in detaching and faulted:
    stuck

Please see volume test-4 related logs in the support bundle for more details.

Expected behavior

Support bundle for troubleshooting

supportbundle_43c90f10-cff2-486a-bf7b-dc91761bf1ea_2024-01-08T04-09-06Z.zip

Environment

  • Longhorn version: master-head (longhorn-manager a601b9b)
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: ubuntu 22.04
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 8, 2024
@yangchiu yangchiu added this to the v1.6.0 milestone Jan 8, 2024
@innobead innobead added area/volume-backup-restore Volume backup restore area/v2-data-engine v2 data engine (SPDK) labels Jan 8, 2024
@derekbit
Copy link
Member

derekbit commented Jan 8, 2024

The error is triggered by the actual size mismatching error of replica lvols. Checking the replica verification logics.

[longhorn-manager-dqzsx] time="2024-01-08T07:27:01Z" level=warning msg="Instance test-1-r-3fe7d78f is state error, error message: found mismatching lvol actual size 2750414848 with recorded prev lvol actual size 2097152 when validating lvol test-1-r-3fe7d78f-snap-rebuild-6d8f7e1c" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:206

@derekbit
Copy link
Member

derekbit commented Jan 8, 2024

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation.
Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 8, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Prevent the false alarm caused by the check of actual size longhorn-spdk-engine#90

  • Which areas/issues this PR might have potential impacts on?
    Area: v2 volume, snapshot, backup
    Issues

@innobead
Copy link
Member

innobead commented Jan 8, 2024

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

Sounds good to me.

@roger-ryao
Copy link

Verified on master-head 20240109

The test steps

  1. [BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575 (comment)

Result passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) area/volume-backup-restore Volume backup restore kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible
Projects
None yet
Development

No branches or pull requests

5 participants