Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Remove v2 volume rebuild snapshot could cause volume stuck in detaching/faulted state #7573

Closed
yangchiu opened this issue Jan 8, 2024 · 7 comments
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/often 80 - 50% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Jan 8, 2024

Describe the bug

When deleted replicas of an attached v2 volume, detaching the volume is necessary to trigger replica rebuilding. When detached the volume and replica rebuilding triggered, there were some rebuild-* snapshots being created:
rebuild
After the replica rebuilding completed, re-attached the volume, and tried to remove these rebuild-* snapshots, the volume could get stuck in detaching/faulted state:
faulted-2

To Reproduce

  1. Create v2 volume environment
  2. Create v2 volume test-1 from UI, and also create PV/PVC for it from UI
  3. Create pod to use this volume:
cat << EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
EOF
kubectl apply -f pod.yaml
  1. write some data to the volume:
dd if=/dev/urandom of=/data/test-1 bs=3M count=1024
  1. delete replicas of this volume from UI
  2. detach the volume to trigger replica rebuilding
kubectl delete -f pod.yaml
  1. replica rebuilding triggered, and there are some rebuild-* snapshots created
  2. re-attach the volume by:
kubectl apply -f pod.yaml
  1. try to remove rebuild-* snapshots from UI
  2. volume could get stuck in detaching/faulted state

Expected behavior

Support bundle for troubleshooting

supportbundle_cbabf3df-d2bb-4981-b527-0ebf65b49c6a_2024-01-08T02-01-38Z.zip

Environment

  • Longhorn version: master-head (longhorn-manager a601b9b)
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: ubuntu 22.04
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

@yangchiu yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 8, 2024
@yangchiu yangchiu added this to the v1.6.0 milestone Jan 8, 2024
@innobead innobead added the area/v2-data-engine v2 data engine (SPDK) label Jan 8, 2024
@derekbit
Copy link
Member

derekbit commented Jan 8, 2024

@yangchiu
Does it happen in v1.5.3 as well?

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 8, 2024

The offline rebuilding feature does not work if one snapshot has multiple children (after introducing snapshot revert). Let me fix it first then re-check this issue.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 15, 2024

Pre Ready-For-Testing Checklist

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 15, 2024

After the fix, reproducing step 9 should be rejected since Longhorn does not allow deleting the parent of the volume head.

@yangchiu
Copy link
Member Author

yangchiu commented Jan 17, 2024

Tested on v1.6.x-head (longhorn-instance-manager c14a405). I'm still able to remove the rebuild snapshot and once all rebuild snapshots removed, there is a replica will be deleted automatically and causes the volume becomes degrade from healthy.

@shuo-wu Could you help to check this?

supportbundle_2743423b-9c3b-4420-b7ad-b7ecc587f749_2024-01-17T04-54-21Z.zip

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 17, 2024

I just made the instance manager and longhorn manager images based on the master head branch and tried the test. Most of the time it works fine except that I triggered this issue once during detaching:

[longhorn-instance-manager] time="2024-01-17T09:42:01Z" level=error msg="Failed to delete replica with cleanupRequired flag false" func="spdk.(*Replica).Delete.func1" file="replica.go:682" error="error sending message, id 7606, method nvmf_subsystem_remove_listener, params {nqn.2023-01.io.longhorn.spdk:vol-r-1c41559a {TCP IPv4 10.42.1.237 20001} }: {\"code\": -32603,\"message\": \"subsystem busy, retry later.\n\"}" lvsName=disk-2 lvsUUID=14821ea5-b73e-45de-9574-3d4628637986 replicaName=vol-r-1c41559a

@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-instance-manager d72e4da) and v1.6.x-head (longhorn-instance-manager bcadf21) following the test steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/often 80 - 50% reproducible
Projects
None yet
Development

No branches or pull requests

5 participants