Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RWX volume remains attached after workload deleted if it's upgraded from v1.4.2 #6139

Closed
yangchiu opened this issue Jun 16, 2023 · 24 comments
Assignees
Labels
area/volume-attach-detach Volume attach & detach related kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug (馃悰 if you encounter this issue)

RWX volume remains attached and healthy after workload deleted if it's created in v1.4.2 and then upgraded to master-head or v1.5.x-head.

Directly create/delete a workload using RWX volume in master-head or v1.5.x-head doesn't have this issue.

To Reproduce

Steps to reproduce the behavior:

  1. Install Longhorn v1.4.2
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.4.2/deploy/longhorn.yaml
  1. Create a statefulset using RWX volume
# rwx_statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-statefulset-rwx
  namespace: default
spec:
  selector:
    matchLabels:
      app: test-statefulset-rwx
  serviceName: test-statefulset-rwx
  replicas: 1
  template:
    metadata:
      labels:
        app: test-statefulset-rwx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: sleep
          args: ['/bin/sh', '-c', 'while true;do date;sleep 5; done']
          volumeMounts:
            - name: pod-data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: pod-data
      spec:
        accessModes: ['ReadWriteMany']
        storageClassName: 'longhorn'
        resources:
          requests:
            storage: 1Gi

# kubectl apply -f rwx_statefulset.yaml
  1. Upgrade Longhorn to master-head or v1.5.x-head, and upgrade the engine image of the RWX volume
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml
  1. Delete the statefulset
kubectl delete -f rwx_statefulset.yaml
  1. The volume remains attached and healthy after workload deleted

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

supportbundle_7c27605c-22df-493e-9e2a-b9135c68e20b_2023-06-16T02-29-21Z.zip

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

@yangchiu yangchiu added kind/bug reproduce/always 100% reproducible labels Jun 16, 2023
@yangchiu yangchiu modified the milestones: v1.5.1, v1.5.0 Jun 16, 2023
@derekbit
Copy link
Member

derekbit commented Jun 16, 2023

The detachment needs to wait for a while.

cc @PhanLe1010

@innobead innobead added the area/volume-attach-detach Volume attach & detach related label Jun 16, 2023
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Jun 16, 2023
@derekbit
Copy link
Member

derekbit commented Jun 16, 2023

statefultset was deleted at Jun 16 03:08:39.
attacherDetacher started detaching volume at Jun 16 03:10:52.

Then, the upgrader ticket of the rwx volume was deleted and triggered volume detachment.

Jun 16 03:08:00 ubuntu2004 k3s[3432]: I0616 03:08:00.076821    3432 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Jun 16 03:08:39 ubuntu2004 k3s[3432]: I0616 03:08:39.618789    3432 stateful_set.go:458] "StatefulSet has been deleted" key="default/test-statefulset-rwx"
Jun 16 03:08:48 ubuntu2004 systemd[1]: Started Session 9 of user root.
Jun 16 03:09:00 ubuntu2004 k3s[3432]: I0616 03:09:00.076014    3432 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Jun 16 03:10:00 ubuntu2004 k3s[3432]: I0616 03:10:00.077053    3432 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Jun 16 03:10:01 ubuntu2004 CRON[9564]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jun 16 03:10:02 ubuntu2004 systemd[1]: Starting Cleanup of Temporary Directories...
Jun 16 03:10:02 ubuntu2004 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Jun 16 03:10:02 ubuntu2004 systemd[1]: Finished Cleanup of Temporary Directories.
Jun 16 03:10:52 ubuntu2004 k3s[3432]: I0616 03:10:52.017032    3432 reconciler.go:273] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/driver.longhorn.io^pvc-36481246-9464-41f9-b877-0b75353a34f2 VolumeSpec:0xc009b6e630 NodeName:rancher60-worker2 PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:false DetachRequestedTime:2023-06-16 03:08:49.788385216 +0000 UTC m=+771.166259892}
Jun 16 03:10:52 ubuntu2004 k3s[3432]: I0616 03:10:52.018700    3432 operation_generator.go:1626] Verified volume is safe to detach for volume "pvc-36481246-9464-41f9-b877-0b75353a34f2" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-36481246-9464-41f9-b877-0b75353a34f2") on node "rancher60-worker2"
Jun 16 03:10:54 ubuntu2004 k3s[3432]: I0616 03:10:54.301710    3432 operation_generator.go:516] DetachVolume.Detach succeeded for volume "pvc-36481246-9464-41f9-b877-0b75353a34f2" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-36481246-9464-41f9-b877-0b75353a34f2") on node "rancher60-worker2"
Jun 16 03:10:58 ubuntu2004 k3s[3432]: time="2023-06-16T03:10:58Z" level=info msg="COMPACT compactRev=1587 targetCompactRev=2587 currentRev=3635"
Jun 16 03:10:58 ubuntu2004 k3s[3432]: time="2023-06-16T03:10:58Z" level=info msg="COMPACT deleted 860 rows from 1000 revisions in 40.747639ms - compacted to 2587/3635"

@innobead innobead assigned derekbit and PhanLe1010 and unassigned PhanLe1010 and derekbit Jun 16, 2023
@yangchiu
Copy link
Member Author

Just for information. It still remains attached and healthy after a more than 3 hours waiting.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 16, 2023

I see. This case can happen when there is no workload pod on the same node with the share manager pod. When user upgrade to 1.5.x, we create an upgrade AD ticket for the share manager's node to keep volume attached there. When the user scale down the workload, we don't cleanup that upgrade AD ticket because that ticket is on different node as the workload pod's node. As the result, no one is cleaning up the upgrade AD ticket and the volume stuck there forever.

Still figuring out how to fix this issue.

@PhanLe1010
Copy link
Contributor

Just for information. It still remains attached and healthy after a more than 3 hours waiting.

@yangchiu Do you have the support bundle after 3 hours?

@derekbit
Copy link
Member

I see. This case can happen when there is no workload pod on the same node with the share manager pod. When user upgrade to 1.5.x, we create an upgrade AD ticket for the share manager's node to keep volume attached there. When the user scale down the workload, we don't cleanup that upgrade AD ticket because that ticket is on different node as the workload pod's node. As the result, no one is cleaning up the upgrade AD ticket and the volume stuck there forever.

It sounds we create a ticket for the node where the share manager pod is running.

Can we create multiple tickets for nodes where workloads are running instead?

@PhanLe1010
Copy link
Contributor

Can we create multiple tickets for nodes where workloads are running instead?

This is a potential approach. We need to be careful not to fall into some race conditions though. For example, at the time we decided to create a ticket for a running workload, it was running but when we finished creating the ticket the workload already stopped. Now there will be a leftover ticket that nobody is going to clean it up

@PhanLe1010
Copy link
Contributor

I am thinking if we should use the native Kubernetes volumeAttachment object to create ticket instead. The logic is:

  • When Longhorn manager start upgrade, it fetch all Kubernetes volumeAttachment objects.
  • For the VA that doesn't have deletion timestamp, create a corresponding Longhorn VA ticket with the type csi-attacher
  • If the Kubernetes VA has deletion timestamp immediately after we create Longhorn VA, it is OK. All Longhorn manager is down at this moment, Kubernetes will have to wait for Longhorn to come up first, then issue the ControllerUnpublish call. Longhorn CSI plugin can then clean up the Longhorn VA created in the upgrade path.
  • There is no Longhorn VA leftover this way

@PhanLe1010
Copy link
Contributor

Hmm. But we still have problem with the both of the above approaches:

  • if we don't create a ticket for the share-manager's node, Longhorn VA controller will detach the volume from that node (maybe) before the share-manager controller kicks in and create the ticket
  • On the flip side, if we create the ticket for the share-manager's node, no one is responsible for cleaning it up (a.k.a the issue in this ticket)

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 17, 2023

How about this design?

In the upgrgade path:

  • When Longhorn manager starts upgrade, it fetches all Kubernetes volumeAttachment objects.
  • For the VA that doesn't have deletion timestamp, create a corresponding Longhorn VA ticket with the type csi-attacher
  • If the Kubernetes VA has deletion timestamp immediately after we create Longhorn VA, it is OK. All Longhorn manager is down at this moment, Kubernetes will have to wait for Longhorn to come up first, then issue the ControllerUnpublish call. Longhorn CSI plugin can then clean up the Longhorn VA created in the upgrade path.
  • If the volume is RWX and has csi-attacher ticket -> also create share-manager-controller ticket. This way the volume is not detached and share manager pod is not deleted during the upgrade.
  • If for some reasons, we created unnecessary share-manager-controller ticket. It is ok, share-manager controller will clean it up later on.
  • There is no Longhorn VA leftover this way and RWX is safe during the upgrade

Limitation:

  1. The volume that doesn't have corresponding Kubernetes workload (A.K.A manually attached volume will be detached after upgrading). I think this limitation is acceptable

Another benefit of this design:

  1. We get rid of longhorn-upgrade ticket type. The logic of AD controller is cleaner

WDTY @derekbit @shuo-wu @innobead ?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 17, 2023

Btw, this issue is side effect of this fix https://github.com/longhorn/longhorn-manager/pull/1993/files. We fixed the detaching issue but introduced the stuck attaching issue as a side effect 馃槃

@derekbit
Copy link
Member

The design looks good to me.

We get rid of longhorn-upgrade ticket type. The logic of AD controller is cleaner

Agree with the removal. longhorn-upgrade is a bit confusing.

@derekbit
Copy link
Member

derekbit commented Jun 17, 2023

Btw, this issue is side effect of this fix https://github.com/longhorn/longhorn-manager/pull/1993/files. We fixed the detaching issue but introduced the stuck attaching issue as a side effect 馃槃

Clarify it a bit. It is not a side effect of the detaching fix. The hasActiveWorkload was already removed from AD controller implantation before. I mistakenly introduced it into longhorn-manager before. It is not related to detaching issue, so I removed it in the end.

Hence, both RWX volumes attaching and detaching issues exist in AD controller design and implementation.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 17, 2023

You are right @derekbit it is already exist. The behavior just a little bit different but there is still same issue. Thank you for the clarification

@innobead
Copy link
Member

innobead commented Jun 17, 2023

I see. This case can happen when there is no workload pod on the same node with the share manager pod. When user upgrade to 1.5.x, we create an upgrade AD ticket for the share manager's node to keep volume attached there. When the user scale down the workload, we don't cleanup that upgrade AD ticket because that ticket is on different node as the workload pod's node. As the result, no one is cleaning up the upgrade AD ticket and the volume stuck there forever.

Still figuring out how to fix this issue.

@PhanLe1010 can't we just delete the ticket (upgrader) if the share manager is not required anymore given no workloads are using it (of course not maintenance mode), no matter what node the ticket is belonging to? or anything I missed? any obstacles?

@innobead
Copy link
Member

BTW, the proposal I agreed as well, especially for having the correct attacker type.

For static volume attachment (not triggered by K8s volume attachment), why not just use AttacherTypeLonghornAPI, so they will not be detached after the upgrade, users still can detach them via UI or an unknown source like emptying the nodeId of the volume resource.

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 19, 2023

  1. I agree with using the correct/valid AD ticket type as well and removing type longhorn-upgrade.
  2. To be honest, I prefer Derek's suggestion: Just directly creating (multiple) CSI-attacher tickets for volumes used by workload pods. As for the race condition you mentioned, I think the main blocker is the existing CSI plugin that may handle the detachment calls incorrectly (with the old way). If we can remove the old CSI plugin before jumping in the upgrade path, will everything work fine? The new Longhorn CSI driver deployer will be deployed when new longhorn manager is up hence we don't need to worry about it.
  3. For the attached volumes that has no workload pod but containning non-empty spec node ID, I think we can add AttacherTypeLonghornAPI. But for the auto-attached volumes, it's too complicated to generate correct AD tickets hence we can ignore them.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 19, 2023

Thanks @innobead and @shuo-wu for the great feedbacks!

I agree with points 1 and 2 @shuo-wu mentioned above (which I understand that they are also the points that @innobead proposed. Please correct me if I am understand it wrong @innobead )

For the point 2 that @shuo-wu mentioned:

To be honest, I prefer Derek's suggestion: Just directly creating (multiple) CSI-attacher tickets for volumes used by workload pods.

Let's me evaluate more to see which one is better option: using workload pod state VS using Kuberntes VolumeAttachment state

@innobead
Copy link
Member

@PhanLe1010 This seems a blocker for 1.5.0. Let's make this the highest priority to tackle first. Thanks.

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 19, 2023

Point 3 I mentioned was similar to what David suggested but with extra concerns about the auto-attached volumes.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 19, 2023

Let's me evaluate more to see which one is better option: using workload pod state VS using Kubernetes VolumeAttachment state

I think the using Kubernetes VolumeAttachment state is better because:

  1. Kubernetes VA and Longhorn CSI VA is one-to-one mapping so it is cleaner
  2. If we use using workload pod state to make decision we will have this problem:
    When we get the workload pod in the upgrade path, assume that the pod has deletionTimeStamp set ( the pod is being terminated). In this case, if we add the Longhorn VA, it might be wrong because no one will cleanup that ticket later. If we don't add Longhorn VA, Longhorn manager will detach the volume immediately when it comes up, this might be bad because we might pull the volume out unexpectedly when the NodeUnPublish and NodeUnstageVolume has not been executed yet. This is bad for filesystem.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jun 19, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] RWX volume remains attached after workload deleted if it's upgraded from v1.4.2聽#6139 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Remove the longhorn-upgrade attacher type and refactor the upgrade path聽longhorn-manager#2006

  • Which areas/issues this PR might have potential impacts on?
    Area Upgrade
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 20, 2023

Test Plan

Check the problem in the issue description

  • Perform the test in the issue description but make sure the workload pod and share manager pod are on different nodes (by setting nodeName: for the stateful set)

Test attach/detach after upgrade

  1. Install Longhorn v1.4.2
  2. Create deployment of 1 pod that use a RWO volume, vol1
  3. Create deployment of 1 pod that use a RWX volume, vol2
  4. Create deployment of 2 pods that use a RWX volume, vol3
  5. Create a volume and attach using Longhorn UI, vol4
  6. Create a volume a not attach it, vol5
  7. Upgrade Longhorn to master-head
  8. Verify that vol1, vol2, and vol3 are not detached during and after the upgrade. Exec into each workload pod, verify that you can still read/write to volumes' mount point
  9. Verify that vol4 is not detached during and after the upgrade
  10. Verify that vol5 remains detached
  11. Scale down the workload deployment to 0, and verify that vol1, vol2, and vol3 are detached
  12. Scale up the workload deployment, and verify that vol1, vol2, and vol3 are attached
  13. Detach/Attach the vol4 and vol5 using Longhorn UI (without the force option) and verify that they are attached/detached ok

Upgrade tests

@chriscchien chriscchien self-assigned this Jun 21, 2023
@chriscchien
Copy link
Contributor

Verified pass in

  • longhorn master(longhorn-manager ec130d)
  • longhorn v1.5.x(longhorn-manager c8092b)

After upgrade from v1.4.2 to master and upgrade from v1.4.2 to v1.5.x, perform test steps were passed, delete workload, the RWX volume become detached as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-attach-detach Volume attach & detach related kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible
Projects
None yet
Development

No branches or pull requests

7 participants