[BUG] Race leaves snapshot CRs that cannot be deleted #6298

ejweber · 2023-07-11T18:42:56Z

Describe the bug (🐛 if you encounter this issue)

I discovered this while doing the iterative testing described in #6078 (comment). In a cluster with 100 volumes and auto-cleanup-system-generated-snapshot=true in which an instance-manager has been force deleted ~50 times, there are ~2000 snapshot CRs that all look similar to:

Name:         fed8ec3d-5cd9-41da-b9d7-b498664883e9
Namespace:    longhorn-system
Labels:       longhornvolume=pvc-13f65711-b559-463f-9a58-5805b84e5a38
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         Snapshot
Metadata:
  Creation Timestamp:             2023-07-10T21:11:58Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2023-07-10T21:19:05Z
  Finalizers:
    longhorn.io
  Generation:  2
  Owner References:
    API Version:     longhorn.io/v1beta2
    Kind:            Volume
    Name:            pvc-13f65711-b559-463f-9a58-5805b84e5a38
    UID:             41a930d4-d39c-4e38-9bae-aa97ed98c790
  Resource Version:  120881
  UID:               47a07880-b410-41a2-924a-59126b177a76
Spec:
  Create Snapshot:  false
  Labels:           <nil>
  Volume:           pvc-13f65711-b559-463f-9a58-5805b84e5a38
Status:
  Checksum:  
  Children:
    Volume - Head:  true
  Creation Time:    2023-07-10T21:11:53Z
  Error:            lost track of the corresponding snapshot info inside volume engine
  Labels:
  Mark Removed:  true
  Owner ID:      
  Parent:        
  Ready To Use:  false
  Restore Size:  1073741824
  Size:          51011584
  User Created:  false

Note that the deletion timestamp is set and that the status.error reads "lost track of the corresponding snapshot info inside volume engine".

These snapshots do not exist in engine.status.snapshots nor do they exist on disk.

To Reproduce

I triggered it with the above, but a much simpler reproduce is at #6298 (comment).

Expected behavior

Longhorn-manager should clear the finalizer and allow these snapshot CRs to be deleted from the cluster.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

Longhorn version: master-head + likely unrelated modifications
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.25.11
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 22.04
- CPU per node: 4 vCPU
- Memory per node: 8 GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
Number of Longhorn volumes in the cluster: 100

Additional context

Discussed with @PhanLe1010 and @james-munson. We cannot get out of the state we are in because:

The snapshot CR has volume-head as one of its children.
This line prevents the volume controller from progressing to this line, which would allow finalizer removal.

@PhanLe1010 suggested that we got into this state because:

snap-1 was created.
The engine controller created snap-1 CR with status.children=volume-head.
snap-2 was created around the same time snap-1 was purged.
The engine controller set deletion timestamp for snap-1 CR and created snap-2 CR with status.children=volume-head.
Now both of them have status.children=volume-head and snap-1 CR has deletion timestamp.
snap-1 CR's status is never updated because it is gone from the engine.

This fits well with the circumstances, as my iterative testing causes lots of purging/rebuilding in a short period of time.

The text was updated successfully, but these errors were encountered:

innobead · 2023-07-11T23:00:31Z

Do we need to backport to 1.4/1.3?

PhanLe1010 · 2023-07-11T23:02:01Z

I think no, this one is introduced in v1.5.1 only

innobead · 2023-07-12T05:32:31Z

@ejweber @PhanLe1010 Please don't forget to update the status of the zenhub pipeline. It's important for us to know where we are.

longhorn-io-github-bot · 2023-07-12T05:32:38Z

ejweber · 2023-07-12T15:06:37Z

This appears to be reproducible during all rebuilds when auto-cleanup-system-generated-snapshots is enabled.

On a three node cluster:

Create and attach a volume.
Delete one of the volumes replicas.
Wait for the replica to rebuild. There is now one snapshot CR in the cluster (as expected).
Delete the replica again.
Wait for the replica to rebuild. There are now two snapshot CRs in the cluster. The older one doesn't exist on disk and looks like the one from [BUG] Race leaves snapshot CRs that cannot be deleted #6298 (comment).

We would expect there to only be the new snapshot CR in the cluster, as the old one is purged during rebuild and no longer exists.

ejweber · 2023-07-12T15:32:04Z

I think we should have an automated test case that catches this issue. I was curious why it wasn't caught in a case like test_snapshot, which has a step like :

16. List the snapshot, make sure `snap1` and `snap3`
are gone. `snap2` is marked as removed.

However, that case is using an API method to get snapshots the engine knows about. It is not aware of erroneous snapshot CRs in the cluster.

yangchiu · 2023-07-17T04:14:30Z

Verified passed on v1.5.x-head (longhorn-manager 8f12052) following the test steps. After the second replica deletion/rebuilding, the old snapshot is replaced by a new one, there are no 2 snapshots existing in the same time.

ejweber added the kind/bug label Jul 11, 2023

PhanLe1010 mentioned this issue Jul 11, 2023

Fix bug: if the snapshot is no longer in engine CR, don't block the removal process longhorn/longhorn-manager#2074

Merged

ejweber mentioned this issue Jul 11, 2023

Ensure deletion of snapshot CR when it is gone from engine longhorn/longhorn-manager#2075

Closed

PhanLe1010 added this to the v1.6.0 milestone Jul 11, 2023

PhanLe1010 added the backport/1.5.1 label Jul 11, 2023

github-actions bot mentioned this issue Jul 11, 2023

[BACKPORT][v1.5.1][BUG] Race leaves snapshot CRs that cannot be deleted #6299

Closed

PhanLe1010 assigned ejweber and PhanLe1010 Jul 11, 2023

innobead added component/longhorn-manager Longhorn manager (control plane) priority/0 Must be fixed in this release (managed by PO) area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/resilience System or volume resilience labels Jul 11, 2023

ejweber added reproduce/often 80 - 50% reproducible reproduce/always 100% reproducible and removed reproduce/often 80 - 50% reproducible labels Jul 12, 2023

ejweber added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Jul 12, 2023

github-actions bot mentioned this issue Jul 12, 2023

[TEST][BUG] Race leaves snapshot CRs that cannot be deleted #6312

Open

PhanLe1010 mentioned this issue Jul 12, 2023

Add test skeleton test_snapshot_cr longhorn/longhorn-tests#1469

Merged

innobead assigned yangchiu Jul 14, 2023

yangchiu closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

ejweber commented Jul 11, 2023 •

edited

innobead commented Jul 11, 2023

PhanLe1010 commented Jul 11, 2023

innobead commented Jul 12, 2023

longhorn-io-github-bot commented Jul 12, 2023 •

edited by PhanLe1010

ejweber commented Jul 12, 2023 •

edited

ejweber commented Jul 12, 2023

yangchiu commented Jul 17, 2023

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

Comments

ejweber commented Jul 11, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

innobead commented Jul 11, 2023

PhanLe1010 commented Jul 11, 2023

innobead commented Jul 12, 2023

longhorn-io-github-bot commented Jul 12, 2023 • edited by PhanLe1010

Pre Ready-For-Testing Checklist

ejweber commented Jul 12, 2023 • edited

ejweber commented Jul 12, 2023

yangchiu commented Jul 17, 2023

ejweber commented Jul 11, 2023 •

edited

longhorn-io-github-bot commented Jul 12, 2023 •

edited by PhanLe1010

ejweber commented Jul 12, 2023 •

edited