Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

Closed
ejweber opened this issue Jul 11, 2023 · 7 comments
Closed

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

ejweber opened this issue Jul 11, 2023 · 7 comments
Assignees
Labels
area/resilience System or volume resilience area/snapshot Volume snapshot (in-cluster snapshot or external backup) backport/1.5.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Jul 11, 2023

Describe the bug (馃悰 if you encounter this issue)

I discovered this while doing the iterative testing described in #6078 (comment). In a cluster with 100 volumes and auto-cleanup-system-generated-snapshot=true in which an instance-manager has been force deleted ~50 times, there are ~2000 snapshot CRs that all look similar to:

Name:         fed8ec3d-5cd9-41da-b9d7-b498664883e9
Namespace:    longhorn-system
Labels:       longhornvolume=pvc-13f65711-b559-463f-9a58-5805b84e5a38
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         Snapshot
Metadata:
  Creation Timestamp:             2023-07-10T21:11:58Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2023-07-10T21:19:05Z
  Finalizers:
    longhorn.io
  Generation:  2
  Owner References:
    API Version:     longhorn.io/v1beta2
    Kind:            Volume
    Name:            pvc-13f65711-b559-463f-9a58-5805b84e5a38
    UID:             41a930d4-d39c-4e38-9bae-aa97ed98c790
  Resource Version:  120881
  UID:               47a07880-b410-41a2-924a-59126b177a76
Spec:
  Create Snapshot:  false
  Labels:           <nil>
  Volume:           pvc-13f65711-b559-463f-9a58-5805b84e5a38
Status:
  Checksum:  
  Children:
    Volume - Head:  true
  Creation Time:    2023-07-10T21:11:53Z
  Error:            lost track of the corresponding snapshot info inside volume engine
  Labels:
  Mark Removed:  true
  Owner ID:      
  Parent:        
  Ready To Use:  false
  Restore Size:  1073741824
  Size:          51011584
  User Created:  false

Note that the deletion timestamp is set and that the status.error reads "lost track of the corresponding snapshot info inside volume engine".

These snapshots do not exist in engine.status.snapshots nor do they exist on disk.

To Reproduce

I triggered it with the above, but a much simpler reproduce is at #6298 (comment).

Expected behavior

Longhorn-manager should clear the finalizer and allow these snapshot CRs to be deleted from the cluster.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: master-head + likely unrelated modifications
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.25.11
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 22.04
    • CPU per node: 4 vCPU
    • Memory per node: 8 GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
  • Number of Longhorn volumes in the cluster: 100

Additional context

Discussed with @PhanLe1010 and @james-munson. We cannot get out of the state we are in because:

  • The snapshot CR has volume-head as one of its children.
  • This line prevents the volume controller from progressing to this line, which would allow finalizer removal.

@PhanLe1010 suggested that we got into this state because:

  1. snap-1 was created.
  2. The engine controller created snap-1 CR with status.children=volume-head.
  3. snap-2 was created around the same time snap-1 was purged.
  4. The engine controller set deletion timestamp for snap-1 CR and created snap-2 CR with status.children=volume-head.
  5. Now both of them have status.children=volume-head and snap-1 CR has deletion timestamp.
  6. snap-1 CR's status is never updated because it is gone from the engine.

This fits well with the circumstances, as my iterative testing causes lots of purging/rebuilding in a short period of time.

@innobead
Copy link
Member

Do we need to backport to 1.4/1.3?

@innobead innobead added component/longhorn-manager Longhorn manager (control plane) priority/0 Must be fixed in this release (managed by PO) area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/resilience System or volume resilience labels Jul 11, 2023
@PhanLe1010
Copy link
Contributor

I think no, this one is introduced in v1.5.1 only

@innobead
Copy link
Member

@ejweber @PhanLe1010 Please don't forget to update the status of the zenhub pipeline. It's important for us to know where we are.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jul 12, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The steps are at: [BUG] Race leaves snapshot CRs that cannot be deleted聽#6298 (comment).

  • Is there a workaround for the issue? If so, where is it documented?
    The extra snapshots will be removed from the cluster when their corresponding volumes are deleted. Otherwise, the finalizer of each snapshot that matches the described conditions can be removed manually with kubectl.

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at: Fix bug: if the snapshot is no longer in engine CR, don't block the removal process聽longhorn-manager#2074.

  • Which areas/issues this PR might have potential impacts on?

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    Test skeleton: Add test skeleton test_snapshot_cr聽longhorn-tests#1469
    Test ticket: [TEST][BUG] Race leaves snapshot CRs that cannot be deleted聽#6312

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?

  • If labeled: require/manual-test-plan Has the manual test plan been documented?

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?

@ejweber ejweber added reproduce/often 80 - 50% reproducible reproduce/always 100% reproducible and removed reproduce/often 80 - 50% reproducible labels Jul 12, 2023
@ejweber
Copy link
Contributor Author

ejweber commented Jul 12, 2023

This appears to be reproducible during all rebuilds when auto-cleanup-system-generated-snapshots is enabled.

On a three node cluster:

  1. Create and attach a volume.
  2. Delete one of the volumes replicas.
  3. Wait for the replica to rebuild. There is now one snapshot CR in the cluster (as expected).
  4. Delete the replica again.
  5. Wait for the replica to rebuild. There are now two snapshot CRs in the cluster. The older one doesn't exist on disk and looks like the one from [BUG] Race leaves snapshot CRs that cannot be deleted聽#6298 (comment).

We would expect there to only be the new snapshot CR in the cluster, as the old one is purged during rebuild and no longer exists.

@ejweber
Copy link
Contributor Author

ejweber commented Jul 12, 2023

I think we should have an automated test case that catches this issue. I was curious why it wasn't caught in a case like test_snapshot, which has a step like :

16. List the snapshot, make sure `snap1` and `snap3`
are gone. `snap2` is marked as removed.

However, that case is using an API method to get snapshots the engine knows about. It is not aware of erroneous snapshot CRs in the cluster.

@yangchiu
Copy link
Member

Verified passed on v1.5.x-head (longhorn-manager 8f12052) following the test steps. After the second replica deletion/rebuilding, the old snapshot is replaced by a new one, there are no 2 snapshots existing in the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/snapshot Volume snapshot (in-cluster snapshot or external backup) backport/1.5.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Projects
None yet
Development

No branches or pull requests

5 participants