Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

Closed
shuo-wu opened this issue Nov 8, 2022 · 4 comments
Closed
Assignees
Labels
area/replica Volume replica where data is placed area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.2.7 backport/1.3.3 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@shuo-wu
Copy link
Contributor

shuo-wu commented Nov 8, 2022

Describe the bug (🐛 if you encounter this issue)

From the reporter:

it happened during runtime no node shutdown no upgrade nothing

Now we ran into this issue, one time during node reboot and once during longhorn upgrade (1.22 -> 1.3.1), in one environment volume.meta file itself is missing and in another environment volume.meta file is empty.

More context is needed later.

To Reproduce

Steps to reproduce the behavior:
N/A

Expected behavior

The volume metafile always records the valid info

Log or Support bundle

N/A

Environment

  • Longhorn version: v1.2.2 or v1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@shuo-wu shuo-wu added kind/bug area/v1-data-engine v1 data engine (iSCSI tgt) labels Nov 8, 2022
@shuo-wu
Copy link
Contributor Author

shuo-wu commented Nov 8, 2022

cc @innobead @joshimoo

@joshimoo joshimoo added area/replica Volume replica where data is placed severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Nov 8, 2022
@joshimoo joshimoo added this to the v1.4.0 milestone Nov 8, 2022
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) investigation-needed Need to identify the case before estimating and starting the development labels Nov 8, 2022
@innobead
Copy link
Member

innobead commented Nov 29, 2022

cc @longhorn/qa

For longhorn's current design, we quite rely on saving some files persistent on the host, so it's necessary to have some error handling and self-healing/recovering when the persistent data on the host has some problems like getting deleted or empty like in this case.

For QA, need to have more negative testing to cover this kind of external interrupt to test Longhorn resilience.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 1, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Create a pod using Longhorn volume
  2. Write some data to the volume then get the md5sum
  3. Delete the pod and wait for the volume detached
  4. Randomly pick up replicas and manually delete or empty the volume meta file in this replica data path
  5. Recreate the pod and wait for the volume attached
  6. Check if the volume is Healthy after the volume attached
  7. Check if data is not corrupted
  8. Check if r/w to volume is ok in pod
  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at
    fix(replica): volume metafile deleted or empty longhorn-engine#775

  • Which areas/issues this PR might have potential impacts on?
    Area/engine

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)
    test(replica): volume metafile deleted or empty longhorn-tests#1214

@chriscchien
Copy link
Contributor

Verified in longhorn master 207e74 with test steps
Result Pass

After create pod with longhorn volume then delete the pod let volume in detach state:

  1. If volume.meta were removed or emptied in replica file path, replicas will recovery soon after pod recreated. Volume will attach to node and data still kept consistent if volume have replica with correct meta file
  2. But if volume-head-001.img.meta were also removed/emptied in some replicas, after pod recreated, volume will keep in Degraded status, replicas which had file volume-head-001.img.meta modified, will keep in fail status. Because volume-head-001.img.meta is not this ticket's scope, maybe we can consider handle those situation in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replica Volume replica where data is placed area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.2.7 backport/1.3.3 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
Status: Closed
Development

No branches or pull requests

7 participants