[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

shuo-wu · 2022-11-08T02:43:05Z

Describe the bug (🐛 if you encounter this issue)

From the reporter:

it happened during runtime no node shutdown no upgrade nothing

Now we ran into this issue, one time during node reboot and once during longhorn upgrade (1.22 -> 1.3.1), in one environment volume.meta file itself is missing and in another environment volume.meta file is empty.

More context is needed later.

To Reproduce

Steps to reproduce the behavior:
N/A

Expected behavior

The volume metafile always records the valid info

Log or Support bundle

N/A

Environment

Longhorn version: v1.2.2 or v1.3.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

shuo-wu · 2022-11-08T02:43:18Z

cc @innobead @joshimoo

innobead · 2022-11-29T14:38:52Z

cc @longhorn/qa

For longhorn's current design, we quite rely on saving some files persistent on the host, so it's necessary to have some error handling and self-healing/recovering when the persistent data on the host has some problems like getting deleted or empty like in this case.

For QA, need to have more negative testing to cover this kind of external interrupt to test Longhorn resilience.

longhorn-io-github-bot · 2022-12-01T06:01:03Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Create a pod using Longhorn volume
Write some data to the volume then get the md5sum
Delete the pod and wait for the volume detached
Randomly pick up replicas and manually delete or empty the volume meta file in this replica data path
Recreate the pod and wait for the volume attached
Check if the volume is Healthy after the volume attached
Check if data is not corrupted
Check if r/w to volume is ok in pod

Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at
fix(replica): volume metafile deleted or empty longhorn-engine#775
Which areas/issues this PR might have potential impacts on?
Area/engine
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)
test(replica): volume metafile deleted or empty longhorn-tests#1214

chriscchien · 2022-12-06T04:07:52Z

Verified in longhorn master 207e74 with test steps
Result Pass

After create pod with longhorn volume then delete the pod let volume in detach state:

If volume.meta were removed or emptied in replica file path, replicas will recovery soon after pod recreated. Volume will attach to node and data still kept consistent if volume have replica with correct meta file
But if volume-head-001.img.meta were also removed/emptied in some replicas, after pod recreated, volume will keep in Degraded status, replicas which had file volume-head-001.img.meta modified, will keep in fail status. Because volume-head-001.img.meta is not this ticket's scope, maybe we can consider handle those situation in the future.

shuo-wu added kind/bug area/v1-data-engine v1 data engine (iSCSI tgt) labels Nov 8, 2022

joshimoo added area/replica Volume replica where data is placed severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Nov 8, 2022

joshimoo added this to the v1.4.0 milestone Nov 8, 2022

joshimoo added the backport/1.3.3 label Nov 8, 2022

github-actions bot mentioned this issue Nov 8, 2022

[BACKPORT][v1.3.3][BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4847

Closed

innobead added priority/0 Must be implement or fixed in this release (managed by PO) investigation-needed Need to identify the case before estimating and starting the development labels Nov 8, 2022

innobead assigned mantissahz Nov 14, 2022

khushboo-rancher added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Nov 29, 2022

github-actions bot mentioned this issue Nov 29, 2022

[TEST][BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4964

Open

This was referenced Dec 1, 2022

fix(replica): volume metafile deleted or empty longhorn/longhorn-engine#775

Merged

test(replica): volume metafile deleted or empty longhorn/longhorn-tests#1214

Merged

innobead added backport/1.2.7 area/resilience System or volume resilience and removed investigation-needed Need to identify the case before estimating and starting the development labels Dec 2, 2022

github-actions bot mentioned this issue Dec 2, 2022

[BACKPORT][v1.2.7][BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4983

Closed

This was referenced Dec 2, 2022

[BACKPORT][v1.3.3] fix(replica): volume metafile deleted or empty longhorn/longhorn-engine#778

Merged

[BACKPORT][v1.2.7] fix(replica): volume metafile deleted or empty longhorn/longhorn-engine#779

Merged

innobead assigned khushboo-rancher and chriscchien and unassigned khushboo-rancher Dec 2, 2022

chriscchien closed this as completed Dec 6, 2022

rajivml mentioned this issue Apr 20, 2023

[BUG] Volume snap meta file getting deleted #5789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

shuo-wu commented Nov 8, 2022

shuo-wu commented Nov 8, 2022

innobead commented Nov 29, 2022 •

edited

Loading

longhorn-io-github-bot commented Dec 1, 2022 •

edited by mantissahz

Loading

chriscchien commented Dec 6, 2022

[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

[BUG] Volume metafile getting deleted or empty results in a detach-attach loop #4846

Comments

shuo-wu commented Nov 8, 2022

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

shuo-wu commented Nov 8, 2022

innobead commented Nov 29, 2022 • edited Loading

longhorn-io-github-bot commented Dec 1, 2022 • edited by mantissahz Loading

Pre Ready-For-Testing Checklist

chriscchien commented Dec 6, 2022

innobead commented Nov 29, 2022 •

edited

Loading

longhorn-io-github-bot commented Dec 1, 2022 •

edited by mantissahz

Loading