[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

ejweber · 2023-12-22T22:39:34Z

Describe the bug (🐛 if you encounter this issue)

In a Harvester cluster, we saw many volumes in an attach/detach loop. The instance-manager-e responsible for starting each engine logged the following on each loop:

2023-12-14T08:58:51.406122904Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Starting with replicas [\"tcp://10.52.3.78:10885\" \"tcp://10.52.1.130:10885\"]"
2023-12-14T08:58:51.406575780Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Connecting to remote: 10.52.3.78:10885"
2023-12-14T08:58:51.413610249Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Opening: 10.52.3.78:10885"
2023-12-14T08:58:51.448733315Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=warning msg="failed to create backend with address tcp://10.52.3.78:10885: failed to open replica 10.52.3.78:10885 from remote: rpc error: code = Unknown desc = Failed to find metadata for volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img"
2023-12-14T08:58:51.448745943Z time="2023-12-14T08:58:51Z" level=info msg="Connecting to remote: 10.52.1.130:10885"
2023-12-14T08:58:51.451211278Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Opening: 10.52.1.130:10885"
2023-12-14T08:58:51.464732493Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=warning msg="failed to create backend with address tcp://10.52.1.130:10885: failed to open replica 10.52.1.130:10885 from remote: rpc error: code = Unknown desc = Failed to find metadata for volume-snap-85e0c320-8232-4428-8bef-ca535e2516a5.img"
2023-12-14T08:58:51.465185634Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Adding backend: tcp://10.52.3.78:10885"
2023-12-14T08:58:51.465200274Z time="2023-12-14T08:58:51Z" level=info msg="Adding backend: tcp://10.52.1.130:10885"

At this point, there were two replicas, and both were somehow corrupted (missing a .meta file to correspond to a .img file).

Both replicas appeared healthy to Longhorn, with healthyAt but no failedAt.

healthyAt: "2023-12-11T08:22:46Z"
healthyAt: "2023-12-11T07:54:46Z"

We were able to look at the data directory on disk for one of the replicas:

node:~ # kubectl -n longhorn-system exec instance-manager-r-46ea085f -- ls -al /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6

total 32

drwx------  2 root root          4096 Dec 12 13:09 .
drwxr-xr-x 57 root root         12288 Dec 20 01:31 ..
-rw-------  1 root root          4096 Dec 12 13:09 revision.counter
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-head-007.img
-rw-r--r--  1 root root           178 Dec 12 13:09 volume-head-007.img.meta
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img
-rw-r--r--  1 root root           178 Dec 12 13:09 volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img.meta
-rw-r--r--  1 root root           197 Dec 12 13:09 volume.meta

node:~ # kubectl -n longhorn-system exec instance-manager-r-46ea085f -- grep -r "" /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img.meta:{"Name":"volume-head-010.img","Parent":"volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img","Removed":false,"UserCreated":false,"Created":"2023-12-12T13:08:59Z","Labels":null}
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume.meta:{"Size":4398046511104,"Head":"volume-head-007.img","Dirty":false,"Rebuilding":true,"Error":"","Parent":"volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img","SectorSize":512,"BackingFilePath":""}
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-head-007.img matches
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img matches
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-head-007.img.meta:{"Name":"volume-head-001.img","Parent":"volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img","Removed":false,"UserCreated":false,"Created":"2023-12-12T13:08:59Z","Labels":null}
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/revision.counter matches
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img matches

volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img was missing a corresponding .meta file.

We were lucky enough to capture the reason for this in the logs:

2023-12-14T09:10:17.456420177Z time="2023-12-14T09:10:17Z" level=warning msg="pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-r-2afa33f4: time=\"2023-12-12T13:09:34Z\" level=error msg=\"Sync agent gRPC server failed to rebuild replica/sync files: replica tcp://10.52.2.176:10720 failed to send file volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img to 10.52.3.78:11431: failed to send file volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img to 10.52.3.78:11431: rpc error: code = Unknown desc = failed to SyncContent for source: volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img: syncHoleInterval [      32:     256](224) failed, err: sendHole failed, err: Post \\\"http://10.52.3.78:11431/v1-ssync/sendHole?begin=131072&end=1048576\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\""

The rebuild timed out. We always transfer the .meta file after the .img file, so this replica was left in a broken state. It shouldn't have been a big deal. Presumably we marked the replica as failed with a failedAt and attempted to rebuild it. (We can't tell from the logs whether this was the case.) The same thing probably happened to the other replica. So, at one point, we likely had three replicas. Two were broken and one was whole.

Somehow, we lost the whole replica and kept one broken one. I think it could have happened like this:

At some point, the node with the whole replica goes down. Now, all replicas are failed.
The autosalvage logic kicks in. Since the node with the whole replica is offline, we autosalvage with the broken replicas.
We remove the failedAt from the broken replicas an attempt an autosalvage.
The engine fails to start (of course) because the replicas are broken.
At some point, the node with the whole replica comes back. It is actually the only replica we want, but it has a failedAt and the broken ones do not.
After 30 minutes, we delete the whole replica.
Data is lost.

To Reproduce

I still need to try to reproduce this. We can probably:

Create a volume with two replicas.
Intentionally corrupt (delete a .meta file) one of the replicas, then crash it.
Restart the node running the other healthy replica.
Observe the behavior.

Expected behavior

We can distinguish the actual healthy replica from the replicas that failed to be salvaged. We do not delete the actual healthy replica. Eventually, the healthy replica is able to rebuild other healthy replicas.

Support bundle for troubleshooting

I cannot share the support bundle here, but we have one that we can re-examine.

Environment

Longhorn version: v1.3.2 (probably effects newer versions)
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Harvester
Number of Longhorn volumes in the cluster: hundreds

Additional context

There is a replica.spec.salvageRequested field and a replica.status.salvageExecuted field that our current code doesn't use. Maybe we can use these to track that certain replicas were never successfully salvaged.

The text was updated successfully, but these errors were encountered:

ejweber · 2023-12-27T20:44:00Z

I'm not sure our autosalvage mechanism is the main culprit here after all. In all relevant versions of Longhorn, we only autosalvage with replicas that have healthyAt != "".

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L1266-L1268

As soon as we decide to rebuild a replica for the first time, we set healthyAt = "".

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L1736

We only set healthyAt to any value when a successful rebuild has completed and a replica is found in the replicaModeMap of the correct engine.

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L609

The corrupted replicas in our support bundle have healthyAt set. This makes them candidates for autosalvage and indistinguishable from the actual healthy replica. We need to identify how they can have healthAt set if they failed during rebuilding.

ChanYiLin · 2024-01-03T01:58:39Z

Hi @ejweber I found this is quite similar to this issue as well
we also don't have the solution but prevention for that one
you may take a look
#4212 (comment)

ejweber · 2024-01-04T17:06:00Z

Thanks @ChanYiLin! I agree that the linked issue is quite relevant! I'm not seeing any real evidence that invalid state rebuilding is a part of the issue the user observed, but the end result is the same: the broken (failed rebuilding) replica had a heatlhyAt, so we chose it for autosalvage. The "fix" I have been working to implement will prevent us from deleting potentially healthy replicas if this happens, but recovery will be awkward (we probably can't/shouldn't automate deleting the replica autosalvage chose in favor of a different one. It would be much better if we understood how a rebuilding replica maintained its healthyAt status.

ejweber · 2024-01-16T22:27:45Z

Longhorn v1.3.2 and older are more susceptible to this issue because:

@derekbit added Fail replica if volume is not attached and node is down longhorn-manager#1737, which was backported to v1.3.3. It updates the failedAt for replicas which can't be launched due to their node being down, etc. Some of the attempts I have made to cause the accidental deletion of a stale (but good) replica have failed because its failedAt kept updating, avoiding the staleReplicaTimeout.
@derekbit added Fail all replicas if there is no valid backend during the engine starting stage longhorn-manager#1582, which was backported to v1.3.3. It updates the failedAt for all replicas when an engine fails due to having no available backend. Some of the attempts I have made to cause the accidental deletion of a stale (but good) replica have failed because its failedAt kept updating, avoiding the staleReplicaTimeout.

These (possibly unintentional) protections are thwarted, however, if the volume is ever detached. Then, we clean up the stale (but good) replica after staleReplicaTimeout.

longhorn-io-github-bot · 2024-01-17T16:11:18Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: Be more careful while cleaning up stale replicas longhorn-manager#2432 (comment).
Is there a workaround for the issue? If so, where is it documented?
It may be beneficial for users to increase the staleReplicaTimeout on their volumes to a large number (e.g. in days). This increase should have a similar effect except that it may cover even more situations. The downside is that this workaround has a time limit. If the user does not take corrective action before staleReplicaTimeout is exceeded, replicas will be deleted.
Does the PR include the explanation for the fix or the feature?
Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at: Add LastFailedAt and LastHealthyAt replica fields #7715.
The PR for the chart change is at: Add LastFailedAt and LastHealthyAt replica fields #7715.
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at: Be more careful while cleaning up stale replicas longhorn-manager#2432.
There is a small followup to fix a typo at: Fix typo in TestTimestampAfterTimestamp case name longhorn-manager#2541.
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at: Add skeleton for test_retain_potentially_useful_replicas_in_autosalvage_loop longhorn-tests#1699.
The issue of automation test case implementation is at: [TEST][BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7803.

roger-ryao · 2024-03-06T08:11:56Z

Verified on master-head 20240306

longhorn master-head d2b7616
longhorn-manager master-head longhorn/longhorn-manager@b57a26e

The test steps

Be more careful while cleaning up stale replicas longhorn-manager#2432 (comment)

Result Passed

If you wait any length of time, ReplicaA is never cleaned up.
After deleting ReplicaB and waiting for the volume to rebuild from ReplicaA, the data is consistent.

ejweber added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 22, 2023

innobead added this to the v1.6.0 milestone Dec 23, 2023

innobead added component/longhorn-manager Longhorn manager (control plane) priority/0 Must be fixed in this release (managed by PO) labels Dec 23, 2023

innobead assigned ejweber Dec 23, 2023

innobead added the investigation-needed Need to identify the case before estimating and starting the development label Jan 2, 2024

innobead modified the milestones: v1.6.0, v1.7.0 Jan 2, 2024

ejweber mentioned this issue Jan 5, 2024

Be more careful while cleaning up stale replicas longhorn/longhorn-manager#2432

Merged

ejweber mentioned this issue Jan 17, 2024

Add LastFailedAt and LastHealthyAt replica fields #7715

Merged

ejweber added the backport/1.5.4 label Jan 29, 2024

github-actions bot mentioned this issue Jan 29, 2024

[BACKPORT][v1.5.4][BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7801

Closed

ejweber added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Jan 29, 2024

github-actions bot mentioned this issue Jan 29, 2024

[TEST][BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7803

Open

ejweber mentioned this issue Feb 1, 2024

Be more careful while cleaning up stale replicas (backport #2432) longhorn/longhorn-manager#2546

Merged

ejweber added the backport/1.6.1 label Feb 7, 2024

github-actions bot mentioned this issue Feb 7, 2024

[BACKPORT][v1.6.1][BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7885

Closed

innobead assigned roger-ryao Feb 21, 2024

innobead added the area/stability System or volume stability label Feb 21, 2024

roger-ryao closed this as completed Mar 6, 2024

roger-ryao mentioned this issue Mar 15, 2024

[BACKPORT][v1.6.1][BUG] Longhorn api-server PUT request rate #8152

Closed

innobead added the area/volume-attach-detach Volume attach & detach related label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

ejweber commented Dec 22, 2023 •

edited

ejweber commented Dec 27, 2023

ChanYiLin commented Jan 3, 2024

ejweber commented Jan 4, 2024 •

edited

ejweber commented Jan 16, 2024 •

edited

longhorn-io-github-bot commented Jan 17, 2024 •

edited by ejweber

roger-ryao commented Mar 6, 2024

[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

Comments

ejweber commented Dec 22, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

ejweber commented Dec 27, 2023

ChanYiLin commented Jan 3, 2024

ejweber commented Jan 4, 2024 • edited

ejweber commented Jan 16, 2024 • edited

longhorn-io-github-bot commented Jan 17, 2024 • edited by ejweber

Pre Ready-For-Testing Checklist

roger-ryao commented Mar 6, 2024

ejweber commented Dec 22, 2023 •

edited

ejweber commented Jan 4, 2024 •

edited

ejweber commented Jan 16, 2024 •

edited

longhorn-io-github-bot commented Jan 17, 2024 •

edited by ejweber