Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425

Closed
ejweber opened this issue Dec 22, 2023 · 6 comments
Closed
Assignees
Labels
area/stability System or volume stability area/volume-attach-detach Volume attach & detach related backport/1.5.4 backport/1.6.1 component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Dec 22, 2023

Describe the bug (馃悰 if you encounter this issue)

In a Harvester cluster, we saw many volumes in an attach/detach loop. The instance-manager-e responsible for starting each engine logged the following on each loop:

2023-12-14T08:58:51.406122904Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Starting with replicas [\"tcp://10.52.3.78:10885\" \"tcp://10.52.1.130:10885\"]"
2023-12-14T08:58:51.406575780Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Connecting to remote: 10.52.3.78:10885"
2023-12-14T08:58:51.413610249Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Opening: 10.52.3.78:10885"
2023-12-14T08:58:51.448733315Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=warning msg="failed to create backend with address tcp://10.52.3.78:10885: failed to open replica 10.52.3.78:10885 from remote: rpc error: code = Unknown desc = Failed to find metadata for volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img"
2023-12-14T08:58:51.448745943Z time="2023-12-14T08:58:51Z" level=info msg="Connecting to remote: 10.52.1.130:10885"
2023-12-14T08:58:51.451211278Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Opening: 10.52.1.130:10885"
2023-12-14T08:58:51.464732493Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=warning msg="failed to create backend with address tcp://10.52.1.130:10885: failed to open replica 10.52.1.130:10885 from remote: rpc error: code = Unknown desc = Failed to find metadata for volume-snap-85e0c320-8232-4428-8bef-ca535e2516a5.img"
2023-12-14T08:58:51.465185634Z [pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-e-4f29712b] time="2023-12-14T08:58:51Z" level=info msg="Adding backend: tcp://10.52.3.78:10885"
2023-12-14T08:58:51.465200274Z time="2023-12-14T08:58:51Z" level=info msg="Adding backend: tcp://10.52.1.130:10885"

At this point, there were two replicas, and both were somehow corrupted (missing a .meta file to correspond to a .img file).

Both replicas appeared healthy to Longhorn, with healthyAt but no failedAt.

healthyAt: "2023-12-11T08:22:46Z"
healthyAt: "2023-12-11T07:54:46Z"

We were able to look at the data directory on disk for one of the replicas:

node:~ # kubectl -n longhorn-system exec instance-manager-r-46ea085f -- ls -al /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6

total 32

drwx------  2 root root          4096 Dec 12 13:09 .
drwxr-xr-x 57 root root         12288 Dec 20 01:31 ..
-rw-------  1 root root          4096 Dec 12 13:09 revision.counter
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-head-007.img
-rw-r--r--  1 root root           178 Dec 12 13:09 volume-head-007.img.meta
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img
-rw-r--r--  1 root root 4398046511104 Dec 12 13:09 volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img
-rw-r--r--  1 root root           178 Dec 12 13:09 volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img.meta
-rw-r--r--  1 root root           197 Dec 12 13:09 volume.meta

node:~ # kubectl -n longhorn-system exec instance-manager-r-46ea085f -- grep -r "" /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img.meta:{"Name":"volume-head-010.img","Parent":"volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img","Removed":false,"UserCreated":false,"Created":"2023-12-12T13:08:59Z","Labels":null}
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume.meta:{"Size":4398046511104,"Head":"volume-head-007.img","Dirty":false,"Rebuilding":true,"Error":"","Parent":"volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img","SectorSize":512,"BackingFilePath":""}
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-head-007.img matches
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img matches
/host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-head-007.img.meta:{"Name":"volume-head-001.img","Parent":"volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img","Removed":false,"UserCreated":false,"Created":"2023-12-12T13:08:59Z","Labels":null}
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/revision.counter matches
Binary file /host/var/lib/harvester/extra-disks/0b70300b9351baf9e032ab76cfbeb918/replicas/pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-832080a6/volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img matches

volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img was missing a corresponding .meta file.

We were lucky enough to capture the reason for this in the logs:

2023-12-14T09:10:17.456420177Z time="2023-12-14T09:10:17Z" level=warning msg="pvc-8c39063b-e4ee-4786-a1bf-f3c3056e77bb-r-2afa33f4: time=\"2023-12-12T13:09:34Z\" level=error msg=\"Sync agent gRPC server failed to rebuild replica/sync files: replica tcp://10.52.2.176:10720 failed to send file volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img to 10.52.3.78:11431: failed to send file volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img to 10.52.3.78:11431: rpc error: code = Unknown desc = failed to SyncContent for source: volume-snap-26f06ba1-bc82-40a4-b506-ff1190f65ba3.img: syncHoleInterval [      32:     256](224) failed, err: sendHole failed, err: Post \\\"http://10.52.3.78:11431/v1-ssync/sendHole?begin=131072&end=1048576\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\""

The rebuild timed out. We always transfer the .meta file after the .img file, so this replica was left in a broken state. It shouldn't have been a big deal. Presumably we marked the replica as failed with a failedAt and attempted to rebuild it. (We can't tell from the logs whether this was the case.) The same thing probably happened to the other replica. So, at one point, we likely had three replicas. Two were broken and one was whole.

Somehow, we lost the whole replica and kept one broken one. I think it could have happened like this:

  • At some point, the node with the whole replica goes down. Now, all replicas are failed.
  • The autosalvage logic kicks in. Since the node with the whole replica is offline, we autosalvage with the broken replicas.
  • We remove the failedAt from the broken replicas an attempt an autosalvage.
  • The engine fails to start (of course) because the replicas are broken.
  • At some point, the node with the whole replica comes back. It is actually the only replica we want, but it has a failedAt and the broken ones do not.
  • After 30 minutes, we delete the whole replica.
  • Data is lost.

To Reproduce

I still need to try to reproduce this. We can probably:

  • Create a volume with two replicas.
  • Intentionally corrupt (delete a .meta file) one of the replicas, then crash it.
  • Restart the node running the other healthy replica.
  • Observe the behavior.

Expected behavior

We can distinguish the actual healthy replica from the replicas that failed to be salvaged. We do not delete the actual healthy replica. Eventually, the healthy replica is able to rebuild other healthy replicas.

Support bundle for troubleshooting

I cannot share the support bundle here, but we have one that we can re-examine.

Environment

  • Longhorn version: v1.3.2 (probably effects newer versions)
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Harvester
  • Number of Longhorn volumes in the cluster: hundreds

Additional context

There is a replica.spec.salvageRequested field and a replica.status.salvageExecuted field that our current code doesn't use. Maybe we can use these to track that certain replicas were never successfully salvaged.

@ejweber ejweber added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 22, 2023
@innobead innobead added this to the v1.6.0 milestone Dec 23, 2023
@innobead innobead added component/longhorn-manager Longhorn manager (control plane) priority/0 Must be fixed in this release (managed by PO) labels Dec 23, 2023
@ejweber
Copy link
Contributor Author

ejweber commented Dec 27, 2023

I'm not sure our autosalvage mechanism is the main culprit here after all. In all relevant versions of Longhorn, we only autosalvage with replicas that have healthyAt != "".

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L1266-L1268

As soon as we decide to rebuild a replica for the first time, we set healthyAt = "".

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L1736

We only set healthyAt to any value when a successful rebuild has completed and a replica is found in the replicaModeMap of the correct engine.

https://github.com/longhorn/longhorn-manager/blob/c4e7942684cc1f8ece900854d09126a7b1f8c0b6/controller/volume_controller.go#L609

The corrupted replicas in our support bundle have healthyAt set. This makes them candidates for autosalvage and indistinguishable from the actual healthy replica. We need to identify how they can have healthAt set if they failed during rebuilding.

@innobead innobead added the investigation-needed Need to identify the case before estimating and starting the development label Jan 2, 2024
@innobead innobead modified the milestones: v1.6.0, v1.7.0 Jan 2, 2024
@ChanYiLin
Copy link
Contributor

Hi @ejweber I found this is quite similar to this issue as well
we also don't have the solution but prevention for that one
you may take a look
#4212 (comment)

@ejweber
Copy link
Contributor Author

ejweber commented Jan 4, 2024

Thanks @ChanYiLin! I agree that the linked issue is quite relevant! I'm not seeing any real evidence that invalid state rebuilding is a part of the issue the user observed, but the end result is the same: the broken (failed rebuilding) replica had a heatlhyAt, so we chose it for autosalvage. The "fix" I have been working to implement will prevent us from deleting potentially healthy replicas if this happens, but recovery will be awkward (we probably can't/shouldn't automate deleting the replica autosalvage chose in favor of a different one. It would be much better if we understood how a rebuilding replica maintained its healthyAt status.

@ejweber
Copy link
Contributor Author

ejweber commented Jan 16, 2024

Longhorn v1.3.2 and older are more susceptible to this issue because:

These (possibly unintentional) protections are thwarted, however, if the volume is ever detached. Then, we clean up the stale (but good) replica after staleReplicaTimeout.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 17, 2024

Pre Ready-For-Testing Checklist

@roger-ryao
Copy link

Verified on master-head 20240306

The test steps

  1. Be more careful while cleaning up stale replicas聽longhorn-manager#2432 (comment)

Result Passed

  1. If you wait any length of time, ReplicaA is never cleaned up.
  2. After deleting ReplicaB and waiting for the volume to rebuild from ReplicaA, the data is consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stability System or volume stability area/volume-attach-detach Volume attach & detach related backport/1.5.4 backport/1.6.1 component/longhorn-manager Longhorn manager (control plane) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

5 participants