[BUG] Failed replicas not being cleared from node #2461

dbpolito · 2021-04-09T14:37:42Z

Describe the bug

I got a instability on one node that made some replicas to corrupt and got rebuilt, but it seems that the disk usage wasn't freed by these replicas that failed:

Environment:

Longhorn version: 1.1.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS

Not sure that's really the problem but the only thing i noticed is this node disk increase after all these replica rebuilds...

khushboo-rancher · 2021-04-09T20:03:33Z

@dbpolito Can you check the staleReplicaTimeout for you storage class, the failed replica remains on the disk for sometime based on the staleReplicaTimeout.

dbpolito · 2021-04-09T20:29:12Z

I don't remember customizing that, so i guess it's the default: staleReplicaTimeout: "30"

That's not even customizable on helm chart: https://github.com/longhorn/charts/blob/master/charts/longhorn/templates/storageclass.yaml#L21

What is this? minutes? hours? days?

cclhsu · 2021-04-12T02:29:09Z

staleReplicaTimeout: "30" is in minutes as in example of https://longhorn.io/docs/1.1.0/volumes-and-nodes/create-volumes/

dbpolito · 2021-04-14T14:37:49Z

I see, well, after a few days i still have this huge difference between nodes:

Which i can't tell what it is... i'm guessing it's related to these failed replicas because this node had that and the size increased during that period of rebuild and never freed.

jenting · 2021-04-16T02:47:03Z

Could you help to check if there any orphan replicas directory under the host path /var/lib/longhorn/replicas/ to see if there any mismatch volume replicas?
If yes, you could manually delete the orphan replicas on the host.

Right now, if the node even went down and back, Longhorn has no information anymore which is the orphan replicas (i.e., Longhorn did not scan the /var/lib/longhorn/replicas/ now). We'll put it in our backlog to see how we can enhance it.

joshimoo · 2021-04-16T07:52:30Z

I left some quick thoughts on how replica deletion could be improved to ensure cleanup of replica data here:
#685 (comment)

dbpolito added the kind/bug label Apr 9, 2021

innobead added this to New in Community Issue Review via automation Apr 9, 2021

innobead added the component/longhorn-manager Longhorn manager (control plane) label Apr 12, 2021

jenting moved this from New to In progress in Community Issue Review Apr 13, 2021

jenting moved this from In progress to Pending user response in Community Issue Review Apr 13, 2021

jenting moved this from Pending user response to In progress in Community Issue Review Apr 16, 2021

jenting self-assigned this Apr 16, 2021

jenting moved this from In progress to Backlog Candidates in Community Issue Review Apr 16, 2021

innobead moved this from Backlog Candidates to Resolved/Scheduled in Community Issue Review Apr 16, 2021

innobead added this to the Planning milestone Apr 16, 2021

innobead unassigned jenting Apr 26, 2021

jenting mentioned this issue Jun 22, 2021

[QUESTION] How to recover from a catastrophic failure? #2714

Closed

innobead modified the milestones: Planning, Backlog Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed replicas not being cleared from node #2461

[BUG] Failed replicas not being cleared from node #2461

dbpolito commented Apr 9, 2021

khushboo-rancher commented Apr 9, 2021

dbpolito commented Apr 9, 2021 •

edited

cclhsu commented Apr 12, 2021

dbpolito commented Apr 14, 2021

jenting commented Apr 16, 2021

joshimoo commented Apr 16, 2021

[BUG] Failed replicas not being cleared from node #2461

[BUG] Failed replicas not being cleared from node #2461

Comments

dbpolito commented Apr 9, 2021

khushboo-rancher commented Apr 9, 2021

dbpolito commented Apr 9, 2021 • edited

cclhsu commented Apr 12, 2021

dbpolito commented Apr 14, 2021

jenting commented Apr 16, 2021

joshimoo commented Apr 16, 2021

dbpolito commented Apr 9, 2021 •

edited