Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed replicas not being cleared from node #2461

Open
dbpolito opened this issue Apr 9, 2021 · 6 comments
Open

[BUG] Failed replicas not being cleared from node #2461

dbpolito opened this issue Apr 9, 2021 · 6 comments
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug
Milestone

Comments

@dbpolito
Copy link

dbpolito commented Apr 9, 2021

Describe the bug

I got a instability on one node that made some replicas to corrupt and got rebuilt, but it seems that the disk usage wasn't freed by these replicas that failed:

image

Environment:

  • Longhorn version: 1.1.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS

Not sure that's really the problem but the only thing i noticed is this node disk increase after all these replica rebuilds...

@innobead innobead added this to New in Community Issue Review via automation Apr 9, 2021
@khushboo-rancher
Copy link
Contributor

@dbpolito Can you check the staleReplicaTimeout for you storage class, the failed replica remains on the disk for sometime based on the staleReplicaTimeout.

@dbpolito
Copy link
Author

dbpolito commented Apr 9, 2021

I don't remember customizing that, so i guess it's the default: staleReplicaTimeout: "30"

That's not even customizable on helm chart: https://github.com/longhorn/charts/blob/master/charts/longhorn/templates/storageclass.yaml#L21

What is this? minutes? hours? days?

@innobead innobead added the component/longhorn-manager Longhorn manager (control plane) label Apr 12, 2021
@cclhsu
Copy link

cclhsu commented Apr 12, 2021

staleReplicaTimeout: "30" is in minutes as in example of https://longhorn.io/docs/1.1.0/volumes-and-nodes/create-volumes/

@jenting jenting moved this from New to In progress in Community Issue Review Apr 13, 2021
@jenting jenting moved this from In progress to Pending user response in Community Issue Review Apr 13, 2021
@dbpolito
Copy link
Author

I see, well, after a few days i still have this huge difference between nodes:

image

Which i can't tell what it is... i'm guessing it's related to these failed replicas because this node had that and the size increased during that period of rebuild and never freed.

@jenting jenting moved this from Pending user response to In progress in Community Issue Review Apr 16, 2021
@jenting jenting self-assigned this Apr 16, 2021
@jenting
Copy link
Contributor

jenting commented Apr 16, 2021

Could you help to check if there any orphan replicas directory under the host path /var/lib/longhorn/replicas/ to see if there any mismatch volume replicas?
If yes, you could manually delete the orphan replicas on the host.

Right now, if the node even went down and back, Longhorn has no information anymore which is the orphan replicas (i.e., Longhorn did not scan the /var/lib/longhorn/replicas/ now). We'll put it in our backlog to see how we can enhance it.

@jenting jenting moved this from In progress to Backlog Candidates in Community Issue Review Apr 16, 2021
@innobead innobead moved this from Backlog Candidates to Resolved/Scheduled in Community Issue Review Apr 16, 2021
@innobead innobead added this to the Planning milestone Apr 16, 2021
@joshimoo
Copy link
Contributor

I left some quick thoughts on how replica deletion could be improved to ensure cleanup of replica data here:
#685 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

6 participants