Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Filesystem corruption and not mountable PVC's #3895

Open
drz9 opened this issue Apr 26, 2022 · 6 comments
Open

[BUG] Filesystem corruption and not mountable PVC's #3895

drz9 opened this issue Apr 26, 2022 · 6 comments

Comments

@drz9
Copy link

drz9 commented Apr 26, 2022

Describe the bug

Hello, we have some issues on many of our production clusters that regularly the PVC's become unmountable for the Pod and we get VolumeAttachErrors often resulting in Filesystem corruption (we use XFS). This is causing many issues because the PVC's mostly hold database data (mongodb, mariadb). In most of the cases, a manual mount to a node and xfs_repair will solve the issue but in some cases even the data in the databases is corrupt so we need to restore that as well.
We are using v1.1.1 and the cloud is vCloud Director.

To Reproduce

This is not really reproducable but we think maybe there is a CPU spike in one of the worker nodes which will cause the node to become inresponsive to Longhorn. However, my expectation would be that then another replica (we have 3 configured) may take over, which is not the case. Also, we don't really see high CPU warnings in our Grafana or other monitorings.

Expected behavior

Log or Support bundle

Attached.
longhorn-support-bundle_ce287a13-569f-42f1-85ce-f29a30945e15_2022-04-26T06-11-56Z.zip

Environment

  • Longhorn version: 1.1.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl with kustomize
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE, Kubernetes Version 1.20.9
    • Number of management node in the cluster: 3

    • Number of worker node in the cluster: 7

  • Node config
    • OS type and version: ubuntu 20.04
    • CPU per node: 4 vCPU
    • Memory per node: 32GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): vCloud
  • Number of Longhorn volumes in the cluster: 25

Additional context

I would be very pleased if someone could analyze the support bundle and maybe give us a hint on what is going on and how we can solve this issue.

@drz9 drz9 added the kind/bug label Apr 26, 2022
@derekbit
Copy link
Member

derekbit commented Apr 26, 2022

We saw similar xfs corruptions before. During the development, we found ext4 is more stable than xfs. Can you use ext4 instead?

@derekbit derekbit added this to New in Community Issue Review via automation Apr 26, 2022
@drz9
Copy link
Author

drz9 commented Apr 26, 2022

@derekbit Thanks for your reply. We also found another bug that discussed the same and people also stated that ext4 was better than xfs. We are considering doing it but this will not really resolve the cause. Filesystem corruptions are in my opinion the symptoms of another cause that has to be found and fixed first. During these errors we see connection timeouts in Longhorn:

level=error msg="GRPC call: /csi.v1.Node/NodeGetVolumeStats request: {"volume_id":"pvc-91ab2ad0-078b-4690-8244-50b81cff3371","volume_path":"/var/lib/kubelet/pods/d6a4244b-d429-46aa-8fee-111036c65551/volumes/kubernetes.io~csi/pvc-91ab2ad0-078b-4690-8244-50b81cff3371/mount"} failed with error: rpc error: code = Internal desc = Get "http://longhorn-manager:9500/v1/volumes/pvc-91ab2ad0-078b-4690-8244-50b81cff3371\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

@drz9
Copy link
Author

drz9 commented Apr 29, 2022

So basically we can figure out that we experience high CPU on that specific node, causing it to become unresponsive for a certain amount of time.
If I understand correctly, longhorn then cannot reach that node and the underlying PVC, causing it to become unresponsive and causing i/o timeouts / errors. This will lead to filesystem corruption in like 30% of cases, roughly estimated.
However, as far as I would expect Longhorn to work, one of the replicas on another node could take over in this time range to serve data or is this not intended by Longhorn? What can we do to make Longhorn more solid in these cases? Or is this a known error in our specific version and maybe fixed in another?

@PhanLe1010
Copy link
Contributor

We recently identified a bug in Longhorn that may corrupt the filesystem. More details are at #4354

@drz9 To verify, could you provide us the filesystem info of your problematic volumes if possible?

  1. SSH into the node that has Longhorn volume currently attached to
  2. If it is ext4, tune2fs -l /dev/longhorn/<longhorn-volume-name>
  3. If it is xfs, xfs_info /dev/longhorn/<longhorn-volume-name>

@innobead
Copy link
Member

cc @shuo-wu @derekbit

@derekbit
Copy link
Member

@drz9
Do you run the filesystem corruption issue again after upgrading to v1.3.2 or a newer version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Pending user response
Development

No branches or pull requests

4 participants