[BUG] Filesystem corruption and not mountable PVC's #3895

drz9 · 2022-04-26T06:22:59Z

Describe the bug

Hello, we have some issues on many of our production clusters that regularly the PVC's become unmountable for the Pod and we get VolumeAttachErrors often resulting in Filesystem corruption (we use XFS). This is causing many issues because the PVC's mostly hold database data (mongodb, mariadb). In most of the cases, a manual mount to a node and xfs_repair will solve the issue but in some cases even the data in the databases is corrupt so we need to restore that as well.
We are using v1.1.1 and the cloud is vCloud Director.

To Reproduce

This is not really reproducable but we think maybe there is a CPU spike in one of the worker nodes which will cause the node to become inresponsive to Longhorn. However, my expectation would be that then another replica (we have 3 configured) may take over, which is not the case. Also, we don't really see high CPU warnings in our Grafana or other monitorings.

Expected behavior

Log or Support bundle

Attached.
longhorn-support-bundle_ce287a13-569f-42f1-85ce-f29a30945e15_2022-04-26T06-11-56Z.zip

Environment

Longhorn version: 1.1.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl with kustomize
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE, Kubernetes Version 1.20.9
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 7
Node config
- OS type and version: ubuntu 20.04
- CPU per node: 4 vCPU
- Memory per node: 32GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): vCloud
Number of Longhorn volumes in the cluster: 25

Additional context

I would be very pleased if someone could analyze the support bundle and maybe give us a hint on what is going on and how we can solve this issue.

derekbit · 2022-04-26T06:44:18Z

We saw similar xfs corruptions before. During the development, we found ext4 is more stable than xfs. Can you use ext4 instead?

drz9 · 2022-04-26T06:56:35Z

@derekbit Thanks for your reply. We also found another bug that discussed the same and people also stated that ext4 was better than xfs. We are considering doing it but this will not really resolve the cause. Filesystem corruptions are in my opinion the symptoms of another cause that has to be found and fixed first. During these errors we see connection timeouts in Longhorn:

level=error msg="GRPC call: /csi.v1.Node/NodeGetVolumeStats request: {"volume_id":"pvc-91ab2ad0-078b-4690-8244-50b81cff3371","volume_path":"/var/lib/kubelet/pods/d6a4244b-d429-46aa-8fee-111036c65551/volumes/kubernetes.io~csi/pvc-91ab2ad0-078b-4690-8244-50b81cff3371/mount"} failed with error: rpc error: code = Internal desc = Get "http://longhorn-manager:9500/v1/volumes/pvc-91ab2ad0-078b-4690-8244-50b81cff3371\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

drz9 · 2022-04-29T13:15:50Z

So basically we can figure out that we experience high CPU on that specific node, causing it to become unresponsive for a certain amount of time.
If I understand correctly, longhorn then cannot reach that node and the underlying PVC, causing it to become unresponsive and causing i/o timeouts / errors. This will lead to filesystem corruption in like 30% of cases, roughly estimated.
However, as far as I would expect Longhorn to work, one of the replicas on another node could take over in this time range to serve data or is this not intended by Longhorn? What can we do to make Longhorn more solid in these cases? Or is this a known error in our specific version and maybe fixed in another?

PhanLe1010 · 2022-09-29T22:09:41Z

We recently identified a bug in Longhorn that may corrupt the filesystem. More details are at #4354

@drz9 To verify, could you provide us the filesystem info of your problematic volumes if possible?

SSH into the node that has Longhorn volume currently attached to
If it is ext4, tune2fs -l /dev/longhorn/<longhorn-volume-name>
If it is xfs, xfs_info /dev/longhorn/<longhorn-volume-name>

innobead · 2023-07-24T14:48:58Z

cc @shuo-wu @derekbit

derekbit · 2023-07-24T14:57:49Z

@drz9
Do you run the filesystem corruption issue again after upgrading to v1.3.2 or a newer version?

drz9 added the kind/bug label Apr 26, 2022

derekbit added this to New in Community Issue Review via automation Apr 26, 2022

derekbit mentioned this issue Sep 20, 2022

[BUG] Corruption using XFS after node restart or pod scale #3597

Closed

This was referenced Sep 29, 2022

[BACKPORT][v1.3.2][BUG] Corruption using XFS after node restart or pod scale #4589

Closed

[BACKPORT][v1.2.6][BUG] Corruption using XFS after node restart or pod scale #4590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Filesystem corruption and not mountable PVC's #3895

[BUG] Filesystem corruption and not mountable PVC's #3895

drz9 commented Apr 26, 2022

derekbit commented Apr 26, 2022 •

edited

drz9 commented Apr 26, 2022

drz9 commented Apr 29, 2022

PhanLe1010 commented Sep 29, 2022

innobead commented Jul 24, 2023

derekbit commented Jul 24, 2023

[BUG] Filesystem corruption and not mountable PVC's #3895

[BUG] Filesystem corruption and not mountable PVC's #3895

Comments

drz9 commented Apr 26, 2022

Describe the bug

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

derekbit commented Apr 26, 2022 • edited

drz9 commented Apr 26, 2022

drz9 commented Apr 29, 2022

PhanLe1010 commented Sep 29, 2022

innobead commented Jul 24, 2023

derekbit commented Jul 24, 2023

derekbit commented Apr 26, 2022 •

edited