New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Filesystem corruption and not mountable PVC's #3895
Comments
We saw similar xfs corruptions before. During the development, we found ext4 is more stable than xfs. Can you use ext4 instead? |
@derekbit Thanks for your reply. We also found another bug that discussed the same and people also stated that ext4 was better than xfs. We are considering doing it but this will not really resolve the cause. Filesystem corruptions are in my opinion the symptoms of another cause that has to be found and fixed first. During these errors we see connection timeouts in Longhorn: level=error msg="GRPC call: /csi.v1.Node/NodeGetVolumeStats request: {"volume_id":"pvc-91ab2ad0-078b-4690-8244-50b81cff3371","volume_path":"/var/lib/kubelet/pods/d6a4244b-d429-46aa-8fee-111036c65551/volumes/kubernetes.io~csi/pvc-91ab2ad0-078b-4690-8244-50b81cff3371/mount"} failed with error: rpc error: code = Internal desc = Get "http://longhorn-manager:9500/v1/volumes/pvc-91ab2ad0-078b-4690-8244-50b81cff3371\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" |
So basically we can figure out that we experience high CPU on that specific node, causing it to become unresponsive for a certain amount of time. |
We recently identified a bug in Longhorn that may corrupt the filesystem. More details are at #4354 @drz9 To verify, could you provide us the filesystem info of your problematic volumes if possible?
|
@drz9 |
Describe the bug
Hello, we have some issues on many of our production clusters that regularly the PVC's become unmountable for the Pod and we get VolumeAttachErrors often resulting in Filesystem corruption (we use XFS). This is causing many issues because the PVC's mostly hold database data (mongodb, mariadb). In most of the cases, a manual mount to a node and xfs_repair will solve the issue but in some cases even the data in the databases is corrupt so we need to restore that as well.
We are using v1.1.1 and the cloud is vCloud Director.
To Reproduce
This is not really reproducable but we think maybe there is a CPU spike in one of the worker nodes which will cause the node to become inresponsive to Longhorn. However, my expectation would be that then another replica (we have 3 configured) may take over, which is not the case. Also, we don't really see high CPU warnings in our Grafana or other monitorings.
Expected behavior
Log or Support bundle
Attached.
longhorn-support-bundle_ce287a13-569f-42f1-85ce-f29a30945e15_2022-04-26T06-11-56Z.zip
Environment
Number of management node in the cluster: 3
Number of worker node in the cluster: 7
Additional context
I would be very pleased if someone could analyze the support bundle and maybe give us a hint on what is going on and how we can solve this issue.
The text was updated successfully, but these errors were encountered: