-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Pod mount took a long time even PV/PVC bound #2590
Comments
I am adding the same comment from #2583 Often I notice this issue. On random node/nodes, CSI plugin is stuck on mount command. I also noticed that after restarting longhorn-csi-plugin-____ for that problematic node, issue is getting resolved. But issue appears in another node after few days. Could this be related to amount of data(about 10GB) the RWX volume holds or number of pods associated with this volume? We are using RWX volumes with argo workflows. 100's of pods are launched with that specific RWX volumes. These pods run to complete some steps and new pods will be launched. All those 100's of the pods will consume same RWX volume. |
cc @joshimoo |
@khushboo-rancher please help reproduce this one and attach the bundle for the debugging reference. cc @kaxing |
Besides that, I encounter another issue on Kubernetes v1.21.0. 2021-05-12T05:03:40.341728+00:00 jenting-longhorn-master-0 k3s[17498]: E0512 05:03:40.341637 17498 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded" containerID="2772def98866331ae76c100ab412ce1961a1e35e19bf4f9f9345077a0d733f9b" cmd=[sh -c ls /data/longhorn && /data/longhorn version --client-only] The workaround to fix it is to increase the |
@jenting enhanced the timeout increase in readiness probe and changing some liveness probe from exec to socket in longhorn/longhorn-manager#928, but this is not the fix for this issue directly, but instead, another finding when testing on 1.21.0. |
And recurring jobs had been removed from storage class Related: #3907 |
@longhorn/qa Can you help check if the issue is still valid? Thank you. |
@roger-ryao Could you please help with this whenever get a chance? |
Verified passed on master-head (longhorn-manager 1f86d76) with the following steps:
|
Describe the bug
Pod mount took a long time even PV/PVC bound.
The way to troubleshoot it is to get the kubelet log, and generally, we'll see what's happening in it.
To Reproduce
I can make it happen when the longhorn manager is busy receiving the request, that is I have 1 control plane + 3 worker nodes and longhorn runs on these 4 nodes on k3s v1.21.0+k3s1.
Steps to reproduce the behavior:
1/* * * * *
.or
Expected behavior
The Pod should be able to mount the volume quickly.
Log
If applicable, add the Longhorn managers' log when the issue happens.
You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment:
Additional context
Since the longhorn manager is the control plan of the longhorn cluster.
I'm think how to make the longhorn manager won't influence like running recurring job very often.
For example, if we run a recurring job for 3 mins, but the job needs 5 mins to finished. This would make the resource starvation. When we launch a new job to run the recurring job, we should make sure this volume has no previous job running.
Furthermore, the current snapshot backup is running in longhorn-manager Pod, we should make it run as a job to not occupy longhorn-manager memory resource.
Or, the simplest way is to run longhorn manager Pods on control plane nodes the other longhorn components run on the worker node.
Furthermore, if possible, we should enhance the longhorn support bundle to get the host kubelet log as well.
The text was updated successfully, but these errors were encountered: