[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

james-munson · 2024-03-31T22:12:38Z

backport #7931

Specifically, the improvement to CSI host namespace and CSI logging of environment when mount fails.

james-munson · 2024-03-31T22:14:45Z

Created this backport issue manually, since part of the previous work on the environment script was already backported.

longhorn-io-github-bot · 2024-03-31T22:17:31Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: See base issue [IMPROVEMENT] Improve environment_check script for NFS protocol bug and the host system self diagnosis #7931
Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
Does the PR include the explanation for the fix or the feature?

chriscchien · 2024-04-02T02:46:13Z

Hi @james-munson ,

I can reproduce the RWX pod mount fail on Ubuntu 5.15.0-94-generic, the RWX pod failed to mount, but the messeage Failed to mount volume %v on node %s with kernel release you added in PR can not be found. Could you take a look on this? thank you.

> uname -r
5.15.0-94-generic
>
> k get volume -A
NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
longhorn-system   pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b   attached   healthy                  1073741824   cha    117s
>
> k -n longhorn-system get volume pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b -o yaml | grep accessMode
  accessMode: rwx
> 
> k get pods 
NAME                                READY   STATUS              RESTARTS   AGE
longhorn-nfs-installation-t7pfk     1/1     Running             0          4m47s
longhorn-iscsi-installation-m7jb4   1/1     Running             0          4m47s
test-deployment-754dd9fc66-68qsp    0/1     ContainerCreating   0          2m20s
> 
> k describe pod test-deployment-754dd9fc66-68qsp | grep Event -A 20
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Warning  FailedScheduling        2m37s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        2m35s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m33s               default-scheduler        Successfully assigned default/test-deployment-754dd9fc66-68qsp to cha
  Normal   SuccessfulAttachVolume  2m17s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b"
  Warning  FailedMount             8s (x9 over 2m17s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.132.6:/pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/6381505fa73cc1887c25ca8b3079326a4843af3ac8af2f500e818253fefc3afb/globalmount
Output: mount.nfs: Protocol not supported

supportbundle_5e584417-1cc9-44ad-8d2f-6c8c3661f36f_2024-04-02T02-19-38Z.zip

In addition. use kernel 5.15.0-101-generic can make RWX pod mount success.

james-munson · 2024-04-02T17:27:08Z

I think I see the problem. Testing the fixup.

james-munson · 2024-04-03T20:28:25Z

This is strange. I'm having a hard time inducing the failure. I installed my test build of longhorn-manager for both daemonsets (longhorn-manager and longhorn-csi-plugin), changed the kernel on one of my Ubuntu 22.04 worker nodes to 5.15.0-94-generic, and made a new RWX deployment. The pods all mounted successfully, including the one with the problematic kernel.

The pod events:

k describe pod rwx-test-79dc5564d4-x6gzj | grep Event -A 20
Events:
  Type     Reason                  Age   From                     Message
  ----     ------                  ----  ----                     -------
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   Scheduled               22m   default-scheduler        Successfully assigned default/rwx-test-79dc5564d4-x6gzj to jbm-u22-pool2-ca5a5aa4-5fcq7
  Normal   SuccessfulAttachVolume  22m   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-8787aa8f-8496-47b6-8bee-d49a27e68356"
  Normal   Pulled                  22m   kubelet                  Container image "ubuntu:xenial" already present on machine
  Normal   Created                 22m   kubelet                  Created container rwx-test
  Normal   Started                 22m   kubelet                  Started container rwx-test
  Normal   Pulled                  22m   kubelet                  Container image "nginx:stable" already present on machine
  Normal   Created                 22m   kubelet                  Created container nginx
  Normal   Started                 22m   kubelet                  Started container nginx

On the node itself,

 uname -a
Linux jbm-u22-pool2-ca5a5aa4-5fcq7 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@jbm-u22-pool2-ca5a5aa4-5fcq7:~# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/8d63416da5e7a35390a0bb4456ec55a10c6dbb3ce0a42b3eab6c9fa221481328/globalmount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/pods/c0ad2c30-c146-4bce-ba95-f9af4176547f/volumes/kubernetes.io~csi/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356/mount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

And in the pod, everything is happy

uname -a
Linux rwx-test-79dc5564d4-x6gzj 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@rwx-test-79dc5564d4-x6gzj:/# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

I'm not sure what's going on.

james-munson · 2024-04-04T15:49:08Z

Repeated the test with 5.15.0-97-generic which is also in the broken range of Ubuntu kernels, and got the same result. Mount did not fail.

PhanLe1010 · 2024-04-04T19:01:08Z

@james-munson Is it reproducible if you use Longhorn 1.6.0 instead?

james-munson · 2024-04-05T17:11:34Z

Actually, that was with 1.6.0 for everything but my custom longhorn-manager and csi-plugin. Perhaps I'll give it a try with a 1.5.x release. Or compare test procedure with @chriscchien.

james-munson · 2024-04-06T18:52:46Z

So, since I can't repro the kernel-based Protocol not supported error, I tested using a different mount failure, but still exercising the code path that was missed. The logging from CSI-plugin is as hoped:

[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] E0406 18:18:12.533095    5056 mount_linux.go:230] Mount failed: exit status 32
[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] time="2024-04-06T18:18:12Z" level=warning msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:295" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o rw,vers=4.1,bunny,softerr,timeo=600,retrans=5,noresvport 10.43.163.213:/pvc-fdf504c7-1616-4309-a5b1-25365d74d02e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/e61d4e8525a632f675b44d7b8ba3704bd179e4e6ae7dbbe5ba811af540ebfa86/globalmount\nOutput: mount.nfs: an incorrect mount option was specified\n" function=nodeStageSharedVolume

In particular,

msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu"

which shows what we wanted to capture in the logs.

james-munson · 2024-04-08T22:44:57Z

With longhorn/longhorn-manager#2724 committed, this should be testable again.

chriscchien · 2024-04-09T01:50:14Z

Verified pass on longhorn v1.5.x (longhorn-manager b160f0)

Deploy Longhorn v1.5.x on Ubuntu 5.15.0-94-generic which RWX volume can not mount success(ref), create and attach a RWX volume, in the longhorn-csi-plugin.log we can find related kernel information as below:

2024-04-09T01:32:45.954023014Z time="2024-04-09T01:32:45Z" level=warning msg="Failed to mount volume pvc-60fce72a-acab-4214-9144-8305cf99e932 on node ccc with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:284" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.183.167:/pvc-60fce72a-acab-4214-9144-8305cf99e932 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/67d50915e9bd94523ce9dfb7240f1b491572361b8cad3537c5274f6184eb3cc0/globalmount\nOutput: mount.nfs: Protocol not supported\n"

james-munson added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Mar 31, 2024

james-munson mentioned this issue Mar 31, 2024

fix(csi): add node info to failed mount logging. (backport #2686) longhorn/longhorn-manager#2714

Merged

james-munson self-assigned this Mar 31, 2024

james-munson added this to the v1.5.5 milestone Mar 31, 2024

chriscchien self-assigned this Apr 1, 2024

james-munson mentioned this issue Apr 3, 2024

fix(csi): correct error found in test. longhorn/longhorn-manager#2717

Merged

This was referenced Apr 8, 2024

fix(csi): correct error found in test. (backport #2717) longhorn/longhorn-manager#2724

Merged

fix(csi): correct error found in test. (backport #2717) longhorn/longhorn-manager#2725

Merged

chriscchien closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

james-munson commented Mar 31, 2024 •

edited

james-munson commented Mar 31, 2024

longhorn-io-github-bot commented Mar 31, 2024 •

edited by james-munson

chriscchien commented Apr 2, 2024

james-munson commented Apr 2, 2024 •

edited

james-munson commented Apr 3, 2024 •

edited

james-munson commented Apr 4, 2024

PhanLe1010 commented Apr 4, 2024

james-munson commented Apr 5, 2024

james-munson commented Apr 6, 2024 •

edited

james-munson commented Apr 8, 2024

chriscchien commented Apr 9, 2024

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

Comments

james-munson commented Mar 31, 2024 • edited

james-munson commented Mar 31, 2024

longhorn-io-github-bot commented Mar 31, 2024 • edited by james-munson

Pre Ready-For-Testing Checklist

chriscchien commented Apr 2, 2024

james-munson commented Apr 2, 2024 • edited

james-munson commented Apr 3, 2024 • edited

james-munson commented Apr 4, 2024

PhanLe1010 commented Apr 4, 2024

james-munson commented Apr 5, 2024

james-munson commented Apr 6, 2024 • edited

james-munson commented Apr 8, 2024

chriscchien commented Apr 9, 2024

james-munson commented Mar 31, 2024 •

edited

longhorn-io-github-bot commented Mar 31, 2024 •

edited by james-munson

james-munson commented Apr 2, 2024 •

edited

james-munson commented Apr 3, 2024 •

edited

james-munson commented Apr 6, 2024 •

edited