Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

Closed
james-munson opened this issue Mar 31, 2024 · 11 comments
Assignees
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. kind/backport Backport request kind/improvement Request for improvement of existing function
Milestone

Comments

@james-munson
Copy link
Contributor

james-munson commented Mar 31, 2024

backport #7931

Specifically, the improvement to CSI host namespace and CSI logging of environment when mount fails.

@james-munson james-munson added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Mar 31, 2024
@james-munson
Copy link
Contributor Author

Created this backport issue manually, since part of the previous work on the environment script was already backported.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 31, 2024

Pre Ready-For-Testing Checklist

@james-munson james-munson self-assigned this Mar 31, 2024
@james-munson james-munson added this to the v1.5.5 milestone Mar 31, 2024
@james-munson james-munson added kind/backport Backport request kind/improvement Request for improvement of existing function area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. and removed kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Mar 31, 2024
@chriscchien chriscchien self-assigned this Apr 1, 2024
@chriscchien
Copy link
Contributor

Hi @james-munson ,

I can reproduce the RWX pod mount fail on Ubuntu 5.15.0-94-generic, the RWX pod failed to mount, but the messeage Failed to mount volume %v on node %s with kernel release you added in PR can not be found. Could you take a look on this? thank you.

> uname -r
5.15.0-94-generic
>
> k get volume -A
NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
longhorn-system   pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b   attached   healthy                  1073741824   cha    117s
>
> k -n longhorn-system get volume pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b -o yaml | grep accessMode
  accessMode: rwx
> 
> k get pods 
NAME                                READY   STATUS              RESTARTS   AGE
longhorn-nfs-installation-t7pfk     1/1     Running             0          4m47s
longhorn-iscsi-installation-m7jb4   1/1     Running             0          4m47s
test-deployment-754dd9fc66-68qsp    0/1     ContainerCreating   0          2m20s
> 
> k describe pod test-deployment-754dd9fc66-68qsp | grep Event -A 20
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Warning  FailedScheduling        2m37s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        2m35s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m33s               default-scheduler        Successfully assigned default/test-deployment-754dd9fc66-68qsp to cha
  Normal   SuccessfulAttachVolume  2m17s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b"
  Warning  FailedMount             8s (x9 over 2m17s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.132.6:/pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/6381505fa73cc1887c25ca8b3079326a4843af3ac8af2f500e818253fefc3afb/globalmount
Output: mount.nfs: Protocol not supported

supportbundle_5e584417-1cc9-44ad-8d2f-6c8c3661f36f_2024-04-02T02-19-38Z.zip

In addition. use kernel 5.15.0-101-generic can make RWX pod mount success.

@james-munson
Copy link
Contributor Author

james-munson commented Apr 2, 2024

I think I see the problem. Testing the fixup.

@james-munson
Copy link
Contributor Author

james-munson commented Apr 3, 2024

This is strange. I'm having a hard time inducing the failure. I installed my test build of longhorn-manager for both daemonsets (longhorn-manager and longhorn-csi-plugin), changed the kernel on one of my Ubuntu 22.04 worker nodes to 5.15.0-94-generic, and made a new RWX deployment. The pods all mounted successfully, including the one with the problematic kernel.

The pod events:

k describe pod rwx-test-79dc5564d4-x6gzj | grep Event -A 20
Events:
  Type     Reason                  Age   From                     Message
  ----     ------                  ----  ----                     -------
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   Scheduled               22m   default-scheduler        Successfully assigned default/rwx-test-79dc5564d4-x6gzj to jbm-u22-pool2-ca5a5aa4-5fcq7
  Normal   SuccessfulAttachVolume  22m   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-8787aa8f-8496-47b6-8bee-d49a27e68356"
  Normal   Pulled                  22m   kubelet                  Container image "ubuntu:xenial" already present on machine
  Normal   Created                 22m   kubelet                  Created container rwx-test
  Normal   Started                 22m   kubelet                  Started container rwx-test
  Normal   Pulled                  22m   kubelet                  Container image "nginx:stable" already present on machine
  Normal   Created                 22m   kubelet                  Created container nginx
  Normal   Started                 22m   kubelet                  Started container nginx

On the node itself,

 uname -a
Linux jbm-u22-pool2-ca5a5aa4-5fcq7 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@jbm-u22-pool2-ca5a5aa4-5fcq7:~# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/8d63416da5e7a35390a0bb4456ec55a10c6dbb3ce0a42b3eab6c9fa221481328/globalmount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/pods/c0ad2c30-c146-4bce-ba95-f9af4176547f/volumes/kubernetes.io~csi/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356/mount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

And in the pod, everything is happy

uname -a
Linux rwx-test-79dc5564d4-x6gzj 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@rwx-test-79dc5564d4-x6gzj:/# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

I'm not sure what's going on.

@james-munson
Copy link
Contributor Author

Repeated the test with 5.15.0-97-generic which is also in the broken range of Ubuntu kernels, and got the same result. Mount did not fail.

@PhanLe1010
Copy link
Contributor

@james-munson Is it reproducible if you use Longhorn 1.6.0 instead?

@james-munson
Copy link
Contributor Author

Actually, that was with 1.6.0 for everything but my custom longhorn-manager and csi-plugin. Perhaps I'll give it a try with a 1.5.x release. Or compare test procedure with @chriscchien.

@james-munson
Copy link
Contributor Author

james-munson commented Apr 6, 2024

So, since I can't repro the kernel-based Protocol not supported error, I tested using a different mount failure, but still exercising the code path that was missed. The logging from CSI-plugin is as hoped:

[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] E0406 18:18:12.533095    5056 mount_linux.go:230] Mount failed: exit status 32
[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] time="2024-04-06T18:18:12Z" level=warning msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:295" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o rw,vers=4.1,bunny,softerr,timeo=600,retrans=5,noresvport 10.43.163.213:/pvc-fdf504c7-1616-4309-a5b1-25365d74d02e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/e61d4e8525a632f675b44d7b8ba3704bd179e4e6ae7dbbe5ba811af540ebfa86/globalmount\nOutput: mount.nfs: an incorrect mount option was specified\n" function=nodeStageSharedVolume

In particular,

msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu"

which shows what we wanted to capture in the logs.

@james-munson
Copy link
Contributor Author

With longhorn/longhorn-manager#2724 committed, this should be testable again.

@chriscchien
Copy link
Contributor

Verified pass on longhorn v1.5.x (longhorn-manager b160f0)

Deploy Longhorn v1.5.x on Ubuntu 5.15.0-94-generic which RWX volume can not mount success(ref), create and attach a RWX volume, in the longhorn-csi-plugin.log we can find related kernel information as below:

2024-04-09T01:32:45.954023014Z time="2024-04-09T01:32:45Z" level=warning msg="Failed to mount volume pvc-60fce72a-acab-4214-9144-8305cf99e932 on node ccc with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:284" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.183.167:/pvc-60fce72a-acab-4214-9144-8305cf99e932 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/67d50915e9bd94523ce9dfb7240f1b491572361b8cad3537c5274f6184eb3cc0/globalmount\nOutput: mount.nfs: Protocol not supported\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. kind/backport Backport request kind/improvement Request for improvement of existing function
Projects
None yet
Development

No branches or pull requests

4 participants