Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup NFS - Operation not permitted during mount #6114

Open
adampetrovic opened this issue Jun 13, 2023 · 15 comments
Open

[BUG] Backup NFS - Operation not permitted during mount #6114

adampetrovic opened this issue Jun 13, 2023 · 15 comments
Assignees
Labels
area/backup-store Remote backup store related area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) require/knowledge-base Require adding knowledge base document
Milestone

Comments

@adampetrovic
Copy link

adampetrovic commented Jun 13, 2023

Describe the bug (馃悰 if you encounter this issue)

Setting backupTarget to an NFS store is giving an Operation not permitted error in the UI

I am using a Synology NAS that only supports up to v4.1

# cat /proc/fs/nfsd/versions
+2 +3 +4 +4.1
ash-4.4# cat /etc/exports
/volume2/k8s-backup	10.0.80.0/21(rw,async,no_wdelay,no_root_squash,insecure_locks,sec=sys,anonuid=1025,anongid=100)

My kubernetes nodes are within the subnet above

To Reproduce

Manually exec the mount command from within a longhorn-manager:

$ mkdir -p /mnt/nfs
$ mount -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 <nas url>:/volume2/k8s-backup /mnt/nfs
mount.nfs4: Operation not permitted

Executing the same command directly on a k8s node works fine:

$ sudo mount -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 <nas url>:/volume2/k8s-backup /tmp/nas
$

Expected behavior

NFS should mount

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: 1.4.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm / Flux
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 4
  • Node config
    • OS type and version: Ubuntu 22.04
    • CPU per node: 2
    • Memory per node: 30GB
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes: 1Gbe
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Proxmox
  • Number of Longhorn volumes in the cluster: N/A

Additional context

Add any other context about the problem here.

@derekbit
Copy link
Member

@adampetrovic
Can you support a support bundle for further investigation?

@ozid
Copy link

ozid commented Nov 11, 2023

I have the exact same issue longhorn v1.5.1 using helm
@derekbit here is my support bundle
[removed]
Mount is working on the host but not inisde the pod, this is the error on longhorn-manager:
mount.nfs4: Operation not permitted
While on the host its' working without issue, the exact same mount command

Personally i don't have access to the NFS server

image

@DanielG0721
Copy link

i also struggle with this issue.. V1.5.1

@derekbit
Copy link
Member

@ozid @DanielG0721
It looks related to the server or client's permission configuration. Would you able to check https://longhorn.io/kb/troubleshooting-unable-to-mount-an-nfs-backup-target/?

@ozid
Copy link

ozid commented Nov 12, 2023

Hello @derekbit i checked a bit more and i see this on a tcpdump:
image

First line is the pod making the request
and second line NFS reply with Operation not permitted so i guess you are right

Find strange this is working on the host itself and not the pod, but after testing on an another privileged pod i have the exact same issue. Ill try to search the issue but on the server side i can only whitelist our IP nothing else

@ozid
Copy link

ozid commented Nov 12, 2023

I just notice when the traffic coming from the pod to the NFS server it use as source port a range of unprivileged ports (above 1023),
On the k8s hosts it always use a port below 1023 as source. This is probably why the NFS server answering with Operation not permitted

....

After testing a rule on my gateway to rewrite the source port to use a port between 1-1023
ip daddr (ip nfs server) tcp dport 2049 snat to (my source nat ip) :1-1023
The mount is working :)

@derekbit
Copy link
Member

derekbit commented Nov 13, 2023

I just notice when the traffic coming from the pod to the NFS server it use as source port a range of unprivileged ports (above 1023), On the k8s hosts it always use a port below 1023 as source. This is probably why the NFS server answering with Operation not permitted

....

After testing a rule on my gateway to rewrite the source port to use a port between 1-1023 ip daddr (ip nfs server) tcp dport 2049 snat to (my source nat ip) :1-1023 The mount is working :)

@ozid
Cool! We were not aware that it is caused by the port usage in the k8s system.
However, why does it work in most environments without the issue?

@derekbit derekbit added this to the v1.7.0 milestone Nov 13, 2023
@derekbit derekbit added investigation-needed Need to identify the case before estimating and starting the development component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) labels Nov 13, 2023
@derekbit
Copy link
Member

cc @james-munson

@derekbit derekbit added priority/0 Must be fixed in this release (managed by PO) area/backup-store Remote backup store related and removed component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) labels Nov 13, 2023
@ozid
Copy link

ozid commented Nov 13, 2023

I just notice when the traffic coming from the pod to the NFS server it use as source port a range of unprivileged ports (above 1023), On the k8s hosts it always use a port below 1023 as source. This is probably why the NFS server answering with Operation not permitted
....
After testing a rule on my gateway to rewrite the source port to use a port between 1-1023 ip daddr (ip nfs server) tcp dport 2049 snat to (my source nat ip) :1-1023 The mount is working :)

@ozid Cool! We were not aware that it is caused by the port usage in the k8s system. However, why does it work in most environments without the issue?

I would love to know why also but there is so many ways of doing thing...
Personally i use cilium as CNI without kube-proxy. So maybe the ebpf program rewrite the source port to a random higher range and this is causing the issue.

Maybe it worth writing this on the documentation here:
https://longhorn.io/kb/troubleshooting-unable-to-mount-an-nfs-backup-target/

@derekbit
Copy link
Member

cc @mantissahz

@dotdiego
Copy link

dotdiego commented Dec 8, 2023

I just notice when the traffic coming from the pod to the NFS server it use as source port a range of unprivileged ports (above 1023), On the k8s hosts it always use a port below 1023 as source. This is probably why the NFS server answering with Operation not permitted

....

After testing a rule on my gateway to rewrite the source port to use a port between 1-1023 ip daddr (ip nfs server) tcp dport 2049 snat to (my source nat ip) :1-1023 The mount is working :)

Could you share more about how you managed to fix this issue ?
I'm currently in this situation and don't know how to fix this.

Thanks

@ozid
Copy link

ozid commented Dec 8, 2023

Hello @dotdiego yes of course, so basically your NFS server expecting your client to have a source port to be in the privileged port range, in simple term it must be under 1024.

So you must somehow find a way to make sure your client (in this case, the longhorn pod) is making a nfs query with a source port between 1-1023.

In my case, my kubernetes cluster need to go through my gateway server to reach the NFS, good news for me it's fully managed with linux tools so i can manipulate the traffic as i want, this is why i added this line to my gateway:
ip daddr (ip nfs server) tcp dport 2049 snat to (my source nat ip) :1-1023

So after adding this, my gateway rewrite the original source port wich was: 54850 to a random port between 1-1023

Before my command tcpdump was output something like this:
my-longhorn-pod-ip:54850 -> nfs_server-ip:2049 this is not working

After my command on my gateway:
my-longhorn-pod-ip:1022 -> nfs_server-ip:2049 this working

Hope it's more clear for you

@derekbit derekbit added require/knowledge-base Require adding knowledge base document area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. and removed investigation-needed Need to identify the case before estimating and starting the development labels Dec 8, 2023
@derekbit derekbit modified the milestones: v1.7.0, v1.6.0 Dec 8, 2023
@innobead innobead added the investigation-needed Need to identify the case before estimating and starting the development label Dec 13, 2023
@innobead
Copy link
Member

@james-munson Please help with the doc as @derekbit mentioned. We need a KB doc.

@innobead innobead removed this from the v1.6.0 milestone Jan 2, 2024
@innobead innobead added this to the v1.7.0 milestone Jan 2, 2024
@npawelek
Copy link

npawelek commented Jan 7, 2024

I stumbled across this as I'm attempting to debug this issue on my own k3s cluster. I'm running into this mount.nfs4: mount(2): Operation not permitted issue attempting to mount the NFS backup target from the longhorn manager container. This appears to be working without issue on Ubuntu 22.04 with K8s (deployed via kubeadm), but not Debian 12 with K3s. I'm deploying a migration cluster and can't mount the backupstore, which is a bit frustrating. Apparmor is disabled, so to not add any complications. From the underlying host, I can mount fine, but not within the longhorn-manager container. Longhorn version is 1.5.3 and it's installed via Helm.

k exec -it -n longhorn-system longhorn-manager-m7ph7 -- sh
sh-4.4# showmount -e 192.168.0.151
Export list for 192.168.0.151:
/volume1/LonghornBackupstore 10.32.0.0/12,192.168.0.34,192.168.0.33,192.168.0.32,192.168.0.31

sh-4.4# ip a sh eth0
127: eth0@if128: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:ae:93:a9:e1:96 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.32.2.86/32 scope global eth0
       valid_lft forever preferred_lft forever

sh-4.4# mount -v -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 192.168.0.151:/volume1/LonghornBackupstore /var/lib/longhorn-backupstore-mounts/192_168_0_151/volume1/LonghornBackupstore
mount.nfs4: timeout set for Sun Jan  7 17:07:16 2024
mount.nfs4: trying text-based options 'nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2,addr=192.168.0.151,clientaddr=10.32.2.86'
mount.nfs4: mount(2): Operation not permitted
mount.nfs4: Operation not permitted for 192.168.0.151:/volume1/LonghornBackupstore on /var/lib/longhorn-backupstore-mounts/192_168_0_151/volume1/LonghornBackupstore

Nothing is logged in dmesg. I'm also using the most updated kernel in Debian 12 stable (6.1.0-17-amd64). CNI plugin is also Cilium 1.14.5, and I can see that traffic from the longhorn-manager pod has a source port < 1024.

k exec -n kube-system cilium-jxfqt -- cilium monitor -n | grep 192.168.0.151
...
Policy verdict log: flow 0x614bba26 local EP ID 1360, remote ID 16777263, proto 6, egress, action allow, auth: disabled, match L3-Only, 10.32.2.86:842 -> 192.168.0.151:2049 tcp SYN
-> network flow 0x614bba26 , identity 67641->16777263 state new ifindex eth0 orig-ip 0.0.0.0: 10.32.2.86:842 -> 192.168.0.151:2049 tcp SYN
-> endpoint 1360 flow 0x0 , identity 16777263->67641 state reply ifindex lxcd72247c18a9c orig-ip 192.168.0.151: 192.168.0.151:2049 -> 10.32.2.86:842 tcp SYN, ACK
-> network flow 0x614bba26 , identity 67641->16777263 state established ifindex eth0 orig-ip 0.0.0.0: 10.32.2.86:842 -> 192.168.0.151:2049 tcp ACK
...

Here's my environment details and I will submit a support bundle as well.

Longhorn version: 1.5.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm / Flux
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.29.0+k3s1
Number of management node in the cluster: 3
Number of worker node in the cluster: 1
Node config
OS type and version: Debian 12 (stable kernel 6.1.0-17-amd64; apparmor disabled)
CPU per node: 2
Memory per node: 32GB
Disk type(e.g. SSD/NVMe): NVMe
Network bandwidth between the nodes: 1Gbe
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal

@ozid
Copy link

ozid commented Apr 29, 2024

@npawelek maybe you could try to tcpdump at the host level just to be 100% sure there is no nat applied once the traffic goes out of the CNI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup-store Remote backup store related area/environment-issue User-specific related issues, ex: network, DNS, host packages, etc. investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) require/knowledge-base Require adding knowledge base document
Projects
Status: Resolved/Scheduled
Development

No branches or pull requests

9 participants