Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ip_vs kernel module does not exist in the unprivileged container /proc #4278

Closed
pekindenis opened this issue Feb 13, 2023 · 9 comments
Closed

Comments

@pekindenis
Copy link

On the host, “ip_vs” is loaded, and “/proc/sys/net/ipv4/vs/” exists, but
On the container, “/proc/sys/net/ipv4/vs” exists, but “/proc/sys/net/ipv4/vs/
” does not.

root@pve01:~$ lsmod | grep ip_vs
ip_vs 155648 1 xt_ipvs
nf_conntrack 139264 10 xt_conntrack,nf_nat,xt_state,xt_nat,openvswitch,nf_conntrack_netlink,nf_conncount,xt_MASQUERADE,ip_vs,xt_REDIRECT
nf_defrag_ipv6 24576 3 nf_conntrack,openvswitch,ip_vs
libcrc32c 16384 7 nf_conntrack,nf_nat,openvswitch,btrfs,xfs,raid456,ip_vs

root@pve02:~# ls -l /proc/sys/net/ipv4/vs/
total 0
-rw-r--r-- 1 root root 0 Feb 11 21:52 am_droprate
-rw-r--r-- 1 root root 0 Feb 11 21:52 amemthresh
-rw-r--r-- 1 root root 0 Feb 11 21:52 backup_only
-rw-r--r-- 1 root root 0 Feb 11 21:52 cache_bypass
-rw-r--r-- 1 root root 0 Feb 11 21:52 conn_reuse_mode
-rw-r--r-- 1 root root 0 Feb 11 21:52 conntrack
-rw-r--r-- 1 root root 0 Feb 11 21:52 drop_entry
-rw-r--r-- 1 root root 0 Feb 11 21:52 drop_packet
-rw-r--r-- 1 root root 0 Feb 11 21:52 expire_nodest_conn
-rw-r--r-- 1 root root 0 Feb 11 21:52 expire_quiescent_template
-rw-r--r-- 1 root root 0 Feb 11 21:52 ignore_tunneled
-rw-r--r-- 1 root root 0 Feb 11 21:52 lblc_expiration
-rw-r--r-- 1 root root 0 Feb 11 21:52 lblcr_expiration
-rw-r--r-- 1 root root 0 Feb 11 21:52 nat_icmp_send
-rw-r--r-- 1 root root 0 Feb 11 21:52 pmtu_disc
-rw-r--r-- 1 root root 0 Feb 11 21:52 schedule_icmp
-rw-r--r-- 1 root root 0 Feb 11 21:52 secure_tcp
-rw-r--r-- 1 root root 0 Feb 11 21:52 sloppy_sctp
-rw-r--r-- 1 root root 0 Feb 11 21:52 sloppy_tcp
-rw-r--r-- 1 root root 0 Feb 11 21:52 snat_reroute
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_persist_mode
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_ports
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_qlen_max
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_refresh_period
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_retries
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_sock_size
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_threshold
-rw-r--r-- 1 root root 0 Feb 11 21:52 sync_version

root@container:~# ls -l /proc/sys/net/ipv4/vs/
total 0

root@container:~# lsmod | grep ip_vs
ip_vs_wrr 16384 0
ip_vs_wlc 16384 0
ip_vs_sh 16384 0
ip_vs_sed 16384 0
ip_vs_rr 16384 0
ip_vs_nq 16384 0
ip_vs_lc 16384 0
ip_vs_lblcr 16384 0
ip_vs_lblc 16384 0
ip_vs_ftp 16384 0
ip_vs_dh 16384 0
ip_vs 172032 33 ip_vs_wlc,ip_vs_rr,ip_vs_dh,ip_vs_lblcr,ip_vs_sh,ip_vs_nq,ip_vs_lblc,xt_ipvs,ip_vs_wrr,ip_vs_lc,ip_vs_sed,ip_vs_ftp
nf_nat 49152 6 xt_nat,nft_chain_nat,iptable_nat,xt_MASQUERADE,xt_REDIRECT,ip_vs_ftp
nf_conntrack 172032 8 xt_conntrack,nf_nat,xt_state,xt_nat,nf_conntrack_netlink,xt_MASQUERADE,ip_vs,xt_REDIRECT
nf_defrag_ipv6 24576 2 nf_conntrack,ip_vs
libcrc32c 16384 7 nf_conntrack,nf_nat,dm_persistent_data,btrfs,nf_tables,ip_vs,sctp

uname -a
Linux pve02 5.15.83-1-pve #1 SMP PVE 5.15.83-1 (2022-12-15T00:00Z) x86_64 GNU/Linux

@mihalicyn
Copy link
Member

mihalicyn commented Feb 13, 2023

Hi @pekindenis

This is normal behavior https://github.com/torvalds/linux/blob/ceaa837f96adb69c0df0397937cd74991d5d821a/net/netfilter/ipvs/ip_vs_ctl.c#L4303

For non-init user namespace these sysctl's are not exposed.

@pekindenis
Copy link
Author

pekindenis commented Feb 13, 2023

Hi @pekindenis

This is normal behavior https://github.com/torvalds/linux/blob/ceaa837f96adb69c0df0397937cd74991d5d821a/net/netfilter/ipvs/ip_vs_ctl.c#L4303

For non-init net namespace these sysctl's are not exported.

If i use privileged lxc container ip_vs “/proc/sys/net/ipv4/vs/” exists
Is there a way to make a container unprivileged without patch the kernel ?

my current settings:

arch: amd64
cores: 1
cpuunits: 512
features: keyctl=1,nesting=1
hostname: traefik
memory: 512
mp0: /usr/lib/modules/5.15.83-1-pve,mp=/usr/lib/modules/5.15.83-1-pve,backup=0,ro=1
net0: name=eth0,bridge=vmbr0,gw=192.168.1.1,hwaddr=52:75:3E:9B:CD:B0,ip=192.168.1.117/24,type=veth
ostype: debian
rootfs: local-lvm:vm-313-disk-0,size=20G
swap: 512
unprivileged: 0

startup lxc log

@mihalicyn
Copy link
Member

If i use privileged lxc container ip_vs “/proc/sys/net/ipv4/vs/” exists

that's because for privileged containers host user namespace is used.

Is there a way to make a container unprivileged without patch the kernel ?

Can you describe a use case for that? You can try to bindmount part of procfs tree to the unprivileged container to show these values, but it makes no sense because you'll not be able to write these sysctls.

@pekindenis
Copy link
Author

You can try to bindmount part of procfs tree to the unprivileged container to show these values, but it makes no sense because you'll not be able to write these sysctls.

I thought they could be mounted from the host machine with read/write capabilities for the unprivileged lxc container

Can you describe a use case for that ?

I discovered this issue when I ran docker swarm on a container.
It requires the IPVS kernel module and others . All other modules work in unprivileged mode, except ipvs .

@stgraber
Copy link
Member

Closing as this is a kernel restriction and not a LXC bug.
To be clear, this doesn't mean that we're not interested in keeping up the chat in here, just that there's nothing actionable for the LXC project itself, kernel work will be needed here instead.

@mihalicyn
Copy link
Member

mihalicyn commented Feb 14, 2023

@pekindenis

I thought they could be mounted from the host machine with read/write capabilities for the unprivileged lxc container

There are two issues with sysfs bindmount solution:

  • bind-mounting itself is not changing file permissions
  • sysfs is tied to the network namespace, so if you bindmount sysfs from the host you'll see host net namespace IPVS sysctl values but not the container ones

I discovered this issue when I ran docker swarm on a container.
It portainer/portainer#7736 (comment) . All other modules work in unprivileged mode, except ipvs .

IPVS itself is working in unprivileged mode, but docker wants to modify default values of IPVS sysctl's https://github.com/moby/libnetwork/blob/master/osl/namespace_linux.go#L679

For now I can suggest you to use privileged container, but I'll put this on my ToDo list and we discuss IPVS containerization internally with LXD team.

cc @stgraber

@pekindenis
Copy link
Author

@mihalicyn

For now I can suggest you to use privileged container, but I'll put this on my ToDo list and we discuss IPVS containerization internally with LXD team.

Thank you, I have already seen that the lxd team solved similar issue.

@mihalicyn
Copy link
Member

@mihalicyn

For now I can suggest you to use privileged container, but I'll put this on my ToDo list and we discuss IPVS containerization internally with LXD team.

Thank you, I have already seen that the lxd team solved similar issue.

yep, once I do that I'll let you know!

@mihalicyn
Copy link
Member

kuba-moo pushed a commit to linux-netdev/testing that referenced this issue Apr 16, 2024
Let's make all IPVS sysctls visible and RO even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be writable:
- conntrack
- conn_reuse_mode
- expire_nodest_conn
- expire_quiescent_template

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This list of sysctls was chosen because I can't
see any security risks allowing them and also
Kubernetes uses [2] these specific sysctls.

This patch is motivated by user request in the LXC
project [1].

[1] lxc/lxc#4278
[2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103

Cc: Stéphane Graber <stgraber@stgraber.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing that referenced this issue Apr 16, 2024
Let's make all IPVS sysctls visible and RO even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be writable:
- conntrack
- conn_reuse_mode
- expire_nodest_conn
- expire_quiescent_template

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This list of sysctls was chosen because I can't
see any security risks allowing them and also
Kubernetes uses [2] these specific sysctls.

This patch is motivated by user request in the LXC
project [1].

[1] lxc/lxc#4278
[2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103

Cc: Stéphane Graber <stgraber@stgraber.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: NipaLocal <nipa@local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants