Skip to content

JonathonReinhart/linux-netns-sysctl-verify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 

Repository files navigation

linux-netns-sysctl-verify

Linux network namespace sysctl safety verifier.

Ensure that net sysctls are network-namespace-safe.

Usage

usage: verify.py [-h] [-v]

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  Verbose output

Currently, this must be run as root, in order to use CLONE_NEWNET.

$ sudo ./verify.py -v

Theory of Operation

The premise behind this tool is simple:

  • Take a snapshot of all values in /proc/sys/net.
  • Create a child process with a new netns (using CLONE_NEWNET).
  • In the child netns, modify every writable value in /proc/sys/net.
  • Exit the child netns.
  • Take a second snapshot of /proc/sys/net.
  • Compare the snapshots and report any differences.

Anything in the parent which changed as a result of manipulations in the child is considered a "leak".

Background

The Linux kernel provides runtime-configurable kernel parameters known as "sysctls", which are accessed via /proc/sys/.

Linux also supports supports network namespaces (netns) which enable isolated virtual network stacks and are used heavily by containerization platforms like LXC or Docker. See network_namespaces(7).

It's generally understood that the "net" sysctls (under /proc/sys/net) are supposed to be "netns safe", meaning that manipulating sysctls from one network namespace cannot affect any other network namespace. This isn't exactly guaranteed, though.

It may be desirable to allow a container to write to net sysctls, specifically parameters of devices which exist only within the container's netns. However, the latest version of Docker (20.10.6 as of this writing) mounts all of /proc/sys read-only, to prevent changes made in a container from "leaking" out of the container. This protection mechanism makes it more difficult (and less secure) to run a libvirt QEMU VM inside of a Docker container.

This tool was inspired by conversation on this runc issue.

Results

Use of this tool helped to uncover several bugs in the Linux kernel's implementation of several sysctls, which have been subsequently fixed by this tool's author:

Bug 1: Several nf_conntrack sysctls are global and writable by any netns

  • Affected sysctls:
    • net.nf_conntrack_max
    • net.netfilter.nf_conntrack_max
    • net.netfilter.nf_conntrack_expect_max
  • First broken: (long ago; since introduction of net namespaces)
  • Fix: netfilter: conntrack: Make global sysctls readonly in non-init netns
  • Fixed in Kernels:

Bug 2: tcp_allowed_congestion_control is global and writable by any netns

  • Affected sysctls:
    • net.ipv4.tcp_allowed_congestion_control
  • First broken: v5.7
  • Fix: net: Make tcp_allowed_congestion_control readonly in non-init netns
  • Fixed in Kernels:

Bug 3: Setting tcp_congestion_control can globally affect tcp_allowed_congestion_control

  • Related sysctls:
    • net.ipv4.tcp_congestion_control (affects)
    • net.ipv4.tcp_allowed_congestion_control (affected)
  • First broken: v4.15
  • Fix: net: Only allow init netns to set default tcp cong to a restricted algo
  • Fixed in Kernels:

Additionally, a safety check was added to the kernel to prevent certain classes of bugs from going unnoticed:

  • 31c4d2f160eb: net: Ensure net namespace isolation of sysctls

About

Linux network namespace sysctl safety verifier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages