Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy currently incompatible with iptables >= 1.8 #71305

Closed
drags opened this issue Nov 21, 2018 · 81 comments · Fixed by #82966
Closed

kube-proxy currently incompatible with iptables >= 1.8 #71305

drags opened this issue Nov 21, 2018 · 81 comments · Fixed by #82966

Comments

@drags
Copy link

@drags drags commented Nov 21, 2018

What happened:

When creating nodes on machines with iptables >= 1.8 kube-proxy is unable initialize and route service traffic. The following is logged:

kube-proxy-22hmk kube-proxy E1120 07:08:50.135017       1 proxier.go:647] Failed to ensure that nat chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 3: iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
kube-proxy-22hmk kube-proxy Perhaps iptables or your kernel needs to be upgraded.

This is compat issue in iptables which I believe is called directly from kube-proxy. This is likely due to module reorganization with iptables move to nf_tables: https://marc.info/?l=netfilter&m=154028964211233&w=2

iptables 1.8 is backwards compatible with iptables 1.6 modules:

root@vm77:~# iptables --version
iptables v1.6.1
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables: No chain/target/match by that name.
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.



root@vm83:~# iptables --version
iptables v1.8.1 (nf_tables)
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.

However kube-proxy is based off of debian:stretch which iptables-1.8 may only make it to as part of stretch-backports

How to reproduce it (as minimally and precisely as possible):

Install a node onto a host with iptables-1.8 installed (ex: Debian Testing/Buster)

Anything else we need to know?:

I can keep these nodes in this config for a while, feel free to ask for any helpful output.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:06:30Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}```
  • Cloud provider or hardware configuration:

libvirt

  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux buster/sid"
NAME="Debian GNU/Linux"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux vm28 4.16.0-1-amd64 #1 SMP Debian 4.16.5-1 (2018-04-29) x86_64 GNU/Linux
  • Install tools:

kubeadm

  • Others:

/kind bug

@drags
Copy link
Author

@drags drags commented Nov 21, 2018

/sig network

@drags
Copy link
Author

@drags drags commented Nov 28, 2018

@kubernetes/sig-network-bugs

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Nov 28, 2018

@drags: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs

In response to this:

@kubernetes/sig-network-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@danderson
Copy link

@danderson danderson commented Dec 1, 2018

For the record, this probably breaks at least Calico and Weave as well, based on my abject failures to get pod<>pod networking to function on Debian Buster (which has upgraded to iptables 1.8). I'm filing bugs for that now, but this breaking change to iptables may be worth a wider broadcast to the k8s community.

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 17, 2018

kube-proxy itself seems compatible with iptables >=1.8 so the slogan in this issue is somewhat misleading. I have made basic tests and see no problems when using the correct version of the user-space iptables (and ipv6 with ip6tables) and the supporting libs. I don't think this problem can be fixed by altering some code in kube-proxy.

Tested versions; iptables v1.8.2, linux 4.19.3

The problem seems to be that that iptables user-space program (and libs) is (and has always been) dependent on the kernel version on the host. When the iptables user-space program is in a container with a old version this problem is bound to happen sooner or later, and it will happen again.

The kernel/user-space dependency is one of the problem that nft is supposed to fix. A long-term solution may be to replace iptables with ntf or bpf.

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 17, 2018

Iptables v1.8.2 have 2 modes (depending on soft-links);

# iptables -V
iptables v1.8.2 (nf_tables)

and;

# iptables -V
iptables v1.8.2 (legacy)

kube-proxy seem to work fine with both.

BTW; I have not tested any network policies, that is not kube-proxy of course but is is iptables.

@drags
Copy link
Author

@drags drags commented Dec 17, 2018

While the title is somewhat murky the fact is that kube-proxy is distributed using images based on debian-stretch and pulls in the iptables userspace from that distribution. When those images are run on hosts with a newer iptables this fails.

To be clear: this isn't a defect in the code, it's a defect in packaging/release.

@thockin
Copy link
Member

@thockin thockin commented Dec 18, 2018

kube-proxy is distributed using images based on debian-stretch and pulls in the iptables userspace from that distribution. When those images are run on hosts with a newer iptables this fails

Do you mean it breaks on a newer kernel? The iptables binary is part of kube-proxy so what would the on-host iptables have to do with anything?

I don't understand.

@danderson
Copy link

@danderson danderson commented Dec 18, 2018

There are 2 sets of modules for packet filtering in the kernel: ip_tables, and nf_tables. Until recently, you controlled the ip_tables ruleset with the iptables family of tools, and nf_tables with the nft tools.

In iptables 1.8, the maintainers have "deprecated" the classic ip_tables: the iptables tool now does userspace translation from the legacy UI/UX, and uses nf_tables under the hood. So, the commands look and feel the same, but they're now programming a different kernel subsystem.

The problem arises when you mix and match invocations of iptables 1.6 (the previous stable) and 1.8 on the same machine, because although they look identical, they're programming different kernel subsystems. The problem is that at least Docker does some stuff with iptables on the host (uncontained), and so you end up with some rules in nf_tables and some rules (including those programmed by kube-proxy and most CNI addons) in legacy ip_tables.

Empirically, this causes weird and wonderful things to happen - things like if you trace a packet coming from a pod, you see it flowing through both ip_tables and nf_tables, but even if both accept the packet, it then vanishes entirely and never gets forwarded (this is the failure mode I reported to Calico and Weave - bug links upthread - after trying to run k8s on debian testing, which now has iptables 1.8 on the host).

Bottom line, the networking containers on a machine have to be using the same minor version of the iptables binary as exists on the host.

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 18, 2018

@danderson Do you think it would be sufficient to enforce (if possible) the host version of iptables to "legacy";

# iptables -V
iptables v1.8.2 (legacy)

and keep the >=1.8 version?

I build and install iptables myself and the "mode" is determined by a soft-link;

# ls -l /usr/sbin/iptables
lrwxrwxrwx    1 root     root            20 Dec 18 08:47 /usr/sbin/iptables -> xtables-legacy-multi*

I assume the same applies for "Debian Testing/Buster" and others, but I don't knoe for sure.

@thockin
Copy link
Member

@thockin thockin commented Dec 18, 2018

@danderson thanks. That was very succinct.

What a crappy situation. How are we to know what is on the host? Can we include BOTH binaries in our images and probe the machine to see if either has been used previously (e.g. lsmod or something in /sys) ?

@danderson
Copy link

@danderson danderson commented Dec 18, 2018

As a preface, one thing to note: iptables 1.8 ships two binaries, iptables and iptables-legacy. The latter always programs ip_tables. So, there's fortunately no need to bundle two versions of iptables into a container, you can bundle just iptables 1.8 and be judicious about which binary you invoke... At least until the -legacy binary gets deleted, presumably in a future release.

Here's some requirements I think an ideal solution would have:

  • k8s networking must continue to function, obviously.
  • should be robust to the host iptables getting upgraded while the system is running (e.g. apt-get upgrade in the background).
  • should be robust to other k8s pods (e.g. CNI addons) using the "wrong" version of iptables.
  • should be invisible to cluster operators - k8s should just keep working throughout.
  • should not require a "flag day" on which everything must cut over simultaneously. There's too many things in k8s that touch iptables (docker, kube-proxy, CNI addons) to enforce that sanely, and k8s's eventual consistency model doesn't make a hard cutover without downtime possible anyway.
  • at the very least, the problem should be detected and surfaced as a fatal node misconfiguration, so that any automatic cluster healing can attempt to help.

So far I've only thought up crappy options for dealing with this. I'll throw them out in the hopes that it leads to better ideas.

  • Mount chunks of the host filesystem (/usr/sbin, /lib, ...) into kube-proxy's VFS, and make it chroot() to that quasi-host-fs when executing iptables commands. That way it's always using exactly the binary present on the host. Introduces obvious complexity, as well as a bunch of security risks if an attacker gets code execution in the kube-proxy container.
  • Using iptables 1.8 in the container, probe both iptables and iptables-legacy for the presence of rules installed by the host. Hopefully, there will be rules in only one of the two, and that can tell kube-proxy which one to use. This is subject to race conditions, and is fragile to host mutations that happen after kube-proxy startup (e.g. apt-get upgrade that upgrades iptables and restarts the docker daemon, shifting its rules over to nf_tables). Can solve it with periodic reconciling (i.e. "oops, host seems to have switched to nf_tables, wipe all ip_tables rules and reinstall them in nf_tables!")
  • Punt the problem up to kubeadm and an entry in the KubeProxyConfiguration cluster object. IOW, just document that "it's your responsibility to correctly tell kube-proxy which version of iptables you're using, or things will break." Relies on humans to get things right, which I predict will cause a rash of broken clusters. If we do this, we should absolutely also wire something into node-problem-detector that fires when both ip_tables and nf_tables have rules programmed.
  • Have a cutover release in which kube-proxy starts using nf_tables exclusively, through the nft tools, and mandate that host OSes for k8s must do everything in nf_tables, no ip_tables allowed. Likely intractable given the variety of addons and non-k8s software that does stuff to the firewall (same reason iptables has endured all these years even though nftables is measurably better in every way).
  • Find some kernel hackers and ask them if there's any way to make ip_tables and nf_tables play nicer together, so that userspace can just continue tolerating mismatches indefinitely. I'm assuming this is ~impossible, otherwise they'd have done it already to facilitate the transition to nf_tables.
  • Create a new DaemonSet whose sole purpose is to be an RPC-to-iptables translator, and get all iptables-using pods in k8s to use it instead of talking direct to the kernel. Clunky, expensive, and doesn't solve the problem of host software touching stuff.
  • Just document (via a Sonobuoy conformance test) that this is a big bag of knives, and kick the can over to cluster operators to figure out how to safely upgrade k8s in place given these constraints. I can at least speak on behalf of GKE and say that I sure hope it doesn't come to that, because all our options are strictly worse. I can also speak as the author of MetalLB and say that the support load from people with broken on-prem installs will be completely unsustainable for me :)

Of all of these, I think "probe with both binaries and try to conform to whatever is already there" is the most tractable if kube-proxy were the only problem pod... But given the ecosystem of CNI addons and other third-party things, I foresee never ending duels of controllers flapping between ip_tables and nf_tables endlessly, all trying to vaguely converge on a single stack, but never succeeding.

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 19, 2018

When using iptables 1.8.2 in nf_tables mode ipset (my version; v6.38) is still used by kube-proxy. But in nft ipset is "built-in".

It seems to work anyway but I can't understand how, or maybe my testing is insufficient.

I will try to test better and make sure the ipset's are used so they are not just defined and not used and my tests just happens to work.

But is any one can explain the relation between iptables in nf_tables mode and ipset please give a reference to some doc.

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 19, 2018

Ipset is only used in proxy-mode=ipvs. I get hits on ipset rules so they work in some way;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *      !11.0.0.0/16          0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst /* Kubernetes service cluster ip + port for masquerade purpose */
   23  1380 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst /* Kubernetes service external ip + port for masquerade and filter purpose */
   23  1380 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL /* Kubernetes service external ip + port for masquerade and filter purpose */

@uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 19, 2018

When using nf_tables mode rules are added indefinitely to the KUBE-FIREWALL chain;

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
....

in both proxy-mode ipvs and iptables.

@Vonor
Copy link

@Vonor Vonor commented Dec 30, 2018

I experienced the same issue in #72370. As a workaround I found this in the oracle docs, which made the pods be able to communicate with each other as well as with the outside world again.

@danwinship
Copy link
Contributor

@danwinship danwinship commented Jan 25, 2019

I discussed iptables/nft incompatibility in #62720 too, although that was before the iptables binary got rewritten...

It seems like for right now, the backward-compatible answer is "you have to make sure the host is using iptables in legacy mode".

@praseodym
Copy link
Contributor

@praseodym praseodym commented Mar 16, 2019

FWIW, I hit this issue as well when deploying Kubernetes on Debian Buster. I've included some logging in #75418.

@mcoreix
Copy link

@mcoreix mcoreix commented Apr 3, 2019

this works for me update-alternatives --set iptables /usr/sbin/iptables-legacy

@jiridanek
Copy link

@jiridanek jiridanek commented Mar 11, 2020

The status of this issue is that it was resolved by #82966, if I read correctly, and therefore Kubernetes 1.17 (kube-proxy in 1.17) should work without having to switch the nodes to iptables-legacy?

@danwinship
Copy link
Contributor

@danwinship danwinship commented Mar 11, 2020

The official kubernetes packages, and in particular kubeadm-based installs, are fixed as of 1.17. Other distributions of kubernetes may have been fixed earlier or might not be fixed yet.

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Mar 15, 2020

did we wind up backporting this at all?

@danwinship
Copy link
Contributor

@danwinship danwinship commented Mar 17, 2020

no... we'd have to backport the whole rebasing-the-images-to-debian-buster thing which seems like a big change for a point release

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Mar 17, 2020

hswong3i added a commit to alvistack/ansible-role-kube_kubelet that referenced this issue May 3, 2020
@CharlieReitzel
Copy link

@CharlieReitzel CharlieReitzel commented Apr 19, 2021

I experienced the same issue in #72370. As a workaround I found this in the oracle docs, which made the pods be able to communicate with each other as well as with the outside world again.

Updated link:
https://docs.oracle.com/en/operating-systems/oracle-linux/kubernetes/kube_admin_config.html#kube_admin_config_iptables

diurnalist added a commit to ChameleonCloud/kolla-containers that referenced this issue Apr 26, 2021
iptables in RHEL 8 containers is at version 1.8, which is not compatible
with a host system at RHEL 7 using iptables 1.4.

kubernetes/kubernetes#71305
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.