Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update images to Debian buster, detect iptables mode #82966

Merged
merged 6 commits into from Nov 16, 2019

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Sep 21, 2019

What this PR does / why we need it:
On systems running iptables 1.8, containers that run iptables in the root network namespace (such as our kube-proxy image) need to run iptables in the same mode (legacy / nft) as the base system. As of 1.16 our solution was "you have to run the base system in legacy mode"; this makes us able to support either mode (and provides a template for other image creators to follow, although we'll want to provide some slightly-more-generic scripts for them eventually).

Which issue(s) this PR fixes:
Fixes #71305
Fixes #81729

Does this PR introduce a user-facing change?:

The official kube-proxy image (used by kubeadm, among other things) is now
compatible with systems running iptables 1.8 in "nft" mode, and will autodetect
which mode it should use.

/kind feature
/sig network
/priority important-soon

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 21, 2019
@danwinship
Copy link
Contributor Author

/assign @thockin

I started with your branch, and reverted the changes to the debian-iptables build; we don't need the nftables package, because iptables includes both iptables-legacy and iptables-nft.

My original plan had been to install a script to detect the mode which we could run before starting kube-proxy, but the process for building the kube-proxy, etc, container images isn't really set up to deal with having a wrapper/init script...

So then I realized that if I made the iptables binaries themselves be the detect-and-run-update-alternatives script, then we wouldn't need to modify the kube-proxy image at all (and in fact, any image based on the debian-iptables image would automatically work as expected). So that's the current version. I tested the debian-iptables image in isolation but not as part of kube-proxy yet because I was having trouble building the release images.

danw@p50:debian-iptables (iptables-nft)> sudo docker run --network=host --privileged -it k8s.gcr.io/debian-iptables-amd64:v12.0.0 /bin/sh
# ls -l /usr/sbin/iptables /etc/alternatives/iptables
lrwxrwxrwx 1 root root 26 Sep 21 13:33 /usr/sbin/iptables -> /etc/alternatives/iptables
lrwxrwxrwx 1 root root 26 Sep 21 14:54 /etc/alternatives/iptables -> /usr/sbin/iptables-wrapper
# iptables -C INPUT -m comment --comment "test" -j ACCEPT
iptables: Bad rule (does a matching rule exist in that chain?).
# iptables -A INPUT -m comment --comment "test" -j ACCEPT
# ls -l /etc/alternatives/iptables
lrwxrwxrwx 1 root root 25 Sep 21 15:07 /etc/alternatives/iptables -> /usr/sbin/iptables-legacy

@praseodym
Copy link
Contributor

Will this be affected by the iptables 1.8.2 bug described in #82361?

@danwinship
Copy link
Contributor Author

kube-proxy doesn't use iptables -C, so it's not affected by the bug from #82361. So if you have a base system with iptables 1.8.3, you could safely use this iptables 1.8.2-based kube-proxy image on it in nft mode. But if other people are using our debian-iptables base image for other things, they might run into the problem. But hopefully that should all be fixed in debian by the time kube 1.17 goes out.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever approach :-)

ncurses-base \
ncurses-bin \
systemd \
systemd-sysv \
sysv-rc \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this file seem unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from Tim's commit changing the base from stretch to buster; the exact set of packages that gets installed by default is different and some of the ones that were getting purged from the image before either don't exist or can't be removed now. (Though there are probably additional packages that could be added to the purge list now.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. My POC commit was hack-and-slack to get something that builds. I am not an apt expert and I don't really know where this list came from, so SOMEONE would have to do it "right".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we identified anyone who ACTUALLY knows how to PROPERLY reproduce this process?

@tallclair ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original process for coming up with this list was going through all the installed dependencies (apt list --installed) and trying to remove anything that didn't seem important (for a container, e.g. no need for an init system).

The original motivation for this was 2 fold:

  1. reduce the size of the image, though I now think this is somewhat irrelevant as the base layer is shared among containers that use it, and the application layer tends to dwarf the base layer (at least in the fluentd case).
  2. reduce the vulnerability scanner noise. This is the more significant motivation (though I may be biased). Fewer deps == fewer times we need to bump the image to pick up a vulnerability fix in an irrelevant dependency.

The tradeoff is that by removing base dependencies, we're more likely to hit untested corner cases, and it's harder to update the image (as we see here).

/cc @rphillips - did the last major version bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, this is the PR with the last major version bump
#52744

Copy link
Member

@justinsb justinsb Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also build the image up (i.e. adding packages to a scratch image), using bazel. The downside is that we can't run the postinstall scripts, so e.g. if we need special users we have to create them separately. But ... it's very doable if we have a list of the packages we rely on.

@aojea
Copy link
Member

aojea commented Sep 29, 2019

I think this fixes #81729 too

@BenTheElder
Copy link
Member

this looks good to me, happy to help push the image(s) when this is ready - feel free to shoot me a ping

@@ -33,22 +33,22 @@ SUDO=$(if $(filter 0,$(shell id -u)),,sudo)
export DOCKER_CLI_EXPERIMENTAL := enabled

ifeq ($(ARCH),amd64)
BASEIMAGE?=debian:stretch
BASEIMAGE?=debian:buster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems there is a slim flavour of buster in the docker registry with half of the size buster-slim25.84 MB vs buster48.05 MB https://hub.docker.com/_/debian/?tab=tags&page=1&name=buster
is it worth to try it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly... This is sort of the same issue as above with the apt-get purge. People can submit follow-up commits to slim things down further if we don't get it perfect here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I don't know who did the debian-base work before, but we probably need a community ownership and docs on how to repeat the process for new debian releases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair was it you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the slim version is mostly redundant with the directories we remove, but I think moving to that base can't hurt.

I agree we need a better process for this base image. I'm tempted to say that we should get out of the image maintenance game and just scrap this for buster-slim, but I worry about how much noise we'll end up with from the vulnerability scanners.

That would be somewhat mitigated if we do a full rebuild from the latest upstream base on every release. E.g. rebuild the intermediate images (debian-iptables) as part of the release process.

Copy link
Member

@aojea aojea Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if this is orthogonal or if this was previously discussed but, based on the image size and security concerns, why using debian instead of alpine?, per example #84420

@danwinship
Copy link
Contributor Author

this looks good to me, happy to help push the image(s) when this is ready - feel free to shoot me a ping

OK. I'm not totally sure what needs to be done here; it seems like pull-kubernetes-cross won't even pass completely until the images already exist? If you know what needs to be done there though, and you're happy with the general build changes, and @thockin is happy with the iptables-specific changes (which he hasn't commented on yet), then I think this is ready to go?

@danwinship danwinship changed the title WIP update images to Debian buster, detect iptables mode update images to Debian buster, detect iptables mode Sep 30, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2019
@danwinship
Copy link
Contributor Author

But if other people are using our debian-iptables base image for other things, they might run into the problem. But hopefully that should all be fixed in debian by the time kube 1.17 goes out.

Comments in #82361 suggest that iptables 1.8.3 might not actually be backported to buster, but that it's considered legitimate to pull single packages from buster-backports, which is just a small change:

+# Install latest iptables package from buster-backports
+RUN echo deb http://deb.debian.org/debian buster-backports main >> /etc/apt/sources.list; \
+    apt-get update; \
+    apt-get -t buster-backports -y --no-install-recommends install iptables

@@ -110,7 +110,7 @@ def debian_image_dependencies():
digest = _digest(_DEBIAN_BASE_DIGEST, arch),
registry = "k8s.gcr.io",
repository = "debian-base",
tag = "0.4.1", # ignored, but kept here for documentation
tag = "1.0.0", # ignored, but kept here for documentation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm pretty sure the shas above need to be updated, but that is pointless until we push the images anyhow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lines 91-96, not sure how we are populating these exactly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the debian base and debian-iptables digests maps above need to be updated with the values in
#82966 (comment)
#82966 (comment)
these are in the same order except the overall manifest is at the bottom.
amd64, arm, arm64, ppc64le, s390x. the digest is the hash at the end of the line after "digest" 🙃

also, this is v2.0.0 now, right? (even if it is ignored)

@BenTheElder
Copy link
Member

when the base image changes are approved, @thockin or I (or another googler) can push & promote these and then we need to update the references to match them.

in the future image promotion & pushing will run under CNCF infra and build will generally be automated, but that switch hasn't been flipped yet..

ncurses-base \
ncurses-bin \
systemd \
systemd-sysv \
sysv-rc \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. My POC commit was hack-and-slack to get something that builds. I am not an apt expert and I don't really know where this list came from, so SOMEONE would have to do it "right".

@@ -33,22 +33,22 @@ SUDO=$(if $(filter 0,$(shell id -u)),,sudo)
export DOCKER_CLI_EXPERIMENTAL := enabled

ifeq ($(ARCH),amd64)
BASEIMAGE?=debian:stretch
BASEIMAGE?=debian:buster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I don't know who did the debian-base work before, but we probably need a community ownership and docs on how to repeat the process for new debian releases

@@ -33,22 +33,22 @@ SUDO=$(if $(filter 0,$(shell id -u)),,sudo)
export DOCKER_CLI_EXPERIMENTAL := enabled

ifeq ($(ARCH),amd64)
BASEIMAGE?=debian:stretch
BASEIMAGE?=debian:buster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair was it you?

# Detect whether the base system is using iptables-legacy or
# iptables-nft. This assumes that some non-containerized process (eg
# kubelet) has already created some iptables rules.
num_legacy_lines=$( (iptables-legacy-save || true; ip6tables-legacy-save || true) 2>/dev/null | wc -l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about not ip6 enabled systems? Will this error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; that's what the "|| true" is for.
I'm continuing to work on https://github.com/danwinship/iptables-wrappers/ and one of the iptables hackers suggested we should add " | grep -v "^\([*:#]\|COMMIT\)"" here too so we're only comparing actual rules and don't get tricked if there are just lots of tables. (Although I think that's probably not really a problem since we know kubelet will create like 8 rules or so before this script ever runs, so even if all the default tables exist in the "wrong" version, we should still be OK.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to check for actual rules?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original PR edited the debian-iptables/Dockerfile to include nftables - is that not needed?

mode=nft
fi

update-alternatives --set iptables "/usr/sbin/iptables-${mode}" > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to update-alternatives vs just execing the correct mode binary directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It moves the wrapper script out of the way so that all future iptables calls will go directly to the correct binary. (We don't want kube-proxy to have to run "iptables-save" an extra time to figure out the correct mode every time it pushes an update.)

@danwinship
Copy link
Contributor Author

My original PR edited the debian-iptables/Dockerfile to include nftables - is that not needed?

It's not needed. The nftables package contains the non-iptables-API-emulating nft binary, but iptables-nft doesn't use that.

@danwinship danwinship force-pushed the iptables-nft branch 2 times, most recently from d9986cc to b1d283b Compare October 17, 2019 11:56
mskrocki added a commit to mskrocki/cilium that referenced this pull request Apr 29, 2020
This feature is required by any OS that enables nft based iptables
i.e. Debian >=10, CentOS 8.1 that have iptables >=1.8.

This solution is based on kube-proxy fix to the same issue:
kubernetes/kubernetes#82966

Bumped cilium-runtime base image to Ubuntu 20.04 that has iptables 1.8.4
Removed libgcc-5-dev which is not available in 20.04.
Added iptables-wrapper that handles iptables mode detection.

Signed-off-by: Maciej Skrocki <maciejskrocki@google.com>
mskrocki added a commit to mskrocki/cilium that referenced this pull request Apr 30, 2020
This feature is required by any OS that enables nft based iptables
i.e. Debian >=10, CentOS 8.1 that have iptables >=1.8.

This solution is based on kube-proxy fix to the same issue:
kubernetes/kubernetes#82966

Bumped cilium-runtime base image to Ubuntu 20.04 that has iptables 1.8.4
Removed libgcc-5-dev which is not available in 20.04.
Added iptables-wrapper that handles iptables mode detection.

Signed-off-by: Maciej Skrocki <maciejskrocki@google.com>
aanm pushed a commit to cilium/cilium that referenced this pull request May 5, 2020
This feature is required by any OS that enables nft based iptables
i.e. Debian >=10, CentOS 8.1 that have iptables >=1.8.

This solution is based on kube-proxy fix to the same issue:
kubernetes/kubernetes#82966

Bumped cilium-runtime base image to Ubuntu 20.04 that has iptables 1.8.4
Removed libgcc-5-dev which is not available in 20.04.
Added iptables-wrapper that handles iptables mode detection.

Signed-off-by: Maciej Skrocki <maciejskrocki@google.com>
champtar added a commit to champtar/dns that referenced this pull request Sep 15, 2020
debian-iptables container transparently select iptables-legacy or iptables-nft since v12.0.0:
kubernetes/kubernetes#82966

Signed-off-by: Etienne Champetier <etienne.champetier@anevia.com>
sayboras added a commit to sayboras/cilium that referenced this pull request May 31, 2022
The original iptables-wrapper script is coming from [1], however,
this was spinned off to [2] in k8s upstream repo. This commit is
to get the latest iptables-wrapper script.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh
Signed-off-by: Tam Mach <tam.mach@cilium.io>
jrfastab added a commit to cilium/cilium that referenced this pull request Jun 9, 2022
Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab added a commit to cilium/cilium that referenced this pull request Jun 9, 2022
Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
joestringer pushed a commit to cilium/cilium that referenced this pull request Jun 9, 2022
Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
qmonnet pushed a commit to cilium/cilium that referenced this pull request Jun 30, 2022
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

[ Backport notes: Conflict on file
    images/runtime/configure-iptables-wrapper.sh, due to the copyright
    year being removed in 1.12 in commit 17a78a2 ("images: remove
    copyright year from copyright notices in source files") ]

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit to cilium/cilium that referenced this pull request Jul 5, 2022
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

[ Backport notes: Conflict on file
    images/runtime/configure-iptables-wrapper.sh, due to the copyright
    year being removed in 1.12 in commit 17a78a2 ("images: remove
    copyright year from copyright notices in source files") ]

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
joestringer pushed a commit to cilium/cilium that referenced this pull request Jul 6, 2022
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
aanm pushed a commit to cilium/cilium that referenced this pull request Jul 8, 2022
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
joestringer pushed a commit to cilium/cilium that referenced this pull request Jul 12, 2022
[ upstream commit 369f3f9 ]

Cilium currently, chooses to use iptables-legacy or iptables-nft using an
iptables-wrapper script. The script currently does a simple check to see
if there are more than 10 rules in iptables-legacy and if so picks legacy
mode. Otherwise it will pick whichever has more rules nft or legacy. See
[1] for the original wrapper this is taken from.

This however can be problematic in some cases. We've hit an environment
where arguably broken pods are inserting rules directly into iptables
without checking legacy or nft. This can happen in cases of pods that
are older for example and use an older package of iptables before 1.8.4
that was buggy or missing nft altogether. At any rate when this happens
it becomes a race to see what pods come online first and insert rules into
the table and if its greater than 10 cilium will flip into legacy mode.
This becomes painfully obvious if the agent is restarted after the system
has been running and these buggy pods already created their rules. At this
point Cilium may be using legacy while kube-proxy and kubelet are running
in nft space. (more on why this is bad below).

We can quickly check this from a sysdump with a few one liners,

$ find . -name iptables-nft-save* | xargs wc -l
  1495 ./cilium-bugtool-cilium-1234/cmd/iptables-nft-save--c.md
$ find . -name iptables-save* | xargs wc -l
  109  ./cilium-bugtool-cilium-1234/cmd/iptables-save--c.md

here we see that a single node has a significant amount of rules in both
nft and legacy tables. In the above example we dove into the legacy
table and found the normal CILIUM-* chains and rules. Then in the nft
tables we see the standard KUBE-PROXY-* chains and rules.

Another scenario where we can create a similar problem is with an old
kube-proxy. In this hypothetical scenario the user upgrades to a new
distribution/kernel with a base iptables image that points to iptables-nft.
This will cause kubelet to use nft tables, but because of the older
version of kube-proxy it may use iptables. Now kubelet and kube-proxy
are out of sync. Now how should Cilium pick nft or legacy?

Lets analyze the two scenarios. Assume Cilium and Kube-proxy pick
differently. First we might ask what runs first nft or iptables. From
the kernel side its unclear to me. The hooks are run walking an array but,
it appears those hooks are registered at runtime. So its up to which
hooks register first. And hooks register at init so now we are left
wondering which of nft or legacy registers first. This may very well
depend on if iptables-legacy or iptables-nft runs first because the
init of the module is done on demand with a request_module helper.
So bottom line ordering is fragile at best. For this discussion lets
assume we can't make any claims on if nft or iptables runs first.

Next, lets assume kube-proxy is in nft and Cilium is in legacy and
nft runs first. Now this will break Cilium's expectation that the
rules for Cilium are run before kube-proxy and any other iptables
rules. The result can be drops in the datapath. The example that
lead us on this adventure is IPSEC traffic hit a kube-proxy -j DROP rule
because it never ran the Cilium -j ACCEPT rule we expected to be
inserted into the front of the chain. So clearly this is no good.

Just to cover our cases, consider Cilium is run first and then
kube-proxy is run. Well we are still stuck from kernel code side
the hooks are executed in a for loop over the hooks and an ACCEPT
will run the next hook instead of the normal accept the skb and
do not run any further rules. The next hook in this case will have
the kube-proxy rules and we hit the same -j DROP rule again.

Finally because we can't depend on the order of nft vs legacy
running it doesn't matter if cilium and kube proxy flip to put
cilium on nft and kube-proxy on legacy. We get the same problem.

Because Cilium and kube-proxy are coupled in that they both
manage iptables for datapath flows they need to be on the same
hook. We could try to do this by doing [2] and following kubelet AND
assuming kube-proxy does the same everything should be OK. The
problem is if kube-proxy is not updated and doesn't follow
kubelet we again get stuck with Cilium and kube-proxy using different
hooks. To fix this case modify [2] so that Cilium follows kube-proxy
instead of following kubelet. This will force cilium and kube-proxy
to at least choose the same hook and avoid the faults outlined
above. There is a corner case if kube-proxy is not up before
cilium, but experimentally it seems kube-proxy is started close
to kubelet and init paths so is in fact up before cilium making
this ok. If we ever need to verify this in sysdump we can check
startAt times in the k8s-pod.yaml to confirm the start ordering
of pods.

For reference The original iptables-wrapper script the Cilium used
previous to this patch is coming from [1]. This patch is based
off of the new wrapper [2] in k8s upstream repo.

[1]: kubernetes/kubernetes#82966
[2]: https://github.com/kubernetes-sigs/iptables-wrappers/blob/master/iptables-wrapper-installer.sh

Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
tengattack pushed a commit to tengattack/kubernetes that referenced this pull request Nov 22, 2023
update images to Debian buster, detect iptables mode
Conflicts:
	build/common.sh
	build/workspace.bzl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-proxy fails to delete nat entries with IPv6 kube-proxy currently incompatible with iptables >= 1.8
10 participants