"Connection reset by peer" due to invalid conntrack packets #117924

junqiang1992 · 2023-05-11T05:50:21Z

What happened?

When packets with sequence number out-of-window arrived k8s node, conntrack marked them as INVALID. kube-proxy will ignore them, without rewriting DNAT. However, there is no corresponding session link on the host, and the host sends a reset packet, causing the session link to be interrupted

What did you expect to happen?

connection not reset

How can we reproduce it (as minimally and precisely as possible)?

#74839

Anything else we need to know?

This problem can be solved by command：
iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP

Similar issue:#74839
But this issue "drop" is placed on the forward chain, our scenario needs to be placed on the input chain.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

aojea · 2023-05-11T10:49:09Z

/sig network

is this about pods with hostNetwork: true ?
if not, why is this a kube-proxy issue if the problem is in INPUT?

shaneutt · 2023-05-11T16:20:29Z

/assign @aojea

thockin · 2023-05-11T16:28:31Z

Yeah, to be clear - filter INPUT applies to packets which are going to be delivered to a process. Is that kube-proxy's job to define?

uablrek · 2023-05-12T05:56:41Z

The "How can we reproduce" sounds a bit like this always happens. Here is a better instruction moby/libnetwork#1090 (comment)

uablrek · 2023-05-12T05:59:55Z

IMO this is the best description of the problem:

The problem as already mentioned by @aaronlehmann is that benign "invalid" packets to the SNAT'ed container (caused for instance by TCP window overflow due to high throughput but slow client) are assigned to the host interface and considered incorrectly martians, which causes a connection reset.

uablrek · 2023-05-12T06:06:15Z

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

This may be a better solution than let kube-proxy add rules in the INPUT chain

aojea · 2023-05-12T12:15:50Z

hehe, ironically we should decide to to_be_liberal or to_be_strict, one is a sysctl call the other is one (or two) iptables rules.

The iptables rule has also the side effect of breaking people (cc @cyclinder ) #94861 (comment)

If we switch to set the sysctl option, the change does not seem to be backportable.

This may be a better solution than let kube-proxy add rules in the INPUT chain
#117924 (comment)

I tend to agree with Lars, remove the iptables INVALID drop rule and set the sysctl, we are already setting more host sysctls

@thockin @dcbw @danwinship what do you think? it will be nice to have some iptables hackers opinion too

uablrek · 2023-05-12T13:58:43Z

I tried to recreate this problem on virtual en (kvm/qemu) by requsting a 100MB file with curl and do;

iptables -A INPUT -m statistic --mode random --probability 0.05 -j DROP

to provoke some packet loss. The transfer took longer time, but never failed.

danwinship · 2023-05-15T13:53:08Z

IMO kube-proxy should not set any sysctls that are not literally required for functionality that the user has explicitly opted into (eg net.ipv4.ip_forward for most network plugins). Kube-proxy does not own the host network namespace, and it should not be doing things that will affect other people's host-network traffic, because if we do it's going to break some users. (See also: #94861.)

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

If we can get the same effect with an iptables rule that only affects kube-proxy's own traffic, then I think that's better than setting a sysctl that will also affect non-kube-proxy traffic.

If I understand the situation here correctly, if kube-proxy added a drop rule for the invalid conntrack packets, but then the administrator set ip_conntrack_tcp_be_liberal, then the result would be that conntrack would not mark some packets as invalid, and so our drop rule would just not get hit, and so our drop rule wouldn't interfere with the "better" sysctl-based solution?

thockin · 2023-05-15T16:18:12Z

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

Agree in general. We have done it in the past (e.g. route-localnet) and mostly it seems like a not-great idea.

If we can get the same effect with an iptables rule that only affects kube-proxy's own traffic, then I think that's better than setting a sysctl that will also affect non-kube-proxy traffic.

Also agree.

sftim · 2023-08-14T16:55:18Z

If we see this as a known bug, please consider documenting it in https://kubernetes.io/docs/reference/networking/virtual-ips/ - even if latest Kubernetes includes a fix (we can nevertheless point that out!)

cyclinder · 2023-08-15T13:27:20Z

I'll fix this in the next few days.

danwinship · 2023-08-22T14:03:19Z

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

From further googling, it seems like there have been situations in the past where people were setting that sysctl to work around bugs in conntrack, which were later fixed. I think this may be another such situation; it seems like our problem is that there is some state involving dropped or retransmitted packets which linux would cope with fine in the non-conntrack case, but which it doesn't handle in the conntrack case, unless "be_liberal" is set, and that seems like a bug.

Every time I look away from this set of issues/PRs for longer than 15 minutes I forget the exact scenario that causes the bug, but if someone could summarize it very clearly we could try dragging some kernel developers in.

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

aojea · 2023-08-22T21:29:23Z

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

uablrek · 2023-08-23T05:06:17Z

Thanks @aojea, now that was an excellent article! Well written (not TL;DR that is), and IMO perfect technical level. If there is some collection of recommended reading for persons who want to know about K8s networking, this article should be in it.

wyike · 2023-08-28T10:13:24Z

Hi @aojea thank you for the document , it's really helpful!

I have some questions:

Per Add workaround for spurious retransmits leading to connection resets moby/libnetwork#1090 (comment), should we update this sentence accordingly ip_conntrack_tcp_be_liberal ==> nf_conntrack_tcp_be_liberal

Make conntrack more liberal on packets, and don’t mark the packets as INVALID. In Linux, you can do this by echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal.

After we use the above workaround, would there any side effect for the cluster network or cluster nodes?

danwinship · 2023-08-28T18:55:29Z

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

That says "this problem will occur if conntrack is unable to recognize a packet", but it doesn't explain the real-world scenario that causes conntrack to be unable to recognize a packet.

If we go to the kernel developers and say "manually injecting random incorrect packets into a conntrack'ed connection causes it to get confused", then they'll likely respond either "don't do that then" or "use nf_conntrack_tcp_be_liberal".

But if we can tell them something like "if a connection has a duplicated packet after a dropped packet, then conntrack gets confused", then that sounds a lot more like a bug in conntrack that they ought to fix, such that then everything will just work for everyone in the future without needing any DROP rules or sysctls.

cyclinder · 2023-08-29T08:59:35Z

So do we need to revert the drop rules? As @aojea mentioned in #117924 (comment) , it now seems that this makes sense.

danwinship · 2023-08-30T14:29:55Z

There are three potential "fixes" here:

If this is unambiguously a conntrack bug, then we should figure out the details, and get the kernel devs to fix it. However, getting kernel bug fixes into all k8s clusters in the world takes "a long time", so even if it is a kernel bug, we should still think about the other fixes.
Individual cluster admins can use nf_conntrack_tcp_be_liberal. We should advertise this better, but we feel that it would be dubious to have kube-proxy set this flag itself ("Connection reset by peer" due to invalid conntrack packets #117924 (comment)).
As per kube-proxy: Drop packets in INVALID state drops packets from outside the pod range #94861 (comment) we could probably tweak the existing rule so that it didn't interfere with non-k8s packets and as suggested by the OP of this issue, we could add a similar rule to the INPUT chain (though this would need a bit of further thinking about to make sure we weren't introducing any new conflicts with non-k8s traffic).

danwinship · 2023-08-30T14:32:47Z

We could revert the DROP rules and tell people to use the sysctl instead (and actually, probably that's a better answer for this issue than adding a new DROP rule to the INPUT chain is). But we'd need to warn users in advance if we were going to do that, since it would break existing clusters.

cyclinder · 2023-09-01T15:44:25Z

I can accept KEP, but I think this may take multiple releases to complete. how about this?

We remove the DROP rule
Control the sysctl with a flag or a feature-gate.

If it's a flag, I believe it's relatively simple and we can finish it in 1.29. In the case of feature-gate, this may take multiple releases to complete.

aojea · 2023-09-03T07:46:48Z

If #120354 sets the nf_conntrack_tcp_be_liberal to true by default, we can safely remove the DROP rule

danwinship · 2023-09-04T14:10:09Z

In the case of feature-gate, this may take multiple releases to complete.

Yes, but if we were doing it that way, that would be because we considered that to be a feature rather than a bug.

We want to avoid breaking working clusters on upgrade. Maybe just having a release note saying "you may need to pass --conntrack-tcp-be-liberal to kube-proxy" is enough.

If #120354 sets the nf_conntrack_tcp_be_liberal to true by default, we can safely remove the DROP rule

I think we agreed we don't want to change any sysctls except by explicit admin request (because that affects the behavior of non-kubernetes networking as well).

aojea · 2023-09-04T14:29:27Z

I think we agreed we don't want to change any sysctls except by explicit admin request (because that affects the behavior of non-kubernetes networking as well).

Agree, but we need to think in the users, this is a problem where the fix that solve a sublte and complex bug that may affect 100% of kubernetes users and most probably they don't know they are affected, impact a esoteric network configuration on the cluster networking, and the solution we are offering is "please read the release notes", that we know only 20% of users do ... I think that advanced users can opt-out to the DROP rule if we don't want to mess with the sysctl,

danwinship · 2023-09-04T14:44:56Z

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't, and so I worry that if we set that flag by default, then we're just going to start breaking some other subset of users.

Also, if it turns out that this is all just a conntrack bug, which eventually gets fixed, then we wouldn't want to be setting the sysctl after it got fixed. But at that point it would be likely that there were some users unknowingly relying on the sysctl for some other side effect.

danwinship · 2023-09-04T14:46:58Z

We could potentially have an iptables rule to count invalid conntrack packets, and expose the count as a metric

aojea · 2023-09-04T15:14:20Z

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't, and so I worry that if we set that flag by default, then we're just going to start breaking some other subset of users.

I completely and absolutely agree. we should not make that sysctl a default in kube-proxy ... what if we read the sysctl and install the DROP rule only if it is not set?
Users that apply the sysctl we no longer have the DROP rule

aojea · 2023-09-04T15:22:29Z

/assign

let me send a PR , I think that approach covers all cases

danwinship · 2023-09-06T13:17:19Z

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it ("I don't see anything that could break by enabling this mode"), and in fact OVN sets it unconditionally on hosts it's running on.

Also, if it turns out that this is all just a conntrack bug, which eventually gets fixed, then we wouldn't want to be setting the sysctl after it got fixed.

He also tells me that the bug (at least in the form that we are causing it in our regression test) is fixed in kernel ~~6.2~~ 6.1 ("netfilter: conntrack: ignore overly delayed tcp packet").

So I'm not sure if that means "we should just go ahead and set the sysctl because it's harmless" or "we should just keep it like it is now and not set the sysctl since it won't be needed in the future"...

(NB: if the latter we still need a docs update to go along with this PR, telling admins about the sysctl.)

cyclinder · 2023-09-06T13:36:08Z

If we can determine that the kernel fixes the issue in a certain version, then I think we can do this:

Do not install the DROP rule.
If the kernel version of the host is greater than the safe kernel version, we do not set sysctl.
If less than, we have a flag to set sysctl. I'll make it default to 1. (refer to I don't see anything that could break by enabling this mode), And let the user know that we may remove this sysctl in the future.

thockin · 2023-09-06T15:46:37Z

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default.

Exactly my thought. I made this mistake with route_localnet, and we still pay for it.

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it

Let me do some internal outreach too, see if I can get a concurring opinion?

thockin · 2023-09-06T16:48:53Z

Kernel ppl here concur:

"I say : if kernel can not be fixed, set tcp_be_liberal to one "

"having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version"

@aojea argues for a flag, default to true (or a hegative flag, defsult false) which seems a bit paranoid, but probably smart :)

Are we all in agreement? Who wants to do the PR?

danwinship · 2023-09-06T19:53:49Z

@aroradaman already has a PR to add tcp_be_liberal to the existing set of configurable conntrack sysctls in kube-proxy (#120354) so it's just a matter of flipping the default

cyclinder · 2023-09-07T02:19:11Z

If the sysctl defaults to true, I think this still keeps backward compatibility, So we don't need the DROP rule anymore.

aojea · 2023-09-07T05:15:10Z

I just realized that we need this to be a flag because there are cases that run kube-proxy on environments with readonly /sys subsystem, so trying to set it always will fail will and make kube-proxy unusable for those environments

https://kubernetes.io/docs/tasks/administer-cluster/kubelet-in-userns/#configuring-kube-proxy

aojea · 2023-09-07T06:06:40Z

So we don't need the DROP rule anymore.

if people does not set it, they will hit the bug https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/ until they use kernel 6.1 #117924 (comment)

cyclinder · 2023-09-07T06:25:00Z

I mean this sysctl defaults to true in kube-proxy, I can't imagine why users would need to set it to false unless the kernel version is safe (> 6.1). Of course, we can also document that the user that it is better not to do this.

thockin · 2023-09-07T14:45:39Z

There are cases where we CAN'T set sysctls but should not hard-fail kube-proxy.

…

On Wed, Sep 6, 2023, 11:25 PM Cyclinder ***@***.***> wrote: I mean this sysctl defaults to true in kube-proxy, I can't imagine why users would need to set it to false unless the kernel version is safe (> 6.1). Of course, we can also document that the user that it is better not to do this. — Reply to this email directly, view it on GitHub <#117924 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVDWLSK43XV76TMNBZLXZFSEPANCNFSM6AAAAAAX5TPFUM> . You are receiving this because you were assigned.Message ID: ***@***.***>

aojea · 2024-04-21T22:59:56Z

@danwinship @thockin after setting the jobs for nftables and hitting the conntrack bug continuously I think we should default the flag #120354 to true and set the sysctl nf_conntrack_tcp_be_liberal to 1 ,

We are already setting some sysctl and for rootless or other environment they can always opt out, this is what we do in kind

conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
# It is a global variable that affects other namespaces
  maxPerCore: 0
{{if .RootlessProvider}}
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
  tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
  tcpCloseWaitTimeout: 0s
{{end}}{{end}}
`

thockin · 2024-04-24T00:09:28Z

Sounds OK to me. Risk should be ~zero, right?

cyclinder · 2024-04-24T02:08:23Z

Yes, I assumed the risk is when nf_conntrack_tcp_be_liberal to 1, we do not set it by default, so default to 0(see

kubernetes/cmd/kube-proxy/app/server_linux.go

Line 414 in 47ad87e

if s.Config.Conntrack.TCPBeLiberal {

).

I'd be happy with it, Can I have this one?

aojea · 2024-04-24T23:01:29Z

Talked with @danwinship offline, I don't have very strong arguments to make this enable by default , so let's wait for more user feedback, nftables will move to beta this release, we can revisit if we have more feedback from users

tyler-lloyd · 2024-05-08T12:09:38Z

@aojea we are considering turning on conntrack-tcp-be-liberal by default in our kube-proxy 1.29+ deployments. Based on the recent discussion here that seems fine and the risk should "be ~zero"(#117924 (comment)), correct? And you are all even discussing turning this on by default in kube-proxy in some future version?

One thing I'm a little confused on: is the fix to conntrack in 6.1 doing the same thing as turning on nf_conntrack_tcp_be_liberal? So by setting nf_conntrack_tcp_be_liberal=1 you're just turning on behavior that will be on by default in 6.1. Did I get that right?

danwinship · 2024-05-08T18:19:06Z

is the fix to conntrack in 6.1 doing the same thing as turning on nf_conntrack_tcp_be_liberal?

Not exactly. nf_conntrack_tcp_be_liberal affects a handful of situations. 6.1 only "normalized" one of them.

The general intent of nf_conntrack_tcp_be_liberal is something like "make conntrack work better with packets from broken TCP implementations". But the netfilter devs realized that one of the places it was being used was actually something that could happen due to totally normal TCP packet drops/retransmissions. And so in 6.1+, that particular situation is just handled correctly automatically, whether nf_conntrack_tcp_be_liberal is enabled or not.

I think we should default the flag to true and set the sysctl nf_conntrack_tcp_be_liberal to 1

My argument against this is that once we do it, we have to keep doing it forever, because people might start unwittingly depending on the other side effects of that sysctl. And particularly given that

Most people never hit the bug anyway. (It seems like we got more reports of the "my cluster is broken because of the DROP rule" bug than we ever got of the "my cluster is broken because of conntrack invalid packet handling" bug.)
Some day once everyone has moved to 6.1 the bug will be gone forever

ISTM that it's better to still not set it by default.

tyler-lloyd · 2024-05-08T20:00:57Z

Thanks for the clarification on the 6.1 bugfix @danwinship.

because people might start unwittingly depending on the other side effects of that sysctl.

That's true and they probably will. I'm just trying to leverage some of the earlier discussion here to make a decision about the safety of flipping this to true, as it seems like the concern was not too high / basically nil.

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it

Kernel ppl here concur:
"I say : if kernel can not be fixed, set tcp_be_liberal to one "
"having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version"

junqiang1992 added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2023

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 11, 2023

k8s-ci-robot assigned aojea May 11, 2023

danwinship mentioned this issue May 22, 2023

kube-proxy: Drop packets in INVALID state drops packets from outside the pod range #94861

Closed

thockin assigned danwinship and thockin Jun 8, 2023

uablrek mentioned this issue Aug 11, 2023

Connection resets with large payload #119887

Closed

aojea mentioned this issue Sep 4, 2023

only drop invalid cstate packets if non liberal #120412

Merged

k8s-ci-robot closed this as completed in #120412 Sep 6, 2023

danwinship mentioned this issue Sep 6, 2023

Add support for nf_conntrack_tcp_be_liberal sysctl to kube-proxy #120354

Merged

aojea mentioned this issue Apr 18, 2024

KEP-3866 kube-proxy nftables to beta #124383

Merged

aojea mentioned this issue Apr 22, 2024

Nftables be liberal with TCP kubernetes-sigs/kind#3588

Merged

"Connection reset by peer" due to invalid conntrack packets #117924

"Connection reset by peer" due to invalid conntrack packets #117924

Comments

junqiang1992 commented May 11, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

aojea commented May 11, 2023

shaneutt commented May 11, 2023

thockin commented May 11, 2023

uablrek commented May 12, 2023 • edited Loading

uablrek commented May 12, 2023 • edited Loading

uablrek commented May 12, 2023

aojea commented May 12, 2023

uablrek commented May 12, 2023

danwinship commented May 15, 2023

thockin commented May 15, 2023

sftim commented Aug 14, 2023

cyclinder commented Aug 15, 2023

danwinship commented Aug 22, 2023

aojea commented Aug 22, 2023

uablrek commented Aug 23, 2023

wyike commented Aug 28, 2023

danwinship commented Aug 28, 2023

cyclinder commented Aug 29, 2023

danwinship commented Aug 30, 2023 • edited Loading

danwinship commented Aug 30, 2023

cyclinder commented Sep 1, 2023

aojea commented Sep 3, 2023

danwinship commented Sep 4, 2023

aojea commented Sep 4, 2023

danwinship commented Sep 4, 2023

danwinship commented Sep 4, 2023

aojea commented Sep 4, 2023

aojea commented Sep 4, 2023

danwinship commented Sep 6, 2023 • edited Loading

cyclinder commented Sep 6, 2023

thockin commented Sep 6, 2023

thockin commented Sep 6, 2023

danwinship commented Sep 6, 2023

cyclinder commented Sep 7, 2023 • edited Loading

aojea commented Sep 7, 2023

aojea commented Sep 7, 2023

cyclinder commented Sep 7, 2023

thockin commented Sep 7, 2023 via email

aojea commented Apr 21, 2024

thockin commented Apr 24, 2024

cyclinder commented Apr 24, 2024

aojea commented Apr 24, 2024

tyler-lloyd commented May 8, 2024 • edited Loading

danwinship commented May 8, 2024

tyler-lloyd commented May 8, 2024

uablrek commented May 12, 2023 •

edited

Loading

uablrek commented May 12, 2023 •

edited

Loading

danwinship commented Aug 30, 2023 •

edited

Loading

danwinship commented Sep 6, 2023 •

edited

Loading

cyclinder commented Sep 7, 2023 •

edited

Loading

tyler-lloyd commented May 8, 2024 •

edited

Loading