Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Connection reset by peer" due to invalid conntrack packets #117924

Closed
junqiang1992 opened this issue May 11, 2023 · 53 comments · Fixed by #120412
Closed

"Connection reset by peer" due to invalid conntrack packets #117924

junqiang1992 opened this issue May 11, 2023 · 53 comments · Fixed by #120412
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@junqiang1992
Copy link

What happened?

When packets with sequence number out-of-window arrived k8s node, conntrack marked them as INVALID. kube-proxy will ignore them, without rewriting DNAT. However, there is no corresponding session link on the host, and the host sends a reset packet, causing the session link to be interrupted

What did you expect to happen?

connection not reset

How can we reproduce it (as minimally and precisely as possible)?

#74839

Anything else we need to know?

This problem can be solved by command:
iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP

Similar issue:#74839
But this issue "drop" is placed on the forward chain, our scenario needs to be placed on the input chain.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@junqiang1992 junqiang1992 added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2023
@aojea
Copy link
Member

aojea commented May 11, 2023

/sig network

is this about pods with hostNetwork: true ?
if not, why is this a kube-proxy issue if the problem is in INPUT?

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 11, 2023
@shaneutt
Copy link
Member

/assign @aojea

@thockin
Copy link
Member

thockin commented May 11, 2023

Yeah, to be clear - filter INPUT applies to packets which are going to be delivered to a process. Is that kube-proxy's job to define?

@uablrek
Copy link
Contributor

uablrek commented May 12, 2023

The "How can we reproduce" sounds a bit like this always happens. Here is a better instruction moby/libnetwork#1090 (comment)

@uablrek
Copy link
Contributor

uablrek commented May 12, 2023

IMO this is the best description of the problem:

The problem as already mentioned by @aaronlehmann is that benign "invalid" packets to the SNAT'ed container (caused for instance by TCP window overflow due to high throughput but slow client) are assigned to the host interface and considered incorrectly martians, which causes a connection reset.

@uablrek
Copy link
Contributor

uablrek commented May 12, 2023

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

This may be a better solution than let kube-proxy add rules in the INPUT chain

@aojea
Copy link
Member

aojea commented May 12, 2023

hehe, ironically we should decide to to_be_liberal or to_be_strict, one is a sysctl call the other is one (or two) iptables rules.

The iptables rule has also the side effect of breaking people (cc @cyclinder ) #94861 (comment)

If we switch to set the sysctl option, the change does not seem to be backportable.

This may be a better solution than let kube-proxy add rules in the INPUT chain
#117924 (comment)

I tend to agree with Lars, remove the iptables INVALID drop rule and set the sysctl, we are already setting more host sysctls

@thockin @dcbw @danwinship what do you think? it will be nice to have some iptables hackers opinion too

@uablrek
Copy link
Contributor

uablrek commented May 12, 2023

I tried to recreate this problem on virtual en (kvm/qemu) by requsting a 100MB file with curl and do;

iptables -A INPUT -m statistic --mode random --probability 0.05 -j DROP

to provoke some packet loss. The transfer took longer time, but never failed.

@danwinship
Copy link
Contributor

IMO kube-proxy should not set any sysctls that are not literally required for functionality that the user has explicitly opted into (eg net.ipv4.ip_forward for most network plugins). Kube-proxy does not own the host network namespace, and it should not be doing things that will affect other people's host-network traffic, because if we do it's going to break some users. (See also: #94861.)

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

If we can get the same effect with an iptables rule that only affects kube-proxy's own traffic, then I think that's better than setting a sysctl that will also affect non-kube-proxy traffic.

If I understand the situation here correctly, if kube-proxy added a drop rule for the invalid conntrack packets, but then the administrator set ip_conntrack_tcp_be_liberal, then the result would be that conntrack would not mark some packets as invalid, and so our drop rule would just not get hit, and so our drop rule wouldn't interfere with the "better" sysctl-based solution?

@thockin
Copy link
Member

thockin commented May 15, 2023

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

Agree in general. We have done it in the past (e.g. route-localnet) and mostly it seems like a not-great idea.

If we can get the same effect with an iptables rule that only affects kube-proxy's own traffic, then I think that's better than setting a sysctl that will also affect non-kube-proxy traffic.

Also agree.

@sftim
Copy link
Contributor

sftim commented Aug 14, 2023

If we see this as a known bug, please consider documenting it in https://kubernetes.io/docs/reference/networking/virtual-ips/ - even if latest Kubernetes includes a fix (we can nevertheless point that out!)

@cyclinder
Copy link
Contributor

I'll fix this in the next few days.

@danwinship
Copy link
Contributor

We can document that we think it's a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it's almost always a better idea than not, we can warn at startup if it's not set, but (IMO) we shouldn't set it ourselves.

From further googling, it seems like there have been situations in the past where people were setting that sysctl to work around bugs in conntrack, which were later fixed. I think this may be another such situation; it seems like our problem is that there is some state involving dropped or retransmitted packets which linux would cope with fine in the non-conntrack case, but which it doesn't handle in the conntrack case, unless "be_liberal" is set, and that seems like a bug.

Every time I look away from this set of issues/PRs for longer than 15 minutes I forget the exact scenario that causes the bug, but if someone could summarize it very clearly we could try dragging some kernel developers in.

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

@aojea
Copy link
Member

aojea commented Aug 22, 2023

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

@uablrek
Copy link
Contributor

uablrek commented Aug 23, 2023

Thanks @aojea, now that was an excellent article! Well written (not TL;DR that is), and IMO perfect technical level. If there is some collection of recommended reading for persons who want to know about K8s networking, this article should be in it.

@wyike
Copy link

wyike commented Aug 28, 2023

Hi @aojea thank you for the document , it's really helpful!

I have some questions:

  1. Per Add workaround for spurious retransmits leading to connection resets moby/libnetwork#1090 (comment), should we update this sentence accordingly ip_conntrack_tcp_be_liberal ==> nf_conntrack_tcp_be_liberal

Make conntrack more liberal on packets, and don’t mark the packets as INVALID. In Linux, you can do this by echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal.

  1. After we use the above workaround, would there any side effect for the cluster network or cluster nodes?

@danwinship
Copy link
Contributor

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.)

https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

That says "this problem will occur if conntrack is unable to recognize a packet", but it doesn't explain the real-world scenario that causes conntrack to be unable to recognize a packet.

If we go to the kernel developers and say "manually injecting random incorrect packets into a conntrack'ed connection causes it to get confused", then they'll likely respond either "don't do that then" or "use nf_conntrack_tcp_be_liberal".

But if we can tell them something like "if a connection has a duplicated packet after a dropped packet, then conntrack gets confused", then that sounds a lot more like a bug in conntrack that they ought to fix, such that then everything will just work for everyone in the future without needing any DROP rules or sysctls.

@cyclinder
Copy link
Contributor

So do we need to revert the drop rules? As @aojea mentioned in #117924 (comment) , it now seems that this makes sense.

@danwinship
Copy link
Contributor

danwinship commented Aug 30, 2023

There are three potential "fixes" here:

  1. If this is unambiguously a conntrack bug, then we should figure out the details, and get the kernel devs to fix it. However, getting kernel bug fixes into all k8s clusters in the world takes "a long time", so even if it is a kernel bug, we should still think about the other fixes.
  2. Individual cluster admins can use nf_conntrack_tcp_be_liberal. We should advertise this better, but we feel that it would be dubious to have kube-proxy set this flag itself ("Connection reset by peer" due to invalid conntrack packets #117924 (comment)).
  3. As per kube-proxy: Drop packets in INVALID state drops packets from outside the pod range #94861 (comment) we could probably tweak the existing rule so that it didn't interfere with non-k8s packets and as suggested by the OP of this issue, we could add a similar rule to the INPUT chain (though this would need a bit of further thinking about to make sure we weren't introducing any new conflicts with non-k8s traffic).

@danwinship
Copy link
Contributor

We could revert the DROP rules and tell people to use the sysctl instead (and actually, probably that's a better answer for this issue than adding a new DROP rule to the INPUT chain is). But we'd need to warn users in advance if we were going to do that, since it would break existing clusters.

@cyclinder
Copy link
Contributor

I can accept KEP, but I think this may take multiple releases to complete. how about this?

  • We remove the DROP rule
  • Control the sysctl with a flag or a feature-gate.

If it's a flag, I believe it's relatively simple and we can finish it in 1.29. In the case of feature-gate, this may take multiple releases to complete.

@aojea
Copy link
Member

aojea commented Sep 3, 2023

If #120354 sets the nf_conntrack_tcp_be_liberal to true by default, we can safely remove the DROP rule

@danwinship
Copy link
Contributor

In the case of feature-gate, this may take multiple releases to complete.

Yes, but if we were doing it that way, that would be because we considered that to be a feature rather than a bug.

We want to avoid breaking working clusters on upgrade. Maybe just having a release note saying "you may need to pass --conntrack-tcp-be-liberal to kube-proxy" is enough.

If #120354 sets the nf_conntrack_tcp_be_liberal to true by default, we can safely remove the DROP rule

I think we agreed we don't want to change any sysctls except by explicit admin request (because that affects the behavior of non-kubernetes networking as well).

@aojea
Copy link
Member

aojea commented Sep 4, 2023

I think we agreed we don't want to change any sysctls except by explicit admin request (because that affects the behavior of non-kubernetes networking as well).

Agree, but we need to think in the users, this is a problem where the fix that solve a sublte and complex bug that may affect 100% of kubernetes users and most probably they don't know they are affected, impact a esoteric network configuration on the cluster networking, and the solution we are offering is "please read the release notes", that we know only 20% of users do ... I think that advanced users can opt-out to the DROP rule if we don't want to mess with the sysctl,

@danwinship
Copy link
Contributor

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't, and so I worry that if we set that flag by default, then we're just going to start breaking some other subset of users.

Also, if it turns out that this is all just a conntrack bug, which eventually gets fixed, then we wouldn't want to be setting the sysctl after it got fixed. But at that point it would be likely that there were some users unknowingly relying on the sysctl for some other side effect.

@danwinship
Copy link
Contributor

We could potentially have an iptables rule to count invalid conntrack packets, and expose the count as a metric

@aojea
Copy link
Member

aojea commented Sep 4, 2023

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't, and so I worry that if we set that flag by default, then we're just going to start breaking some other subset of users.

I completely and absolutely agree. we should not make that sysctl a default in kube-proxy ... what if we read the sysctl and install the DROP rule only if it is not set?
Users that apply the sysctl we no longer have the DROP rule

@aojea
Copy link
Member

aojea commented Sep 4, 2023

/assign

let me send a PR , I think that approach covers all cases

@danwinship
Copy link
Contributor

danwinship commented Sep 6, 2023

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default. There must be some reason why they don't

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it ("I don't see anything that could break by enabling this mode"), and in fact OVN sets it unconditionally on hosts it's running on.

Also, if it turns out that this is all just a conntrack bug, which eventually gets fixed, then we wouldn't want to be setting the sysctl after it got fixed.

He also tells me that the bug (at least in the form that we are causing it in our regression test) is fixed in kernel 6.2 6.1 ("netfilter: conntrack: ignore overly delayed tcp packet").

So I'm not sure if that means "we should just go ahead and set the sysctl because it's harmless" or "we should just keep it like it is now and not set the sysctl since it won't be needed in the future"...

(NB: if the latter we still need a docs update to go along with this PR, telling admins about the sysctl.)

@cyclinder
Copy link
Contributor

If we can determine that the kernel fixes the issue in a certain version, then I think we can do this:

  • Do not install the DROP rule.

  • If the kernel version of the host is greater than the safe kernel version, we do not set sysctl.

  • If less than, we have a flag to set sysctl. I'll make it default to 1. (refer to I don't see anything that could break by enabling this mode), And let the user know that we may remove this sysctl in the future.

@thockin
Copy link
Member

thockin commented Sep 6, 2023

If setting nf_conntrack_tcp_be_liberal was a no-brainer good idea, then the kernel would just have that behavior by default.

Exactly my thought. I made this mistake with route_localnet, and we still pay for it.

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it

Let me do some internal outreach too, see if I can get a concurring opinion?

@thockin
Copy link
Member

thockin commented Sep 6, 2023

Kernel ppl here concur:

"I say : if kernel can not be fixed, set tcp_be_liberal to one "

"having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version"

@aojea argues for a flag, default to true (or a hegative flag, defsult false) which seems a bit paranoid, but probably smart :)

Are we all in agreement? Who wants to do the PR?

@danwinship
Copy link
Contributor

@aroradaman already has a PR to add tcp_be_liberal to the existing set of configurable conntrack sysctls in kube-proxy (#120354) so it's just a matter of flipping the default

@cyclinder
Copy link
Contributor

cyclinder commented Sep 7, 2023

If the sysctl defaults to true, I think this still keeps backward compatibility, So we don't need the DROP rule anymore.

@aojea
Copy link
Member

aojea commented Sep 7, 2023

I just realized that we need this to be a flag because there are cases that run kube-proxy on environments with readonly /sys subsystem, so trying to set it always will fail will and make kube-proxy unusable for those environments

https://kubernetes.io/docs/tasks/administer-cluster/kubelet-in-userns/#configuring-kube-proxy

@aojea
Copy link
Member

aojea commented Sep 7, 2023

So we don't need the DROP rule anymore.

if people does not set it, they will hit the bug https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/ until they use kernel 6.1 #117924 (comment)

@cyclinder
Copy link
Contributor

I mean this sysctl defaults to true in kube-proxy, I can't imagine why users would need to set it to false unless the kernel version is safe (> 6.1). Of course, we can also document that the user that it is better not to do this.

@thockin
Copy link
Member

thockin commented Sep 7, 2023 via email

@aojea
Copy link
Member

aojea commented Apr 21, 2024

@danwinship @thockin after setting the jobs for nftables and hitting the conntrack bug continuously I think we should default the flag #120354 to true and set the sysctl nf_conntrack_tcp_be_liberal to 1 ,

We are already setting some sysctl and for rootless or other environment they can always opt out, this is what we do in kind

conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
# It is a global variable that affects other namespaces
  maxPerCore: 0
{{if .RootlessProvider}}
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
  tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
  tcpCloseWaitTimeout: 0s
{{end}}{{end}}
`

@thockin
Copy link
Member

thockin commented Apr 24, 2024

Sounds OK to me. Risk should be ~zero, right?

@cyclinder
Copy link
Contributor

Yes, I assumed the risk is when nf_conntrack_tcp_be_liberal to 1, we do not set it by default, so default to 0(see

if s.Config.Conntrack.TCPBeLiberal {
).

I'd be happy with it, Can I have this one?

@aojea
Copy link
Member

aojea commented Apr 24, 2024

Talked with @danwinship offline, I don't have very strong arguments to make this enable by default , so let's wait for more user feedback, nftables will move to beta this release, we can revisit if we have more feedback from users

@tyler-lloyd
Copy link
Contributor

tyler-lloyd commented May 8, 2024

@aojea we are considering turning on conntrack-tcp-be-liberal by default in our kube-proxy 1.29+ deployments. Based on the recent discussion here that seems fine and the risk should "be ~zero"(#117924 (comment)), correct? And you are all even discussing turning this on by default in kube-proxy in some future version?

One thing I'm a little confused on: is the fix to conntrack in 6.1 doing the same thing as turning on nf_conntrack_tcp_be_liberal? So by setting nf_conntrack_tcp_be_liberal=1 you're just turning on behavior that will be on by default in 6.1. Did I get that right?

@danwinship
Copy link
Contributor

is the fix to conntrack in 6.1 doing the same thing as turning on nf_conntrack_tcp_be_liberal?

Not exactly. nf_conntrack_tcp_be_liberal affects a handful of situations. 6.1 only "normalized" one of them.

The general intent of nf_conntrack_tcp_be_liberal is something like "make conntrack work better with packets from broken TCP implementations". But the netfilter devs realized that one of the places it was being used was actually something that could happen due to totally normal TCP packet drops/retransmissions. And so in 6.1+, that particular situation is just handled correctly automatically, whether nf_conntrack_tcp_be_liberal is enabled or not.

I think we should default the flag to true and set the sysctl nf_conntrack_tcp_be_liberal to 1

My argument against this is that once we do it, we have to keep doing it forever, because people might start unwittingly depending on the other side effects of that sysctl. And particularly given that

  1. Most people never hit the bug anyway. (It seems like we got more reports of the "my cluster is broken because of the DROP rule" bug than we ever got of the "my cluster is broken because of conntrack invalid packet handling" bug.)
  2. Some day once everyone has moved to 6.1 the bug will be gone forever

ISTM that it's better to still not set it by default.

@tyler-lloyd
Copy link
Contributor

Thanks for the clarification on the 6.1 bugfix @danwinship.

because people might start unwittingly depending on the other side effects of that sysctl.

That's true and they probably will. I'm just trying to leverage some of the earlier discussion here to make a decision about the safety of flipping this to true, as it seems like the concern was not too high / basically nil.

FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it

Kernel ppl here concur:
"I say : if kernel can not be fixed, set tcp_be_liberal to one "
"having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.