-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Connection reset by peer" due to invalid conntrack packets #117924
Comments
/sig network is this about pods with hostNetwork: true ? |
/assign @aojea |
Yeah, to be clear - filter INPUT applies to packets which are going to be delivered to a process. Is that kube-proxy's job to define? |
The "How can we reproduce" sounds a bit like this always happens. Here is a better instruction moby/libnetwork#1090 (comment) |
IMO this is the best description of the problem:
|
This may be a better solution than let |
hehe, ironically we should decide to The iptables rule has also the side effect of breaking people (cc @cyclinder ) #94861 (comment) If we switch to set the
I tend to agree with Lars, remove the iptables INVALID drop rule and set the sysctl, we are already setting more host sysctls @thockin @dcbw @danwinship what do you think? it will be nice to have some iptables hackers opinion too |
I tried to recreate this problem on virtual en (kvm/qemu) by requsting a 100MB file with curl and do;
to provoke some packet loss. The transfer took longer time, but never failed. |
IMO kube-proxy should not set any sysctls that are not literally required for functionality that the user has explicitly opted into (eg We can document that we think it's a good idea for users/distros to set If we can get the same effect with an iptables rule that only affects kube-proxy's own traffic, then I think that's better than setting a sysctl that will also affect non-kube-proxy traffic. If I understand the situation here correctly, if kube-proxy added a drop rule for the invalid conntrack packets, but then the administrator set |
Agree in general. We have done it in the past (e.g. route-localnet) and mostly it seems like a not-great idea.
Also agree. |
If we see this as a known bug, please consider documenting it in https://kubernetes.io/docs/reference/networking/virtual-ips/ - even if latest Kubernetes includes a fix (we can nevertheless point that out!) |
I'll fix this in the next few days. |
From further googling, it seems like there have been situations in the past where people were setting that sysctl to work around bugs in conntrack, which were later fixed. I think this may be another such situation; it seems like our problem is that there is some state involving dropped or retransmitted packets which linux would cope with fine in the non-conntrack case, but which it doesn't handle in the conntrack case, unless "be_liberal" is set, and that seems like a bug. Every time I look away from this set of issues/PRs for longer than 15 minutes I forget the exact scenario that causes the bug, but if someone could summarize it very clearly we could try dragging some kernel developers in. (The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it's not clear to me what sort of real-world issue that's supposed to be representing.) |
|
Thanks @aojea, now that was an excellent article! Well written (not TL;DR that is), and IMO perfect technical level. If there is some collection of recommended reading for persons who want to know about K8s networking, this article should be in it. |
Hi @aojea thank you for the document , it's really helpful! I have some questions:
|
That says "this problem will occur if conntrack is unable to recognize a packet", but it doesn't explain the real-world scenario that causes conntrack to be unable to recognize a packet. If we go to the kernel developers and say "manually injecting random incorrect packets into a conntrack'ed connection causes it to get confused", then they'll likely respond either "don't do that then" or "use But if we can tell them something like "if a connection has a duplicated packet after a dropped packet, then conntrack gets confused", then that sounds a lot more like a bug in conntrack that they ought to fix, such that then everything will just work for everyone in the future without needing any DROP rules or sysctls. |
So do we need to revert the drop rules? As @aojea mentioned in #117924 (comment) , it now seems that this makes sense. |
There are three potential "fixes" here:
|
We could revert the DROP rules and tell people to use the sysctl instead (and actually, probably that's a better answer for this issue than adding a new DROP rule to the INPUT chain is). But we'd need to warn users in advance if we were going to do that, since it would break existing clusters. |
I can accept KEP, but I think this may take multiple releases to complete. how about this?
If it's a flag, I believe it's relatively simple and we can finish it in 1.29. In the case of feature-gate, this may take multiple releases to complete. |
If #120354 sets the |
Yes, but if we were doing it that way, that would be because we considered that to be a feature rather than a bug. We want to avoid breaking working clusters on upgrade. Maybe just having a release note saying "you may need to pass
I think we agreed we don't want to change any sysctls except by explicit admin request (because that affects the behavior of non-kubernetes networking as well). |
Agree, but we need to think in the users, this is a problem where the fix that solve a sublte and complex bug that may affect 100% of kubernetes users and most probably they don't know they are affected, impact a esoteric network configuration on the cluster networking, and the solution we are offering is "please read the release notes", that we know only 20% of users do ... I think that advanced users can opt-out to the DROP rule if we don't want to mess with the sysctl, |
If setting Also, if it turns out that this is all just a conntrack bug, which eventually gets fixed, then we wouldn't want to be setting the sysctl after it got fixed. But at that point it would be likely that there were some users unknowingly relying on the sysctl for some other side effect. |
We could potentially have an iptables rule to count invalid conntrack packets, and expose the count as a metric |
I completely and absolutely agree. we should not make that sysctl a default in kube-proxy ... what if we read the sysctl and install the DROP rule only if it is not set? |
/assign let me send a PR , I think that approach covers all cases |
FWIW, my kernel conntrack informant tells me that there really aren't any bad side effects of setting it ("I don't see anything that could break by enabling this mode"), and in fact OVN sets it unconditionally on hosts it's running on.
He also tells me that the bug (at least in the form that we are causing it in our regression test) is fixed in kernel So I'm not sure if that means "we should just go ahead and set the sysctl because it's harmless" or "we should just keep it like it is now and not set the sysctl since it won't be needed in the future"... (NB: if the latter we still need a docs update to go along with this PR, telling admins about the sysctl.) |
If we can determine that the kernel fixes the issue in a certain version, then I think we can do this:
|
Exactly my thought. I made this mistake with route_localnet, and we still pay for it.
Let me do some internal outreach too, see if I can get a concurring opinion? |
Kernel ppl here concur: "I say : if kernel can not be fixed, set tcp_be_liberal to one " "having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version" @aojea argues for a flag, default to true (or a hegative flag, defsult false) which seems a bit paranoid, but probably smart :) Are we all in agreement? Who wants to do the PR? |
@aroradaman already has a PR to add |
If the sysctl defaults to true, I think this still keeps backward compatibility, So we don't need the DROP rule anymore. |
I just realized that we need this to be a flag because there are cases that run kube-proxy on environments with readonly /sys subsystem, so trying to set it always will fail will and make kube-proxy unusable for those environments https://kubernetes.io/docs/tasks/administer-cluster/kubelet-in-userns/#configuring-kube-proxy |
if people does not set it, they will hit the bug https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/ until they use kernel 6.1 #117924 (comment) |
I mean this sysctl defaults to true in kube-proxy, I can't imagine why users would need to set it to false unless the kernel version is safe (> 6.1). Of course, we can also document that the user that it is better not to do this. |
There are cases where we CAN'T set sysctls but should not hard-fail
kube-proxy.
…On Wed, Sep 6, 2023, 11:25 PM Cyclinder ***@***.***> wrote:
I mean this sysctl defaults to true in kube-proxy, I can't imagine why
users would need to set it to false unless the kernel version is safe (>
6.1). Of course, we can also document that the user that it is better not
to do this.
—
Reply to this email directly, view it on GitHub
<#117924 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVDWLSK43XV76TMNBZLXZFSEPANCNFSM6AAAAAAX5TPFUM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
@danwinship @thockin after setting the jobs for nftables and hitting the conntrack bug continuously I think we should default the flag #120354 to true and set the sysctl nf_conntrack_tcp_be_liberal to 1 , We are already setting some sysctl and for rootless or other environment they can always opt out, this is what we do in kind
|
Sounds OK to me. Risk should be ~zero, right? |
Yes, I assumed the risk is when kubernetes/cmd/kube-proxy/app/server_linux.go Line 414 in 47ad87e
I'd be happy with it, Can I have this one? |
Talked with @danwinship offline, I don't have very strong arguments to make this enable by default , so let's wait for more user feedback, nftables will move to beta this release, we can revisit if we have more feedback from users |
@aojea we are considering turning on One thing I'm a little confused on: is the fix to conntrack in 6.1 doing the same thing as turning on |
Not exactly. The general intent of
My argument against this is that once we do it, we have to keep doing it forever, because people might start unwittingly depending on the other side effects of that sysctl. And particularly given that
ISTM that it's better to still not set it by default. |
Thanks for the clarification on the 6.1 bugfix @danwinship.
That's true and they probably will. I'm just trying to leverage some of the earlier discussion here to make a decision about the safety of flipping this to true, as it seems like the concern was not too high / basically nil.
|
What happened?
When packets with sequence number out-of-window arrived k8s node, conntrack marked them as INVALID. kube-proxy will ignore them, without rewriting DNAT. However, there is no corresponding session link on the host, and the host sends a reset packet, causing the session link to be interrupted
What did you expect to happen?
connection not reset
How can we reproduce it (as minimally and precisely as possible)?
#74839
Anything else we need to know?
This problem can be solved by command:
iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP
Similar issue:#74839
But this issue "drop" is placed on the forward chain, our scenario needs to be placed on the input chain.
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: