-
Notifications
You must be signed in to change notification settings - Fork 40.4k
kube-proxy: Drop packets in INVALID state #74840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/assign @thockin Is this testable is some way? |
/test pull-kubernetes-integration |
the repro (https://github.com/tcarmet/k8s-connection-reset) is more like a load test, it would be nice to have a simpler deterministic reproducing script, but currently i don't have one. other than that, a unit test can be added too. I think it would be nice to make it into v1.14 's march 7th window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also drop invalid state packets that will be delivered locally? like the node port service queries.
This is subtle, and I don't really understand what would cause it to happen. Without a test I have no way to say whether this is doing anything at all. Do we have any way to force this condition? We must look at this for all sorts of traffic, too - NodePorts, LB IPs, etc. Does this cover them all? Also need to be looked at in IPVS mode. @m1093782566 |
I created a small app to reproduce this issue: https://github.com/anfernee/k8s-issue-74839 |
@m1093782566 do you have comments? |
@mainred this specific bug is all about the cluster-local traffic between 2 pods. other incoming traffic is not affected. The rule fixes the returning packet, so either NodePort or ClusterIP should already be fixed together. |
@m1093782566 PTAL
…On Mon, Mar 18, 2019 at 3:36 PM Yongkun Anfernee Gui < ***@***.***> wrote:
@mainred <https://github.com/mainred> this specific bug is all about the
cluster-local traffic between 2 pods. other incoming traffic is not
affected. The rule fixes the returning packet, so either NodePort or
ClusterIP should already be fixed together.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#74840 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVGHtlvDBSqWCmSFoPp6Qmb4y7AATks5vYBTogaJpZM4baGOX>
.
|
// unexpected connection reset. | ||
// https://github.com/kubernetes/kubernetes/issues/74839 | ||
writeLine(proxier.filterRules, | ||
"-A", string(kubeForwardChain), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this need to be near the start of syncProxyRules
rather than near the end? Many packets have already been matched and redirected by this point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the first rule in KUBE-FORWARD chain.
Chain KUBE-FORWARD (1 references)
target prot opt source destination
*** NEW RULE INSERTED HERE ***
ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT all -- 10.36.0.0/14 anywhere /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere 10.36.0.0/14 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, all the (relevant) stuff before that is in natRules
, not filterRules
. ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Calico is doing similar things for disabling ConntrackInvalid projectcalico/felix#1424 .
The Calico implementation also works for a cluster that runs IPVS as a load-balancer with iptables-mode of k8s-proxy .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know. Thanks to bring it up @cmluciano . not sure if it should be an option here.. any reason not to enable that option?
So this needs a release note (@anfernee you would add this in the appropriate place in the initial comment on the PR). Something like:
And maybe it should not get backported right away, until we have more confidence that this doesn't break anything else. /lgtm |
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP" which drops a packet if its CT entry is in INVALID state: kubernetes/kubernetes#74840. The reason for this rule is to work around a nf_conntrack bug which marks a CT entry as INVALID if a client receives an ACK from a server which is above client's TCP window. The INVALID state prevents a packet from being reverse xlated which results in the connection being terminated by a host of the client which sends TCP RST to the server. Most likely, in the case of the direct routing mode when bpf_netdev is attached to a native device, a request packet avoids nf_conntrack hooks, thus no CT entry is created. For some reasons, passing the request to the stack instead of redirecting to EP's iface bypasses the hooks as well (tested on 5.2 kernel), so no entry is created either way. A reply send from the EP gets dropped due to missing CT entry (=INVALID state) for the request. Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT' which prevents from a reply of an EP being dropped. However, this does not apply to cilium-health EP as its host-side veth name is "cilium_health" which makes its reply to bypass the rule, and thus to be dropped. This commit changes the iface name to "lxcciliumhealth" instead of adding a rule or extending the existing one (= adding more latency to the datapath). Unfortunately, "lxc_cilium_health" is above the dev name max limit. Signed-off-by: Martynas Pumputis <m@lambda.lt>
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP" which drops a packet if its CT entry is in INVALID state: kubernetes/kubernetes#74840. The reason for this rule is to work around a nf_conntrack bug which marks a CT entry as INVALID if a client receives an ACK from a server which is above client's TCP window. The INVALID state prevents a packet from being reverse xlated which results in the connection being terminated by a host of the client which sends TCP RST to the server. Most likely, in the case of the direct routing mode when bpf_netdev is attached to a native device, a request packet avoids nf_conntrack hooks, thus no CT entry is created. For some reasons, passing the request to the stack instead of redirecting to EP's iface bypasses the hooks as well (tested on 5.2 kernel), so no entry is created either way. A reply send from the EP gets dropped due to missing CT entry (=INVALID state) for the request. Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT' which prevents from a reply of an EP being dropped. However, this does not apply to cilium-health EP as its host-side veth name is "cilium_health" which makes its reply to bypass the rule, and thus to be dropped. This commit changes the iface name to "lxc_health" instead of adding a rule or extending the existing one (= adding more latency to the datapath). Unfortunately, "lxc_cilium_health" is above the dev name max limit. Signed-off-by: Martynas Pumputis <m@lambda.lt>
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP" which drops a packet if its CT entry is in INVALID state: kubernetes/kubernetes#74840. The reason for this rule is to work around a nf_conntrack bug which marks a CT entry as INVALID if a client receives an ACK from a server which is above client's TCP window. The INVALID state prevents a packet from being reverse xlated which results in the connection being terminated by a host of the client which sends TCP RST to the server. Most likely, in the case of the direct routing mode when bpf_netdev is attached to a native device, a request packet avoids nf_conntrack hooks, thus no CT entry is created. For some reasons, passing the request to the stack instead of redirecting to EP's iface bypasses the hooks as well (tested on 5.2 kernel), so no entry is created either way. A reply send from the EP gets dropped due to missing CT entry (=INVALID state) for the request. Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT' which prevents from a reply of an EP being dropped. However, this does not apply to cilium-health EP as its host-side veth name is "cilium_health" which makes its reply to bypass the rule, and thus to be dropped. This commit changes the iface name to "lxc_health" instead of adding a rule or extending the existing one (= adding more latency to the datapath). Unfortunately, "lxc_cilium_health" is above the dev name max limit. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Can someone clarify if the fix is in 15.7 already? if not in what version is it |
Yes, I can: life is great again after upgrading to 1.15.0 (in our case from 1.14.x to 1.15.5). But it was already included earlier, see: https://relnotes.k8s.io/?markdown=drop&releaseVersions=1.15.0 |
For those who might be interested, Conntrack INVALID Drop iptable rule introduced by this PR will break asymmetrical routing case as described in https://github.com/projectcalico/felix/issues/1248 It would be nice to have it configurable. |
Well, but you don't want the "long-term connections sometimes die randomly" bug to come back when you use asymmetrical routing. The fix would be to find a way a better way to do this that blocks the packets that cause random connection death without blocking packets related to asymmetric routing. |
I've hit the assymetric routing issue due to this |
It seems like narrowing this should work, yes, although it's not guaranteed that kube-proxy knows the cluster CIDR range unfortunately. (We could try narrowing it iff kube-proxy knows the cluster CIDR.) |
I agree with @kabakaev, this rule is too wide. I have to maintain my own private fork of kubernetes with this commit removed due to asymmetric routing for traffic outside the cluster on private subnets. :( If it was scoped to cluster cidr, that would no longer be necessary. |
Hi all, We have a setup with kube-proxy in ipvs mode and calico as CNI, in which we observe intermittent connection resets when pods are calling internal services in the cluster. Calico does push the per-chain Iptables rules to drop the packets marked as invalid (e.g. -A cali-fw-cali08d8970e614 -m comment --comment "cali:pgiwwL2d0pFFF8jF" -m conntrack --ctstate INVALID -j DROP) , but the drop rule for the KUBE-FORWARD chain is missing. Is that expected? We are running kubernetes version 1.23.7 on Centos7 machines. Any help would be appreciated. |
Hi all, iptables -N "KUBE-FORWARD-PATCH" |
what is the exact message? |
o.apache.http.impl.execchain.RetryExec : I/O exception (java.net.SocketException) caught when processing request to {s}->https://<i_have_removed_host_name>:443: Connection reset |
that may mean different causes, not necessarily this one , if <i_have_removed_host_name> stop listened on that port, if that is a service and the endpoint has changed or died on aggressive autoscaling, .... |
Hi, ocppserver-bc6f65954-j99lx:/app# tcpdump 'tcp[13] & 4 != 0' I have applied "echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" on both node and pod and But the problem continues, do you have any suggestions for me. |
Fixes: #74839
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #74839
Special notes for your reviewer:
Does this PR introduce a user-facing change?: