Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy: Drop packets in INVALID state #74840

Merged
merged 1 commit into from Apr 26, 2019

Conversation

@anfernee
Copy link
Member

@anfernee anfernee commented Mar 2, 2019

Fixes: #74839

What type of PR is this?
/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #74839

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"
@bowei
Copy link
Member

@bowei bowei commented Mar 5, 2019

/assign @thockin

Is this testable is some way?

@anfernee
Copy link
Member Author

@anfernee anfernee commented Mar 5, 2019

/test pull-kubernetes-integration

@anfernee
Copy link
Member Author

@anfernee anfernee commented Mar 5, 2019

the repro (https://github.com/tcarmet/k8s-connection-reset) is more like a load test, it would be nice to have a simpler deterministic reproducing script, but currently i don't have one. other than that, a unit test can be added too.

I think it would be nice to make it into v1.14 's march 7th window.

Copy link

@mainred mainred left a comment

Should we also drop invalid state packets that will be delivered locally? like the node port service queries.

pkg/proxy/iptables/proxier.go Outdated Show resolved Hide resolved
@thockin
Copy link
Member

@thockin thockin commented Mar 5, 2019

This is subtle, and I don't really understand what would cause it to happen. Without a test I have no way to say whether this is doing anything at all. Do we have any way to force this condition?

We must look at this for all sorts of traffic, too - NodePorts, LB IPs, etc. Does this cover them all?

Also need to be looked at in IPVS mode. @m1093782566

@anfernee
Copy link
Member Author

@anfernee anfernee commented Mar 9, 2019

I created a small app to reproduce this issue: https://github.com/anfernee/k8s-issue-74839

@anfernee
Copy link
Member Author

@anfernee anfernee commented Mar 12, 2019

@m1093782566 do you have comments?

@anfernee anfernee force-pushed the anfernee:connreset branch from a86204f to a07169b Mar 18, 2019
@anfernee
Copy link
Member Author

@anfernee anfernee commented Mar 18, 2019

@mainred this specific bug is all about the cluster-local traffic between 2 pods. other incoming traffic is not affected. The rule fixes the returning packet, so either NodePort or ClusterIP should already be fixed together.

@thockin
Copy link
Member

@thockin thockin commented Mar 18, 2019

// unexpected connection reset.
// https://github.com/kubernetes/kubernetes/issues/74839
writeLine(proxier.filterRules,
"-A", string(kubeForwardChain),

This comment has been minimized.

@danwinship

danwinship Mar 28, 2019
Contributor

Wouldn't this need to be near the start of syncProxyRules rather than near the end? Many packets have already been matched and redirected by this point...

This comment has been minimized.

@anfernee

anfernee Mar 28, 2019
Author Member

Yes, it's the first rule in KUBE-FORWARD chain.

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination 
*** NEW RULE INSERTED HERE ***
ACCEPT     all  --  anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT     all  --  10.36.0.0/14         anywhere             /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             10.36.0.0/14         /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

This comment has been minimized.

@danwinship

danwinship Mar 29, 2019
Contributor

oh, all the (relevant) stuff before that is in natRules, not filterRules. ok

This comment has been minimized.

@cmluciano

cmluciano Apr 4, 2019
Member

FYI Calico is doing similar things for disabling ConntrackInvalid projectcalico/felix#1424 .

The Calico implementation also works for a cluster that runs IPVS as a load-balancer with iptables-mode of k8s-proxy .

This comment has been minimized.

@anfernee

anfernee Apr 5, 2019
Author Member

Ah, good to know. Thanks to bring it up @cmluciano . not sure if it should be an option here.. any reason not to enable that option?

@danwinship
Copy link
Contributor

@danwinship danwinship commented Mar 29, 2019

So this needs a release note (@anfernee you would add this in the appropriate place in the initial comment on the PR). Something like:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"

And maybe it should not get backported right away, until we have more confidence that this doesn't break anything else.

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit fa833a1 into kubernetes:master Apr 26, 2019
17 of 18 checks passed
17 of 18 checks passed
@k8s-ci-robot
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
@thelinuxfoundation
cla/linuxfoundation anfernee authorized
Details
@k8s-ci-robot
pull-kubernetes-bazel-build Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-bazel-test Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-conformance-image-test Skipped.
@k8s-ci-robot
pull-kubernetes-cross Skipped.
@k8s-ci-robot
pull-kubernetes-dependencies Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
@cblecker
pull-kubernetes-godeps Context retired. Status moved to "pull-kubernetes-dependencies".
@k8s-ci-robot
pull-kubernetes-integration Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-local-e2e Skipped.
@k8s-ci-robot
pull-kubernetes-node-e2e Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-typecheck Job succeeded.
Details
@k8s-ci-robot
pull-kubernetes-verify Job succeeded.
Details
@k8s-ci-robot
pull-publishing-bot-validate Skipped.
@k8s-ci-robot
tide In merge pool.
Details
@vishnukraj1111
Copy link

@vishnukraj1111 vishnukraj1111 commented May 13, 2019

Just to provide an update on this. The fix mentioned in most of the places echo 1 > /proc/sys/net/netfilter/ip_conntrack_tcp_be_liberal cause the conntrack table to be full and all connectivity to the servers were lost.

@WingkaiHo
Copy link

@WingkaiHo WingkaiHo commented Jun 11, 2019

I have save problem, some time is udp package snat send to pod

@tuapuikia
Copy link

@tuapuikia tuapuikia commented Jun 14, 2019

Anyone tested the patch it in crowded production environment?

@sepulworld
Copy link

@sepulworld sepulworld commented Jul 2, 2019

Curious, if there is a safe way to apply this iptables rule prior to upgrading to 1.15? Looks like if you apply the rule iptables -A "KUBE-FORWARD" -m "conntrack" --ctstate "INVALID" -j "DROP" it will get flushed by kube-proxy.

@danwinship
Copy link
Contributor

@danwinship danwinship commented Jul 2, 2019

Yeah, you can't add rules to the chains kube-proxy is managing, but you could just add it to one of the top-level chains. (I guess, FORWARD in this case?)

@sepulworld
Copy link

@sepulworld sepulworld commented Jul 2, 2019

I was thinking something like this:

iptables -N "KUBE-FORWARD-PATCH"
iptables -A "KUBE-FORWARD-PATCH" -m "conntrack" --ctstate "INVALID" -j "DROP"
iptables -I FORWARD -m comment --comment "k8s patch PR 74840" -j KUBE-FORWARD-PATCH
brb added a commit to cilium/cilium that referenced this pull request Jul 18, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxcciliumhealth" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit to cilium/cilium that referenced this pull request Jul 18, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxcciliumhealth" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit to cilium/cilium that referenced this pull request Jul 19, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxc_health" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
ianvernon added a commit to cilium/cilium that referenced this pull request Jul 19, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxc_health" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@alonisser
Copy link

@alonisser alonisser commented Mar 3, 2020

Can someone clarify if the fix is in 15.7 already? if not in what version is it

@johannesboon
Copy link

@johannesboon johannesboon commented Mar 3, 2020

@alonisser

Can someone clarify if the fix is in 15.7 already? if not in what version is it

Yes, I can: life is great again after upgrading to 1.15.0 (in our case from 1.14.x to 1.15.5).

But it was already included earlier, see: https://relnotes.k8s.io/?markdown=drop&releaseVersions=1.15.0

@unicell
Copy link
Contributor

@unicell unicell commented Apr 24, 2020

For those who might be interested, Conntrack INVALID Drop iptable rule introduced by this PR will break asymmetrical routing case as described in projectcalico/felix#1248

It would be nice to have it configurable.

@danwinship
Copy link
Contributor

@danwinship danwinship commented Apr 24, 2020

It would be nice to have it configurable.

Well, but you don't want the "long-term connections sometimes die randomly" bug to come back when you use asymmetrical routing. The fix would be to find a way a better way to do this that blocks the packets that cause random connection death without blocking packets related to asymmetric routing.

@kabakaev
Copy link

@kabakaev kabakaev commented May 17, 2020

I've hit the assymetric routing issue due to this KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP rule.
Isn't it too wide? Can we somehow make it more specific to the pod range?
I'm looking at /var/lib/kube-proxy/config.conf inside the kube-proxy container, and it has clusterCIDR: 10.2.0.0/16,f00d::/64 line. Isn't it enough to let assymmetric routing of non-kubernetes subnets co-exist with the drop of invalid conntrack rule?

@danwinship
Copy link
Contributor

@danwinship danwinship commented May 18, 2020

It seems like narrowing this should work, yes, although it's not guaranteed that kube-proxy knows the cluster CIDR range unfortunately. (We could try narrowing it iff kube-proxy knows the cluster CIDR.)

@cnmcavoy
Copy link

@cnmcavoy cnmcavoy commented Aug 26, 2020

I agree with @kabakaev, this rule is too wide. I have to maintain my own private fork of kubernetes with this commit removed due to asymmetric routing for traffic outside the cluster on private subnets. :( If it was scoped to cluster cidr, that would no longer be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment