kube-proxy: Drop packets in INVALID state #74840

anfernee · 2019-03-02T01:50:00Z

What type of PR is this?
/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"

bowei · 2019-03-05T00:10:56Z

/assign @thockin

Is this testable is some way?

anfernee · 2019-03-05T01:57:08Z

/test pull-kubernetes-integration

anfernee · 2019-03-05T02:01:05Z

the repro (https://github.com/tcarmet/k8s-connection-reset) is more like a load test, it would be nice to have a simpler deterministic reproducing script, but currently i don't have one. other than that, a unit test can be added too.

I think it would be nice to make it into v1.14 's march 7th window.

mainred

Should we also drop invalid state packets that will be delivered locally? like the node port service queries.

pkg/proxy/iptables/proxier.go

thockin · 2019-03-05T17:52:32Z

This is subtle, and I don't really understand what would cause it to happen. Without a test I have no way to say whether this is doing anything at all. Do we have any way to force this condition?

We must look at this for all sorts of traffic, too - NodePorts, LB IPs, etc. Does this cover them all?

Also need to be looked at in IPVS mode. @m1093782566

anfernee · 2019-03-09T02:08:49Z

I created a small app to reproduce this issue: https://github.com/anfernee/k8s-issue-74839

anfernee · 2019-03-12T17:27:47Z

@m1093782566 do you have comments?

Fixes: kubernetes#74839

anfernee · 2019-03-18T22:35:44Z

@mainred this specific bug is all about the cluster-local traffic between 2 pods. other incoming traffic is not affected. The rule fixes the returning packet, so either NodePort or ClusterIP should already be fixed together.

thockin · 2019-03-18T22:49:18Z

@m1093782566 PTAL

…

On Mon, Mar 18, 2019 at 3:36 PM Yongkun Anfernee Gui < ***@***.***> wrote: @mainred <https://github.com/mainred> this specific bug is all about the cluster-local traffic between 2 pods. other incoming traffic is not affected. The rule fixes the returning packet, so either NodePort or ClusterIP should already be fixed together. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#74840 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVGHtlvDBSqWCmSFoPp6Qmb4y7AATks5vYBTogaJpZM4baGOX> .

danwinship · 2019-03-28T20:25:20Z

pkg/proxy/iptables/proxier.go

+	// unexpected connection reset.
+	// https://github.com/kubernetes/kubernetes/issues/74839
+	writeLine(proxier.filterRules,
+		"-A", string(kubeForwardChain),


Wouldn't this need to be near the start of syncProxyRules rather than near the end? Many packets have already been matched and redirected by this point...

Yes, it's the first rule in KUBE-FORWARD chain.

Chain KUBE-FORWARD (1 references) target prot opt source destination *** NEW RULE INSERTED HERE *** ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000 ACCEPT all -- 10.36.0.0/14 anywhere /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED ACCEPT all -- anywhere 10.36.0.0/14 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

oh, all the (relevant) stuff before that is in natRules, not filterRules. ok

FYI Calico is doing similar things for disabling ConntrackInvalid projectcalico/felix#1424 .

The Calico implementation also works for a cluster that runs IPVS as a load-balancer with iptables-mode of k8s-proxy .

Ah, good to know. Thanks to bring it up @cmluciano . not sure if it should be an option here.. any reason not to enable that option?

danwinship · 2019-03-29T12:59:33Z

So this needs a release note (@anfernee you would add this in the appropriate place in the initial comment on the PR). Something like:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"

And maybe it should not get backported right away, until we have more confidence that this doesn't break anything else.

/lgtm

k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP" which drops a packet if its CT entry is in INVALID state: kubernetes/kubernetes#74840. The reason for this rule is to work around a nf_conntrack bug which marks a CT entry as INVALID if a client receives an ACK from a server which is above client's TCP window. The INVALID state prevents a packet from being reverse xlated which results in the connection being terminated by a host of the client which sends TCP RST to the server. Most likely, in the case of the direct routing mode when bpf_netdev is attached to a native device, a request packet avoids nf_conntrack hooks, thus no CT entry is created. For some reasons, passing the request to the stack instead of redirecting to EP's iface bypasses the hooks as well (tested on 5.2 kernel), so no entry is created either way. A reply send from the EP gets dropped due to missing CT entry (=INVALID state) for the request. Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT' which prevents from a reply of an EP being dropped. However, this does not apply to cilium-health EP as its host-side veth name is "cilium_health" which makes its reply to bypass the rule, and thus to be dropped. This commit changes the iface name to "lxcciliumhealth" instead of adding a rule or extending the existing one (= adding more latency to the datapath). Unfortunately, "lxc_cilium_health" is above the dev name max limit. Signed-off-by: Martynas Pumputis <m@lambda.lt>

k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP" which drops a packet if its CT entry is in INVALID state: kubernetes/kubernetes#74840. The reason for this rule is to work around a nf_conntrack bug which marks a CT entry as INVALID if a client receives an ACK from a server which is above client's TCP window. The INVALID state prevents a packet from being reverse xlated which results in the connection being terminated by a host of the client which sends TCP RST to the server. Most likely, in the case of the direct routing mode when bpf_netdev is attached to a native device, a request packet avoids nf_conntrack hooks, thus no CT entry is created. For some reasons, passing the request to the stack instead of redirecting to EP's iface bypasses the hooks as well (tested on 5.2 kernel), so no entry is created either way. A reply send from the EP gets dropped due to missing CT entry (=INVALID state) for the request. Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT' which prevents from a reply of an EP being dropped. However, this does not apply to cilium-health EP as its host-side veth name is "cilium_health" which makes its reply to bypass the rule, and thus to be dropped. This commit changes the iface name to "lxc_health" instead of adding a rule or extending the existing one (= adding more latency to the datapath). Unfortunately, "lxc_cilium_health" is above the dev name max limit. Signed-off-by: Martynas Pumputis <m@lambda.lt>

alonisser · 2020-03-03T08:47:32Z

Can someone clarify if the fix is in 15.7 already? if not in what version is it

johannesboon · 2020-03-03T10:25:34Z

@alonisser

Can someone clarify if the fix is in 15.7 already? if not in what version is it

Yes, I can: life is great again after upgrading to 1.15.0 (in our case from 1.14.x to 1.15.5).

But it was already included earlier, see: https://relnotes.k8s.io/?markdown=drop&releaseVersions=1.15.0

unicell · 2020-04-24T05:35:25Z

For those who might be interested, Conntrack INVALID Drop iptable rule introduced by this PR will break asymmetrical routing case as described in https://github.com/projectcalico/felix/issues/1248

It would be nice to have it configurable.

danwinship · 2020-04-24T13:59:25Z

It would be nice to have it configurable.

Well, but you don't want the "long-term connections sometimes die randomly" bug to come back when you use asymmetrical routing. The fix would be to find a way a better way to do this that blocks the packets that cause random connection death without blocking packets related to asymmetric routing.

kabakaev · 2020-05-17T22:09:25Z

I've hit the assymetric routing issue due to this KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP rule.
Isn't it too wide? Can we somehow make it more specific to the pod range?
I'm looking at /var/lib/kube-proxy/config.conf inside the kube-proxy container, and it has clusterCIDR: 10.2.0.0/16,f00d::/64 line. Isn't it enough to let assymmetric routing of non-kubernetes subnets co-exist with the drop of invalid conntrack rule?

danwinship · 2020-05-18T16:19:37Z

It seems like narrowing this should work, yes, although it's not guaranteed that kube-proxy knows the cluster CIDR range unfortunately. (We could try narrowing it iff kube-proxy knows the cluster CIDR.)

cnmcavoy · 2020-08-26T17:49:03Z

I agree with @kabakaev, this rule is too wide. I have to maintain my own private fork of kubernetes with this commit removed due to asymmetric routing for traffic outside the cluster on private subnets. :( If it was scoped to cluster cidr, that would no longer be necessary.

agmimidi · 2022-12-09T07:30:50Z

Hi all,

We have a setup with kube-proxy in ipvs mode and calico as CNI, in which we observe intermittent connection resets when pods are calling internal services in the cluster.

Calico does push the per-chain Iptables rules to drop the packets marked as invalid (e.g. -A cali-fw-cali08d8970e614 -m comment --comment "cali:pgiwwL2d0pFFF8jF" -m conntrack --ctstate INVALID -j DROP) , but the drop rule for the KUBE-FORWARD chain is missing. Is that expected?

We are running kubernetes version 1.23.7 on Centos7 machines.

Any help would be appreciated.

kk9384 · 2023-02-02T22:59:45Z

Hi all,
We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening, the gke kubernetes version is 1.21.14, should we need to configure the below values? please advise, thank you.

iptables -N "KUBE-FORWARD-PATCH"
iptables -A "KUBE-FORWARD-PATCH" -m "conntrack" --ctstate "INVALID" -j "DROP"
iptables -I FORWARD -m comment --comment "k8s patch PR 74840" -j KUBE-FORWARD-PATCH

aojea · 2023-02-02T23:22:34Z

We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening

what is the exact message?

kk9384 · 2023-02-03T12:48:57Z

We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening

what is the exact message?

o.apache.http.impl.execchain.RetryExec : I/O exception (java.net.SocketException) caught when processing request to {s}->https://<i_have_removed_host_name>:443: Connection reset

aojea · 2023-02-04T11:22:21Z

that may mean different causes, not necessarily this one , if <i_have_removed_host_name> stop listened on that port, if that is a service and the endpoint has changed or died on aggressive autoscaling, ....

mehmetcolgecen · 2023-02-07T14:51:11Z

Hi,
I am using on-prem kubernetes version 1.21.9, with MetalLB to provide LoadBalancer IP to the services and CNI is Calico. I am using an OCPP server inside kubernetes and server gets heartbeat signals from client throughout websocket connetcion. Within this heartbeat periods, the connection interrupts with RST signals. Here is the TCPDUMP output from inside pod;

ocppserver-bc6f65954-j99lx:/app# tcpdump 'tcp[13] & 4 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:07:59.726669 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.53627: Flags [R], seq 2070087305, win 0, length 0
14:07:59.734667 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.53627: Flags [R], seq 2070087305, win 0, length 0
14:08:27.646676 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.9510: Flags [R], seq 355699249, win 0, length 0
14:08:27.654637 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.9510: Flags [R], seq 355699249, win 0, length 0

I have applied "echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" on both node and pod and
"iptables -I INPUT -m conntrack --ctstate INVALID -j DROP" on the node.

But the problem continues, do you have any suggestions for me.

k8s-ci-robot requested review from freehan and justinsb March 2, 2019 01:50

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 2, 2019

k8s-ci-robot assigned thockin Mar 5, 2019

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Mar 5, 2019

mainred reviewed Mar 5, 2019

View reviewed changes

thockin reviewed Mar 5, 2019

View reviewed changes

pkg/proxy/iptables/proxier.go Outdated Show resolved Hide resolved

kube-proxy: Drop packets in INVALID state

a07169b

Fixes: kubernetes#74839

anfernee force-pushed the connreset branch from a86204f to a07169b Compare March 18, 2019 22:22

danwinship reviewed Mar 28, 2019

View reviewed changes

k8s-ci-robot assigned danwinship Mar 29, 2019

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 29, 2019

brb mentioned this pull request Jul 18, 2019

health: Change cilium-health host-side veth link device name cilium/cilium#8614

Merged

prameshj mentioned this pull request Aug 9, 2019

unidirectional UDP traffic to a NodePort service is not handled properly #72511

Closed

ataillefer mentioned this pull request Oct 3, 2019

Connection reset by peer GoogleContainerTools/kaniko#797

Closed

antoineco mentioned this pull request Oct 8, 2019

Have acceptance tests for Function controller kyma-project/kyma#5147

Closed

DP19 mentioned this pull request Oct 16, 2019

Random / Sporadic 502 gateway timeouts kubernetes/ingress-nginx#4433

Closed

tykytyz mentioned this pull request Nov 5, 2019

connection reset with no fix by changing conntrack parameters #84767

Closed

alvaroaleman mentioned this pull request Dec 5, 2019

Tide history connection reset by peer kubernetes/test-infra#14520

Closed

sbillet mentioned this pull request Jan 14, 2020

Intermittent connection resets in Swarm using IPVS due to invalid conntrack packets moby/moby#40374

Open

stevenctl mentioned this pull request Sep 1, 2020

fix echo tcp service istio/istio#26835

Merged

cnmcavoy mentioned this pull request Sep 17, 2020

kube-proxy: Drop packets in INVALID state drops packets from outside the pod range #94861

Closed

kaitoy mentioned this pull request Jul 31, 2021

"Connection reset by peer" due to invalid conntrack packets #74839

Closed

wanyaoqi mentioned this pull request Jun 16, 2023

[BUG] glance image upload: connection reset by peer yunionio/cloudpods#17318

Closed

jasvinder1107 mentioned this pull request Aug 10, 2023

Connection resets with large payload #119887

Closed

aroradaman mentioned this pull request Jan 9, 2024

pkg/proxy/nftables: drop conntrack state invalid rule #122663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy: Drop packets in INVALID state #74840

kube-proxy: Drop packets in INVALID state #74840

anfernee commented Mar 2, 2019 •

edited

bowei commented Mar 5, 2019

anfernee commented Mar 5, 2019

anfernee commented Mar 5, 2019 •

edited

mainred left a comment

thockin commented Mar 5, 2019

anfernee commented Mar 9, 2019

anfernee commented Mar 12, 2019

anfernee commented Mar 18, 2019

thockin commented Mar 18, 2019 via email

danwinship Mar 28, 2019

anfernee Mar 28, 2019

danwinship Mar 29, 2019

cmluciano Apr 4, 2019

anfernee Apr 5, 2019

danwinship commented Mar 29, 2019

alonisser commented Mar 3, 2020

johannesboon commented Mar 3, 2020

unicell commented Apr 24, 2020

danwinship commented Apr 24, 2020

kabakaev commented May 17, 2020 •

edited

danwinship commented May 18, 2020

cnmcavoy commented Aug 26, 2020

agmimidi commented Dec 9, 2022

kk9384 commented Feb 2, 2023

aojea commented Feb 2, 2023

kk9384 commented Feb 3, 2023

aojea commented Feb 4, 2023

mehmetcolgecen commented Feb 7, 2023

kube-proxy: Drop packets in INVALID state #74840

kube-proxy: Drop packets in INVALID state #74840

Conversation

anfernee commented Mar 2, 2019 • edited

bowei commented Mar 5, 2019

anfernee commented Mar 5, 2019

anfernee commented Mar 5, 2019 • edited

mainred left a comment

Choose a reason for hiding this comment

thockin commented Mar 5, 2019

anfernee commented Mar 9, 2019

anfernee commented Mar 12, 2019

anfernee commented Mar 18, 2019

thockin commented Mar 18, 2019 via email

danwinship Mar 28, 2019

Choose a reason for hiding this comment

anfernee Mar 28, 2019

Choose a reason for hiding this comment

danwinship Mar 29, 2019

Choose a reason for hiding this comment

cmluciano Apr 4, 2019

Choose a reason for hiding this comment

anfernee Apr 5, 2019

Choose a reason for hiding this comment

danwinship commented Mar 29, 2019

alonisser commented Mar 3, 2020

johannesboon commented Mar 3, 2020

unicell commented Apr 24, 2020

danwinship commented Apr 24, 2020

kabakaev commented May 17, 2020 • edited

danwinship commented May 18, 2020

cnmcavoy commented Aug 26, 2020

agmimidi commented Dec 9, 2022

kk9384 commented Feb 2, 2023

aojea commented Feb 2, 2023

kk9384 commented Feb 3, 2023

aojea commented Feb 4, 2023

mehmetcolgecen commented Feb 7, 2023

anfernee commented Mar 2, 2019 •

edited

anfernee commented Mar 5, 2019 •

edited

kabakaev commented May 17, 2020 •

edited