Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy: Drop packets in INVALID state #74840

Merged
merged 1 commit into from
Apr 26, 2019

Conversation

anfernee
Copy link
Member

@anfernee anfernee commented Mar 2, 2019

Fixes: #74839

What type of PR is this?
/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #74839

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 2, 2019
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 2, 2019
@bowei
Copy link
Member

bowei commented Mar 5, 2019

/assign @thockin

Is this testable is some way?

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Mar 5, 2019
@anfernee
Copy link
Member Author

anfernee commented Mar 5, 2019

/test pull-kubernetes-integration

@anfernee
Copy link
Member Author

anfernee commented Mar 5, 2019

the repro (https://github.com/tcarmet/k8s-connection-reset) is more like a load test, it would be nice to have a simpler deterministic reproducing script, but currently i don't have one. other than that, a unit test can be added too.

I think it would be nice to make it into v1.14 's march 7th window.

Copy link

@mainred mainred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also drop invalid state packets that will be delivered locally? like the node port service queries.

@thockin
Copy link
Member

thockin commented Mar 5, 2019

This is subtle, and I don't really understand what would cause it to happen. Without a test I have no way to say whether this is doing anything at all. Do we have any way to force this condition?

We must look at this for all sorts of traffic, too - NodePorts, LB IPs, etc. Does this cover them all?

Also need to be looked at in IPVS mode. @m1093782566

@anfernee
Copy link
Member Author

anfernee commented Mar 9, 2019

I created a small app to reproduce this issue: https://github.com/anfernee/k8s-issue-74839

@anfernee
Copy link
Member Author

@m1093782566 do you have comments?

@anfernee
Copy link
Member Author

@mainred this specific bug is all about the cluster-local traffic between 2 pods. other incoming traffic is not affected. The rule fixes the returning packet, so either NodePort or ClusterIP should already be fixed together.

@thockin
Copy link
Member

thockin commented Mar 18, 2019 via email

// unexpected connection reset.
// https://github.com/kubernetes/kubernetes/issues/74839
writeLine(proxier.filterRules,
"-A", string(kubeForwardChain),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this need to be near the start of syncProxyRules rather than near the end? Many packets have already been matched and redirected by this point...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the first rule in KUBE-FORWARD chain.

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination 
*** NEW RULE INSERTED HERE ***
ACCEPT     all  --  anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT     all  --  10.36.0.0/14         anywhere             /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             10.36.0.0/14         /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, all the (relevant) stuff before that is in natRules, not filterRules. ok

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Calico is doing similar things for disabling ConntrackInvalid projectcalico/felix#1424 .

The Calico implementation also works for a cluster that runs IPVS as a load-balancer with iptables-mode of k8s-proxy .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know. Thanks to bring it up @cmluciano . not sure if it should be an option here.. any reason not to enable that option?

@danwinship
Copy link
Contributor

So this needs a release note (@anfernee you would add this in the appropriate place in the initial comment on the PR). Something like:

Packets considered INVALID by conntrack are now dropped. In particular, this fixes
a problem where spurious retransmits in a long-running TCP connection to a service
IP could result in the connection being closed with the error "Connection reset by
peer"

And maybe it should not get backported right away, until we have more confidence that this doesn't break anything else.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 29, 2019
brb added a commit to cilium/cilium that referenced this pull request Jul 18, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxcciliumhealth" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit to cilium/cilium that referenced this pull request Jul 19, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxc_health" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
ianvernon pushed a commit to cilium/cilium that referenced this pull request Jul 19, 2019
k8s 1.15 introduced the iptables rule "-A KUBE-FORWARD -m conntrack
--ctstate INVALID -j DROP" which drops a packet if its CT entry is
in INVALID state: kubernetes/kubernetes#74840.
The reason for this rule is to work around a nf_conntrack bug which marks
a CT entry as INVALID if a client receives an ACK from a server which is
above client's TCP window. The INVALID state prevents a packet from
being reverse xlated which results in the connection being terminated by
a host of the client which sends TCP RST to the server.

Most likely, in the case of the direct routing mode when bpf_netdev is
attached to a native device, a request packet avoids nf_conntrack hooks,
thus no CT entry is created. For some reasons, passing the request to
the stack instead of redirecting to EP's iface bypasses the hooks as
well (tested on 5.2 kernel), so no entry is created either way. A reply
send from the EP gets dropped due to missing CT entry (=INVALID state)
for the request.

Luckily, there is the iptables rule '-A CILIUM_FORWARD -i lxc+ -m
comment --comment "cilium: cluster->any on lxc+ forward accept" -j
ACCEPT' which prevents from a reply of an EP being dropped. However,
this does not apply to cilium-health EP as its host-side veth name is
"cilium_health" which makes its reply to bypass the rule, and thus to
be dropped.

This commit changes the iface name to "lxc_health" instead of
adding a rule or extending the existing one (= adding more latency
to the datapath). Unfortunately, "lxc_cilium_health" is above the
dev name max limit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@alonisser
Copy link

Can someone clarify if the fix is in 15.7 already? if not in what version is it

@johannesboon
Copy link

@alonisser

Can someone clarify if the fix is in 15.7 already? if not in what version is it

Yes, I can: life is great again after upgrading to 1.15.0 (in our case from 1.14.x to 1.15.5).

But it was already included earlier, see: https://relnotes.k8s.io/?markdown=drop&releaseVersions=1.15.0

@unicell
Copy link
Contributor

unicell commented Apr 24, 2020

For those who might be interested, Conntrack INVALID Drop iptable rule introduced by this PR will break asymmetrical routing case as described in https://github.com/projectcalico/felix/issues/1248

It would be nice to have it configurable.

@danwinship
Copy link
Contributor

It would be nice to have it configurable.

Well, but you don't want the "long-term connections sometimes die randomly" bug to come back when you use asymmetrical routing. The fix would be to find a way a better way to do this that blocks the packets that cause random connection death without blocking packets related to asymmetric routing.

@kabakaev
Copy link

kabakaev commented May 17, 2020

I've hit the assymetric routing issue due to this KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP rule.
Isn't it too wide? Can we somehow make it more specific to the pod range?
I'm looking at /var/lib/kube-proxy/config.conf inside the kube-proxy container, and it has clusterCIDR: 10.2.0.0/16,f00d::/64 line. Isn't it enough to let assymmetric routing of non-kubernetes subnets co-exist with the drop of invalid conntrack rule?

@danwinship
Copy link
Contributor

It seems like narrowing this should work, yes, although it's not guaranteed that kube-proxy knows the cluster CIDR range unfortunately. (We could try narrowing it iff kube-proxy knows the cluster CIDR.)

@cnmcavoy
Copy link

I agree with @kabakaev, this rule is too wide. I have to maintain my own private fork of kubernetes with this commit removed due to asymmetric routing for traffic outside the cluster on private subnets. :( If it was scoped to cluster cidr, that would no longer be necessary.

@agmimidi
Copy link

agmimidi commented Dec 9, 2022

Hi all,

We have a setup with kube-proxy in ipvs mode and calico as CNI, in which we observe intermittent connection resets when pods are calling internal services in the cluster.

Calico does push the per-chain Iptables rules to drop the packets marked as invalid (e.g. -A cali-fw-cali08d8970e614 -m comment --comment "cali:pgiwwL2d0pFFF8jF" -m conntrack --ctstate INVALID -j DROP) , but the drop rule for the KUBE-FORWARD chain is missing. Is that expected?

We are running kubernetes version 1.23.7 on Centos7 machines.

Any help would be appreciated.

@kk9384
Copy link

kk9384 commented Feb 2, 2023

Hi all,
We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening, the gke kubernetes version is 1.21.14, should we need to configure the below values? please advise, thank you.

iptables -N "KUBE-FORWARD-PATCH"
iptables -A "KUBE-FORWARD-PATCH" -m "conntrack" --ctstate "INVALID" -j "DROP"
iptables -I FORWARD -m comment --comment "k8s patch PR 74840" -j KUBE-FORWARD-PATCH

@aojea
Copy link
Member

aojea commented Feb 2, 2023

We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening

what is the exact message?

@kk9384
Copy link

kk9384 commented Feb 3, 2023

We have a setup of GKE and the services inside the pods are facing socket exception connect reset intermittently only on few pods it is happening

what is the exact message?

o.apache.http.impl.execchain.RetryExec : I/O exception (java.net.SocketException) caught when processing request to {s}->https://<i_have_removed_host_name>:443: Connection reset

@aojea
Copy link
Member

aojea commented Feb 4, 2023

that may mean different causes, not necessarily this one , if <i_have_removed_host_name> stop listened on that port, if that is a service and the endpoint has changed or died on aggressive autoscaling, ....

@mehmetcolgecen
Copy link

Hi,
I am using on-prem kubernetes version 1.21.9, with MetalLB to provide LoadBalancer IP to the services and CNI is Calico. I am using an OCPP server inside kubernetes and server gets heartbeat signals from client throughout websocket connetcion. Within this heartbeat periods, the connection interrupts with RST signals. Here is the TCPDUMP output from inside pod;

ocppserver-bc6f65954-j99lx:/app# tcpdump 'tcp[13] & 4 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:07:59.726669 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.53627: Flags [R], seq 2070087305, win 0, length 0
14:07:59.734667 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.53627: Flags [R], seq 2070087305, win 0, length 0
14:08:27.646676 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.9510: Flags [R], seq 355699249, win 0, length 0
14:08:27.654637 IP ocppserver-bc6f65954-j99lx.8080 > 10.42.176.0.9510: Flags [R], seq 355699249, win 0, length 0

I have applied "echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" on both node and pod and
"iptables -I INPUT -m conntrack --ctstate INVALID -j DROP" on the node.

But the problem continues, do you have any suggestions for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Connection reset by peer" due to invalid conntrack packets